From 5b7c4cabbb65f5c469464da6c5f614cbd7f730f2 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Tue, 21 Feb 2023 18:24:12 -0800 Subject: Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - Add dedicated kmem_cache for typical/small skb->head, avoid having to access struct page at kfree time, and improve memory use. - Introduce sysctl to set default RPS configuration for new netdevs. - Define Netlink protocol specification format which can be used to describe messages used by each family and auto-generate parsers. Add tools for generating kernel data structures and uAPI headers. - Expose all net/core sysctls inside netns. - Remove 4s sleep in netpoll if carrier is instantly detected on boot. - Add configurable limit of MDB entries per port, and port-vlan. - Continue populating drop reasons throughout the stack. - Retire a handful of legacy Qdiscs and classifiers. Protocols: - Support IPv4 big TCP (TSO frames larger than 64kB). - Add IP_LOCAL_PORT_RANGE socket option, to control local port range on socket by socket basis. - Track and report in procfs number of MPTCP sockets used. - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path manager. - IPv6: don't check net.ipv6.route.max_size and rely on garbage collection to free memory (similarly to IPv4). - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986). - ICMP: add per-rate limit counters. - Add support for user scanning requests in ieee802154. - Remove static WEP support. - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting. - WiFi 7 EHT channel puncturing support (client & AP). BPF: - Add a rbtree data structure following the "next-gen data structure" precedent set by recently added linked list, that is, by using kfunc + kptr instead of adding a new BPF map type. - Expose XDP hints via kfuncs with initial support for RX hash and timestamp metadata. - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata. - Improve x86 JIT's codegen for PROBE_MEM runtime error checks. - Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers. - Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case. - Significantly reduce the search time for module symbols by livepatch and BPF. - Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals. - Add support for BPF trampoline on s390x and riscv64. - Add capability to export the XDP features supported by the NIC. - Add __bpf_kfunc tag for marking kernel functions as kfuncs. - Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accounting for container environments. Netfilter: - Remove the CLUSTERIP target. It has been marked as obsolete for years, and we still have WARN splats wrt races of the out-of-band /proc interface installed by this target. - Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not return an error if the referenced object (set, chain, rule...) did not exist. Driver API: - Improve cpumask_local_spread() locality to help NICs set the right IRQ affinity on AMD platforms. - Separate C22 and C45 MDIO bus transactions more clearly. - Introduce new DCB table to control DSCP rewrite on egress. - Support configuration of Physical Layer Collision Avoidance (PLCA) Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of shared medium Ethernet. - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing preemption of low priority frames by high priority frames. - Add support for controlling MACSec offload using netlink SET. - Rework devlink instance refcounts to allow registration and de-registration under the instance lock. Split the code into multiple files, drop some of the unnecessarily granular locks and factor out common parts of netlink operation handling. - Add TX frame aggregation parameters (for USB drivers). - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning messages with notifications for debug. - Allow offloading of UDP NEW connections via act_ct. - Add support for per action HW stats in TC. - Support hardware miss to TC action (continue processing in SW from a specific point in the action chain). - Warn if old Wireless Extension user space interface is used with modern cfg80211/mac80211 drivers. Do not support Wireless Extensions for Wi-Fi 7 devices at all. Everyone should switch to using nl80211 interface instead. - Improve the CAN bit timing configuration. Use extack to return error messages directly to user space, update the SJW handling, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. New hardware / drivers: - Ethernet: - nVidia BlueField-3 support (control traffic driver) - Ethernet support for imx93 SoCs - Motorcomm yt8531 gigabit Ethernet PHY - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA) - Microchip LAN8841 PHY (incl. cable diagnostics and PTP) - Amlogic gxl MDIO mux - WiFi: - RealTek RTL8188EU (rtl8xxxu) - Qualcomm Wi-Fi 7 devices (ath12k) - CAN: - Renesas R-Car V4H Drivers: - Bluetooth: - Set Per Platform Antenna Gain (PPAG) for Intel controllers. - Ethernet NICs: - Intel (1G, igc): - support TSN / Qbv / packet scheduling features of i226 model - Intel (100G, ice): - use GNSS subsystem instead of TTY - multi-buffer XDP support - extend support for GPIO pins to E823 devices - nVidia/Mellanox: - update the shared buffer configuration on PFC commands - implement PTP adjphase function for HW offset control - TC support for Geneve and GRE with VF tunnel offload - more efficient crypto key management method - multi-port eswitch support - Netronome/Corigine: - add DCB IEEE support - support IPsec offloading for NFP3800 - Freescale/NXP (enetc): - support XDP_REDIRECT for XDP non-linear buffers - improve reconfig, avoid link flap and waiting for idle - support MAC Merge layer - Other NICs: - sfc/ef100: add basic devlink support for ef100 - ionic: rx_push mode operation (writing descriptors via MMIO) - bnxt: use the auxiliary bus abstraction for RDMA - r8169: disable ASPM and reset bus in case of tx timeout - cpsw: support QSGMII mode for J721e CPSW9G - cpts: support pulse-per-second output - ngbe: add an mdio bus driver - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing - r8152: handle devices with FW with NCM support - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation - virtio-net: support multi buffer XDP - virtio/vsock: replace virtio_vsock_pkt with sk_buff - tsnep: XDP support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add support for latency TLV (in FW control messages) - Microchip (sparx5): - separate explicit and implicit traffic forwarding rules, make the implicit rules always active - add support for egress DSCP rewrite - IS0 VCAP support (Ingress Classification) - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.) - ES2 VCAP support (Egress Access Control) - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1) - Ethernet embedded switches: - Marvell (mv88e6xxx): - add MAB (port auth) offload support - enable PTP receive for mv88e6390 - NXP (ocelot): - support MAC Merge layer - support for the the vsc7512 internal copper phys - Microchip: - lan9303: convert to PHYLINK - lan966x: support TC flower filter statistics - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x - lan937x: support Credit Based Shaper configuration - ksz9477: support Energy Efficient Ethernet - other: - qca8k: convert to regmap read/write API, use bulk operations - rswitch: Improve TX timestamp accuracy - Intel WiFi (iwlwifi): - EHT (Wi-Fi 7) rate reporting - STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware. - Qualcomm 802.11ax WiFi (ath11k): - IPQ5018 support - Fine Timing Measurement (FTM) responder role support - channel 177 support - MediaTek WiFi (mt76): - per-PHY LED support - mt7996: EHT (Wi-Fi 7) support - Wireless Ethernet Dispatch (WED) reset support - switch to using page pool allocator - RealTek WiFi (rtw89): - support new version of Bluetooth co-existance - Mobile: - rmnet: support TX aggregation" * tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits) page_pool: add a comment explaining the fragment counter usage net: ethtool: fix __ethtool_dev_mm_supported() implementation ethtool: pse-pd: Fix double word in comments xsk: add linux/vmalloc.h to xsk.c sefltests: netdevsim: wait for devlink instance after netns removal selftest: fib_tests: Always cleanup before exit net/mlx5e: Align IPsec ASO result memory to be as required by hardware net/mlx5e: TC, Set CT miss to the specific ct action instance net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG net/mlx5: Refactor tc miss handling to a single function net/mlx5: Kconfig: Make tc offload depend on tc skb extension net/sched: flower: Support hardware miss to tc action net/sched: flower: Move filter handle initialization earlier net/sched: cls_api: Support hardware miss to tc action net/sched: Rename user cookie and act cookie sfc: fix builds without CONFIG_RTC_LIB sfc: clean up some inconsistent indentings net/mlx4_en: Introduce flexible array to silence overflow warning net: lan966x: Fix possible deadlock inside PTP net/ulp: Remove redundant ->clone() test in inet_clone_ulp(). ... --- arch/powerpc/mm/ptdump/ptdump.c | 375 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 375 insertions(+) create mode 100644 arch/powerpc/mm/ptdump/ptdump.c (limited to 'arch/powerpc/mm/ptdump/ptdump.c') diff --git a/arch/powerpc/mm/ptdump/ptdump.c b/arch/powerpc/mm/ptdump/ptdump.c new file mode 100644 index 000000000..2313053fe --- /dev/null +++ b/arch/powerpc/mm/ptdump/ptdump.c @@ -0,0 +1,375 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright 2016, Rashmica Gupta, IBM Corp. + * + * This traverses the kernel pagetables and dumps the + * information about the used sections of memory to + * /sys/kernel/debug/kernel_pagetables. + * + * Derived from the arm64 implementation: + * Copyright (c) 2014, The Linux Foundation, Laura Abbott. + * (C) Copyright 2008 Intel Corporation, Arjan van de Ven. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "ptdump.h" + +/* + * To visualise what is happening, + * + * - PTRS_PER_P** = how many entries there are in the corresponding P** + * - P**_SHIFT = how many bits of the address we use to index into the + * corresponding P** + * - P**_SIZE is how much memory we can access through the table - not the + * size of the table itself. + * P**={PGD, PUD, PMD, PTE} + * + * + * Each entry of the PGD points to a PUD. Each entry of a PUD points to a + * PMD. Each entry of a PMD points to a PTE. And every PTE entry points to + * a page. + * + * In the case where there are only 3 levels, the PUD is folded into the + * PGD: every PUD has only one entry which points to the PMD. + * + * The page dumper groups page table entries of the same type into a single + * description. It uses pg_state to track the range information while + * iterating over the PTE entries. When the continuity is broken it then + * dumps out a description of the range - ie PTEs that are virtually contiguous + * with the same PTE flags are chunked together. This is to make it clear how + * different areas of the kernel virtual memory are used. + * + */ +struct pg_state { + struct ptdump_state ptdump; + struct seq_file *seq; + const struct addr_marker *marker; + unsigned long start_address; + unsigned long start_pa; + int level; + u64 current_flags; + bool check_wx; + unsigned long wx_pages; +}; + +struct addr_marker { + unsigned long start_address; + const char *name; +}; + +static struct addr_marker address_markers[] = { + { 0, "Start of kernel VM" }, +#ifdef MODULES_VADDR + { 0, "modules start" }, + { 0, "modules end" }, +#endif + { 0, "vmalloc() Area" }, + { 0, "vmalloc() End" }, +#ifdef CONFIG_PPC64 + { 0, "isa I/O start" }, + { 0, "isa I/O end" }, + { 0, "phb I/O start" }, + { 0, "phb I/O end" }, + { 0, "I/O remap start" }, + { 0, "I/O remap end" }, + { 0, "vmemmap start" }, +#else + { 0, "Early I/O remap start" }, + { 0, "Early I/O remap end" }, +#ifdef CONFIG_HIGHMEM + { 0, "Highmem PTEs start" }, + { 0, "Highmem PTEs end" }, +#endif + { 0, "Fixmap start" }, + { 0, "Fixmap end" }, +#endif +#ifdef CONFIG_KASAN + { 0, "kasan shadow mem start" }, + { 0, "kasan shadow mem end" }, +#endif + { -1, NULL }, +}; + +static struct ptdump_range ptdump_range[] __ro_after_init = { + {TASK_SIZE_MAX, ~0UL}, + {0, 0} +}; + +#define pt_dump_seq_printf(m, fmt, args...) \ +({ \ + if (m) \ + seq_printf(m, fmt, ##args); \ +}) + +#define pt_dump_seq_putc(m, c) \ +({ \ + if (m) \ + seq_putc(m, c); \ +}) + +void pt_dump_size(struct seq_file *m, unsigned long size) +{ + static const char units[] = " KMGTPE"; + const char *unit = units; + + /* Work out what appropriate unit to use */ + while (!(size & 1023) && unit[1]) { + size >>= 10; + unit++; + } + pt_dump_seq_printf(m, "%9lu%c ", size, *unit); +} + +static void dump_flag_info(struct pg_state *st, const struct flag_info + *flag, u64 pte, int num) +{ + unsigned int i; + + for (i = 0; i < num; i++, flag++) { + const char *s = NULL; + u64 val; + + /* flag not defined so don't check it */ + if (flag->mask == 0) + continue; + /* Some 'flags' are actually values */ + if (flag->is_val) { + val = pte & flag->val; + if (flag->shift) + val = val >> flag->shift; + pt_dump_seq_printf(st->seq, " %s:%llx", flag->set, val); + } else { + if ((pte & flag->mask) == flag->val) + s = flag->set; + else + s = flag->clear; + if (s) + pt_dump_seq_printf(st->seq, " %s", s); + } + st->current_flags &= ~flag->mask; + } + if (st->current_flags != 0) + pt_dump_seq_printf(st->seq, " unknown flags:%llx", st->current_flags); +} + +static void dump_addr(struct pg_state *st, unsigned long addr) +{ +#ifdef CONFIG_PPC64 +#define REG "0x%016lx" +#else +#define REG "0x%08lx" +#endif + + pt_dump_seq_printf(st->seq, REG "-" REG " ", st->start_address, addr - 1); + pt_dump_seq_printf(st->seq, " " REG " ", st->start_pa); + pt_dump_size(st->seq, addr - st->start_address); +} + +static void note_prot_wx(struct pg_state *st, unsigned long addr) +{ + pte_t pte = __pte(st->current_flags); + + if (!IS_ENABLED(CONFIG_DEBUG_WX) || !st->check_wx) + return; + + if (!pte_write(pte) || !pte_exec(pte)) + return; + + WARN_ONCE(1, "powerpc/mm: Found insecure W+X mapping at address %p/%pS\n", + (void *)st->start_address, (void *)st->start_address); + + st->wx_pages += (addr - st->start_address) / PAGE_SIZE; +} + +static void note_page_update_state(struct pg_state *st, unsigned long addr, int level, u64 val) +{ + u64 flag = level >= 0 ? val & pg_level[level].mask : 0; + u64 pa = val & PTE_RPN_MASK; + + st->level = level; + st->current_flags = flag; + st->start_address = addr; + st->start_pa = pa; + + while (addr >= st->marker[1].start_address) { + st->marker++; + pt_dump_seq_printf(st->seq, "---[ %s ]---\n", st->marker->name); + } +} + +static void note_page(struct ptdump_state *pt_st, unsigned long addr, int level, u64 val) +{ + u64 flag = level >= 0 ? val & pg_level[level].mask : 0; + struct pg_state *st = container_of(pt_st, struct pg_state, ptdump); + + /* At first no level is set */ + if (st->level == -1) { + pt_dump_seq_printf(st->seq, "---[ %s ]---\n", st->marker->name); + note_page_update_state(st, addr, level, val); + /* + * Dump the section of virtual memory when: + * - the PTE flags from one entry to the next differs. + * - we change levels in the tree. + * - the address is in a different section of memory and is thus + * used for a different purpose, regardless of the flags. + */ + } else if (flag != st->current_flags || level != st->level || + addr >= st->marker[1].start_address) { + + /* Check the PTE flags */ + if (st->current_flags) { + note_prot_wx(st, addr); + dump_addr(st, addr); + + /* Dump all the flags */ + if (pg_level[st->level].flag) + dump_flag_info(st, pg_level[st->level].flag, + st->current_flags, + pg_level[st->level].num); + + pt_dump_seq_putc(st->seq, '\n'); + } + + /* + * Address indicates we have passed the end of the + * current section of virtual memory + */ + note_page_update_state(st, addr, level, val); + } +} + +static void populate_markers(void) +{ + int i = 0; + +#ifdef CONFIG_PPC64 + address_markers[i++].start_address = PAGE_OFFSET; +#else + address_markers[i++].start_address = TASK_SIZE; +#endif +#ifdef MODULES_VADDR + address_markers[i++].start_address = MODULES_VADDR; + address_markers[i++].start_address = MODULES_END; +#endif + address_markers[i++].start_address = VMALLOC_START; + address_markers[i++].start_address = VMALLOC_END; +#ifdef CONFIG_PPC64 + address_markers[i++].start_address = ISA_IO_BASE; + address_markers[i++].start_address = ISA_IO_END; + address_markers[i++].start_address = PHB_IO_BASE; + address_markers[i++].start_address = PHB_IO_END; + address_markers[i++].start_address = IOREMAP_BASE; + address_markers[i++].start_address = IOREMAP_END; + /* What is the ifdef about? */ +#ifdef CONFIG_PPC_BOOK3S_64 + address_markers[i++].start_address = H_VMEMMAP_START; +#else + address_markers[i++].start_address = VMEMMAP_BASE; +#endif +#else /* !CONFIG_PPC64 */ + address_markers[i++].start_address = ioremap_bot; + address_markers[i++].start_address = IOREMAP_TOP; +#ifdef CONFIG_HIGHMEM + address_markers[i++].start_address = PKMAP_BASE; + address_markers[i++].start_address = PKMAP_ADDR(LAST_PKMAP); +#endif + address_markers[i++].start_address = FIXADDR_START; + address_markers[i++].start_address = FIXADDR_TOP; +#endif /* CONFIG_PPC64 */ +#ifdef CONFIG_KASAN + address_markers[i++].start_address = KASAN_SHADOW_START; + address_markers[i++].start_address = KASAN_SHADOW_END; +#endif +} + +static int ptdump_show(struct seq_file *m, void *v) +{ + struct pg_state st = { + .seq = m, + .marker = address_markers, + .level = -1, + .ptdump = { + .note_page = note_page, + .range = ptdump_range, + } + }; + + /* Traverse kernel page tables */ + ptdump_walk_pgd(&st.ptdump, &init_mm, NULL); + return 0; +} + +DEFINE_SHOW_ATTRIBUTE(ptdump); + +static void __init build_pgtable_complete_mask(void) +{ + unsigned int i, j; + + for (i = 0; i < ARRAY_SIZE(pg_level); i++) + if (pg_level[i].flag) + for (j = 0; j < pg_level[i].num; j++) + pg_level[i].mask |= pg_level[i].flag[j].mask; +} + +#ifdef CONFIG_DEBUG_WX +void ptdump_check_wx(void) +{ + struct pg_state st = { + .seq = NULL, + .marker = (struct addr_marker[]) { + { 0, NULL}, + { -1, NULL}, + }, + .level = -1, + .check_wx = true, + .ptdump = { + .note_page = note_page, + .range = ptdump_range, + } + }; + + ptdump_walk_pgd(&st.ptdump, &init_mm, NULL); + + if (st.wx_pages) + pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found\n", + st.wx_pages); + else + pr_info("Checked W+X mappings: passed, no W+X pages found\n"); +} +#endif + +static int __init ptdump_init(void) +{ +#ifdef CONFIG_PPC64 + if (!radix_enabled()) + ptdump_range[0].start = KERN_VIRT_START; + else + ptdump_range[0].start = PAGE_OFFSET; + + ptdump_range[0].end = PAGE_OFFSET + (PGDIR_SIZE * PTRS_PER_PGD); +#endif + + populate_markers(); + build_pgtable_complete_mask(); + + if (IS_ENABLED(CONFIG_PTDUMP_DEBUGFS)) + debugfs_create_file("kernel_page_tables", 0400, NULL, NULL, &ptdump_fops); + + return 0; +} +device_initcall(ptdump_init); -- cgit v1.2.3