From 5b7c4cabbb65f5c469464da6c5f614cbd7f730f2 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Tue, 21 Feb 2023 18:24:12 -0800 Subject: Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - Add dedicated kmem_cache for typical/small skb->head, avoid having to access struct page at kfree time, and improve memory use. - Introduce sysctl to set default RPS configuration for new netdevs. - Define Netlink protocol specification format which can be used to describe messages used by each family and auto-generate parsers. Add tools for generating kernel data structures and uAPI headers. - Expose all net/core sysctls inside netns. - Remove 4s sleep in netpoll if carrier is instantly detected on boot. - Add configurable limit of MDB entries per port, and port-vlan. - Continue populating drop reasons throughout the stack. - Retire a handful of legacy Qdiscs and classifiers. Protocols: - Support IPv4 big TCP (TSO frames larger than 64kB). - Add IP_LOCAL_PORT_RANGE socket option, to control local port range on socket by socket basis. - Track and report in procfs number of MPTCP sockets used. - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path manager. - IPv6: don't check net.ipv6.route.max_size and rely on garbage collection to free memory (similarly to IPv4). - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986). - ICMP: add per-rate limit counters. - Add support for user scanning requests in ieee802154. - Remove static WEP support. - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting. - WiFi 7 EHT channel puncturing support (client & AP). BPF: - Add a rbtree data structure following the "next-gen data structure" precedent set by recently added linked list, that is, by using kfunc + kptr instead of adding a new BPF map type. - Expose XDP hints via kfuncs with initial support for RX hash and timestamp metadata. - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata. - Improve x86 JIT's codegen for PROBE_MEM runtime error checks. - Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers. - Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case. - Significantly reduce the search time for module symbols by livepatch and BPF. - Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals. - Add support for BPF trampoline on s390x and riscv64. - Add capability to export the XDP features supported by the NIC. - Add __bpf_kfunc tag for marking kernel functions as kfuncs. - Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accounting for container environments. Netfilter: - Remove the CLUSTERIP target. It has been marked as obsolete for years, and we still have WARN splats wrt races of the out-of-band /proc interface installed by this target. - Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not return an error if the referenced object (set, chain, rule...) did not exist. Driver API: - Improve cpumask_local_spread() locality to help NICs set the right IRQ affinity on AMD platforms. - Separate C22 and C45 MDIO bus transactions more clearly. - Introduce new DCB table to control DSCP rewrite on egress. - Support configuration of Physical Layer Collision Avoidance (PLCA) Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of shared medium Ethernet. - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing preemption of low priority frames by high priority frames. - Add support for controlling MACSec offload using netlink SET. - Rework devlink instance refcounts to allow registration and de-registration under the instance lock. Split the code into multiple files, drop some of the unnecessarily granular locks and factor out common parts of netlink operation handling. - Add TX frame aggregation parameters (for USB drivers). - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning messages with notifications for debug. - Allow offloading of UDP NEW connections via act_ct. - Add support for per action HW stats in TC. - Support hardware miss to TC action (continue processing in SW from a specific point in the action chain). - Warn if old Wireless Extension user space interface is used with modern cfg80211/mac80211 drivers. Do not support Wireless Extensions for Wi-Fi 7 devices at all. Everyone should switch to using nl80211 interface instead. - Improve the CAN bit timing configuration. Use extack to return error messages directly to user space, update the SJW handling, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. New hardware / drivers: - Ethernet: - nVidia BlueField-3 support (control traffic driver) - Ethernet support for imx93 SoCs - Motorcomm yt8531 gigabit Ethernet PHY - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA) - Microchip LAN8841 PHY (incl. cable diagnostics and PTP) - Amlogic gxl MDIO mux - WiFi: - RealTek RTL8188EU (rtl8xxxu) - Qualcomm Wi-Fi 7 devices (ath12k) - CAN: - Renesas R-Car V4H Drivers: - Bluetooth: - Set Per Platform Antenna Gain (PPAG) for Intel controllers. - Ethernet NICs: - Intel (1G, igc): - support TSN / Qbv / packet scheduling features of i226 model - Intel (100G, ice): - use GNSS subsystem instead of TTY - multi-buffer XDP support - extend support for GPIO pins to E823 devices - nVidia/Mellanox: - update the shared buffer configuration on PFC commands - implement PTP adjphase function for HW offset control - TC support for Geneve and GRE with VF tunnel offload - more efficient crypto key management method - multi-port eswitch support - Netronome/Corigine: - add DCB IEEE support - support IPsec offloading for NFP3800 - Freescale/NXP (enetc): - support XDP_REDIRECT for XDP non-linear buffers - improve reconfig, avoid link flap and waiting for idle - support MAC Merge layer - Other NICs: - sfc/ef100: add basic devlink support for ef100 - ionic: rx_push mode operation (writing descriptors via MMIO) - bnxt: use the auxiliary bus abstraction for RDMA - r8169: disable ASPM and reset bus in case of tx timeout - cpsw: support QSGMII mode for J721e CPSW9G - cpts: support pulse-per-second output - ngbe: add an mdio bus driver - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing - r8152: handle devices with FW with NCM support - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation - virtio-net: support multi buffer XDP - virtio/vsock: replace virtio_vsock_pkt with sk_buff - tsnep: XDP support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add support for latency TLV (in FW control messages) - Microchip (sparx5): - separate explicit and implicit traffic forwarding rules, make the implicit rules always active - add support for egress DSCP rewrite - IS0 VCAP support (Ingress Classification) - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.) - ES2 VCAP support (Egress Access Control) - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1) - Ethernet embedded switches: - Marvell (mv88e6xxx): - add MAB (port auth) offload support - enable PTP receive for mv88e6390 - NXP (ocelot): - support MAC Merge layer - support for the the vsc7512 internal copper phys - Microchip: - lan9303: convert to PHYLINK - lan966x: support TC flower filter statistics - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x - lan937x: support Credit Based Shaper configuration - ksz9477: support Energy Efficient Ethernet - other: - qca8k: convert to regmap read/write API, use bulk operations - rswitch: Improve TX timestamp accuracy - Intel WiFi (iwlwifi): - EHT (Wi-Fi 7) rate reporting - STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware. - Qualcomm 802.11ax WiFi (ath11k): - IPQ5018 support - Fine Timing Measurement (FTM) responder role support - channel 177 support - MediaTek WiFi (mt76): - per-PHY LED support - mt7996: EHT (Wi-Fi 7) support - Wireless Ethernet Dispatch (WED) reset support - switch to using page pool allocator - RealTek WiFi (rtw89): - support new version of Bluetooth co-existance - Mobile: - rmnet: support TX aggregation" * tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits) page_pool: add a comment explaining the fragment counter usage net: ethtool: fix __ethtool_dev_mm_supported() implementation ethtool: pse-pd: Fix double word in comments xsk: add linux/vmalloc.h to xsk.c sefltests: netdevsim: wait for devlink instance after netns removal selftest: fib_tests: Always cleanup before exit net/mlx5e: Align IPsec ASO result memory to be as required by hardware net/mlx5e: TC, Set CT miss to the specific ct action instance net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG net/mlx5: Refactor tc miss handling to a single function net/mlx5: Kconfig: Make tc offload depend on tc skb extension net/sched: flower: Support hardware miss to tc action net/sched: flower: Move filter handle initialization earlier net/sched: cls_api: Support hardware miss to tc action net/sched: Rename user cookie and act cookie sfc: fix builds without CONFIG_RTC_LIB sfc: clean up some inconsistent indentings net/mlx4_en: Introduce flexible array to silence overflow warning net: lan966x: Fix possible deadlock inside PTP net/ulp: Remove redundant ->clone() test in inet_clone_ulp(). ... --- tools/testing/selftests/x86/test_syscall_vdso.c | 400 ++++++++++++++++++++++++ 1 file changed, 400 insertions(+) create mode 100644 tools/testing/selftests/x86/test_syscall_vdso.c (limited to 'tools/testing/selftests/x86/test_syscall_vdso.c') diff --git a/tools/testing/selftests/x86/test_syscall_vdso.c b/tools/testing/selftests/x86/test_syscall_vdso.c new file mode 100644 index 000000000..8965c311b --- /dev/null +++ b/tools/testing/selftests/x86/test_syscall_vdso.c @@ -0,0 +1,400 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * 32-bit syscall ABI conformance test. + * + * Copyright (c) 2015 Denys Vlasenko + */ +/* + * Can be built statically: + * gcc -Os -Wall -static -m32 test_syscall_vdso.c thunks_32.S + */ +#undef _GNU_SOURCE +#define _GNU_SOURCE 1 +#undef __USE_GNU +#define __USE_GNU 1 +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#if !defined(__i386__) +int main(int argc, char **argv, char **envp) +{ + printf("[SKIP]\tNot a 32-bit x86 userspace\n"); + return 0; +} +#else + +long syscall_addr; +long get_syscall(char **envp) +{ + Elf32_auxv_t *auxv; + while (*envp++ != NULL) + continue; + for (auxv = (void *)envp; auxv->a_type != AT_NULL; auxv++) + if (auxv->a_type == AT_SYSINFO) + return auxv->a_un.a_val; + printf("[WARN]\tAT_SYSINFO not supplied\n"); + return 0; +} + +asm ( + " .pushsection .text\n" + " .global int80\n" + "int80:\n" + " int $0x80\n" + " ret\n" + " .popsection\n" +); +extern char int80; + +struct regs64 { + uint64_t rax, rbx, rcx, rdx; + uint64_t rsi, rdi, rbp, rsp; + uint64_t r8, r9, r10, r11; + uint64_t r12, r13, r14, r15; +}; +struct regs64 regs64; +int kernel_is_64bit; + +asm ( + " .pushsection .text\n" + " .code64\n" + "get_regs64:\n" + " push %rax\n" + " mov $regs64, %eax\n" + " pop 0*8(%rax)\n" + " movq %rbx, 1*8(%rax)\n" + " movq %rcx, 2*8(%rax)\n" + " movq %rdx, 3*8(%rax)\n" + " movq %rsi, 4*8(%rax)\n" + " movq %rdi, 5*8(%rax)\n" + " movq %rbp, 6*8(%rax)\n" + " movq %rsp, 7*8(%rax)\n" + " movq %r8, 8*8(%rax)\n" + " movq %r9, 9*8(%rax)\n" + " movq %r10, 10*8(%rax)\n" + " movq %r11, 11*8(%rax)\n" + " movq %r12, 12*8(%rax)\n" + " movq %r13, 13*8(%rax)\n" + " movq %r14, 14*8(%rax)\n" + " movq %r15, 15*8(%rax)\n" + " ret\n" + "poison_regs64:\n" + " movq $0x7f7f7f7f, %r8\n" + " shl $32, %r8\n" + " orq $0x7f7f7f7f, %r8\n" + " movq %r8, %r9\n" + " incq %r9\n" + " movq %r9, %r10\n" + " incq %r10\n" + " movq %r10, %r11\n" + " incq %r11\n" + " movq %r11, %r12\n" + " incq %r12\n" + " movq %r12, %r13\n" + " incq %r13\n" + " movq %r13, %r14\n" + " incq %r14\n" + " movq %r14, %r15\n" + " incq %r15\n" + " ret\n" + " .code32\n" + " .popsection\n" +); +extern void get_regs64(void); +extern void poison_regs64(void); +extern unsigned long call64_from_32(void (*function)(void)); +void print_regs64(void) +{ + if (!kernel_is_64bit) + return; + printf("ax:%016llx bx:%016llx cx:%016llx dx:%016llx\n", regs64.rax, regs64.rbx, regs64.rcx, regs64.rdx); + printf("si:%016llx di:%016llx bp:%016llx sp:%016llx\n", regs64.rsi, regs64.rdi, regs64.rbp, regs64.rsp); + printf(" 8:%016llx 9:%016llx 10:%016llx 11:%016llx\n", regs64.r8 , regs64.r9 , regs64.r10, regs64.r11); + printf("12:%016llx 13:%016llx 14:%016llx 15:%016llx\n", regs64.r12, regs64.r13, regs64.r14, regs64.r15); +} + +int check_regs64(void) +{ + int err = 0; + int num = 8; + uint64_t *r64 = ®s64.r8; + uint64_t expected = 0x7f7f7f7f7f7f7f7fULL; + + if (!kernel_is_64bit) + return 0; + + do { + if (*r64 == expected++) + continue; /* register did not change */ + if (syscall_addr != (long)&int80) { + /* + * Non-INT80 syscall entrypoints are allowed to clobber R8+ regs: + * either clear them to 0, or for R11, load EFLAGS. + */ + if (*r64 == 0) + continue; + if (num == 11) { + printf("[NOTE]\tR11 has changed:%016llx - assuming clobbered by SYSRET insn\n", *r64); + continue; + } + } else { + /* + * INT80 syscall entrypoint can be used by + * 64-bit programs too, unlike SYSCALL/SYSENTER. + * Therefore it must preserve R12+ + * (they are callee-saved registers in 64-bit C ABI). + * + * Starting in Linux 4.17 (and any kernel that + * backports the change), R8..11 are preserved. + * Historically (and probably unintentionally), they + * were clobbered or zeroed. + */ + } + printf("[FAIL]\tR%d has changed:%016llx\n", num, *r64); + err++; + } while (r64++, ++num < 16); + + if (!err) + printf("[OK]\tR8..R15 did not leak kernel data\n"); + return err; +} + +int nfds; +fd_set rfds; +fd_set wfds; +fd_set efds; +struct timespec timeout; +sigset_t sigmask; +struct { + sigset_t *sp; + int sz; +} sigmask_desc; + +void prep_args() +{ + nfds = 42; + FD_ZERO(&rfds); + FD_ZERO(&wfds); + FD_ZERO(&efds); + FD_SET(0, &rfds); + FD_SET(1, &wfds); + FD_SET(2, &efds); + timeout.tv_sec = 0; + timeout.tv_nsec = 123; + sigemptyset(&sigmask); + sigaddset(&sigmask, SIGINT); + sigaddset(&sigmask, SIGUSR2); + sigaddset(&sigmask, SIGRTMAX); + sigmask_desc.sp = &sigmask; + sigmask_desc.sz = 8; /* bytes */ +} + +static void print_flags(const char *name, unsigned long r) +{ + static const char *bitarray[] = { + "\n" ,"c\n" ,/* Carry Flag */ + "0 " ,"1 " ,/* Bit 1 - always on */ + "" ,"p " ,/* Parity Flag */ + "0 " ,"3? " , + "" ,"a " ,/* Auxiliary carry Flag */ + "0 " ,"5? " , + "" ,"z " ,/* Zero Flag */ + "" ,"s " ,/* Sign Flag */ + "" ,"t " ,/* Trap Flag */ + "" ,"i " ,/* Interrupt Flag */ + "" ,"d " ,/* Direction Flag */ + "" ,"o " ,/* Overflow Flag */ + "0 " ,"1 " ,/* I/O Privilege Level (2 bits) */ + "0" ,"1" ,/* I/O Privilege Level (2 bits) */ + "" ,"n " ,/* Nested Task */ + "0 " ,"15? ", + "" ,"r " ,/* Resume Flag */ + "" ,"v " ,/* Virtual Mode */ + "" ,"ac " ,/* Alignment Check/Access Control */ + "" ,"vif ",/* Virtual Interrupt Flag */ + "" ,"vip ",/* Virtual Interrupt Pending */ + "" ,"id " ,/* CPUID detection */ + NULL + }; + const char **bitstr; + int bit; + + printf("%s=%016lx ", name, r); + bitstr = bitarray + 42; + bit = 21; + if ((r >> 22) != 0) + printf("(extra bits are set) "); + do { + if (bitstr[(r >> bit) & 1][0]) + fputs(bitstr[(r >> bit) & 1], stdout); + bitstr -= 2; + bit--; + } while (bit >= 0); +} + +int run_syscall(void) +{ + long flags, bad_arg; + + prep_args(); + + if (kernel_is_64bit) + call64_from_32(poison_regs64); + /*print_regs64();*/ + + asm("\n" + /* Try 6-arg syscall: pselect. It should return quickly */ + " push %%ebp\n" + " mov $308, %%eax\n" /* PSELECT */ + " mov nfds, %%ebx\n" /* ebx arg1 */ + " mov $rfds, %%ecx\n" /* ecx arg2 */ + " mov $wfds, %%edx\n" /* edx arg3 */ + " mov $efds, %%esi\n" /* esi arg4 */ + " mov $timeout, %%edi\n" /* edi arg5 */ + " mov $sigmask_desc, %%ebp\n" /* %ebp arg6 */ + " push $0x200ed7\n" /* set almost all flags */ + " popf\n" /* except TF, IOPL, NT, RF, VM, AC, VIF, VIP */ + " call *syscall_addr\n" + /* Check that registers are not clobbered */ + " pushf\n" + " pop %%eax\n" + " cld\n" + " cmp nfds, %%ebx\n" /* ebx arg1 */ + " mov $1, %%ebx\n" + " jne 1f\n" + " cmp $rfds, %%ecx\n" /* ecx arg2 */ + " mov $2, %%ebx\n" + " jne 1f\n" + " cmp $wfds, %%edx\n" /* edx arg3 */ + " mov $3, %%ebx\n" + " jne 1f\n" + " cmp $efds, %%esi\n" /* esi arg4 */ + " mov $4, %%ebx\n" + " jne 1f\n" + " cmp $timeout, %%edi\n" /* edi arg5 */ + " mov $5, %%ebx\n" + " jne 1f\n" + " cmpl $sigmask_desc, %%ebp\n" /* %ebp arg6 */ + " mov $6, %%ebx\n" + " jne 1f\n" + " mov $0, %%ebx\n" + "1:\n" + " pop %%ebp\n" + : "=a" (flags), "=b" (bad_arg) + : + : "cx", "dx", "si", "di" + ); + + if (kernel_is_64bit) { + memset(®s64, 0x77, sizeof(regs64)); + call64_from_32(get_regs64); + /*print_regs64();*/ + } + + /* + * On paravirt kernels, flags are not preserved across syscalls. + * Thus, we do not consider it a bug if some are changed. + * We just show ones which do. + */ + if ((0x200ed7 ^ flags) != 0) { + print_flags("[WARN]\tFlags before", 0x200ed7); + print_flags("[WARN]\tFlags after", flags); + print_flags("[WARN]\tFlags change", (0x200ed7 ^ flags)); + } + + if (bad_arg) { + printf("[FAIL]\targ#%ld clobbered\n", bad_arg); + return 1; + } + printf("[OK]\tArguments are preserved across syscall\n"); + + return check_regs64(); +} + +int run_syscall_twice() +{ + int exitcode = 0; + long sv; + + if (syscall_addr) { + printf("[RUN]\tExecuting 6-argument 32-bit syscall via VDSO\n"); + exitcode = run_syscall(); + } + sv = syscall_addr; + syscall_addr = (long)&int80; + printf("[RUN]\tExecuting 6-argument 32-bit syscall via INT 80\n"); + exitcode += run_syscall(); + syscall_addr = sv; + return exitcode; +} + +void ptrace_me() +{ + pid_t pid; + + fflush(NULL); + pid = fork(); + if (pid < 0) + exit(1); + if (pid == 0) { + /* child */ + if (ptrace(PTRACE_TRACEME, 0L, 0L, 0L) != 0) + exit(0); + raise(SIGSTOP); + return; + } + /* parent */ + printf("[RUN]\tRunning tests under ptrace\n"); + while (1) { + int status; + pid = waitpid(-1, &status, __WALL); + if (WIFEXITED(status)) + exit(WEXITSTATUS(status)); + if (WIFSIGNALED(status)) + exit(WTERMSIG(status)); + if (pid <= 0 || !WIFSTOPPED(status)) /* paranoia */ + exit(255); + /* + * Note: we do not inject sig = WSTOPSIG(status). + * We probably should, but careful: do not inject SIGTRAP + * generated by syscall entry/exit stops. + * That kills the child. + */ + ptrace(PTRACE_SYSCALL, pid, 0L, 0L /*sig*/); + } +} + +int main(int argc, char **argv, char **envp) +{ + int exitcode = 0; + int cs; + + asm("\n" + " movl %%cs, %%eax\n" + : "=a" (cs) + ); + kernel_is_64bit = (cs == 0x23); + if (!kernel_is_64bit) + printf("[NOTE]\tNot a 64-bit kernel, won't test R8..R15 leaks\n"); + + /* This only works for non-static builds: + * syscall_addr = dlsym(dlopen("linux-gate.so.1", RTLD_NOW), "__kernel_vsyscall"); + */ + syscall_addr = get_syscall(envp); + + exitcode += run_syscall_twice(); + ptrace_me(); + exitcode += run_syscall_twice(); + + return exitcode; +} +#endif -- cgit v1.2.3