aboutsummaryrefslogtreecommitdiff
path: root/Documentation/staging/static-keys.rst
diff options
context:
space:
mode:
authorLibravatar Linus Torvalds <torvalds@linux-foundation.org>2023-02-21 18:24:12 -0800
committerLibravatar Linus Torvalds <torvalds@linux-foundation.org>2023-02-21 18:24:12 -0800
commit5b7c4cabbb65f5c469464da6c5f614cbd7f730f2 (patch)
treecc5c2d0a898769fd59549594fedb3ee6f84e59a0 /Documentation/staging/static-keys.rst
downloadlinux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.tar.gz
linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.zip
Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextgrafted
Pull networking updates from Jakub Kicinski: "Core: - Add dedicated kmem_cache for typical/small skb->head, avoid having to access struct page at kfree time, and improve memory use. - Introduce sysctl to set default RPS configuration for new netdevs. - Define Netlink protocol specification format which can be used to describe messages used by each family and auto-generate parsers. Add tools for generating kernel data structures and uAPI headers. - Expose all net/core sysctls inside netns. - Remove 4s sleep in netpoll if carrier is instantly detected on boot. - Add configurable limit of MDB entries per port, and port-vlan. - Continue populating drop reasons throughout the stack. - Retire a handful of legacy Qdiscs and classifiers. Protocols: - Support IPv4 big TCP (TSO frames larger than 64kB). - Add IP_LOCAL_PORT_RANGE socket option, to control local port range on socket by socket basis. - Track and report in procfs number of MPTCP sockets used. - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path manager. - IPv6: don't check net.ipv6.route.max_size and rely on garbage collection to free memory (similarly to IPv4). - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986). - ICMP: add per-rate limit counters. - Add support for user scanning requests in ieee802154. - Remove static WEP support. - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting. - WiFi 7 EHT channel puncturing support (client & AP). BPF: - Add a rbtree data structure following the "next-gen data structure" precedent set by recently added linked list, that is, by using kfunc + kptr instead of adding a new BPF map type. - Expose XDP hints via kfuncs with initial support for RX hash and timestamp metadata. - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata. - Improve x86 JIT's codegen for PROBE_MEM runtime error checks. - Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers. - Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case. - Significantly reduce the search time for module symbols by livepatch and BPF. - Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals. - Add support for BPF trampoline on s390x and riscv64. - Add capability to export the XDP features supported by the NIC. - Add __bpf_kfunc tag for marking kernel functions as kfuncs. - Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accounting for container environments. Netfilter: - Remove the CLUSTERIP target. It has been marked as obsolete for years, and we still have WARN splats wrt races of the out-of-band /proc interface installed by this target. - Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not return an error if the referenced object (set, chain, rule...) did not exist. Driver API: - Improve cpumask_local_spread() locality to help NICs set the right IRQ affinity on AMD platforms. - Separate C22 and C45 MDIO bus transactions more clearly. - Introduce new DCB table to control DSCP rewrite on egress. - Support configuration of Physical Layer Collision Avoidance (PLCA) Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of shared medium Ethernet. - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing preemption of low priority frames by high priority frames. - Add support for controlling MACSec offload using netlink SET. - Rework devlink instance refcounts to allow registration and de-registration under the instance lock. Split the code into multiple files, drop some of the unnecessarily granular locks and factor out common parts of netlink operation handling. - Add TX frame aggregation parameters (for USB drivers). - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning messages with notifications for debug. - Allow offloading of UDP NEW connections via act_ct. - Add support for per action HW stats in TC. - Support hardware miss to TC action (continue processing in SW from a specific point in the action chain). - Warn if old Wireless Extension user space interface is used with modern cfg80211/mac80211 drivers. Do not support Wireless Extensions for Wi-Fi 7 devices at all. Everyone should switch to using nl80211 interface instead. - Improve the CAN bit timing configuration. Use extack to return error messages directly to user space, update the SJW handling, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. New hardware / drivers: - Ethernet: - nVidia BlueField-3 support (control traffic driver) - Ethernet support for imx93 SoCs - Motorcomm yt8531 gigabit Ethernet PHY - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA) - Microchip LAN8841 PHY (incl. cable diagnostics and PTP) - Amlogic gxl MDIO mux - WiFi: - RealTek RTL8188EU (rtl8xxxu) - Qualcomm Wi-Fi 7 devices (ath12k) - CAN: - Renesas R-Car V4H Drivers: - Bluetooth: - Set Per Platform Antenna Gain (PPAG) for Intel controllers. - Ethernet NICs: - Intel (1G, igc): - support TSN / Qbv / packet scheduling features of i226 model - Intel (100G, ice): - use GNSS subsystem instead of TTY - multi-buffer XDP support - extend support for GPIO pins to E823 devices - nVidia/Mellanox: - update the shared buffer configuration on PFC commands - implement PTP adjphase function for HW offset control - TC support for Geneve and GRE with VF tunnel offload - more efficient crypto key management method - multi-port eswitch support - Netronome/Corigine: - add DCB IEEE support - support IPsec offloading for NFP3800 - Freescale/NXP (enetc): - support XDP_REDIRECT for XDP non-linear buffers - improve reconfig, avoid link flap and waiting for idle - support MAC Merge layer - Other NICs: - sfc/ef100: add basic devlink support for ef100 - ionic: rx_push mode operation (writing descriptors via MMIO) - bnxt: use the auxiliary bus abstraction for RDMA - r8169: disable ASPM and reset bus in case of tx timeout - cpsw: support QSGMII mode for J721e CPSW9G - cpts: support pulse-per-second output - ngbe: add an mdio bus driver - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing - r8152: handle devices with FW with NCM support - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation - virtio-net: support multi buffer XDP - virtio/vsock: replace virtio_vsock_pkt with sk_buff - tsnep: XDP support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add support for latency TLV (in FW control messages) - Microchip (sparx5): - separate explicit and implicit traffic forwarding rules, make the implicit rules always active - add support for egress DSCP rewrite - IS0 VCAP support (Ingress Classification) - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.) - ES2 VCAP support (Egress Access Control) - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1) - Ethernet embedded switches: - Marvell (mv88e6xxx): - add MAB (port auth) offload support - enable PTP receive for mv88e6390 - NXP (ocelot): - support MAC Merge layer - support for the the vsc7512 internal copper phys - Microchip: - lan9303: convert to PHYLINK - lan966x: support TC flower filter statistics - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x - lan937x: support Credit Based Shaper configuration - ksz9477: support Energy Efficient Ethernet - other: - qca8k: convert to regmap read/write API, use bulk operations - rswitch: Improve TX timestamp accuracy - Intel WiFi (iwlwifi): - EHT (Wi-Fi 7) rate reporting - STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware. - Qualcomm 802.11ax WiFi (ath11k): - IPQ5018 support - Fine Timing Measurement (FTM) responder role support - channel 177 support - MediaTek WiFi (mt76): - per-PHY LED support - mt7996: EHT (Wi-Fi 7) support - Wireless Ethernet Dispatch (WED) reset support - switch to using page pool allocator - RealTek WiFi (rtw89): - support new version of Bluetooth co-existance - Mobile: - rmnet: support TX aggregation" * tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits) page_pool: add a comment explaining the fragment counter usage net: ethtool: fix __ethtool_dev_mm_supported() implementation ethtool: pse-pd: Fix double word in comments xsk: add linux/vmalloc.h to xsk.c sefltests: netdevsim: wait for devlink instance after netns removal selftest: fib_tests: Always cleanup before exit net/mlx5e: Align IPsec ASO result memory to be as required by hardware net/mlx5e: TC, Set CT miss to the specific ct action instance net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG net/mlx5: Refactor tc miss handling to a single function net/mlx5: Kconfig: Make tc offload depend on tc skb extension net/sched: flower: Support hardware miss to tc action net/sched: flower: Move filter handle initialization earlier net/sched: cls_api: Support hardware miss to tc action net/sched: Rename user cookie and act cookie sfc: fix builds without CONFIG_RTC_LIB sfc: clean up some inconsistent indentings net/mlx4_en: Introduce flexible array to silence overflow warning net: lan966x: Fix possible deadlock inside PTP net/ulp: Remove redundant ->clone() test in inet_clone_ulp(). ...
Diffstat (limited to 'Documentation/staging/static-keys.rst')
-rw-r--r--Documentation/staging/static-keys.rst328
1 files changed, 328 insertions, 0 deletions
diff --git a/Documentation/staging/static-keys.rst b/Documentation/staging/static-keys.rst
new file mode 100644
index 000000000..b0a519f45
--- /dev/null
+++ b/Documentation/staging/static-keys.rst
@@ -0,0 +1,328 @@
+===========
+Static Keys
+===========
+
+.. warning::
+
+ DEPRECATED API:
+
+ The use of 'struct static_key' directly, is now DEPRECATED. In addition
+ static_key_{true,false}() is also DEPRECATED. IE DO NOT use the following::
+
+ struct static_key false = STATIC_KEY_INIT_FALSE;
+ struct static_key true = STATIC_KEY_INIT_TRUE;
+ static_key_true()
+ static_key_false()
+
+ The updated API replacements are::
+
+ DEFINE_STATIC_KEY_TRUE(key);
+ DEFINE_STATIC_KEY_FALSE(key);
+ DEFINE_STATIC_KEY_ARRAY_TRUE(keys, count);
+ DEFINE_STATIC_KEY_ARRAY_FALSE(keys, count);
+ static_branch_likely()
+ static_branch_unlikely()
+
+Abstract
+========
+
+Static keys allows the inclusion of seldom used features in
+performance-sensitive fast-path kernel code, via a GCC feature and a code
+patching technique. A quick example::
+
+ DEFINE_STATIC_KEY_FALSE(key);
+
+ ...
+
+ if (static_branch_unlikely(&key))
+ do unlikely code
+ else
+ do likely code
+
+ ...
+ static_branch_enable(&key);
+ ...
+ static_branch_disable(&key);
+ ...
+
+The static_branch_unlikely() branch will be generated into the code with as little
+impact to the likely code path as possible.
+
+
+Motivation
+==========
+
+
+Currently, tracepoints are implemented using a conditional branch. The
+conditional check requires checking a global variable for each tracepoint.
+Although the overhead of this check is small, it increases when the memory
+cache comes under pressure (memory cache lines for these global variables may
+be shared with other memory accesses). As we increase the number of tracepoints
+in the kernel this overhead may become more of an issue. In addition,
+tracepoints are often dormant (disabled) and provide no direct kernel
+functionality. Thus, it is highly desirable to reduce their impact as much as
+possible. Although tracepoints are the original motivation for this work, other
+kernel code paths should be able to make use of the static keys facility.
+
+
+Solution
+========
+
+
+gcc (v4.5) adds a new 'asm goto' statement that allows branching to a label:
+
+https://gcc.gnu.org/ml/gcc-patches/2009-07/msg01556.html
+
+Using the 'asm goto', we can create branches that are either taken or not taken
+by default, without the need to check memory. Then, at run-time, we can patch
+the branch site to change the branch direction.
+
+For example, if we have a simple branch that is disabled by default::
+
+ if (static_branch_unlikely(&key))
+ printk("I am the true branch\n");
+
+Thus, by default the 'printk' will not be emitted. And the code generated will
+consist of a single atomic 'no-op' instruction (5 bytes on x86), in the
+straight-line code path. When the branch is 'flipped', we will patch the
+'no-op' in the straight-line codepath with a 'jump' instruction to the
+out-of-line true branch. Thus, changing branch direction is expensive but
+branch selection is basically 'free'. That is the basic tradeoff of this
+optimization.
+
+This lowlevel patching mechanism is called 'jump label patching', and it gives
+the basis for the static keys facility.
+
+Static key label API, usage and examples
+========================================
+
+
+In order to make use of this optimization you must first define a key::
+
+ DEFINE_STATIC_KEY_TRUE(key);
+
+or::
+
+ DEFINE_STATIC_KEY_FALSE(key);
+
+
+The key must be global, that is, it can't be allocated on the stack or dynamically
+allocated at run-time.
+
+The key is then used in code as::
+
+ if (static_branch_unlikely(&key))
+ do unlikely code
+ else
+ do likely code
+
+Or::
+
+ if (static_branch_likely(&key))
+ do likely code
+ else
+ do unlikely code
+
+Keys defined via DEFINE_STATIC_KEY_TRUE(), or DEFINE_STATIC_KEY_FALSE, may
+be used in either static_branch_likely() or static_branch_unlikely()
+statements.
+
+Branch(es) can be set true via::
+
+ static_branch_enable(&key);
+
+or false via::
+
+ static_branch_disable(&key);
+
+The branch(es) can then be switched via reference counts::
+
+ static_branch_inc(&key);
+ ...
+ static_branch_dec(&key);
+
+Thus, 'static_branch_inc()' means 'make the branch true', and
+'static_branch_dec()' means 'make the branch false' with appropriate
+reference counting. For example, if the key is initialized true, a
+static_branch_dec(), will switch the branch to false. And a subsequent
+static_branch_inc(), will change the branch back to true. Likewise, if the
+key is initialized false, a 'static_branch_inc()', will change the branch to
+true. And then a 'static_branch_dec()', will again make the branch false.
+
+The state and the reference count can be retrieved with 'static_key_enabled()'
+and 'static_key_count()'. In general, if you use these functions, they
+should be protected with the same mutex used around the enable/disable
+or increment/decrement function.
+
+Note that switching branches results in some locks being taken,
+particularly the CPU hotplug lock (in order to avoid races against
+CPUs being brought in the kernel while the kernel is getting
+patched). Calling the static key API from within a hotplug notifier is
+thus a sure deadlock recipe. In order to still allow use of the
+functionality, the following functions are provided:
+
+ static_key_enable_cpuslocked()
+ static_key_disable_cpuslocked()
+ static_branch_enable_cpuslocked()
+ static_branch_disable_cpuslocked()
+
+These functions are *not* general purpose, and must only be used when
+you really know that you're in the above context, and no other.
+
+Where an array of keys is required, it can be defined as::
+
+ DEFINE_STATIC_KEY_ARRAY_TRUE(keys, count);
+
+or::
+
+ DEFINE_STATIC_KEY_ARRAY_FALSE(keys, count);
+
+4) Architecture level code patching interface, 'jump labels'
+
+
+There are a few functions and macros that architectures must implement in order
+to take advantage of this optimization. If there is no architecture support, we
+simply fall back to a traditional, load, test, and jump sequence. Also, the
+struct jump_entry table must be at least 4-byte aligned because the
+static_key->entry field makes use of the two least significant bits.
+
+* ``select HAVE_ARCH_JUMP_LABEL``,
+ see: arch/x86/Kconfig
+
+* ``#define JUMP_LABEL_NOP_SIZE``,
+ see: arch/x86/include/asm/jump_label.h
+
+* ``__always_inline bool arch_static_branch(struct static_key *key, bool branch)``,
+ see: arch/x86/include/asm/jump_label.h
+
+* ``__always_inline bool arch_static_branch_jump(struct static_key *key, bool branch)``,
+ see: arch/x86/include/asm/jump_label.h
+
+* ``void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type)``,
+ see: arch/x86/kernel/jump_label.c
+
+* ``struct jump_entry``,
+ see: arch/x86/include/asm/jump_label.h
+
+
+5) Static keys / jump label analysis, results (x86_64):
+
+
+As an example, let's add the following branch to 'getppid()', such that the
+system call now looks like::
+
+ SYSCALL_DEFINE0(getppid)
+ {
+ int pid;
+
+ + if (static_branch_unlikely(&key))
+ + printk("I am the true branch\n");
+
+ rcu_read_lock();
+ pid = task_tgid_vnr(rcu_dereference(current->real_parent));
+ rcu_read_unlock();
+
+ return pid;
+ }
+
+The resulting instructions with jump labels generated by GCC is::
+
+ ffffffff81044290 <sys_getppid>:
+ ffffffff81044290: 55 push %rbp
+ ffffffff81044291: 48 89 e5 mov %rsp,%rbp
+ ffffffff81044294: e9 00 00 00 00 jmpq ffffffff81044299 <sys_getppid+0x9>
+ ffffffff81044299: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax
+ ffffffff810442a0: 00 00
+ ffffffff810442a2: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax
+ ffffffff810442a9: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax
+ ffffffff810442b0: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi
+ ffffffff810442b7: e8 f4 d9 00 00 callq ffffffff81051cb0 <pid_vnr>
+ ffffffff810442bc: 5d pop %rbp
+ ffffffff810442bd: 48 98 cltq
+ ffffffff810442bf: c3 retq
+ ffffffff810442c0: 48 c7 c7 e3 54 98 81 mov $0xffffffff819854e3,%rdi
+ ffffffff810442c7: 31 c0 xor %eax,%eax
+ ffffffff810442c9: e8 71 13 6d 00 callq ffffffff8171563f <printk>
+ ffffffff810442ce: eb c9 jmp ffffffff81044299 <sys_getppid+0x9>
+
+Without the jump label optimization it looks like::
+
+ ffffffff810441f0 <sys_getppid>:
+ ffffffff810441f0: 8b 05 8a 52 d8 00 mov 0xd8528a(%rip),%eax # ffffffff81dc9480 <key>
+ ffffffff810441f6: 55 push %rbp
+ ffffffff810441f7: 48 89 e5 mov %rsp,%rbp
+ ffffffff810441fa: 85 c0 test %eax,%eax
+ ffffffff810441fc: 75 27 jne ffffffff81044225 <sys_getppid+0x35>
+ ffffffff810441fe: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax
+ ffffffff81044205: 00 00
+ ffffffff81044207: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax
+ ffffffff8104420e: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax
+ ffffffff81044215: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi
+ ffffffff8104421c: e8 2f da 00 00 callq ffffffff81051c50 <pid_vnr>
+ ffffffff81044221: 5d pop %rbp
+ ffffffff81044222: 48 98 cltq
+ ffffffff81044224: c3 retq
+ ffffffff81044225: 48 c7 c7 13 53 98 81 mov $0xffffffff81985313,%rdi
+ ffffffff8104422c: 31 c0 xor %eax,%eax
+ ffffffff8104422e: e8 60 0f 6d 00 callq ffffffff81715193 <printk>
+ ffffffff81044233: eb c9 jmp ffffffff810441fe <sys_getppid+0xe>
+ ffffffff81044235: 66 66 2e 0f 1f 84 00 data32 nopw %cs:0x0(%rax,%rax,1)
+ ffffffff8104423c: 00 00 00 00
+
+Thus, the disable jump label case adds a 'mov', 'test' and 'jne' instruction
+vs. the jump label case just has a 'no-op' or 'jmp 0'. (The jmp 0, is patched
+to a 5 byte atomic no-op instruction at boot-time.) Thus, the disabled jump
+label case adds::
+
+ 6 (mov) + 2 (test) + 2 (jne) = 10 - 5 (5 byte jump 0) = 5 addition bytes.
+
+If we then include the padding bytes, the jump label code saves, 16 total bytes
+of instruction memory for this small function. In this case the non-jump label
+function is 80 bytes long. Thus, we have saved 20% of the instruction
+footprint. We can in fact improve this even further, since the 5-byte no-op
+really can be a 2-byte no-op since we can reach the branch with a 2-byte jmp.
+However, we have not yet implemented optimal no-op sizes (they are currently
+hard-coded).
+
+Since there are a number of static key API uses in the scheduler paths,
+'pipe-test' (also known as 'perf bench sched pipe') can be used to show the
+performance improvement. Testing done on 3.3.0-rc2:
+
+jump label disabled::
+
+ Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs):
+
+ 855.700314 task-clock # 0.534 CPUs utilized ( +- 0.11% )
+ 200,003 context-switches # 0.234 M/sec ( +- 0.00% )
+ 0 CPU-migrations # 0.000 M/sec ( +- 39.58% )
+ 487 page-faults # 0.001 M/sec ( +- 0.02% )
+ 1,474,374,262 cycles # 1.723 GHz ( +- 0.17% )
+ <not supported> stalled-cycles-frontend
+ <not supported> stalled-cycles-backend
+ 1,178,049,567 instructions # 0.80 insns per cycle ( +- 0.06% )
+ 208,368,926 branches # 243.507 M/sec ( +- 0.06% )
+ 5,569,188 branch-misses # 2.67% of all branches ( +- 0.54% )
+
+ 1.601607384 seconds time elapsed ( +- 0.07% )
+
+jump label enabled::
+
+ Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs):
+
+ 841.043185 task-clock # 0.533 CPUs utilized ( +- 0.12% )
+ 200,004 context-switches # 0.238 M/sec ( +- 0.00% )
+ 0 CPU-migrations # 0.000 M/sec ( +- 40.87% )
+ 487 page-faults # 0.001 M/sec ( +- 0.05% )
+ 1,432,559,428 cycles # 1.703 GHz ( +- 0.18% )
+ <not supported> stalled-cycles-frontend
+ <not supported> stalled-cycles-backend
+ 1,175,363,994 instructions # 0.82 insns per cycle ( +- 0.04% )
+ 206,859,359 branches # 245.956 M/sec ( +- 0.04% )
+ 4,884,119 branch-misses # 2.36% of all branches ( +- 0.85% )
+
+ 1.579384366 seconds time elapsed
+
+The percentage of saved branches is .7%, and we've saved 12% on
+'branch-misses'. This is where we would expect to get the most savings, since
+this optimization is about reducing the number of branches. In addition, we've
+saved .2% on instructions, and 2.8% on cycles and 1.4% on elapsed time.