diff options
author | 2023-02-21 18:24:12 -0800 | |
---|---|---|
committer | 2023-02-21 18:24:12 -0800 | |
commit | 5b7c4cabbb65f5c469464da6c5f614cbd7f730f2 (patch) | |
tree | cc5c2d0a898769fd59549594fedb3ee6f84e59a0 /tools/memory-model/Documentation/recipes.txt | |
download | linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.tar.gz linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.zip |
Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextgrafted
Pull networking updates from Jakub Kicinski:
"Core:
- Add dedicated kmem_cache for typical/small skb->head, avoid having
to access struct page at kfree time, and improve memory use.
- Introduce sysctl to set default RPS configuration for new netdevs.
- Define Netlink protocol specification format which can be used to
describe messages used by each family and auto-generate parsers.
Add tools for generating kernel data structures and uAPI headers.
- Expose all net/core sysctls inside netns.
- Remove 4s sleep in netpoll if carrier is instantly detected on
boot.
- Add configurable limit of MDB entries per port, and port-vlan.
- Continue populating drop reasons throughout the stack.
- Retire a handful of legacy Qdiscs and classifiers.
Protocols:
- Support IPv4 big TCP (TSO frames larger than 64kB).
- Add IP_LOCAL_PORT_RANGE socket option, to control local port range
on socket by socket basis.
- Track and report in procfs number of MPTCP sockets used.
- Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path
manager.
- IPv6: don't check net.ipv6.route.max_size and rely on garbage
collection to free memory (similarly to IPv4).
- Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).
- ICMP: add per-rate limit counters.
- Add support for user scanning requests in ieee802154.
- Remove static WEP support.
- Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
reporting.
- WiFi 7 EHT channel puncturing support (client & AP).
BPF:
- Add a rbtree data structure following the "next-gen data structure"
precedent set by recently added linked list, that is, by using
kfunc + kptr instead of adding a new BPF map type.
- Expose XDP hints via kfuncs with initial support for RX hash and
timestamp metadata.
- Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to
better support decap on GRE tunnel devices not operating in collect
metadata.
- Improve x86 JIT's codegen for PROBE_MEM runtime error checks.
- Remove the need for trace_printk_lock for bpf_trace_printk and
bpf_trace_vprintk helpers.
- Extend libbpf's bpf_tracing.h support for tracing arguments of
kprobes/uprobes and syscall as a special case.
- Significantly reduce the search time for module symbols by
livepatch and BPF.
- Enable cpumasks to be used as kptrs, which is useful for tracing
programs tracking which tasks end up running on which CPUs in
different time intervals.
- Add support for BPF trampoline on s390x and riscv64.
- Add capability to export the XDP features supported by the NIC.
- Add __bpf_kfunc tag for marking kernel functions as kfuncs.
- Add cgroup.memory=nobpf kernel parameter option to disable BPF
memory accounting for container environments.
Netfilter:
- Remove the CLUSTERIP target. It has been marked as obsolete for
years, and we still have WARN splats wrt races of the out-of-band
/proc interface installed by this target.
- Add 'destroy' commands to nf_tables. They are identical to the
existing 'delete' commands, but do not return an error if the
referenced object (set, chain, rule...) did not exist.
Driver API:
- Improve cpumask_local_spread() locality to help NICs set the right
IRQ affinity on AMD platforms.
- Separate C22 and C45 MDIO bus transactions more clearly.
- Introduce new DCB table to control DSCP rewrite on egress.
- Support configuration of Physical Layer Collision Avoidance (PLCA)
Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
shared medium Ethernet.
- Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
preemption of low priority frames by high priority frames.
- Add support for controlling MACSec offload using netlink SET.
- Rework devlink instance refcounts to allow registration and
de-registration under the instance lock. Split the code into
multiple files, drop some of the unnecessarily granular locks and
factor out common parts of netlink operation handling.
- Add TX frame aggregation parameters (for USB drivers).
- Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
messages with notifications for debug.
- Allow offloading of UDP NEW connections via act_ct.
- Add support for per action HW stats in TC.
- Support hardware miss to TC action (continue processing in SW from
a specific point in the action chain).
- Warn if old Wireless Extension user space interface is used with
modern cfg80211/mac80211 drivers. Do not support Wireless
Extensions for Wi-Fi 7 devices at all. Everyone should switch to
using nl80211 interface instead.
- Improve the CAN bit timing configuration. Use extack to return
error messages directly to user space, update the SJW handling,
including the definition of a new default value that will benefit
CAN-FD controllers, by increasing their oscillator tolerance.
New hardware / drivers:
- Ethernet:
- nVidia BlueField-3 support (control traffic driver)
- Ethernet support for imx93 SoCs
- Motorcomm yt8531 gigabit Ethernet PHY
- onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
- Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
- Amlogic gxl MDIO mux
- WiFi:
- RealTek RTL8188EU (rtl8xxxu)
- Qualcomm Wi-Fi 7 devices (ath12k)
- CAN:
- Renesas R-Car V4H
Drivers:
- Bluetooth:
- Set Per Platform Antenna Gain (PPAG) for Intel controllers.
- Ethernet NICs:
- Intel (1G, igc):
- support TSN / Qbv / packet scheduling features of i226 model
- Intel (100G, ice):
- use GNSS subsystem instead of TTY
- multi-buffer XDP support
- extend support for GPIO pins to E823 devices
- nVidia/Mellanox:
- update the shared buffer configuration on PFC commands
- implement PTP adjphase function for HW offset control
- TC support for Geneve and GRE with VF tunnel offload
- more efficient crypto key management method
- multi-port eswitch support
- Netronome/Corigine:
- add DCB IEEE support
- support IPsec offloading for NFP3800
- Freescale/NXP (enetc):
- support XDP_REDIRECT for XDP non-linear buffers
- improve reconfig, avoid link flap and waiting for idle
- support MAC Merge layer
- Other NICs:
- sfc/ef100: add basic devlink support for ef100
- ionic: rx_push mode operation (writing descriptors via MMIO)
- bnxt: use the auxiliary bus abstraction for RDMA
- r8169: disable ASPM and reset bus in case of tx timeout
- cpsw: support QSGMII mode for J721e CPSW9G
- cpts: support pulse-per-second output
- ngbe: add an mdio bus driver
- usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
- r8152: handle devices with FW with NCM support
- amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
- virtio-net: support multi buffer XDP
- virtio/vsock: replace virtio_vsock_pkt with sk_buff
- tsnep: XDP support
- Ethernet high-speed switches:
- nVidia/Mellanox (mlxsw):
- add support for latency TLV (in FW control messages)
- Microchip (sparx5):
- separate explicit and implicit traffic forwarding rules, make
the implicit rules always active
- add support for egress DSCP rewrite
- IS0 VCAP support (Ingress Classification)
- IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS
etc.)
- ES2 VCAP support (Egress Access Control)
- support for Per-Stream Filtering and Policing (802.1Q,
8.6.5.1)
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- add MAB (port auth) offload support
- enable PTP receive for mv88e6390
- NXP (ocelot):
- support MAC Merge layer
- support for the the vsc7512 internal copper phys
- Microchip:
- lan9303: convert to PHYLINK
- lan966x: support TC flower filter statistics
- lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
- lan937x: support Credit Based Shaper configuration
- ksz9477: support Energy Efficient Ethernet
- other:
- qca8k: convert to regmap read/write API, use bulk operations
- rswitch: Improve TX timestamp accuracy
- Intel WiFi (iwlwifi):
- EHT (Wi-Fi 7) rate reporting
- STEP equalizer support: transfer some STEP (connection to radio
on platforms with integrated wifi) related parameters from the
BIOS to the firmware.
- Qualcomm 802.11ax WiFi (ath11k):
- IPQ5018 support
- Fine Timing Measurement (FTM) responder role support
- channel 177 support
- MediaTek WiFi (mt76):
- per-PHY LED support
- mt7996: EHT (Wi-Fi 7) support
- Wireless Ethernet Dispatch (WED) reset support
- switch to using page pool allocator
- RealTek WiFi (rtw89):
- support new version of Bluetooth co-existance
- Mobile:
- rmnet: support TX aggregation"
* tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits)
page_pool: add a comment explaining the fragment counter usage
net: ethtool: fix __ethtool_dev_mm_supported() implementation
ethtool: pse-pd: Fix double word in comments
xsk: add linux/vmalloc.h to xsk.c
sefltests: netdevsim: wait for devlink instance after netns removal
selftest: fib_tests: Always cleanup before exit
net/mlx5e: Align IPsec ASO result memory to be as required by hardware
net/mlx5e: TC, Set CT miss to the specific ct action instance
net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG
net/mlx5: Refactor tc miss handling to a single function
net/mlx5: Kconfig: Make tc offload depend on tc skb extension
net/sched: flower: Support hardware miss to tc action
net/sched: flower: Move filter handle initialization earlier
net/sched: cls_api: Support hardware miss to tc action
net/sched: Rename user cookie and act cookie
sfc: fix builds without CONFIG_RTC_LIB
sfc: clean up some inconsistent indentings
net/mlx4_en: Introduce flexible array to silence overflow warning
net: lan966x: Fix possible deadlock inside PTP
net/ulp: Remove redundant ->clone() test in inet_clone_ulp().
...
Diffstat (limited to 'tools/memory-model/Documentation/recipes.txt')
-rw-r--r-- | tools/memory-model/Documentation/recipes.txt | 570 |
1 files changed, 570 insertions, 0 deletions
diff --git a/tools/memory-model/Documentation/recipes.txt b/tools/memory-model/Documentation/recipes.txt new file mode 100644 index 000000000..03f58b11c --- /dev/null +++ b/tools/memory-model/Documentation/recipes.txt @@ -0,0 +1,570 @@ +This document provides "recipes", that is, litmus tests for commonly +occurring situations, as well as a few that illustrate subtly broken but +attractive nuisances. Many of these recipes include example code from +v5.7 of the Linux kernel. + +The first section covers simple special cases, the second section +takes off the training wheels to cover more involved examples, +and the third section provides a few rules of thumb. + + +Simple special cases +==================== + +This section presents two simple special cases, the first being where +there is only one CPU or only one memory location is accessed, and the +second being use of that old concurrency workhorse, locking. + + +Single CPU or single memory location +------------------------------------ + +If there is only one CPU on the one hand or only one variable +on the other, the code will execute in order. There are (as +usual) some things to be careful of: + +1. Some aspects of the C language are unordered. For example, + in the expression "f(x) + g(y)", the order in which f and g are + called is not defined; the object code is allowed to use either + order or even to interleave the computations. + +2. Compilers are permitted to use the "as-if" rule. That is, a + compiler can emit whatever code it likes for normal accesses, + as long as the results of a single-threaded execution appear + just as if the compiler had followed all the relevant rules. + To see this, compile with a high level of optimization and run + the debugger on the resulting binary. + +3. If there is only one variable but multiple CPUs, that variable + must be properly aligned and all accesses to that variable must + be full sized. Variables that straddle cachelines or pages void + your full-ordering warranty, as do undersized accesses that load + from or store to only part of the variable. + +4. If there are multiple CPUs, accesses to shared variables should + use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store + tearing, load/store fusing, and invented loads and stores. + There are exceptions to this rule, including: + + i. When there is no possibility of a given shared variable + being updated by some other CPU, for example, while + holding the update-side lock, reads from that variable + need not use READ_ONCE(). + + ii. When there is no possibility of a given shared variable + being either read or updated by other CPUs, for example, + when running during early boot, reads from that variable + need not use READ_ONCE() and writes to that variable + need not use WRITE_ONCE(). + + +Locking +------- + +Locking is well-known and straightforward, at least if you don't think +about it too hard. And the basic rule is indeed quite simple: Any CPU that +has acquired a given lock sees any changes previously seen or made by any +CPU before it released that same lock. Note that this statement is a bit +stronger than "Any CPU holding a given lock sees all changes made by any +CPU during the time that CPU was holding this same lock". For example, +consider the following pair of code fragments: + + /* See MP+polocks.litmus. */ + void CPU0(void) + { + WRITE_ONCE(x, 1); + spin_lock(&mylock); + WRITE_ONCE(y, 1); + spin_unlock(&mylock); + } + + void CPU1(void) + { + spin_lock(&mylock); + r0 = READ_ONCE(y); + spin_unlock(&mylock); + r1 = READ_ONCE(x); + } + +The basic rule guarantees that if CPU0() acquires mylock before CPU1(), +then both r0 and r1 must be set to the value 1. This also has the +consequence that if the final value of r0 is equal to 1, then the final +value of r1 must also be equal to 1. In contrast, the weaker rule would +say nothing about the final value of r1. + +The converse to the basic rule also holds, as illustrated by the +following litmus test: + + /* See MP+porevlocks.litmus. */ + void CPU0(void) + { + r0 = READ_ONCE(y); + spin_lock(&mylock); + r1 = READ_ONCE(x); + spin_unlock(&mylock); + } + + void CPU1(void) + { + spin_lock(&mylock); + WRITE_ONCE(x, 1); + spin_unlock(&mylock); + WRITE_ONCE(y, 1); + } + +This converse to the basic rule guarantees that if CPU0() acquires +mylock before CPU1(), then both r0 and r1 must be set to the value 0. +This also has the consequence that if the final value of r1 is equal +to 0, then the final value of r0 must also be equal to 0. In contrast, +the weaker rule would say nothing about the final value of r0. + +These examples show only a single pair of CPUs, but the effects of the +locking basic rule extend across multiple acquisitions of a given lock +across multiple CPUs. + +However, it is not necessarily the case that accesses ordered by +locking will be seen as ordered by CPUs not holding that lock. +Consider this example: + + /* See Z6.0+pooncelock+pooncelock+pombonce.litmus. */ + void CPU0(void) + { + spin_lock(&mylock); + WRITE_ONCE(x, 1); + WRITE_ONCE(y, 1); + spin_unlock(&mylock); + } + + void CPU1(void) + { + spin_lock(&mylock); + r0 = READ_ONCE(y); + WRITE_ONCE(z, 1); + spin_unlock(&mylock); + } + + void CPU2(void) + { + WRITE_ONCE(z, 2); + smp_mb(); + r1 = READ_ONCE(x); + } + +Counter-intuitive though it might be, it is quite possible to have +the final value of r0 be 1, the final value of z be 2, and the final +value of r1 be 0. The reason for this surprising outcome is that +CPU2() never acquired the lock, and thus did not benefit from the +lock's ordering properties. + +Ordering can be extended to CPUs not holding the lock by careful use +of smp_mb__after_spinlock(): + + /* See Z6.0+pooncelock+poonceLock+pombonce.litmus. */ + void CPU0(void) + { + spin_lock(&mylock); + WRITE_ONCE(x, 1); + WRITE_ONCE(y, 1); + spin_unlock(&mylock); + } + + void CPU1(void) + { + spin_lock(&mylock); + smp_mb__after_spinlock(); + r0 = READ_ONCE(y); + WRITE_ONCE(z, 1); + spin_unlock(&mylock); + } + + void CPU2(void) + { + WRITE_ONCE(z, 2); + smp_mb(); + r1 = READ_ONCE(x); + } + +This addition of smp_mb__after_spinlock() strengthens the lock acquisition +sufficiently to rule out the counter-intuitive outcome. + + +Taking off the training wheels +============================== + +This section looks at more complex examples, including message passing, +load buffering, release-acquire chains, store buffering. +Many classes of litmus tests have abbreviated names, which may be found +here: https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf + + +Message passing (MP) +-------------------- + +The MP pattern has one CPU execute a pair of stores to a pair of variables +and another CPU execute a pair of loads from this same pair of variables, +but in the opposite order. The goal is to avoid the counter-intuitive +outcome in which the first load sees the value written by the second store +but the second load does not see the value written by the first store. +In the absence of any ordering, this goal may not be met, as can be seen +in the MP+poonceonces.litmus litmus test. This section therefore looks at +a number of ways of meeting this goal. + + +Release and acquire +~~~~~~~~~~~~~~~~~~~ + +Use of smp_store_release() and smp_load_acquire() is one way to force +the desired MP ordering. The general approach is shown below: + + /* See MP+pooncerelease+poacquireonce.litmus. */ + void CPU0(void) + { + WRITE_ONCE(x, 1); + smp_store_release(&y, 1); + } + + void CPU1(void) + { + r0 = smp_load_acquire(&y); + r1 = READ_ONCE(x); + } + +The smp_store_release() macro orders any prior accesses against the +store, while the smp_load_acquire macro orders the load against any +subsequent accesses. Therefore, if the final value of r0 is the value 1, +the final value of r1 must also be the value 1. + +The init_stack_slab() function in lib/stackdepot.c uses release-acquire +in this way to safely initialize of a slab of the stack. Working out +the mutual-exclusion design is left as an exercise for the reader. + + +Assign and dereference +~~~~~~~~~~~~~~~~~~~~~~ + +Use of rcu_assign_pointer() and rcu_dereference() is quite similar to the +use of smp_store_release() and smp_load_acquire(), except that both +rcu_assign_pointer() and rcu_dereference() operate on RCU-protected +pointers. The general approach is shown below: + + /* See MP+onceassign+derefonce.litmus. */ + int z; + int *y = &z; + int x; + + void CPU0(void) + { + WRITE_ONCE(x, 1); + rcu_assign_pointer(y, &x); + } + + void CPU1(void) + { + rcu_read_lock(); + r0 = rcu_dereference(y); + r1 = READ_ONCE(*r0); + rcu_read_unlock(); + } + +In this example, if the final value of r0 is &x then the final value of +r1 must be 1. + +The rcu_assign_pointer() macro has the same ordering properties as does +smp_store_release(), but the rcu_dereference() macro orders the load only +against later accesses that depend on the value loaded. A dependency +is present if the value loaded determines the address of a later access +(address dependency, as shown above), the value written by a later store +(data dependency), or whether or not a later store is executed in the +first place (control dependency). Note that the term "data dependency" +is sometimes casually used to cover both address and data dependencies. + +In lib/math/prime_numbers.c, the expand_to_next_prime() function invokes +rcu_assign_pointer(), and the next_prime_number() function invokes +rcu_dereference(). This combination mediates access to a bit vector +that is expanded as additional primes are needed. + + +Write and read memory barriers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It is usually better to use smp_store_release() instead of smp_wmb() +and to use smp_load_acquire() instead of smp_rmb(). However, the older +smp_wmb() and smp_rmb() APIs are still heavily used, so it is important +to understand their use cases. The general approach is shown below: + + /* See MP+fencewmbonceonce+fencermbonceonce.litmus. */ + void CPU0(void) + { + WRITE_ONCE(x, 1); + smp_wmb(); + WRITE_ONCE(y, 1); + } + + void CPU1(void) + { + r0 = READ_ONCE(y); + smp_rmb(); + r1 = READ_ONCE(x); + } + +The smp_wmb() macro orders prior stores against later stores, and the +smp_rmb() macro orders prior loads against later loads. Therefore, if +the final value of r0 is 1, the final value of r1 must also be 1. + +The xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains +the following write-side code fragment: + + log->l_curr_block -= log->l_logBBsize; + ASSERT(log->l_curr_block >= 0); + smp_wmb(); + log->l_curr_cycle++; + +And the xlog_valid_lsn() function in fs/xfs/xfs_log_priv.h contains +the corresponding read-side code fragment: + + cur_cycle = READ_ONCE(log->l_curr_cycle); + smp_rmb(); + cur_block = READ_ONCE(log->l_curr_block); + +Alternatively, consider the following comment in function +perf_output_put_handle() in kernel/events/ring_buffer.c: + + * kernel user + * + * if (LOAD ->data_tail) { LOAD ->data_head + * (A) smp_rmb() (C) + * STORE $data LOAD $data + * smp_wmb() (B) smp_mb() (D) + * STORE ->data_head STORE ->data_tail + * } + +The B/C pairing is an example of the MP pattern using smp_wmb() on the +write side and smp_rmb() on the read side. + +Of course, given that smp_mb() is strictly stronger than either smp_wmb() +or smp_rmb(), any code fragment that would work with smp_rmb() and +smp_wmb() would also work with smp_mb() replacing either or both of the +weaker barriers. + + +Load buffering (LB) +------------------- + +The LB pattern has one CPU load from one variable and then store to a +second, while another CPU loads from the second variable and then stores +to the first. The goal is to avoid the counter-intuitive situation where +each load reads the value written by the other CPU's store. In the +absence of any ordering it is quite possible that this may happen, as +can be seen in the LB+poonceonces.litmus litmus test. + +One way of avoiding the counter-intuitive outcome is through the use of a +control dependency paired with a full memory barrier: + + /* See LB+fencembonceonce+ctrlonceonce.litmus. */ + void CPU0(void) + { + r0 = READ_ONCE(x); + if (r0) + WRITE_ONCE(y, 1); + } + + void CPU1(void) + { + r1 = READ_ONCE(y); + smp_mb(); + WRITE_ONCE(x, 1); + } + +This pairing of a control dependency in CPU0() with a full memory +barrier in CPU1() prevents r0 and r1 from both ending up equal to 1. + +The A/D pairing from the ring-buffer use case shown earlier also +illustrates LB. Here is a repeat of the comment in +perf_output_put_handle() in kernel/events/ring_buffer.c, showing a +control dependency on the kernel side and a full memory barrier on +the user side: + + * kernel user + * + * if (LOAD ->data_tail) { LOAD ->data_head + * (A) smp_rmb() (C) + * STORE $data LOAD $data + * smp_wmb() (B) smp_mb() (D) + * STORE ->data_head STORE ->data_tail + * } + * + * Where A pairs with D, and B pairs with C. + +The kernel's control dependency between the load from ->data_tail +and the store to data combined with the user's full memory barrier +between the load from data and the store to ->data_tail prevents +the counter-intuitive outcome where the kernel overwrites the data +before the user gets done loading it. + + +Release-acquire chains +---------------------- + +Release-acquire chains are a low-overhead, flexible, and easy-to-use +method of maintaining order. However, they do have some limitations that +need to be fully understood. Here is an example that maintains order: + + /* See ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus. */ + void CPU0(void) + { + WRITE_ONCE(x, 1); + smp_store_release(&y, 1); + } + + void CPU1(void) + { + r0 = smp_load_acquire(y); + smp_store_release(&z, 1); + } + + void CPU2(void) + { + r1 = smp_load_acquire(z); + r2 = READ_ONCE(x); + } + +In this case, if r0 and r1 both have final values of 1, then r2 must +also have a final value of 1. + +The ordering in this example is stronger than it needs to be. For +example, ordering would still be preserved if CPU1()'s smp_load_acquire() +invocation was replaced with READ_ONCE(). + +It is tempting to assume that CPU0()'s store to x is globally ordered +before CPU1()'s store to z, but this is not the case: + + /* See Z6.0+pooncerelease+poacquirerelease+mbonceonce.litmus. */ + void CPU0(void) + { + WRITE_ONCE(x, 1); + smp_store_release(&y, 1); + } + + void CPU1(void) + { + r0 = smp_load_acquire(y); + smp_store_release(&z, 1); + } + + void CPU2(void) + { + WRITE_ONCE(z, 2); + smp_mb(); + r1 = READ_ONCE(x); + } + +One might hope that if the final value of r0 is 1 and the final value +of z is 2, then the final value of r1 must also be 1, but it really is +possible for r1 to have the final value of 0. The reason, of course, +is that in this version, CPU2() is not part of the release-acquire chain. +This situation is accounted for in the rules of thumb below. + +Despite this limitation, release-acquire chains are low-overhead as +well as simple and powerful, at least as memory-ordering mechanisms go. + + +Store buffering +--------------- + +Store buffering can be thought of as upside-down load buffering, so +that one CPU first stores to one variable and then loads from a second, +while another CPU stores to the second variable and then loads from the +first. Preserving order requires nothing less than full barriers: + + /* See SB+fencembonceonces.litmus. */ + void CPU0(void) + { + WRITE_ONCE(x, 1); + smp_mb(); + r0 = READ_ONCE(y); + } + + void CPU1(void) + { + WRITE_ONCE(y, 1); + smp_mb(); + r1 = READ_ONCE(x); + } + +Omitting either smp_mb() will allow both r0 and r1 to have final +values of 0, but providing both full barriers as shown above prevents +this counter-intuitive outcome. + +This pattern most famously appears as part of Dekker's locking +algorithm, but it has a much more practical use within the Linux kernel +of ordering wakeups. The following comment taken from waitqueue_active() +in include/linux/wait.h shows the canonical pattern: + + * CPU0 - waker CPU1 - waiter + * + * for (;;) { + * @cond = true; prepare_to_wait(&wq_head, &wait, state); + * smp_mb(); // smp_mb() from set_current_state() + * if (waitqueue_active(wq_head)) if (@cond) + * wake_up(wq_head); break; + * schedule(); + * } + * finish_wait(&wq_head, &wait); + +On CPU0, the store is to @cond and the load is in waitqueue_active(). +On CPU1, prepare_to_wait() contains both a store to wq_head and a call +to set_current_state(), which contains an smp_mb() barrier; the load is +"if (@cond)". The full barriers prevent the undesirable outcome where +CPU1 puts the waiting task to sleep and CPU0 fails to wake it up. + +Note that use of locking can greatly simplify this pattern. + + +Rules of thumb +============== + +There might seem to be no pattern governing what ordering primitives are +needed in which situations, but this is not the case. There is a pattern +based on the relation between the accesses linking successive CPUs in a +given litmus test. There are three types of linkage: + +1. Write-to-read, where the next CPU reads the value that the + previous CPU wrote. The LB litmus-test patterns contain only + this type of relation. In formal memory-modeling texts, this + relation is called "reads-from" and is usually abbreviated "rf". + +2. Read-to-write, where the next CPU overwrites the value that the + previous CPU read. The SB litmus test contains only this type + of relation. In formal memory-modeling texts, this relation is + often called "from-reads" and is sometimes abbreviated "fr". + +3. Write-to-write, where the next CPU overwrites the value written + by the previous CPU. The Z6.0 litmus test pattern contains a + write-to-write relation between the last access of CPU1() and + the first access of CPU2(). In formal memory-modeling texts, + this relation is often called "coherence order" and is sometimes + abbreviated "co". In the C++ standard, it is instead called + "modification order" and often abbreviated "mo". + +The strength of memory ordering required for a given litmus test to +avoid a counter-intuitive outcome depends on the types of relations +linking the memory accesses for the outcome in question: + +o If all links are write-to-read links, then the weakest + possible ordering within each CPU suffices. For example, in + the LB litmus test, a control dependency was enough to do the + job. + +o If all but one of the links are write-to-read links, then a + release-acquire chain suffices. Both the MP and the ISA2 + litmus tests illustrate this case. + +o If more than one of the links are something other than + write-to-read links, then a full memory barrier is required + between each successive pair of non-write-to-read links. This + case is illustrated by the Z6.0 litmus tests, both in the + locking and in the release-acquire sections. + +However, if you find yourself having to stretch these rules of thumb +to fit your situation, you should consider creating a litmus test and +running it on the model. |