Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextgrafted

Pull networking updates from Jakub Kicinski: "Core: - Add dedicated kmem_cache for typical/small skb->head, avoid having to access struct page at kfree time, and improve memory use. - Introduce sysctl to set default RPS configuration for new netdevs. - Define Netlink protocol specification format which can be used to describe messages used by each family and auto-generate parsers. Add tools for generating kernel data structures and uAPI headers. - Expose all net/core sysctls inside netns. - Remove 4s sleep in netpoll if carrier is instantly detected on boot. - Add configurable limit of MDB entries per port, and port-vlan. - Continue populating drop reasons throughout the stack. - Retire a handful of legacy Qdiscs and classifiers. Protocols: - Support IPv4 big TCP (TSO frames larger than 64kB). - Add IP_LOCAL_PORT_RANGE socket option, to control local port range on socket by socket basis. - Track and report in procfs number of MPTCP sockets used. - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path manager. - IPv6: don't check net.ipv6.route.max_size and rely on garbage collection to free memory (similarly to IPv4). - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986). - ICMP: add per-rate limit counters. - Add support for user scanning requests in ieee802154. - Remove static WEP support. - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting. - WiFi 7 EHT channel puncturing support (client & AP). BPF: - Add a rbtree data structure following the "next-gen data structure" precedent set by recently added linked list, that is, by using kfunc + kptr instead of adding a new BPF map type. - Expose XDP hints via kfuncs with initial support for RX hash and timestamp metadata. - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata. - Improve x86 JIT's codegen for PROBE_MEM runtime error checks. - Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers. - Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case. - Significantly reduce the search time for module symbols by livepatch and BPF. - Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals. - Add support for BPF trampoline on s390x and riscv64. - Add capability to export the XDP features supported by the NIC. - Add __bpf_kfunc tag for marking kernel functions as kfuncs. - Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accounting for container environments. Netfilter: - Remove the CLUSTERIP target. It has been marked as obsolete for years, and we still have WARN splats wrt races of the out-of-band /proc interface installed by this target. - Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not return an error if the referenced object (set, chain, rule...) did not exist. Driver API: - Improve cpumask_local_spread() locality to help NICs set the right IRQ affinity on AMD platforms. - Separate C22 and C45 MDIO bus transactions more clearly. - Introduce new DCB table to control DSCP rewrite on egress. - Support configuration of Physical Layer Collision Avoidance (PLCA) Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of shared medium Ethernet. - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing preemption of low priority frames by high priority frames. - Add support for controlling MACSec offload using netlink SET. - Rework devlink instance refcounts to allow registration and de-registration under the instance lock. Split the code into multiple files, drop some of the unnecessarily granular locks and factor out common parts of netlink operation handling. - Add TX frame aggregation parameters (for USB drivers). - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning messages with notifications for debug. - Allow offloading of UDP NEW connections via act_ct. - Add support for per action HW stats in TC. - Support hardware miss to TC action (continue processing in SW from a specific point in the action chain). - Warn if old Wireless Extension user space interface is used with modern cfg80211/mac80211 drivers. Do not support Wireless Extensions for Wi-Fi 7 devices at all. Everyone should switch to using nl80211 interface instead. - Improve the CAN bit timing configuration. Use extack to return error messages directly to user space, update the SJW handling, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. New hardware / drivers: - Ethernet: - nVidia BlueField-3 support (control traffic driver) - Ethernet support for imx93 SoCs - Motorcomm yt8531 gigabit Ethernet PHY - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA) - Microchip LAN8841 PHY (incl. cable diagnostics and PTP) - Amlogic gxl MDIO mux - WiFi: - RealTek RTL8188EU (rtl8xxxu) - Qualcomm Wi-Fi 7 devices (ath12k) - CAN: - Renesas R-Car V4H Drivers: - Bluetooth: - Set Per Platform Antenna Gain (PPAG) for Intel controllers. - Ethernet NICs: - Intel (1G, igc): - support TSN / Qbv / packet scheduling features of i226 model - Intel (100G, ice): - use GNSS subsystem instead of TTY - multi-buffer XDP support - extend support for GPIO pins to E823 devices - nVidia/Mellanox: - update the shared buffer configuration on PFC commands - implement PTP adjphase function for HW offset control - TC support for Geneve and GRE with VF tunnel offload - more efficient crypto key management method - multi-port eswitch support - Netronome/Corigine: - add DCB IEEE support - support IPsec offloading for NFP3800 - Freescale/NXP (enetc): - support XDP_REDIRECT for XDP non-linear buffers - improve reconfig, avoid link flap and waiting for idle - support MAC Merge layer - Other NICs: - sfc/ef100: add basic devlink support for ef100 - ionic: rx_push mode operation (writing descriptors via MMIO) - bnxt: use the auxiliary bus abstraction for RDMA - r8169: disable ASPM and reset bus in case of tx timeout - cpsw: support QSGMII mode for J721e CPSW9G - cpts: support pulse-per-second output - ngbe: add an mdio bus driver - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing - r8152: handle devices with FW with NCM support - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation - virtio-net: support multi buffer XDP - virtio/vsock: replace virtio_vsock_pkt with sk_buff - tsnep: XDP support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add support for latency TLV (in FW control messages) - Microchip (sparx5): - separate explicit and implicit traffic forwarding rules, make the implicit rules always active - add support for egress DSCP rewrite - IS0 VCAP support (Ingress Classification) - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.) - ES2 VCAP support (Egress Access Control) - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1) - Ethernet embedded switches: - Marvell (mv88e6xxx): - add MAB (port auth) offload support - enable PTP receive for mv88e6390 - NXP (ocelot): - support MAC Merge layer - support for the the vsc7512 internal copper phys - Microchip: - lan9303: convert to PHYLINK - lan966x: support TC flower filter statistics - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x - lan937x: support Credit Based Shaper configuration - ksz9477: support Energy Efficient Ethernet - other: - qca8k: convert to regmap read/write API, use bulk operations - rswitch: Improve TX timestamp accuracy - Intel WiFi (iwlwifi): - EHT (Wi-Fi 7) rate reporting - STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware. - Qualcomm 802.11ax WiFi (ath11k): - IPQ5018 support - Fine Timing Measurement (FTM) responder role support - channel 177 support - MediaTek WiFi (mt76): - per-PHY LED support - mt7996: EHT (Wi-Fi 7) support - Wireless Ethernet Dispatch (WED) reset support - switch to using page pool allocator - RealTek WiFi (rtw89): - support new version of Bluetooth co-existance - Mobile: - rmnet: support TX aggregation" * tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits) page_pool: add a comment explaining the fragment counter usage net: ethtool: fix __ethtool_dev_mm_supported() implementation ethtool: pse-pd: Fix double word in comments xsk: add linux/vmalloc.h to xsk.c sefltests: netdevsim: wait for devlink instance after netns removal selftest: fib_tests: Always cleanup before exit net/mlx5e: Align IPsec ASO result memory to be as required by hardware net/mlx5e: TC, Set CT miss to the specific ct action instance net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG net/mlx5: Refactor tc miss handling to a single function net/mlx5: Kconfig: Make tc offload depend on tc skb extension net/sched: flower: Support hardware miss to tc action net/sched: flower: Move filter handle initialization earlier net/sched: cls_api: Support hardware miss to tc action net/sched: Rename user cookie and act cookie sfc: fix builds without CONFIG_RTC_LIB sfc: clean up some inconsistent indentings net/mlx4_en: Introduce flexible array to silence overflow warning net: lan966x: Fix possible deadlock inside PTP net/ulp: Remove redundant ->clone() test in inet_clone_ulp(). ...
author: Linus Torvalds <torvalds@linux-foundation.org> 2023-02-21 18:24:12 -0800
committer: Linus Torvalds <torvalds@linux-foundation.org> 2023-02-21 18:24:12 -0800
commit: 5b7c4cabbb65f5c469464da6c5f614cbd7f730f2 (patch)
tree: cc5c2d0a898769fd59549594fedb3ee6f84e59a0 /Documentation/translations/zh_CN/core-api/unaligned-memory-access.rst
download: linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.tar.gz
linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.zip
1 files changed, 229 insertions, 0 deletions
diff --git a/Documentation/translations/zh_CN/core-api/unaligned-memory-access.rst b/Documentation/translations/zh_CN/core-api/unaligned-memory-access.rst
new file mode 100644
index 000000000..29c33e7e0
--- /dev/null
+++ b/Documentation/translations/zh_CN/core-api/unaligned-memory-access.rst
@@ -0,0 +1,229 @@
+.. include:: ../disclaimer-zh_CN.rst
+
+:Original: Documentation/core-api/unaligned-memory-access.rst
+
+:翻译:
+
+ 司延腾 Yanteng Si <siyanteng@loongson.cn>
+
+:校译:
+
+ 时奎亮 <alexs@kernel.org>
+
+.. _cn_core-api_unaligned-memory-access:
+
+==============
+非对齐内存访问
+==============
+
+:作者: Daniel Drake <dsd@gentoo.org>,
+:作者: Johannes Berg <johannes@sipsolutions.net>
+
+:感谢他们的帮助: Alan Cox, Avuton Olrich, Heikki Orsila, Jan Engelhardt,
+  Kyle McMartin, Kyle Moffett, Randy Dunlap, Robert Hancock, Uli Kunitz,
+  Vadim Lobanov
+
+
+Linux运行在各种各样的架构上，这些架构在内存访问方面有不同的表现。本文介绍了一些
+关于不对齐访问的细节，为什么你需要编写不引起不对齐访问的代码，以及如何编写这样的
+代码
+
+
+非对齐访问的定义
+================
+
+当你试图从一个不被N偶数整除的地址（即addr % N != 0）开始读取N字节的数据时，就
+会发生无对齐内存访问。例如，从地址0x10004读取4个字节的数据是可以的，但从地址
+0x10005读取4个字节的数据将是一个不对齐的内存访问。
+
+上述内容可能看起来有点模糊，因为内存访问可以以不同的方式发生。这里的背景是在机器
+码层面上：某些指令在内存中读取或写入一些字节（例如x86汇编中的movb、movw、movl）。
+正如将变得清晰的那样，相对容易发现那些将编译为多字节内存访问指令的C语句，即在处理
+u16、u32和u64等类型时。
+
+
+自然对齐
+========
+
+上面提到的规则构成了我们所说的自然对齐。当访问N个字节的内存时，基础内存地址必须被
+N平均分割，即addr % N == 0。
+
+在编写代码时，假设目标架构有自然对齐的要求。
+
+在现实中，只有少数架构在所有大小的内存访问上都要求自然对齐。然而，我们必须考虑所
+有支持的架构；编写满足自然对齐要求的代码是实现完全可移植性的最简单方法。
+
+
+为什么非对齐访问时坏事
+======================
+
+执行非对齐内存访问的效果因架构不同而不同。在这里写一整篇关于这些差异的文档是很容
+易的；下面是对常见情况的总结:
+
+ - 一些架构能够透明地执行非对齐内存访问，但通常会有很大的性能代价。
+ - 当不对齐的访问发生时，一些架构会引发处理器异常。异常处理程序能够纠正不对齐的
+   访问，但要付出很大的性能代价。
+ - 一些架构在发生不对齐访问时，会引发处理器异常，但异常中并没有包含足够的信息来
+   纠正不对齐访问。
+ - 有些架构不能进行无对齐内存访问，但会默默地执行与请求不同的内存访问，从而导致
+   难以发现的微妙的代码错误!
+
+从上文可以看出，如果你的代码导致不对齐的内存访问发生，那么你的代码在某些平台上将无
+法正常工作，在其他平台上将导致性能问题。
+
+不会导致非对齐访问的代码
+========================
+
+起初，上面的概念似乎有点难以与实际编码实践联系起来。毕竟，你对某些变量的内存地址没
+有很大的控制权，等等。
+
+幸运的是事情并不复杂，因为在大多数情况下，编译器会确保代码工作正常。例如，以下面的
+结构体为例::
+
+	struct foo {
+		u16 field1;
+		u32 field2;
+		u8 field3;
+	};
+
+让我们假设上述结构体的一个实例驻留在从地址0x10000开始的内存中。根据基本的理解，访问
+field2会导致非对齐访问，这并不是不合理的。你会期望field2位于该结构体的2个字节的偏移
+量，即地址0x10002，但该地址不能被4平均整除（注意，我们在这里读一个4字节的值）。
+
+幸运的是，编译器理解对齐约束，所以在上述情况下，它会在field1和field2之间插入2个字节
+的填充。因此，对于标准的结构体类型，你总是可以依靠编译器来填充结构体，以便对字段的访
+问可以适当地对齐（假设你没有将字段定义不同长度的类型）。
+
+同样，你也可以依靠编译器根据变量类型的大小，将变量和函数参数对齐到一个自然对齐的方案。
+
+在这一点上，应该很清楚，访问单个字节（u8或char）永远不会导致无对齐访问，因为所有的内
+存地址都可以被1均匀地整除。
+
+在一个相关的话题上，考虑到上述因素，你可以观察到，你可以对结构体中的字段进行重新排序，
+以便将字段放在不重排就会插入填充物的地方，从而减少结构体实例的整体常驻内存大小。上述
+例子的最佳布局是::
+
+	struct foo {
+		u32 field2;
+		u16 field1;
+		u8 field3;
+	};
+
+对于一个自然对齐方案，编译器只需要在结构的末尾添加一个字节的填充。添加这种填充是为了满
+足这些结构的数组的对齐约束。
+
+另一点值得一提的是在结构体类型上使用__attribute__((packed))。这个GCC特有的属性告诉编
+译器永远不要在结构体中插入任何填充，当你想用C结构体来表示一些“off the wire”的固定排列
+的数据时，这个属性很有用。
+
+你可能会倾向于认为，在访问不满足架构对齐要求的字段时，使用这个属性很容易导致不对齐的访
+问。然而，编译器也意识到了对齐的限制，并且会产生额外的指令来执行内存访问，以避免造成不
+对齐的访问。当然，与non-packed的情况相比，额外的指令显然会造成性能上的损失，所以packed
+属性应该只在避免结构填充很重要的时候使用。
+
+
+导致非对齐访问的代码
+====================
+
+考虑到上述情况，让我们来看看一个现实生活中可能导致非对齐内存访问的函数的例子。下面这个
+函数取自include/linux/etherdevice.h，是一个优化的例程，用于比较两个以太网MAC地址是否
+相等::
+
+  bool ether_addr_equal(const u8 *addr1, const u8 *addr2)
+  {
+  #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+	u32 fold = ((*(const u32 *)addr1) ^ (*(const u32 *)addr2)) |
+		   ((*(const u16 *)(addr1 + 4)) ^ (*(const u16 *)(addr2 + 4)));
+
+	return fold == 0;
+  #else
+	const u16 *a = (const u16 *)addr1;
+	const u16 *b = (const u16 *)addr2;
+	return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) == 0;
+  #endif
+  }
+
+在上述函数中，当硬件具有高效的非对齐访问能力时，这段代码没有问题。但是当硬件不能在任意
+边界上访问内存时，对a[0]的引用导致从地址addr1开始的2个字节（16位）被读取。
+
+想一想，如果addr1是一个奇怪的地址，如0x10003，会发生什么？(提示：这将是一个非对齐访
+问。)
+
+尽管上述函数存在潜在的非对齐访问问题，但它还是被包含在内核中，但被理解为只在16位对齐
+的地址上正常工作。调用者应该确保这种对齐方式或者根本不使用这个函数。这个不对齐的函数
+仍然是有用的，因为它是在你能确保对齐的情况下的一个很好的优化，这在以太网网络环境中几
+乎是一直如此。
+
+
+下面是另一个可能导致非对齐访问的代码的例子::
+
+	void myfunc(u8 *data, u32 value)
+	{
+		[...]
+		*((u32 *) data) = cpu_to_le32(value);
+		[...]
+	}
+
+每当数据参数指向的地址不被4均匀整除时，这段代码就会导致非对齐访问。
+
+综上所述，你可能遇到非对齐访问问题的两种主要情况包括:
+
+ 1. 将变量定义不同长度的类型
+ 2. 指针运算后访问至少2个字节的数据
+
+
+避免非对齐访问
+==============
+
+避免非对齐访问的最简单方法是使用<asm/unaligned.h>头文件提供的get_unaligned()和
+put_unaligned()宏。
+
+回到前面的一个可能导致非对齐访问的代码例子::
+
+	void myfunc(u8 *data, u32 value)
+	{
+		[...]
+		*((u32 *) data) = cpu_to_le32(value);
+		[...]
+	}
+
+为了避免非对齐的内存访问，你可以将其改写如下::
+
+	void myfunc(u8 *data, u32 value)
+	{
+		[...]
+		value = cpu_to_le32(value);
+		put_unaligned(value, (u32 *) data);
+		[...]
+	}
+
+get_unaligned()宏的工作原理与此类似。假设'data'是一个指向内存的指针，并且你希望避免
+非对齐访问，其用法如下::
+
+	u32 value = get_unaligned((u32 *) data);
+
+这些宏适用于任何长度的内存访问（不仅仅是上面例子中的32位）。请注意，与标准的对齐内存
+访问相比，使用这些宏来访问非对齐内存可能会在性能上付出代价。
+
+如果使用这些宏不方便，另一个选择是使用memcpy()，其中源或目标（或两者）的类型为u8*或
+非对齐char*。由于这种操作的字节性质，避免了非对齐访问。
+
+
+对齐 vs. 网络
+=============
+
+在需要对齐负载的架构上，网络要求IP头在四字节边界上对齐，以优化IP栈。对于普通的以太网
+硬件，常数NET_IP_ALIGN被使用。在大多数架构上，这个常数的值是2，因为正常的以太网头是
+14个字节，所以为了获得适当的对齐，需要DMA到一个可以表示为4*n+2的地址。一个值得注意的
+例外是powerpc，它将NET_IP_ALIGN定义为0，因为DMA到未对齐的地址可能非常昂贵，与未对齐
+的负载的成本相比相形见绌。
+
+对于一些不能DMA到未对齐地址的以太网硬件，如4*n+2或非以太网硬件，这可能是一个问题，这
+时需要将传入的帧复制到一个对齐的缓冲区。因为这在可以进行非对齐访问的架构上是不必要的，
+所以可以使代码依赖于CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS，像这样::
+
+	#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+		skb = original skb
+	#else
+		skb = copy skb
+	#endif
author	Linus Torvalds <torvalds@linux-foundation.org>	2023-02-21 18:24:12 -0800
committer	Linus Torvalds <torvalds@linux-foundation.org>	2023-02-21 18:24:12 -0800
commit	5b7c4cabbb65f5c469464da6c5f614cbd7f730f2 (patch)
tree	cc5c2d0a898769fd59549594fedb3ee6f84e59a0 /Documentation/translations/zh_CN/core-api/unaligned-memory-access.rst
download	linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.tar.gz linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.zip