aboutsummaryrefslogtreecommitdiff
path: root/Documentation/admin-guide/sysfs-rules.rst
diff options
context:
space:
mode:
authorLibravatar Linus Torvalds <torvalds@linux-foundation.org>2023-02-21 18:24:12 -0800
committerLibravatar Linus Torvalds <torvalds@linux-foundation.org>2023-02-21 18:24:12 -0800
commit5b7c4cabbb65f5c469464da6c5f614cbd7f730f2 (patch)
treecc5c2d0a898769fd59549594fedb3ee6f84e59a0 /Documentation/admin-guide/sysfs-rules.rst
downloadlinux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.tar.gz
linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.zip
Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextgrafted
Pull networking updates from Jakub Kicinski: "Core: - Add dedicated kmem_cache for typical/small skb->head, avoid having to access struct page at kfree time, and improve memory use. - Introduce sysctl to set default RPS configuration for new netdevs. - Define Netlink protocol specification format which can be used to describe messages used by each family and auto-generate parsers. Add tools for generating kernel data structures and uAPI headers. - Expose all net/core sysctls inside netns. - Remove 4s sleep in netpoll if carrier is instantly detected on boot. - Add configurable limit of MDB entries per port, and port-vlan. - Continue populating drop reasons throughout the stack. - Retire a handful of legacy Qdiscs and classifiers. Protocols: - Support IPv4 big TCP (TSO frames larger than 64kB). - Add IP_LOCAL_PORT_RANGE socket option, to control local port range on socket by socket basis. - Track and report in procfs number of MPTCP sockets used. - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path manager. - IPv6: don't check net.ipv6.route.max_size and rely on garbage collection to free memory (similarly to IPv4). - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986). - ICMP: add per-rate limit counters. - Add support for user scanning requests in ieee802154. - Remove static WEP support. - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting. - WiFi 7 EHT channel puncturing support (client & AP). BPF: - Add a rbtree data structure following the "next-gen data structure" precedent set by recently added linked list, that is, by using kfunc + kptr instead of adding a new BPF map type. - Expose XDP hints via kfuncs with initial support for RX hash and timestamp metadata. - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata. - Improve x86 JIT's codegen for PROBE_MEM runtime error checks. - Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers. - Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case. - Significantly reduce the search time for module symbols by livepatch and BPF. - Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals. - Add support for BPF trampoline on s390x and riscv64. - Add capability to export the XDP features supported by the NIC. - Add __bpf_kfunc tag for marking kernel functions as kfuncs. - Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accounting for container environments. Netfilter: - Remove the CLUSTERIP target. It has been marked as obsolete for years, and we still have WARN splats wrt races of the out-of-band /proc interface installed by this target. - Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not return an error if the referenced object (set, chain, rule...) did not exist. Driver API: - Improve cpumask_local_spread() locality to help NICs set the right IRQ affinity on AMD platforms. - Separate C22 and C45 MDIO bus transactions more clearly. - Introduce new DCB table to control DSCP rewrite on egress. - Support configuration of Physical Layer Collision Avoidance (PLCA) Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of shared medium Ethernet. - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing preemption of low priority frames by high priority frames. - Add support for controlling MACSec offload using netlink SET. - Rework devlink instance refcounts to allow registration and de-registration under the instance lock. Split the code into multiple files, drop some of the unnecessarily granular locks and factor out common parts of netlink operation handling. - Add TX frame aggregation parameters (for USB drivers). - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning messages with notifications for debug. - Allow offloading of UDP NEW connections via act_ct. - Add support for per action HW stats in TC. - Support hardware miss to TC action (continue processing in SW from a specific point in the action chain). - Warn if old Wireless Extension user space interface is used with modern cfg80211/mac80211 drivers. Do not support Wireless Extensions for Wi-Fi 7 devices at all. Everyone should switch to using nl80211 interface instead. - Improve the CAN bit timing configuration. Use extack to return error messages directly to user space, update the SJW handling, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. New hardware / drivers: - Ethernet: - nVidia BlueField-3 support (control traffic driver) - Ethernet support for imx93 SoCs - Motorcomm yt8531 gigabit Ethernet PHY - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA) - Microchip LAN8841 PHY (incl. cable diagnostics and PTP) - Amlogic gxl MDIO mux - WiFi: - RealTek RTL8188EU (rtl8xxxu) - Qualcomm Wi-Fi 7 devices (ath12k) - CAN: - Renesas R-Car V4H Drivers: - Bluetooth: - Set Per Platform Antenna Gain (PPAG) for Intel controllers. - Ethernet NICs: - Intel (1G, igc): - support TSN / Qbv / packet scheduling features of i226 model - Intel (100G, ice): - use GNSS subsystem instead of TTY - multi-buffer XDP support - extend support for GPIO pins to E823 devices - nVidia/Mellanox: - update the shared buffer configuration on PFC commands - implement PTP adjphase function for HW offset control - TC support for Geneve and GRE with VF tunnel offload - more efficient crypto key management method - multi-port eswitch support - Netronome/Corigine: - add DCB IEEE support - support IPsec offloading for NFP3800 - Freescale/NXP (enetc): - support XDP_REDIRECT for XDP non-linear buffers - improve reconfig, avoid link flap and waiting for idle - support MAC Merge layer - Other NICs: - sfc/ef100: add basic devlink support for ef100 - ionic: rx_push mode operation (writing descriptors via MMIO) - bnxt: use the auxiliary bus abstraction for RDMA - r8169: disable ASPM and reset bus in case of tx timeout - cpsw: support QSGMII mode for J721e CPSW9G - cpts: support pulse-per-second output - ngbe: add an mdio bus driver - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing - r8152: handle devices with FW with NCM support - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation - virtio-net: support multi buffer XDP - virtio/vsock: replace virtio_vsock_pkt with sk_buff - tsnep: XDP support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add support for latency TLV (in FW control messages) - Microchip (sparx5): - separate explicit and implicit traffic forwarding rules, make the implicit rules always active - add support for egress DSCP rewrite - IS0 VCAP support (Ingress Classification) - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.) - ES2 VCAP support (Egress Access Control) - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1) - Ethernet embedded switches: - Marvell (mv88e6xxx): - add MAB (port auth) offload support - enable PTP receive for mv88e6390 - NXP (ocelot): - support MAC Merge layer - support for the the vsc7512 internal copper phys - Microchip: - lan9303: convert to PHYLINK - lan966x: support TC flower filter statistics - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x - lan937x: support Credit Based Shaper configuration - ksz9477: support Energy Efficient Ethernet - other: - qca8k: convert to regmap read/write API, use bulk operations - rswitch: Improve TX timestamp accuracy - Intel WiFi (iwlwifi): - EHT (Wi-Fi 7) rate reporting - STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware. - Qualcomm 802.11ax WiFi (ath11k): - IPQ5018 support - Fine Timing Measurement (FTM) responder role support - channel 177 support - MediaTek WiFi (mt76): - per-PHY LED support - mt7996: EHT (Wi-Fi 7) support - Wireless Ethernet Dispatch (WED) reset support - switch to using page pool allocator - RealTek WiFi (rtw89): - support new version of Bluetooth co-existance - Mobile: - rmnet: support TX aggregation" * tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits) page_pool: add a comment explaining the fragment counter usage net: ethtool: fix __ethtool_dev_mm_supported() implementation ethtool: pse-pd: Fix double word in comments xsk: add linux/vmalloc.h to xsk.c sefltests: netdevsim: wait for devlink instance after netns removal selftest: fib_tests: Always cleanup before exit net/mlx5e: Align IPsec ASO result memory to be as required by hardware net/mlx5e: TC, Set CT miss to the specific ct action instance net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG net/mlx5: Refactor tc miss handling to a single function net/mlx5: Kconfig: Make tc offload depend on tc skb extension net/sched: flower: Support hardware miss to tc action net/sched: flower: Move filter handle initialization earlier net/sched: cls_api: Support hardware miss to tc action net/sched: Rename user cookie and act cookie sfc: fix builds without CONFIG_RTC_LIB sfc: clean up some inconsistent indentings net/mlx4_en: Introduce flexible array to silence overflow warning net: lan966x: Fix possible deadlock inside PTP net/ulp: Remove redundant ->clone() test in inet_clone_ulp(). ...
Diffstat (limited to 'Documentation/admin-guide/sysfs-rules.rst')
-rw-r--r--Documentation/admin-guide/sysfs-rules.rst192
1 files changed, 192 insertions, 0 deletions
diff --git a/Documentation/admin-guide/sysfs-rules.rst b/Documentation/admin-guide/sysfs-rules.rst
new file mode 100644
index 000000000..abad33526
--- /dev/null
+++ b/Documentation/admin-guide/sysfs-rules.rst
@@ -0,0 +1,192 @@
+Rules on how to access information in sysfs
+===========================================
+
+The kernel-exported sysfs exports internal kernel implementation details
+and depends on internal kernel structures and layout. It is agreed upon
+by the kernel developers that the Linux kernel does not provide a stable
+internal API. Therefore, there are aspects of the sysfs interface that
+may not be stable across kernel releases.
+
+To minimize the risk of breaking users of sysfs, which are in most cases
+low-level userspace applications, with a new kernel release, the users
+of sysfs must follow some rules to use an as-abstract-as-possible way to
+access this filesystem. The current udev and HAL programs already
+implement this and users are encouraged to plug, if possible, into the
+abstractions these programs provide instead of accessing sysfs directly.
+
+But if you really do want or need to access sysfs directly, please follow
+the following rules and then your programs should work with future
+versions of the sysfs interface.
+
+- Do not use libsysfs
+ It makes assumptions about sysfs which are not true. Its API does not
+ offer any abstraction, it exposes all the kernel driver-core
+ implementation details in its own API. Therefore it is not better than
+ reading directories and opening the files yourself.
+ Also, it is not actively maintained, in the sense of reflecting the
+ current kernel development. The goal of providing a stable interface
+ to sysfs has failed; it causes more problems than it solves. It
+ violates many of the rules in this document.
+
+- sysfs is always at ``/sys``
+ Parsing ``/proc/mounts`` is a waste of time. Other mount points are a
+ system configuration bug you should not try to solve. For test cases,
+ possibly support a ``SYSFS_PATH`` environment variable to overwrite the
+ application's behavior, but never try to search for sysfs. Never try
+ to mount it, if you are not an early boot script.
+
+- devices are only "devices"
+ There is no such thing like class-, bus-, physical devices,
+ interfaces, and such that you can rely on in userspace. Everything is
+ just simply a "device". Class-, bus-, physical, ... types are just
+ kernel implementation details which should not be expected by
+ applications that look for devices in sysfs.
+
+ The properties of a device are:
+
+ - devpath (``/devices/pci0000:00/0000:00:1d.1/usb2/2-2/2-2:1.0``)
+
+ - identical to the DEVPATH value in the event sent from the kernel
+ at device creation and removal
+ - the unique key to the device at that point in time
+ - the kernel's path to the device directory without the leading
+ ``/sys``, and always starting with a slash
+ - all elements of a devpath must be real directories. Symlinks
+ pointing to /sys/devices must always be resolved to their real
+ target and the target path must be used to access the device.
+ That way the devpath to the device matches the devpath of the
+ kernel used at event time.
+ - using or exposing symlink values as elements in a devpath string
+ is a bug in the application
+
+ - kernel name (``sda``, ``tty``, ``0000:00:1f.2``, ...)
+
+ - a directory name, identical to the last element of the devpath
+ - applications need to handle spaces and characters like ``!`` in
+ the name
+
+ - subsystem (``block``, ``tty``, ``pci``, ...)
+
+ - simple string, never a path or a link
+ - retrieved by reading the "subsystem"-link and using only the
+ last element of the target path
+
+ - driver (``tg3``, ``ata_piix``, ``uhci_hcd``)
+
+ - a simple string, which may contain spaces, never a path or a
+ link
+ - it is retrieved by reading the "driver"-link and using only the
+ last element of the target path
+ - devices which do not have "driver"-link just do not have a
+ driver; copying the driver value in a child device context is a
+ bug in the application
+
+ - attributes
+
+ - the files in the device directory or files below subdirectories
+ of the same device directory
+ - accessing attributes reached by a symlink pointing to another device,
+ like the "device"-link, is a bug in the application
+
+ Everything else is just a kernel driver-core implementation detail
+ that should not be assumed to be stable across kernel releases.
+
+- Properties of parent devices never belong into a child device.
+ Always look at the parent devices themselves for determining device
+ context properties. If the device ``eth0`` or ``sda`` does not have a
+ "driver"-link, then this device does not have a driver. Its value is empty.
+ Never copy any property of the parent-device into a child-device. Parent
+ device properties may change dynamically without any notice to the
+ child device.
+
+- Hierarchy in a single device tree
+ There is only one valid place in sysfs where hierarchy can be examined
+ and this is below: ``/sys/devices.``
+ It is planned that all device directories will end up in the tree
+ below this directory.
+
+- Classification by subsystem
+ There are currently three places for classification of devices:
+ ``/sys/block,`` ``/sys/class`` and ``/sys/bus.`` It is planned that these will
+ not contain any device directories themselves, but only flat lists of
+ symlinks pointing to the unified ``/sys/devices`` tree.
+ All three places have completely different rules on how to access
+ device information. It is planned to merge all three
+ classification directories into one place at ``/sys/subsystem``,
+ following the layout of the bus directories. All buses and
+ classes, including the converted block subsystem, will show up
+ there.
+ The devices belonging to a subsystem will create a symlink in the
+ "devices" directory at ``/sys/subsystem/<name>/devices``,
+
+ If ``/sys/subsystem`` exists, ``/sys/bus``, ``/sys/class`` and ``/sys/block``
+ can be ignored. If it does not exist, you always have to scan all three
+ places, as the kernel is free to move a subsystem from one place to
+ the other, as long as the devices are still reachable by the same
+ subsystem name.
+
+ Assuming ``/sys/class/<subsystem>`` and ``/sys/bus/<subsystem>``, or
+ ``/sys/block`` and ``/sys/class/block`` are not interchangeable is a bug in
+ the application.
+
+- Block
+ The converted block subsystem at ``/sys/class/block`` or
+ ``/sys/subsystem/block`` will contain the links for disks and partitions
+ at the same level, never in a hierarchy. Assuming the block subsystem to
+ contain only disks and not partition devices in the same flat list is
+ a bug in the application.
+
+- "device"-link and <subsystem>:<kernel name>-links
+ Never depend on the "device"-link. The "device"-link is a workaround
+ for the old layout, where class devices are not created in
+ ``/sys/devices/`` like the bus devices. If the link-resolving of a
+ device directory does not end in ``/sys/devices/``, you can use the
+ "device"-link to find the parent devices in ``/sys/devices/``, That is the
+ single valid use of the "device"-link; it must never appear in any
+ path as an element. Assuming the existence of the "device"-link for
+ a device in ``/sys/devices/`` is a bug in the application.
+ Accessing ``/sys/class/net/eth0/device`` is a bug in the application.
+
+ Never depend on the class-specific links back to the ``/sys/class``
+ directory. These links are also a workaround for the design mistake
+ that class devices are not created in ``/sys/devices.`` If a device
+ directory does not contain directories for child devices, these links
+ may be used to find the child devices in ``/sys/class.`` That is the single
+ valid use of these links; they must never appear in any path as an
+ element. Assuming the existence of these links for devices which are
+ real child device directories in the ``/sys/devices`` tree is a bug in
+ the application.
+
+ It is planned to remove all these links when all class device
+ directories live in ``/sys/devices.``
+
+- Position of devices along device chain can change.
+ Never depend on a specific parent device position in the devpath,
+ or the chain of parent devices. The kernel is free to insert devices into
+ the chain. You must always request the parent device you are looking for
+ by its subsystem value. You need to walk up the chain until you find
+ the device that matches the expected subsystem. Depending on a specific
+ position of a parent device or exposing relative paths using ``../`` to
+ access the chain of parents is a bug in the application.
+
+- When reading and writing sysfs device attribute files, avoid dependency
+ on specific error codes wherever possible. This minimizes coupling to
+ the error handling implementation within the kernel.
+
+ In general, failures to read or write sysfs device attributes shall
+ propagate errors wherever possible. Common errors include, but are not
+ limited to:
+
+ ``-EIO``: The read or store operation is not supported, typically
+ returned by the sysfs system itself if the read or store pointer
+ is ``NULL``.
+
+ ``-ENXIO``: The read or store operation failed
+
+ Error codes will not be changed without good reason, and should a change
+ to error codes result in user-space breakage, it will be fixed, or the
+ the offending change will be reverted.
+
+ Userspace applications can, however, expect the format and contents of
+ the attribute files to remain consistent in the absence of a version
+ attribute change in the context of a given attribute.