From 5b7c4cabbb65f5c469464da6c5f614cbd7f730f2 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Tue, 21 Feb 2023 18:24:12 -0800 Subject: Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - Add dedicated kmem_cache for typical/small skb->head, avoid having to access struct page at kfree time, and improve memory use. - Introduce sysctl to set default RPS configuration for new netdevs. - Define Netlink protocol specification format which can be used to describe messages used by each family and auto-generate parsers. Add tools for generating kernel data structures and uAPI headers. - Expose all net/core sysctls inside netns. - Remove 4s sleep in netpoll if carrier is instantly detected on boot. - Add configurable limit of MDB entries per port, and port-vlan. - Continue populating drop reasons throughout the stack. - Retire a handful of legacy Qdiscs and classifiers. Protocols: - Support IPv4 big TCP (TSO frames larger than 64kB). - Add IP_LOCAL_PORT_RANGE socket option, to control local port range on socket by socket basis. - Track and report in procfs number of MPTCP sockets used. - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path manager. - IPv6: don't check net.ipv6.route.max_size and rely on garbage collection to free memory (similarly to IPv4). - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986). - ICMP: add per-rate limit counters. - Add support for user scanning requests in ieee802154. - Remove static WEP support. - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting. - WiFi 7 EHT channel puncturing support (client & AP). BPF: - Add a rbtree data structure following the "next-gen data structure" precedent set by recently added linked list, that is, by using kfunc + kptr instead of adding a new BPF map type. - Expose XDP hints via kfuncs with initial support for RX hash and timestamp metadata. - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata. - Improve x86 JIT's codegen for PROBE_MEM runtime error checks. - Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers. - Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case. - Significantly reduce the search time for module symbols by livepatch and BPF. - Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals. - Add support for BPF trampoline on s390x and riscv64. - Add capability to export the XDP features supported by the NIC. - Add __bpf_kfunc tag for marking kernel functions as kfuncs. - Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accounting for container environments. Netfilter: - Remove the CLUSTERIP target. It has been marked as obsolete for years, and we still have WARN splats wrt races of the out-of-band /proc interface installed by this target. - Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not return an error if the referenced object (set, chain, rule...) did not exist. Driver API: - Improve cpumask_local_spread() locality to help NICs set the right IRQ affinity on AMD platforms. - Separate C22 and C45 MDIO bus transactions more clearly. - Introduce new DCB table to control DSCP rewrite on egress. - Support configuration of Physical Layer Collision Avoidance (PLCA) Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of shared medium Ethernet. - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing preemption of low priority frames by high priority frames. - Add support for controlling MACSec offload using netlink SET. - Rework devlink instance refcounts to allow registration and de-registration under the instance lock. Split the code into multiple files, drop some of the unnecessarily granular locks and factor out common parts of netlink operation handling. - Add TX frame aggregation parameters (for USB drivers). - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning messages with notifications for debug. - Allow offloading of UDP NEW connections via act_ct. - Add support for per action HW stats in TC. - Support hardware miss to TC action (continue processing in SW from a specific point in the action chain). - Warn if old Wireless Extension user space interface is used with modern cfg80211/mac80211 drivers. Do not support Wireless Extensions for Wi-Fi 7 devices at all. Everyone should switch to using nl80211 interface instead. - Improve the CAN bit timing configuration. Use extack to return error messages directly to user space, update the SJW handling, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. New hardware / drivers: - Ethernet: - nVidia BlueField-3 support (control traffic driver) - Ethernet support for imx93 SoCs - Motorcomm yt8531 gigabit Ethernet PHY - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA) - Microchip LAN8841 PHY (incl. cable diagnostics and PTP) - Amlogic gxl MDIO mux - WiFi: - RealTek RTL8188EU (rtl8xxxu) - Qualcomm Wi-Fi 7 devices (ath12k) - CAN: - Renesas R-Car V4H Drivers: - Bluetooth: - Set Per Platform Antenna Gain (PPAG) for Intel controllers. - Ethernet NICs: - Intel (1G, igc): - support TSN / Qbv / packet scheduling features of i226 model - Intel (100G, ice): - use GNSS subsystem instead of TTY - multi-buffer XDP support - extend support for GPIO pins to E823 devices - nVidia/Mellanox: - update the shared buffer configuration on PFC commands - implement PTP adjphase function for HW offset control - TC support for Geneve and GRE with VF tunnel offload - more efficient crypto key management method - multi-port eswitch support - Netronome/Corigine: - add DCB IEEE support - support IPsec offloading for NFP3800 - Freescale/NXP (enetc): - support XDP_REDIRECT for XDP non-linear buffers - improve reconfig, avoid link flap and waiting for idle - support MAC Merge layer - Other NICs: - sfc/ef100: add basic devlink support for ef100 - ionic: rx_push mode operation (writing descriptors via MMIO) - bnxt: use the auxiliary bus abstraction for RDMA - r8169: disable ASPM and reset bus in case of tx timeout - cpsw: support QSGMII mode for J721e CPSW9G - cpts: support pulse-per-second output - ngbe: add an mdio bus driver - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing - r8152: handle devices with FW with NCM support - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation - virtio-net: support multi buffer XDP - virtio/vsock: replace virtio_vsock_pkt with sk_buff - tsnep: XDP support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add support for latency TLV (in FW control messages) - Microchip (sparx5): - separate explicit and implicit traffic forwarding rules, make the implicit rules always active - add support for egress DSCP rewrite - IS0 VCAP support (Ingress Classification) - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.) - ES2 VCAP support (Egress Access Control) - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1) - Ethernet embedded switches: - Marvell (mv88e6xxx): - add MAB (port auth) offload support - enable PTP receive for mv88e6390 - NXP (ocelot): - support MAC Merge layer - support for the the vsc7512 internal copper phys - Microchip: - lan9303: convert to PHYLINK - lan966x: support TC flower filter statistics - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x - lan937x: support Credit Based Shaper configuration - ksz9477: support Energy Efficient Ethernet - other: - qca8k: convert to regmap read/write API, use bulk operations - rswitch: Improve TX timestamp accuracy - Intel WiFi (iwlwifi): - EHT (Wi-Fi 7) rate reporting - STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware. - Qualcomm 802.11ax WiFi (ath11k): - IPQ5018 support - Fine Timing Measurement (FTM) responder role support - channel 177 support - MediaTek WiFi (mt76): - per-PHY LED support - mt7996: EHT (Wi-Fi 7) support - Wireless Ethernet Dispatch (WED) reset support - switch to using page pool allocator - RealTek WiFi (rtw89): - support new version of Bluetooth co-existance - Mobile: - rmnet: support TX aggregation" * tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits) page_pool: add a comment explaining the fragment counter usage net: ethtool: fix __ethtool_dev_mm_supported() implementation ethtool: pse-pd: Fix double word in comments xsk: add linux/vmalloc.h to xsk.c sefltests: netdevsim: wait for devlink instance after netns removal selftest: fib_tests: Always cleanup before exit net/mlx5e: Align IPsec ASO result memory to be as required by hardware net/mlx5e: TC, Set CT miss to the specific ct action instance net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG net/mlx5: Refactor tc miss handling to a single function net/mlx5: Kconfig: Make tc offload depend on tc skb extension net/sched: flower: Support hardware miss to tc action net/sched: flower: Move filter handle initialization earlier net/sched: cls_api: Support hardware miss to tc action net/sched: Rename user cookie and act cookie sfc: fix builds without CONFIG_RTC_LIB sfc: clean up some inconsistent indentings net/mlx4_en: Introduce flexible array to silence overflow warning net: lan966x: Fix possible deadlock inside PTP net/ulp: Remove redundant ->clone() test in inet_clone_ulp(). ... --- Documentation/userspace-api/netlink/specs.rst | 425 ++++++++++++++++++++++++++ 1 file changed, 425 insertions(+) create mode 100644 Documentation/userspace-api/netlink/specs.rst (limited to 'Documentation/userspace-api/netlink/specs.rst') diff --git a/Documentation/userspace-api/netlink/specs.rst b/Documentation/userspace-api/netlink/specs.rst new file mode 100644 index 000000000..6ffe8137c --- /dev/null +++ b/Documentation/userspace-api/netlink/specs.rst @@ -0,0 +1,425 @@ +.. SPDX-License-Identifier: BSD-3-Clause + +========================================= +Netlink protocol specifications (in YAML) +========================================= + +Netlink protocol specifications are complete, machine readable descriptions of +Netlink protocols written in YAML. The goal of the specifications is to allow +separating Netlink parsing from user space logic and minimize the amount of +hand written Netlink code for each new family, command, attribute. +Netlink specs should be complete and not depend on any other spec +or C header file, making it easy to use in languages which can't include +kernel headers directly. + +Internally kernel uses the YAML specs to generate: + + - the C uAPI header + - documentation of the protocol as a ReST file + - policy tables for input attribute validation + - operation tables + +YAML specifications can be found under ``Documentation/netlink/specs/`` + +This document describes details of the schema. +See :doc:`intro-specs` for a practical starting guide. + +Compatibility levels +==================== + +There are four schema levels for Netlink specs, from the simplest used +by new families to the most complex covering all the quirks of the old ones. +Each next level inherits the attributes of the previous level, meaning that +user capable of parsing more complex ``genetlink`` schemas is also compatible +with simpler ones. The levels are: + + - ``genetlink`` - most streamlined, should be used by all new families + - ``genetlink-c`` - superset of ``genetlink`` with extra attributes allowing + customization of define and enum type and value names; this schema should + be equivalent to ``genetlink`` for all implementations which don't interact + directly with C uAPI headers + - ``genetlink-legacy`` - Generic Netlink catch all schema supporting quirks of + all old genetlink families, strange attribute formats, binary structures etc. + - ``netlink-raw`` - catch all schema supporting pre-Generic Netlink protocols + such as ``NETLINK_ROUTE`` + +The definition of the schemas (in ``jsonschema``) can be found +under ``Documentation/netlink/``. + +Schema structure +================ + +YAML schema has the following conceptual sections: + + - globals + - definitions + - attributes + - operations + - multicast groups + +Most properties in the schema accept (or in fact require) a ``doc`` +sub-property documenting the defined object. + +The following sections describe the properties of the most modern ``genetlink`` +schema. See the documentation of :doc:`genetlink-c ` +for information on how C names are derived from name properties. + +genetlink +========= + +Globals +------- + +Attributes listed directly at the root level of the spec file. + +name +~~~~ + +Name of the family. Name identifies the family in a unique way, since +the Family IDs are allocated dynamically. + +version +~~~~~~~ + +Generic Netlink family version, default is 1. + +protocol +~~~~~~~~ + +The schema level, default is ``genetlink``, which is the only value +allowed for new ``genetlink`` families. + +definitions +----------- + +Array of type and constant definitions. + +name +~~~~ + +Name of the type / constant. + +type +~~~~ + +One of the following types: + + - const - a single, standalone constant + - enum - defines an integer enumeration, with values for each entry + incrementing by 1, (e.g. 0, 1, 2, 3) + - flags - defines an integer enumeration, with values for each entry + occupying a bit, starting from bit 0, (e.g. 1, 2, 4, 8) + +value +~~~~~ + +The value for the ``const``. + +value-start +~~~~~~~~~~~ + +The first value for ``enum`` and ``flags``, allows overriding the default +start value of ``0`` (for ``enum``) and starting bit (for ``flags``). +For ``flags`` ``value-start`` selects the starting bit, not the shifted value. + +Sparse enumerations are not supported. + +entries +~~~~~~~ + +Array of names of the entries for ``enum`` and ``flags``. + +header +~~~~~~ + +For C-compatible languages, header which already defines this value. +In case the definition is shared by multiple families (e.g. ``IFNAMSIZ``) +code generators for C-compatible languages may prefer to add an appropriate +include instead of rendering a new definition. + +attribute-sets +-------------- + +This property contains information about netlink attributes of the family. +All families have at least one attribute set, most have multiple. +``attribute-sets`` is an array, with each entry describing a single set. + +Note that the spec is "flattened" and is not meant to visually resemble +the format of the netlink messages (unlike certain ad-hoc documentation +formats seen in kernel comments). In the spec subordinate attribute sets +are not defined inline as a nest, but defined in a separate attribute set +referred to with a ``nested-attributes`` property of the container. + +Spec may also contain fractional sets - sets which contain a ``subset-of`` +property. Such sets describe a section of a full set, allowing narrowing down +which attributes are allowed in a nest or refining the validation criteria. +Fractional sets can only be used in nests. They are not rendered to the uAPI +in any fashion. + +name +~~~~ + +Uniquely identifies the attribute set, operations and nested attributes +refer to the sets by the ``name``. + +subset-of +~~~~~~~~~ + +Re-defines a portion of another set (a fractional set). +Allows narrowing down fields and changing validation criteria +or even types of attributes depending on the nest in which they +are contained. The ``value`` of each attribute in the fractional +set is implicitly the same as in the main set. + +attributes +~~~~~~~~~~ + +List of attributes in the set. + +Attribute properties +-------------------- + +name +~~~~ + +Identifies the attribute, unique within the set. + +type +~~~~ + +Netlink attribute type, see :ref:`attr_types`. + +.. _assign_val: + +value +~~~~~ + +Numerical attribute ID, used in serialized Netlink messages. +The ``value`` property can be skipped, in which case the attribute ID +will be the value of the previous attribute plus one (recursively) +and ``0`` for the first attribute in the attribute set. + +Note that the ``value`` of an attribute is defined only in its main set. + +enum +~~~~ + +For integer types specifies that values in the attribute belong +to an ``enum`` or ``flags`` from the ``definitions`` section. + +enum-as-flags +~~~~~~~~~~~~~ + +Treat ``enum`` as ``flags`` regardless of its type in ``definitions``. +When both ``enum`` and ``flags`` forms are needed ``definitions`` should +contain an ``enum`` and attributes which need the ``flags`` form should +use this attribute. + +nested-attributes +~~~~~~~~~~~~~~~~~ + +Identifies the attribute space for attributes nested within given attribute. +Only valid for complex attributes which may have sub-attributes. + +multi-attr (arrays) +~~~~~~~~~~~~~~~~~~~ + +Boolean property signifying that the attribute may be present multiple times. +Allowing an attribute to repeat is the recommended way of implementing arrays +(no extra nesting). + +byte-order +~~~~~~~~~~ + +For integer types specifies attribute byte order - ``little-endian`` +or ``big-endian``. + +checks +~~~~~~ + +Input validation constraints used by the kernel. User space should query +the policy of the running kernel using Generic Netlink introspection, +rather than depend on what is specified in the spec file. + +The validation policy in the kernel is formed by combining the type +definition (``type`` and ``nested-attributes``) and the ``checks``. + +operations +---------- + +This section describes messages passed between the kernel and the user space. +There are three types of entries in this section - operations, notifications +and events. + +Operations describe the most common request - response communication. User +sends a request and kernel replies. Each operation may contain any combination +of the two modes familiar to netlink users - ``do`` and ``dump``. +``do`` and ``dump`` in turn contain a combination of ``request`` and +``response`` properties. If no explicit message with attributes is passed +in a given direction (e.g. a ``dump`` which does not accept filter, or a ``do`` +of a SET operation to which the kernel responds with just the netlink error +code) ``request`` or ``response`` section can be skipped. +``request`` and ``response`` sections list the attributes allowed in a message. +The list contains only the names of attributes from a set referred +to by the ``attribute-set`` property. + +Notifications and events both refer to the asynchronous messages sent by +the kernel to members of a multicast group. The difference between the +two is that a notification shares its contents with a GET operation +(the name of the GET operation is specified in the ``notify`` property). +This arrangement is commonly used for notifications about +objects where the notification carries the full object definition. + +Events are more focused and carry only a subset of information rather than full +object state (a made up example would be a link state change event with just +the interface name and the new link state). Events contain the ``event`` +property. Events are considered less idiomatic for netlink and notifications +should be preferred. + +list +~~~~ + +The only property of ``operations`` for ``genetlink``, holds the list of +operations, notifications etc. + +Operation properties +-------------------- + +name +~~~~ + +Identifies the operation. + +value +~~~~~ + +Numerical message ID, used in serialized Netlink messages. +The same enumeration rules are applied as to +:ref:`attribute values`. + +attribute-set +~~~~~~~~~~~~~ + +Specifies the attribute set contained within the message. + +do +~~~ + +Specification for the ``doit`` request. Should contain ``request``, ``reply`` +or both of these properties, each holding a :ref:`attr_list`. + +dump +~~~~ + +Specification for the ``dumpit`` request. Should contain ``request``, ``reply`` +or both of these properties, each holding a :ref:`attr_list`. + +notify +~~~~~~ + +Designates the message as a notification. Contains the name of the operation +(possibly the same as the operation holding this property) which shares +the contents with the notification (``do``). + +event +~~~~~ + +Specification of attributes in the event, holds a :ref:`attr_list`. +``event`` property is mutually exclusive with ``notify``. + +mcgrp +~~~~~ + +Used with ``event`` and ``notify``, specifies which multicast group +message belongs to. + +.. _attr_list: + +Message attribute list +---------------------- + +``request``, ``reply`` and ``event`` properties have a single ``attributes`` +property which holds the list of attribute names. + +Messages can also define ``pre`` and ``post`` properties which will be rendered +as ``pre_doit`` and ``post_doit`` calls in the kernel (these properties should +be ignored by user space). + +mcast-groups +------------ + +This section lists the multicast groups of the family. + +list +~~~~ + +The only property of ``mcast-groups`` for ``genetlink``, holds the list +of groups. + +Multicast group properties +-------------------------- + +name +~~~~ + +Uniquely identifies the multicast group in the family. Similarly to +Family ID, Multicast Group ID needs to be resolved at runtime, based +on the name. + +.. _attr_types: + +Attribute types +=============== + +This section describes the attribute types supported by the ``genetlink`` +compatibility level. Refer to documentation of different levels for additional +attribute types. + +Scalar integer types +-------------------- + +Fixed-width integer types: +``u8``, ``u16``, ``u32``, ``u64``, ``s8``, ``s16``, ``s32``, ``s64``. + +Note that types smaller than 32 bit should be avoided as using them +does not save any memory in Netlink messages (due to alignment). +See :ref:`pad_type` for padding of 64 bit attributes. + +The payload of the attribute is the integer in host order unless ``byte-order`` +specifies otherwise. + +.. _pad_type: + +pad +--- + +Special attribute type used for padding attributes which require alignment +bigger than standard 4B alignment required by netlink (e.g. 64 bit integers). +There can only be a single attribute of the ``pad`` type in any attribute set +and it should be automatically used for padding when needed. + +flag +---- + +Attribute with no payload, its presence is the entire information. + +binary +------ + +Raw binary data attribute, the contents are opaque to generic code. + +string +------ + +Character string. Unless ``checks`` has ``unterminated-ok`` set to ``true`` +the string is required to be null terminated. +``max-len`` in ``checks`` indicates the longest possible string, +if not present the length of the string is unbounded. + +Note that ``max-len`` does not count the terminating character. + +nest +---- + +Attribute containing other (nested) attributes. +``nested-attributes`` specifies which attribute set is used inside. -- cgit v1.2.3