diff options
author | 2023-02-21 18:24:12 -0800 | |
---|---|---|
committer | 2023-02-21 18:24:12 -0800 | |
commit | 5b7c4cabbb65f5c469464da6c5f614cbd7f730f2 (patch) | |
tree | cc5c2d0a898769fd59549594fedb3ee6f84e59a0 /Documentation/driver-api/vfio-mediated-device.rst | |
download | linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.tar.gz linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.zip |
Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextgrafted
Pull networking updates from Jakub Kicinski:
"Core:
- Add dedicated kmem_cache for typical/small skb->head, avoid having
to access struct page at kfree time, and improve memory use.
- Introduce sysctl to set default RPS configuration for new netdevs.
- Define Netlink protocol specification format which can be used to
describe messages used by each family and auto-generate parsers.
Add tools for generating kernel data structures and uAPI headers.
- Expose all net/core sysctls inside netns.
- Remove 4s sleep in netpoll if carrier is instantly detected on
boot.
- Add configurable limit of MDB entries per port, and port-vlan.
- Continue populating drop reasons throughout the stack.
- Retire a handful of legacy Qdiscs and classifiers.
Protocols:
- Support IPv4 big TCP (TSO frames larger than 64kB).
- Add IP_LOCAL_PORT_RANGE socket option, to control local port range
on socket by socket basis.
- Track and report in procfs number of MPTCP sockets used.
- Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path
manager.
- IPv6: don't check net.ipv6.route.max_size and rely on garbage
collection to free memory (similarly to IPv4).
- Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).
- ICMP: add per-rate limit counters.
- Add support for user scanning requests in ieee802154.
- Remove static WEP support.
- Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
reporting.
- WiFi 7 EHT channel puncturing support (client & AP).
BPF:
- Add a rbtree data structure following the "next-gen data structure"
precedent set by recently added linked list, that is, by using
kfunc + kptr instead of adding a new BPF map type.
- Expose XDP hints via kfuncs with initial support for RX hash and
timestamp metadata.
- Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to
better support decap on GRE tunnel devices not operating in collect
metadata.
- Improve x86 JIT's codegen for PROBE_MEM runtime error checks.
- Remove the need for trace_printk_lock for bpf_trace_printk and
bpf_trace_vprintk helpers.
- Extend libbpf's bpf_tracing.h support for tracing arguments of
kprobes/uprobes and syscall as a special case.
- Significantly reduce the search time for module symbols by
livepatch and BPF.
- Enable cpumasks to be used as kptrs, which is useful for tracing
programs tracking which tasks end up running on which CPUs in
different time intervals.
- Add support for BPF trampoline on s390x and riscv64.
- Add capability to export the XDP features supported by the NIC.
- Add __bpf_kfunc tag for marking kernel functions as kfuncs.
- Add cgroup.memory=nobpf kernel parameter option to disable BPF
memory accounting for container environments.
Netfilter:
- Remove the CLUSTERIP target. It has been marked as obsolete for
years, and we still have WARN splats wrt races of the out-of-band
/proc interface installed by this target.
- Add 'destroy' commands to nf_tables. They are identical to the
existing 'delete' commands, but do not return an error if the
referenced object (set, chain, rule...) did not exist.
Driver API:
- Improve cpumask_local_spread() locality to help NICs set the right
IRQ affinity on AMD platforms.
- Separate C22 and C45 MDIO bus transactions more clearly.
- Introduce new DCB table to control DSCP rewrite on egress.
- Support configuration of Physical Layer Collision Avoidance (PLCA)
Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
shared medium Ethernet.
- Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
preemption of low priority frames by high priority frames.
- Add support for controlling MACSec offload using netlink SET.
- Rework devlink instance refcounts to allow registration and
de-registration under the instance lock. Split the code into
multiple files, drop some of the unnecessarily granular locks and
factor out common parts of netlink operation handling.
- Add TX frame aggregation parameters (for USB drivers).
- Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
messages with notifications for debug.
- Allow offloading of UDP NEW connections via act_ct.
- Add support for per action HW stats in TC.
- Support hardware miss to TC action (continue processing in SW from
a specific point in the action chain).
- Warn if old Wireless Extension user space interface is used with
modern cfg80211/mac80211 drivers. Do not support Wireless
Extensions for Wi-Fi 7 devices at all. Everyone should switch to
using nl80211 interface instead.
- Improve the CAN bit timing configuration. Use extack to return
error messages directly to user space, update the SJW handling,
including the definition of a new default value that will benefit
CAN-FD controllers, by increasing their oscillator tolerance.
New hardware / drivers:
- Ethernet:
- nVidia BlueField-3 support (control traffic driver)
- Ethernet support for imx93 SoCs
- Motorcomm yt8531 gigabit Ethernet PHY
- onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
- Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
- Amlogic gxl MDIO mux
- WiFi:
- RealTek RTL8188EU (rtl8xxxu)
- Qualcomm Wi-Fi 7 devices (ath12k)
- CAN:
- Renesas R-Car V4H
Drivers:
- Bluetooth:
- Set Per Platform Antenna Gain (PPAG) for Intel controllers.
- Ethernet NICs:
- Intel (1G, igc):
- support TSN / Qbv / packet scheduling features of i226 model
- Intel (100G, ice):
- use GNSS subsystem instead of TTY
- multi-buffer XDP support
- extend support for GPIO pins to E823 devices
- nVidia/Mellanox:
- update the shared buffer configuration on PFC commands
- implement PTP adjphase function for HW offset control
- TC support for Geneve and GRE with VF tunnel offload
- more efficient crypto key management method
- multi-port eswitch support
- Netronome/Corigine:
- add DCB IEEE support
- support IPsec offloading for NFP3800
- Freescale/NXP (enetc):
- support XDP_REDIRECT for XDP non-linear buffers
- improve reconfig, avoid link flap and waiting for idle
- support MAC Merge layer
- Other NICs:
- sfc/ef100: add basic devlink support for ef100
- ionic: rx_push mode operation (writing descriptors via MMIO)
- bnxt: use the auxiliary bus abstraction for RDMA
- r8169: disable ASPM and reset bus in case of tx timeout
- cpsw: support QSGMII mode for J721e CPSW9G
- cpts: support pulse-per-second output
- ngbe: add an mdio bus driver
- usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
- r8152: handle devices with FW with NCM support
- amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
- virtio-net: support multi buffer XDP
- virtio/vsock: replace virtio_vsock_pkt with sk_buff
- tsnep: XDP support
- Ethernet high-speed switches:
- nVidia/Mellanox (mlxsw):
- add support for latency TLV (in FW control messages)
- Microchip (sparx5):
- separate explicit and implicit traffic forwarding rules, make
the implicit rules always active
- add support for egress DSCP rewrite
- IS0 VCAP support (Ingress Classification)
- IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS
etc.)
- ES2 VCAP support (Egress Access Control)
- support for Per-Stream Filtering and Policing (802.1Q,
8.6.5.1)
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- add MAB (port auth) offload support
- enable PTP receive for mv88e6390
- NXP (ocelot):
- support MAC Merge layer
- support for the the vsc7512 internal copper phys
- Microchip:
- lan9303: convert to PHYLINK
- lan966x: support TC flower filter statistics
- lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
- lan937x: support Credit Based Shaper configuration
- ksz9477: support Energy Efficient Ethernet
- other:
- qca8k: convert to regmap read/write API, use bulk operations
- rswitch: Improve TX timestamp accuracy
- Intel WiFi (iwlwifi):
- EHT (Wi-Fi 7) rate reporting
- STEP equalizer support: transfer some STEP (connection to radio
on platforms with integrated wifi) related parameters from the
BIOS to the firmware.
- Qualcomm 802.11ax WiFi (ath11k):
- IPQ5018 support
- Fine Timing Measurement (FTM) responder role support
- channel 177 support
- MediaTek WiFi (mt76):
- per-PHY LED support
- mt7996: EHT (Wi-Fi 7) support
- Wireless Ethernet Dispatch (WED) reset support
- switch to using page pool allocator
- RealTek WiFi (rtw89):
- support new version of Bluetooth co-existance
- Mobile:
- rmnet: support TX aggregation"
* tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits)
page_pool: add a comment explaining the fragment counter usage
net: ethtool: fix __ethtool_dev_mm_supported() implementation
ethtool: pse-pd: Fix double word in comments
xsk: add linux/vmalloc.h to xsk.c
sefltests: netdevsim: wait for devlink instance after netns removal
selftest: fib_tests: Always cleanup before exit
net/mlx5e: Align IPsec ASO result memory to be as required by hardware
net/mlx5e: TC, Set CT miss to the specific ct action instance
net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG
net/mlx5: Refactor tc miss handling to a single function
net/mlx5: Kconfig: Make tc offload depend on tc skb extension
net/sched: flower: Support hardware miss to tc action
net/sched: flower: Move filter handle initialization earlier
net/sched: cls_api: Support hardware miss to tc action
net/sched: Rename user cookie and act cookie
sfc: fix builds without CONFIG_RTC_LIB
sfc: clean up some inconsistent indentings
net/mlx4_en: Introduce flexible array to silence overflow warning
net: lan966x: Fix possible deadlock inside PTP
net/ulp: Remove redundant ->clone() test in inet_clone_ulp().
...
Diffstat (limited to 'Documentation/driver-api/vfio-mediated-device.rst')
-rw-r--r-- | Documentation/driver-api/vfio-mediated-device.rst | 379 |
1 files changed, 379 insertions, 0 deletions
diff --git a/Documentation/driver-api/vfio-mediated-device.rst b/Documentation/driver-api/vfio-mediated-device.rst new file mode 100644 index 000000000..fdf7d6937 --- /dev/null +++ b/Documentation/driver-api/vfio-mediated-device.rst @@ -0,0 +1,379 @@ +.. SPDX-License-Identifier: GPL-2.0-only +.. include:: <isonum.txt> + +===================== +VFIO Mediated devices +===================== + +:Copyright: |copy| 2016, NVIDIA CORPORATION. All rights reserved. +:Author: Neo Jia <cjia@nvidia.com> +:Author: Kirti Wankhede <kwankhede@nvidia.com> + + + +Virtual Function I/O (VFIO) Mediated devices[1] +=============================================== + +The number of use cases for virtualizing DMA devices that do not have built-in +SR_IOV capability is increasing. Previously, to virtualize such devices, +developers had to create their own management interfaces and APIs, and then +integrate them with user space software. To simplify integration with user space +software, we have identified common requirements and a unified management +interface for such devices. + +The VFIO driver framework provides unified APIs for direct device access. It is +an IOMMU/device-agnostic framework for exposing direct device access to user +space in a secure, IOMMU-protected environment. This framework is used for +multiple devices, such as GPUs, network adapters, and compute accelerators. With +direct device access, virtual machines or user space applications have direct +access to the physical device. This framework is reused for mediated devices. + +The mediated core driver provides a common interface for mediated device +management that can be used by drivers of different devices. This module +provides a generic interface to perform these operations: + +* Create and destroy a mediated device +* Add a mediated device to and remove it from a mediated bus driver +* Add a mediated device to and remove it from an IOMMU group + +The mediated core driver also provides an interface to register a bus driver. +For example, the mediated VFIO mdev driver is designed for mediated devices and +supports VFIO APIs. The mediated bus driver adds a mediated device to and +removes it from a VFIO group. + +The following high-level block diagram shows the main components and interfaces +in the VFIO mediated driver framework. The diagram shows NVIDIA, Intel, and IBM +devices as examples, as these devices are the first devices to use this module:: + + +---------------+ + | | + | +-----------+ | mdev_register_driver() +--------------+ + | | | +<------------------------+ | + | | mdev | | | | + | | bus | +------------------------>+ vfio_mdev.ko |<-> VFIO user + | | driver | | probe()/remove() | | APIs + | | | | +--------------+ + | +-----------+ | + | | + | MDEV CORE | + | MODULE | + | mdev.ko | + | +-----------+ | mdev_register_parent() +--------------+ + | | | +<------------------------+ | + | | | | | nvidia.ko |<-> physical + | | | +------------------------>+ | device + | | | | callbacks +--------------+ + | | Physical | | + | | device | | mdev_register_parent() +--------------+ + | | interface | |<------------------------+ | + | | | | | i915.ko |<-> physical + | | | +------------------------>+ | device + | | | | callbacks +--------------+ + | | | | + | | | | mdev_register_parent() +--------------+ + | | | +<------------------------+ | + | | | | | ccw_device.ko|<-> physical + | | | +------------------------>+ | device + | | | | callbacks +--------------+ + | +-----------+ | + +---------------+ + + +Registration Interfaces +======================= + +The mediated core driver provides the following types of registration +interfaces: + +* Registration interface for a mediated bus driver +* Physical device driver interface + +Registration Interface for a Mediated Bus Driver +------------------------------------------------ + +The registration interface for a mediated device driver provides the following +structure to represent a mediated device's driver:: + + /* + * struct mdev_driver [2] - Mediated device's driver + * @probe: called when new device created + * @remove: called when device removed + * @driver: device driver structure + */ + struct mdev_driver { + int (*probe) (struct mdev_device *dev); + void (*remove) (struct mdev_device *dev); + unsigned int (*get_available)(struct mdev_type *mtype); + ssize_t (*show_description)(struct mdev_type *mtype, char *buf); + struct device_driver driver; + }; + +A mediated bus driver for mdev should use this structure in the function calls +to register and unregister itself with the core driver: + +* Register:: + + int mdev_register_driver(struct mdev_driver *drv); + +* Unregister:: + + void mdev_unregister_driver(struct mdev_driver *drv); + +The mediated bus driver's probe function should create a vfio_device on top of +the mdev_device and connect it to an appropriate implementation of +vfio_device_ops. + +When a driver wants to add the GUID creation sysfs to an existing device it has +probe'd to then it should call:: + + int mdev_register_parent(struct mdev_parent *parent, struct device *dev, + struct mdev_driver *mdev_driver); + +This will provide the 'mdev_supported_types/XX/create' files which can then be +used to trigger the creation of a mdev_device. The created mdev_device will be +attached to the specified driver. + +When the driver needs to remove itself it calls:: + + void mdev_unregister_parent(struct mdev_parent *parent); + +Which will unbind and destroy all the created mdevs and remove the sysfs files. + +Mediated Device Management Interface Through sysfs +================================================== + +The management interface through sysfs enables user space software, such as +libvirt, to query and configure mediated devices in a hardware-agnostic fashion. +This management interface provides flexibility to the underlying physical +device's driver to support features such as: + +* Mediated device hot plug +* Multiple mediated devices in a single virtual machine +* Multiple mediated devices from different physical devices + +Links in the mdev_bus Class Directory +------------------------------------- +The /sys/class/mdev_bus/ directory contains links to devices that are registered +with the mdev core driver. + +Directories and files under the sysfs for Each Physical Device +-------------------------------------------------------------- + +:: + + |- [parent physical device] + |--- Vendor-specific-attributes [optional] + |--- [mdev_supported_types] + | |--- [<type-id>] + | | |--- create + | | |--- name + | | |--- available_instances + | | |--- device_api + | | |--- description + | | |--- [devices] + | |--- [<type-id>] + | | |--- create + | | |--- name + | | |--- available_instances + | | |--- device_api + | | |--- description + | | |--- [devices] + | |--- [<type-id>] + | |--- create + | |--- name + | |--- available_instances + | |--- device_api + | |--- description + | |--- [devices] + +* [mdev_supported_types] + + The list of currently supported mediated device types and their details. + + [<type-id>], device_api, and available_instances are mandatory attributes + that should be provided by vendor driver. + +* [<type-id>] + + The [<type-id>] name is created by adding the device driver string as a prefix + to the string provided by the vendor driver. This format of this name is as + follows:: + + sprintf(buf, "%s-%s", dev_driver_string(parent->dev), group->name); + +* device_api + + This attribute shows which device API is being created, for example, + "vfio-pci" for a PCI device. + +* available_instances + + This attribute shows the number of devices of type <type-id> that can be + created. + +* [device] + + This directory contains links to the devices of type <type-id> that have been + created. + +* name + + This attribute shows a human readable name. + +* description + + This attribute can show brief features/description of the type. This is an + optional attribute. + +Directories and Files Under the sysfs for Each mdev Device +---------------------------------------------------------- + +:: + + |- [parent phy device] + |--- [$MDEV_UUID] + |--- remove + |--- mdev_type {link to its type} + |--- vendor-specific-attributes [optional] + +* remove (write only) + +Writing '1' to the 'remove' file destroys the mdev device. The vendor driver can +fail the remove() callback if that device is active and the vendor driver +doesn't support hot unplug. + +Example:: + + # echo 1 > /sys/bus/mdev/devices/$mdev_UUID/remove + +Mediated device Hot plug +------------------------ + +Mediated devices can be created and assigned at runtime. The procedure to hot +plug a mediated device is the same as the procedure to hot plug a PCI device. + +Translation APIs for Mediated Devices +===================================== + +The following APIs are provided for translating user pfn to host pfn in a VFIO +driver:: + + int vfio_pin_pages(struct vfio_device *device, dma_addr_t iova, + int npage, int prot, struct page **pages); + + void vfio_unpin_pages(struct vfio_device *device, dma_addr_t iova, + int npage); + +These functions call back into the back-end IOMMU module by using the pin_pages +and unpin_pages callbacks of the struct vfio_iommu_driver_ops[4]. Currently +these callbacks are supported in the TYPE1 IOMMU module. To enable them for +other IOMMU backend modules, such as PPC64 sPAPR module, they need to provide +these two callback functions. + +Using the Sample Code +===================== + +mtty.c in samples/vfio-mdev/ directory is a sample driver program to +demonstrate how to use the mediated device framework. + +The sample driver creates an mdev device that simulates a serial port over a PCI +card. + +1. Build and load the mtty.ko module. + + This step creates a dummy device, /sys/devices/virtual/mtty/mtty/ + + Files in this device directory in sysfs are similar to the following:: + + # tree /sys/devices/virtual/mtty/mtty/ + /sys/devices/virtual/mtty/mtty/ + |-- mdev_supported_types + | |-- mtty-1 + | | |-- available_instances + | | |-- create + | | |-- device_api + | | |-- devices + | | `-- name + | `-- mtty-2 + | |-- available_instances + | |-- create + | |-- device_api + | |-- devices + | `-- name + |-- mtty_dev + | `-- sample_mtty_dev + |-- power + | |-- autosuspend_delay_ms + | |-- control + | |-- runtime_active_time + | |-- runtime_status + | `-- runtime_suspended_time + |-- subsystem -> ../../../../class/mtty + `-- uevent + +2. Create a mediated device by using the dummy device that you created in the + previous step:: + + # echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1001" > \ + /sys/devices/virtual/mtty/mtty/mdev_supported_types/mtty-2/create + +3. Add parameters to qemu-kvm:: + + -device vfio-pci,\ + sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 + +4. Boot the VM. + + In the Linux guest VM, with no hardware on the host, the device appears + as follows:: + + # lspci -s 00:05.0 -xxvv + 00:05.0 Serial controller: Device 4348:3253 (rev 10) (prog-if 02 [16550]) + Subsystem: Device 4348:3253 + Physical Slot: 5 + Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- + Stepping- SERR- FastB2B- DisINTx- + Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- + <TAbort- <MAbort- >SERR- <PERR- INTx- + Interrupt: pin A routed to IRQ 10 + Region 0: I/O ports at c150 [size=8] + Region 1: I/O ports at c158 [size=8] + Kernel driver in use: serial + 00: 48 43 53 32 01 00 00 02 10 02 00 07 00 00 00 00 + 10: 51 c1 00 00 59 c1 00 00 00 00 00 00 00 00 00 00 + 20: 00 00 00 00 00 00 00 00 00 00 00 00 48 43 53 32 + 30: 00 00 00 00 00 00 00 00 00 00 00 00 0a 01 00 00 + + In the Linux guest VM, dmesg output for the device is as follows: + + serial 0000:00:05.0: PCI INT A -> Link[LNKA] -> GSI 10 (level, high) -> IRQ 10 + 0000:00:05.0: ttyS1 at I/O 0xc150 (irq = 10) is a 16550A + 0000:00:05.0: ttyS2 at I/O 0xc158 (irq = 10) is a 16550A + + +5. In the Linux guest VM, check the serial ports:: + + # setserial -g /dev/ttyS* + /dev/ttyS0, UART: 16550A, Port: 0x03f8, IRQ: 4 + /dev/ttyS1, UART: 16550A, Port: 0xc150, IRQ: 10 + /dev/ttyS2, UART: 16550A, Port: 0xc158, IRQ: 10 + +6. Using minicom or any terminal emulation program, open port /dev/ttyS1 or + /dev/ttyS2 with hardware flow control disabled. + +7. Type data on the minicom terminal or send data to the terminal emulation + program and read the data. + + Data is loop backed from hosts mtty driver. + +8. Destroy the mediated device that you created:: + + # echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001/remove + +References +========== + +1. See Documentation/driver-api/vfio.rst for more information on VFIO. +2. struct mdev_driver in include/linux/mdev.h +3. struct mdev_parent_ops in include/linux/mdev.h +4. struct vfio_iommu_driver_ops in include/linux/vfio.h |