From 5b7c4cabbb65f5c469464da6c5f614cbd7f730f2 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Tue, 21 Feb 2023 18:24:12 -0800 Subject: Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - Add dedicated kmem_cache for typical/small skb->head, avoid having to access struct page at kfree time, and improve memory use. - Introduce sysctl to set default RPS configuration for new netdevs. - Define Netlink protocol specification format which can be used to describe messages used by each family and auto-generate parsers. Add tools for generating kernel data structures and uAPI headers. - Expose all net/core sysctls inside netns. - Remove 4s sleep in netpoll if carrier is instantly detected on boot. - Add configurable limit of MDB entries per port, and port-vlan. - Continue populating drop reasons throughout the stack. - Retire a handful of legacy Qdiscs and classifiers. Protocols: - Support IPv4 big TCP (TSO frames larger than 64kB). - Add IP_LOCAL_PORT_RANGE socket option, to control local port range on socket by socket basis. - Track and report in procfs number of MPTCP sockets used. - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path manager. - IPv6: don't check net.ipv6.route.max_size and rely on garbage collection to free memory (similarly to IPv4). - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986). - ICMP: add per-rate limit counters. - Add support for user scanning requests in ieee802154. - Remove static WEP support. - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting. - WiFi 7 EHT channel puncturing support (client & AP). BPF: - Add a rbtree data structure following the "next-gen data structure" precedent set by recently added linked list, that is, by using kfunc + kptr instead of adding a new BPF map type. - Expose XDP hints via kfuncs with initial support for RX hash and timestamp metadata. - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata. - Improve x86 JIT's codegen for PROBE_MEM runtime error checks. - Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers. - Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case. - Significantly reduce the search time for module symbols by livepatch and BPF. - Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals. - Add support for BPF trampoline on s390x and riscv64. - Add capability to export the XDP features supported by the NIC. - Add __bpf_kfunc tag for marking kernel functions as kfuncs. - Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accounting for container environments. Netfilter: - Remove the CLUSTERIP target. It has been marked as obsolete for years, and we still have WARN splats wrt races of the out-of-band /proc interface installed by this target. - Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not return an error if the referenced object (set, chain, rule...) did not exist. Driver API: - Improve cpumask_local_spread() locality to help NICs set the right IRQ affinity on AMD platforms. - Separate C22 and C45 MDIO bus transactions more clearly. - Introduce new DCB table to control DSCP rewrite on egress. - Support configuration of Physical Layer Collision Avoidance (PLCA) Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of shared medium Ethernet. - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing preemption of low priority frames by high priority frames. - Add support for controlling MACSec offload using netlink SET. - Rework devlink instance refcounts to allow registration and de-registration under the instance lock. Split the code into multiple files, drop some of the unnecessarily granular locks and factor out common parts of netlink operation handling. - Add TX frame aggregation parameters (for USB drivers). - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning messages with notifications for debug. - Allow offloading of UDP NEW connections via act_ct. - Add support for per action HW stats in TC. - Support hardware miss to TC action (continue processing in SW from a specific point in the action chain). - Warn if old Wireless Extension user space interface is used with modern cfg80211/mac80211 drivers. Do not support Wireless Extensions for Wi-Fi 7 devices at all. Everyone should switch to using nl80211 interface instead. - Improve the CAN bit timing configuration. Use extack to return error messages directly to user space, update the SJW handling, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. New hardware / drivers: - Ethernet: - nVidia BlueField-3 support (control traffic driver) - Ethernet support for imx93 SoCs - Motorcomm yt8531 gigabit Ethernet PHY - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA) - Microchip LAN8841 PHY (incl. cable diagnostics and PTP) - Amlogic gxl MDIO mux - WiFi: - RealTek RTL8188EU (rtl8xxxu) - Qualcomm Wi-Fi 7 devices (ath12k) - CAN: - Renesas R-Car V4H Drivers: - Bluetooth: - Set Per Platform Antenna Gain (PPAG) for Intel controllers. - Ethernet NICs: - Intel (1G, igc): - support TSN / Qbv / packet scheduling features of i226 model - Intel (100G, ice): - use GNSS subsystem instead of TTY - multi-buffer XDP support - extend support for GPIO pins to E823 devices - nVidia/Mellanox: - update the shared buffer configuration on PFC commands - implement PTP adjphase function for HW offset control - TC support for Geneve and GRE with VF tunnel offload - more efficient crypto key management method - multi-port eswitch support - Netronome/Corigine: - add DCB IEEE support - support IPsec offloading for NFP3800 - Freescale/NXP (enetc): - support XDP_REDIRECT for XDP non-linear buffers - improve reconfig, avoid link flap and waiting for idle - support MAC Merge layer - Other NICs: - sfc/ef100: add basic devlink support for ef100 - ionic: rx_push mode operation (writing descriptors via MMIO) - bnxt: use the auxiliary bus abstraction for RDMA - r8169: disable ASPM and reset bus in case of tx timeout - cpsw: support QSGMII mode for J721e CPSW9G - cpts: support pulse-per-second output - ngbe: add an mdio bus driver - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing - r8152: handle devices with FW with NCM support - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation - virtio-net: support multi buffer XDP - virtio/vsock: replace virtio_vsock_pkt with sk_buff - tsnep: XDP support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add support for latency TLV (in FW control messages) - Microchip (sparx5): - separate explicit and implicit traffic forwarding rules, make the implicit rules always active - add support for egress DSCP rewrite - IS0 VCAP support (Ingress Classification) - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.) - ES2 VCAP support (Egress Access Control) - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1) - Ethernet embedded switches: - Marvell (mv88e6xxx): - add MAB (port auth) offload support - enable PTP receive for mv88e6390 - NXP (ocelot): - support MAC Merge layer - support for the the vsc7512 internal copper phys - Microchip: - lan9303: convert to PHYLINK - lan966x: support TC flower filter statistics - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x - lan937x: support Credit Based Shaper configuration - ksz9477: support Energy Efficient Ethernet - other: - qca8k: convert to regmap read/write API, use bulk operations - rswitch: Improve TX timestamp accuracy - Intel WiFi (iwlwifi): - EHT (Wi-Fi 7) rate reporting - STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware. - Qualcomm 802.11ax WiFi (ath11k): - IPQ5018 support - Fine Timing Measurement (FTM) responder role support - channel 177 support - MediaTek WiFi (mt76): - per-PHY LED support - mt7996: EHT (Wi-Fi 7) support - Wireless Ethernet Dispatch (WED) reset support - switch to using page pool allocator - RealTek WiFi (rtw89): - support new version of Bluetooth co-existance - Mobile: - rmnet: support TX aggregation" * tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits) page_pool: add a comment explaining the fragment counter usage net: ethtool: fix __ethtool_dev_mm_supported() implementation ethtool: pse-pd: Fix double word in comments xsk: add linux/vmalloc.h to xsk.c sefltests: netdevsim: wait for devlink instance after netns removal selftest: fib_tests: Always cleanup before exit net/mlx5e: Align IPsec ASO result memory to be as required by hardware net/mlx5e: TC, Set CT miss to the specific ct action instance net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG net/mlx5: Refactor tc miss handling to a single function net/mlx5: Kconfig: Make tc offload depend on tc skb extension net/sched: flower: Support hardware miss to tc action net/sched: flower: Move filter handle initialization earlier net/sched: cls_api: Support hardware miss to tc action net/sched: Rename user cookie and act cookie sfc: fix builds without CONFIG_RTC_LIB sfc: clean up some inconsistent indentings net/mlx4_en: Introduce flexible array to silence overflow warning net: lan966x: Fix possible deadlock inside PTP net/ulp: Remove redundant ->clone() test in inet_clone_ulp(). ... --- drivers/vfio/pci/vfio_pci_igd.c | 451 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 451 insertions(+) create mode 100644 drivers/vfio/pci/vfio_pci_igd.c (limited to 'drivers/vfio/pci/vfio_pci_igd.c') diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c new file mode 100644 index 000000000..5e6ca5926 --- /dev/null +++ b/drivers/vfio/pci/vfio_pci_igd.c @@ -0,0 +1,451 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * VFIO PCI Intel Graphics support + * + * Copyright (C) 2016 Red Hat, Inc. All rights reserved. + * Author: Alex Williamson + * + * Register a device specific region through which to provide read-only + * access to the Intel IGD opregion. The register defining the opregion + * address is also virtualized to prevent user modification. + */ + +#include +#include +#include +#include + +#include "vfio_pci_priv.h" + +#define OPREGION_SIGNATURE "IntelGraphicsMem" +#define OPREGION_SIZE (8 * 1024) +#define OPREGION_PCI_ADDR 0xfc + +#define OPREGION_RVDA 0x3ba +#define OPREGION_RVDS 0x3c2 +#define OPREGION_VERSION 0x16 + +struct igd_opregion_vbt { + void *opregion; + void *vbt_ex; +}; + +/** + * igd_opregion_shift_copy() - Copy OpRegion to user buffer and shift position. + * @dst: User buffer ptr to copy to. + * @off: Offset to user buffer ptr. Increased by bytes on return. + * @src: Source buffer to copy from. + * @pos: Increased by bytes on return. + * @remaining: Decreased by bytes on return. + * @bytes: Bytes to copy and adjust off, pos and remaining. + * + * Copy OpRegion to offset from specific source ptr and shift the offset. + * + * Return: 0 on success, -EFAULT otherwise. + * + */ +static inline unsigned long igd_opregion_shift_copy(char __user *dst, + loff_t *off, + void *src, + loff_t *pos, + size_t *remaining, + size_t bytes) +{ + if (copy_to_user(dst + (*off), src, bytes)) + return -EFAULT; + + *off += bytes; + *pos += bytes; + *remaining -= bytes; + + return 0; +} + +static ssize_t vfio_pci_igd_rw(struct vfio_pci_core_device *vdev, + char __user *buf, size_t count, loff_t *ppos, + bool iswrite) +{ + unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS; + struct igd_opregion_vbt *opregionvbt = vdev->region[i].data; + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK, off = 0; + size_t remaining; + + if (pos >= vdev->region[i].size || iswrite) + return -EINVAL; + + count = min_t(size_t, count, vdev->region[i].size - pos); + remaining = count; + + /* Copy until OpRegion version */ + if (remaining && pos < OPREGION_VERSION) { + size_t bytes = min_t(size_t, remaining, OPREGION_VERSION - pos); + + if (igd_opregion_shift_copy(buf, &off, + opregionvbt->opregion + pos, &pos, + &remaining, bytes)) + return -EFAULT; + } + + /* Copy patched (if necessary) OpRegion version */ + if (remaining && pos < OPREGION_VERSION + sizeof(__le16)) { + size_t bytes = min_t(size_t, remaining, + OPREGION_VERSION + sizeof(__le16) - pos); + __le16 version = *(__le16 *)(opregionvbt->opregion + + OPREGION_VERSION); + + /* Patch to 2.1 if OpRegion 2.0 has extended VBT */ + if (le16_to_cpu(version) == 0x0200 && opregionvbt->vbt_ex) + version = cpu_to_le16(0x0201); + + if (igd_opregion_shift_copy(buf, &off, + (u8 *)&version + + (pos - OPREGION_VERSION), + &pos, &remaining, bytes)) + return -EFAULT; + } + + /* Copy until RVDA */ + if (remaining && pos < OPREGION_RVDA) { + size_t bytes = min_t(size_t, remaining, OPREGION_RVDA - pos); + + if (igd_opregion_shift_copy(buf, &off, + opregionvbt->opregion + pos, &pos, + &remaining, bytes)) + return -EFAULT; + } + + /* Copy modified (if necessary) RVDA */ + if (remaining && pos < OPREGION_RVDA + sizeof(__le64)) { + size_t bytes = min_t(size_t, remaining, + OPREGION_RVDA + sizeof(__le64) - pos); + __le64 rvda = cpu_to_le64(opregionvbt->vbt_ex ? + OPREGION_SIZE : 0); + + if (igd_opregion_shift_copy(buf, &off, + (u8 *)&rvda + (pos - OPREGION_RVDA), + &pos, &remaining, bytes)) + return -EFAULT; + } + + /* Copy the rest of OpRegion */ + if (remaining && pos < OPREGION_SIZE) { + size_t bytes = min_t(size_t, remaining, OPREGION_SIZE - pos); + + if (igd_opregion_shift_copy(buf, &off, + opregionvbt->opregion + pos, &pos, + &remaining, bytes)) + return -EFAULT; + } + + /* Copy extended VBT if exists */ + if (remaining && + copy_to_user(buf + off, opregionvbt->vbt_ex + (pos - OPREGION_SIZE), + remaining)) + return -EFAULT; + + *ppos += count; + + return count; +} + +static void vfio_pci_igd_release(struct vfio_pci_core_device *vdev, + struct vfio_pci_region *region) +{ + struct igd_opregion_vbt *opregionvbt = region->data; + + if (opregionvbt->vbt_ex) + memunmap(opregionvbt->vbt_ex); + + memunmap(opregionvbt->opregion); + kfree(opregionvbt); +} + +static const struct vfio_pci_regops vfio_pci_igd_regops = { + .rw = vfio_pci_igd_rw, + .release = vfio_pci_igd_release, +}; + +static int vfio_pci_igd_opregion_init(struct vfio_pci_core_device *vdev) +{ + __le32 *dwordp = (__le32 *)(vdev->vconfig + OPREGION_PCI_ADDR); + u32 addr, size; + struct igd_opregion_vbt *opregionvbt; + int ret; + u16 version; + + ret = pci_read_config_dword(vdev->pdev, OPREGION_PCI_ADDR, &addr); + if (ret) + return ret; + + if (!addr || !(~addr)) + return -ENODEV; + + opregionvbt = kzalloc(sizeof(*opregionvbt), GFP_KERNEL); + if (!opregionvbt) + return -ENOMEM; + + opregionvbt->opregion = memremap(addr, OPREGION_SIZE, MEMREMAP_WB); + if (!opregionvbt->opregion) { + kfree(opregionvbt); + return -ENOMEM; + } + + if (memcmp(opregionvbt->opregion, OPREGION_SIGNATURE, 16)) { + memunmap(opregionvbt->opregion); + kfree(opregionvbt); + return -EINVAL; + } + + size = le32_to_cpu(*(__le32 *)(opregionvbt->opregion + 16)); + if (!size) { + memunmap(opregionvbt->opregion); + kfree(opregionvbt); + return -EINVAL; + } + + size *= 1024; /* In KB */ + + /* + * OpRegion and VBT: + * When VBT data doesn't exceed 6KB, it's stored in Mailbox #4. + * When VBT data exceeds 6KB size, Mailbox #4 is no longer large enough + * to hold the VBT data, the Extended VBT region is introduced since + * OpRegion 2.0 to hold the VBT data. Since OpRegion 2.0, RVDA/RVDS are + * introduced to define the extended VBT data location and size. + * OpRegion 2.0: RVDA defines the absolute physical address of the + * extended VBT data, RVDS defines the VBT data size. + * OpRegion 2.1 and above: RVDA defines the relative address of the + * extended VBT data to OpRegion base, RVDS defines the VBT data size. + * + * Due to the RVDA definition diff in OpRegion VBT (also the only diff + * between 2.0 and 2.1), exposing OpRegion and VBT as a contiguous range + * for OpRegion 2.0 and above makes it possible to support the + * non-contiguous VBT through a single vfio region. From r/w ops view, + * only contiguous VBT after OpRegion with version 2.1+ is exposed, + * regardless the host OpRegion is 2.0 or non-contiguous 2.1+. The r/w + * ops will on-the-fly shift the actural offset into VBT so that data at + * correct position can be returned to the requester. + */ + version = le16_to_cpu(*(__le16 *)(opregionvbt->opregion + + OPREGION_VERSION)); + if (version >= 0x0200) { + u64 rvda = le64_to_cpu(*(__le64 *)(opregionvbt->opregion + + OPREGION_RVDA)); + u32 rvds = le32_to_cpu(*(__le32 *)(opregionvbt->opregion + + OPREGION_RVDS)); + + /* The extended VBT is valid only when RVDA/RVDS are non-zero */ + if (rvda && rvds) { + size += rvds; + + /* + * Extended VBT location by RVDA: + * Absolute physical addr for 2.0. + * Relative addr to OpRegion header for 2.1+. + */ + if (version == 0x0200) + addr = rvda; + else + addr += rvda; + + opregionvbt->vbt_ex = memremap(addr, rvds, MEMREMAP_WB); + if (!opregionvbt->vbt_ex) { + memunmap(opregionvbt->opregion); + kfree(opregionvbt); + return -ENOMEM; + } + } + } + + ret = vfio_pci_core_register_dev_region(vdev, + PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE, + VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION, &vfio_pci_igd_regops, + size, VFIO_REGION_INFO_FLAG_READ, opregionvbt); + if (ret) { + if (opregionvbt->vbt_ex) + memunmap(opregionvbt->vbt_ex); + + memunmap(opregionvbt->opregion); + kfree(opregionvbt); + return ret; + } + + /* Fill vconfig with the hw value and virtualize register */ + *dwordp = cpu_to_le32(addr); + memset(vdev->pci_config_map + OPREGION_PCI_ADDR, + PCI_CAP_ID_INVALID_VIRT, 4); + + return ret; +} + +static ssize_t vfio_pci_igd_cfg_rw(struct vfio_pci_core_device *vdev, + char __user *buf, size_t count, loff_t *ppos, + bool iswrite) +{ + unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS; + struct pci_dev *pdev = vdev->region[i].data; + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK; + size_t size; + int ret; + + if (pos >= vdev->region[i].size || iswrite) + return -EINVAL; + + size = count = min(count, (size_t)(vdev->region[i].size - pos)); + + if ((pos & 1) && size) { + u8 val; + + ret = pci_user_read_config_byte(pdev, pos, &val); + if (ret) + return ret; + + if (copy_to_user(buf + count - size, &val, 1)) + return -EFAULT; + + pos++; + size--; + } + + if ((pos & 3) && size > 2) { + u16 val; + __le16 lval; + + ret = pci_user_read_config_word(pdev, pos, &val); + if (ret) + return ret; + + lval = cpu_to_le16(val); + if (copy_to_user(buf + count - size, &lval, 2)) + return -EFAULT; + + pos += 2; + size -= 2; + } + + while (size > 3) { + u32 val; + __le32 lval; + + ret = pci_user_read_config_dword(pdev, pos, &val); + if (ret) + return ret; + + lval = cpu_to_le32(val); + if (copy_to_user(buf + count - size, &lval, 4)) + return -EFAULT; + + pos += 4; + size -= 4; + } + + while (size >= 2) { + u16 val; + __le16 lval; + + ret = pci_user_read_config_word(pdev, pos, &val); + if (ret) + return ret; + + lval = cpu_to_le16(val); + if (copy_to_user(buf + count - size, &lval, 2)) + return -EFAULT; + + pos += 2; + size -= 2; + } + + while (size) { + u8 val; + + ret = pci_user_read_config_byte(pdev, pos, &val); + if (ret) + return ret; + + if (copy_to_user(buf + count - size, &val, 1)) + return -EFAULT; + + pos++; + size--; + } + + *ppos += count; + + return count; +} + +static void vfio_pci_igd_cfg_release(struct vfio_pci_core_device *vdev, + struct vfio_pci_region *region) +{ + struct pci_dev *pdev = region->data; + + pci_dev_put(pdev); +} + +static const struct vfio_pci_regops vfio_pci_igd_cfg_regops = { + .rw = vfio_pci_igd_cfg_rw, + .release = vfio_pci_igd_cfg_release, +}; + +static int vfio_pci_igd_cfg_init(struct vfio_pci_core_device *vdev) +{ + struct pci_dev *host_bridge, *lpc_bridge; + int ret; + + host_bridge = pci_get_domain_bus_and_slot(0, 0, PCI_DEVFN(0, 0)); + if (!host_bridge) + return -ENODEV; + + if (host_bridge->vendor != PCI_VENDOR_ID_INTEL || + host_bridge->class != (PCI_CLASS_BRIDGE_HOST << 8)) { + pci_dev_put(host_bridge); + return -EINVAL; + } + + ret = vfio_pci_core_register_dev_region(vdev, + PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE, + VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG, + &vfio_pci_igd_cfg_regops, host_bridge->cfg_size, + VFIO_REGION_INFO_FLAG_READ, host_bridge); + if (ret) { + pci_dev_put(host_bridge); + return ret; + } + + lpc_bridge = pci_get_domain_bus_and_slot(0, 0, PCI_DEVFN(0x1f, 0)); + if (!lpc_bridge) + return -ENODEV; + + if (lpc_bridge->vendor != PCI_VENDOR_ID_INTEL || + lpc_bridge->class != (PCI_CLASS_BRIDGE_ISA << 8)) { + pci_dev_put(lpc_bridge); + return -EINVAL; + } + + ret = vfio_pci_core_register_dev_region(vdev, + PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE, + VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG, + &vfio_pci_igd_cfg_regops, lpc_bridge->cfg_size, + VFIO_REGION_INFO_FLAG_READ, lpc_bridge); + if (ret) { + pci_dev_put(lpc_bridge); + return ret; + } + + return 0; +} + +int vfio_pci_igd_init(struct vfio_pci_core_device *vdev) +{ + int ret; + + ret = vfio_pci_igd_opregion_init(vdev); + if (ret) + return ret; + + ret = vfio_pci_igd_cfg_init(vdev); + if (ret) + return ret; + + return 0; +} -- cgit v1.2.3