aboutsummaryrefslogtreecommitdiff
path: root/Documentation/driver-api/dma-buf.rst
diff options
context:
space:
mode:
authorLibravatar Linus Torvalds <torvalds@linux-foundation.org>2023-02-21 18:24:12 -0800
committerLibravatar Linus Torvalds <torvalds@linux-foundation.org>2023-02-21 18:24:12 -0800
commit5b7c4cabbb65f5c469464da6c5f614cbd7f730f2 (patch)
treecc5c2d0a898769fd59549594fedb3ee6f84e59a0 /Documentation/driver-api/dma-buf.rst
downloadlinux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.tar.gz
linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.zip
Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextgrafted
Pull networking updates from Jakub Kicinski: "Core: - Add dedicated kmem_cache for typical/small skb->head, avoid having to access struct page at kfree time, and improve memory use. - Introduce sysctl to set default RPS configuration for new netdevs. - Define Netlink protocol specification format which can be used to describe messages used by each family and auto-generate parsers. Add tools for generating kernel data structures and uAPI headers. - Expose all net/core sysctls inside netns. - Remove 4s sleep in netpoll if carrier is instantly detected on boot. - Add configurable limit of MDB entries per port, and port-vlan. - Continue populating drop reasons throughout the stack. - Retire a handful of legacy Qdiscs and classifiers. Protocols: - Support IPv4 big TCP (TSO frames larger than 64kB). - Add IP_LOCAL_PORT_RANGE socket option, to control local port range on socket by socket basis. - Track and report in procfs number of MPTCP sockets used. - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path manager. - IPv6: don't check net.ipv6.route.max_size and rely on garbage collection to free memory (similarly to IPv4). - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986). - ICMP: add per-rate limit counters. - Add support for user scanning requests in ieee802154. - Remove static WEP support. - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting. - WiFi 7 EHT channel puncturing support (client & AP). BPF: - Add a rbtree data structure following the "next-gen data structure" precedent set by recently added linked list, that is, by using kfunc + kptr instead of adding a new BPF map type. - Expose XDP hints via kfuncs with initial support for RX hash and timestamp metadata. - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata. - Improve x86 JIT's codegen for PROBE_MEM runtime error checks. - Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers. - Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case. - Significantly reduce the search time for module symbols by livepatch and BPF. - Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals. - Add support for BPF trampoline on s390x and riscv64. - Add capability to export the XDP features supported by the NIC. - Add __bpf_kfunc tag for marking kernel functions as kfuncs. - Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accounting for container environments. Netfilter: - Remove the CLUSTERIP target. It has been marked as obsolete for years, and we still have WARN splats wrt races of the out-of-band /proc interface installed by this target. - Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not return an error if the referenced object (set, chain, rule...) did not exist. Driver API: - Improve cpumask_local_spread() locality to help NICs set the right IRQ affinity on AMD platforms. - Separate C22 and C45 MDIO bus transactions more clearly. - Introduce new DCB table to control DSCP rewrite on egress. - Support configuration of Physical Layer Collision Avoidance (PLCA) Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of shared medium Ethernet. - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing preemption of low priority frames by high priority frames. - Add support for controlling MACSec offload using netlink SET. - Rework devlink instance refcounts to allow registration and de-registration under the instance lock. Split the code into multiple files, drop some of the unnecessarily granular locks and factor out common parts of netlink operation handling. - Add TX frame aggregation parameters (for USB drivers). - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning messages with notifications for debug. - Allow offloading of UDP NEW connections via act_ct. - Add support for per action HW stats in TC. - Support hardware miss to TC action (continue processing in SW from a specific point in the action chain). - Warn if old Wireless Extension user space interface is used with modern cfg80211/mac80211 drivers. Do not support Wireless Extensions for Wi-Fi 7 devices at all. Everyone should switch to using nl80211 interface instead. - Improve the CAN bit timing configuration. Use extack to return error messages directly to user space, update the SJW handling, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. New hardware / drivers: - Ethernet: - nVidia BlueField-3 support (control traffic driver) - Ethernet support for imx93 SoCs - Motorcomm yt8531 gigabit Ethernet PHY - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA) - Microchip LAN8841 PHY (incl. cable diagnostics and PTP) - Amlogic gxl MDIO mux - WiFi: - RealTek RTL8188EU (rtl8xxxu) - Qualcomm Wi-Fi 7 devices (ath12k) - CAN: - Renesas R-Car V4H Drivers: - Bluetooth: - Set Per Platform Antenna Gain (PPAG) for Intel controllers. - Ethernet NICs: - Intel (1G, igc): - support TSN / Qbv / packet scheduling features of i226 model - Intel (100G, ice): - use GNSS subsystem instead of TTY - multi-buffer XDP support - extend support for GPIO pins to E823 devices - nVidia/Mellanox: - update the shared buffer configuration on PFC commands - implement PTP adjphase function for HW offset control - TC support for Geneve and GRE with VF tunnel offload - more efficient crypto key management method - multi-port eswitch support - Netronome/Corigine: - add DCB IEEE support - support IPsec offloading for NFP3800 - Freescale/NXP (enetc): - support XDP_REDIRECT for XDP non-linear buffers - improve reconfig, avoid link flap and waiting for idle - support MAC Merge layer - Other NICs: - sfc/ef100: add basic devlink support for ef100 - ionic: rx_push mode operation (writing descriptors via MMIO) - bnxt: use the auxiliary bus abstraction for RDMA - r8169: disable ASPM and reset bus in case of tx timeout - cpsw: support QSGMII mode for J721e CPSW9G - cpts: support pulse-per-second output - ngbe: add an mdio bus driver - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing - r8152: handle devices with FW with NCM support - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation - virtio-net: support multi buffer XDP - virtio/vsock: replace virtio_vsock_pkt with sk_buff - tsnep: XDP support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add support for latency TLV (in FW control messages) - Microchip (sparx5): - separate explicit and implicit traffic forwarding rules, make the implicit rules always active - add support for egress DSCP rewrite - IS0 VCAP support (Ingress Classification) - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.) - ES2 VCAP support (Egress Access Control) - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1) - Ethernet embedded switches: - Marvell (mv88e6xxx): - add MAB (port auth) offload support - enable PTP receive for mv88e6390 - NXP (ocelot): - support MAC Merge layer - support for the the vsc7512 internal copper phys - Microchip: - lan9303: convert to PHYLINK - lan966x: support TC flower filter statistics - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x - lan937x: support Credit Based Shaper configuration - ksz9477: support Energy Efficient Ethernet - other: - qca8k: convert to regmap read/write API, use bulk operations - rswitch: Improve TX timestamp accuracy - Intel WiFi (iwlwifi): - EHT (Wi-Fi 7) rate reporting - STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware. - Qualcomm 802.11ax WiFi (ath11k): - IPQ5018 support - Fine Timing Measurement (FTM) responder role support - channel 177 support - MediaTek WiFi (mt76): - per-PHY LED support - mt7996: EHT (Wi-Fi 7) support - Wireless Ethernet Dispatch (WED) reset support - switch to using page pool allocator - RealTek WiFi (rtw89): - support new version of Bluetooth co-existance - Mobile: - rmnet: support TX aggregation" * tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits) page_pool: add a comment explaining the fragment counter usage net: ethtool: fix __ethtool_dev_mm_supported() implementation ethtool: pse-pd: Fix double word in comments xsk: add linux/vmalloc.h to xsk.c sefltests: netdevsim: wait for devlink instance after netns removal selftest: fib_tests: Always cleanup before exit net/mlx5e: Align IPsec ASO result memory to be as required by hardware net/mlx5e: TC, Set CT miss to the specific ct action instance net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG net/mlx5: Refactor tc miss handling to a single function net/mlx5: Kconfig: Make tc offload depend on tc skb extension net/sched: flower: Support hardware miss to tc action net/sched: flower: Move filter handle initialization earlier net/sched: cls_api: Support hardware miss to tc action net/sched: Rename user cookie and act cookie sfc: fix builds without CONFIG_RTC_LIB sfc: clean up some inconsistent indentings net/mlx4_en: Introduce flexible array to silence overflow warning net: lan966x: Fix possible deadlock inside PTP net/ulp: Remove redundant ->clone() test in inet_clone_ulp(). ...
Diffstat (limited to 'Documentation/driver-api/dma-buf.rst')
-rw-r--r--Documentation/driver-api/dma-buf.rst354
1 files changed, 354 insertions, 0 deletions
diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
new file mode 100644
index 000000000..622b8156d
--- /dev/null
+++ b/Documentation/driver-api/dma-buf.rst
@@ -0,0 +1,354 @@
+Buffer Sharing and Synchronization
+==================================
+
+The dma-buf subsystem provides the framework for sharing buffers for
+hardware (DMA) access across multiple device drivers and subsystems, and
+for synchronizing asynchronous hardware access.
+
+This is used, for example, by drm "prime" multi-GPU support, but is of
+course not limited to GPU use cases.
+
+The three main components of this are: (1) dma-buf, representing a
+sg_table and exposed to userspace as a file descriptor to allow passing
+between devices, (2) fence, which provides a mechanism to signal when
+one device has finished access, and (3) reservation, which manages the
+shared or exclusive fence(s) associated with the buffer.
+
+Shared DMA Buffers
+------------------
+
+This document serves as a guide to device-driver writers on what is the dma-buf
+buffer sharing API, how to use it for exporting and using shared buffers.
+
+Any device driver which wishes to be a part of DMA buffer sharing, can do so as
+either the 'exporter' of buffers, or the 'user' or 'importer' of buffers.
+
+Say a driver A wants to use buffers created by driver B, then we call B as the
+exporter, and A as buffer-user/importer.
+
+The exporter
+
+ - implements and manages operations in :c:type:`struct dma_buf_ops
+ <dma_buf_ops>` for the buffer,
+ - allows other users to share the buffer by using dma_buf sharing APIs,
+ - manages the details of buffer allocation, wrapped in a :c:type:`struct
+ dma_buf <dma_buf>`,
+ - decides about the actual backing storage where this allocation happens,
+ - and takes care of any migration of scatterlist - for all (shared) users of
+ this buffer.
+
+The buffer-user
+
+ - is one of (many) sharing users of the buffer.
+ - doesn't need to worry about how the buffer is allocated, or where.
+ - and needs a mechanism to get access to the scatterlist that makes up this
+ buffer in memory, mapped into its own address space, so it can access the
+ same area of memory. This interface is provided by :c:type:`struct
+ dma_buf_attachment <dma_buf_attachment>`.
+
+Any exporters or users of the dma-buf buffer sharing framework must have a
+'select DMA_SHARED_BUFFER' in their respective Kconfigs.
+
+Userspace Interface Notes
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Mostly a DMA buffer file descriptor is simply an opaque object for userspace,
+and hence the generic interface exposed is very minimal. There's a few things to
+consider though:
+
+- Since kernel 3.12 the dma-buf FD supports the llseek system call, but only
+ with offset=0 and whence=SEEK_END|SEEK_SET. SEEK_SET is supported to allow
+ the usual size discover pattern size = SEEK_END(0); SEEK_SET(0). Every other
+ llseek operation will report -EINVAL.
+
+ If llseek on dma-buf FDs isn't support the kernel will report -ESPIPE for all
+ cases. Userspace can use this to detect support for discovering the dma-buf
+ size using llseek.
+
+- In order to avoid fd leaks on exec, the FD_CLOEXEC flag must be set
+ on the file descriptor. This is not just a resource leak, but a
+ potential security hole. It could give the newly exec'd application
+ access to buffers, via the leaked fd, to which it should otherwise
+ not be permitted access.
+
+ The problem with doing this via a separate fcntl() call, versus doing it
+ atomically when the fd is created, is that this is inherently racy in a
+ multi-threaded app[3]. The issue is made worse when it is library code
+ opening/creating the file descriptor, as the application may not even be
+ aware of the fd's.
+
+ To avoid this problem, userspace must have a way to request O_CLOEXEC
+ flag be set when the dma-buf fd is created. So any API provided by
+ the exporting driver to create a dmabuf fd must provide a way to let
+ userspace control setting of O_CLOEXEC flag passed in to dma_buf_fd().
+
+- Memory mapping the contents of the DMA buffer is also supported. See the
+ discussion below on `CPU Access to DMA Buffer Objects`_ for the full details.
+
+- The DMA buffer FD is also pollable, see `Implicit Fence Poll Support`_ below for
+ details.
+
+- The DMA buffer FD also supports a few dma-buf-specific ioctls, see
+ `DMA Buffer ioctls`_ below for details.
+
+Basic Operation and Device DMA Access
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. kernel-doc:: drivers/dma-buf/dma-buf.c
+ :doc: dma buf device access
+
+CPU Access to DMA Buffer Objects
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. kernel-doc:: drivers/dma-buf/dma-buf.c
+ :doc: cpu access
+
+Implicit Fence Poll Support
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. kernel-doc:: drivers/dma-buf/dma-buf.c
+ :doc: implicit fence polling
+
+DMA-BUF statistics
+~~~~~~~~~~~~~~~~~~
+.. kernel-doc:: drivers/dma-buf/dma-buf-sysfs-stats.c
+ :doc: overview
+
+DMA Buffer ioctls
+~~~~~~~~~~~~~~~~~
+
+.. kernel-doc:: include/uapi/linux/dma-buf.h
+
+DMA-BUF locking convention
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. kernel-doc:: drivers/dma-buf/dma-buf.c
+ :doc: locking convention
+
+Kernel Functions and Structures Reference
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. kernel-doc:: drivers/dma-buf/dma-buf.c
+ :export:
+
+.. kernel-doc:: include/linux/dma-buf.h
+ :internal:
+
+Reservation Objects
+-------------------
+
+.. kernel-doc:: drivers/dma-buf/dma-resv.c
+ :doc: Reservation Object Overview
+
+.. kernel-doc:: drivers/dma-buf/dma-resv.c
+ :export:
+
+.. kernel-doc:: include/linux/dma-resv.h
+ :internal:
+
+DMA Fences
+----------
+
+.. kernel-doc:: drivers/dma-buf/dma-fence.c
+ :doc: DMA fences overview
+
+DMA Fence Cross-Driver Contract
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. kernel-doc:: drivers/dma-buf/dma-fence.c
+ :doc: fence cross-driver contract
+
+DMA Fence Signalling Annotations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. kernel-doc:: drivers/dma-buf/dma-fence.c
+ :doc: fence signalling annotation
+
+DMA Fences Functions Reference
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. kernel-doc:: drivers/dma-buf/dma-fence.c
+ :export:
+
+.. kernel-doc:: include/linux/dma-fence.h
+ :internal:
+
+DMA Fence Array
+~~~~~~~~~~~~~~~
+
+.. kernel-doc:: drivers/dma-buf/dma-fence-array.c
+ :export:
+
+.. kernel-doc:: include/linux/dma-fence-array.h
+ :internal:
+
+DMA Fence Chain
+~~~~~~~~~~~~~~~
+
+.. kernel-doc:: drivers/dma-buf/dma-fence-chain.c
+ :export:
+
+.. kernel-doc:: include/linux/dma-fence-chain.h
+ :internal:
+
+DMA Fence unwrap
+~~~~~~~~~~~~~~~~
+
+.. kernel-doc:: include/linux/dma-fence-unwrap.h
+ :internal:
+
+DMA Fence uABI/Sync File
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. kernel-doc:: drivers/dma-buf/sync_file.c
+ :export:
+
+.. kernel-doc:: include/linux/sync_file.h
+ :internal:
+
+Indefinite DMA Fences
+~~~~~~~~~~~~~~~~~~~~~
+
+At various times struct dma_fence with an indefinite time until dma_fence_wait()
+finishes have been proposed. Examples include:
+
+* Future fences, used in HWC1 to signal when a buffer isn't used by the display
+ any longer, and created with the screen update that makes the buffer visible.
+ The time this fence completes is entirely under userspace's control.
+
+* Proxy fences, proposed to handle &drm_syncobj for which the fence has not yet
+ been set. Used to asynchronously delay command submission.
+
+* Userspace fences or gpu futexes, fine-grained locking within a command buffer
+ that userspace uses for synchronization across engines or with the CPU, which
+ are then imported as a DMA fence for integration into existing winsys
+ protocols.
+
+* Long-running compute command buffers, while still using traditional end of
+ batch DMA fences for memory management instead of context preemption DMA
+ fences which get reattached when the compute job is rescheduled.
+
+Common to all these schemes is that userspace controls the dependencies of these
+fences and controls when they fire. Mixing indefinite fences with normal
+in-kernel DMA fences does not work, even when a fallback timeout is included to
+protect against malicious userspace:
+
+* Only the kernel knows about all DMA fence dependencies, userspace is not aware
+ of dependencies injected due to memory management or scheduler decisions.
+
+* Only userspace knows about all dependencies in indefinite fences and when
+ exactly they will complete, the kernel has no visibility.
+
+Furthermore the kernel has to be able to hold up userspace command submission
+for memory management needs, which means we must support indefinite fences being
+dependent upon DMA fences. If the kernel also support indefinite fences in the
+kernel like a DMA fence, like any of the above proposal would, there is the
+potential for deadlocks.
+
+.. kernel-render:: DOT
+ :alt: Indefinite Fencing Dependency Cycle
+ :caption: Indefinite Fencing Dependency Cycle
+
+ digraph "Fencing Cycle" {
+ node [shape=box bgcolor=grey style=filled]
+ kernel [label="Kernel DMA Fences"]
+ userspace [label="userspace controlled fences"]
+ kernel -> userspace [label="memory management"]
+ userspace -> kernel [label="Future fence, fence proxy, ..."]
+
+ { rank=same; kernel userspace }
+ }
+
+This means that the kernel might accidentally create deadlocks
+through memory management dependencies which userspace is unaware of, which
+randomly hangs workloads until the timeout kicks in. Workloads, which from
+userspace's perspective, do not contain a deadlock. In such a mixed fencing
+architecture there is no single entity with knowledge of all dependencies.
+Thefore preventing such deadlocks from within the kernel is not possible.
+
+The only solution to avoid dependencies loops is by not allowing indefinite
+fences in the kernel. This means:
+
+* No future fences, proxy fences or userspace fences imported as DMA fences,
+ with or without a timeout.
+
+* No DMA fences that signal end of batchbuffer for command submission where
+ userspace is allowed to use userspace fencing or long running compute
+ workloads. This also means no implicit fencing for shared buffers in these
+ cases.
+
+Recoverable Hardware Page Faults Implications
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Modern hardware supports recoverable page faults, which has a lot of
+implications for DMA fences.
+
+First, a pending page fault obviously holds up the work that's running on the
+accelerator and a memory allocation is usually required to resolve the fault.
+But memory allocations are not allowed to gate completion of DMA fences, which
+means any workload using recoverable page faults cannot use DMA fences for
+synchronization. Synchronization fences controlled by userspace must be used
+instead.
+
+On GPUs this poses a problem, because current desktop compositor protocols on
+Linux rely on DMA fences, which means without an entirely new userspace stack
+built on top of userspace fences, they cannot benefit from recoverable page
+faults. Specifically this means implicit synchronization will not be possible.
+The exception is when page faults are only used as migration hints and never to
+on-demand fill a memory request. For now this means recoverable page
+faults on GPUs are limited to pure compute workloads.
+
+Furthermore GPUs usually have shared resources between the 3D rendering and
+compute side, like compute units or command submission engines. If both a 3D
+job with a DMA fence and a compute workload using recoverable page faults are
+pending they could deadlock:
+
+- The 3D workload might need to wait for the compute job to finish and release
+ hardware resources first.
+
+- The compute workload might be stuck in a page fault, because the memory
+ allocation is waiting for the DMA fence of the 3D workload to complete.
+
+There are a few options to prevent this problem, one of which drivers need to
+ensure:
+
+- Compute workloads can always be preempted, even when a page fault is pending
+ and not yet repaired. Not all hardware supports this.
+
+- DMA fence workloads and workloads which need page fault handling have
+ independent hardware resources to guarantee forward progress. This could be
+ achieved through e.g. through dedicated engines and minimal compute unit
+ reservations for DMA fence workloads.
+
+- The reservation approach could be further refined by only reserving the
+ hardware resources for DMA fence workloads when they are in-flight. This must
+ cover the time from when the DMA fence is visible to other threads up to
+ moment when fence is completed through dma_fence_signal().
+
+- As a last resort, if the hardware provides no useful reservation mechanics,
+ all workloads must be flushed from the GPU when switching between jobs
+ requiring DMA fences or jobs requiring page fault handling: This means all DMA
+ fences must complete before a compute job with page fault handling can be
+ inserted into the scheduler queue. And vice versa, before a DMA fence can be
+ made visible anywhere in the system, all compute workloads must be preempted
+ to guarantee all pending GPU page faults are flushed.
+
+- Only a fairly theoretical option would be to untangle these dependencies when
+ allocating memory to repair hardware page faults, either through separate
+ memory blocks or runtime tracking of the full dependency graph of all DMA
+ fences. This results very wide impact on the kernel, since resolving the page
+ on the CPU side can itself involve a page fault. It is much more feasible and
+ robust to limit the impact of handling hardware page faults to the specific
+ driver.
+
+Note that workloads that run on independent hardware like copy engines or other
+GPUs do not have any impact. This allows us to keep using DMA fences internally
+in the kernel even for resolving hardware page faults, e.g. by using copy
+engines to clear or copy memory needed to resolve the page fault.
+
+In some ways this page fault problem is a special case of the `Infinite DMA
+Fences` discussions: Infinite fences from compute workloads are allowed to
+depend on DMA fences, but not the other way around. And not even the page fault
+problem is new, because some other CPU thread in userspace might
+hit a page fault which holds up a userspace fence - supporting page faults on
+GPUs doesn't anything fundamentally new.