aboutsummaryrefslogtreecommitdiff
path: root/Documentation/admin-guide/device-mapper/dm-clone.rst
diff options
context:
space:
mode:
authorLibravatar Linus Torvalds <torvalds@linux-foundation.org>2023-02-21 18:24:12 -0800
committerLibravatar Linus Torvalds <torvalds@linux-foundation.org>2023-02-21 18:24:12 -0800
commit5b7c4cabbb65f5c469464da6c5f614cbd7f730f2 (patch)
treecc5c2d0a898769fd59549594fedb3ee6f84e59a0 /Documentation/admin-guide/device-mapper/dm-clone.rst
downloadlinux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.tar.gz
linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.zip
Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextgrafted
Pull networking updates from Jakub Kicinski: "Core: - Add dedicated kmem_cache for typical/small skb->head, avoid having to access struct page at kfree time, and improve memory use. - Introduce sysctl to set default RPS configuration for new netdevs. - Define Netlink protocol specification format which can be used to describe messages used by each family and auto-generate parsers. Add tools for generating kernel data structures and uAPI headers. - Expose all net/core sysctls inside netns. - Remove 4s sleep in netpoll if carrier is instantly detected on boot. - Add configurable limit of MDB entries per port, and port-vlan. - Continue populating drop reasons throughout the stack. - Retire a handful of legacy Qdiscs and classifiers. Protocols: - Support IPv4 big TCP (TSO frames larger than 64kB). - Add IP_LOCAL_PORT_RANGE socket option, to control local port range on socket by socket basis. - Track and report in procfs number of MPTCP sockets used. - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path manager. - IPv6: don't check net.ipv6.route.max_size and rely on garbage collection to free memory (similarly to IPv4). - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986). - ICMP: add per-rate limit counters. - Add support for user scanning requests in ieee802154. - Remove static WEP support. - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting. - WiFi 7 EHT channel puncturing support (client & AP). BPF: - Add a rbtree data structure following the "next-gen data structure" precedent set by recently added linked list, that is, by using kfunc + kptr instead of adding a new BPF map type. - Expose XDP hints via kfuncs with initial support for RX hash and timestamp metadata. - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata. - Improve x86 JIT's codegen for PROBE_MEM runtime error checks. - Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers. - Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case. - Significantly reduce the search time for module symbols by livepatch and BPF. - Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals. - Add support for BPF trampoline on s390x and riscv64. - Add capability to export the XDP features supported by the NIC. - Add __bpf_kfunc tag for marking kernel functions as kfuncs. - Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accounting for container environments. Netfilter: - Remove the CLUSTERIP target. It has been marked as obsolete for years, and we still have WARN splats wrt races of the out-of-band /proc interface installed by this target. - Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not return an error if the referenced object (set, chain, rule...) did not exist. Driver API: - Improve cpumask_local_spread() locality to help NICs set the right IRQ affinity on AMD platforms. - Separate C22 and C45 MDIO bus transactions more clearly. - Introduce new DCB table to control DSCP rewrite on egress. - Support configuration of Physical Layer Collision Avoidance (PLCA) Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of shared medium Ethernet. - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing preemption of low priority frames by high priority frames. - Add support for controlling MACSec offload using netlink SET. - Rework devlink instance refcounts to allow registration and de-registration under the instance lock. Split the code into multiple files, drop some of the unnecessarily granular locks and factor out common parts of netlink operation handling. - Add TX frame aggregation parameters (for USB drivers). - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning messages with notifications for debug. - Allow offloading of UDP NEW connections via act_ct. - Add support for per action HW stats in TC. - Support hardware miss to TC action (continue processing in SW from a specific point in the action chain). - Warn if old Wireless Extension user space interface is used with modern cfg80211/mac80211 drivers. Do not support Wireless Extensions for Wi-Fi 7 devices at all. Everyone should switch to using nl80211 interface instead. - Improve the CAN bit timing configuration. Use extack to return error messages directly to user space, update the SJW handling, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. New hardware / drivers: - Ethernet: - nVidia BlueField-3 support (control traffic driver) - Ethernet support for imx93 SoCs - Motorcomm yt8531 gigabit Ethernet PHY - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA) - Microchip LAN8841 PHY (incl. cable diagnostics and PTP) - Amlogic gxl MDIO mux - WiFi: - RealTek RTL8188EU (rtl8xxxu) - Qualcomm Wi-Fi 7 devices (ath12k) - CAN: - Renesas R-Car V4H Drivers: - Bluetooth: - Set Per Platform Antenna Gain (PPAG) for Intel controllers. - Ethernet NICs: - Intel (1G, igc): - support TSN / Qbv / packet scheduling features of i226 model - Intel (100G, ice): - use GNSS subsystem instead of TTY - multi-buffer XDP support - extend support for GPIO pins to E823 devices - nVidia/Mellanox: - update the shared buffer configuration on PFC commands - implement PTP adjphase function for HW offset control - TC support for Geneve and GRE with VF tunnel offload - more efficient crypto key management method - multi-port eswitch support - Netronome/Corigine: - add DCB IEEE support - support IPsec offloading for NFP3800 - Freescale/NXP (enetc): - support XDP_REDIRECT for XDP non-linear buffers - improve reconfig, avoid link flap and waiting for idle - support MAC Merge layer - Other NICs: - sfc/ef100: add basic devlink support for ef100 - ionic: rx_push mode operation (writing descriptors via MMIO) - bnxt: use the auxiliary bus abstraction for RDMA - r8169: disable ASPM and reset bus in case of tx timeout - cpsw: support QSGMII mode for J721e CPSW9G - cpts: support pulse-per-second output - ngbe: add an mdio bus driver - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing - r8152: handle devices with FW with NCM support - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation - virtio-net: support multi buffer XDP - virtio/vsock: replace virtio_vsock_pkt with sk_buff - tsnep: XDP support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add support for latency TLV (in FW control messages) - Microchip (sparx5): - separate explicit and implicit traffic forwarding rules, make the implicit rules always active - add support for egress DSCP rewrite - IS0 VCAP support (Ingress Classification) - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.) - ES2 VCAP support (Egress Access Control) - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1) - Ethernet embedded switches: - Marvell (mv88e6xxx): - add MAB (port auth) offload support - enable PTP receive for mv88e6390 - NXP (ocelot): - support MAC Merge layer - support for the the vsc7512 internal copper phys - Microchip: - lan9303: convert to PHYLINK - lan966x: support TC flower filter statistics - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x - lan937x: support Credit Based Shaper configuration - ksz9477: support Energy Efficient Ethernet - other: - qca8k: convert to regmap read/write API, use bulk operations - rswitch: Improve TX timestamp accuracy - Intel WiFi (iwlwifi): - EHT (Wi-Fi 7) rate reporting - STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware. - Qualcomm 802.11ax WiFi (ath11k): - IPQ5018 support - Fine Timing Measurement (FTM) responder role support - channel 177 support - MediaTek WiFi (mt76): - per-PHY LED support - mt7996: EHT (Wi-Fi 7) support - Wireless Ethernet Dispatch (WED) reset support - switch to using page pool allocator - RealTek WiFi (rtw89): - support new version of Bluetooth co-existance - Mobile: - rmnet: support TX aggregation" * tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits) page_pool: add a comment explaining the fragment counter usage net: ethtool: fix __ethtool_dev_mm_supported() implementation ethtool: pse-pd: Fix double word in comments xsk: add linux/vmalloc.h to xsk.c sefltests: netdevsim: wait for devlink instance after netns removal selftest: fib_tests: Always cleanup before exit net/mlx5e: Align IPsec ASO result memory to be as required by hardware net/mlx5e: TC, Set CT miss to the specific ct action instance net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG net/mlx5: Refactor tc miss handling to a single function net/mlx5: Kconfig: Make tc offload depend on tc skb extension net/sched: flower: Support hardware miss to tc action net/sched: flower: Move filter handle initialization earlier net/sched: cls_api: Support hardware miss to tc action net/sched: Rename user cookie and act cookie sfc: fix builds without CONFIG_RTC_LIB sfc: clean up some inconsistent indentings net/mlx4_en: Introduce flexible array to silence overflow warning net: lan966x: Fix possible deadlock inside PTP net/ulp: Remove redundant ->clone() test in inet_clone_ulp(). ...
Diffstat (limited to 'Documentation/admin-guide/device-mapper/dm-clone.rst')
-rw-r--r--Documentation/admin-guide/device-mapper/dm-clone.rst333
1 files changed, 333 insertions, 0 deletions
diff --git a/Documentation/admin-guide/device-mapper/dm-clone.rst b/Documentation/admin-guide/device-mapper/dm-clone.rst
new file mode 100644
index 000000000..b43a34c14
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-clone.rst
@@ -0,0 +1,333 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+========
+dm-clone
+========
+
+Introduction
+============
+
+dm-clone is a device mapper target which produces a one-to-one copy of an
+existing, read-only source device into a writable destination device: It
+presents a virtual block device which makes all data appear immediately, and
+redirects reads and writes accordingly.
+
+The main use case of dm-clone is to clone a potentially remote, high-latency,
+read-only, archival-type block device into a writable, fast, primary-type device
+for fast, low-latency I/O. The cloned device is visible/mountable immediately
+and the copy of the source device to the destination device happens in the
+background, in parallel with user I/O.
+
+For example, one could restore an application backup from a read-only copy,
+accessible through a network storage protocol (NBD, Fibre Channel, iSCSI, AoE,
+etc.), into a local SSD or NVMe device, and start using the device immediately,
+without waiting for the restore to complete.
+
+When the cloning completes, the dm-clone table can be removed altogether and be
+replaced, e.g., by a linear table, mapping directly to the destination device.
+
+The dm-clone target reuses the metadata library used by the thin-provisioning
+target.
+
+Glossary
+========
+
+ Hydration
+ The process of filling a region of the destination device with data from
+ the same region of the source device, i.e., copying the region from the
+ source to the destination device.
+
+Once a region gets hydrated we redirect all I/O regarding it to the destination
+device.
+
+Design
+======
+
+Sub-devices
+-----------
+
+The target is constructed by passing three devices to it (along with other
+parameters detailed later):
+
+1. A source device - the read-only device that gets cloned and source of the
+ hydration.
+
+2. A destination device - the destination of the hydration, which will become a
+ clone of the source device.
+
+3. A small metadata device - it records which regions are already valid in the
+ destination device, i.e., which regions have already been hydrated, or have
+ been written to directly, via user I/O.
+
+The size of the destination device must be at least equal to the size of the
+source device.
+
+Regions
+-------
+
+dm-clone divides the source and destination devices in fixed sized regions.
+Regions are the unit of hydration, i.e., the minimum amount of data copied from
+the source to the destination device.
+
+The region size is configurable when you first create the dm-clone device. The
+recommended region size is the same as the file system block size, which usually
+is 4KB. The region size must be between 8 sectors (4KB) and 2097152 sectors
+(1GB) and a power of two.
+
+Reads and writes from/to hydrated regions are serviced from the destination
+device.
+
+A read to a not yet hydrated region is serviced directly from the source device.
+
+A write to a not yet hydrated region will be delayed until the corresponding
+region has been hydrated and the hydration of the region starts immediately.
+
+Note that a write request with size equal to region size will skip copying of
+the corresponding region from the source device and overwrite the region of the
+destination device directly.
+
+Discards
+--------
+
+dm-clone interprets a discard request to a range that hasn't been hydrated yet
+as a hint to skip hydration of the regions covered by the request, i.e., it
+skips copying the region's data from the source to the destination device, and
+only updates its metadata.
+
+If the destination device supports discards, then by default dm-clone will pass
+down discard requests to it.
+
+Background Hydration
+--------------------
+
+dm-clone copies continuously from the source to the destination device, until
+all of the device has been copied.
+
+Copying data from the source to the destination device uses bandwidth. The user
+can set a throttle to prevent more than a certain amount of copying occurring at
+any one time. Moreover, dm-clone takes into account user I/O traffic going to
+the devices and pauses the background hydration when there is I/O in-flight.
+
+A message `hydration_threshold <#regions>` can be used to set the maximum number
+of regions being copied, the default being 1 region.
+
+dm-clone employs dm-kcopyd for copying portions of the source device to the
+destination device. By default, we issue copy requests of size equal to the
+region size. A message `hydration_batch_size <#regions>` can be used to tune the
+size of these copy requests. Increasing the hydration batch size results in
+dm-clone trying to batch together contiguous regions, so we copy the data in
+batches of this many regions.
+
+When the hydration of the destination device finishes, a dm event will be sent
+to user space.
+
+Updating on-disk metadata
+-------------------------
+
+On-disk metadata is committed every time a FLUSH or FUA bio is written. If no
+such requests are made then commits will occur every second. This means the
+dm-clone device behaves like a physical disk that has a volatile write cache. If
+power is lost you may lose some recent writes. The metadata should always be
+consistent in spite of any crash.
+
+Target Interface
+================
+
+Constructor
+-----------
+
+ ::
+
+ clone <metadata dev> <destination dev> <source dev> <region size>
+ [<#feature args> [<feature arg>]* [<#core args> [<core arg>]*]]
+
+ ================ ==============================================================
+ metadata dev Fast device holding the persistent metadata
+ destination dev The destination device, where the source will be cloned
+ source dev Read only device containing the data that gets cloned
+ region size The size of a region in sectors
+
+ #feature args Number of feature arguments passed
+ feature args no_hydration or no_discard_passdown
+
+ #core args An even number of arguments corresponding to key/value pairs
+ passed to dm-clone
+ core args Key/value pairs passed to dm-clone, e.g. `hydration_threshold
+ 256`
+ ================ ==============================================================
+
+Optional feature arguments are:
+
+ ==================== =========================================================
+ no_hydration Create a dm-clone instance with background hydration
+ disabled
+ no_discard_passdown Disable passing down discards to the destination device
+ ==================== =========================================================
+
+Optional core arguments are:
+
+ ================================ ==============================================
+ hydration_threshold <#regions> Maximum number of regions being copied from
+ the source to the destination device at any
+ one time, during background hydration.
+ hydration_batch_size <#regions> During background hydration, try to batch
+ together contiguous regions, so we copy data
+ from the source to the destination device in
+ batches of this many regions.
+ ================================ ==============================================
+
+Status
+------
+
+ ::
+
+ <metadata block size> <#used metadata blocks>/<#total metadata blocks>
+ <region size> <#hydrated regions>/<#total regions> <#hydrating regions>
+ <#feature args> <feature args>* <#core args> <core args>*
+ <clone metadata mode>
+
+ ======================= =======================================================
+ metadata block size Fixed block size for each metadata block in sectors
+ #used metadata blocks Number of metadata blocks used
+ #total metadata blocks Total number of metadata blocks
+ region size Configurable region size for the device in sectors
+ #hydrated regions Number of regions that have finished hydrating
+ #total regions Total number of regions to hydrate
+ #hydrating regions Number of regions currently hydrating
+ #feature args Number of feature arguments to follow
+ feature args Feature arguments, e.g. `no_hydration`
+ #core args Even number of core arguments to follow
+ core args Key/value pairs for tuning the core, e.g.
+ `hydration_threshold 256`
+ clone metadata mode ro if read-only, rw if read-write
+
+ In serious cases where even a read-only mode is deemed
+ unsafe no further I/O will be permitted and the status
+ will just contain the string 'Fail'. If the metadata
+ mode changes, a dm event will be sent to user space.
+ ======================= =======================================================
+
+Messages
+--------
+
+ `disable_hydration`
+ Disable the background hydration of the destination device.
+
+ `enable_hydration`
+ Enable the background hydration of the destination device.
+
+ `hydration_threshold <#regions>`
+ Set background hydration threshold.
+
+ `hydration_batch_size <#regions>`
+ Set background hydration batch size.
+
+Examples
+========
+
+Clone a device containing a file system
+---------------------------------------
+
+1. Create the dm-clone device.
+
+ ::
+
+ dmsetup create clone --table "0 1048576000 clone $metadata_dev $dest_dev \
+ $source_dev 8 1 no_hydration"
+
+2. Mount the device and trim the file system. dm-clone interprets the discards
+ sent by the file system and it will not hydrate the unused space.
+
+ ::
+
+ mount /dev/mapper/clone /mnt/cloned-fs
+ fstrim /mnt/cloned-fs
+
+3. Enable background hydration of the destination device.
+
+ ::
+
+ dmsetup message clone 0 enable_hydration
+
+4. When the hydration finishes, we can replace the dm-clone table with a linear
+ table.
+
+ ::
+
+ dmsetup suspend clone
+ dmsetup load clone --table "0 1048576000 linear $dest_dev 0"
+ dmsetup resume clone
+
+ The metadata device is no longer needed and can be safely discarded or reused
+ for other purposes.
+
+Known issues
+============
+
+1. We redirect reads, to not-yet-hydrated regions, to the source device. If
+ reading the source device has high latency and the user repeatedly reads from
+ the same regions, this behaviour could degrade performance. We should use
+ these reads as hints to hydrate the relevant regions sooner. Currently, we
+ rely on the page cache to cache these regions, so we hopefully don't end up
+ reading them multiple times from the source device.
+
+2. Release in-core resources, i.e., the bitmaps tracking which regions are
+ hydrated, after the hydration has finished.
+
+3. During background hydration, if we fail to read the source or write to the
+ destination device, we print an error message, but the hydration process
+ continues indefinitely, until it succeeds. We should stop the background
+ hydration after a number of failures and emit a dm event for user space to
+ notice.
+
+Why not...?
+===========
+
+We explored the following alternatives before implementing dm-clone:
+
+1. Use dm-cache with cache size equal to the source device and implement a new
+ cloning policy:
+
+ * The resulting cache device is not a one-to-one mirror of the source device
+ and thus we cannot remove the cache device once cloning completes.
+
+ * dm-cache writes to the source device, which violates our requirement that
+ the source device must be treated as read-only.
+
+ * Caching is semantically different from cloning.
+
+2. Use dm-snapshot with a COW device equal to the source device:
+
+ * dm-snapshot stores its metadata in the COW device, so the resulting device
+ is not a one-to-one mirror of the source device.
+
+ * No background copying mechanism.
+
+ * dm-snapshot needs to commit its metadata whenever a pending exception
+ completes, to ensure snapshot consistency. In the case of cloning, we don't
+ need to be so strict and can rely on committing metadata every time a FLUSH
+ or FUA bio is written, or periodically, like dm-thin and dm-cache do. This
+ improves the performance significantly.
+
+3. Use dm-mirror: The mirror target has a background copying/mirroring
+ mechanism, but it writes to all mirrors, thus violating our requirement that
+ the source device must be treated as read-only.
+
+4. Use dm-thin's external snapshot functionality. This approach is the most
+ promising among all alternatives, as the thinly-provisioned volume is a
+ one-to-one mirror of the source device and handles reads and writes to
+ un-provisioned/not-yet-cloned areas the same way as dm-clone does.
+
+ Still:
+
+ * There is no background copying mechanism, though one could be implemented.
+
+ * Most importantly, we want to support arbitrary block devices as the
+ destination of the cloning process and not restrict ourselves to
+ thinly-provisioned volumes. Thin-provisioning has an inherent metadata
+ overhead, for maintaining the thin volume mappings, which significantly
+ degrades performance.
+
+ Moreover, cloning a device shouldn't force the use of thin-provisioning. On
+ the other hand, if we wish to use thin provisioning, we can just use a thin
+ LV as dm-clone's destination device.