diff options
author | 2023-02-21 18:24:12 -0800 | |
---|---|---|
committer | 2023-02-21 18:24:12 -0800 | |
commit | 5b7c4cabbb65f5c469464da6c5f614cbd7f730f2 (patch) | |
tree | cc5c2d0a898769fd59549594fedb3ee6f84e59a0 /Documentation/admin-guide/device-mapper/dm-clone.rst | |
download | linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.tar.gz linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.zip |
Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextgrafted
Pull networking updates from Jakub Kicinski:
"Core:
- Add dedicated kmem_cache for typical/small skb->head, avoid having
to access struct page at kfree time, and improve memory use.
- Introduce sysctl to set default RPS configuration for new netdevs.
- Define Netlink protocol specification format which can be used to
describe messages used by each family and auto-generate parsers.
Add tools for generating kernel data structures and uAPI headers.
- Expose all net/core sysctls inside netns.
- Remove 4s sleep in netpoll if carrier is instantly detected on
boot.
- Add configurable limit of MDB entries per port, and port-vlan.
- Continue populating drop reasons throughout the stack.
- Retire a handful of legacy Qdiscs and classifiers.
Protocols:
- Support IPv4 big TCP (TSO frames larger than 64kB).
- Add IP_LOCAL_PORT_RANGE socket option, to control local port range
on socket by socket basis.
- Track and report in procfs number of MPTCP sockets used.
- Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path
manager.
- IPv6: don't check net.ipv6.route.max_size and rely on garbage
collection to free memory (similarly to IPv4).
- Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).
- ICMP: add per-rate limit counters.
- Add support for user scanning requests in ieee802154.
- Remove static WEP support.
- Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
reporting.
- WiFi 7 EHT channel puncturing support (client & AP).
BPF:
- Add a rbtree data structure following the "next-gen data structure"
precedent set by recently added linked list, that is, by using
kfunc + kptr instead of adding a new BPF map type.
- Expose XDP hints via kfuncs with initial support for RX hash and
timestamp metadata.
- Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to
better support decap on GRE tunnel devices not operating in collect
metadata.
- Improve x86 JIT's codegen for PROBE_MEM runtime error checks.
- Remove the need for trace_printk_lock for bpf_trace_printk and
bpf_trace_vprintk helpers.
- Extend libbpf's bpf_tracing.h support for tracing arguments of
kprobes/uprobes and syscall as a special case.
- Significantly reduce the search time for module symbols by
livepatch and BPF.
- Enable cpumasks to be used as kptrs, which is useful for tracing
programs tracking which tasks end up running on which CPUs in
different time intervals.
- Add support for BPF trampoline on s390x and riscv64.
- Add capability to export the XDP features supported by the NIC.
- Add __bpf_kfunc tag for marking kernel functions as kfuncs.
- Add cgroup.memory=nobpf kernel parameter option to disable BPF
memory accounting for container environments.
Netfilter:
- Remove the CLUSTERIP target. It has been marked as obsolete for
years, and we still have WARN splats wrt races of the out-of-band
/proc interface installed by this target.
- Add 'destroy' commands to nf_tables. They are identical to the
existing 'delete' commands, but do not return an error if the
referenced object (set, chain, rule...) did not exist.
Driver API:
- Improve cpumask_local_spread() locality to help NICs set the right
IRQ affinity on AMD platforms.
- Separate C22 and C45 MDIO bus transactions more clearly.
- Introduce new DCB table to control DSCP rewrite on egress.
- Support configuration of Physical Layer Collision Avoidance (PLCA)
Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
shared medium Ethernet.
- Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
preemption of low priority frames by high priority frames.
- Add support for controlling MACSec offload using netlink SET.
- Rework devlink instance refcounts to allow registration and
de-registration under the instance lock. Split the code into
multiple files, drop some of the unnecessarily granular locks and
factor out common parts of netlink operation handling.
- Add TX frame aggregation parameters (for USB drivers).
- Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
messages with notifications for debug.
- Allow offloading of UDP NEW connections via act_ct.
- Add support for per action HW stats in TC.
- Support hardware miss to TC action (continue processing in SW from
a specific point in the action chain).
- Warn if old Wireless Extension user space interface is used with
modern cfg80211/mac80211 drivers. Do not support Wireless
Extensions for Wi-Fi 7 devices at all. Everyone should switch to
using nl80211 interface instead.
- Improve the CAN bit timing configuration. Use extack to return
error messages directly to user space, update the SJW handling,
including the definition of a new default value that will benefit
CAN-FD controllers, by increasing their oscillator tolerance.
New hardware / drivers:
- Ethernet:
- nVidia BlueField-3 support (control traffic driver)
- Ethernet support for imx93 SoCs
- Motorcomm yt8531 gigabit Ethernet PHY
- onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
- Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
- Amlogic gxl MDIO mux
- WiFi:
- RealTek RTL8188EU (rtl8xxxu)
- Qualcomm Wi-Fi 7 devices (ath12k)
- CAN:
- Renesas R-Car V4H
Drivers:
- Bluetooth:
- Set Per Platform Antenna Gain (PPAG) for Intel controllers.
- Ethernet NICs:
- Intel (1G, igc):
- support TSN / Qbv / packet scheduling features of i226 model
- Intel (100G, ice):
- use GNSS subsystem instead of TTY
- multi-buffer XDP support
- extend support for GPIO pins to E823 devices
- nVidia/Mellanox:
- update the shared buffer configuration on PFC commands
- implement PTP adjphase function for HW offset control
- TC support for Geneve and GRE with VF tunnel offload
- more efficient crypto key management method
- multi-port eswitch support
- Netronome/Corigine:
- add DCB IEEE support
- support IPsec offloading for NFP3800
- Freescale/NXP (enetc):
- support XDP_REDIRECT for XDP non-linear buffers
- improve reconfig, avoid link flap and waiting for idle
- support MAC Merge layer
- Other NICs:
- sfc/ef100: add basic devlink support for ef100
- ionic: rx_push mode operation (writing descriptors via MMIO)
- bnxt: use the auxiliary bus abstraction for RDMA
- r8169: disable ASPM and reset bus in case of tx timeout
- cpsw: support QSGMII mode for J721e CPSW9G
- cpts: support pulse-per-second output
- ngbe: add an mdio bus driver
- usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
- r8152: handle devices with FW with NCM support
- amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
- virtio-net: support multi buffer XDP
- virtio/vsock: replace virtio_vsock_pkt with sk_buff
- tsnep: XDP support
- Ethernet high-speed switches:
- nVidia/Mellanox (mlxsw):
- add support for latency TLV (in FW control messages)
- Microchip (sparx5):
- separate explicit and implicit traffic forwarding rules, make
the implicit rules always active
- add support for egress DSCP rewrite
- IS0 VCAP support (Ingress Classification)
- IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS
etc.)
- ES2 VCAP support (Egress Access Control)
- support for Per-Stream Filtering and Policing (802.1Q,
8.6.5.1)
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- add MAB (port auth) offload support
- enable PTP receive for mv88e6390
- NXP (ocelot):
- support MAC Merge layer
- support for the the vsc7512 internal copper phys
- Microchip:
- lan9303: convert to PHYLINK
- lan966x: support TC flower filter statistics
- lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
- lan937x: support Credit Based Shaper configuration
- ksz9477: support Energy Efficient Ethernet
- other:
- qca8k: convert to regmap read/write API, use bulk operations
- rswitch: Improve TX timestamp accuracy
- Intel WiFi (iwlwifi):
- EHT (Wi-Fi 7) rate reporting
- STEP equalizer support: transfer some STEP (connection to radio
on platforms with integrated wifi) related parameters from the
BIOS to the firmware.
- Qualcomm 802.11ax WiFi (ath11k):
- IPQ5018 support
- Fine Timing Measurement (FTM) responder role support
- channel 177 support
- MediaTek WiFi (mt76):
- per-PHY LED support
- mt7996: EHT (Wi-Fi 7) support
- Wireless Ethernet Dispatch (WED) reset support
- switch to using page pool allocator
- RealTek WiFi (rtw89):
- support new version of Bluetooth co-existance
- Mobile:
- rmnet: support TX aggregation"
* tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits)
page_pool: add a comment explaining the fragment counter usage
net: ethtool: fix __ethtool_dev_mm_supported() implementation
ethtool: pse-pd: Fix double word in comments
xsk: add linux/vmalloc.h to xsk.c
sefltests: netdevsim: wait for devlink instance after netns removal
selftest: fib_tests: Always cleanup before exit
net/mlx5e: Align IPsec ASO result memory to be as required by hardware
net/mlx5e: TC, Set CT miss to the specific ct action instance
net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG
net/mlx5: Refactor tc miss handling to a single function
net/mlx5: Kconfig: Make tc offload depend on tc skb extension
net/sched: flower: Support hardware miss to tc action
net/sched: flower: Move filter handle initialization earlier
net/sched: cls_api: Support hardware miss to tc action
net/sched: Rename user cookie and act cookie
sfc: fix builds without CONFIG_RTC_LIB
sfc: clean up some inconsistent indentings
net/mlx4_en: Introduce flexible array to silence overflow warning
net: lan966x: Fix possible deadlock inside PTP
net/ulp: Remove redundant ->clone() test in inet_clone_ulp().
...
Diffstat (limited to 'Documentation/admin-guide/device-mapper/dm-clone.rst')
-rw-r--r-- | Documentation/admin-guide/device-mapper/dm-clone.rst | 333 |
1 files changed, 333 insertions, 0 deletions
diff --git a/Documentation/admin-guide/device-mapper/dm-clone.rst b/Documentation/admin-guide/device-mapper/dm-clone.rst new file mode 100644 index 000000000..b43a34c14 --- /dev/null +++ b/Documentation/admin-guide/device-mapper/dm-clone.rst @@ -0,0 +1,333 @@ +.. SPDX-License-Identifier: GPL-2.0-only + +======== +dm-clone +======== + +Introduction +============ + +dm-clone is a device mapper target which produces a one-to-one copy of an +existing, read-only source device into a writable destination device: It +presents a virtual block device which makes all data appear immediately, and +redirects reads and writes accordingly. + +The main use case of dm-clone is to clone a potentially remote, high-latency, +read-only, archival-type block device into a writable, fast, primary-type device +for fast, low-latency I/O. The cloned device is visible/mountable immediately +and the copy of the source device to the destination device happens in the +background, in parallel with user I/O. + +For example, one could restore an application backup from a read-only copy, +accessible through a network storage protocol (NBD, Fibre Channel, iSCSI, AoE, +etc.), into a local SSD or NVMe device, and start using the device immediately, +without waiting for the restore to complete. + +When the cloning completes, the dm-clone table can be removed altogether and be +replaced, e.g., by a linear table, mapping directly to the destination device. + +The dm-clone target reuses the metadata library used by the thin-provisioning +target. + +Glossary +======== + + Hydration + The process of filling a region of the destination device with data from + the same region of the source device, i.e., copying the region from the + source to the destination device. + +Once a region gets hydrated we redirect all I/O regarding it to the destination +device. + +Design +====== + +Sub-devices +----------- + +The target is constructed by passing three devices to it (along with other +parameters detailed later): + +1. A source device - the read-only device that gets cloned and source of the + hydration. + +2. A destination device - the destination of the hydration, which will become a + clone of the source device. + +3. A small metadata device - it records which regions are already valid in the + destination device, i.e., which regions have already been hydrated, or have + been written to directly, via user I/O. + +The size of the destination device must be at least equal to the size of the +source device. + +Regions +------- + +dm-clone divides the source and destination devices in fixed sized regions. +Regions are the unit of hydration, i.e., the minimum amount of data copied from +the source to the destination device. + +The region size is configurable when you first create the dm-clone device. The +recommended region size is the same as the file system block size, which usually +is 4KB. The region size must be between 8 sectors (4KB) and 2097152 sectors +(1GB) and a power of two. + +Reads and writes from/to hydrated regions are serviced from the destination +device. + +A read to a not yet hydrated region is serviced directly from the source device. + +A write to a not yet hydrated region will be delayed until the corresponding +region has been hydrated and the hydration of the region starts immediately. + +Note that a write request with size equal to region size will skip copying of +the corresponding region from the source device and overwrite the region of the +destination device directly. + +Discards +-------- + +dm-clone interprets a discard request to a range that hasn't been hydrated yet +as a hint to skip hydration of the regions covered by the request, i.e., it +skips copying the region's data from the source to the destination device, and +only updates its metadata. + +If the destination device supports discards, then by default dm-clone will pass +down discard requests to it. + +Background Hydration +-------------------- + +dm-clone copies continuously from the source to the destination device, until +all of the device has been copied. + +Copying data from the source to the destination device uses bandwidth. The user +can set a throttle to prevent more than a certain amount of copying occurring at +any one time. Moreover, dm-clone takes into account user I/O traffic going to +the devices and pauses the background hydration when there is I/O in-flight. + +A message `hydration_threshold <#regions>` can be used to set the maximum number +of regions being copied, the default being 1 region. + +dm-clone employs dm-kcopyd for copying portions of the source device to the +destination device. By default, we issue copy requests of size equal to the +region size. A message `hydration_batch_size <#regions>` can be used to tune the +size of these copy requests. Increasing the hydration batch size results in +dm-clone trying to batch together contiguous regions, so we copy the data in +batches of this many regions. + +When the hydration of the destination device finishes, a dm event will be sent +to user space. + +Updating on-disk metadata +------------------------- + +On-disk metadata is committed every time a FLUSH or FUA bio is written. If no +such requests are made then commits will occur every second. This means the +dm-clone device behaves like a physical disk that has a volatile write cache. If +power is lost you may lose some recent writes. The metadata should always be +consistent in spite of any crash. + +Target Interface +================ + +Constructor +----------- + + :: + + clone <metadata dev> <destination dev> <source dev> <region size> + [<#feature args> [<feature arg>]* [<#core args> [<core arg>]*]] + + ================ ============================================================== + metadata dev Fast device holding the persistent metadata + destination dev The destination device, where the source will be cloned + source dev Read only device containing the data that gets cloned + region size The size of a region in sectors + + #feature args Number of feature arguments passed + feature args no_hydration or no_discard_passdown + + #core args An even number of arguments corresponding to key/value pairs + passed to dm-clone + core args Key/value pairs passed to dm-clone, e.g. `hydration_threshold + 256` + ================ ============================================================== + +Optional feature arguments are: + + ==================== ========================================================= + no_hydration Create a dm-clone instance with background hydration + disabled + no_discard_passdown Disable passing down discards to the destination device + ==================== ========================================================= + +Optional core arguments are: + + ================================ ============================================== + hydration_threshold <#regions> Maximum number of regions being copied from + the source to the destination device at any + one time, during background hydration. + hydration_batch_size <#regions> During background hydration, try to batch + together contiguous regions, so we copy data + from the source to the destination device in + batches of this many regions. + ================================ ============================================== + +Status +------ + + :: + + <metadata block size> <#used metadata blocks>/<#total metadata blocks> + <region size> <#hydrated regions>/<#total regions> <#hydrating regions> + <#feature args> <feature args>* <#core args> <core args>* + <clone metadata mode> + + ======================= ======================================================= + metadata block size Fixed block size for each metadata block in sectors + #used metadata blocks Number of metadata blocks used + #total metadata blocks Total number of metadata blocks + region size Configurable region size for the device in sectors + #hydrated regions Number of regions that have finished hydrating + #total regions Total number of regions to hydrate + #hydrating regions Number of regions currently hydrating + #feature args Number of feature arguments to follow + feature args Feature arguments, e.g. `no_hydration` + #core args Even number of core arguments to follow + core args Key/value pairs for tuning the core, e.g. + `hydration_threshold 256` + clone metadata mode ro if read-only, rw if read-write + + In serious cases where even a read-only mode is deemed + unsafe no further I/O will be permitted and the status + will just contain the string 'Fail'. If the metadata + mode changes, a dm event will be sent to user space. + ======================= ======================================================= + +Messages +-------- + + `disable_hydration` + Disable the background hydration of the destination device. + + `enable_hydration` + Enable the background hydration of the destination device. + + `hydration_threshold <#regions>` + Set background hydration threshold. + + `hydration_batch_size <#regions>` + Set background hydration batch size. + +Examples +======== + +Clone a device containing a file system +--------------------------------------- + +1. Create the dm-clone device. + + :: + + dmsetup create clone --table "0 1048576000 clone $metadata_dev $dest_dev \ + $source_dev 8 1 no_hydration" + +2. Mount the device and trim the file system. dm-clone interprets the discards + sent by the file system and it will not hydrate the unused space. + + :: + + mount /dev/mapper/clone /mnt/cloned-fs + fstrim /mnt/cloned-fs + +3. Enable background hydration of the destination device. + + :: + + dmsetup message clone 0 enable_hydration + +4. When the hydration finishes, we can replace the dm-clone table with a linear + table. + + :: + + dmsetup suspend clone + dmsetup load clone --table "0 1048576000 linear $dest_dev 0" + dmsetup resume clone + + The metadata device is no longer needed and can be safely discarded or reused + for other purposes. + +Known issues +============ + +1. We redirect reads, to not-yet-hydrated regions, to the source device. If + reading the source device has high latency and the user repeatedly reads from + the same regions, this behaviour could degrade performance. We should use + these reads as hints to hydrate the relevant regions sooner. Currently, we + rely on the page cache to cache these regions, so we hopefully don't end up + reading them multiple times from the source device. + +2. Release in-core resources, i.e., the bitmaps tracking which regions are + hydrated, after the hydration has finished. + +3. During background hydration, if we fail to read the source or write to the + destination device, we print an error message, but the hydration process + continues indefinitely, until it succeeds. We should stop the background + hydration after a number of failures and emit a dm event for user space to + notice. + +Why not...? +=========== + +We explored the following alternatives before implementing dm-clone: + +1. Use dm-cache with cache size equal to the source device and implement a new + cloning policy: + + * The resulting cache device is not a one-to-one mirror of the source device + and thus we cannot remove the cache device once cloning completes. + + * dm-cache writes to the source device, which violates our requirement that + the source device must be treated as read-only. + + * Caching is semantically different from cloning. + +2. Use dm-snapshot with a COW device equal to the source device: + + * dm-snapshot stores its metadata in the COW device, so the resulting device + is not a one-to-one mirror of the source device. + + * No background copying mechanism. + + * dm-snapshot needs to commit its metadata whenever a pending exception + completes, to ensure snapshot consistency. In the case of cloning, we don't + need to be so strict and can rely on committing metadata every time a FLUSH + or FUA bio is written, or periodically, like dm-thin and dm-cache do. This + improves the performance significantly. + +3. Use dm-mirror: The mirror target has a background copying/mirroring + mechanism, but it writes to all mirrors, thus violating our requirement that + the source device must be treated as read-only. + +4. Use dm-thin's external snapshot functionality. This approach is the most + promising among all alternatives, as the thinly-provisioned volume is a + one-to-one mirror of the source device and handles reads and writes to + un-provisioned/not-yet-cloned areas the same way as dm-clone does. + + Still: + + * There is no background copying mechanism, though one could be implemented. + + * Most importantly, we want to support arbitrary block devices as the + destination of the cloning process and not restrict ourselves to + thinly-provisioned volumes. Thin-provisioning has an inherent metadata + overhead, for maintaining the thin volume mappings, which significantly + degrades performance. + + Moreover, cloning a device shouldn't force the use of thin-provisioning. On + the other hand, if we wish to use thin provisioning, we can just use a thin + LV as dm-clone's destination device. |