diff options
author | 2023-02-21 18:24:12 -0800 | |
---|---|---|
committer | 2023-02-21 18:24:12 -0800 | |
commit | 5b7c4cabbb65f5c469464da6c5f614cbd7f730f2 (patch) | |
tree | cc5c2d0a898769fd59549594fedb3ee6f84e59a0 /Documentation/admin-guide/mm/memory-hotplug.rst | |
download | linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.tar.gz linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.zip |
Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextgrafted
Pull networking updates from Jakub Kicinski:
"Core:
- Add dedicated kmem_cache for typical/small skb->head, avoid having
to access struct page at kfree time, and improve memory use.
- Introduce sysctl to set default RPS configuration for new netdevs.
- Define Netlink protocol specification format which can be used to
describe messages used by each family and auto-generate parsers.
Add tools for generating kernel data structures and uAPI headers.
- Expose all net/core sysctls inside netns.
- Remove 4s sleep in netpoll if carrier is instantly detected on
boot.
- Add configurable limit of MDB entries per port, and port-vlan.
- Continue populating drop reasons throughout the stack.
- Retire a handful of legacy Qdiscs and classifiers.
Protocols:
- Support IPv4 big TCP (TSO frames larger than 64kB).
- Add IP_LOCAL_PORT_RANGE socket option, to control local port range
on socket by socket basis.
- Track and report in procfs number of MPTCP sockets used.
- Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path
manager.
- IPv6: don't check net.ipv6.route.max_size and rely on garbage
collection to free memory (similarly to IPv4).
- Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).
- ICMP: add per-rate limit counters.
- Add support for user scanning requests in ieee802154.
- Remove static WEP support.
- Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
reporting.
- WiFi 7 EHT channel puncturing support (client & AP).
BPF:
- Add a rbtree data structure following the "next-gen data structure"
precedent set by recently added linked list, that is, by using
kfunc + kptr instead of adding a new BPF map type.
- Expose XDP hints via kfuncs with initial support for RX hash and
timestamp metadata.
- Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to
better support decap on GRE tunnel devices not operating in collect
metadata.
- Improve x86 JIT's codegen for PROBE_MEM runtime error checks.
- Remove the need for trace_printk_lock for bpf_trace_printk and
bpf_trace_vprintk helpers.
- Extend libbpf's bpf_tracing.h support for tracing arguments of
kprobes/uprobes and syscall as a special case.
- Significantly reduce the search time for module symbols by
livepatch and BPF.
- Enable cpumasks to be used as kptrs, which is useful for tracing
programs tracking which tasks end up running on which CPUs in
different time intervals.
- Add support for BPF trampoline on s390x and riscv64.
- Add capability to export the XDP features supported by the NIC.
- Add __bpf_kfunc tag for marking kernel functions as kfuncs.
- Add cgroup.memory=nobpf kernel parameter option to disable BPF
memory accounting for container environments.
Netfilter:
- Remove the CLUSTERIP target. It has been marked as obsolete for
years, and we still have WARN splats wrt races of the out-of-band
/proc interface installed by this target.
- Add 'destroy' commands to nf_tables. They are identical to the
existing 'delete' commands, but do not return an error if the
referenced object (set, chain, rule...) did not exist.
Driver API:
- Improve cpumask_local_spread() locality to help NICs set the right
IRQ affinity on AMD platforms.
- Separate C22 and C45 MDIO bus transactions more clearly.
- Introduce new DCB table to control DSCP rewrite on egress.
- Support configuration of Physical Layer Collision Avoidance (PLCA)
Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
shared medium Ethernet.
- Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
preemption of low priority frames by high priority frames.
- Add support for controlling MACSec offload using netlink SET.
- Rework devlink instance refcounts to allow registration and
de-registration under the instance lock. Split the code into
multiple files, drop some of the unnecessarily granular locks and
factor out common parts of netlink operation handling.
- Add TX frame aggregation parameters (for USB drivers).
- Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
messages with notifications for debug.
- Allow offloading of UDP NEW connections via act_ct.
- Add support for per action HW stats in TC.
- Support hardware miss to TC action (continue processing in SW from
a specific point in the action chain).
- Warn if old Wireless Extension user space interface is used with
modern cfg80211/mac80211 drivers. Do not support Wireless
Extensions for Wi-Fi 7 devices at all. Everyone should switch to
using nl80211 interface instead.
- Improve the CAN bit timing configuration. Use extack to return
error messages directly to user space, update the SJW handling,
including the definition of a new default value that will benefit
CAN-FD controllers, by increasing their oscillator tolerance.
New hardware / drivers:
- Ethernet:
- nVidia BlueField-3 support (control traffic driver)
- Ethernet support for imx93 SoCs
- Motorcomm yt8531 gigabit Ethernet PHY
- onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
- Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
- Amlogic gxl MDIO mux
- WiFi:
- RealTek RTL8188EU (rtl8xxxu)
- Qualcomm Wi-Fi 7 devices (ath12k)
- CAN:
- Renesas R-Car V4H
Drivers:
- Bluetooth:
- Set Per Platform Antenna Gain (PPAG) for Intel controllers.
- Ethernet NICs:
- Intel (1G, igc):
- support TSN / Qbv / packet scheduling features of i226 model
- Intel (100G, ice):
- use GNSS subsystem instead of TTY
- multi-buffer XDP support
- extend support for GPIO pins to E823 devices
- nVidia/Mellanox:
- update the shared buffer configuration on PFC commands
- implement PTP adjphase function for HW offset control
- TC support for Geneve and GRE with VF tunnel offload
- more efficient crypto key management method
- multi-port eswitch support
- Netronome/Corigine:
- add DCB IEEE support
- support IPsec offloading for NFP3800
- Freescale/NXP (enetc):
- support XDP_REDIRECT for XDP non-linear buffers
- improve reconfig, avoid link flap and waiting for idle
- support MAC Merge layer
- Other NICs:
- sfc/ef100: add basic devlink support for ef100
- ionic: rx_push mode operation (writing descriptors via MMIO)
- bnxt: use the auxiliary bus abstraction for RDMA
- r8169: disable ASPM and reset bus in case of tx timeout
- cpsw: support QSGMII mode for J721e CPSW9G
- cpts: support pulse-per-second output
- ngbe: add an mdio bus driver
- usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
- r8152: handle devices with FW with NCM support
- amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
- virtio-net: support multi buffer XDP
- virtio/vsock: replace virtio_vsock_pkt with sk_buff
- tsnep: XDP support
- Ethernet high-speed switches:
- nVidia/Mellanox (mlxsw):
- add support for latency TLV (in FW control messages)
- Microchip (sparx5):
- separate explicit and implicit traffic forwarding rules, make
the implicit rules always active
- add support for egress DSCP rewrite
- IS0 VCAP support (Ingress Classification)
- IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS
etc.)
- ES2 VCAP support (Egress Access Control)
- support for Per-Stream Filtering and Policing (802.1Q,
8.6.5.1)
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- add MAB (port auth) offload support
- enable PTP receive for mv88e6390
- NXP (ocelot):
- support MAC Merge layer
- support for the the vsc7512 internal copper phys
- Microchip:
- lan9303: convert to PHYLINK
- lan966x: support TC flower filter statistics
- lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
- lan937x: support Credit Based Shaper configuration
- ksz9477: support Energy Efficient Ethernet
- other:
- qca8k: convert to regmap read/write API, use bulk operations
- rswitch: Improve TX timestamp accuracy
- Intel WiFi (iwlwifi):
- EHT (Wi-Fi 7) rate reporting
- STEP equalizer support: transfer some STEP (connection to radio
on platforms with integrated wifi) related parameters from the
BIOS to the firmware.
- Qualcomm 802.11ax WiFi (ath11k):
- IPQ5018 support
- Fine Timing Measurement (FTM) responder role support
- channel 177 support
- MediaTek WiFi (mt76):
- per-PHY LED support
- mt7996: EHT (Wi-Fi 7) support
- Wireless Ethernet Dispatch (WED) reset support
- switch to using page pool allocator
- RealTek WiFi (rtw89):
- support new version of Bluetooth co-existance
- Mobile:
- rmnet: support TX aggregation"
* tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits)
page_pool: add a comment explaining the fragment counter usage
net: ethtool: fix __ethtool_dev_mm_supported() implementation
ethtool: pse-pd: Fix double word in comments
xsk: add linux/vmalloc.h to xsk.c
sefltests: netdevsim: wait for devlink instance after netns removal
selftest: fib_tests: Always cleanup before exit
net/mlx5e: Align IPsec ASO result memory to be as required by hardware
net/mlx5e: TC, Set CT miss to the specific ct action instance
net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG
net/mlx5: Refactor tc miss handling to a single function
net/mlx5: Kconfig: Make tc offload depend on tc skb extension
net/sched: flower: Support hardware miss to tc action
net/sched: flower: Move filter handle initialization earlier
net/sched: cls_api: Support hardware miss to tc action
net/sched: Rename user cookie and act cookie
sfc: fix builds without CONFIG_RTC_LIB
sfc: clean up some inconsistent indentings
net/mlx4_en: Introduce flexible array to silence overflow warning
net: lan966x: Fix possible deadlock inside PTP
net/ulp: Remove redundant ->clone() test in inet_clone_ulp().
...
Diffstat (limited to 'Documentation/admin-guide/mm/memory-hotplug.rst')
-rw-r--r-- | Documentation/admin-guide/mm/memory-hotplug.rst | 677 |
1 files changed, 677 insertions, 0 deletions
diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst new file mode 100644 index 000000000..a3c9e8ad8 --- /dev/null +++ b/Documentation/admin-guide/mm/memory-hotplug.rst @@ -0,0 +1,677 @@ +.. _admin_guide_memory_hotplug: + +================== +Memory Hot(Un)Plug +================== + +This document describes generic Linux support for memory hot(un)plug with +a focus on System RAM, including ZONE_MOVABLE support. + +.. contents:: :local: + +Introduction +============ + +Memory hot(un)plug allows for increasing and decreasing the size of physical +memory available to a machine at runtime. In the simplest case, it consists of +physically plugging or unplugging a DIMM at runtime, coordinated with the +operating system. + +Memory hot(un)plug is used for various purposes: + +- The physical memory available to a machine can be adjusted at runtime, up- or + downgrading the memory capacity. This dynamic memory resizing, sometimes + referred to as "capacity on demand", is frequently used with virtual machines + and logical partitions. + +- Replacing hardware, such as DIMMs or whole NUMA nodes, without downtime. One + example is replacing failing memory modules. + +- Reducing energy consumption either by physically unplugging memory modules or + by logically unplugging (parts of) memory modules from Linux. + +Further, the basic memory hot(un)plug infrastructure in Linux is nowadays also +used to expose persistent memory, other performance-differentiated memory and +reserved memory regions as ordinary system RAM to Linux. + +Linux only supports memory hot(un)plug on selected 64 bit architectures, such as +x86_64, arm64, ppc64, s390x and ia64. + +Memory Hot(Un)Plug Granularity +------------------------------ + +Memory hot(un)plug in Linux uses the SPARSEMEM memory model, which divides the +physical memory address space into chunks of the same size: memory sections. The +size of a memory section is architecture dependent. For example, x86_64 uses +128 MiB and ppc64 uses 16 MiB. + +Memory sections are combined into chunks referred to as "memory blocks". The +size of a memory block is architecture dependent and corresponds to the smallest +granularity that can be hot(un)plugged. The default size of a memory block is +the same as memory section size, unless an architecture specifies otherwise. + +All memory blocks have the same size. + +Phases of Memory Hotplug +------------------------ + +Memory hotplug consists of two phases: + +(1) Adding the memory to Linux +(2) Onlining memory blocks + +In the first phase, metadata, such as the memory map ("memmap") and page tables +for the direct mapping, is allocated and initialized, and memory blocks are +created; the latter also creates sysfs files for managing newly created memory +blocks. + +In the second phase, added memory is exposed to the page allocator. After this +phase, the memory is visible in memory statistics, such as free and total +memory, of the system. + +Phases of Memory Hotunplug +-------------------------- + +Memory hotunplug consists of two phases: + +(1) Offlining memory blocks +(2) Removing the memory from Linux + +In the fist phase, memory is "hidden" from the page allocator again, for +example, by migrating busy memory to other memory locations and removing all +relevant free pages from the page allocator After this phase, the memory is no +longer visible in memory statistics of the system. + +In the second phase, the memory blocks are removed and metadata is freed. + +Memory Hotplug Notifications +============================ + +There are various ways how Linux is notified about memory hotplug events such +that it can start adding hotplugged memory. This description is limited to +systems that support ACPI; mechanisms specific to other firmware interfaces or +virtual machines are not described. + +ACPI Notifications +------------------ + +Platforms that support ACPI, such as x86_64, can support memory hotplug +notifications via ACPI. + +In general, a firmware supporting memory hotplug defines a memory class object +HID "PNP0C80". When notified about hotplug of a new memory device, the ACPI +driver will hotplug the memory to Linux. + +If the firmware supports hotplug of NUMA nodes, it defines an object _HID +"ACPI0004", "PNP0A05", or "PNP0A06". When notified about an hotplug event, all +assigned memory devices are added to Linux by the ACPI driver. + +Similarly, Linux can be notified about requests to hotunplug a memory device or +a NUMA node via ACPI. The ACPI driver will try offlining all relevant memory +blocks, and, if successful, hotunplug the memory from Linux. + +Manual Probing +-------------- + +On some architectures, the firmware may not be able to notify the operating +system about a memory hotplug event. Instead, the memory has to be manually +probed from user space. + +The probe interface is located at:: + + /sys/devices/system/memory/probe + +Only complete memory blocks can be probed. Individual memory blocks are probed +by providing the physical start address of the memory block:: + + % echo addr > /sys/devices/system/memory/probe + +Which results in a memory block for the range [addr, addr + memory_block_size) +being created. + +.. note:: + + Using the probe interface is discouraged as it is easy to crash the kernel, + because Linux cannot validate user input; this interface might be removed in + the future. + +Onlining and Offlining Memory Blocks +==================================== + +After a memory block has been created, Linux has to be instructed to actually +make use of that memory: the memory block has to be "online". + +Before a memory block can be removed, Linux has to stop using any memory part of +the memory block: the memory block has to be "offlined". + +The Linux kernel can be configured to automatically online added memory blocks +and drivers automatically trigger offlining of memory blocks when trying +hotunplug of memory. Memory blocks can only be removed once offlining succeeded +and drivers may trigger offlining of memory blocks when attempting hotunplug of +memory. + +Onlining Memory Blocks Manually +------------------------------- + +If auto-onlining of memory blocks isn't enabled, user-space has to manually +trigger onlining of memory blocks. Often, udev rules are used to automate this +task in user space. + +Onlining of a memory block can be triggered via:: + + % echo online > /sys/devices/system/memory/memoryXXX/state + +Or alternatively:: + + % echo 1 > /sys/devices/system/memory/memoryXXX/online + +The kernel will select the target zone automatically, depending on the +configured ``online_policy``. + +One can explicitly request to associate an offline memory block with +ZONE_MOVABLE by:: + + % echo online_movable > /sys/devices/system/memory/memoryXXX/state + +Or one can explicitly request a kernel zone (usually ZONE_NORMAL) by:: + + % echo online_kernel > /sys/devices/system/memory/memoryXXX/state + +In any case, if onlining succeeds, the state of the memory block is changed to +be "online". If it fails, the state of the memory block will remain unchanged +and the above commands will fail. + +Onlining Memory Blocks Automatically +------------------------------------ + +The kernel can be configured to try auto-onlining of newly added memory blocks. +If this feature is disabled, the memory blocks will stay offline until +explicitly onlined from user space. + +The configured auto-online behavior can be observed via:: + + % cat /sys/devices/system/memory/auto_online_blocks + +Auto-onlining can be enabled by writing ``online``, ``online_kernel`` or +``online_movable`` to that file, like:: + + % echo online > /sys/devices/system/memory/auto_online_blocks + +Similarly to manual onlining, with ``online`` the kernel will select the +target zone automatically, depending on the configured ``online_policy``. + +Modifying the auto-online behavior will only affect all subsequently added +memory blocks only. + +.. note:: + + In corner cases, auto-onlining can fail. The kernel won't retry. Note that + auto-onlining is not expected to fail in default configurations. + +.. note:: + + DLPAR on ppc64 ignores the ``offline`` setting and will still online added + memory blocks; if onlining fails, memory blocks are removed again. + +Offlining Memory Blocks +----------------------- + +In the current implementation, Linux's memory offlining will try migrating all +movable pages off the affected memory block. As most kernel allocations, such as +page tables, are unmovable, page migration can fail and, therefore, inhibit +memory offlining from succeeding. + +Having the memory provided by memory block managed by ZONE_MOVABLE significantly +increases memory offlining reliability; still, memory offlining can fail in +some corner cases. + +Further, memory offlining might retry for a long time (or even forever), until +aborted by the user. + +Offlining of a memory block can be triggered via:: + + % echo offline > /sys/devices/system/memory/memoryXXX/state + +Or alternatively:: + + % echo 0 > /sys/devices/system/memory/memoryXXX/online + +If offlining succeeds, the state of the memory block is changed to be "offline". +If it fails, the state of the memory block will remain unchanged and the above +commands will fail, for example, via:: + + bash: echo: write error: Device or resource busy + +or via:: + + bash: echo: write error: Invalid argument + +Observing the State of Memory Blocks +------------------------------------ + +The state (online/offline/going-offline) of a memory block can be observed +either via:: + + % cat /sys/device/system/memory/memoryXXX/state + +Or alternatively (1/0) via:: + + % cat /sys/device/system/memory/memoryXXX/online + +For an online memory block, the managing zone can be observed via:: + + % cat /sys/device/system/memory/memoryXXX/valid_zones + +Configuring Memory Hot(Un)Plug +============================== + +There are various ways how system administrators can configure memory +hot(un)plug and interact with memory blocks, especially, to online them. + +Memory Hot(Un)Plug Configuration via Sysfs +------------------------------------------ + +Some memory hot(un)plug properties can be configured or inspected via sysfs in:: + + /sys/devices/system/memory/ + +The following files are currently defined: + +====================== ========================================================= +``auto_online_blocks`` read-write: set or get the default state of new memory + blocks; configure auto-onlining. + + The default value depends on the + CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel configuration + option. + + See the ``state`` property of memory blocks for details. +``block_size_bytes`` read-only: the size in bytes of a memory block. +``probe`` write-only: add (probe) selected memory blocks manually + from user space by supplying the physical start address. + + Availability depends on the CONFIG_ARCH_MEMORY_PROBE + kernel configuration option. +``uevent`` read-write: generic udev file for device subsystems. +====================== ========================================================= + +.. note:: + + When the CONFIG_MEMORY_FAILURE kernel configuration option is enabled, two + additional files ``hard_offline_page`` and ``soft_offline_page`` are available + to trigger hwpoisoning of pages, for example, for testing purposes. Note that + this functionality is not really related to memory hot(un)plug or actual + offlining of memory blocks. + +Memory Block Configuration via Sysfs +------------------------------------ + +Each memory block is represented as a memory block device that can be +onlined or offlined. All memory blocks have their device information located in +sysfs. Each present memory block is listed under +``/sys/devices/system/memory`` as:: + + /sys/devices/system/memory/memoryXXX + +where XXX is the memory block id; the number of digits is variable. + +A present memory block indicates that some memory in the range is present; +however, a memory block might span memory holes. A memory block spanning memory +holes cannot be offlined. + +For example, assume 1 GiB memory block size. A device for a memory starting at +0x100000000 is ``/sys/device/system/memory/memory4``:: + + (0x100000000 / 1Gib = 4) + +This device covers address range [0x100000000 ... 0x140000000) + +The following files are currently defined: + +=================== ============================================================ +``online`` read-write: simplified interface to trigger onlining / + offlining and to observe the state of a memory block. + When onlining, the zone is selected automatically. +``phys_device`` read-only: legacy interface only ever used on s390x to + expose the covered storage increment. +``phys_index`` read-only: the memory block id (XXX). +``removable`` read-only: legacy interface that indicated whether a memory + block was likely to be offlineable or not. Nowadays, the + kernel return ``1`` if and only if it supports memory + offlining. +``state`` read-write: advanced interface to trigger onlining / + offlining and to observe the state of a memory block. + + When writing, ``online``, ``offline``, ``online_kernel`` and + ``online_movable`` are supported. + + ``online_movable`` specifies onlining to ZONE_MOVABLE. + ``online_kernel`` specifies onlining to the default kernel + zone for the memory block, such as ZONE_NORMAL. + ``online`` let's the kernel select the zone automatically. + + When reading, ``online``, ``offline`` and ``going-offline`` + may be returned. +``uevent`` read-write: generic uevent file for devices. +``valid_zones`` read-only: when a block is online, shows the zone it + belongs to; when a block is offline, shows what zone will + manage it when the block will be onlined. + + For online memory blocks, ``DMA``, ``DMA32``, ``Normal``, + ``Movable`` and ``none`` may be returned. ``none`` indicates + that memory provided by a memory block is managed by + multiple zones or spans multiple nodes; such memory blocks + cannot be offlined. ``Movable`` indicates ZONE_MOVABLE. + Other values indicate a kernel zone. + + For offline memory blocks, the first column shows the + zone the kernel would select when onlining the memory block + right now without further specifying a zone. + + Availability depends on the CONFIG_MEMORY_HOTREMOVE + kernel configuration option. +=================== ============================================================ + +.. note:: + + If the CONFIG_NUMA kernel configuration option is enabled, the memoryXXX/ + directories can also be accessed via symbolic links located in the + ``/sys/devices/system/node/node*`` directories. + + For example:: + + /sys/devices/system/node/node0/memory9 -> ../../memory/memory9 + + A backlink will also be created:: + + /sys/devices/system/memory/memory9/node0 -> ../../node/node0 + +Command Line Parameters +----------------------- + +Some command line parameters affect memory hot(un)plug handling. The following +command line parameters are relevant: + +======================== ======================================================= +``memhp_default_state`` configure auto-onlining by essentially setting + ``/sys/devices/system/memory/auto_online_blocks``. +``movable_node`` configure automatic zone selection in the kernel when + using the ``contig-zones`` online policy. When + set, the kernel will default to ZONE_MOVABLE when + onlining a memory block, unless other zones can be kept + contiguous. +======================== ======================================================= + +See Documentation/admin-guide/kernel-parameters.txt for a more generic +description of these command line parameters. + +Module Parameters +------------------ + +Instead of additional command line parameters or sysfs files, the +``memory_hotplug`` subsystem now provides a dedicated namespace for module +parameters. Module parameters can be set via the command line by predicating +them with ``memory_hotplug.`` such as:: + + memory_hotplug.memmap_on_memory=1 + +and they can be observed (and some even modified at runtime) via:: + + /sys/module/memory_hotplug/parameters/ + +The following module parameters are currently defined: + +================================ =============================================== +``memmap_on_memory`` read-write: Allocate memory for the memmap from + the added memory block itself. Even if enabled, + actual support depends on various other system + properties and should only be regarded as a + hint whether the behavior would be desired. + + While allocating the memmap from the memory + block itself makes memory hotplug less likely + to fail and keeps the memmap on the same NUMA + node in any case, it can fragment physical + memory in a way that huge pages in bigger + granularity cannot be formed on hotplugged + memory. +``online_policy`` read-write: Set the basic policy used for + automatic zone selection when onlining memory + blocks without specifying a target zone. + ``contig-zones`` has been the kernel default + before this parameter was added. After an + online policy was configured and memory was + online, the policy should not be changed + anymore. + + When set to ``contig-zones``, the kernel will + try keeping zones contiguous. If a memory block + intersects multiple zones or no zone, the + behavior depends on the ``movable_node`` kernel + command line parameter: default to ZONE_MOVABLE + if set, default to the applicable kernel zone + (usually ZONE_NORMAL) if not set. + + When set to ``auto-movable``, the kernel will + try onlining memory blocks to ZONE_MOVABLE if + possible according to the configuration and + memory device details. With this policy, one + can avoid zone imbalances when eventually + hotplugging a lot of memory later and still + wanting to be able to hotunplug as much as + possible reliably, very desirable in + virtualized environments. This policy ignores + the ``movable_node`` kernel command line + parameter and isn't really applicable in + environments that require it (e.g., bare metal + with hotunpluggable nodes) where hotplugged + memory might be exposed via the + firmware-provided memory map early during boot + to the system instead of getting detected, + added and onlined later during boot (such as + done by virtio-mem or by some hypervisors + implementing emulated DIMMs). As one example, a + hotplugged DIMM will be onlined either + completely to ZONE_MOVABLE or completely to + ZONE_NORMAL, not a mixture. + As another example, as many memory blocks + belonging to a virtio-mem device will be + onlined to ZONE_MOVABLE as possible, + special-casing units of memory blocks that can + only get hotunplugged together. *This policy + does not protect from setups that are + problematic with ZONE_MOVABLE and does not + change the zone of memory blocks dynamically + after they were onlined.* +``auto_movable_ratio`` read-write: Set the maximum MOVABLE:KERNEL + memory ratio in % for the ``auto-movable`` + online policy. Whether the ratio applies only + for the system across all NUMA nodes or also + per NUMA nodes depends on the + ``auto_movable_numa_aware`` configuration. + + All accounting is based on present memory pages + in the zones combined with accounting per + memory device. Memory dedicated to the CMA + allocator is accounted as MOVABLE, although + residing on one of the kernel zones. The + possible ratio depends on the actual workload. + The kernel default is "301" %, for example, + allowing for hotplugging 24 GiB to a 8 GiB VM + and automatically onlining all hotplugged + memory to ZONE_MOVABLE in many setups. The + additional 1% deals with some pages being not + present, for example, because of some firmware + allocations. + + Note that ZONE_NORMAL memory provided by one + memory device does not allow for more + ZONE_MOVABLE memory for a different memory + device. As one example, onlining memory of a + hotplugged DIMM to ZONE_NORMAL will not allow + for another hotplugged DIMM to get onlined to + ZONE_MOVABLE automatically. In contrast, memory + hotplugged by a virtio-mem device that got + onlined to ZONE_NORMAL will allow for more + ZONE_MOVABLE memory within *the same* + virtio-mem device. +``auto_movable_numa_aware`` read-write: Configure whether the + ``auto_movable_ratio`` in the ``auto-movable`` + online policy also applies per NUMA + node in addition to the whole system across all + NUMA nodes. The kernel default is "Y". + + Disabling NUMA awareness can be helpful when + dealing with NUMA nodes that should be + completely hotunpluggable, onlining the memory + completely to ZONE_MOVABLE automatically if + possible. + + Parameter availability depends on CONFIG_NUMA. +================================ =============================================== + +ZONE_MOVABLE +============ + +ZONE_MOVABLE is an important mechanism for more reliable memory offlining. +Further, having system RAM managed by ZONE_MOVABLE instead of one of the +kernel zones can increase the number of possible transparent huge pages and +dynamically allocated huge pages. + +Most kernel allocations are unmovable. Important examples include the memory +map (usually 1/64ths of memory), page tables, and kmalloc(). Such allocations +can only be served from the kernel zones. + +Most user space pages, such as anonymous memory, and page cache pages are +movable. Such allocations can be served from ZONE_MOVABLE and the kernel zones. + +Only movable allocations are served from ZONE_MOVABLE, resulting in unmovable +allocations being limited to the kernel zones. Without ZONE_MOVABLE, there is +absolutely no guarantee whether a memory block can be offlined successfully. + +Zone Imbalances +--------------- + +Having too much system RAM managed by ZONE_MOVABLE is called a zone imbalance, +which can harm the system or degrade performance. As one example, the kernel +might crash because it runs out of free memory for unmovable allocations, +although there is still plenty of free memory left in ZONE_MOVABLE. + +Usually, MOVABLE:KERNEL ratios of up to 3:1 or even 4:1 are fine. Ratios of 63:1 +are definitely impossible due to the overhead for the memory map. + +Actual safe zone ratios depend on the workload. Extreme cases, like excessive +long-term pinning of pages, might not be able to deal with ZONE_MOVABLE at all. + +.. note:: + + CMA memory part of a kernel zone essentially behaves like memory in + ZONE_MOVABLE and similar considerations apply, especially when combining + CMA with ZONE_MOVABLE. + +ZONE_MOVABLE Sizing Considerations +---------------------------------- + +We usually expect that a large portion of available system RAM will actually +be consumed by user space, either directly or indirectly via the page cache. In +the normal case, ZONE_MOVABLE can be used when allocating such pages just fine. + +With that in mind, it makes sense that we can have a big portion of system RAM +managed by ZONE_MOVABLE. However, there are some things to consider when using +ZONE_MOVABLE, especially when fine-tuning zone ratios: + +- Having a lot of offline memory blocks. Even offline memory blocks consume + memory for metadata and page tables in the direct map; having a lot of offline + memory blocks is not a typical case, though. + +- Memory ballooning without balloon compaction is incompatible with + ZONE_MOVABLE. Only some implementations, such as virtio-balloon and + pseries CMM, fully support balloon compaction. + + Further, the CONFIG_BALLOON_COMPACTION kernel configuration option might be + disabled. In that case, balloon inflation will only perform unmovable + allocations and silently create a zone imbalance, usually triggered by + inflation requests from the hypervisor. + +- Gigantic pages are unmovable, resulting in user space consuming a + lot of unmovable memory. + +- Huge pages are unmovable when an architectures does not support huge + page migration, resulting in a similar issue as with gigantic pages. + +- Page tables are unmovable. Excessive swapping, mapping extremely large + files or ZONE_DEVICE memory can be problematic, although only really relevant + in corner cases. When we manage a lot of user space memory that has been + swapped out or is served from a file/persistent memory/... we still need a lot + of page tables to manage that memory once user space accessed that memory. + +- In certain DAX configurations the memory map for the device memory will be + allocated from the kernel zones. + +- KASAN can have a significant memory overhead, for example, consuming 1/8th of + the total system memory size as (unmovable) tracking metadata. + +- Long-term pinning of pages. Techniques that rely on long-term pinnings + (especially, RDMA and vfio/mdev) are fundamentally problematic with + ZONE_MOVABLE, and therefore, memory offlining. Pinned pages cannot reside + on ZONE_MOVABLE as that would turn these pages unmovable. Therefore, they + have to be migrated off that zone while pinning. Pinning a page can fail + even if there is plenty of free memory in ZONE_MOVABLE. + + In addition, using ZONE_MOVABLE might make page pinning more expensive, + because of the page migration overhead. + +By default, all the memory configured at boot time is managed by the kernel +zones and ZONE_MOVABLE is not used. + +To enable ZONE_MOVABLE to include the memory present at boot and to control the +ratio between movable and kernel zones there are two command line options: +``kernelcore=`` and ``movablecore=``. See +Documentation/admin-guide/kernel-parameters.rst for their description. + +Memory Offlining and ZONE_MOVABLE +--------------------------------- + +Even with ZONE_MOVABLE, there are some corner cases where offlining a memory +block might fail: + +- Memory blocks with memory holes; this applies to memory blocks present during + boot and can apply to memory blocks hotplugged via the XEN balloon and the + Hyper-V balloon. + +- Mixed NUMA nodes and mixed zones within a single memory block prevent memory + offlining; this applies to memory blocks present during boot only. + +- Special memory blocks prevented by the system from getting offlined. Examples + include any memory available during boot on arm64 or memory blocks spanning + the crashkernel area on s390x; this usually applies to memory blocks present + during boot only. + +- Memory blocks overlapping with CMA areas cannot be offlined, this applies to + memory blocks present during boot only. + +- Concurrent activity that operates on the same physical memory area, such as + allocating gigantic pages, can result in temporary offlining failures. + +- Out of memory when dissolving huge pages, especially when HugeTLB Vmemmap + Optimization (HVO) is enabled. + + Offlining code may be able to migrate huge page contents, but may not be able + to dissolve the source huge page because it fails allocating (unmovable) pages + for the vmemmap, because the system might not have free memory in the kernel + zones left. + + Users that depend on memory offlining to succeed for movable zones should + carefully consider whether the memory savings gained from this feature are + worth the risk of possibly not being able to offline memory in certain + situations. + +Further, when running into out of memory situations while migrating pages, or +when still encountering permanently unmovable pages within ZONE_MOVABLE +(-> BUG), memory offlining will keep retrying until it eventually succeeds. + +When offlining is triggered from user space, the offlining context can be +terminated by sending a fatal signal. A timeout based offlining can easily be +implemented via:: + + % timeout $TIMEOUT offline_block | failure_handling |