aboutsummaryrefslogtreecommitdiff
path: root/Documentation/admin-guide/ext4.rst
diff options
context:
space:
mode:
authorLibravatar Linus Torvalds <torvalds@linux-foundation.org>2023-02-21 18:24:12 -0800
committerLibravatar Linus Torvalds <torvalds@linux-foundation.org>2023-02-21 18:24:12 -0800
commit5b7c4cabbb65f5c469464da6c5f614cbd7f730f2 (patch)
treecc5c2d0a898769fd59549594fedb3ee6f84e59a0 /Documentation/admin-guide/ext4.rst
downloadlinux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.tar.gz
linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.zip
Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextgrafted
Pull networking updates from Jakub Kicinski: "Core: - Add dedicated kmem_cache for typical/small skb->head, avoid having to access struct page at kfree time, and improve memory use. - Introduce sysctl to set default RPS configuration for new netdevs. - Define Netlink protocol specification format which can be used to describe messages used by each family and auto-generate parsers. Add tools for generating kernel data structures and uAPI headers. - Expose all net/core sysctls inside netns. - Remove 4s sleep in netpoll if carrier is instantly detected on boot. - Add configurable limit of MDB entries per port, and port-vlan. - Continue populating drop reasons throughout the stack. - Retire a handful of legacy Qdiscs and classifiers. Protocols: - Support IPv4 big TCP (TSO frames larger than 64kB). - Add IP_LOCAL_PORT_RANGE socket option, to control local port range on socket by socket basis. - Track and report in procfs number of MPTCP sockets used. - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path manager. - IPv6: don't check net.ipv6.route.max_size and rely on garbage collection to free memory (similarly to IPv4). - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986). - ICMP: add per-rate limit counters. - Add support for user scanning requests in ieee802154. - Remove static WEP support. - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting. - WiFi 7 EHT channel puncturing support (client & AP). BPF: - Add a rbtree data structure following the "next-gen data structure" precedent set by recently added linked list, that is, by using kfunc + kptr instead of adding a new BPF map type. - Expose XDP hints via kfuncs with initial support for RX hash and timestamp metadata. - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata. - Improve x86 JIT's codegen for PROBE_MEM runtime error checks. - Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers. - Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case. - Significantly reduce the search time for module symbols by livepatch and BPF. - Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals. - Add support for BPF trampoline on s390x and riscv64. - Add capability to export the XDP features supported by the NIC. - Add __bpf_kfunc tag for marking kernel functions as kfuncs. - Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accounting for container environments. Netfilter: - Remove the CLUSTERIP target. It has been marked as obsolete for years, and we still have WARN splats wrt races of the out-of-band /proc interface installed by this target. - Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not return an error if the referenced object (set, chain, rule...) did not exist. Driver API: - Improve cpumask_local_spread() locality to help NICs set the right IRQ affinity on AMD platforms. - Separate C22 and C45 MDIO bus transactions more clearly. - Introduce new DCB table to control DSCP rewrite on egress. - Support configuration of Physical Layer Collision Avoidance (PLCA) Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of shared medium Ethernet. - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing preemption of low priority frames by high priority frames. - Add support for controlling MACSec offload using netlink SET. - Rework devlink instance refcounts to allow registration and de-registration under the instance lock. Split the code into multiple files, drop some of the unnecessarily granular locks and factor out common parts of netlink operation handling. - Add TX frame aggregation parameters (for USB drivers). - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning messages with notifications for debug. - Allow offloading of UDP NEW connections via act_ct. - Add support for per action HW stats in TC. - Support hardware miss to TC action (continue processing in SW from a specific point in the action chain). - Warn if old Wireless Extension user space interface is used with modern cfg80211/mac80211 drivers. Do not support Wireless Extensions for Wi-Fi 7 devices at all. Everyone should switch to using nl80211 interface instead. - Improve the CAN bit timing configuration. Use extack to return error messages directly to user space, update the SJW handling, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. New hardware / drivers: - Ethernet: - nVidia BlueField-3 support (control traffic driver) - Ethernet support for imx93 SoCs - Motorcomm yt8531 gigabit Ethernet PHY - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA) - Microchip LAN8841 PHY (incl. cable diagnostics and PTP) - Amlogic gxl MDIO mux - WiFi: - RealTek RTL8188EU (rtl8xxxu) - Qualcomm Wi-Fi 7 devices (ath12k) - CAN: - Renesas R-Car V4H Drivers: - Bluetooth: - Set Per Platform Antenna Gain (PPAG) for Intel controllers. - Ethernet NICs: - Intel (1G, igc): - support TSN / Qbv / packet scheduling features of i226 model - Intel (100G, ice): - use GNSS subsystem instead of TTY - multi-buffer XDP support - extend support for GPIO pins to E823 devices - nVidia/Mellanox: - update the shared buffer configuration on PFC commands - implement PTP adjphase function for HW offset control - TC support for Geneve and GRE with VF tunnel offload - more efficient crypto key management method - multi-port eswitch support - Netronome/Corigine: - add DCB IEEE support - support IPsec offloading for NFP3800 - Freescale/NXP (enetc): - support XDP_REDIRECT for XDP non-linear buffers - improve reconfig, avoid link flap and waiting for idle - support MAC Merge layer - Other NICs: - sfc/ef100: add basic devlink support for ef100 - ionic: rx_push mode operation (writing descriptors via MMIO) - bnxt: use the auxiliary bus abstraction for RDMA - r8169: disable ASPM and reset bus in case of tx timeout - cpsw: support QSGMII mode for J721e CPSW9G - cpts: support pulse-per-second output - ngbe: add an mdio bus driver - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing - r8152: handle devices with FW with NCM support - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation - virtio-net: support multi buffer XDP - virtio/vsock: replace virtio_vsock_pkt with sk_buff - tsnep: XDP support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add support for latency TLV (in FW control messages) - Microchip (sparx5): - separate explicit and implicit traffic forwarding rules, make the implicit rules always active - add support for egress DSCP rewrite - IS0 VCAP support (Ingress Classification) - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.) - ES2 VCAP support (Egress Access Control) - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1) - Ethernet embedded switches: - Marvell (mv88e6xxx): - add MAB (port auth) offload support - enable PTP receive for mv88e6390 - NXP (ocelot): - support MAC Merge layer - support for the the vsc7512 internal copper phys - Microchip: - lan9303: convert to PHYLINK - lan966x: support TC flower filter statistics - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x - lan937x: support Credit Based Shaper configuration - ksz9477: support Energy Efficient Ethernet - other: - qca8k: convert to regmap read/write API, use bulk operations - rswitch: Improve TX timestamp accuracy - Intel WiFi (iwlwifi): - EHT (Wi-Fi 7) rate reporting - STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware. - Qualcomm 802.11ax WiFi (ath11k): - IPQ5018 support - Fine Timing Measurement (FTM) responder role support - channel 177 support - MediaTek WiFi (mt76): - per-PHY LED support - mt7996: EHT (Wi-Fi 7) support - Wireless Ethernet Dispatch (WED) reset support - switch to using page pool allocator - RealTek WiFi (rtw89): - support new version of Bluetooth co-existance - Mobile: - rmnet: support TX aggregation" * tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits) page_pool: add a comment explaining the fragment counter usage net: ethtool: fix __ethtool_dev_mm_supported() implementation ethtool: pse-pd: Fix double word in comments xsk: add linux/vmalloc.h to xsk.c sefltests: netdevsim: wait for devlink instance after netns removal selftest: fib_tests: Always cleanup before exit net/mlx5e: Align IPsec ASO result memory to be as required by hardware net/mlx5e: TC, Set CT miss to the specific ct action instance net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG net/mlx5: Refactor tc miss handling to a single function net/mlx5: Kconfig: Make tc offload depend on tc skb extension net/sched: flower: Support hardware miss to tc action net/sched: flower: Move filter handle initialization earlier net/sched: cls_api: Support hardware miss to tc action net/sched: Rename user cookie and act cookie sfc: fix builds without CONFIG_RTC_LIB sfc: clean up some inconsistent indentings net/mlx4_en: Introduce flexible array to silence overflow warning net: lan966x: Fix possible deadlock inside PTP net/ulp: Remove redundant ->clone() test in inet_clone_ulp(). ...
Diffstat (limited to 'Documentation/admin-guide/ext4.rst')
-rw-r--r--Documentation/admin-guide/ext4.rst627
1 files changed, 627 insertions, 0 deletions
diff --git a/Documentation/admin-guide/ext4.rst b/Documentation/admin-guide/ext4.rst
new file mode 100644
index 000000000..4c559e08d
--- /dev/null
+++ b/Documentation/admin-guide/ext4.rst
@@ -0,0 +1,627 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================
+ext4 General Information
+========================
+
+Ext4 is an advanced level of the ext3 filesystem which incorporates
+scalability and reliability enhancements for supporting large filesystems
+(64 bit) in keeping with increasing disk capacities and state-of-the-art
+feature requirements.
+
+Mailing list: linux-ext4@vger.kernel.org
+Web site: http://ext4.wiki.kernel.org
+
+
+Quick usage instructions
+========================
+
+Note: More extensive information for getting started with ext4 can be
+found at the ext4 wiki site at the URL:
+http://ext4.wiki.kernel.org/index.php/Ext4_Howto
+
+ - The latest version of e2fsprogs can be found at:
+
+ https://www.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
+
+ or
+
+ http://sourceforge.net/project/showfiles.php?group_id=2406
+
+ or grab the latest git repository from:
+
+ https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
+
+ - Create a new filesystem using the ext4 filesystem type:
+
+ # mke2fs -t ext4 /dev/hda1
+
+ Or to configure an existing ext3 filesystem to support extents:
+
+ # tune2fs -O extents /dev/hda1
+
+ If the filesystem was created with 128 byte inodes, it can be
+ converted to use 256 byte for greater efficiency via:
+
+ # tune2fs -I 256 /dev/hda1
+
+ - Mounting:
+
+ # mount -t ext4 /dev/hda1 /wherever
+
+ - When comparing performance with other filesystems, it's always
+ important to try multiple workloads; very often a subtle change in a
+ workload parameter can completely change the ranking of which
+ filesystems do well compared to others. When comparing versus ext3,
+ note that ext4 enables write barriers by default, while ext3 does
+ not enable write barriers by default. So it is useful to use
+ explicitly specify whether barriers are enabled or not when via the
+ '-o barriers=[0|1]' mount option for both ext3 and ext4 filesystems
+ for a fair comparison. When tuning ext3 for best benchmark numbers,
+ it is often worthwhile to try changing the data journaling mode; '-o
+ data=writeback' can be faster for some workloads. (Note however that
+ running mounted with data=writeback can potentially leave stale data
+ exposed in recently written files in case of an unclean shutdown,
+ which could be a security exposure in some situations.) Configuring
+ the filesystem with a large journal can also be helpful for
+ metadata-intensive workloads.
+
+Features
+========
+
+Currently Available
+-------------------
+
+* ability to use filesystems > 16TB (e2fsprogs support not available yet)
+* extent format reduces metadata overhead (RAM, IO for access, transactions)
+* extent format more robust in face of on-disk corruption due to magics,
+* internal redundancy in tree
+* improved file allocation (multi-block alloc)
+* lift 32000 subdirectory limit imposed by i_links_count[1]
+* nsec timestamps for mtime, atime, ctime, create time
+* inode version field on disk (NFSv4, Lustre)
+* reduced e2fsck time via uninit_bg feature
+* journal checksumming for robustness, performance
+* persistent file preallocation (e.g for streaming media, databases)
+* ability to pack bitmaps and inode tables into larger virtual groups via the
+ flex_bg feature
+* large file support
+* inode allocation using large virtual block groups via flex_bg
+* delayed allocation
+* large block (up to pagesize) support
+* efficient new ordered mode in JBD2 and ext4 (avoid using buffer head to force
+ the ordering)
+* Case-insensitive file name lookups
+* file-based encryption support (fscrypt)
+* file-based verity support (fsverity)
+
+[1] Filesystems with a block size of 1k may see a limit imposed by the
+directory hash tree having a maximum depth of two.
+
+case-insensitive file name lookups
+======================================================
+
+The case-insensitive file name lookup feature is supported on a
+per-directory basis, allowing the user to mix case-insensitive and
+case-sensitive directories in the same filesystem. It is enabled by
+flipping the +F inode attribute of an empty directory. The
+case-insensitive string match operation is only defined when we know how
+text in encoded in a byte sequence. For that reason, in order to enable
+case-insensitive directories, the filesystem must have the
+casefold feature, which stores the filesystem-wide encoding
+model used. By default, the charset adopted is the latest version of
+Unicode (12.1.0, by the time of this writing), encoded in the UTF-8
+form. The comparison algorithm is implemented by normalizing the
+strings to the Canonical decomposition form, as defined by Unicode,
+followed by a byte per byte comparison.
+
+The case-awareness is name-preserving on the disk, meaning that the file
+name provided by userspace is a byte-per-byte match to what is actually
+written in the disk. The Unicode normalization format used by the
+kernel is thus an internal representation, and not exposed to the
+userspace nor to the disk, with the important exception of disk hashes,
+used on large case-insensitive directories with DX feature. On DX
+directories, the hash must be calculated using the casefolded version of
+the filename, meaning that the normalization format used actually has an
+impact on where the directory entry is stored.
+
+When we change from viewing filenames as opaque byte sequences to seeing
+them as encoded strings we need to address what happens when a program
+tries to create a file with an invalid name. The Unicode subsystem
+within the kernel leaves the decision of what to do in this case to the
+filesystem, which select its preferred behavior by enabling/disabling
+the strict mode. When Ext4 encounters one of those strings and the
+filesystem did not require strict mode, it falls back to considering the
+entire string as an opaque byte sequence, which still allows the user to
+operate on that file, but the case-insensitive lookups won't work.
+
+Options
+=======
+
+When mounting an ext4 filesystem, the following option are accepted:
+(*) == default
+
+ ro
+ Mount filesystem read only. Note that ext4 will replay the journal (and
+ thus write to the partition) even when mounted "read only". The mount
+ options "ro,noload" can be used to prevent writes to the filesystem.
+
+ journal_checksum
+ Enable checksumming of the journal transactions. This will allow the
+ recovery code in e2fsck and the kernel to detect corruption in the
+ kernel. It is a compatible change and will be ignored by older
+ kernels.
+
+ journal_async_commit
+ Commit block can be written to disk without waiting for descriptor
+ blocks. If enabled older kernels cannot mount the device. This will
+ enable 'journal_checksum' internally.
+
+ journal_path=path, journal_dev=devnum
+ When the external journal device's major/minor numbers have changed,
+ these options allow the user to specify the new journal location. The
+ journal device is identified through either its new major/minor numbers
+ encoded in devnum, or via a path to the device.
+
+ norecovery, noload
+ Don't load the journal on mounting. Note that if the filesystem was
+ not unmounted cleanly, skipping the journal replay will lead to the
+ filesystem containing inconsistencies that can lead to any number of
+ problems.
+
+ data=journal
+ All data are committed into the journal prior to being written into the
+ main file system. Enabling this mode will disable delayed allocation
+ and O_DIRECT support.
+
+ data=ordered (*)
+ All data are forced directly out to the main file system prior to its
+ metadata being committed to the journal.
+
+ data=writeback
+ Data ordering is not preserved, data may be written into the main file
+ system after its metadata has been committed to the journal.
+
+ commit=nrsec (*)
+ This setting limits the maximum age of the running transaction to
+ 'nrsec' seconds. The default value is 5 seconds. This means that if
+ you lose your power, you will lose as much as the latest 5 seconds of
+ metadata changes (your filesystem will not be damaged though, thanks
+ to the journaling). This default value (or any low value) will hurt
+ performance, but it's good for data-safety. Setting it to 0 will have
+ the same effect as leaving it at the default (5 seconds). Setting it
+ to very large values will improve performance. Note that due to
+ delayed allocation even older data can be lost on power failure since
+ writeback of those data begins only after time set in
+ /proc/sys/vm/dirty_expire_centisecs.
+
+ barrier=<0|1(*)>, barrier(*), nobarrier
+ This enables/disables the use of write barriers in the jbd code.
+ barrier=0 disables, barrier=1 enables. This also requires an IO stack
+ which can support barriers, and if jbd gets an error on a barrier
+ write, it will disable again with a warning. Write barriers enforce
+ proper on-disk ordering of journal commits, making volatile disk write
+ caches safe to use, at some performance penalty. If your disks are
+ battery-backed in one way or another, disabling barriers may safely
+ improve performance. The mount options "barrier" and "nobarrier" can
+ also be used to enable or disable barriers, for consistency with other
+ ext4 mount options.
+
+ inode_readahead_blks=n
+ This tuning parameter controls the maximum number of inode table blocks
+ that ext4's inode table readahead algorithm will pre-read into the
+ buffer cache. The default value is 32 blocks.
+
+ nouser_xattr
+ Disables Extended User Attributes. See the attr(5) manual page for
+ more information about extended attributes.
+
+ noacl
+ This option disables POSIX Access Control List support. If ACL support
+ is enabled in the kernel configuration (CONFIG_EXT4_FS_POSIX_ACL), ACL
+ is enabled by default on mount. See the acl(5) manual page for more
+ information about acl.
+
+ bsddf (*)
+ Make 'df' act like BSD.
+
+ minixdf
+ Make 'df' act like Minix.
+
+ debug
+ Extra debugging information is sent to syslog.
+
+ abort
+ Simulate the effects of calling ext4_abort() for debugging purposes.
+ This is normally used while remounting a filesystem which is already
+ mounted.
+
+ errors=remount-ro
+ Remount the filesystem read-only on an error.
+
+ errors=continue
+ Keep going on a filesystem error.
+
+ errors=panic
+ Panic and halt the machine if an error occurs. (These mount options
+ override the errors behavior specified in the superblock, which can be
+ configured using tune2fs)
+
+ data_err=ignore(*)
+ Just print an error message if an error occurs in a file data buffer in
+ ordered mode.
+ data_err=abort
+ Abort the journal if an error occurs in a file data buffer in ordered
+ mode.
+
+ grpid | bsdgroups
+ New objects have the group ID of their parent.
+
+ nogrpid (*) | sysvgroups
+ New objects have the group ID of their creator.
+
+ resgid=n
+ The group ID which may use the reserved blocks.
+
+ resuid=n
+ The user ID which may use the reserved blocks.
+
+ sb=
+ Use alternate superblock at this location.
+
+ quota, noquota, grpquota, usrquota
+ These options are ignored by the filesystem. They are used only by
+ quota tools to recognize volumes where quota should be turned on. See
+ documentation in the quota-tools package for more details
+ (http://sourceforge.net/projects/linuxquota).
+
+ jqfmt=<quota type>, usrjquota=<file>, grpjquota=<file>
+ These options tell filesystem details about quota so that quota
+ information can be properly updated during journal replay. They replace
+ the above quota options. See documentation in the quota-tools package
+ for more details (http://sourceforge.net/projects/linuxquota).
+
+ stripe=n
+ Number of filesystem blocks that mballoc will try to use for allocation
+ size and alignment. For RAID5/6 systems this should be the number of
+ data disks * RAID chunk size in file system blocks.
+
+ delalloc (*)
+ Defer block allocation until just before ext4 writes out the block(s)
+ in question. This allows ext4 to better allocation decisions more
+ efficiently.
+
+ nodelalloc
+ Disable delayed allocation. Blocks are allocated when the data is
+ copied from userspace to the page cache, either via the write(2) system
+ call or when an mmap'ed page which was previously unallocated is
+ written for the first time.
+
+ max_batch_time=usec
+ Maximum amount of time ext4 should wait for additional filesystem
+ operations to be batch together with a synchronous write operation.
+ Since a synchronous write operation is going to force a commit and then
+ a wait for the I/O complete, it doesn't cost much, and can be a huge
+ throughput win, we wait for a small amount of time to see if any other
+ transactions can piggyback on the synchronous write. The algorithm
+ used is designed to automatically tune for the speed of the disk, by
+ measuring the amount of time (on average) that it takes to finish
+ committing a transaction. Call this time the "commit time". If the
+ time that the transaction has been running is less than the commit
+ time, ext4 will try sleeping for the commit time to see if other
+ operations will join the transaction. The commit time is capped by
+ the max_batch_time, which defaults to 15000us (15ms). This
+ optimization can be turned off entirely by setting max_batch_time to 0.
+
+ min_batch_time=usec
+ This parameter sets the commit time (as described above) to be at least
+ min_batch_time. It defaults to zero microseconds. Increasing this
+ parameter may improve the throughput of multi-threaded, synchronous
+ workloads on very fast disks, at the cost of increasing latency.
+
+ journal_ioprio=prio
+ The I/O priority (from 0 to 7, where 0 is the highest priority) which
+ should be used for I/O operations submitted by kjournald2 during a
+ commit operation. This defaults to 3, which is a slightly higher
+ priority than the default I/O priority.
+
+ auto_da_alloc(*), noauto_da_alloc
+ Many broken applications don't use fsync() when replacing existing
+ files via patterns such as fd = open("foo.new")/write(fd,..)/close(fd)/
+ rename("foo.new", "foo"), or worse yet, fd = open("foo",
+ O_TRUNC)/write(fd,..)/close(fd). If auto_da_alloc is enabled, ext4
+ will detect the replace-via-rename and replace-via-truncate patterns
+ and force that any delayed allocation blocks are allocated such that at
+ the next journal commit, in the default data=ordered mode, the data
+ blocks of the new file are forced to disk before the rename() operation
+ is committed. This provides roughly the same level of guarantees as
+ ext3, and avoids the "zero-length" problem that can happen when a
+ system crashes before the delayed allocation blocks are forced to disk.
+
+ noinit_itable
+ Do not initialize any uninitialized inode table blocks in the
+ background. This feature may be used by installation CD's so that the
+ install process can complete as quickly as possible; the inode table
+ initialization process would then be deferred until the next time the
+ file system is unmounted.
+
+ init_itable=n
+ The lazy itable init code will wait n times the number of milliseconds
+ it took to zero out the previous block group's inode table. This
+ minimizes the impact on the system performance while file system's
+ inode table is being initialized.
+
+ discard, nodiscard(*)
+ Controls whether ext4 should issue discard/TRIM commands to the
+ underlying block device when blocks are freed. This is useful for SSD
+ devices and sparse/thinly-provisioned LUNs, but it is off by default
+ until sufficient testing has been done.
+
+ nouid32
+ Disables 32-bit UIDs and GIDs. This is for interoperability with
+ older kernels which only store and expect 16-bit values.
+
+ block_validity(*), noblock_validity
+ These options enable or disable the in-kernel facility for tracking
+ filesystem metadata blocks within internal data structures. This
+ allows multi- block allocator and other routines to notice bugs or
+ corrupted allocation bitmaps which cause blocks to be allocated which
+ overlap with filesystem metadata blocks.
+
+ dioread_lock, dioread_nolock
+ Controls whether or not ext4 should use the DIO read locking. If the
+ dioread_nolock option is specified ext4 will allocate uninitialized
+ extent before buffer write and convert the extent to initialized after
+ IO completes. This approach allows ext4 code to avoid using inode
+ mutex, which improves scalability on high speed storages. However this
+ does not work with data journaling and dioread_nolock option will be
+ ignored with kernel warning. Note that dioread_nolock code path is only
+ used for extent-based files. Because of the restrictions this options
+ comprises it is off by default (e.g. dioread_lock).
+
+ max_dir_size_kb=n
+ This limits the size of directories so that any attempt to expand them
+ beyond the specified limit in kilobytes will cause an ENOSPC error.
+ This is useful in memory constrained environments, where a very large
+ directory can cause severe performance problems or even provoke the Out
+ Of Memory killer. (For example, if there is only 512mb memory
+ available, a 176mb directory may seriously cramp the system's style.)
+
+ i_version
+ Enable 64-bit inode version support. This option is off by default.
+
+ dax
+ Use direct access (no page cache). See
+ Documentation/filesystems/dax.rst. Note that this option is
+ incompatible with data=journal.
+
+ inlinecrypt
+ When possible, encrypt/decrypt the contents of encrypted files using the
+ blk-crypto framework rather than filesystem-layer encryption. This
+ allows the use of inline encryption hardware. The on-disk format is
+ unaffected. For more details, see
+ Documentation/block/inline-encryption.rst.
+
+Data Mode
+=========
+There are 3 different data modes:
+
+* writeback mode
+
+ In data=writeback mode, ext4 does not journal data at all. This mode provides
+ a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
+ mode - metadata journaling. A crash+recovery can cause incorrect data to
+ appear in files which were written shortly before the crash. This mode will
+ typically provide the best ext4 performance.
+
+* ordered mode
+
+ In data=ordered mode, ext4 only officially journals metadata, but it logically
+ groups metadata information related to data changes with the data blocks into
+ a single unit called a transaction. When it's time to write the new metadata
+ out to disk, the associated data blocks are written first. In general, this
+ mode performs slightly slower than writeback but significantly faster than
+ journal mode.
+
+* journal mode
+
+ data=journal mode provides full data and metadata journaling. All new data is
+ written to the journal first, and then to its final location. In the event of
+ a crash, the journal can be replayed, bringing both data and metadata into a
+ consistent state. This mode is the slowest except when data needs to be read
+ from and written to disk at the same time where it outperforms all others
+ modes. Enabling this mode will disable delayed allocation and O_DIRECT
+ support.
+
+/proc entries
+=============
+
+Information about mounted ext4 file systems can be found in
+/proc/fs/ext4. Each mounted filesystem will have a directory in
+/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or
+/proc/fs/ext4/dm-0). The files in each per-device directory are shown
+in table below.
+
+Files in /proc/fs/ext4/<devname>
+
+ mb_groups
+ details of multiblock allocator buddy cache of free blocks
+
+/sys entries
+============
+
+Information about mounted ext4 file systems can be found in
+/sys/fs/ext4. Each mounted filesystem will have a directory in
+/sys/fs/ext4 based on its device name (i.e., /sys/fs/ext4/hdc or
+/sys/fs/ext4/dm-0). The files in each per-device directory are shown
+in table below.
+
+Files in /sys/fs/ext4/<devname>:
+
+(see also Documentation/ABI/testing/sysfs-fs-ext4)
+
+ delayed_allocation_blocks
+ This file is read-only and shows the number of blocks that are dirty in
+ the page cache, but which do not have their location in the filesystem
+ allocated yet.
+
+ inode_goal
+ Tuning parameter which (if non-zero) controls the goal inode used by
+ the inode allocator in preference to all other allocation heuristics.
+ This is intended for debugging use only, and should be 0 on production
+ systems.
+
+ inode_readahead_blks
+ Tuning parameter which controls the maximum number of inode table
+ blocks that ext4's inode table readahead algorithm will pre-read into
+ the buffer cache.
+
+ lifetime_write_kbytes
+ This file is read-only and shows the number of kilobytes of data that
+ have been written to this filesystem since it was created.
+
+ max_writeback_mb_bump
+ The maximum number of megabytes the writeback code will try to write
+ out before move on to another inode.
+
+ mb_group_prealloc
+ The multiblock allocator will round up allocation requests to a
+ multiple of this tuning parameter if the stripe size is not set in the
+ ext4 superblock
+
+ mb_max_inode_prealloc
+ The maximum length of per-inode ext4_prealloc_space list.
+
+ mb_max_to_scan
+ The maximum number of extents the multiblock allocator will search to
+ find the best extent.
+
+ mb_min_to_scan
+ The minimum number of extents the multiblock allocator will search to
+ find the best extent.
+
+ mb_order2_req
+ Tuning parameter which controls the minimum size for requests (as a
+ power of 2) where the buddy cache is used.
+
+ mb_stats
+ Controls whether the multiblock allocator should collect statistics,
+ which are shown during the unmount. 1 means to collect statistics, 0
+ means not to collect statistics.
+
+ mb_stream_req
+ Files which have fewer blocks than this tunable parameter will have
+ their blocks allocated out of a block group specific preallocation
+ pool, so that small files are packed closely together. Each large file
+ will have its blocks allocated out of its own unique preallocation
+ pool.
+
+ session_write_kbytes
+ This file is read-only and shows the number of kilobytes of data that
+ have been written to this filesystem since it was mounted.
+
+ reserved_clusters
+ This is RW file and contains number of reserved clusters in the file
+ system which will be used in the specific situations to avoid costly
+ zeroout, unexpected ENOSPC, or possible data loss. The default is 2% or
+ 4096 clusters, whichever is smaller and this can be changed however it
+ can never exceed number of clusters in the file system. If there is not
+ enough space for the reserved space when mounting the file mount will
+ _not_ fail.
+
+Ioctls
+======
+
+Ext4 implements various ioctls which can be used by applications to access
+ext4-specific functionality. An incomplete list of these ioctls is shown in the
+table below. This list includes truly ext4-specific ioctls (``EXT4_IOC_*``) as
+well as ioctls that may have been ext4-specific originally but are now supported
+by some other filesystem(s) too (``FS_IOC_*``).
+
+Table of Ext4 ioctls
+
+ FS_IOC_GETFLAGS
+ Get additional attributes associated with inode. The ioctl argument is
+ an integer bitfield, with bit values described in ext4.h.
+
+ FS_IOC_SETFLAGS
+ Set additional attributes associated with inode. The ioctl argument is
+ an integer bitfield, with bit values described in ext4.h.
+
+ EXT4_IOC_GETVERSION, EXT4_IOC_GETVERSION_OLD
+ Get the inode i_generation number stored for each inode. The
+ i_generation number is normally changed only when new inode is created
+ and it is particularly useful for network filesystems. The '_OLD'
+ version of this ioctl is an alias for FS_IOC_GETVERSION.
+
+ EXT4_IOC_SETVERSION, EXT4_IOC_SETVERSION_OLD
+ Set the inode i_generation number stored for each inode. The '_OLD'
+ version of this ioctl is an alias for FS_IOC_SETVERSION.
+
+ EXT4_IOC_GROUP_EXTEND
+ This ioctl has the same purpose as the resize mount option. It allows
+ to resize filesystem to the end of the last existing block group,
+ further resize has to be done with resize2fs, either online, or
+ offline. The argument points to the unsigned logn number representing
+ the filesystem new block count.
+
+ EXT4_IOC_MOVE_EXT
+ Move the block extents from orig_fd (the one this ioctl is pointing to)
+ to the donor_fd (the one specified in move_extent structure passed as
+ an argument to this ioctl). Then, exchange inode metadata between
+ orig_fd and donor_fd. This is especially useful for online
+ defragmentation, because the allocator has the opportunity to allocate
+ moved blocks better, ideally into one contiguous extent.
+
+ EXT4_IOC_GROUP_ADD
+ Add a new group descriptor to an existing or new group descriptor
+ block. The new group descriptor is described by ext4_new_group_input
+ structure, which is passed as an argument to this ioctl. This is
+ especially useful in conjunction with EXT4_IOC_GROUP_EXTEND, which
+ allows online resize of the filesystem to the end of the last existing
+ block group. Those two ioctls combined is used in userspace online
+ resize tool (e.g. resize2fs).
+
+ EXT4_IOC_MIGRATE
+ This ioctl operates on the filesystem itself. It converts (migrates)
+ ext3 indirect block mapped inode to ext4 extent mapped inode by walking
+ through indirect block mapping of the original inode and converting
+ contiguous block ranges into ext4 extents of the temporary inode. Then,
+ inodes are swapped. This ioctl might help, when migrating from ext3 to
+ ext4 filesystem, however suggestion is to create fresh ext4 filesystem
+ and copy data from the backup. Note, that filesystem has to support
+ extents for this ioctl to work.
+
+ EXT4_IOC_ALLOC_DA_BLKS
+ Force all of the delay allocated blocks to be allocated to preserve
+ application-expected ext3 behaviour. Note that this will also start
+ triggering a write of the data blocks, but this behaviour may change in
+ the future as it is not necessary and has been done this way only for
+ sake of simplicity.
+
+ EXT4_IOC_RESIZE_FS
+ Resize the filesystem to a new size. The number of blocks of resized
+ filesystem is passed in via 64 bit integer argument. The kernel
+ allocates bitmaps and inode table, the userspace tool thus just passes
+ the new number of blocks.
+
+ EXT4_IOC_SWAP_BOOT
+ Swap i_blocks and associated attributes (like i_blocks, i_size,
+ i_flags, ...) from the specified inode with inode EXT4_BOOT_LOADER_INO
+ (#5). This is typically used to store a boot loader in a secure part of
+ the filesystem, where it can't be changed by a normal user by accident.
+ The data blocks of the previous boot loader will be associated with the
+ given inode.
+
+References
+==========
+
+kernel source: <file:fs/ext4/>
+ <file:fs/jbd2/>
+
+programs: http://e2fsprogs.sourceforge.net/
+
+useful links: https://fedoraproject.org/wiki/ext3-devel
+ http://www.bullopensource.org/ext4/
+ http://ext4.wiki.kernel.org/index.php/Main_Page
+ https://fedoraproject.org/wiki/Features/Ext4