diff options
author | 2023-02-21 18:24:12 -0800 | |
---|---|---|
committer | 2023-02-21 18:24:12 -0800 | |
commit | 5b7c4cabbb65f5c469464da6c5f614cbd7f730f2 (patch) | |
tree | cc5c2d0a898769fd59549594fedb3ee6f84e59a0 /Documentation/scheduler/sched-design-CFS.rst | |
download | linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.tar.gz linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.zip |
Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextgrafted
Pull networking updates from Jakub Kicinski:
"Core:
- Add dedicated kmem_cache for typical/small skb->head, avoid having
to access struct page at kfree time, and improve memory use.
- Introduce sysctl to set default RPS configuration for new netdevs.
- Define Netlink protocol specification format which can be used to
describe messages used by each family and auto-generate parsers.
Add tools for generating kernel data structures and uAPI headers.
- Expose all net/core sysctls inside netns.
- Remove 4s sleep in netpoll if carrier is instantly detected on
boot.
- Add configurable limit of MDB entries per port, and port-vlan.
- Continue populating drop reasons throughout the stack.
- Retire a handful of legacy Qdiscs and classifiers.
Protocols:
- Support IPv4 big TCP (TSO frames larger than 64kB).
- Add IP_LOCAL_PORT_RANGE socket option, to control local port range
on socket by socket basis.
- Track and report in procfs number of MPTCP sockets used.
- Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path
manager.
- IPv6: don't check net.ipv6.route.max_size and rely on garbage
collection to free memory (similarly to IPv4).
- Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).
- ICMP: add per-rate limit counters.
- Add support for user scanning requests in ieee802154.
- Remove static WEP support.
- Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
reporting.
- WiFi 7 EHT channel puncturing support (client & AP).
BPF:
- Add a rbtree data structure following the "next-gen data structure"
precedent set by recently added linked list, that is, by using
kfunc + kptr instead of adding a new BPF map type.
- Expose XDP hints via kfuncs with initial support for RX hash and
timestamp metadata.
- Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to
better support decap on GRE tunnel devices not operating in collect
metadata.
- Improve x86 JIT's codegen for PROBE_MEM runtime error checks.
- Remove the need for trace_printk_lock for bpf_trace_printk and
bpf_trace_vprintk helpers.
- Extend libbpf's bpf_tracing.h support for tracing arguments of
kprobes/uprobes and syscall as a special case.
- Significantly reduce the search time for module symbols by
livepatch and BPF.
- Enable cpumasks to be used as kptrs, which is useful for tracing
programs tracking which tasks end up running on which CPUs in
different time intervals.
- Add support for BPF trampoline on s390x and riscv64.
- Add capability to export the XDP features supported by the NIC.
- Add __bpf_kfunc tag for marking kernel functions as kfuncs.
- Add cgroup.memory=nobpf kernel parameter option to disable BPF
memory accounting for container environments.
Netfilter:
- Remove the CLUSTERIP target. It has been marked as obsolete for
years, and we still have WARN splats wrt races of the out-of-band
/proc interface installed by this target.
- Add 'destroy' commands to nf_tables. They are identical to the
existing 'delete' commands, but do not return an error if the
referenced object (set, chain, rule...) did not exist.
Driver API:
- Improve cpumask_local_spread() locality to help NICs set the right
IRQ affinity on AMD platforms.
- Separate C22 and C45 MDIO bus transactions more clearly.
- Introduce new DCB table to control DSCP rewrite on egress.
- Support configuration of Physical Layer Collision Avoidance (PLCA)
Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
shared medium Ethernet.
- Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
preemption of low priority frames by high priority frames.
- Add support for controlling MACSec offload using netlink SET.
- Rework devlink instance refcounts to allow registration and
de-registration under the instance lock. Split the code into
multiple files, drop some of the unnecessarily granular locks and
factor out common parts of netlink operation handling.
- Add TX frame aggregation parameters (for USB drivers).
- Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
messages with notifications for debug.
- Allow offloading of UDP NEW connections via act_ct.
- Add support for per action HW stats in TC.
- Support hardware miss to TC action (continue processing in SW from
a specific point in the action chain).
- Warn if old Wireless Extension user space interface is used with
modern cfg80211/mac80211 drivers. Do not support Wireless
Extensions for Wi-Fi 7 devices at all. Everyone should switch to
using nl80211 interface instead.
- Improve the CAN bit timing configuration. Use extack to return
error messages directly to user space, update the SJW handling,
including the definition of a new default value that will benefit
CAN-FD controllers, by increasing their oscillator tolerance.
New hardware / drivers:
- Ethernet:
- nVidia BlueField-3 support (control traffic driver)
- Ethernet support for imx93 SoCs
- Motorcomm yt8531 gigabit Ethernet PHY
- onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
- Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
- Amlogic gxl MDIO mux
- WiFi:
- RealTek RTL8188EU (rtl8xxxu)
- Qualcomm Wi-Fi 7 devices (ath12k)
- CAN:
- Renesas R-Car V4H
Drivers:
- Bluetooth:
- Set Per Platform Antenna Gain (PPAG) for Intel controllers.
- Ethernet NICs:
- Intel (1G, igc):
- support TSN / Qbv / packet scheduling features of i226 model
- Intel (100G, ice):
- use GNSS subsystem instead of TTY
- multi-buffer XDP support
- extend support for GPIO pins to E823 devices
- nVidia/Mellanox:
- update the shared buffer configuration on PFC commands
- implement PTP adjphase function for HW offset control
- TC support for Geneve and GRE with VF tunnel offload
- more efficient crypto key management method
- multi-port eswitch support
- Netronome/Corigine:
- add DCB IEEE support
- support IPsec offloading for NFP3800
- Freescale/NXP (enetc):
- support XDP_REDIRECT for XDP non-linear buffers
- improve reconfig, avoid link flap and waiting for idle
- support MAC Merge layer
- Other NICs:
- sfc/ef100: add basic devlink support for ef100
- ionic: rx_push mode operation (writing descriptors via MMIO)
- bnxt: use the auxiliary bus abstraction for RDMA
- r8169: disable ASPM and reset bus in case of tx timeout
- cpsw: support QSGMII mode for J721e CPSW9G
- cpts: support pulse-per-second output
- ngbe: add an mdio bus driver
- usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
- r8152: handle devices with FW with NCM support
- amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
- virtio-net: support multi buffer XDP
- virtio/vsock: replace virtio_vsock_pkt with sk_buff
- tsnep: XDP support
- Ethernet high-speed switches:
- nVidia/Mellanox (mlxsw):
- add support for latency TLV (in FW control messages)
- Microchip (sparx5):
- separate explicit and implicit traffic forwarding rules, make
the implicit rules always active
- add support for egress DSCP rewrite
- IS0 VCAP support (Ingress Classification)
- IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS
etc.)
- ES2 VCAP support (Egress Access Control)
- support for Per-Stream Filtering and Policing (802.1Q,
8.6.5.1)
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- add MAB (port auth) offload support
- enable PTP receive for mv88e6390
- NXP (ocelot):
- support MAC Merge layer
- support for the the vsc7512 internal copper phys
- Microchip:
- lan9303: convert to PHYLINK
- lan966x: support TC flower filter statistics
- lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
- lan937x: support Credit Based Shaper configuration
- ksz9477: support Energy Efficient Ethernet
- other:
- qca8k: convert to regmap read/write API, use bulk operations
- rswitch: Improve TX timestamp accuracy
- Intel WiFi (iwlwifi):
- EHT (Wi-Fi 7) rate reporting
- STEP equalizer support: transfer some STEP (connection to radio
on platforms with integrated wifi) related parameters from the
BIOS to the firmware.
- Qualcomm 802.11ax WiFi (ath11k):
- IPQ5018 support
- Fine Timing Measurement (FTM) responder role support
- channel 177 support
- MediaTek WiFi (mt76):
- per-PHY LED support
- mt7996: EHT (Wi-Fi 7) support
- Wireless Ethernet Dispatch (WED) reset support
- switch to using page pool allocator
- RealTek WiFi (rtw89):
- support new version of Bluetooth co-existance
- Mobile:
- rmnet: support TX aggregation"
* tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits)
page_pool: add a comment explaining the fragment counter usage
net: ethtool: fix __ethtool_dev_mm_supported() implementation
ethtool: pse-pd: Fix double word in comments
xsk: add linux/vmalloc.h to xsk.c
sefltests: netdevsim: wait for devlink instance after netns removal
selftest: fib_tests: Always cleanup before exit
net/mlx5e: Align IPsec ASO result memory to be as required by hardware
net/mlx5e: TC, Set CT miss to the specific ct action instance
net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG
net/mlx5: Refactor tc miss handling to a single function
net/mlx5: Kconfig: Make tc offload depend on tc skb extension
net/sched: flower: Support hardware miss to tc action
net/sched: flower: Move filter handle initialization earlier
net/sched: cls_api: Support hardware miss to tc action
net/sched: Rename user cookie and act cookie
sfc: fix builds without CONFIG_RTC_LIB
sfc: clean up some inconsistent indentings
net/mlx4_en: Introduce flexible array to silence overflow warning
net: lan966x: Fix possible deadlock inside PTP
net/ulp: Remove redundant ->clone() test in inet_clone_ulp().
...
Diffstat (limited to 'Documentation/scheduler/sched-design-CFS.rst')
-rw-r--r-- | Documentation/scheduler/sched-design-CFS.rst | 249 |
1 files changed, 249 insertions, 0 deletions
diff --git a/Documentation/scheduler/sched-design-CFS.rst b/Documentation/scheduler/sched-design-CFS.rst new file mode 100644 index 000000000..03db55504 --- /dev/null +++ b/Documentation/scheduler/sched-design-CFS.rst @@ -0,0 +1,249 @@ +============= +CFS Scheduler +============= + + +1. OVERVIEW +============ + +CFS stands for "Completely Fair Scheduler," and is the new "desktop" process +scheduler implemented by Ingo Molnar and merged in Linux 2.6.23. It is the +replacement for the previous vanilla scheduler's SCHED_OTHER interactivity +code. + +80% of CFS's design can be summed up in a single sentence: CFS basically models +an "ideal, precise multi-tasking CPU" on real hardware. + +"Ideal multi-tasking CPU" is a (non-existent :-)) CPU that has 100% physical +power and which can run each task at precise equal speed, in parallel, each at +1/nr_running speed. For example: if there are 2 tasks running, then it runs +each at 50% physical power --- i.e., actually in parallel. + +On real hardware, we can run only a single task at once, so we have to +introduce the concept of "virtual runtime." The virtual runtime of a task +specifies when its next timeslice would start execution on the ideal +multi-tasking CPU described above. In practice, the virtual runtime of a task +is its actual runtime normalized to the total number of running tasks. + + + +2. FEW IMPLEMENTATION DETAILS +============================== + +In CFS the virtual runtime is expressed and tracked via the per-task +p->se.vruntime (nanosec-unit) value. This way, it's possible to accurately +timestamp and measure the "expected CPU time" a task should have gotten. + + Small detail: on "ideal" hardware, at any time all tasks would have the same + p->se.vruntime value --- i.e., tasks would execute simultaneously and no task + would ever get "out of balance" from the "ideal" share of CPU time. + +CFS's task picking logic is based on this p->se.vruntime value and it is thus +very simple: it always tries to run the task with the smallest p->se.vruntime +value (i.e., the task which executed least so far). CFS always tries to split +up CPU time between runnable tasks as close to "ideal multitasking hardware" as +possible. + +Most of the rest of CFS's design just falls out of this really simple concept, +with a few add-on embellishments like nice levels, multiprocessing and various +algorithm variants to recognize sleepers. + + + +3. THE RBTREE +============== + +CFS's design is quite radical: it does not use the old data structures for the +runqueues, but it uses a time-ordered rbtree to build a "timeline" of future +task execution, and thus has no "array switch" artifacts (by which both the +previous vanilla scheduler and RSDL/SD are affected). + +CFS also maintains the rq->cfs.min_vruntime value, which is a monotonic +increasing value tracking the smallest vruntime among all tasks in the +runqueue. The total amount of work done by the system is tracked using +min_vruntime; that value is used to place newly activated entities on the left +side of the tree as much as possible. + +The total number of running tasks in the runqueue is accounted through the +rq->cfs.load value, which is the sum of the weights of the tasks queued on the +runqueue. + +CFS maintains a time-ordered rbtree, where all runnable tasks are sorted by the +p->se.vruntime key. CFS picks the "leftmost" task from this tree and sticks to it. +As the system progresses forwards, the executed tasks are put into the tree +more and more to the right --- slowly but surely giving a chance for every task +to become the "leftmost task" and thus get on the CPU within a deterministic +amount of time. + +Summing up, CFS works like this: it runs a task a bit, and when the task +schedules (or a scheduler tick happens) the task's CPU usage is "accounted +for": the (small) time it just spent using the physical CPU is added to +p->se.vruntime. Once p->se.vruntime gets high enough so that another task +becomes the "leftmost task" of the time-ordered rbtree it maintains (plus a +small amount of "granularity" distance relative to the leftmost task so that we +do not over-schedule tasks and trash the cache), then the new leftmost task is +picked and the current task is preempted. + + + +4. SOME FEATURES OF CFS +======================== + +CFS uses nanosecond granularity accounting and does not rely on any jiffies or +other HZ detail. Thus the CFS scheduler has no notion of "timeslices" in the +way the previous scheduler had, and has no heuristics whatsoever. There is +only one central tunable (you have to switch on CONFIG_SCHED_DEBUG): + + /sys/kernel/debug/sched/min_granularity_ns + +which can be used to tune the scheduler from "desktop" (i.e., low latencies) to +"server" (i.e., good batching) workloads. It defaults to a setting suitable +for desktop workloads. SCHED_BATCH is handled by the CFS scheduler module too. + +Due to its design, the CFS scheduler is not prone to any of the "attacks" that +exist today against the heuristics of the stock scheduler: fiftyp.c, thud.c, +chew.c, ring-test.c, massive_intr.c all work fine and do not impact +interactivity and produce the expected behavior. + +The CFS scheduler has a much stronger handling of nice levels and SCHED_BATCH +than the previous vanilla scheduler: both types of workloads are isolated much +more aggressively. + +SMP load-balancing has been reworked/sanitized: the runqueue-walking +assumptions are gone from the load-balancing code now, and iterators of the +scheduling modules are used. The balancing code got quite a bit simpler as a +result. + + + +5. Scheduling policies +====================== + +CFS implements three scheduling policies: + + - SCHED_NORMAL (traditionally called SCHED_OTHER): The scheduling + policy that is used for regular tasks. + + - SCHED_BATCH: Does not preempt nearly as often as regular tasks + would, thereby allowing tasks to run longer and make better use of + caches but at the cost of interactivity. This is well suited for + batch jobs. + + - SCHED_IDLE: This is even weaker than nice 19, but its not a true + idle timer scheduler in order to avoid to get into priority + inversion problems which would deadlock the machine. + +SCHED_FIFO/_RR are implemented in sched/rt.c and are as specified by +POSIX. + +The command chrt from util-linux-ng 2.13.1.1 can set all of these except +SCHED_IDLE. + + + +6. SCHEDULING CLASSES +====================== + +The new CFS scheduler has been designed in such a way to introduce "Scheduling +Classes," an extensible hierarchy of scheduler modules. These modules +encapsulate scheduling policy details and are handled by the scheduler core +without the core code assuming too much about them. + +sched/fair.c implements the CFS scheduler described above. + +sched/rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler way than +the previous vanilla scheduler did. It uses 100 runqueues (for all 100 RT +priority levels, instead of 140 in the previous scheduler) and it needs no +expired array. + +Scheduling classes are implemented through the sched_class structure, which +contains hooks to functions that must be called whenever an interesting event +occurs. + +This is the (partial) list of the hooks: + + - enqueue_task(...) + + Called when a task enters a runnable state. + It puts the scheduling entity (task) into the red-black tree and + increments the nr_running variable. + + - dequeue_task(...) + + When a task is no longer runnable, this function is called to keep the + corresponding scheduling entity out of the red-black tree. It decrements + the nr_running variable. + + - yield_task(...) + + This function is basically just a dequeue followed by an enqueue, unless the + compat_yield sysctl is turned on; in that case, it places the scheduling + entity at the right-most end of the red-black tree. + + - check_preempt_curr(...) + + This function checks if a task that entered the runnable state should + preempt the currently running task. + + - pick_next_task(...) + + This function chooses the most appropriate task eligible to run next. + + - set_curr_task(...) + + This function is called when a task changes its scheduling class or changes + its task group. + + - task_tick(...) + + This function is mostly called from time tick functions; it might lead to + process switch. This drives the running preemption. + + + + +7. GROUP SCHEDULER EXTENSIONS TO CFS +===================================== + +Normally, the scheduler operates on individual tasks and strives to provide +fair CPU time to each task. Sometimes, it may be desirable to group tasks and +provide fair CPU time to each such task group. For example, it may be +desirable to first provide fair CPU time to each user on the system and then to +each task belonging to a user. + +CONFIG_CGROUP_SCHED strives to achieve exactly that. It lets tasks to be +grouped and divides CPU time fairly among such groups. + +CONFIG_RT_GROUP_SCHED permits to group real-time (i.e., SCHED_FIFO and +SCHED_RR) tasks. + +CONFIG_FAIR_GROUP_SCHED permits to group CFS (i.e., SCHED_NORMAL and +SCHED_BATCH) tasks. + + These options need CONFIG_CGROUPS to be defined, and let the administrator + create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See + Documentation/admin-guide/cgroup-v1/cgroups.rst for more information about this filesystem. + +When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each +group created using the pseudo filesystem. See example steps below to create +task groups and modify their CPU share using the "cgroups" pseudo filesystem:: + + # mount -t tmpfs cgroup_root /sys/fs/cgroup + # mkdir /sys/fs/cgroup/cpu + # mount -t cgroup -ocpu none /sys/fs/cgroup/cpu + # cd /sys/fs/cgroup/cpu + + # mkdir multimedia # create "multimedia" group of tasks + # mkdir browser # create "browser" group of tasks + + # #Configure the multimedia group to receive twice the CPU bandwidth + # #that of browser group + + # echo 2048 > multimedia/cpu.shares + # echo 1024 > browser/cpu.shares + + # firefox & # Launch firefox and move it to "browser" group + # echo <firefox_pid> > browser/tasks + + # #Launch gmplayer (or your favourite movie player) + # echo <movie_player_pid> > multimedia/tasks |