diff options
author | 2023-02-21 18:24:12 -0800 | |
---|---|---|
committer | 2023-02-21 18:24:12 -0800 | |
commit | 5b7c4cabbb65f5c469464da6c5f614cbd7f730f2 (patch) | |
tree | cc5c2d0a898769fd59549594fedb3ee6f84e59a0 /Documentation/scheduler/sched-util-clamp.rst | |
download | linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.tar.gz linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.zip |
Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextgrafted
Pull networking updates from Jakub Kicinski:
"Core:
- Add dedicated kmem_cache for typical/small skb->head, avoid having
to access struct page at kfree time, and improve memory use.
- Introduce sysctl to set default RPS configuration for new netdevs.
- Define Netlink protocol specification format which can be used to
describe messages used by each family and auto-generate parsers.
Add tools for generating kernel data structures and uAPI headers.
- Expose all net/core sysctls inside netns.
- Remove 4s sleep in netpoll if carrier is instantly detected on
boot.
- Add configurable limit of MDB entries per port, and port-vlan.
- Continue populating drop reasons throughout the stack.
- Retire a handful of legacy Qdiscs and classifiers.
Protocols:
- Support IPv4 big TCP (TSO frames larger than 64kB).
- Add IP_LOCAL_PORT_RANGE socket option, to control local port range
on socket by socket basis.
- Track and report in procfs number of MPTCP sockets used.
- Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path
manager.
- IPv6: don't check net.ipv6.route.max_size and rely on garbage
collection to free memory (similarly to IPv4).
- Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).
- ICMP: add per-rate limit counters.
- Add support for user scanning requests in ieee802154.
- Remove static WEP support.
- Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
reporting.
- WiFi 7 EHT channel puncturing support (client & AP).
BPF:
- Add a rbtree data structure following the "next-gen data structure"
precedent set by recently added linked list, that is, by using
kfunc + kptr instead of adding a new BPF map type.
- Expose XDP hints via kfuncs with initial support for RX hash and
timestamp metadata.
- Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to
better support decap on GRE tunnel devices not operating in collect
metadata.
- Improve x86 JIT's codegen for PROBE_MEM runtime error checks.
- Remove the need for trace_printk_lock for bpf_trace_printk and
bpf_trace_vprintk helpers.
- Extend libbpf's bpf_tracing.h support for tracing arguments of
kprobes/uprobes and syscall as a special case.
- Significantly reduce the search time for module symbols by
livepatch and BPF.
- Enable cpumasks to be used as kptrs, which is useful for tracing
programs tracking which tasks end up running on which CPUs in
different time intervals.
- Add support for BPF trampoline on s390x and riscv64.
- Add capability to export the XDP features supported by the NIC.
- Add __bpf_kfunc tag for marking kernel functions as kfuncs.
- Add cgroup.memory=nobpf kernel parameter option to disable BPF
memory accounting for container environments.
Netfilter:
- Remove the CLUSTERIP target. It has been marked as obsolete for
years, and we still have WARN splats wrt races of the out-of-band
/proc interface installed by this target.
- Add 'destroy' commands to nf_tables. They are identical to the
existing 'delete' commands, but do not return an error if the
referenced object (set, chain, rule...) did not exist.
Driver API:
- Improve cpumask_local_spread() locality to help NICs set the right
IRQ affinity on AMD platforms.
- Separate C22 and C45 MDIO bus transactions more clearly.
- Introduce new DCB table to control DSCP rewrite on egress.
- Support configuration of Physical Layer Collision Avoidance (PLCA)
Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
shared medium Ethernet.
- Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
preemption of low priority frames by high priority frames.
- Add support for controlling MACSec offload using netlink SET.
- Rework devlink instance refcounts to allow registration and
de-registration under the instance lock. Split the code into
multiple files, drop some of the unnecessarily granular locks and
factor out common parts of netlink operation handling.
- Add TX frame aggregation parameters (for USB drivers).
- Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
messages with notifications for debug.
- Allow offloading of UDP NEW connections via act_ct.
- Add support for per action HW stats in TC.
- Support hardware miss to TC action (continue processing in SW from
a specific point in the action chain).
- Warn if old Wireless Extension user space interface is used with
modern cfg80211/mac80211 drivers. Do not support Wireless
Extensions for Wi-Fi 7 devices at all. Everyone should switch to
using nl80211 interface instead.
- Improve the CAN bit timing configuration. Use extack to return
error messages directly to user space, update the SJW handling,
including the definition of a new default value that will benefit
CAN-FD controllers, by increasing their oscillator tolerance.
New hardware / drivers:
- Ethernet:
- nVidia BlueField-3 support (control traffic driver)
- Ethernet support for imx93 SoCs
- Motorcomm yt8531 gigabit Ethernet PHY
- onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
- Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
- Amlogic gxl MDIO mux
- WiFi:
- RealTek RTL8188EU (rtl8xxxu)
- Qualcomm Wi-Fi 7 devices (ath12k)
- CAN:
- Renesas R-Car V4H
Drivers:
- Bluetooth:
- Set Per Platform Antenna Gain (PPAG) for Intel controllers.
- Ethernet NICs:
- Intel (1G, igc):
- support TSN / Qbv / packet scheduling features of i226 model
- Intel (100G, ice):
- use GNSS subsystem instead of TTY
- multi-buffer XDP support
- extend support for GPIO pins to E823 devices
- nVidia/Mellanox:
- update the shared buffer configuration on PFC commands
- implement PTP adjphase function for HW offset control
- TC support for Geneve and GRE with VF tunnel offload
- more efficient crypto key management method
- multi-port eswitch support
- Netronome/Corigine:
- add DCB IEEE support
- support IPsec offloading for NFP3800
- Freescale/NXP (enetc):
- support XDP_REDIRECT for XDP non-linear buffers
- improve reconfig, avoid link flap and waiting for idle
- support MAC Merge layer
- Other NICs:
- sfc/ef100: add basic devlink support for ef100
- ionic: rx_push mode operation (writing descriptors via MMIO)
- bnxt: use the auxiliary bus abstraction for RDMA
- r8169: disable ASPM and reset bus in case of tx timeout
- cpsw: support QSGMII mode for J721e CPSW9G
- cpts: support pulse-per-second output
- ngbe: add an mdio bus driver
- usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
- r8152: handle devices with FW with NCM support
- amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
- virtio-net: support multi buffer XDP
- virtio/vsock: replace virtio_vsock_pkt with sk_buff
- tsnep: XDP support
- Ethernet high-speed switches:
- nVidia/Mellanox (mlxsw):
- add support for latency TLV (in FW control messages)
- Microchip (sparx5):
- separate explicit and implicit traffic forwarding rules, make
the implicit rules always active
- add support for egress DSCP rewrite
- IS0 VCAP support (Ingress Classification)
- IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS
etc.)
- ES2 VCAP support (Egress Access Control)
- support for Per-Stream Filtering and Policing (802.1Q,
8.6.5.1)
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- add MAB (port auth) offload support
- enable PTP receive for mv88e6390
- NXP (ocelot):
- support MAC Merge layer
- support for the the vsc7512 internal copper phys
- Microchip:
- lan9303: convert to PHYLINK
- lan966x: support TC flower filter statistics
- lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
- lan937x: support Credit Based Shaper configuration
- ksz9477: support Energy Efficient Ethernet
- other:
- qca8k: convert to regmap read/write API, use bulk operations
- rswitch: Improve TX timestamp accuracy
- Intel WiFi (iwlwifi):
- EHT (Wi-Fi 7) rate reporting
- STEP equalizer support: transfer some STEP (connection to radio
on platforms with integrated wifi) related parameters from the
BIOS to the firmware.
- Qualcomm 802.11ax WiFi (ath11k):
- IPQ5018 support
- Fine Timing Measurement (FTM) responder role support
- channel 177 support
- MediaTek WiFi (mt76):
- per-PHY LED support
- mt7996: EHT (Wi-Fi 7) support
- Wireless Ethernet Dispatch (WED) reset support
- switch to using page pool allocator
- RealTek WiFi (rtw89):
- support new version of Bluetooth co-existance
- Mobile:
- rmnet: support TX aggregation"
* tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits)
page_pool: add a comment explaining the fragment counter usage
net: ethtool: fix __ethtool_dev_mm_supported() implementation
ethtool: pse-pd: Fix double word in comments
xsk: add linux/vmalloc.h to xsk.c
sefltests: netdevsim: wait for devlink instance after netns removal
selftest: fib_tests: Always cleanup before exit
net/mlx5e: Align IPsec ASO result memory to be as required by hardware
net/mlx5e: TC, Set CT miss to the specific ct action instance
net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG
net/mlx5: Refactor tc miss handling to a single function
net/mlx5: Kconfig: Make tc offload depend on tc skb extension
net/sched: flower: Support hardware miss to tc action
net/sched: flower: Move filter handle initialization earlier
net/sched: cls_api: Support hardware miss to tc action
net/sched: Rename user cookie and act cookie
sfc: fix builds without CONFIG_RTC_LIB
sfc: clean up some inconsistent indentings
net/mlx4_en: Introduce flexible array to silence overflow warning
net: lan966x: Fix possible deadlock inside PTP
net/ulp: Remove redundant ->clone() test in inet_clone_ulp().
...
Diffstat (limited to 'Documentation/scheduler/sched-util-clamp.rst')
-rw-r--r-- | Documentation/scheduler/sched-util-clamp.rst | 741 |
1 files changed, 741 insertions, 0 deletions
diff --git a/Documentation/scheduler/sched-util-clamp.rst b/Documentation/scheduler/sched-util-clamp.rst new file mode 100644 index 000000000..74d5b7c64 --- /dev/null +++ b/Documentation/scheduler/sched-util-clamp.rst @@ -0,0 +1,741 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +Utilization Clamping +==================== + +1. Introduction +=============== + +Utilization clamping, also known as util clamp or uclamp, is a scheduler +feature that allows user space to help in managing the performance requirement +of tasks. It was introduced in v5.3 release. The CGroup support was merged in +v5.4. + +Uclamp is a hinting mechanism that allows the scheduler to understand the +performance requirements and restrictions of the tasks, thus it helps the +scheduler to make a better decision. And when schedutil cpufreq governor is +used, util clamp will influence the CPU frequency selection as well. + +Since the scheduler and schedutil are both driven by PELT (util_avg) signals, +util clamp acts on that to achieve its goal by clamping the signal to a certain +point; hence the name. That is, by clamping utilization we are making the +system run at a certain performance point. + +The right way to view util clamp is as a mechanism to make request or hint on +performance constraints. It consists of two tunables: + + * UCLAMP_MIN, which sets the lower bound. + * UCLAMP_MAX, which sets the upper bound. + +These two bounds will ensure a task will operate within this performance range +of the system. UCLAMP_MIN implies boosting a task, while UCLAMP_MAX implies +capping a task. + +One can tell the system (scheduler) that some tasks require a minimum +performance point to operate at to deliver the desired user experience. Or one +can tell the system that some tasks should be restricted from consuming too +much resources and should not go above a specific performance point. Viewing +the uclamp values as performance points rather than utilization is a better +abstraction from user space point of view. + +As an example, a game can use util clamp to form a feedback loop with its +perceived Frames Per Second (FPS). It can dynamically increase the minimum +performance point required by its display pipeline to ensure no frame is +dropped. It can also dynamically 'prime' up these tasks if it knows in the +coming few hundred milliseconds a computationally intensive scene is about to +happen. + +On mobile hardware where the capability of the devices varies a lot, this +dynamic feedback loop offers a great flexibility to ensure best user experience +given the capabilities of any system. + +Of course a static configuration is possible too. The exact usage will depend +on the system, application and the desired outcome. + +Another example is in Android where tasks are classified as background, +foreground, top-app, etc. Util clamp can be used to constrain how much +resources background tasks are consuming by capping the performance point they +can run at. This constraint helps reserve resources for important tasks, like +the ones belonging to the currently active app (top-app group). Beside this +helps in limiting how much power they consume. This can be more obvious in +heterogeneous systems (e.g. Arm big.LITTLE); the constraint will help bias the +background tasks to stay on the little cores which will ensure that: + + 1. The big cores are free to run top-app tasks immediately. top-app + tasks are the tasks the user is currently interacting with, hence + the most important tasks in the system. + 2. They don't run on a power hungry core and drain battery even if they + are CPU intensive tasks. + +.. note:: + **little cores**: + CPUs with capacity < 1024 + + **big cores**: + CPUs with capacity = 1024 + +By making these uclamp performance requests, or rather hints, user space can +ensure system resources are used optimally to deliver the best possible user +experience. + +Another use case is to help with **overcoming the ramp up latency inherit in +how scheduler utilization signal is calculated**. + +On the other hand, a busy task for instance that requires to run at maximum +performance point will suffer a delay of ~200ms (PELT HALFIFE = 32ms) for the +scheduler to realize that. This is known to affect workloads like gaming on +mobile devices where frames will drop due to slow response time to select the +higher frequency required for the tasks to finish their work in time. Setting +UCLAMP_MIN=1024 will ensure such tasks will always see the highest performance +level when they start running. + +The overall visible effect goes beyond better perceived user +experience/performance and stretches to help achieve a better overall +performance/watt if used effectively. + +User space can form a feedback loop with the thermal subsystem too to ensure +the device doesn't heat up to the point where it will throttle. + +Both SCHED_NORMAL/OTHER and SCHED_FIFO/RR honour uclamp requests/hints. + +In the SCHED_FIFO/RR case, uclamp gives the option to run RT tasks at any +performance point rather than being tied to MAX frequency all the time. Which +can be useful on general purpose systems that run on battery powered devices. + +Note that by design RT tasks don't have per-task PELT signal and must always +run at a constant frequency to combat undeterministic DVFS rampup delays. + +Note that using schedutil always implies a single delay to modify the frequency +when an RT task wakes up. This cost is unchanged by using uclamp. Uclamp only +helps picking what frequency to request instead of schedutil always requesting +MAX for all RT tasks. + +See :ref:`section 3.4 <uclamp-default-values>` for default values and +:ref:`3.4.1 <sched-util-clamp-min-rt-default>` on how to change RT tasks +default value. + +2. Design +========= + +Util clamp is a property of every task in the system. It sets the boundaries of +its utilization signal; acting as a bias mechanism that influences certain +decisions within the scheduler. + +The actual utilization signal of a task is never clamped in reality. If you +inspect PELT signals at any point of time you should continue to see them as +they are intact. Clamping happens only when needed, e.g: when a task wakes up +and the scheduler needs to select a suitable CPU for it to run on. + +Since the goal of util clamp is to allow requesting a minimum and maximum +performance point for a task to run on, it must be able to influence the +frequency selection as well as task placement to be most effective. Both of +which have implications on the utilization value at CPU runqueue (rq for short) +level, which brings us to the main design challenge. + +When a task wakes up on an rq, the utilization signal of the rq will be +affected by the uclamp settings of all the tasks enqueued on it. For example if +a task requests to run at UTIL_MIN = 512, then the util signal of the rq needs +to respect to this request as well as all other requests from all of the +enqueued tasks. + +To be able to aggregate the util clamp value of all the tasks attached to the +rq, uclamp must do some housekeeping at every enqueue/dequeue, which is the +scheduler hot path. Hence care must be taken since any slow down will have +significant impact on a lot of use cases and could hinder its usability in +practice. + +The way this is handled is by dividing the utilization range into buckets +(struct uclamp_bucket) which allows us to reduce the search space from every +task on the rq to only a subset of tasks on the top-most bucket. + +When a task is enqueued, the counter in the matching bucket is incremented, +and on dequeue it is decremented. This makes keeping track of the effective +uclamp value at rq level a lot easier. + +As tasks are enqueued and dequeued, we keep track of the current effective +uclamp value of the rq. See :ref:`section 2.1 <uclamp-buckets>` for details on +how this works. + +Later at any path that wants to identify the effective uclamp value of the rq, +it will simply need to read this effective uclamp value of the rq at that exact +moment of time it needs to take a decision. + +For task placement case, only Energy Aware and Capacity Aware Scheduling +(EAS/CAS) make use of uclamp for now, which implies that it is applied on +heterogeneous systems only. +When a task wakes up, the scheduler will look at the current effective uclamp +value of every rq and compare it with the potential new value if the task were +to be enqueued there. Favoring the rq that will end up with the most energy +efficient combination. + +Similarly in schedutil, when it needs to make a frequency update it will look +at the current effective uclamp value of the rq which is influenced by the set +of tasks currently enqueued there and select the appropriate frequency that +will satisfy constraints from requests. + +Other paths like setting overutilization state (which effectively disables EAS) +make use of uclamp as well. Such cases are considered necessary housekeeping to +allow the 2 main use cases above and will not be covered in detail here as they +could change with implementation details. + +.. _uclamp-buckets: + +2.1. Buckets +------------ + +:: + + [struct rq] + + (bottom) (top) + + 0 1024 + | | + +-----------+-----------+-----------+---- ----+-----------+ + | Bucket 0 | Bucket 1 | Bucket 2 | ... | Bucket N | + +-----------+-----------+-----------+---- ----+-----------+ + : : : + +- p0 +- p3 +- p4 + : : + +- p1 +- p5 + : + +- p2 + + +.. note:: + The diagram above is an illustration rather than a true depiction of the + internal data structure. + +To reduce the search space when trying to decide the effective uclamp value of +an rq as tasks are enqueued/dequeued, the whole utilization range is divided +into N buckets where N is configured at compile time by setting +CONFIG_UCLAMP_BUCKETS_COUNT. By default it is set to 5. + +The rq has a bucket for each uclamp_id tunables: [UCLAMP_MIN, UCLAMP_MAX]. + +The range of each bucket is 1024/N. For example, for the default value of +5 there will be 5 buckets, each of which will cover the following range: + +:: + + DELTA = round_closest(1024/5) = 204.8 = 205 + + Bucket 0: [0:204] + Bucket 1: [205:409] + Bucket 2: [410:614] + Bucket 3: [615:819] + Bucket 4: [820:1024] + +When a task p with following tunable parameters + +:: + + p->uclamp[UCLAMP_MIN] = 300 + p->uclamp[UCLAMP_MAX] = 1024 + +is enqueued into the rq, bucket 1 will be incremented for UCLAMP_MIN and bucket +4 will be incremented for UCLAMP_MAX to reflect the fact the rq has a task in +this range. + +The rq then keeps track of its current effective uclamp value for each +uclamp_id. + +When a task p is enqueued, the rq value changes to: + +:: + + // update bucket logic goes here + rq->uclamp[UCLAMP_MIN] = max(rq->uclamp[UCLAMP_MIN], p->uclamp[UCLAMP_MIN]) + // repeat for UCLAMP_MAX + +Similarly, when p is dequeued the rq value changes to: + +:: + + // update bucket logic goes here + rq->uclamp[UCLAMP_MIN] = search_top_bucket_for_highest_value() + // repeat for UCLAMP_MAX + +When all buckets are empty, the rq uclamp values are reset to system defaults. +See :ref:`section 3.4 <uclamp-default-values>` for details on default values. + + +2.2. Max aggregation +-------------------- + +Util clamp is tuned to honour the request for the task that requires the +highest performance point. + +When multiple tasks are attached to the same rq, then util clamp must make sure +the task that needs the highest performance point gets it even if there's +another task that doesn't need it or is disallowed from reaching this point. + +For example, if there are multiple tasks attached to an rq with the following +values: + +:: + + p0->uclamp[UCLAMP_MIN] = 300 + p0->uclamp[UCLAMP_MAX] = 900 + + p1->uclamp[UCLAMP_MIN] = 500 + p1->uclamp[UCLAMP_MAX] = 500 + +then assuming both p0 and p1 are enqueued to the same rq, both UCLAMP_MIN +and UCLAMP_MAX become: + +:: + + rq->uclamp[UCLAMP_MIN] = max(300, 500) = 500 + rq->uclamp[UCLAMP_MAX] = max(900, 500) = 900 + +As we shall see in :ref:`section 5.1 <uclamp-capping-fail>`, this max +aggregation is the cause of one of limitations when using util clamp, in +particular for UCLAMP_MAX hint when user space would like to save power. + +2.3. Hierarchical aggregation +----------------------------- + +As stated earlier, util clamp is a property of every task in the system. But +the actual applied (effective) value can be influenced by more than just the +request made by the task or another actor on its behalf (middleware library). + +The effective util clamp value of any task is restricted as follows: + + 1. By the uclamp settings defined by the cgroup CPU controller it is attached + to, if any. + 2. The restricted value in (1) is then further restricted by the system wide + uclamp settings. + +:ref:`Section 3 <uclamp-interfaces>` discusses the interfaces and will expand +further on that. + +For now suffice to say that if a task makes a request, its actual effective +value will have to adhere to some restrictions imposed by cgroup and system +wide settings. + +The system will still accept the request even if effectively will be beyond the +constraints, but as soon as the task moves to a different cgroup or a sysadmin +modifies the system settings, the request will be satisfied only if it is +within new constraints. + +In other words, this aggregation will not cause an error when a task changes +its uclamp values, but rather the system may not be able to satisfy requests +based on those factors. + +2.4. Range +---------- + +Uclamp performance request has the range of 0 to 1024 inclusive. + +For cgroup interface percentage is used (that is 0 to 100 inclusive). +Just like other cgroup interfaces, you can use 'max' instead of 100. + +.. _uclamp-interfaces: + +3. Interfaces +============= + +3.1. Per task interface +----------------------- + +sched_setattr() syscall was extended to accept two new fields: + +* sched_util_min: requests the minimum performance point the system should run + at when this task is running. Or lower performance bound. +* sched_util_max: requests the maximum performance point the system should run + at when this task is running. Or upper performance bound. + +For example, the following scenario have 40% to 80% utilization constraints: + +:: + + attr->sched_util_min = 40% * 1024; + attr->sched_util_max = 80% * 1024; + +When task @p is running, **the scheduler should try its best to ensure it +starts at 40% performance level**. If the task runs for a long enough time so +that its actual utilization goes above 80%, the utilization, or performance +level, will be capped. + +The special value -1 is used to reset the uclamp settings to the system +default. + +Note that resetting the uclamp value to system default using -1 is not the same +as manually setting uclamp value to system default. This distinction is +important because as we shall see in system interfaces, the default value for +RT could be changed. SCHED_NORMAL/OTHER might gain similar knobs too in the +future. + +3.2. cgroup interface +--------------------- + +There are two uclamp related values in the CPU cgroup controller: + +* cpu.uclamp.min +* cpu.uclamp.max + +When a task is attached to a CPU controller, its uclamp values will be impacted +as follows: + +* cpu.uclamp.min is a protection as described in :ref:`section 3-3 of cgroup + v2 documentation <cgroupv2-protections-distributor>`. + + If a task uclamp_min value is lower than cpu.uclamp.min, then the task will + inherit the cgroup cpu.uclamp.min value. + + In a cgroup hierarchy, effective cpu.uclamp.min is the max of (child, + parent). + +* cpu.uclamp.max is a limit as described in :ref:`section 3-2 of cgroup v2 + documentation <cgroupv2-limits-distributor>`. + + If a task uclamp_max value is higher than cpu.uclamp.max, then the task will + inherit the cgroup cpu.uclamp.max value. + + In a cgroup hierarchy, effective cpu.uclamp.max is the min of (child, + parent). + +For example, given following parameters: + +:: + + p0->uclamp[UCLAMP_MIN] = // system default; + p0->uclamp[UCLAMP_MAX] = // system default; + + p1->uclamp[UCLAMP_MIN] = 40% * 1024; + p1->uclamp[UCLAMP_MAX] = 50% * 1024; + + cgroup0->cpu.uclamp.min = 20% * 1024; + cgroup0->cpu.uclamp.max = 60% * 1024; + + cgroup1->cpu.uclamp.min = 60% * 1024; + cgroup1->cpu.uclamp.max = 100% * 1024; + +when p0 and p1 are attached to cgroup0, the values become: + +:: + + p0->uclamp[UCLAMP_MIN] = cgroup0->cpu.uclamp.min = 20% * 1024; + p0->uclamp[UCLAMP_MAX] = cgroup0->cpu.uclamp.max = 60% * 1024; + + p1->uclamp[UCLAMP_MIN] = 40% * 1024; // intact + p1->uclamp[UCLAMP_MAX] = 50% * 1024; // intact + +when p0 and p1 are attached to cgroup1, these instead become: + +:: + + p0->uclamp[UCLAMP_MIN] = cgroup1->cpu.uclamp.min = 60% * 1024; + p0->uclamp[UCLAMP_MAX] = cgroup1->cpu.uclamp.max = 100% * 1024; + + p1->uclamp[UCLAMP_MIN] = cgroup1->cpu.uclamp.min = 60% * 1024; + p1->uclamp[UCLAMP_MAX] = 50% * 1024; // intact + +Note that cgroup interfaces allows cpu.uclamp.max value to be lower than +cpu.uclamp.min. Other interfaces don't allow that. + +3.3. System interface +--------------------- + +3.3.1 sched_util_clamp_min +-------------------------- + +System wide limit of allowed UCLAMP_MIN range. By default it is set to 1024, +which means that permitted effective UCLAMP_MIN range for tasks is [0:1024]. +By changing it to 512 for example the range reduces to [0:512]. This is useful +to restrict how much boosting tasks are allowed to acquire. + +Requests from tasks to go above this knob value will still succeed, but +they won't be satisfied until it is more than p->uclamp[UCLAMP_MIN]. + +The value must be smaller than or equal to sched_util_clamp_max. + +3.3.2 sched_util_clamp_max +-------------------------- + +System wide limit of allowed UCLAMP_MAX range. By default it is set to 1024, +which means that permitted effective UCLAMP_MAX range for tasks is [0:1024]. + +By changing it to 512 for example the effective allowed range reduces to +[0:512]. This means is that no task can run above 512, which implies that all +rqs are restricted too. IOW, the whole system is capped to half its performance +capacity. + +This is useful to restrict the overall maximum performance point of the system. +For example, it can be handy to limit performance when running low on battery +or when the system wants to limit access to more energy hungry performance +levels when it's in idle state or screen is off. + +Requests from tasks to go above this knob value will still succeed, but they +won't be satisfied until it is more than p->uclamp[UCLAMP_MAX]. + +The value must be greater than or equal to sched_util_clamp_min. + +.. _uclamp-default-values: + +3.4. Default values +------------------- + +By default all SCHED_NORMAL/SCHED_OTHER tasks are initialized to: + +:: + + p_fair->uclamp[UCLAMP_MIN] = 0 + p_fair->uclamp[UCLAMP_MAX] = 1024 + +That is, by default they're boosted to run at the maximum performance point of +changed at boot or runtime. No argument was made yet as to why we should +provide this, but can be added in the future. + +For SCHED_FIFO/SCHED_RR tasks: + +:: + + p_rt->uclamp[UCLAMP_MIN] = 1024 + p_rt->uclamp[UCLAMP_MAX] = 1024 + +That is by default they're boosted to run at the maximum performance point of +the system which retains the historical behavior of the RT tasks. + +RT tasks default uclamp_min value can be modified at boot or runtime via +sysctl. See below section. + +.. _sched-util-clamp-min-rt-default: + +3.4.1 sched_util_clamp_min_rt_default +------------------------------------- + +Running RT tasks at maximum performance point is expensive on battery powered +devices and not necessary. To allow system developer to offer good performance +guarantees for these tasks without pushing it all the way to maximum +performance point, this sysctl knob allows tuning the best boost value to +address the system requirement without burning power running at maximum +performance point all the time. + +Application developer are encouraged to use the per task util clamp interface +to ensure they are performance and power aware. Ideally this knob should be set +to 0 by system designers and leave the task of managing performance +requirements to the apps. + +4. How to use util clamp +======================== + +Util clamp promotes the concept of user space assisted power and performance +management. At the scheduler level there is no info required to make the best +decision. However, with util clamp user space can hint to the scheduler to make +better decision about task placement and frequency selection. + +Best results are achieved by not making any assumptions about the system the +application is running on and to use it in conjunction with a feedback loop to +dynamically monitor and adjust. Ultimately this will allow for a better user +experience at a better perf/watt. + +For some systems and use cases, static setup will help to achieve good results. +Portability will be a problem in this case. How much work one can do at 100, +200 or 1024 is different for each system. Unless there's a specific target +system, static setup should be avoided. + +There are enough possibilities to create a whole framework based on util clamp +or self contained app that makes use of it directly. + +4.1. Boost important and DVFS-latency-sensitive tasks +----------------------------------------------------- + +A GUI task might not be busy to warrant driving the frequency high when it +wakes up. However, it requires to finish its work within a specific time window +to deliver the desired user experience. The right frequency it requires at +wakeup will be system dependent. On some underpowered systems it will be high, +on other overpowered ones it will be low or 0. + +This task can increase its UCLAMP_MIN value every time it misses the deadline +to ensure on next wake up it runs at a higher performance point. It should try +to approach the lowest UCLAMP_MIN value that allows to meet its deadline on any +particular system to achieve the best possible perf/watt for that system. + +On heterogeneous systems, it might be important for this task to run on +a faster CPU. + +**Generally it is advised to perceive the input as performance level or point +which will imply both task placement and frequency selection**. + +4.2. Cap background tasks +------------------------- + +Like explained for Android case in the introduction. Any app can lower +UCLAMP_MAX for some background tasks that don't care about performance but +could end up being busy and consume unnecessary system resources on the system. + +4.3. Powersave mode +------------------- + +sched_util_clamp_max system wide interface can be used to limit all tasks from +operating at the higher performance points which are usually energy +inefficient. + +This is not unique to uclamp as one can achieve the same by reducing max +frequency of the cpufreq governor. It can be considered a more convenient +alternative interface. + +4.4. Per-app performance restriction +------------------------------------ + +Middleware/Utility can provide the user an option to set UCLAMP_MIN/MAX for an +app every time it is executed to guarantee a minimum performance point and/or +limit it from draining system power at the cost of reduced performance for +these apps. + +If you want to prevent your laptop from heating up while on the go from +compiling the kernel and happy to sacrifice performance to save power, but +still would like to keep your browser performance intact, uclamp makes it +possible. + +5. Limitations +============== + +.. _uclamp-capping-fail: + +5.1. Capping frequency with uclamp_max fails under certain conditions +--------------------------------------------------------------------- + +If task p0 is capped to run at 512: + +:: + + p0->uclamp[UCLAMP_MAX] = 512 + +and it shares the rq with p1 which is free to run at any performance point: + +:: + + p1->uclamp[UCLAMP_MAX] = 1024 + +then due to max aggregation the rq will be allowed to reach max performance +point: + +:: + + rq->uclamp[UCLAMP_MAX] = max(512, 1024) = 1024 + +Assuming both p0 and p1 have UCLAMP_MIN = 0, then the frequency selection for +the rq will depend on the actual utilization value of the tasks. + +If p1 is a small task but p0 is a CPU intensive task, then due to the fact that +both are running at the same rq, p1 will cause the frequency capping to be left +from the rq although p1, which is allowed to run at any performance point, +doesn't actually need to run at that frequency. + +5.2. UCLAMP_MAX can break PELT (util_avg) signal +------------------------------------------------ + +PELT assumes that frequency will always increase as the signals grow to ensure +there's always some idle time on the CPU. But with UCLAMP_MAX, this frequency +increase will be prevented which can lead to no idle time in some +circumstances. When there's no idle time, a task will stuck in a busy loop, +which would result in util_avg being 1024. + +Combing with issue described below, this can lead to unwanted frequency spikes +when severely capped tasks share the rq with a small non capped task. + +As an example if task p, which have: + +:: + + p0->util_avg = 300 + p0->uclamp[UCLAMP_MAX] = 0 + +wakes up on an idle CPU, then it will run at min frequency (Fmin) this +CPU is capable of. The max CPU frequency (Fmax) matters here as well, +since it designates the shortest computational time to finish the task's +work on this CPU. + +:: + + rq->uclamp[UCLAMP_MAX] = 0 + +If the ratio of Fmax/Fmin is 3, then maximum value will be: + +:: + + 300 * (Fmax/Fmin) = 900 + +which indicates the CPU will still see idle time since 900 is < 1024. The +_actual_ util_avg will not be 900 though, but somewhere between 300 and 900. As +long as there's idle time, p->util_avg updates will be off by a some margin, +but not proportional to Fmax/Fmin. + +:: + + p0->util_avg = 300 + small_error + +Now if the ratio of Fmax/Fmin is 4, the maximum value becomes: + +:: + + 300 * (Fmax/Fmin) = 1200 + +which is higher than 1024 and indicates that the CPU has no idle time. When +this happens, then the _actual_ util_avg will become: + +:: + + p0->util_avg = 1024 + +If task p1 wakes up on this CPU, which have: + +:: + + p1->util_avg = 200 + p1->uclamp[UCLAMP_MAX] = 1024 + +then the effective UCLAMP_MAX for the CPU will be 1024 according to max +aggregation rule. But since the capped p0 task was running and throttled +severely, then the rq->util_avg will be: + +:: + + p0->util_avg = 1024 + p1->util_avg = 200 + + rq->util_avg = 1024 + rq->uclamp[UCLAMP_MAX] = 1024 + +Hence lead to a frequency spike since if p0 wasn't throttled we should get: + +:: + + p0->util_avg = 300 + p1->util_avg = 200 + + rq->util_avg = 500 + +and run somewhere near mid performance point of that CPU, not the Fmax we get. + +5.3. Schedutil response time issues +----------------------------------- + +schedutil has three limitations: + + 1. Hardware takes non-zero time to respond to any frequency change + request. On some platforms can be in the order of few ms. + 2. Non fast-switch systems require a worker deadline thread to wake up + and perform the frequency change, which adds measurable overhead. + 3. schedutil rate_limit_us drops any requests during this rate_limit_us + window. + +If a relatively small task is doing critical job and requires a certain +performance point when it wakes up and starts running, then all these +limitations will prevent it from getting what it wants in the time scale it +expects. + +This limitation is not only impactful when using uclamp, but will be more +prevalent as we no longer gradually ramp up or down. We could easily be +jumping between frequencies depending on the order tasks wake up, and their +respective uclamp values. + +We regard that as a limitation of the capabilities of the underlying system +itself. + +There is room to improve the behavior of schedutil rate_limit_us, but not much +to be done for 1 or 2. They are considered hard limitations of the system. |