diff options
author | 2023-02-21 18:24:12 -0800 | |
---|---|---|
committer | 2023-02-21 18:24:12 -0800 | |
commit | 5b7c4cabbb65f5c469464da6c5f614cbd7f730f2 (patch) | |
tree | cc5c2d0a898769fd59549594fedb3ee6f84e59a0 /Documentation/ia64/fsys.rst | |
download | linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.tar.gz linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.zip |
Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextgrafted
Pull networking updates from Jakub Kicinski:
"Core:
- Add dedicated kmem_cache for typical/small skb->head, avoid having
to access struct page at kfree time, and improve memory use.
- Introduce sysctl to set default RPS configuration for new netdevs.
- Define Netlink protocol specification format which can be used to
describe messages used by each family and auto-generate parsers.
Add tools for generating kernel data structures and uAPI headers.
- Expose all net/core sysctls inside netns.
- Remove 4s sleep in netpoll if carrier is instantly detected on
boot.
- Add configurable limit of MDB entries per port, and port-vlan.
- Continue populating drop reasons throughout the stack.
- Retire a handful of legacy Qdiscs and classifiers.
Protocols:
- Support IPv4 big TCP (TSO frames larger than 64kB).
- Add IP_LOCAL_PORT_RANGE socket option, to control local port range
on socket by socket basis.
- Track and report in procfs number of MPTCP sockets used.
- Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path
manager.
- IPv6: don't check net.ipv6.route.max_size and rely on garbage
collection to free memory (similarly to IPv4).
- Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).
- ICMP: add per-rate limit counters.
- Add support for user scanning requests in ieee802154.
- Remove static WEP support.
- Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
reporting.
- WiFi 7 EHT channel puncturing support (client & AP).
BPF:
- Add a rbtree data structure following the "next-gen data structure"
precedent set by recently added linked list, that is, by using
kfunc + kptr instead of adding a new BPF map type.
- Expose XDP hints via kfuncs with initial support for RX hash and
timestamp metadata.
- Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to
better support decap on GRE tunnel devices not operating in collect
metadata.
- Improve x86 JIT's codegen for PROBE_MEM runtime error checks.
- Remove the need for trace_printk_lock for bpf_trace_printk and
bpf_trace_vprintk helpers.
- Extend libbpf's bpf_tracing.h support for tracing arguments of
kprobes/uprobes and syscall as a special case.
- Significantly reduce the search time for module symbols by
livepatch and BPF.
- Enable cpumasks to be used as kptrs, which is useful for tracing
programs tracking which tasks end up running on which CPUs in
different time intervals.
- Add support for BPF trampoline on s390x and riscv64.
- Add capability to export the XDP features supported by the NIC.
- Add __bpf_kfunc tag for marking kernel functions as kfuncs.
- Add cgroup.memory=nobpf kernel parameter option to disable BPF
memory accounting for container environments.
Netfilter:
- Remove the CLUSTERIP target. It has been marked as obsolete for
years, and we still have WARN splats wrt races of the out-of-band
/proc interface installed by this target.
- Add 'destroy' commands to nf_tables. They are identical to the
existing 'delete' commands, but do not return an error if the
referenced object (set, chain, rule...) did not exist.
Driver API:
- Improve cpumask_local_spread() locality to help NICs set the right
IRQ affinity on AMD platforms.
- Separate C22 and C45 MDIO bus transactions more clearly.
- Introduce new DCB table to control DSCP rewrite on egress.
- Support configuration of Physical Layer Collision Avoidance (PLCA)
Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
shared medium Ethernet.
- Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
preemption of low priority frames by high priority frames.
- Add support for controlling MACSec offload using netlink SET.
- Rework devlink instance refcounts to allow registration and
de-registration under the instance lock. Split the code into
multiple files, drop some of the unnecessarily granular locks and
factor out common parts of netlink operation handling.
- Add TX frame aggregation parameters (for USB drivers).
- Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
messages with notifications for debug.
- Allow offloading of UDP NEW connections via act_ct.
- Add support for per action HW stats in TC.
- Support hardware miss to TC action (continue processing in SW from
a specific point in the action chain).
- Warn if old Wireless Extension user space interface is used with
modern cfg80211/mac80211 drivers. Do not support Wireless
Extensions for Wi-Fi 7 devices at all. Everyone should switch to
using nl80211 interface instead.
- Improve the CAN bit timing configuration. Use extack to return
error messages directly to user space, update the SJW handling,
including the definition of a new default value that will benefit
CAN-FD controllers, by increasing their oscillator tolerance.
New hardware / drivers:
- Ethernet:
- nVidia BlueField-3 support (control traffic driver)
- Ethernet support for imx93 SoCs
- Motorcomm yt8531 gigabit Ethernet PHY
- onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
- Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
- Amlogic gxl MDIO mux
- WiFi:
- RealTek RTL8188EU (rtl8xxxu)
- Qualcomm Wi-Fi 7 devices (ath12k)
- CAN:
- Renesas R-Car V4H
Drivers:
- Bluetooth:
- Set Per Platform Antenna Gain (PPAG) for Intel controllers.
- Ethernet NICs:
- Intel (1G, igc):
- support TSN / Qbv / packet scheduling features of i226 model
- Intel (100G, ice):
- use GNSS subsystem instead of TTY
- multi-buffer XDP support
- extend support for GPIO pins to E823 devices
- nVidia/Mellanox:
- update the shared buffer configuration on PFC commands
- implement PTP adjphase function for HW offset control
- TC support for Geneve and GRE with VF tunnel offload
- more efficient crypto key management method
- multi-port eswitch support
- Netronome/Corigine:
- add DCB IEEE support
- support IPsec offloading for NFP3800
- Freescale/NXP (enetc):
- support XDP_REDIRECT for XDP non-linear buffers
- improve reconfig, avoid link flap and waiting for idle
- support MAC Merge layer
- Other NICs:
- sfc/ef100: add basic devlink support for ef100
- ionic: rx_push mode operation (writing descriptors via MMIO)
- bnxt: use the auxiliary bus abstraction for RDMA
- r8169: disable ASPM and reset bus in case of tx timeout
- cpsw: support QSGMII mode for J721e CPSW9G
- cpts: support pulse-per-second output
- ngbe: add an mdio bus driver
- usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
- r8152: handle devices with FW with NCM support
- amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
- virtio-net: support multi buffer XDP
- virtio/vsock: replace virtio_vsock_pkt with sk_buff
- tsnep: XDP support
- Ethernet high-speed switches:
- nVidia/Mellanox (mlxsw):
- add support for latency TLV (in FW control messages)
- Microchip (sparx5):
- separate explicit and implicit traffic forwarding rules, make
the implicit rules always active
- add support for egress DSCP rewrite
- IS0 VCAP support (Ingress Classification)
- IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS
etc.)
- ES2 VCAP support (Egress Access Control)
- support for Per-Stream Filtering and Policing (802.1Q,
8.6.5.1)
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- add MAB (port auth) offload support
- enable PTP receive for mv88e6390
- NXP (ocelot):
- support MAC Merge layer
- support for the the vsc7512 internal copper phys
- Microchip:
- lan9303: convert to PHYLINK
- lan966x: support TC flower filter statistics
- lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
- lan937x: support Credit Based Shaper configuration
- ksz9477: support Energy Efficient Ethernet
- other:
- qca8k: convert to regmap read/write API, use bulk operations
- rswitch: Improve TX timestamp accuracy
- Intel WiFi (iwlwifi):
- EHT (Wi-Fi 7) rate reporting
- STEP equalizer support: transfer some STEP (connection to radio
on platforms with integrated wifi) related parameters from the
BIOS to the firmware.
- Qualcomm 802.11ax WiFi (ath11k):
- IPQ5018 support
- Fine Timing Measurement (FTM) responder role support
- channel 177 support
- MediaTek WiFi (mt76):
- per-PHY LED support
- mt7996: EHT (Wi-Fi 7) support
- Wireless Ethernet Dispatch (WED) reset support
- switch to using page pool allocator
- RealTek WiFi (rtw89):
- support new version of Bluetooth co-existance
- Mobile:
- rmnet: support TX aggregation"
* tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits)
page_pool: add a comment explaining the fragment counter usage
net: ethtool: fix __ethtool_dev_mm_supported() implementation
ethtool: pse-pd: Fix double word in comments
xsk: add linux/vmalloc.h to xsk.c
sefltests: netdevsim: wait for devlink instance after netns removal
selftest: fib_tests: Always cleanup before exit
net/mlx5e: Align IPsec ASO result memory to be as required by hardware
net/mlx5e: TC, Set CT miss to the specific ct action instance
net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG
net/mlx5: Refactor tc miss handling to a single function
net/mlx5: Kconfig: Make tc offload depend on tc skb extension
net/sched: flower: Support hardware miss to tc action
net/sched: flower: Move filter handle initialization earlier
net/sched: cls_api: Support hardware miss to tc action
net/sched: Rename user cookie and act cookie
sfc: fix builds without CONFIG_RTC_LIB
sfc: clean up some inconsistent indentings
net/mlx4_en: Introduce flexible array to silence overflow warning
net: lan966x: Fix possible deadlock inside PTP
net/ulp: Remove redundant ->clone() test in inet_clone_ulp().
...
Diffstat (limited to 'Documentation/ia64/fsys.rst')
-rw-r--r-- | Documentation/ia64/fsys.rst | 303 |
1 files changed, 303 insertions, 0 deletions
diff --git a/Documentation/ia64/fsys.rst b/Documentation/ia64/fsys.rst new file mode 100644 index 000000000..a702d2cc9 --- /dev/null +++ b/Documentation/ia64/fsys.rst @@ -0,0 +1,303 @@ +=================================== +Light-weight System Calls for IA-64 +=================================== + + Started: 13-Jan-2003 + + Last update: 27-Sep-2003 + + David Mosberger-Tang + <davidm@hpl.hp.com> + +Using the "epc" instruction effectively introduces a new mode of +execution to the ia64 linux kernel. We call this mode the +"fsys-mode". To recap, the normal states of execution are: + + - kernel mode: + Both the register stack and the memory stack have been + switched over to kernel memory. The user-level state is saved + in a pt-regs structure at the top of the kernel memory stack. + + - user mode: + Both the register stack and the kernel stack are in + user memory. The user-level state is contained in the + CPU registers. + + - bank 0 interruption-handling mode: + This is the non-interruptible state which all + interruption-handlers start execution in. The user-level + state remains in the CPU registers and some kernel state may + be stored in bank 0 of registers r16-r31. + +In contrast, fsys-mode has the following special properties: + + - execution is at privilege level 0 (most-privileged) + + - CPU registers may contain a mixture of user-level and kernel-level + state (it is the responsibility of the kernel to ensure that no + security-sensitive kernel-level state is leaked back to + user-level) + + - execution is interruptible and preemptible (an fsys-mode handler + can disable interrupts and avoid all other interruption-sources + to avoid preemption) + + - neither the memory-stack nor the register-stack can be trusted while + in fsys-mode (they point to the user-level stacks, which may + be invalid, or completely bogus addresses) + +In summary, fsys-mode is much more similar to running in user-mode +than it is to running in kernel-mode. Of course, given that the +privilege level is at level 0, this means that fsys-mode requires some +care (see below). + + +How to tell fsys-mode +===================== + +Linux operates in fsys-mode when (a) the privilege level is 0 (most +privileged) and (b) the stacks have NOT been switched to kernel memory +yet. For convenience, the header file <asm-ia64/ptrace.h> provides +three macros:: + + user_mode(regs) + user_stack(task,regs) + fsys_mode(task,regs) + +The "regs" argument is a pointer to a pt_regs structure. The "task" +argument is a pointer to the task structure to which the "regs" +pointer belongs to. user_mode() returns TRUE if the CPU state pointed +to by "regs" was executing in user mode (privilege level 3). +user_stack() returns TRUE if the state pointed to by "regs" was +executing on the user-level stack(s). Finally, fsys_mode() returns +TRUE if the CPU state pointed to by "regs" was executing in fsys-mode. +The fsys_mode() macro is equivalent to the expression:: + + !user_mode(regs) && user_stack(task,regs) + +How to write an fsyscall handler +================================ + +The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers +(fsyscall_table). This table contains one entry for each system call. +By default, a system call is handled by fsys_fallback_syscall(). This +routine takes care of entering (full) kernel mode and calling the +normal Linux system call handler. For performance-critical system +calls, it is possible to write a hand-tuned fsyscall_handler. For +example, fsys.S contains fsys_getpid(), which is a hand-tuned version +of the getpid() system call. + +The entry and exit-state of an fsyscall handler is as follows: + +Machine state on entry to fsyscall handler +------------------------------------------ + + ========= =============================================================== + r10 0 + r11 saved ar.pfs (a user-level value) + r15 system call number + r16 "current" task pointer (in normal kernel-mode, this is in r13) + r32-r39 system call arguments + b6 return address (a user-level value) + ar.pfs previous frame-state (a user-level value) + PSR.be cleared to zero (i.e., little-endian byte order is in effect) + - all other registers may contain values passed in from user-mode + ========= =============================================================== + +Required machine state on exit to fsyscall handler +-------------------------------------------------- + + ========= =========================================================== + r11 saved ar.pfs (as passed into the fsyscall handler) + r15 system call number (as passed into the fsyscall handler) + r32-r39 system call arguments (as passed into the fsyscall handler) + b6 return address (as passed into the fsyscall handler) + ar.pfs previous frame-state (as passed into the fsyscall handler) + ========= =========================================================== + +Fsyscall handlers can execute with very little overhead, but with that +speed comes a set of restrictions: + + * Fsyscall-handlers MUST check for any pending work in the flags + member of the thread-info structure and if any of the + TIF_ALLWORK_MASK flags are set, the handler needs to fall back on + doing a full system call (by calling fsys_fallback_syscall). + + * Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11, + r15, b6, and ar.pfs) because they will be needed in case of a + system call restart. Of course, all "preserved" registers also + must be preserved, in accordance to the normal calling conventions. + + * Fsyscall-handlers MUST check argument registers for containing a + NaT value before using them in any way that could trigger a + NaT-consumption fault. If a system call argument is found to + contain a NaT value, an fsyscall-handler may return immediately + with r8=EINVAL, r10=-1. + + * Fsyscall-handlers MUST NOT use the "alloc" instruction or perform + any other operation that would trigger mandatory RSE + (register-stack engine) traffic. + + * Fsyscall-handlers MUST NOT write to any stacked registers because + it is not safe to assume that user-level called a handler with the + proper number of arguments. + + * Fsyscall-handlers need to be careful when accessing per-CPU variables: + unless proper safe-guards are taken (e.g., interruptions are avoided), + execution may be pre-empted and resumed on another CPU at any given + time. + + * Fsyscall-handlers must be careful not to leak sensitive kernel' + information back to user-level. In particular, before returning to + user-level, care needs to be taken to clear any scratch registers + that could contain sensitive information (note that the current + task pointer is not considered sensitive: it's already exposed + through ar.k6). + + * Fsyscall-handlers MUST NOT access user-memory without first + validating access-permission (this can be done typically via + probe.r.fault and/or probe.w.fault) and without guarding against + memory access exceptions (this can be done with the EX() macros + defined by asmmacro.h). + +The above restrictions may seem draconian, but remember that it's +possible to trade off some of the restrictions by paying a slightly +higher overhead. For example, if an fsyscall-handler could benefit +from the shadow register bank, it could temporarily disable PSR.i and +PSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as +needed. In other words, following the above rules yields extremely +fast system call execution (while fully preserving system call +semantics), but there is also a lot of flexibility in handling more +complicated cases. + +Signal handling +=============== + +The delivery of (asynchronous) signals must be delayed until fsys-mode +is exited. This is accomplished with the help of the lower-privilege +transfer trap: arch/ia64/kernel/process.c:do_notify_resume_user() +checks whether the interrupted task was in fsys-mode and, if so, sets +PSR.lp and returns immediately. When fsys-mode is exited via the +"br.ret" instruction that lowers the privilege level, a trap will +occur. The trap handler clears PSR.lp again and returns immediately. +The kernel exit path then checks for and delivers any pending signals. + +PSR Handling +============ + +The "epc" instruction doesn't change the contents of PSR at all. This +is in contrast to a regular interruption, which clears almost all +bits. Because of that, some care needs to be taken to ensure things +work as expected. The following discussion describes how each PSR bit +is handled. + +======= ======================================================================= +PSR.be Cleared when entering fsys-mode. A srlz.d instruction is used + to ensure the CPU is in little-endian mode before the first + load/store instruction is executed. PSR.be is normally NOT + restored upon return from an fsys-mode handler. In other + words, user-level code must not rely on PSR.be being preserved + across a system call. +PSR.up Unchanged. +PSR.ac Unchanged. +PSR.mfl Unchanged. Note: fsys-mode handlers must not write-registers! +PSR.mfh Unchanged. Note: fsys-mode handlers must not write-registers! +PSR.ic Unchanged. Note: fsys-mode handlers can clear the bit, if needed. +PSR.i Unchanged. Note: fsys-mode handlers can clear the bit, if needed. +PSR.pk Unchanged. +PSR.dt Unchanged. +PSR.dfl Unchanged. Note: fsys-mode handlers must not write-registers! +PSR.dfh Unchanged. Note: fsys-mode handlers must not write-registers! +PSR.sp Unchanged. +PSR.pp Unchanged. +PSR.di Unchanged. +PSR.si Unchanged. +PSR.db Unchanged. The kernel prevents user-level from setting a hardware + breakpoint that triggers at any privilege level other than + 3 (user-mode). +PSR.lp Unchanged. +PSR.tb Lazy redirect. If a taken-branch trap occurs while in + fsys-mode, the trap-handler modifies the saved machine state + such that execution resumes in the gate page at + syscall_via_break(), with privilege level 3. Note: the + taken branch would occur on the branch invoking the + fsyscall-handler, at which point, by definition, a syscall + restart is still safe. If the system call number is invalid, + the fsys-mode handler will return directly to user-level. This + return will trigger a taken-branch trap, but since the trap is + taken _after_ restoring the privilege level, the CPU has already + left fsys-mode, so no special treatment is needed. +PSR.rt Unchanged. +PSR.cpl Cleared to 0. +PSR.is Unchanged (guaranteed to be 0 on entry to the gate page). +PSR.mc Unchanged. +PSR.it Unchanged (guaranteed to be 1). +PSR.id Unchanged. Note: the ia64 linux kernel never sets this bit. +PSR.da Unchanged. Note: the ia64 linux kernel never sets this bit. +PSR.dd Unchanged. Note: the ia64 linux kernel never sets this bit. +PSR.ss Lazy redirect. If set, "epc" will cause a Single Step Trap to + be taken. The trap handler then modifies the saved machine + state such that execution resumes in the gate page at + syscall_via_break(), with privilege level 3. +PSR.ri Unchanged. +PSR.ed Unchanged. Note: This bit could only have an effect if an fsys-mode + handler performed a speculative load that gets NaTted. If so, this + would be the normal & expected behavior, so no special treatment is + needed. +PSR.bn Unchanged. Note: fsys-mode handlers may clear the bit, if needed. + Doing so requires clearing PSR.i and PSR.ic as well. +PSR.ia Unchanged. Note: the ia64 linux kernel never sets this bit. +======= ======================================================================= + +Using fast system calls +======================= + +To use fast system calls, userspace applications need simply call +__kernel_syscall_via_epc(). For example + +-- example fgettimeofday() call -- + +-- fgettimeofday.S -- + +:: + + #include <asm/asmmacro.h> + + GLOBAL_ENTRY(fgettimeofday) + .prologue + .save ar.pfs, r11 + mov r11 = ar.pfs + .body + + mov r2 = 0xa000000000020660;; // gate address + // found by inspection of System.map for the + // __kernel_syscall_via_epc() function. See + // below for how to do this for real. + + mov b7 = r2 + mov r15 = 1087 // gettimeofday syscall + ;; + br.call.sptk.many b6 = b7 + ;; + + .restore sp + + mov ar.pfs = r11 + br.ret.sptk.many rp;; // return to caller + END(fgettimeofday) + +-- end fgettimeofday.S -- + +In reality, getting the gate address is accomplished by two extra +values passed via the ELF auxiliary vector (include/asm-ia64/elf.h) + + * AT_SYSINFO : is the address of __kernel_syscall_via_epc() + * AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO + +The ELF DSO is a pre-linked library that is mapped in by the kernel at +the gate page. It is a proper ELF shared object so, with a dynamic +loader that recognises the library, you should be able to make calls to +the exported functions within it as with any other shared library. +AT_SYSINFO points into the kernel DSO at the +__kernel_syscall_via_epc() function for historical reasons (it was +used before the kernel DSO) and as a convenience. |