diff options
author | 2023-02-21 18:24:12 -0800 | |
---|---|---|
committer | 2023-02-21 18:24:12 -0800 | |
commit | 5b7c4cabbb65f5c469464da6c5f614cbd7f730f2 (patch) | |
tree | cc5c2d0a898769fd59549594fedb3ee6f84e59a0 /Documentation/trace/ftrace-design.rst | |
download | linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.tar.gz linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.zip |
Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextgrafted
Pull networking updates from Jakub Kicinski:
"Core:
- Add dedicated kmem_cache for typical/small skb->head, avoid having
to access struct page at kfree time, and improve memory use.
- Introduce sysctl to set default RPS configuration for new netdevs.
- Define Netlink protocol specification format which can be used to
describe messages used by each family and auto-generate parsers.
Add tools for generating kernel data structures and uAPI headers.
- Expose all net/core sysctls inside netns.
- Remove 4s sleep in netpoll if carrier is instantly detected on
boot.
- Add configurable limit of MDB entries per port, and port-vlan.
- Continue populating drop reasons throughout the stack.
- Retire a handful of legacy Qdiscs and classifiers.
Protocols:
- Support IPv4 big TCP (TSO frames larger than 64kB).
- Add IP_LOCAL_PORT_RANGE socket option, to control local port range
on socket by socket basis.
- Track and report in procfs number of MPTCP sockets used.
- Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path
manager.
- IPv6: don't check net.ipv6.route.max_size and rely on garbage
collection to free memory (similarly to IPv4).
- Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).
- ICMP: add per-rate limit counters.
- Add support for user scanning requests in ieee802154.
- Remove static WEP support.
- Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
reporting.
- WiFi 7 EHT channel puncturing support (client & AP).
BPF:
- Add a rbtree data structure following the "next-gen data structure"
precedent set by recently added linked list, that is, by using
kfunc + kptr instead of adding a new BPF map type.
- Expose XDP hints via kfuncs with initial support for RX hash and
timestamp metadata.
- Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to
better support decap on GRE tunnel devices not operating in collect
metadata.
- Improve x86 JIT's codegen for PROBE_MEM runtime error checks.
- Remove the need for trace_printk_lock for bpf_trace_printk and
bpf_trace_vprintk helpers.
- Extend libbpf's bpf_tracing.h support for tracing arguments of
kprobes/uprobes and syscall as a special case.
- Significantly reduce the search time for module symbols by
livepatch and BPF.
- Enable cpumasks to be used as kptrs, which is useful for tracing
programs tracking which tasks end up running on which CPUs in
different time intervals.
- Add support for BPF trampoline on s390x and riscv64.
- Add capability to export the XDP features supported by the NIC.
- Add __bpf_kfunc tag for marking kernel functions as kfuncs.
- Add cgroup.memory=nobpf kernel parameter option to disable BPF
memory accounting for container environments.
Netfilter:
- Remove the CLUSTERIP target. It has been marked as obsolete for
years, and we still have WARN splats wrt races of the out-of-band
/proc interface installed by this target.
- Add 'destroy' commands to nf_tables. They are identical to the
existing 'delete' commands, but do not return an error if the
referenced object (set, chain, rule...) did not exist.
Driver API:
- Improve cpumask_local_spread() locality to help NICs set the right
IRQ affinity on AMD platforms.
- Separate C22 and C45 MDIO bus transactions more clearly.
- Introduce new DCB table to control DSCP rewrite on egress.
- Support configuration of Physical Layer Collision Avoidance (PLCA)
Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
shared medium Ethernet.
- Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
preemption of low priority frames by high priority frames.
- Add support for controlling MACSec offload using netlink SET.
- Rework devlink instance refcounts to allow registration and
de-registration under the instance lock. Split the code into
multiple files, drop some of the unnecessarily granular locks and
factor out common parts of netlink operation handling.
- Add TX frame aggregation parameters (for USB drivers).
- Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
messages with notifications for debug.
- Allow offloading of UDP NEW connections via act_ct.
- Add support for per action HW stats in TC.
- Support hardware miss to TC action (continue processing in SW from
a specific point in the action chain).
- Warn if old Wireless Extension user space interface is used with
modern cfg80211/mac80211 drivers. Do not support Wireless
Extensions for Wi-Fi 7 devices at all. Everyone should switch to
using nl80211 interface instead.
- Improve the CAN bit timing configuration. Use extack to return
error messages directly to user space, update the SJW handling,
including the definition of a new default value that will benefit
CAN-FD controllers, by increasing their oscillator tolerance.
New hardware / drivers:
- Ethernet:
- nVidia BlueField-3 support (control traffic driver)
- Ethernet support for imx93 SoCs
- Motorcomm yt8531 gigabit Ethernet PHY
- onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
- Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
- Amlogic gxl MDIO mux
- WiFi:
- RealTek RTL8188EU (rtl8xxxu)
- Qualcomm Wi-Fi 7 devices (ath12k)
- CAN:
- Renesas R-Car V4H
Drivers:
- Bluetooth:
- Set Per Platform Antenna Gain (PPAG) for Intel controllers.
- Ethernet NICs:
- Intel (1G, igc):
- support TSN / Qbv / packet scheduling features of i226 model
- Intel (100G, ice):
- use GNSS subsystem instead of TTY
- multi-buffer XDP support
- extend support for GPIO pins to E823 devices
- nVidia/Mellanox:
- update the shared buffer configuration on PFC commands
- implement PTP adjphase function for HW offset control
- TC support for Geneve and GRE with VF tunnel offload
- more efficient crypto key management method
- multi-port eswitch support
- Netronome/Corigine:
- add DCB IEEE support
- support IPsec offloading for NFP3800
- Freescale/NXP (enetc):
- support XDP_REDIRECT for XDP non-linear buffers
- improve reconfig, avoid link flap and waiting for idle
- support MAC Merge layer
- Other NICs:
- sfc/ef100: add basic devlink support for ef100
- ionic: rx_push mode operation (writing descriptors via MMIO)
- bnxt: use the auxiliary bus abstraction for RDMA
- r8169: disable ASPM and reset bus in case of tx timeout
- cpsw: support QSGMII mode for J721e CPSW9G
- cpts: support pulse-per-second output
- ngbe: add an mdio bus driver
- usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
- r8152: handle devices with FW with NCM support
- amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
- virtio-net: support multi buffer XDP
- virtio/vsock: replace virtio_vsock_pkt with sk_buff
- tsnep: XDP support
- Ethernet high-speed switches:
- nVidia/Mellanox (mlxsw):
- add support for latency TLV (in FW control messages)
- Microchip (sparx5):
- separate explicit and implicit traffic forwarding rules, make
the implicit rules always active
- add support for egress DSCP rewrite
- IS0 VCAP support (Ingress Classification)
- IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS
etc.)
- ES2 VCAP support (Egress Access Control)
- support for Per-Stream Filtering and Policing (802.1Q,
8.6.5.1)
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- add MAB (port auth) offload support
- enable PTP receive for mv88e6390
- NXP (ocelot):
- support MAC Merge layer
- support for the the vsc7512 internal copper phys
- Microchip:
- lan9303: convert to PHYLINK
- lan966x: support TC flower filter statistics
- lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
- lan937x: support Credit Based Shaper configuration
- ksz9477: support Energy Efficient Ethernet
- other:
- qca8k: convert to regmap read/write API, use bulk operations
- rswitch: Improve TX timestamp accuracy
- Intel WiFi (iwlwifi):
- EHT (Wi-Fi 7) rate reporting
- STEP equalizer support: transfer some STEP (connection to radio
on platforms with integrated wifi) related parameters from the
BIOS to the firmware.
- Qualcomm 802.11ax WiFi (ath11k):
- IPQ5018 support
- Fine Timing Measurement (FTM) responder role support
- channel 177 support
- MediaTek WiFi (mt76):
- per-PHY LED support
- mt7996: EHT (Wi-Fi 7) support
- Wireless Ethernet Dispatch (WED) reset support
- switch to using page pool allocator
- RealTek WiFi (rtw89):
- support new version of Bluetooth co-existance
- Mobile:
- rmnet: support TX aggregation"
* tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits)
page_pool: add a comment explaining the fragment counter usage
net: ethtool: fix __ethtool_dev_mm_supported() implementation
ethtool: pse-pd: Fix double word in comments
xsk: add linux/vmalloc.h to xsk.c
sefltests: netdevsim: wait for devlink instance after netns removal
selftest: fib_tests: Always cleanup before exit
net/mlx5e: Align IPsec ASO result memory to be as required by hardware
net/mlx5e: TC, Set CT miss to the specific ct action instance
net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG
net/mlx5: Refactor tc miss handling to a single function
net/mlx5: Kconfig: Make tc offload depend on tc skb extension
net/sched: flower: Support hardware miss to tc action
net/sched: flower: Move filter handle initialization earlier
net/sched: cls_api: Support hardware miss to tc action
net/sched: Rename user cookie and act cookie
sfc: fix builds without CONFIG_RTC_LIB
sfc: clean up some inconsistent indentings
net/mlx4_en: Introduce flexible array to silence overflow warning
net: lan966x: Fix possible deadlock inside PTP
net/ulp: Remove redundant ->clone() test in inet_clone_ulp().
...
Diffstat (limited to 'Documentation/trace/ftrace-design.rst')
-rw-r--r-- | Documentation/trace/ftrace-design.rst | 411 |
1 files changed, 411 insertions, 0 deletions
diff --git a/Documentation/trace/ftrace-design.rst b/Documentation/trace/ftrace-design.rst new file mode 100644 index 000000000..689339915 --- /dev/null +++ b/Documentation/trace/ftrace-design.rst @@ -0,0 +1,411 @@ +====================== +Function Tracer Design +====================== + +:Author: Mike Frysinger + +.. caution:: + This document is out of date. Some of the description below doesn't + match current implementation now. + +Introduction +------------ + +Here we will cover the architecture pieces that the common function tracing +code relies on for proper functioning. Things are broken down into increasing +complexity so that you can start simple and at least get basic functionality. + +Note that this focuses on architecture implementation details only. If you +want more explanation of a feature in terms of common code, review the common +ftrace.txt file. + +Ideally, everyone who wishes to retain performance while supporting tracing in +their kernel should make it all the way to dynamic ftrace support. + + +Prerequisites +------------- + +Ftrace relies on these features being implemented: + - STACKTRACE_SUPPORT - implement save_stack_trace() + - TRACE_IRQFLAGS_SUPPORT - implement include/asm/irqflags.h + + +HAVE_FUNCTION_TRACER +-------------------- + +You will need to implement the mcount and the ftrace_stub functions. + +The exact mcount symbol name will depend on your toolchain. Some call it +"mcount", "_mcount", or even "__mcount". You can probably figure it out by +running something like:: + + $ echo 'main(){}' | gcc -x c -S -o - - -pg | grep mcount + call mcount + +We'll make the assumption below that the symbol is "mcount" just to keep things +nice and simple in the examples. + +Keep in mind that the ABI that is in effect inside of the mcount function is +*highly* architecture/toolchain specific. We cannot help you in this regard, +sorry. Dig up some old documentation and/or find someone more familiar than +you to bang ideas off of. Typically, register usage (argument/scratch/etc...) +is a major issue at this point, especially in relation to the location of the +mcount call (before/after function prologue). You might also want to look at +how glibc has implemented the mcount function for your architecture. It might +be (semi-)relevant. + +The mcount function should check the function pointer ftrace_trace_function +to see if it is set to ftrace_stub. If it is, there is nothing for you to do, +so return immediately. If it isn't, then call that function in the same way +the mcount function normally calls __mcount_internal -- the first argument is +the "frompc" while the second argument is the "selfpc" (adjusted to remove the +size of the mcount call that is embedded in the function). + +For example, if the function foo() calls bar(), when the bar() function calls +mcount(), the arguments mcount() will pass to the tracer are: + + - "frompc" - the address bar() will use to return to foo() + - "selfpc" - the address bar() (with mcount() size adjustment) + +Also keep in mind that this mcount function will be called *a lot*, so +optimizing for the default case of no tracer will help the smooth running of +your system when tracing is disabled. So the start of the mcount function is +typically the bare minimum with checking things before returning. That also +means the code flow should usually be kept linear (i.e. no branching in the nop +case). This is of course an optimization and not a hard requirement. + +Here is some pseudo code that should help (these functions should actually be +implemented in assembly):: + + void ftrace_stub(void) + { + return; + } + + void mcount(void) + { + /* save any bare state needed in order to do initial checking */ + + extern void (*ftrace_trace_function)(unsigned long, unsigned long); + if (ftrace_trace_function != ftrace_stub) + goto do_trace; + + /* restore any bare state */ + + return; + + do_trace: + + /* save all state needed by the ABI (see paragraph above) */ + + unsigned long frompc = ...; + unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE; + ftrace_trace_function(frompc, selfpc); + + /* restore all state needed by the ABI */ + } + +Don't forget to export mcount for modules ! +:: + + extern void mcount(void); + EXPORT_SYMBOL(mcount); + + +HAVE_FUNCTION_GRAPH_TRACER +-------------------------- + +Deep breath ... time to do some real work. Here you will need to update the +mcount function to check ftrace graph function pointers, as well as implement +some functions to save (hijack) and restore the return address. + +The mcount function should check the function pointers ftrace_graph_return +(compare to ftrace_stub) and ftrace_graph_entry (compare to +ftrace_graph_entry_stub). If either of those is not set to the relevant stub +function, call the arch-specific function ftrace_graph_caller which in turn +calls the arch-specific function prepare_ftrace_return. Neither of these +function names is strictly required, but you should use them anyway to stay +consistent across the architecture ports -- easier to compare & contrast +things. + +The arguments to prepare_ftrace_return are slightly different than what are +passed to ftrace_trace_function. The second argument "selfpc" is the same, +but the first argument should be a pointer to the "frompc". Typically this is +located on the stack. This allows the function to hijack the return address +temporarily to have it point to the arch-specific function return_to_handler. +That function will simply call the common ftrace_return_to_handler function and +that will return the original return address with which you can return to the +original call site. + +Here is the updated mcount pseudo code:: + + void mcount(void) + { + ... + if (ftrace_trace_function != ftrace_stub) + goto do_trace; + + +#ifdef CONFIG_FUNCTION_GRAPH_TRACER + + extern void (*ftrace_graph_return)(...); + + extern void (*ftrace_graph_entry)(...); + + if (ftrace_graph_return != ftrace_stub || + + ftrace_graph_entry != ftrace_graph_entry_stub) + + ftrace_graph_caller(); + +#endif + + /* restore any bare state */ + ... + +Here is the pseudo code for the new ftrace_graph_caller assembly function:: + + #ifdef CONFIG_FUNCTION_GRAPH_TRACER + void ftrace_graph_caller(void) + { + /* save all state needed by the ABI */ + + unsigned long *frompc = &...; + unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE; + /* passing frame pointer up is optional -- see below */ + prepare_ftrace_return(frompc, selfpc, frame_pointer); + + /* restore all state needed by the ABI */ + } + #endif + +For information on how to implement prepare_ftrace_return(), simply look at the +x86 version (the frame pointer passing is optional; see the next section for +more information). The only architecture-specific piece in it is the setup of +the fault recovery table (the asm(...) code). The rest should be the same +across architectures. + +Here is the pseudo code for the new return_to_handler assembly function. Note +that the ABI that applies here is different from what applies to the mcount +code. Since you are returning from a function (after the epilogue), you might +be able to skimp on things saved/restored (usually just registers used to pass +return values). +:: + + #ifdef CONFIG_FUNCTION_GRAPH_TRACER + void return_to_handler(void) + { + /* save all state needed by the ABI (see paragraph above) */ + + void (*original_return_point)(void) = ftrace_return_to_handler(); + + /* restore all state needed by the ABI */ + + /* this is usually either a return or a jump */ + original_return_point(); + } + #endif + + +HAVE_FUNCTION_GRAPH_FP_TEST +--------------------------- + +An arch may pass in a unique value (frame pointer) to both the entering and +exiting of a function. On exit, the value is compared and if it does not +match, then it will panic the kernel. This is largely a sanity check for bad +code generation with gcc. If gcc for your port sanely updates the frame +pointer under different optimization levels, then ignore this option. + +However, adding support for it isn't terribly difficult. In your assembly code +that calls prepare_ftrace_return(), pass the frame pointer as the 3rd argument. +Then in the C version of that function, do what the x86 port does and pass it +along to ftrace_push_return_trace() instead of a stub value of 0. + +Similarly, when you call ftrace_return_to_handler(), pass it the frame pointer. + +HAVE_FUNCTION_GRAPH_RET_ADDR_PTR +-------------------------------- + +An arch may pass in a pointer to the return address on the stack. This +prevents potential stack unwinding issues where the unwinder gets out of +sync with ret_stack and the wrong addresses are reported by +ftrace_graph_ret_addr(). + +Adding support for it is easy: just define the macro in asm/ftrace.h and +pass the return address pointer as the 'retp' argument to +ftrace_push_return_trace(). + +HAVE_SYSCALL_TRACEPOINTS +------------------------ + +You need very few things to get the syscalls tracing in an arch. + + - Support HAVE_ARCH_TRACEHOOK (see arch/Kconfig). + - Have a NR_syscalls variable in <asm/unistd.h> that provides the number + of syscalls supported by the arch. + - Support the TIF_SYSCALL_TRACEPOINT thread flags. + - Put the trace_sys_enter() and trace_sys_exit() tracepoints calls from ptrace + in the ptrace syscalls tracing path. + - If the system call table on this arch is more complicated than a simple array + of addresses of the system calls, implement an arch_syscall_addr to return + the address of a given system call. + - If the symbol names of the system calls do not match the function names on + this arch, define ARCH_HAS_SYSCALL_MATCH_SYM_NAME in asm/ftrace.h and + implement arch_syscall_match_sym_name with the appropriate logic to return + true if the function name corresponds with the symbol name. + - Tag this arch as HAVE_SYSCALL_TRACEPOINTS. + + +HAVE_FTRACE_MCOUNT_RECORD +------------------------- + +See scripts/recordmcount.pl for more info. Just fill in the arch-specific +details for how to locate the addresses of mcount call sites via objdump. +This option doesn't make much sense without also implementing dynamic ftrace. + + +HAVE_DYNAMIC_FTRACE +------------------- + +You will first need HAVE_FTRACE_MCOUNT_RECORD and HAVE_FUNCTION_TRACER, so +scroll your reader back up if you got over eager. + +Once those are out of the way, you will need to implement: + - asm/ftrace.h: + - MCOUNT_ADDR + - ftrace_call_adjust() + - struct dyn_arch_ftrace{} + - asm code: + - mcount() (new stub) + - ftrace_caller() + - ftrace_call() + - ftrace_stub() + - C code: + - ftrace_dyn_arch_init() + - ftrace_make_nop() + - ftrace_make_call() + - ftrace_update_ftrace_func() + +First you will need to fill out some arch details in your asm/ftrace.h. + +Define MCOUNT_ADDR as the address of your mcount symbol similar to:: + + #define MCOUNT_ADDR ((unsigned long)mcount) + +Since no one else will have a decl for that function, you will need to:: + + extern void mcount(void); + +You will also need the helper function ftrace_call_adjust(). Most people +will be able to stub it out like so:: + + static inline unsigned long ftrace_call_adjust(unsigned long addr) + { + return addr; + } + +<details to be filled> + +Lastly you will need the custom dyn_arch_ftrace structure. If you need +some extra state when runtime patching arbitrary call sites, this is the +place. For now though, create an empty struct:: + + struct dyn_arch_ftrace { + /* No extra data needed */ + }; + +With the header out of the way, we can fill out the assembly code. While we +did already create a mcount() function earlier, dynamic ftrace only wants a +stub function. This is because the mcount() will only be used during boot +and then all references to it will be patched out never to return. Instead, +the guts of the old mcount() will be used to create a new ftrace_caller() +function. Because the two are hard to merge, it will most likely be a lot +easier to have two separate definitions split up by #ifdefs. Same goes for +the ftrace_stub() as that will now be inlined in ftrace_caller(). + +Before we get confused anymore, let's check out some pseudo code so you can +implement your own stuff in assembly:: + + void mcount(void) + { + return; + } + + void ftrace_caller(void) + { + /* save all state needed by the ABI (see paragraph above) */ + + unsigned long frompc = ...; + unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE; + + ftrace_call: + ftrace_stub(frompc, selfpc); + + /* restore all state needed by the ABI */ + + ftrace_stub: + return; + } + +This might look a little odd at first, but keep in mind that we will be runtime +patching multiple things. First, only functions that we actually want to trace +will be patched to call ftrace_caller(). Second, since we only have one tracer +active at a time, we will patch the ftrace_caller() function itself to call the +specific tracer in question. That is the point of the ftrace_call label. + +With that in mind, let's move on to the C code that will actually be doing the +runtime patching. You'll need a little knowledge of your arch's opcodes in +order to make it through the next section. + +Every arch has an init callback function. If you need to do something early on +to initialize some state, this is the time to do that. Otherwise, this simple +function below should be sufficient for most people:: + + int __init ftrace_dyn_arch_init(void) + { + return 0; + } + +There are two functions that are used to do runtime patching of arbitrary +functions. The first is used to turn the mcount call site into a nop (which +is what helps us retain runtime performance when not tracing). The second is +used to turn the mcount call site into a call to an arbitrary location (but +typically that is ftracer_caller()). See the general function definition in +linux/ftrace.h for the functions:: + + ftrace_make_nop() + ftrace_make_call() + +The rec->ip value is the address of the mcount call site that was collected +by the scripts/recordmcount.pl during build time. + +The last function is used to do runtime patching of the active tracer. This +will be modifying the assembly code at the location of the ftrace_call symbol +inside of the ftrace_caller() function. So you should have sufficient padding +at that location to support the new function calls you'll be inserting. Some +people will be using a "call" type instruction while others will be using a +"branch" type instruction. Specifically, the function is:: + + ftrace_update_ftrace_func() + + +HAVE_DYNAMIC_FTRACE + HAVE_FUNCTION_GRAPH_TRACER +------------------------------------------------ + +The function grapher needs a few tweaks in order to work with dynamic ftrace. +Basically, you will need to: + + - update: + - ftrace_caller() + - ftrace_graph_call() + - ftrace_graph_caller() + - implement: + - ftrace_enable_ftrace_graph_caller() + - ftrace_disable_ftrace_graph_caller() + +<details to be filled> + +Quick notes: + + - add a nop stub after the ftrace_call location named ftrace_graph_call; + stub needs to be large enough to support a call to ftrace_graph_caller() + - update ftrace_graph_caller() to work with being called by the new + ftrace_caller() since some semantics may have changed + - ftrace_enable_ftrace_graph_caller() will runtime patch the + ftrace_graph_call location with a call to ftrace_graph_caller() + - ftrace_disable_ftrace_graph_caller() will runtime patch the + ftrace_graph_call location with nops |