diff options
author | 2023-02-21 18:24:12 -0800 | |
---|---|---|
committer | 2023-02-21 18:24:12 -0800 | |
commit | 5b7c4cabbb65f5c469464da6c5f614cbd7f730f2 (patch) | |
tree | cc5c2d0a898769fd59549594fedb3ee6f84e59a0 /Documentation/networking/tls-offload.rst | |
download | linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.tar.gz linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.zip |
Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextgrafted
Pull networking updates from Jakub Kicinski:
"Core:
- Add dedicated kmem_cache for typical/small skb->head, avoid having
to access struct page at kfree time, and improve memory use.
- Introduce sysctl to set default RPS configuration for new netdevs.
- Define Netlink protocol specification format which can be used to
describe messages used by each family and auto-generate parsers.
Add tools for generating kernel data structures and uAPI headers.
- Expose all net/core sysctls inside netns.
- Remove 4s sleep in netpoll if carrier is instantly detected on
boot.
- Add configurable limit of MDB entries per port, and port-vlan.
- Continue populating drop reasons throughout the stack.
- Retire a handful of legacy Qdiscs and classifiers.
Protocols:
- Support IPv4 big TCP (TSO frames larger than 64kB).
- Add IP_LOCAL_PORT_RANGE socket option, to control local port range
on socket by socket basis.
- Track and report in procfs number of MPTCP sockets used.
- Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path
manager.
- IPv6: don't check net.ipv6.route.max_size and rely on garbage
collection to free memory (similarly to IPv4).
- Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).
- ICMP: add per-rate limit counters.
- Add support for user scanning requests in ieee802154.
- Remove static WEP support.
- Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
reporting.
- WiFi 7 EHT channel puncturing support (client & AP).
BPF:
- Add a rbtree data structure following the "next-gen data structure"
precedent set by recently added linked list, that is, by using
kfunc + kptr instead of adding a new BPF map type.
- Expose XDP hints via kfuncs with initial support for RX hash and
timestamp metadata.
- Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to
better support decap on GRE tunnel devices not operating in collect
metadata.
- Improve x86 JIT's codegen for PROBE_MEM runtime error checks.
- Remove the need for trace_printk_lock for bpf_trace_printk and
bpf_trace_vprintk helpers.
- Extend libbpf's bpf_tracing.h support for tracing arguments of
kprobes/uprobes and syscall as a special case.
- Significantly reduce the search time for module symbols by
livepatch and BPF.
- Enable cpumasks to be used as kptrs, which is useful for tracing
programs tracking which tasks end up running on which CPUs in
different time intervals.
- Add support for BPF trampoline on s390x and riscv64.
- Add capability to export the XDP features supported by the NIC.
- Add __bpf_kfunc tag for marking kernel functions as kfuncs.
- Add cgroup.memory=nobpf kernel parameter option to disable BPF
memory accounting for container environments.
Netfilter:
- Remove the CLUSTERIP target. It has been marked as obsolete for
years, and we still have WARN splats wrt races of the out-of-band
/proc interface installed by this target.
- Add 'destroy' commands to nf_tables. They are identical to the
existing 'delete' commands, but do not return an error if the
referenced object (set, chain, rule...) did not exist.
Driver API:
- Improve cpumask_local_spread() locality to help NICs set the right
IRQ affinity on AMD platforms.
- Separate C22 and C45 MDIO bus transactions more clearly.
- Introduce new DCB table to control DSCP rewrite on egress.
- Support configuration of Physical Layer Collision Avoidance (PLCA)
Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
shared medium Ethernet.
- Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
preemption of low priority frames by high priority frames.
- Add support for controlling MACSec offload using netlink SET.
- Rework devlink instance refcounts to allow registration and
de-registration under the instance lock. Split the code into
multiple files, drop some of the unnecessarily granular locks and
factor out common parts of netlink operation handling.
- Add TX frame aggregation parameters (for USB drivers).
- Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
messages with notifications for debug.
- Allow offloading of UDP NEW connections via act_ct.
- Add support for per action HW stats in TC.
- Support hardware miss to TC action (continue processing in SW from
a specific point in the action chain).
- Warn if old Wireless Extension user space interface is used with
modern cfg80211/mac80211 drivers. Do not support Wireless
Extensions for Wi-Fi 7 devices at all. Everyone should switch to
using nl80211 interface instead.
- Improve the CAN bit timing configuration. Use extack to return
error messages directly to user space, update the SJW handling,
including the definition of a new default value that will benefit
CAN-FD controllers, by increasing their oscillator tolerance.
New hardware / drivers:
- Ethernet:
- nVidia BlueField-3 support (control traffic driver)
- Ethernet support for imx93 SoCs
- Motorcomm yt8531 gigabit Ethernet PHY
- onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
- Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
- Amlogic gxl MDIO mux
- WiFi:
- RealTek RTL8188EU (rtl8xxxu)
- Qualcomm Wi-Fi 7 devices (ath12k)
- CAN:
- Renesas R-Car V4H
Drivers:
- Bluetooth:
- Set Per Platform Antenna Gain (PPAG) for Intel controllers.
- Ethernet NICs:
- Intel (1G, igc):
- support TSN / Qbv / packet scheduling features of i226 model
- Intel (100G, ice):
- use GNSS subsystem instead of TTY
- multi-buffer XDP support
- extend support for GPIO pins to E823 devices
- nVidia/Mellanox:
- update the shared buffer configuration on PFC commands
- implement PTP adjphase function for HW offset control
- TC support for Geneve and GRE with VF tunnel offload
- more efficient crypto key management method
- multi-port eswitch support
- Netronome/Corigine:
- add DCB IEEE support
- support IPsec offloading for NFP3800
- Freescale/NXP (enetc):
- support XDP_REDIRECT for XDP non-linear buffers
- improve reconfig, avoid link flap and waiting for idle
- support MAC Merge layer
- Other NICs:
- sfc/ef100: add basic devlink support for ef100
- ionic: rx_push mode operation (writing descriptors via MMIO)
- bnxt: use the auxiliary bus abstraction for RDMA
- r8169: disable ASPM and reset bus in case of tx timeout
- cpsw: support QSGMII mode for J721e CPSW9G
- cpts: support pulse-per-second output
- ngbe: add an mdio bus driver
- usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
- r8152: handle devices with FW with NCM support
- amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
- virtio-net: support multi buffer XDP
- virtio/vsock: replace virtio_vsock_pkt with sk_buff
- tsnep: XDP support
- Ethernet high-speed switches:
- nVidia/Mellanox (mlxsw):
- add support for latency TLV (in FW control messages)
- Microchip (sparx5):
- separate explicit and implicit traffic forwarding rules, make
the implicit rules always active
- add support for egress DSCP rewrite
- IS0 VCAP support (Ingress Classification)
- IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS
etc.)
- ES2 VCAP support (Egress Access Control)
- support for Per-Stream Filtering and Policing (802.1Q,
8.6.5.1)
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- add MAB (port auth) offload support
- enable PTP receive for mv88e6390
- NXP (ocelot):
- support MAC Merge layer
- support for the the vsc7512 internal copper phys
- Microchip:
- lan9303: convert to PHYLINK
- lan966x: support TC flower filter statistics
- lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
- lan937x: support Credit Based Shaper configuration
- ksz9477: support Energy Efficient Ethernet
- other:
- qca8k: convert to regmap read/write API, use bulk operations
- rswitch: Improve TX timestamp accuracy
- Intel WiFi (iwlwifi):
- EHT (Wi-Fi 7) rate reporting
- STEP equalizer support: transfer some STEP (connection to radio
on platforms with integrated wifi) related parameters from the
BIOS to the firmware.
- Qualcomm 802.11ax WiFi (ath11k):
- IPQ5018 support
- Fine Timing Measurement (FTM) responder role support
- channel 177 support
- MediaTek WiFi (mt76):
- per-PHY LED support
- mt7996: EHT (Wi-Fi 7) support
- Wireless Ethernet Dispatch (WED) reset support
- switch to using page pool allocator
- RealTek WiFi (rtw89):
- support new version of Bluetooth co-existance
- Mobile:
- rmnet: support TX aggregation"
* tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits)
page_pool: add a comment explaining the fragment counter usage
net: ethtool: fix __ethtool_dev_mm_supported() implementation
ethtool: pse-pd: Fix double word in comments
xsk: add linux/vmalloc.h to xsk.c
sefltests: netdevsim: wait for devlink instance after netns removal
selftest: fib_tests: Always cleanup before exit
net/mlx5e: Align IPsec ASO result memory to be as required by hardware
net/mlx5e: TC, Set CT miss to the specific ct action instance
net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG
net/mlx5: Refactor tc miss handling to a single function
net/mlx5: Kconfig: Make tc offload depend on tc skb extension
net/sched: flower: Support hardware miss to tc action
net/sched: flower: Move filter handle initialization earlier
net/sched: cls_api: Support hardware miss to tc action
net/sched: Rename user cookie and act cookie
sfc: fix builds without CONFIG_RTC_LIB
sfc: clean up some inconsistent indentings
net/mlx4_en: Introduce flexible array to silence overflow warning
net: lan966x: Fix possible deadlock inside PTP
net/ulp: Remove redundant ->clone() test in inet_clone_ulp().
...
Diffstat (limited to 'Documentation/networking/tls-offload.rst')
-rw-r--r-- | Documentation/networking/tls-offload.rst | 539 |
1 files changed, 539 insertions, 0 deletions
diff --git a/Documentation/networking/tls-offload.rst b/Documentation/networking/tls-offload.rst new file mode 100644 index 000000000..5f0dea3d5 --- /dev/null +++ b/Documentation/networking/tls-offload.rst @@ -0,0 +1,539 @@ +.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) + +================== +Kernel TLS offload +================== + +Kernel TLS operation +==================== + +Linux kernel provides TLS connection offload infrastructure. Once a TCP +connection is in ``ESTABLISHED`` state user space can enable the TLS Upper +Layer Protocol (ULP) and install the cryptographic connection state. +For details regarding the user-facing interface refer to the TLS +documentation in :ref:`Documentation/networking/tls.rst <kernel_tls>`. + +``ktls`` can operate in three modes: + + * Software crypto mode (``TLS_SW``) - CPU handles the cryptography. + In most basic cases only crypto operations synchronous with the CPU + can be used, but depending on calling context CPU may utilize + asynchronous crypto accelerators. The use of accelerators introduces extra + latency on socket reads (decryption only starts when a read syscall + is made) and additional I/O load on the system. + * Packet-based NIC offload mode (``TLS_HW``) - the NIC handles crypto + on a packet by packet basis, provided the packets arrive in order. + This mode integrates best with the kernel stack and is described in detail + in the remaining part of this document + (``ethtool`` flags ``tls-hw-tx-offload`` and ``tls-hw-rx-offload``). + * Full TCP NIC offload mode (``TLS_HW_RECORD``) - mode of operation where + NIC driver and firmware replace the kernel networking stack + with its own TCP handling, it is not usable in production environments + making use of the Linux networking stack for example any firewalling + abilities or QoS and packet scheduling (``ethtool`` flag ``tls-hw-record``). + +The operation mode is selected automatically based on device configuration, +offload opt-in or opt-out on per-connection basis is not currently supported. + +TX +-- + +At a high level user write requests are turned into a scatter list, the TLS ULP +intercepts them, inserts record framing, performs encryption (in ``TLS_SW`` +mode) and then hands the modified scatter list to the TCP layer. From this +point on the TCP stack proceeds as normal. + +In ``TLS_HW`` mode the encryption is not performed in the TLS ULP. +Instead packets reach a device driver, the driver will mark the packets +for crypto offload based on the socket the packet is attached to, +and send them to the device for encryption and transmission. + +RX +-- + +On the receive side if the device handled decryption and authentication +successfully, the driver will set the decrypted bit in the associated +:c:type:`struct sk_buff <sk_buff>`. The packets reach the TCP stack and +are handled normally. ``ktls`` is informed when data is queued to the socket +and the ``strparser`` mechanism is used to delineate the records. Upon read +request, records are retrieved from the socket and passed to decryption routine. +If device decrypted all the segments of the record the decryption is skipped, +otherwise software path handles decryption. + +.. kernel-figure:: tls-offload-layers.svg + :alt: TLS offload layers + :align: center + :figwidth: 28em + + Layers of Kernel TLS stack + +Device configuration +==================== + +During driver initialization device sets the ``NETIF_F_HW_TLS_RX`` and +``NETIF_F_HW_TLS_TX`` features and installs its +:c:type:`struct tlsdev_ops <tlsdev_ops>` +pointer in the :c:member:`tlsdev_ops` member of the +:c:type:`struct net_device <net_device>`. + +When TLS cryptographic connection state is installed on a ``ktls`` socket +(note that it is done twice, once for RX and once for TX direction, +and the two are completely independent), the kernel checks if the underlying +network device is offload-capable and attempts the offload. In case offload +fails the connection is handled entirely in software using the same mechanism +as if the offload was never tried. + +Offload request is performed via the :c:member:`tls_dev_add` callback of +:c:type:`struct tlsdev_ops <tlsdev_ops>`: + +.. code-block:: c + + int (*tls_dev_add)(struct net_device *netdev, struct sock *sk, + enum tls_offload_ctx_dir direction, + struct tls_crypto_info *crypto_info, + u32 start_offload_tcp_sn); + +``direction`` indicates whether the cryptographic information is for +the received or transmitted packets. Driver uses the ``sk`` parameter +to retrieve the connection 5-tuple and socket family (IPv4 vs IPv6). +Cryptographic information in ``crypto_info`` includes the key, iv, salt +as well as TLS record sequence number. ``start_offload_tcp_sn`` indicates +which TCP sequence number corresponds to the beginning of the record with +sequence number from ``crypto_info``. The driver can add its state +at the end of kernel structures (see :c:member:`driver_state` members +in ``include/net/tls.h``) to avoid additional allocations and pointer +dereferences. + +TX +-- + +After TX state is installed, the stack guarantees that the first segment +of the stream will start exactly at the ``start_offload_tcp_sn`` sequence +number, simplifying TCP sequence number matching. + +TX offload being fully initialized does not imply that all segments passing +through the driver and which belong to the offloaded socket will be after +the expected sequence number and will have kernel record information. +In particular, already encrypted data may have been queued to the socket +before installing the connection state in the kernel. + +RX +-- + +In RX direction local networking stack has little control over the segmentation, +so the initial records' TCP sequence number may be anywhere inside the segment. + +Normal operation +================ + +At the minimum the device maintains the following state for each connection, in +each direction: + + * crypto secrets (key, iv, salt) + * crypto processing state (partial blocks, partial authentication tag, etc.) + * record metadata (sequence number, processing offset and length) + * expected TCP sequence number + +There are no guarantees on record length or record segmentation. In particular +segments may start at any point of a record and contain any number of records. +Assuming segments are received in order, the device should be able to perform +crypto operations and authentication regardless of segmentation. For this +to be possible device has to keep small amount of segment-to-segment state. +This includes at least: + + * partial headers (if a segment carried only a part of the TLS header) + * partial data block + * partial authentication tag (all data had been seen but part of the + authentication tag has to be written or read from the subsequent segment) + +Record reassembly is not necessary for TLS offload. If the packets arrive +in order the device should be able to handle them separately and make +forward progress. + +TX +-- + +The kernel stack performs record framing reserving space for the authentication +tag and populating all other TLS header and tailer fields. + +Both the device and the driver maintain expected TCP sequence numbers +due to the possibility of retransmissions and the lack of software fallback +once the packet reaches the device. +For segments passed in order, the driver marks the packets with +a connection identifier (note that a 5-tuple lookup is insufficient to identify +packets requiring HW offload, see the :ref:`5tuple_problems` section) +and hands them to the device. The device identifies the packet as requiring +TLS handling and confirms the sequence number matches its expectation. +The device performs encryption and authentication of the record data. +It replaces the authentication tag and TCP checksum with correct values. + +RX +-- + +Before a packet is DMAed to the host (but after NIC's embedded switching +and packet transformation functions) the device validates the Layer 4 +checksum and performs a 5-tuple lookup to find any TLS connection the packet +may belong to (technically a 4-tuple +lookup is sufficient - IP addresses and TCP port numbers, as the protocol +is always TCP). If connection is matched device confirms if the TCP sequence +number is the expected one and proceeds to TLS handling (record delineation, +decryption, authentication for each record in the packet). The device leaves +the record framing unmodified, the stack takes care of record decapsulation. +Device indicates successful handling of TLS offload in the per-packet context +(descriptor) passed to the host. + +Upon reception of a TLS offloaded packet, the driver sets +the :c:member:`decrypted` mark in :c:type:`struct sk_buff <sk_buff>` +corresponding to the segment. Networking stack makes sure decrypted +and non-decrypted segments do not get coalesced (e.g. by GRO or socket layer) +and takes care of partial decryption. + +Resync handling +=============== + +In presence of packet drops or network packet reordering, the device may lose +synchronization with the TLS stream, and require a resync with the kernel's +TCP stack. + +Note that resync is only attempted for connections which were successfully +added to the device table and are in TLS_HW mode. For example, +if the table was full when cryptographic state was installed in the kernel, +such connection will never get offloaded. Therefore the resync request +does not carry any cryptographic connection state. + +TX +-- + +Segments transmitted from an offloaded socket can get out of sync +in similar ways to the receive side-retransmissions - local drops +are possible, though network reorders are not. There are currently +two mechanisms for dealing with out of order segments. + +Crypto state rebuilding +~~~~~~~~~~~~~~~~~~~~~~~ + +Whenever an out of order segment is transmitted the driver provides +the device with enough information to perform cryptographic operations. +This means most likely that the part of the record preceding the current +segment has to be passed to the device as part of the packet context, +together with its TCP sequence number and TLS record number. The device +can then initialize its crypto state, process and discard the preceding +data (to be able to insert the authentication tag) and move onto handling +the actual packet. + +In this mode depending on the implementation the driver can either ask +for a continuation with the crypto state and the new sequence number +(next expected segment is the one after the out of order one), or continue +with the previous stream state - assuming that the out of order segment +was just a retransmission. The former is simpler, and does not require +retransmission detection therefore it is the recommended method until +such time it is proven inefficient. + +Next record sync +~~~~~~~~~~~~~~~~ + +Whenever an out of order segment is detected the driver requests +that the ``ktls`` software fallback code encrypt it. If the segment's +sequence number is lower than expected the driver assumes retransmission +and doesn't change device state. If the segment is in the future, it +may imply a local drop, the driver asks the stack to sync the device +to the next record state and falls back to software. + +Resync request is indicated with: + +.. code-block:: c + + void tls_offload_tx_resync_request(struct sock *sk, u32 got_seq, u32 exp_seq) + +Until resync is complete driver should not access its expected TCP +sequence number (as it will be updated from a different context). +Following helper should be used to test if resync is complete: + +.. code-block:: c + + bool tls_offload_tx_resync_pending(struct sock *sk) + +Next time ``ktls`` pushes a record it will first send its TCP sequence number +and TLS record number to the driver. Stack will also make sure that +the new record will start on a segment boundary (like it does when +the connection is initially added). + +RX +-- + +A small amount of RX reorder events may not require a full resynchronization. +In particular the device should not lose synchronization +when record boundary can be recovered: + +.. kernel-figure:: tls-offload-reorder-good.svg + :alt: reorder of non-header segment + :align: center + + Reorder of non-header segment + +Green segments are successfully decrypted, blue ones are passed +as received on wire, red stripes mark start of new records. + +In above case segment 1 is received and decrypted successfully. +Segment 2 was dropped so 3 arrives out of order. The device knows +the next record starts inside 3, based on record length in segment 1. +Segment 3 is passed untouched, because due to lack of data from segment 2 +the remainder of the previous record inside segment 3 cannot be handled. +The device can, however, collect the authentication algorithm's state +and partial block from the new record in segment 3 and when 4 and 5 +arrive continue decryption. Finally when 2 arrives it's completely outside +of expected window of the device so it's passed as is without special +handling. ``ktls`` software fallback handles the decryption of record +spanning segments 1, 2 and 3. The device did not get out of sync, +even though two segments did not get decrypted. + +Kernel synchronization may be necessary if the lost segment contained +a record header and arrived after the next record header has already passed: + +.. kernel-figure:: tls-offload-reorder-bad.svg + :alt: reorder of header segment + :align: center + + Reorder of segment with a TLS header + +In this example segment 2 gets dropped, and it contains a record header. +Device can only detect that segment 4 also contains a TLS header +if it knows the length of the previous record from segment 2. In this case +the device will lose synchronization with the stream. + +Stream scan resynchronization +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When the device gets out of sync and the stream reaches TCP sequence +numbers more than a max size record past the expected TCP sequence number, +the device starts scanning for a known header pattern. For example +for TLS 1.2 and TLS 1.3 subsequent bytes of value ``0x03 0x03`` occur +in the SSL/TLS version field of the header. Once pattern is matched +the device continues attempting parsing headers at expected locations +(based on the length fields at guessed locations). +Whenever the expected location does not contain a valid header the scan +is restarted. + +When the header is matched the device sends a confirmation request +to the kernel, asking if the guessed location is correct (if a TLS record +really starts there), and which record sequence number the given header had. +The kernel confirms the guessed location was correct and tells the device +the record sequence number. Meanwhile, the device had been parsing +and counting all records since the just-confirmed one, it adds the number +of records it had seen to the record number provided by the kernel. +At this point the device is in sync and can resume decryption at next +segment boundary. + +In a pathological case the device may latch onto a sequence of matching +headers and never hear back from the kernel (there is no negative +confirmation from the kernel). The implementation may choose to periodically +restart scan. Given how unlikely falsely-matching stream is, however, +periodic restart is not deemed necessary. + +Special care has to be taken if the confirmation request is passed +asynchronously to the packet stream and record may get processed +by the kernel before the confirmation request. + +Stack-driven resynchronization +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The driver may also request the stack to perform resynchronization +whenever it sees the records are no longer getting decrypted. +If the connection is configured in this mode the stack automatically +schedules resynchronization after it has received two completely encrypted +records. + +The stack waits for the socket to drain and informs the device about +the next expected record number and its TCP sequence number. If the +records continue to be received fully encrypted stack retries the +synchronization with an exponential back off (first after 2 encrypted +records, then after 4 records, after 8, after 16... up until every +128 records). + +Error handling +============== + +TX +-- + +Packets may be redirected or rerouted by the stack to a different +device than the selected TLS offload device. The stack will handle +such condition using the :c:func:`sk_validate_xmit_skb` helper +(TLS offload code installs :c:func:`tls_validate_xmit_skb` at this hook). +Offload maintains information about all records until the data is +fully acknowledged, so if skbs reach the wrong device they can be handled +by software fallback. + +Any device TLS offload handling error on the transmission side must result +in the packet being dropped. For example if a packet got out of order +due to a bug in the stack or the device, reached the device and can't +be encrypted such packet must be dropped. + +RX +-- + +If the device encounters any problems with TLS offload on the receive +side it should pass the packet to the host's networking stack as it was +received on the wire. + +For example authentication failure for any record in the segment should +result in passing the unmodified packet to the software fallback. This means +packets should not be modified "in place". Splitting segments to handle partial +decryption is not advised. In other words either all records in the packet +had been handled successfully and authenticated or the packet has to be passed +to the host's stack as it was on the wire (recovering original packet in the +driver if device provides precise error is sufficient). + +The Linux networking stack does not provide a way of reporting per-packet +decryption and authentication errors, packets with errors must simply not +have the :c:member:`decrypted` mark set. + +A packet should also not be handled by the TLS offload if it contains +incorrect checksums. + +Performance metrics +=================== + +TLS offload can be characterized by the following basic metrics: + + * max connection count + * connection installation rate + * connection installation latency + * total cryptographic performance + +Note that each TCP connection requires a TLS session in both directions, +the performance may be reported treating each direction separately. + +Max connection count +-------------------- + +The number of connections device can support can be exposed via +``devlink resource`` API. + +Total cryptographic performance +------------------------------- + +Offload performance may depend on segment and record size. + +Overload of the cryptographic subsystem of the device should not have +significant performance impact on non-offloaded streams. + +Statistics +========== + +Following minimum set of TLS-related statistics should be reported +by the driver: + + * ``rx_tls_decrypted_packets`` - number of successfully decrypted RX packets + which were part of a TLS stream. + * ``rx_tls_decrypted_bytes`` - number of TLS payload bytes in RX packets + which were successfully decrypted. + * ``rx_tls_ctx`` - number of TLS RX HW offload contexts added to device for + decryption. + * ``rx_tls_del`` - number of TLS RX HW offload contexts deleted from device + (connection has finished). + * ``rx_tls_resync_req_pkt`` - number of received TLS packets with a resync + request. + * ``rx_tls_resync_req_start`` - number of times the TLS async resync request + was started. + * ``rx_tls_resync_req_end`` - number of times the TLS async resync request + properly ended with providing the HW tracked tcp-seq. + * ``rx_tls_resync_req_skip`` - number of times the TLS async resync request + procedure was started by not properly ended. + * ``rx_tls_resync_res_ok`` - number of times the TLS resync response call to + the driver was successfully handled. + * ``rx_tls_resync_res_skip`` - number of times the TLS resync response call to + the driver was terminated unsuccessfully. + * ``rx_tls_err`` - number of RX packets which were part of a TLS stream + but were not decrypted due to unexpected error in the state machine. + * ``tx_tls_encrypted_packets`` - number of TX packets passed to the device + for encryption of their TLS payload. + * ``tx_tls_encrypted_bytes`` - number of TLS payload bytes in TX packets + passed to the device for encryption. + * ``tx_tls_ctx`` - number of TLS TX HW offload contexts added to device for + encryption. + * ``tx_tls_ooo`` - number of TX packets which were part of a TLS stream + but did not arrive in the expected order. + * ``tx_tls_skip_no_sync_data`` - number of TX packets which were part of + a TLS stream and arrived out-of-order, but skipped the HW offload routine + and went to the regular transmit flow as they were retransmissions of the + connection handshake. + * ``tx_tls_drop_no_sync_data`` - number of TX packets which were part of + a TLS stream dropped, because they arrived out of order and associated + record could not be found. + * ``tx_tls_drop_bypass_req`` - number of TX packets which were part of a TLS + stream dropped, because they contain both data that has been encrypted by + software and data that expects hardware crypto offload. + +Notable corner cases, exceptions and additional requirements +============================================================ + +.. _5tuple_problems: + +5-tuple matching limitations +---------------------------- + +The device can only recognize received packets based on the 5-tuple +of the socket. Current ``ktls`` implementation will not offload sockets +routed through software interfaces such as those used for tunneling +or virtual networking. However, many packet transformations performed +by the networking stack (most notably any BPF logic) do not require +any intermediate software device, therefore a 5-tuple match may +consistently miss at the device level. In such cases the device +should still be able to perform TX offload (encryption) and should +fallback cleanly to software decryption (RX). + +Out of order +------------ + +Introducing extra processing in NICs should not cause packets to be +transmitted or received out of order, for example pure ACK packets +should not be reordered with respect to data segments. + +Ingress reorder +--------------- + +A device is permitted to perform packet reordering for consecutive +TCP segments (i.e. placing packets in the correct order) but any form +of additional buffering is disallowed. + +Coexistence with standard networking offload features +----------------------------------------------------- + +Offloaded ``ktls`` sockets should support standard TCP stack features +transparently. Enabling device TLS offload should not cause any difference +in packets as seen on the wire. + +Transport layer transparency +---------------------------- + +The device should not modify any packet headers for the purpose +of the simplifying TLS offload. + +The device should not depend on any packet headers beyond what is strictly +necessary for TLS offload. + +Segment drops +------------- + +Dropping packets is acceptable only in the event of catastrophic +system errors and should never be used as an error handling mechanism +in cases arising from normal operation. In other words, reliance +on TCP retransmissions to handle corner cases is not acceptable. + +TLS device features +------------------- + +Drivers should ignore the changes to the TLS device feature flags. +These flags will be acted upon accordingly by the core ``ktls`` code. +TLS device feature flags only control adding of new TLS connection +offloads, old connections will remain active after flags are cleared. + +TLS encryption cannot be offloaded to devices without checksum calculation +offload. Hence, TLS TX device feature flag requires TX csum offload being set. +Disabling the latter implies clearing the former. Disabling TX checksum offload +should not affect old connections, and drivers should make sure checksum +calculation does not break for them. +Similarly, device-offloaded TLS decryption implies doing RXCSUM. If the user +does not want to enable RX csum offload, TLS RX device feature is disabled +as well. |