aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/coco
diff options
context:
space:
mode:
authorLibravatar Linus Torvalds <torvalds@linux-foundation.org>2023-02-21 18:24:12 -0800
committerLibravatar Linus Torvalds <torvalds@linux-foundation.org>2023-02-21 18:24:12 -0800
commit5b7c4cabbb65f5c469464da6c5f614cbd7f730f2 (patch)
treecc5c2d0a898769fd59549594fedb3ee6f84e59a0 /arch/x86/coco
downloadlinux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.tar.gz
linux-5b7c4cabbb65f5c469464da6c5f614cbd7f730f2.zip
Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextgrafted
Pull networking updates from Jakub Kicinski: "Core: - Add dedicated kmem_cache for typical/small skb->head, avoid having to access struct page at kfree time, and improve memory use. - Introduce sysctl to set default RPS configuration for new netdevs. - Define Netlink protocol specification format which can be used to describe messages used by each family and auto-generate parsers. Add tools for generating kernel data structures and uAPI headers. - Expose all net/core sysctls inside netns. - Remove 4s sleep in netpoll if carrier is instantly detected on boot. - Add configurable limit of MDB entries per port, and port-vlan. - Continue populating drop reasons throughout the stack. - Retire a handful of legacy Qdiscs and classifiers. Protocols: - Support IPv4 big TCP (TSO frames larger than 64kB). - Add IP_LOCAL_PORT_RANGE socket option, to control local port range on socket by socket basis. - Track and report in procfs number of MPTCP sockets used. - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path manager. - IPv6: don't check net.ipv6.route.max_size and rely on garbage collection to free memory (similarly to IPv4). - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986). - ICMP: add per-rate limit counters. - Add support for user scanning requests in ieee802154. - Remove static WEP support. - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting. - WiFi 7 EHT channel puncturing support (client & AP). BPF: - Add a rbtree data structure following the "next-gen data structure" precedent set by recently added linked list, that is, by using kfunc + kptr instead of adding a new BPF map type. - Expose XDP hints via kfuncs with initial support for RX hash and timestamp metadata. - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata. - Improve x86 JIT's codegen for PROBE_MEM runtime error checks. - Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers. - Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case. - Significantly reduce the search time for module symbols by livepatch and BPF. - Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals. - Add support for BPF trampoline on s390x and riscv64. - Add capability to export the XDP features supported by the NIC. - Add __bpf_kfunc tag for marking kernel functions as kfuncs. - Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accounting for container environments. Netfilter: - Remove the CLUSTERIP target. It has been marked as obsolete for years, and we still have WARN splats wrt races of the out-of-band /proc interface installed by this target. - Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not return an error if the referenced object (set, chain, rule...) did not exist. Driver API: - Improve cpumask_local_spread() locality to help NICs set the right IRQ affinity on AMD platforms. - Separate C22 and C45 MDIO bus transactions more clearly. - Introduce new DCB table to control DSCP rewrite on egress. - Support configuration of Physical Layer Collision Avoidance (PLCA) Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of shared medium Ethernet. - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing preemption of low priority frames by high priority frames. - Add support for controlling MACSec offload using netlink SET. - Rework devlink instance refcounts to allow registration and de-registration under the instance lock. Split the code into multiple files, drop some of the unnecessarily granular locks and factor out common parts of netlink operation handling. - Add TX frame aggregation parameters (for USB drivers). - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning messages with notifications for debug. - Allow offloading of UDP NEW connections via act_ct. - Add support for per action HW stats in TC. - Support hardware miss to TC action (continue processing in SW from a specific point in the action chain). - Warn if old Wireless Extension user space interface is used with modern cfg80211/mac80211 drivers. Do not support Wireless Extensions for Wi-Fi 7 devices at all. Everyone should switch to using nl80211 interface instead. - Improve the CAN bit timing configuration. Use extack to return error messages directly to user space, update the SJW handling, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. New hardware / drivers: - Ethernet: - nVidia BlueField-3 support (control traffic driver) - Ethernet support for imx93 SoCs - Motorcomm yt8531 gigabit Ethernet PHY - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA) - Microchip LAN8841 PHY (incl. cable diagnostics and PTP) - Amlogic gxl MDIO mux - WiFi: - RealTek RTL8188EU (rtl8xxxu) - Qualcomm Wi-Fi 7 devices (ath12k) - CAN: - Renesas R-Car V4H Drivers: - Bluetooth: - Set Per Platform Antenna Gain (PPAG) for Intel controllers. - Ethernet NICs: - Intel (1G, igc): - support TSN / Qbv / packet scheduling features of i226 model - Intel (100G, ice): - use GNSS subsystem instead of TTY - multi-buffer XDP support - extend support for GPIO pins to E823 devices - nVidia/Mellanox: - update the shared buffer configuration on PFC commands - implement PTP adjphase function for HW offset control - TC support for Geneve and GRE with VF tunnel offload - more efficient crypto key management method - multi-port eswitch support - Netronome/Corigine: - add DCB IEEE support - support IPsec offloading for NFP3800 - Freescale/NXP (enetc): - support XDP_REDIRECT for XDP non-linear buffers - improve reconfig, avoid link flap and waiting for idle - support MAC Merge layer - Other NICs: - sfc/ef100: add basic devlink support for ef100 - ionic: rx_push mode operation (writing descriptors via MMIO) - bnxt: use the auxiliary bus abstraction for RDMA - r8169: disable ASPM and reset bus in case of tx timeout - cpsw: support QSGMII mode for J721e CPSW9G - cpts: support pulse-per-second output - ngbe: add an mdio bus driver - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing - r8152: handle devices with FW with NCM support - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation - virtio-net: support multi buffer XDP - virtio/vsock: replace virtio_vsock_pkt with sk_buff - tsnep: XDP support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add support for latency TLV (in FW control messages) - Microchip (sparx5): - separate explicit and implicit traffic forwarding rules, make the implicit rules always active - add support for egress DSCP rewrite - IS0 VCAP support (Ingress Classification) - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.) - ES2 VCAP support (Egress Access Control) - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1) - Ethernet embedded switches: - Marvell (mv88e6xxx): - add MAB (port auth) offload support - enable PTP receive for mv88e6390 - NXP (ocelot): - support MAC Merge layer - support for the the vsc7512 internal copper phys - Microchip: - lan9303: convert to PHYLINK - lan966x: support TC flower filter statistics - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x - lan937x: support Credit Based Shaper configuration - ksz9477: support Energy Efficient Ethernet - other: - qca8k: convert to regmap read/write API, use bulk operations - rswitch: Improve TX timestamp accuracy - Intel WiFi (iwlwifi): - EHT (Wi-Fi 7) rate reporting - STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware. - Qualcomm 802.11ax WiFi (ath11k): - IPQ5018 support - Fine Timing Measurement (FTM) responder role support - channel 177 support - MediaTek WiFi (mt76): - per-PHY LED support - mt7996: EHT (Wi-Fi 7) support - Wireless Ethernet Dispatch (WED) reset support - switch to using page pool allocator - RealTek WiFi (rtw89): - support new version of Bluetooth co-existance - Mobile: - rmnet: support TX aggregation" * tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits) page_pool: add a comment explaining the fragment counter usage net: ethtool: fix __ethtool_dev_mm_supported() implementation ethtool: pse-pd: Fix double word in comments xsk: add linux/vmalloc.h to xsk.c sefltests: netdevsim: wait for devlink instance after netns removal selftest: fib_tests: Always cleanup before exit net/mlx5e: Align IPsec ASO result memory to be as required by hardware net/mlx5e: TC, Set CT miss to the specific ct action instance net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG net/mlx5: Refactor tc miss handling to a single function net/mlx5: Kconfig: Make tc offload depend on tc skb extension net/sched: flower: Support hardware miss to tc action net/sched: flower: Move filter handle initialization earlier net/sched: cls_api: Support hardware miss to tc action net/sched: Rename user cookie and act cookie sfc: fix builds without CONFIG_RTC_LIB sfc: clean up some inconsistent indentings net/mlx4_en: Introduce flexible array to silence overflow warning net: lan966x: Fix possible deadlock inside PTP net/ulp: Remove redundant ->clone() test in inet_clone_ulp(). ...
Diffstat (limited to 'arch/x86/coco')
-rw-r--r--arch/x86/coco/Makefile8
-rw-r--r--arch/x86/coco/core.c140
-rw-r--r--arch/x86/coco/tdx/Makefile3
-rw-r--r--arch/x86/coco/tdx/tdcall.S194
-rw-r--r--arch/x86/coco/tdx/tdx.c817
5 files changed, 1162 insertions, 0 deletions
diff --git a/arch/x86/coco/Makefile b/arch/x86/coco/Makefile
new file mode 100644
index 000000000..c816acf78
--- /dev/null
+++ b/arch/x86/coco/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0
+CFLAGS_REMOVE_core.o = -pg
+KASAN_SANITIZE_core.o := n
+CFLAGS_core.o += -fno-stack-protector
+
+obj-y += core.o
+
+obj-$(CONFIG_INTEL_TDX_GUEST) += tdx/
diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
new file mode 100644
index 000000000..49b44f881
--- /dev/null
+++ b/arch/x86/coco/core.c
@@ -0,0 +1,140 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Confidential Computing Platform Capability checks
+ *
+ * Copyright (C) 2021 Advanced Micro Devices, Inc.
+ *
+ * Author: Tom Lendacky <thomas.lendacky@amd.com>
+ */
+
+#include <linux/export.h>
+#include <linux/cc_platform.h>
+
+#include <asm/coco.h>
+#include <asm/processor.h>
+
+static enum cc_vendor vendor __ro_after_init;
+static u64 cc_mask __ro_after_init;
+
+static bool intel_cc_platform_has(enum cc_attr attr)
+{
+ switch (attr) {
+ case CC_ATTR_GUEST_UNROLL_STRING_IO:
+ case CC_ATTR_HOTPLUG_DISABLED:
+ case CC_ATTR_GUEST_MEM_ENCRYPT:
+ case CC_ATTR_MEM_ENCRYPT:
+ return true;
+ default:
+ return false;
+ }
+}
+
+/*
+ * SME and SEV are very similar but they are not the same, so there are
+ * times that the kernel will need to distinguish between SME and SEV. The
+ * cc_platform_has() function is used for this. When a distinction isn't
+ * needed, the CC_ATTR_MEM_ENCRYPT attribute can be used.
+ *
+ * The trampoline code is a good example for this requirement. Before
+ * paging is activated, SME will access all memory as decrypted, but SEV
+ * will access all memory as encrypted. So, when APs are being brought
+ * up under SME the trampoline area cannot be encrypted, whereas under SEV
+ * the trampoline area must be encrypted.
+ */
+static bool amd_cc_platform_has(enum cc_attr attr)
+{
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+ switch (attr) {
+ case CC_ATTR_MEM_ENCRYPT:
+ return sme_me_mask;
+
+ case CC_ATTR_HOST_MEM_ENCRYPT:
+ return sme_me_mask && !(sev_status & MSR_AMD64_SEV_ENABLED);
+
+ case CC_ATTR_GUEST_MEM_ENCRYPT:
+ return sev_status & MSR_AMD64_SEV_ENABLED;
+
+ case CC_ATTR_GUEST_STATE_ENCRYPT:
+ return sev_status & MSR_AMD64_SEV_ES_ENABLED;
+
+ /*
+ * With SEV, the rep string I/O instructions need to be unrolled
+ * but SEV-ES supports them through the #VC handler.
+ */
+ case CC_ATTR_GUEST_UNROLL_STRING_IO:
+ return (sev_status & MSR_AMD64_SEV_ENABLED) &&
+ !(sev_status & MSR_AMD64_SEV_ES_ENABLED);
+
+ case CC_ATTR_GUEST_SEV_SNP:
+ return sev_status & MSR_AMD64_SEV_SNP_ENABLED;
+
+ default:
+ return false;
+ }
+#else
+ return false;
+#endif
+}
+
+static bool hyperv_cc_platform_has(enum cc_attr attr)
+{
+ return attr == CC_ATTR_GUEST_MEM_ENCRYPT;
+}
+
+bool cc_platform_has(enum cc_attr attr)
+{
+ switch (vendor) {
+ case CC_VENDOR_AMD:
+ return amd_cc_platform_has(attr);
+ case CC_VENDOR_INTEL:
+ return intel_cc_platform_has(attr);
+ case CC_VENDOR_HYPERV:
+ return hyperv_cc_platform_has(attr);
+ default:
+ return false;
+ }
+}
+EXPORT_SYMBOL_GPL(cc_platform_has);
+
+u64 cc_mkenc(u64 val)
+{
+ /*
+ * Both AMD and Intel use a bit in the page table to indicate
+ * encryption status of the page.
+ *
+ * - for AMD, bit *set* means the page is encrypted
+ * - for Intel *clear* means encrypted.
+ */
+ switch (vendor) {
+ case CC_VENDOR_AMD:
+ return val | cc_mask;
+ case CC_VENDOR_INTEL:
+ return val & ~cc_mask;
+ default:
+ return val;
+ }
+}
+
+u64 cc_mkdec(u64 val)
+{
+ /* See comment in cc_mkenc() */
+ switch (vendor) {
+ case CC_VENDOR_AMD:
+ return val & ~cc_mask;
+ case CC_VENDOR_INTEL:
+ return val | cc_mask;
+ default:
+ return val;
+ }
+}
+EXPORT_SYMBOL_GPL(cc_mkdec);
+
+__init void cc_set_vendor(enum cc_vendor v)
+{
+ vendor = v;
+}
+
+__init void cc_set_mask(u64 mask)
+{
+ cc_mask = mask;
+}
diff --git a/arch/x86/coco/tdx/Makefile b/arch/x86/coco/tdx/Makefile
new file mode 100644
index 000000000..46c559985
--- /dev/null
+++ b/arch/x86/coco/tdx/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-y += tdx.o tdcall.o
diff --git a/arch/x86/coco/tdx/tdcall.S b/arch/x86/coco/tdx/tdcall.S
new file mode 100644
index 000000000..ad0d51f03
--- /dev/null
+++ b/arch/x86/coco/tdx/tdcall.S
@@ -0,0 +1,194 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/asm-offsets.h>
+#include <asm/asm.h>
+#include <asm/frame.h>
+#include <asm/unwind_hints.h>
+
+#include <linux/linkage.h>
+#include <linux/bits.h>
+#include <linux/errno.h>
+
+#include "../../virt/vmx/tdx/tdxcall.S"
+
+/*
+ * Bitmasks of exposed registers (with VMM).
+ */
+#define TDX_R10 BIT(10)
+#define TDX_R11 BIT(11)
+#define TDX_R12 BIT(12)
+#define TDX_R13 BIT(13)
+#define TDX_R14 BIT(14)
+#define TDX_R15 BIT(15)
+
+/*
+ * These registers are clobbered to hold arguments for each
+ * TDVMCALL. They are safe to expose to the VMM.
+ * Each bit in this mask represents a register ID. Bit field
+ * details can be found in TDX GHCI specification, section
+ * titled "TDCALL [TDG.VP.VMCALL] leaf".
+ */
+#define TDVMCALL_EXPOSE_REGS_MASK ( TDX_R10 | TDX_R11 | \
+ TDX_R12 | TDX_R13 | \
+ TDX_R14 | TDX_R15 )
+
+.section .noinstr.text, "ax"
+
+/*
+ * __tdx_module_call() - Used by TDX guests to request services from
+ * the TDX module (does not include VMM services) using TDCALL instruction.
+ *
+ * Transforms function call register arguments into the TDCALL register ABI.
+ * After TDCALL operation, TDX module output is saved in @out (if it is
+ * provided by the user).
+ *
+ *-------------------------------------------------------------------------
+ * TDCALL ABI:
+ *-------------------------------------------------------------------------
+ * Input Registers:
+ *
+ * RAX - TDCALL Leaf number.
+ * RCX,RDX,R8-R9 - TDCALL Leaf specific input registers.
+ *
+ * Output Registers:
+ *
+ * RAX - TDCALL instruction error code.
+ * RCX,RDX,R8-R11 - TDCALL Leaf specific output registers.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * __tdx_module_call() function ABI:
+ *
+ * @fn (RDI) - TDCALL Leaf ID, moved to RAX
+ * @rcx (RSI) - Input parameter 1, moved to RCX
+ * @rdx (RDX) - Input parameter 2, moved to RDX
+ * @r8 (RCX) - Input parameter 3, moved to R8
+ * @r9 (R8) - Input parameter 4, moved to R9
+ *
+ * @out (R9) - struct tdx_module_output pointer
+ * stored temporarily in R12 (not
+ * shared with the TDX module). It
+ * can be NULL.
+ *
+ * Return status of TDCALL via RAX.
+ */
+SYM_FUNC_START(__tdx_module_call)
+ FRAME_BEGIN
+ TDX_MODULE_CALL host=0
+ FRAME_END
+ RET
+SYM_FUNC_END(__tdx_module_call)
+
+/*
+ * __tdx_hypercall() - Make hypercalls to a TDX VMM using TDVMCALL leaf
+ * of TDCALL instruction
+ *
+ * Transforms values in function call argument struct tdx_hypercall_args @args
+ * into the TDCALL register ABI. After TDCALL operation, VMM output is saved
+ * back in @args.
+ *
+ *-------------------------------------------------------------------------
+ * TD VMCALL ABI:
+ *-------------------------------------------------------------------------
+ *
+ * Input Registers:
+ *
+ * RAX - TDCALL instruction leaf number (0 - TDG.VP.VMCALL)
+ * RCX - BITMAP which controls which part of TD Guest GPR
+ * is passed as-is to the VMM and back.
+ * R10 - Set 0 to indicate TDCALL follows standard TDX ABI
+ * specification. Non zero value indicates vendor
+ * specific ABI.
+ * R11 - VMCALL sub function number
+ * RBX, RBP, RDI, RSI - Used to pass VMCALL sub function specific arguments.
+ * R8-R9, R12-R15 - Same as above.
+ *
+ * Output Registers:
+ *
+ * RAX - TDCALL instruction status (Not related to hypercall
+ * output).
+ * R10 - Hypercall output error code.
+ * R11-R15 - Hypercall sub function specific output values.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * __tdx_hypercall() function ABI:
+ *
+ * @args (RDI) - struct tdx_hypercall_args for input and output
+ * @flags (RSI) - TDX_HCALL_* flags
+ *
+ * On successful completion, return the hypercall error code.
+ */
+SYM_FUNC_START(__tdx_hypercall)
+ FRAME_BEGIN
+
+ /* Save callee-saved GPRs as mandated by the x86_64 ABI */
+ push %r15
+ push %r14
+ push %r13
+ push %r12
+
+ /* Mangle function call ABI into TDCALL ABI: */
+ /* Set TDCALL leaf ID (TDVMCALL (0)) in RAX */
+ xor %eax, %eax
+
+ /* Copy hypercall registers from arg struct: */
+ movq TDX_HYPERCALL_r10(%rdi), %r10
+ movq TDX_HYPERCALL_r11(%rdi), %r11
+ movq TDX_HYPERCALL_r12(%rdi), %r12
+ movq TDX_HYPERCALL_r13(%rdi), %r13
+ movq TDX_HYPERCALL_r14(%rdi), %r14
+ movq TDX_HYPERCALL_r15(%rdi), %r15
+
+ movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
+
+ tdcall
+
+ /*
+ * RAX==0 indicates a failure of the TDVMCALL mechanism itself and that
+ * something has gone horribly wrong with the TDX module.
+ *
+ * The return status of the hypercall operation is in a separate
+ * register (in R10). Hypercall errors are a part of normal operation
+ * and are handled by callers.
+ */
+ testq %rax, %rax
+ jne .Lpanic
+
+ /* TDVMCALL leaf return code is in R10 */
+ movq %r10, %rax
+
+ /* Copy hypercall result registers to arg struct if needed */
+ testq $TDX_HCALL_HAS_OUTPUT, %rsi
+ jz .Lout
+
+ movq %r10, TDX_HYPERCALL_r10(%rdi)
+ movq %r11, TDX_HYPERCALL_r11(%rdi)
+ movq %r12, TDX_HYPERCALL_r12(%rdi)
+ movq %r13, TDX_HYPERCALL_r13(%rdi)
+ movq %r14, TDX_HYPERCALL_r14(%rdi)
+ movq %r15, TDX_HYPERCALL_r15(%rdi)
+.Lout:
+ /*
+ * Zero out registers exposed to the VMM to avoid speculative execution
+ * with VMM-controlled values. This needs to include all registers
+ * present in TDVMCALL_EXPOSE_REGS_MASK (except R12-R15). R12-R15
+ * context will be restored.
+ */
+ xor %r10d, %r10d
+ xor %r11d, %r11d
+
+ /* Restore callee-saved GPRs as mandated by the x86_64 ABI */
+ pop %r12
+ pop %r13
+ pop %r14
+ pop %r15
+
+ FRAME_END
+
+ RET
+.Lpanic:
+ call __tdx_hypercall_failed
+ /* __tdx_hypercall_failed never returns */
+ REACHABLE
+ jmp .Lpanic
+SYM_FUNC_END(__tdx_hypercall)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
new file mode 100644
index 000000000..3bd111d5e
--- /dev/null
+++ b/arch/x86/coco/tdx/tdx.c
@@ -0,0 +1,817 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2021-2022 Intel Corporation */
+
+#undef pr_fmt
+#define pr_fmt(fmt) "tdx: " fmt
+
+#include <linux/cpufeature.h>
+#include <linux/export.h>
+#include <linux/io.h>
+#include <asm/coco.h>
+#include <asm/tdx.h>
+#include <asm/vmx.h>
+#include <asm/insn.h>
+#include <asm/insn-eval.h>
+#include <asm/pgtable.h>
+
+/* TDX module Call Leaf IDs */
+#define TDX_GET_INFO 1
+#define TDX_GET_VEINFO 3
+#define TDX_GET_REPORT 4
+#define TDX_ACCEPT_PAGE 6
+
+/* TDX hypercall Leaf IDs */
+#define TDVMCALL_MAP_GPA 0x10001
+
+/* MMIO direction */
+#define EPT_READ 0
+#define EPT_WRITE 1
+
+/* Port I/O direction */
+#define PORT_READ 0
+#define PORT_WRITE 1
+
+/* See Exit Qualification for I/O Instructions in VMX documentation */
+#define VE_IS_IO_IN(e) ((e) & BIT(3))
+#define VE_GET_IO_SIZE(e) (((e) & GENMASK(2, 0)) + 1)
+#define VE_GET_PORT_NUM(e) ((e) >> 16)
+#define VE_IS_IO_STRING(e) ((e) & BIT(4))
+
+#define ATTR_SEPT_VE_DISABLE BIT(28)
+
+/* TDX Module call error codes */
+#define TDCALL_RETURN_CODE(a) ((a) >> 32)
+#define TDCALL_INVALID_OPERAND 0xc0000100
+
+#define TDREPORT_SUBTYPE_0 0
+
+/*
+ * Wrapper for standard use of __tdx_hypercall with no output aside from
+ * return code.
+ */
+static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+ struct tdx_hypercall_args args = {
+ .r10 = TDX_HYPERCALL_STANDARD,
+ .r11 = fn,
+ .r12 = r12,
+ .r13 = r13,
+ .r14 = r14,
+ .r15 = r15,
+ };
+
+ return __tdx_hypercall(&args, 0);
+}
+
+/* Called from __tdx_hypercall() for unrecoverable failure */
+noinstr void __tdx_hypercall_failed(void)
+{
+ instrumentation_begin();
+ panic("TDVMCALL failed. TDX module bug?");
+}
+
+/*
+ * The TDG.VP.VMCALL-Instruction-execution sub-functions are defined
+ * independently from but are currently matched 1:1 with VMX EXIT_REASONs.
+ * Reusing the KVM EXIT_REASON macros makes it easier to connect the host and
+ * guest sides of these calls.
+ */
+static __always_inline u64 hcall_func(u64 exit_reason)
+{
+ return exit_reason;
+}
+
+#ifdef CONFIG_KVM_GUEST
+long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, unsigned long p2,
+ unsigned long p3, unsigned long p4)
+{
+ struct tdx_hypercall_args args = {
+ .r10 = nr,
+ .r11 = p1,
+ .r12 = p2,
+ .r13 = p3,
+ .r14 = p4,
+ };
+
+ return __tdx_hypercall(&args, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall);
+#endif
+
+/*
+ * Used for TDX guests to make calls directly to the TD module. This
+ * should only be used for calls that have no legitimate reason to fail
+ * or where the kernel can not survive the call failing.
+ */
+static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+ struct tdx_module_output *out)
+{
+ if (__tdx_module_call(fn, rcx, rdx, r8, r9, out))
+ panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
+}
+
+/**
+ * tdx_mcall_get_report0() - Wrapper to get TDREPORT0 (a.k.a. TDREPORT
+ * subtype 0) using TDG.MR.REPORT TDCALL.
+ * @reportdata: Address of the input buffer which contains user-defined
+ * REPORTDATA to be included into TDREPORT.
+ * @tdreport: Address of the output buffer to store TDREPORT.
+ *
+ * Refer to section titled "TDG.MR.REPORT leaf" in the TDX Module
+ * v1.0 specification for more information on TDG.MR.REPORT TDCALL.
+ * It is used in the TDX guest driver module to get the TDREPORT0.
+ *
+ * Return 0 on success, -EINVAL for invalid operands, or -EIO on
+ * other TDCALL failures.
+ */
+int tdx_mcall_get_report0(u8 *reportdata, u8 *tdreport)
+{
+ u64 ret;
+
+ ret = __tdx_module_call(TDX_GET_REPORT, virt_to_phys(tdreport),
+ virt_to_phys(reportdata), TDREPORT_SUBTYPE_0,
+ 0, NULL);
+ if (ret) {
+ if (TDCALL_RETURN_CODE(ret) == TDCALL_INVALID_OPERAND)
+ return -EINVAL;
+ return -EIO;
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(tdx_mcall_get_report0);
+
+static void tdx_parse_tdinfo(u64 *cc_mask)
+{
+ struct tdx_module_output out;
+ unsigned int gpa_width;
+ u64 td_attr;
+
+ /*
+ * TDINFO TDX module call is used to get the TD execution environment
+ * information like GPA width, number of available vcpus, debug mode
+ * information, etc. More details about the ABI can be found in TDX
+ * Guest-Host-Communication Interface (GHCI), section 2.4.2 TDCALL
+ * [TDG.VP.INFO].
+ */
+ tdx_module_call(TDX_GET_INFO, 0, 0, 0, 0, &out);
+
+ /*
+ * The highest bit of a guest physical address is the "sharing" bit.
+ * Set it for shared pages and clear it for private pages.
+ *
+ * The GPA width that comes out of this call is critical. TDX guests
+ * can not meaningfully run without it.
+ */
+ gpa_width = out.rcx & GENMASK(5, 0);
+ *cc_mask = BIT_ULL(gpa_width - 1);
+
+ /*
+ * The kernel can not handle #VE's when accessing normal kernel
+ * memory. Ensure that no #VE will be delivered for accesses to
+ * TD-private memory. Only VMM-shared memory (MMIO) will #VE.
+ */
+ td_attr = out.rdx;
+ if (!(td_attr & ATTR_SEPT_VE_DISABLE))
+ panic("TD misconfiguration: SEPT_VE_DISABLE attibute must be set.\n");
+}
+
+/*
+ * The TDX module spec states that #VE may be injected for a limited set of
+ * reasons:
+ *
+ * - Emulation of the architectural #VE injection on EPT violation;
+ *
+ * - As a result of guest TD execution of a disallowed instruction,
+ * a disallowed MSR access, or CPUID virtualization;
+ *
+ * - A notification to the guest TD about anomalous behavior;
+ *
+ * The last one is opt-in and is not used by the kernel.
+ *
+ * The Intel Software Developer's Manual describes cases when instruction
+ * length field can be used in section "Information for VM Exits Due to
+ * Instruction Execution".
+ *
+ * For TDX, it ultimately means GET_VEINFO provides reliable instruction length
+ * information if #VE occurred due to instruction execution, but not for EPT
+ * violations.
+ */
+static int ve_instr_len(struct ve_info *ve)
+{
+ switch (ve->exit_reason) {
+ case EXIT_REASON_HLT:
+ case EXIT_REASON_MSR_READ:
+ case EXIT_REASON_MSR_WRITE:
+ case EXIT_REASON_CPUID:
+ case EXIT_REASON_IO_INSTRUCTION:
+ /* It is safe to use ve->instr_len for #VE due instructions */
+ return ve->instr_len;
+ case EXIT_REASON_EPT_VIOLATION:
+ /*
+ * For EPT violations, ve->insn_len is not defined. For those,
+ * the kernel must decode instructions manually and should not
+ * be using this function.
+ */
+ WARN_ONCE(1, "ve->instr_len is not defined for EPT violations");
+ return 0;
+ default:
+ WARN_ONCE(1, "Unexpected #VE-type: %lld\n", ve->exit_reason);
+ return ve->instr_len;
+ }
+}
+
+static u64 __cpuidle __halt(const bool irq_disabled)
+{
+ struct tdx_hypercall_args args = {
+ .r10 = TDX_HYPERCALL_STANDARD,
+ .r11 = hcall_func(EXIT_REASON_HLT),
+ .r12 = irq_disabled,
+ };
+
+ /*
+ * Emulate HLT operation via hypercall. More info about ABI
+ * can be found in TDX Guest-Host-Communication Interface
+ * (GHCI), section 3.8 TDG.VP.VMCALL<Instruction.HLT>.
+ *
+ * The VMM uses the "IRQ disabled" param to understand IRQ
+ * enabled status (RFLAGS.IF) of the TD guest and to determine
+ * whether or not it should schedule the halted vCPU if an
+ * IRQ becomes pending. E.g. if IRQs are disabled, the VMM
+ * can keep the vCPU in virtual HLT, even if an IRQ is
+ * pending, without hanging/breaking the guest.
+ */
+ return __tdx_hypercall(&args, 0);
+}
+
+static int handle_halt(struct ve_info *ve)
+{
+ const bool irq_disabled = irqs_disabled();
+
+ if (__halt(irq_disabled))
+ return -EIO;
+
+ return ve_instr_len(ve);
+}
+
+void __cpuidle tdx_safe_halt(void)
+{
+ const bool irq_disabled = false;
+
+ /*
+ * Use WARN_ONCE() to report the failure.
+ */
+ if (__halt(irq_disabled))
+ WARN_ONCE(1, "HLT instruction emulation failed\n");
+}
+
+static int read_msr(struct pt_regs *regs, struct ve_info *ve)
+{
+ struct tdx_hypercall_args args = {
+ .r10 = TDX_HYPERCALL_STANDARD,
+ .r11 = hcall_func(EXIT_REASON_MSR_READ),
+ .r12 = regs->cx,
+ };
+
+ /*
+ * Emulate the MSR read via hypercall. More info about ABI
+ * can be found in TDX Guest-Host-Communication Interface
+ * (GHCI), section titled "TDG.VP.VMCALL<Instruction.RDMSR>".
+ */
+ if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
+ return -EIO;
+
+ regs->ax = lower_32_bits(args.r11);
+ regs->dx = upper_32_bits(args.r11);
+ return ve_instr_len(ve);
+}
+
+static int write_msr(struct pt_regs *regs, struct ve_info *ve)
+{
+ struct tdx_hypercall_args args = {
+ .r10 = TDX_HYPERCALL_STANDARD,
+ .r11 = hcall_func(EXIT_REASON_MSR_WRITE),
+ .r12 = regs->cx,
+ .r13 = (u64)regs->dx << 32 | regs->ax,
+ };
+
+ /*
+ * Emulate the MSR write via hypercall. More info about ABI
+ * can be found in TDX Guest-Host-Communication Interface
+ * (GHCI) section titled "TDG.VP.VMCALL<Instruction.WRMSR>".
+ */
+ if (__tdx_hypercall(&args, 0))
+ return -EIO;
+
+ return ve_instr_len(ve);
+}
+
+static int handle_cpuid(struct pt_regs *regs, struct ve_info *ve)
+{
+ struct tdx_hypercall_args args = {
+ .r10 = TDX_HYPERCALL_STANDARD,
+ .r11 = hcall_func(EXIT_REASON_CPUID),
+ .r12 = regs->ax,
+ .r13 = regs->cx,
+ };
+
+ /*
+ * Only allow VMM to control range reserved for hypervisor
+ * communication.
+ *
+ * Return all-zeros for any CPUID outside the range. It matches CPU
+ * behaviour for non-supported leaf.
+ */
+ if (regs->ax < 0x40000000 || regs->ax > 0x4FFFFFFF) {
+ regs->ax = regs->bx = regs->cx = regs->dx = 0;
+ return ve_instr_len(ve);
+ }
+
+ /*
+ * Emulate the CPUID instruction via a hypercall. More info about
+ * ABI can be found in TDX Guest-Host-Communication Interface
+ * (GHCI), section titled "VP.VMCALL<Instruction.CPUID>".
+ */
+ if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
+ return -EIO;
+
+ /*
+ * As per TDX GHCI CPUID ABI, r12-r15 registers contain contents of
+ * EAX, EBX, ECX, EDX registers after the CPUID instruction execution.
+ * So copy the register contents back to pt_regs.
+ */
+ regs->ax = args.r12;
+ regs->bx = args.r13;
+ regs->cx = args.r14;
+ regs->dx = args.r15;
+
+ return ve_instr_len(ve);
+}
+
+static bool mmio_read(int size, unsigned long addr, unsigned long *val)
+{
+ struct tdx_hypercall_args args = {
+ .r10 = TDX_HYPERCALL_STANDARD,
+ .r11 = hcall_func(EXIT_REASON_EPT_VIOLATION),
+ .r12 = size,
+ .r13 = EPT_READ,
+ .r14 = addr,
+ .r15 = *val,
+ };
+
+ if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
+ return false;
+ *val = args.r11;
+ return true;
+}
+
+static bool mmio_write(int size, unsigned long addr, unsigned long val)
+{
+ return !_tdx_hypercall(hcall_func(EXIT_REASON_EPT_VIOLATION), size,
+ EPT_WRITE, addr, val);
+}
+
+static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
+{
+ unsigned long *reg, val, vaddr;
+ char buffer[MAX_INSN_SIZE];
+ enum insn_mmio_type mmio;
+ struct insn insn = {};
+ int size, extend_size;
+ u8 extend_val = 0;
+
+ /* Only in-kernel MMIO is supported */
+ if (WARN_ON_ONCE(user_mode(regs)))
+ return -EFAULT;
+
+ if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
+ return -EFAULT;
+
+ if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
+ return -EINVAL;
+
+ mmio = insn_decode_mmio(&insn, &size);
+ if (WARN_ON_ONCE(mmio == INSN_MMIO_DECODE_FAILED))
+ return -EINVAL;
+
+ if (mmio != INSN_MMIO_WRITE_IMM && mmio != INSN_MMIO_MOVS) {
+ reg = insn_get_modrm_reg_ptr(&insn, regs);
+ if (!reg)
+ return -EINVAL;
+ }
+
+ /*
+ * Reject EPT violation #VEs that split pages.
+ *
+ * MMIO accesses are supposed to be naturally aligned and therefore
+ * never cross page boundaries. Seeing split page accesses indicates
+ * a bug or a load_unaligned_zeropad() that stepped into an MMIO page.
+ *
+ * load_unaligned_zeropad() will recover using exception fixups.
+ */
+ vaddr = (unsigned long)insn_get_addr_ref(&insn, regs);
+ if (vaddr / PAGE_SIZE != (vaddr + size - 1) / PAGE_SIZE)
+ return -EFAULT;
+
+ /* Handle writes first */
+ switch (mmio) {
+ case INSN_MMIO_WRITE:
+ memcpy(&val, reg, size);
+ if (!mmio_write(size, ve->gpa, val))
+ return -EIO;
+ return insn.length;
+ case INSN_MMIO_WRITE_IMM:
+ val = insn.immediate.value;
+ if (!mmio_write(size, ve->gpa, val))
+ return -EIO;
+ return insn.length;
+ case INSN_MMIO_READ:
+ case INSN_MMIO_READ_ZERO_EXTEND:
+ case INSN_MMIO_READ_SIGN_EXTEND:
+ /* Reads are handled below */
+ break;
+ case INSN_MMIO_MOVS:
+ case INSN_MMIO_DECODE_FAILED:
+ /*
+ * MMIO was accessed with an instruction that could not be
+ * decoded or handled properly. It was likely not using io.h
+ * helpers or accessed MMIO accidentally.
+ */
+ return -EINVAL;
+ default:
+ WARN_ONCE(1, "Unknown insn_decode_mmio() decode value?");
+ return -EINVAL;
+ }
+
+ /* Handle reads */
+ if (!mmio_read(size, ve->gpa, &val))
+ return -EIO;
+
+ switch (mmio) {
+ case INSN_MMIO_READ:
+ /* Zero-extend for 32-bit operation */
+ extend_size = size == 4 ? sizeof(*reg) : 0;
+ break;
+ case INSN_MMIO_READ_ZERO_EXTEND:
+ /* Zero extend based on operand size */
+ extend_size = insn.opnd_bytes;
+ break;
+ case INSN_MMIO_READ_SIGN_EXTEND:
+ /* Sign extend based on operand size */
+ extend_size = insn.opnd_bytes;
+ if (size == 1 && val & BIT(7))
+ extend_val = 0xFF;
+ else if (size > 1 && val & BIT(15))
+ extend_val = 0xFF;
+ break;
+ default:
+ /* All other cases has to be covered with the first switch() */
+ WARN_ON_ONCE(1);
+ return -EINVAL;
+ }
+
+ if (extend_size)
+ memset(reg, extend_val, extend_size);
+ memcpy(reg, &val, size);
+ return insn.length;
+}
+
+static bool handle_in(struct pt_regs *regs, int size, int port)
+{
+ struct tdx_hypercall_args args = {
+ .r10 = TDX_HYPERCALL_STANDARD,
+ .r11 = hcall_func(EXIT_REASON_IO_INSTRUCTION),
+ .r12 = size,
+ .r13 = PORT_READ,
+ .r14 = port,
+ };
+ u64 mask = GENMASK(BITS_PER_BYTE * size, 0);
+ bool success;
+
+ /*
+ * Emulate the I/O read via hypercall. More info about ABI can be found
+ * in TDX Guest-Host-Communication Interface (GHCI) section titled
+ * "TDG.VP.VMCALL<Instruction.IO>".
+ */
+ success = !__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT);
+
+ /* Update part of the register affected by the emulated instruction */
+ regs->ax &= ~mask;
+ if (success)
+ regs->ax |= args.r11 & mask;
+
+ return success;
+}
+
+static bool handle_out(struct pt_regs *regs, int size, int port)
+{
+ u64 mask = GENMASK(BITS_PER_BYTE * size, 0);
+
+ /*
+ * Emulate the I/O write via hypercall. More info about ABI can be found
+ * in TDX Guest-Host-Communication Interface (GHCI) section titled
+ * "TDG.VP.VMCALL<Instruction.IO>".
+ */
+ return !_tdx_hypercall(hcall_func(EXIT_REASON_IO_INSTRUCTION), size,
+ PORT_WRITE, port, regs->ax & mask);
+}
+
+/*
+ * Emulate I/O using hypercall.
+ *
+ * Assumes the IO instruction was using ax, which is enforced
+ * by the standard io.h macros.
+ *
+ * Return True on success or False on failure.
+ */
+static int handle_io(struct pt_regs *regs, struct ve_info *ve)
+{
+ u32 exit_qual = ve->exit_qual;
+ int size, port;
+ bool in, ret;
+
+ if (VE_IS_IO_STRING(exit_qual))
+ return -EIO;
+
+ in = VE_IS_IO_IN(exit_qual);
+ size = VE_GET_IO_SIZE(exit_qual);
+ port = VE_GET_PORT_NUM(exit_qual);
+
+
+ if (in)
+ ret = handle_in(regs, size, port);
+ else
+ ret = handle_out(regs, size, port);
+ if (!ret)
+ return -EIO;
+
+ return ve_instr_len(ve);
+}
+
+/*
+ * Early #VE exception handler. Only handles a subset of port I/O.
+ * Intended only for earlyprintk. If failed, return false.
+ */
+__init bool tdx_early_handle_ve(struct pt_regs *regs)
+{
+ struct ve_info ve;
+ int insn_len;
+
+ tdx_get_ve_info(&ve);
+
+ if (ve.exit_reason != EXIT_REASON_IO_INSTRUCTION)
+ return false;
+
+ insn_len = handle_io(regs, &ve);
+ if (insn_len < 0)
+ return false;
+
+ regs->ip += insn_len;
+ return true;
+}
+
+void tdx_get_ve_info(struct ve_info *ve)
+{
+ struct tdx_module_output out;
+
+ /*
+ * Called during #VE handling to retrieve the #VE info from the
+ * TDX module.
+ *
+ * This has to be called early in #VE handling. A "nested" #VE which
+ * occurs before this will raise a #DF and is not recoverable.
+ *
+ * The call retrieves the #VE info from the TDX module, which also
+ * clears the "#VE valid" flag. This must be done before anything else
+ * because any #VE that occurs while the valid flag is set will lead to
+ * #DF.
+ *
+ * Note, the TDX module treats virtual NMIs as inhibited if the #VE
+ * valid flag is set. It means that NMI=>#VE will not result in a #DF.
+ */
+ tdx_module_call(TDX_GET_VEINFO, 0, 0, 0, 0, &out);
+
+ /* Transfer the output parameters */
+ ve->exit_reason = out.rcx;
+ ve->exit_qual = out.rdx;
+ ve->gla = out.r8;
+ ve->gpa = out.r9;
+ ve->instr_len = lower_32_bits(out.r10);
+ ve->instr_info = upper_32_bits(out.r10);
+}
+
+/*
+ * Handle the user initiated #VE.
+ *
+ * On success, returns the number of bytes RIP should be incremented (>=0)
+ * or -errno on error.
+ */
+static int virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
+{
+ switch (ve->exit_reason) {
+ case EXIT_REASON_CPUID:
+ return handle_cpuid(regs, ve);
+ default:
+ pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+ return -EIO;
+ }
+}
+
+/*
+ * Handle the kernel #VE.
+ *
+ * On success, returns the number of bytes RIP should be incremented (>=0)
+ * or -errno on error.
+ */
+static int virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
+{
+ switch (ve->exit_reason) {
+ case EXIT_REASON_HLT:
+ return handle_halt(ve);
+ case EXIT_REASON_MSR_READ:
+ return read_msr(regs, ve);
+ case EXIT_REASON_MSR_WRITE:
+ return write_msr(regs, ve);
+ case EXIT_REASON_CPUID:
+ return handle_cpuid(regs, ve);
+ case EXIT_REASON_EPT_VIOLATION:
+ return handle_mmio(regs, ve);
+ case EXIT_REASON_IO_INSTRUCTION:
+ return handle_io(regs, ve);
+ default:
+ pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+ return -EIO;
+ }
+}
+
+bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
+{
+ int insn_len;
+
+ if (user_mode(regs))
+ insn_len = virt_exception_user(regs, ve);
+ else
+ insn_len = virt_exception_kernel(regs, ve);
+ if (insn_len < 0)
+ return false;
+
+ /* After successful #VE handling, move the IP */
+ regs->ip += insn_len;
+
+ return true;
+}
+
+static bool tdx_tlb_flush_required(bool private)
+{
+ /*
+ * TDX guest is responsible for flushing TLB on private->shared
+ * transition. VMM is responsible for flushing on shared->private.
+ *
+ * The VMM _can't_ flush private addresses as it can't generate PAs
+ * with the guest's HKID. Shared memory isn't subject to integrity
+ * checking, i.e. the VMM doesn't need to flush for its own protection.
+ *
+ * There's no need to flush when converting from shared to private,
+ * as flushing is the VMM's responsibility in this case, e.g. it must
+ * flush to avoid integrity failures in the face of a buggy or
+ * malicious guest.
+ */
+ return !private;
+}
+
+static bool tdx_cache_flush_required(void)
+{
+ /*
+ * AMD SME/SEV can avoid cache flushing if HW enforces cache coherence.
+ * TDX doesn't have such capability.
+ *
+ * Flush cache unconditionally.
+ */
+ return true;
+}
+
+static bool try_accept_one(phys_addr_t *start, unsigned long len,
+ enum pg_level pg_level)
+{
+ unsigned long accept_size = page_level_size(pg_level);
+ u64 tdcall_rcx;
+ u8 page_size;
+
+ if (!IS_ALIGNED(*start, accept_size))
+ return false;
+
+ if (len < accept_size)
+ return false;
+
+ /*
+ * Pass the page physical address to the TDX module to accept the
+ * pending, private page.
+ *
+ * Bits 2:0 of RCX encode page size: 0 - 4K, 1 - 2M, 2 - 1G.
+ */
+ switch (pg_level) {
+ case PG_LEVEL_4K:
+ page_size = 0;
+ break;
+ case PG_LEVEL_2M:
+ page_size = 1;
+ break;
+ case PG_LEVEL_1G:
+ page_size = 2;
+ break;
+ default:
+ return false;
+ }
+
+ tdcall_rcx = *start | page_size;
+ if (__tdx_module_call(TDX_ACCEPT_PAGE, tdcall_rcx, 0, 0, 0, NULL))
+ return false;
+
+ *start += accept_size;
+ return true;
+}
+
+/*
+ * Inform the VMM of the guest's intent for this physical page: shared with
+ * the VMM or private to the guest. The VMM is expected to change its mapping
+ * of the page in response.
+ */
+static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
+{
+ phys_addr_t start = __pa(vaddr);
+ phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
+
+ if (!enc) {
+ /* Set the shared (decrypted) bits: */
+ start |= cc_mkdec(0);
+ end |= cc_mkdec(0);
+ }
+
+ /*
+ * Notify the VMM about page mapping conversion. More info about ABI
+ * can be found in TDX Guest-Host-Communication Interface (GHCI),
+ * section "TDG.VP.VMCALL<MapGPA>"
+ */
+ if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
+ return false;
+
+ /* private->shared conversion requires only MapGPA call */
+ if (!enc)
+ return true;
+
+ /*
+ * For shared->private conversion, accept the page using
+ * TDX_ACCEPT_PAGE TDX module call.
+ */
+ while (start < end) {
+ unsigned long len = end - start;
+
+ /*
+ * Try larger accepts first. It gives chance to VMM to keep
+ * 1G/2M SEPT entries where possible and speeds up process by
+ * cutting number of hypercalls (if successful).
+ */
+
+ if (try_accept_one(&start, len, PG_LEVEL_1G))
+ continue;
+
+ if (try_accept_one(&start, len, PG_LEVEL_2M))
+ continue;
+
+ if (!try_accept_one(&start, len, PG_LEVEL_4K))
+ return false;
+ }
+
+ return true;
+}
+
+void __init tdx_early_init(void)
+{
+ u64 cc_mask;
+ u32 eax, sig[3];
+
+ cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2], &sig[1]);
+
+ if (memcmp(TDX_IDENT, sig, sizeof(sig)))
+ return;
+
+ setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
+
+ cc_set_vendor(CC_VENDOR_INTEL);
+ tdx_parse_tdinfo(&cc_mask);
+ cc_set_mask(cc_mask);
+
+ /*
+ * All bits above GPA width are reserved and kernel treats shared bit
+ * as flag, not as part of physical address.
+ *
+ * Adjust physical mask to only cover valid GPA bits.
+ */
+ physical_mask &= cc_mask - 1;
+
+ x86_platform.guest.enc_cache_flush_required = tdx_cache_flush_required;
+ x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required;
+ x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed;
+
+ pr_info("Guest detected\n");
+}