From 5b7c4cabbb65f5c469464da6c5f614cbd7f730f2 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Tue, 21 Feb 2023 18:24:12 -0800 Subject: Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - Add dedicated kmem_cache for typical/small skb->head, avoid having to access struct page at kfree time, and improve memory use. - Introduce sysctl to set default RPS configuration for new netdevs. - Define Netlink protocol specification format which can be used to describe messages used by each family and auto-generate parsers. Add tools for generating kernel data structures and uAPI headers. - Expose all net/core sysctls inside netns. - Remove 4s sleep in netpoll if carrier is instantly detected on boot. - Add configurable limit of MDB entries per port, and port-vlan. - Continue populating drop reasons throughout the stack. - Retire a handful of legacy Qdiscs and classifiers. Protocols: - Support IPv4 big TCP (TSO frames larger than 64kB). - Add IP_LOCAL_PORT_RANGE socket option, to control local port range on socket by socket basis. - Track and report in procfs number of MPTCP sockets used. - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path manager. - IPv6: don't check net.ipv6.route.max_size and rely on garbage collection to free memory (similarly to IPv4). - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986). - ICMP: add per-rate limit counters. - Add support for user scanning requests in ieee802154. - Remove static WEP support. - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting. - WiFi 7 EHT channel puncturing support (client & AP). BPF: - Add a rbtree data structure following the "next-gen data structure" precedent set by recently added linked list, that is, by using kfunc + kptr instead of adding a new BPF map type. - Expose XDP hints via kfuncs with initial support for RX hash and timestamp metadata. - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata. - Improve x86 JIT's codegen for PROBE_MEM runtime error checks. - Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers. - Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case. - Significantly reduce the search time for module symbols by livepatch and BPF. - Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals. - Add support for BPF trampoline on s390x and riscv64. - Add capability to export the XDP features supported by the NIC. - Add __bpf_kfunc tag for marking kernel functions as kfuncs. - Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accounting for container environments. Netfilter: - Remove the CLUSTERIP target. It has been marked as obsolete for years, and we still have WARN splats wrt races of the out-of-band /proc interface installed by this target. - Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not return an error if the referenced object (set, chain, rule...) did not exist. Driver API: - Improve cpumask_local_spread() locality to help NICs set the right IRQ affinity on AMD platforms. - Separate C22 and C45 MDIO bus transactions more clearly. - Introduce new DCB table to control DSCP rewrite on egress. - Support configuration of Physical Layer Collision Avoidance (PLCA) Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of shared medium Ethernet. - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing preemption of low priority frames by high priority frames. - Add support for controlling MACSec offload using netlink SET. - Rework devlink instance refcounts to allow registration and de-registration under the instance lock. Split the code into multiple files, drop some of the unnecessarily granular locks and factor out common parts of netlink operation handling. - Add TX frame aggregation parameters (for USB drivers). - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning messages with notifications for debug. - Allow offloading of UDP NEW connections via act_ct. - Add support for per action HW stats in TC. - Support hardware miss to TC action (continue processing in SW from a specific point in the action chain). - Warn if old Wireless Extension user space interface is used with modern cfg80211/mac80211 drivers. Do not support Wireless Extensions for Wi-Fi 7 devices at all. Everyone should switch to using nl80211 interface instead. - Improve the CAN bit timing configuration. Use extack to return error messages directly to user space, update the SJW handling, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. New hardware / drivers: - Ethernet: - nVidia BlueField-3 support (control traffic driver) - Ethernet support for imx93 SoCs - Motorcomm yt8531 gigabit Ethernet PHY - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA) - Microchip LAN8841 PHY (incl. cable diagnostics and PTP) - Amlogic gxl MDIO mux - WiFi: - RealTek RTL8188EU (rtl8xxxu) - Qualcomm Wi-Fi 7 devices (ath12k) - CAN: - Renesas R-Car V4H Drivers: - Bluetooth: - Set Per Platform Antenna Gain (PPAG) for Intel controllers. - Ethernet NICs: - Intel (1G, igc): - support TSN / Qbv / packet scheduling features of i226 model - Intel (100G, ice): - use GNSS subsystem instead of TTY - multi-buffer XDP support - extend support for GPIO pins to E823 devices - nVidia/Mellanox: - update the shared buffer configuration on PFC commands - implement PTP adjphase function for HW offset control - TC support for Geneve and GRE with VF tunnel offload - more efficient crypto key management method - multi-port eswitch support - Netronome/Corigine: - add DCB IEEE support - support IPsec offloading for NFP3800 - Freescale/NXP (enetc): - support XDP_REDIRECT for XDP non-linear buffers - improve reconfig, avoid link flap and waiting for idle - support MAC Merge layer - Other NICs: - sfc/ef100: add basic devlink support for ef100 - ionic: rx_push mode operation (writing descriptors via MMIO) - bnxt: use the auxiliary bus abstraction for RDMA - r8169: disable ASPM and reset bus in case of tx timeout - cpsw: support QSGMII mode for J721e CPSW9G - cpts: support pulse-per-second output - ngbe: add an mdio bus driver - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing - r8152: handle devices with FW with NCM support - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation - virtio-net: support multi buffer XDP - virtio/vsock: replace virtio_vsock_pkt with sk_buff - tsnep: XDP support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add support for latency TLV (in FW control messages) - Microchip (sparx5): - separate explicit and implicit traffic forwarding rules, make the implicit rules always active - add support for egress DSCP rewrite - IS0 VCAP support (Ingress Classification) - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.) - ES2 VCAP support (Egress Access Control) - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1) - Ethernet embedded switches: - Marvell (mv88e6xxx): - add MAB (port auth) offload support - enable PTP receive for mv88e6390 - NXP (ocelot): - support MAC Merge layer - support for the the vsc7512 internal copper phys - Microchip: - lan9303: convert to PHYLINK - lan966x: support TC flower filter statistics - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x - lan937x: support Credit Based Shaper configuration - ksz9477: support Energy Efficient Ethernet - other: - qca8k: convert to regmap read/write API, use bulk operations - rswitch: Improve TX timestamp accuracy - Intel WiFi (iwlwifi): - EHT (Wi-Fi 7) rate reporting - STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware. - Qualcomm 802.11ax WiFi (ath11k): - IPQ5018 support - Fine Timing Measurement (FTM) responder role support - channel 177 support - MediaTek WiFi (mt76): - per-PHY LED support - mt7996: EHT (Wi-Fi 7) support - Wireless Ethernet Dispatch (WED) reset support - switch to using page pool allocator - RealTek WiFi (rtw89): - support new version of Bluetooth co-existance - Mobile: - rmnet: support TX aggregation" * tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits) page_pool: add a comment explaining the fragment counter usage net: ethtool: fix __ethtool_dev_mm_supported() implementation ethtool: pse-pd: Fix double word in comments xsk: add linux/vmalloc.h to xsk.c sefltests: netdevsim: wait for devlink instance after netns removal selftest: fib_tests: Always cleanup before exit net/mlx5e: Align IPsec ASO result memory to be as required by hardware net/mlx5e: TC, Set CT miss to the specific ct action instance net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG net/mlx5: Refactor tc miss handling to a single function net/mlx5: Kconfig: Make tc offload depend on tc skb extension net/sched: flower: Support hardware miss to tc action net/sched: flower: Move filter handle initialization earlier net/sched: cls_api: Support hardware miss to tc action net/sched: Rename user cookie and act cookie sfc: fix builds without CONFIG_RTC_LIB sfc: clean up some inconsistent indentings net/mlx4_en: Introduce flexible array to silence overflow warning net: lan966x: Fix possible deadlock inside PTP net/ulp: Remove redundant ->clone() test in inet_clone_ulp(). ... --- drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 967 +++++++++++++++++++++++++++++++ 1 file changed, 967 insertions(+) create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c (limited to 'drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c') diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c new file mode 100644 index 000000000..4b9e7b050 --- /dev/null +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c @@ -0,0 +1,967 @@ +/* + * Copyright 2018 Advanced Micro Devices, Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR + * OTHER DEALINGS IN THE SOFTWARE. + * + * + */ +#include +#include "amdgpu.h" +#include "amdgpu_xgmi.h" +#include "amdgpu_ras.h" +#include "soc15.h" +#include "df/df_3_6_offset.h" +#include "xgmi/xgmi_4_0_0_smn.h" +#include "xgmi/xgmi_4_0_0_sh_mask.h" +#include "wafl/wafl2_4_0_0_smn.h" +#include "wafl/wafl2_4_0_0_sh_mask.h" + +#include "amdgpu_reset.h" + +#define smnPCS_XGMI3X16_PCS_ERROR_STATUS 0x11a0020c +#define smnPCS_GOPX1_PCS_ERROR_STATUS 0x12200210 + +static DEFINE_MUTEX(xgmi_mutex); + +#define AMDGPU_MAX_XGMI_DEVICE_PER_HIVE 4 + +static LIST_HEAD(xgmi_hive_list); + +static const int xgmi_pcs_err_status_reg_vg20[] = { + smnXGMI0_PCS_GOPX16_PCS_ERROR_STATUS, + smnXGMI0_PCS_GOPX16_PCS_ERROR_STATUS + 0x100000, +}; + +static const int wafl_pcs_err_status_reg_vg20[] = { + smnPCS_GOPX1_0_PCS_GOPX1_PCS_ERROR_STATUS, + smnPCS_GOPX1_0_PCS_GOPX1_PCS_ERROR_STATUS + 0x100000, +}; + +static const int xgmi_pcs_err_status_reg_arct[] = { + smnXGMI0_PCS_GOPX16_PCS_ERROR_STATUS, + smnXGMI0_PCS_GOPX16_PCS_ERROR_STATUS + 0x100000, + smnXGMI0_PCS_GOPX16_PCS_ERROR_STATUS + 0x500000, + smnXGMI0_PCS_GOPX16_PCS_ERROR_STATUS + 0x600000, + smnXGMI0_PCS_GOPX16_PCS_ERROR_STATUS + 0x700000, + smnXGMI0_PCS_GOPX16_PCS_ERROR_STATUS + 0x800000, +}; + +/* same as vg20*/ +static const int wafl_pcs_err_status_reg_arct[] = { + smnPCS_GOPX1_0_PCS_GOPX1_PCS_ERROR_STATUS, + smnPCS_GOPX1_0_PCS_GOPX1_PCS_ERROR_STATUS + 0x100000, +}; + +static const int xgmi3x16_pcs_err_status_reg_aldebaran[] = { + smnPCS_XGMI3X16_PCS_ERROR_STATUS, + smnPCS_XGMI3X16_PCS_ERROR_STATUS + 0x100000, + smnPCS_XGMI3X16_PCS_ERROR_STATUS + 0x200000, + smnPCS_XGMI3X16_PCS_ERROR_STATUS + 0x300000, + smnPCS_XGMI3X16_PCS_ERROR_STATUS + 0x400000, + smnPCS_XGMI3X16_PCS_ERROR_STATUS + 0x500000, + smnPCS_XGMI3X16_PCS_ERROR_STATUS + 0x600000, + smnPCS_XGMI3X16_PCS_ERROR_STATUS + 0x700000 +}; + +static const int walf_pcs_err_status_reg_aldebaran[] = { + smnPCS_GOPX1_PCS_ERROR_STATUS, + smnPCS_GOPX1_PCS_ERROR_STATUS + 0x100000 +}; + +static const struct amdgpu_pcs_ras_field xgmi_pcs_ras_fields[] = { + {"XGMI PCS DataLossErr", + SOC15_REG_FIELD(XGMI0_PCS_GOPX16_PCS_ERROR_STATUS, DataLossErr)}, + {"XGMI PCS TrainingErr", + SOC15_REG_FIELD(XGMI0_PCS_GOPX16_PCS_ERROR_STATUS, TrainingErr)}, + {"XGMI PCS CRCErr", + SOC15_REG_FIELD(XGMI0_PCS_GOPX16_PCS_ERROR_STATUS, CRCErr)}, + {"XGMI PCS BERExceededErr", + SOC15_REG_FIELD(XGMI0_PCS_GOPX16_PCS_ERROR_STATUS, BERExceededErr)}, + {"XGMI PCS TxMetaDataErr", + SOC15_REG_FIELD(XGMI0_PCS_GOPX16_PCS_ERROR_STATUS, TxMetaDataErr)}, + {"XGMI PCS ReplayBufParityErr", + SOC15_REG_FIELD(XGMI0_PCS_GOPX16_PCS_ERROR_STATUS, ReplayBufParityErr)}, + {"XGMI PCS DataParityErr", + SOC15_REG_FIELD(XGMI0_PCS_GOPX16_PCS_ERROR_STATUS, DataParityErr)}, + {"XGMI PCS ReplayFifoOverflowErr", + SOC15_REG_FIELD(XGMI0_PCS_GOPX16_PCS_ERROR_STATUS, ReplayFifoOverflowErr)}, + {"XGMI PCS ReplayFifoUnderflowErr", + SOC15_REG_FIELD(XGMI0_PCS_GOPX16_PCS_ERROR_STATUS, ReplayFifoUnderflowErr)}, + {"XGMI PCS ElasticFifoOverflowErr", + SOC15_REG_FIELD(XGMI0_PCS_GOPX16_PCS_ERROR_STATUS, ElasticFifoOverflowErr)}, + {"XGMI PCS DeskewErr", + SOC15_REG_FIELD(XGMI0_PCS_GOPX16_PCS_ERROR_STATUS, DeskewErr)}, + {"XGMI PCS DataStartupLimitErr", + SOC15_REG_FIELD(XGMI0_PCS_GOPX16_PCS_ERROR_STATUS, DataStartupLimitErr)}, + {"XGMI PCS FCInitTimeoutErr", + SOC15_REG_FIELD(XGMI0_PCS_GOPX16_PCS_ERROR_STATUS, FCInitTimeoutErr)}, + {"XGMI PCS RecoveryTimeoutErr", + SOC15_REG_FIELD(XGMI0_PCS_GOPX16_PCS_ERROR_STATUS, RecoveryTimeoutErr)}, + {"XGMI PCS ReadySerialTimeoutErr", + SOC15_REG_FIELD(XGMI0_PCS_GOPX16_PCS_ERROR_STATUS, ReadySerialTimeoutErr)}, + {"XGMI PCS ReadySerialAttemptErr", + SOC15_REG_FIELD(XGMI0_PCS_GOPX16_PCS_ERROR_STATUS, ReadySerialAttemptErr)}, + {"XGMI PCS RecoveryAttemptErr", + SOC15_REG_FIELD(XGMI0_PCS_GOPX16_PCS_ERROR_STATUS, RecoveryAttemptErr)}, + {"XGMI PCS RecoveryRelockAttemptErr", + SOC15_REG_FIELD(XGMI0_PCS_GOPX16_PCS_ERROR_STATUS, RecoveryRelockAttemptErr)}, +}; + +static const struct amdgpu_pcs_ras_field wafl_pcs_ras_fields[] = { + {"WAFL PCS DataLossErr", + SOC15_REG_FIELD(PCS_GOPX1_0_PCS_GOPX1_PCS_ERROR_STATUS, DataLossErr)}, + {"WAFL PCS TrainingErr", + SOC15_REG_FIELD(PCS_GOPX1_0_PCS_GOPX1_PCS_ERROR_STATUS, TrainingErr)}, + {"WAFL PCS CRCErr", + SOC15_REG_FIELD(PCS_GOPX1_0_PCS_GOPX1_PCS_ERROR_STATUS, CRCErr)}, + {"WAFL PCS BERExceededErr", + SOC15_REG_FIELD(PCS_GOPX1_0_PCS_GOPX1_PCS_ERROR_STATUS, BERExceededErr)}, + {"WAFL PCS TxMetaDataErr", + SOC15_REG_FIELD(PCS_GOPX1_0_PCS_GOPX1_PCS_ERROR_STATUS, TxMetaDataErr)}, + {"WAFL PCS ReplayBufParityErr", + SOC15_REG_FIELD(PCS_GOPX1_0_PCS_GOPX1_PCS_ERROR_STATUS, ReplayBufParityErr)}, + {"WAFL PCS DataParityErr", + SOC15_REG_FIELD(PCS_GOPX1_0_PCS_GOPX1_PCS_ERROR_STATUS, DataParityErr)}, + {"WAFL PCS ReplayFifoOverflowErr", + SOC15_REG_FIELD(PCS_GOPX1_0_PCS_GOPX1_PCS_ERROR_STATUS, ReplayFifoOverflowErr)}, + {"WAFL PCS ReplayFifoUnderflowErr", + SOC15_REG_FIELD(PCS_GOPX1_0_PCS_GOPX1_PCS_ERROR_STATUS, ReplayFifoUnderflowErr)}, + {"WAFL PCS ElasticFifoOverflowErr", + SOC15_REG_FIELD(PCS_GOPX1_0_PCS_GOPX1_PCS_ERROR_STATUS, ElasticFifoOverflowErr)}, + {"WAFL PCS DeskewErr", + SOC15_REG_FIELD(PCS_GOPX1_0_PCS_GOPX1_PCS_ERROR_STATUS, DeskewErr)}, + {"WAFL PCS DataStartupLimitErr", + SOC15_REG_FIELD(PCS_GOPX1_0_PCS_GOPX1_PCS_ERROR_STATUS, DataStartupLimitErr)}, + {"WAFL PCS FCInitTimeoutErr", + SOC15_REG_FIELD(PCS_GOPX1_0_PCS_GOPX1_PCS_ERROR_STATUS, FCInitTimeoutErr)}, + {"WAFL PCS RecoveryTimeoutErr", + SOC15_REG_FIELD(PCS_GOPX1_0_PCS_GOPX1_PCS_ERROR_STATUS, RecoveryTimeoutErr)}, + {"WAFL PCS ReadySerialTimeoutErr", + SOC15_REG_FIELD(PCS_GOPX1_0_PCS_GOPX1_PCS_ERROR_STATUS, ReadySerialTimeoutErr)}, + {"WAFL PCS ReadySerialAttemptErr", + SOC15_REG_FIELD(PCS_GOPX1_0_PCS_GOPX1_PCS_ERROR_STATUS, ReadySerialAttemptErr)}, + {"WAFL PCS RecoveryAttemptErr", + SOC15_REG_FIELD(PCS_GOPX1_0_PCS_GOPX1_PCS_ERROR_STATUS, RecoveryAttemptErr)}, + {"WAFL PCS RecoveryRelockAttemptErr", + SOC15_REG_FIELD(PCS_GOPX1_0_PCS_GOPX1_PCS_ERROR_STATUS, RecoveryRelockAttemptErr)}, +}; + +/** + * DOC: AMDGPU XGMI Support + * + * XGMI is a high speed interconnect that joins multiple GPU cards + * into a homogeneous memory space that is organized by a collective + * hive ID and individual node IDs, both of which are 64-bit numbers. + * + * The file xgmi_device_id contains the unique per GPU device ID and + * is stored in the /sys/class/drm/card${cardno}/device/ directory. + * + * Inside the device directory a sub-directory 'xgmi_hive_info' is + * created which contains the hive ID and the list of nodes. + * + * The hive ID is stored in: + * /sys/class/drm/card${cardno}/device/xgmi_hive_info/xgmi_hive_id + * + * The node information is stored in numbered directories: + * /sys/class/drm/card${cardno}/device/xgmi_hive_info/node${nodeno}/xgmi_device_id + * + * Each device has their own xgmi_hive_info direction with a mirror + * set of node sub-directories. + * + * The XGMI memory space is built by contiguously adding the power of + * two padded VRAM space from each node to each other. + * + */ + +static struct attribute amdgpu_xgmi_hive_id = { + .name = "xgmi_hive_id", + .mode = S_IRUGO +}; + +static struct attribute *amdgpu_xgmi_hive_attrs[] = { + &amdgpu_xgmi_hive_id, + NULL +}; +ATTRIBUTE_GROUPS(amdgpu_xgmi_hive); + +static ssize_t amdgpu_xgmi_show_attrs(struct kobject *kobj, + struct attribute *attr, char *buf) +{ + struct amdgpu_hive_info *hive = container_of( + kobj, struct amdgpu_hive_info, kobj); + + if (attr == &amdgpu_xgmi_hive_id) + return snprintf(buf, PAGE_SIZE, "%llu\n", hive->hive_id); + + return 0; +} + +static void amdgpu_xgmi_hive_release(struct kobject *kobj) +{ + struct amdgpu_hive_info *hive = container_of( + kobj, struct amdgpu_hive_info, kobj); + + amdgpu_reset_put_reset_domain(hive->reset_domain); + hive->reset_domain = NULL; + + mutex_destroy(&hive->hive_lock); + kfree(hive); +} + +static const struct sysfs_ops amdgpu_xgmi_hive_ops = { + .show = amdgpu_xgmi_show_attrs, +}; + +struct kobj_type amdgpu_xgmi_hive_type = { + .release = amdgpu_xgmi_hive_release, + .sysfs_ops = &amdgpu_xgmi_hive_ops, + .default_groups = amdgpu_xgmi_hive_groups, +}; + +static ssize_t amdgpu_xgmi_show_device_id(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct drm_device *ddev = dev_get_drvdata(dev); + struct amdgpu_device *adev = drm_to_adev(ddev); + + return sysfs_emit(buf, "%llu\n", adev->gmc.xgmi.node_id); + +} + +#define AMDGPU_XGMI_SET_FICAA(o) ((o) | 0x456801) +static ssize_t amdgpu_xgmi_show_error(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct drm_device *ddev = dev_get_drvdata(dev); + struct amdgpu_device *adev = drm_to_adev(ddev); + uint32_t ficaa_pie_ctl_in, ficaa_pie_status_in; + uint64_t fica_out; + unsigned int error_count = 0; + + ficaa_pie_ctl_in = AMDGPU_XGMI_SET_FICAA(0x200); + ficaa_pie_status_in = AMDGPU_XGMI_SET_FICAA(0x208); + + if ((!adev->df.funcs) || + (!adev->df.funcs->get_fica) || + (!adev->df.funcs->set_fica)) + return -EINVAL; + + fica_out = adev->df.funcs->get_fica(adev, ficaa_pie_ctl_in); + if (fica_out != 0x1f) + pr_err("xGMI error counters not enabled!\n"); + + fica_out = adev->df.funcs->get_fica(adev, ficaa_pie_status_in); + + if ((fica_out & 0xffff) == 2) + error_count = ((fica_out >> 62) & 0x1) + (fica_out >> 63); + + adev->df.funcs->set_fica(adev, ficaa_pie_status_in, 0, 0); + + return sysfs_emit(buf, "%u\n", error_count); +} + + +static DEVICE_ATTR(xgmi_device_id, S_IRUGO, amdgpu_xgmi_show_device_id, NULL); +static DEVICE_ATTR(xgmi_error, S_IRUGO, amdgpu_xgmi_show_error, NULL); + +static int amdgpu_xgmi_sysfs_add_dev_info(struct amdgpu_device *adev, + struct amdgpu_hive_info *hive) +{ + int ret = 0; + char node[10] = { 0 }; + + /* Create xgmi device id file */ + ret = device_create_file(adev->dev, &dev_attr_xgmi_device_id); + if (ret) { + dev_err(adev->dev, "XGMI: Failed to create device file xgmi_device_id\n"); + return ret; + } + + /* Create xgmi error file */ + ret = device_create_file(adev->dev, &dev_attr_xgmi_error); + if (ret) + pr_err("failed to create xgmi_error\n"); + + + /* Create sysfs link to hive info folder on the first device */ + if (hive->kobj.parent != (&adev->dev->kobj)) { + ret = sysfs_create_link(&adev->dev->kobj, &hive->kobj, + "xgmi_hive_info"); + if (ret) { + dev_err(adev->dev, "XGMI: Failed to create link to hive info"); + goto remove_file; + } + } + + sprintf(node, "node%d", atomic_read(&hive->number_devices)); + /* Create sysfs link form the hive folder to yourself */ + ret = sysfs_create_link(&hive->kobj, &adev->dev->kobj, node); + if (ret) { + dev_err(adev->dev, "XGMI: Failed to create link from hive info"); + goto remove_link; + } + + goto success; + + +remove_link: + sysfs_remove_link(&adev->dev->kobj, adev_to_drm(adev)->unique); + +remove_file: + device_remove_file(adev->dev, &dev_attr_xgmi_device_id); + +success: + return ret; +} + +static void amdgpu_xgmi_sysfs_rem_dev_info(struct amdgpu_device *adev, + struct amdgpu_hive_info *hive) +{ + char node[10]; + memset(node, 0, sizeof(node)); + + device_remove_file(adev->dev, &dev_attr_xgmi_device_id); + device_remove_file(adev->dev, &dev_attr_xgmi_error); + + if (hive->kobj.parent != (&adev->dev->kobj)) + sysfs_remove_link(&adev->dev->kobj,"xgmi_hive_info"); + + sprintf(node, "node%d", atomic_read(&hive->number_devices)); + sysfs_remove_link(&hive->kobj, node); + +} + + + +struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev) +{ + struct amdgpu_hive_info *hive = NULL; + int ret; + + if (!adev->gmc.xgmi.hive_id) + return NULL; + + if (adev->hive) { + kobject_get(&adev->hive->kobj); + return adev->hive; + } + + mutex_lock(&xgmi_mutex); + + list_for_each_entry(hive, &xgmi_hive_list, node) { + if (hive->hive_id == adev->gmc.xgmi.hive_id) + goto pro_end; + } + + hive = kzalloc(sizeof(*hive), GFP_KERNEL); + if (!hive) { + dev_err(adev->dev, "XGMI: allocation failed\n"); + hive = NULL; + goto pro_end; + } + + /* initialize new hive if not exist */ + ret = kobject_init_and_add(&hive->kobj, + &amdgpu_xgmi_hive_type, + &adev->dev->kobj, + "%s", "xgmi_hive_info"); + if (ret) { + dev_err(adev->dev, "XGMI: failed initializing kobject for xgmi hive\n"); + kobject_put(&hive->kobj); + hive = NULL; + goto pro_end; + } + + /** + * Only init hive->reset_domain for none SRIOV configuration. For SRIOV, + * Host driver decide how to reset the GPU either through FLR or chain reset. + * Guest side will get individual notifications from the host for the FLR + * if necessary. + */ + if (!amdgpu_sriov_vf(adev)) { + /** + * Avoid recreating reset domain when hive is reconstructed for the case + * of reset the devices in the XGMI hive during probe for passthrough GPU + * See https://www.spinics.net/lists/amd-gfx/msg58836.html + */ + if (adev->reset_domain->type != XGMI_HIVE) { + hive->reset_domain = + amdgpu_reset_create_reset_domain(XGMI_HIVE, "amdgpu-reset-hive"); + if (!hive->reset_domain) { + dev_err(adev->dev, "XGMI: failed initializing reset domain for xgmi hive\n"); + ret = -ENOMEM; + kobject_put(&hive->kobj); + hive = NULL; + goto pro_end; + } + } else { + amdgpu_reset_get_reset_domain(adev->reset_domain); + hive->reset_domain = adev->reset_domain; + } + } + + hive->hive_id = adev->gmc.xgmi.hive_id; + INIT_LIST_HEAD(&hive->device_list); + INIT_LIST_HEAD(&hive->node); + mutex_init(&hive->hive_lock); + atomic_set(&hive->number_devices, 0); + task_barrier_init(&hive->tb); + hive->pstate = AMDGPU_XGMI_PSTATE_UNKNOWN; + hive->hi_req_gpu = NULL; + + /* + * hive pstate on boot is high in vega20 so we have to go to low + * pstate on after boot. + */ + hive->hi_req_count = AMDGPU_MAX_XGMI_DEVICE_PER_HIVE; + list_add_tail(&hive->node, &xgmi_hive_list); + +pro_end: + if (hive) + kobject_get(&hive->kobj); + mutex_unlock(&xgmi_mutex); + return hive; +} + +void amdgpu_put_xgmi_hive(struct amdgpu_hive_info *hive) +{ + if (hive) + kobject_put(&hive->kobj); +} + +int amdgpu_xgmi_set_pstate(struct amdgpu_device *adev, int pstate) +{ + int ret = 0; + struct amdgpu_hive_info *hive; + struct amdgpu_device *request_adev; + bool is_hi_req = pstate == AMDGPU_XGMI_PSTATE_MAX_VEGA20; + bool init_low; + + hive = amdgpu_get_xgmi_hive(adev); + if (!hive) + return 0; + + request_adev = hive->hi_req_gpu ? hive->hi_req_gpu : adev; + init_low = hive->pstate == AMDGPU_XGMI_PSTATE_UNKNOWN; + amdgpu_put_xgmi_hive(hive); + /* fw bug so temporarily disable pstate switching */ + return 0; + + if (!hive || adev->asic_type != CHIP_VEGA20) + return 0; + + mutex_lock(&hive->hive_lock); + + if (is_hi_req) + hive->hi_req_count++; + else + hive->hi_req_count--; + + /* + * Vega20 only needs single peer to request pstate high for the hive to + * go high but all peers must request pstate low for the hive to go low + */ + if (hive->pstate == pstate || + (!is_hi_req && hive->hi_req_count && !init_low)) + goto out; + + dev_dbg(request_adev->dev, "Set xgmi pstate %d.\n", pstate); + + ret = amdgpu_dpm_set_xgmi_pstate(request_adev, pstate); + if (ret) { + dev_err(request_adev->dev, + "XGMI: Set pstate failure on device %llx, hive %llx, ret %d", + request_adev->gmc.xgmi.node_id, + request_adev->gmc.xgmi.hive_id, ret); + goto out; + } + + if (init_low) + hive->pstate = hive->hi_req_count ? + hive->pstate : AMDGPU_XGMI_PSTATE_MIN; + else { + hive->pstate = pstate; + hive->hi_req_gpu = pstate != AMDGPU_XGMI_PSTATE_MIN ? + adev : NULL; + } +out: + mutex_unlock(&hive->hive_lock); + return ret; +} + +int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive, struct amdgpu_device *adev) +{ + int ret; + + if (amdgpu_sriov_vf(adev)) + return 0; + + /* Each psp need to set the latest topology */ + ret = psp_xgmi_set_topology_info(&adev->psp, + atomic_read(&hive->number_devices), + &adev->psp.xgmi_context.top_info); + if (ret) + dev_err(adev->dev, + "XGMI: Set topology failure on device %llx, hive %llx, ret %d", + adev->gmc.xgmi.node_id, + adev->gmc.xgmi.hive_id, ret); + + return ret; +} + + +/* + * NOTE psp_xgmi_node_info.num_hops layout is as follows: + * num_hops[7:6] = link type (0 = xGMI2, 1 = xGMI3, 2/3 = reserved) + * num_hops[5:3] = reserved + * num_hops[2:0] = number of hops + */ +int amdgpu_xgmi_get_hops_count(struct amdgpu_device *adev, + struct amdgpu_device *peer_adev) +{ + struct psp_xgmi_topology_info *top = &adev->psp.xgmi_context.top_info; + uint8_t num_hops_mask = 0x7; + int i; + + for (i = 0 ; i < top->num_nodes; ++i) + if (top->nodes[i].node_id == peer_adev->gmc.xgmi.node_id) + return top->nodes[i].num_hops & num_hops_mask; + return -EINVAL; +} + +int amdgpu_xgmi_get_num_links(struct amdgpu_device *adev, + struct amdgpu_device *peer_adev) +{ + struct psp_xgmi_topology_info *top = &adev->psp.xgmi_context.top_info; + int i; + + for (i = 0 ; i < top->num_nodes; ++i) + if (top->nodes[i].node_id == peer_adev->gmc.xgmi.node_id) + return top->nodes[i].num_links; + return -EINVAL; +} + +/* + * Devices that support extended data require the entire hive to initialize with + * the shared memory buffer flag set. + * + * Hive locks and conditions apply - see amdgpu_xgmi_add_device + */ +static int amdgpu_xgmi_initialize_hive_get_data_partition(struct amdgpu_hive_info *hive, + bool set_extended_data) +{ + struct amdgpu_device *tmp_adev; + int ret; + + list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) { + ret = psp_xgmi_initialize(&tmp_adev->psp, set_extended_data, false); + if (ret) { + dev_err(tmp_adev->dev, + "XGMI: Failed to initialize xgmi session for data partition %i\n", + set_extended_data); + return ret; + } + + } + + return 0; +} + +int amdgpu_xgmi_add_device(struct amdgpu_device *adev) +{ + struct psp_xgmi_topology_info *top_info; + struct amdgpu_hive_info *hive; + struct amdgpu_xgmi *entry; + struct amdgpu_device *tmp_adev = NULL; + + int count = 0, ret = 0; + + if (!adev->gmc.xgmi.supported) + return 0; + + if (!adev->gmc.xgmi.pending_reset && + amdgpu_device_ip_get_ip_block(adev, AMD_IP_BLOCK_TYPE_PSP)) { + ret = psp_xgmi_initialize(&adev->psp, false, true); + if (ret) { + dev_err(adev->dev, + "XGMI: Failed to initialize xgmi session\n"); + return ret; + } + + ret = psp_xgmi_get_hive_id(&adev->psp, &adev->gmc.xgmi.hive_id); + if (ret) { + dev_err(adev->dev, + "XGMI: Failed to get hive id\n"); + return ret; + } + + ret = psp_xgmi_get_node_id(&adev->psp, &adev->gmc.xgmi.node_id); + if (ret) { + dev_err(adev->dev, + "XGMI: Failed to get node id\n"); + return ret; + } + } else { + adev->gmc.xgmi.hive_id = 16; + adev->gmc.xgmi.node_id = adev->gmc.xgmi.physical_node_id + 16; + } + + hive = amdgpu_get_xgmi_hive(adev); + if (!hive) { + ret = -EINVAL; + dev_err(adev->dev, + "XGMI: node 0x%llx, can not match hive 0x%llx in the hive list.\n", + adev->gmc.xgmi.node_id, adev->gmc.xgmi.hive_id); + goto exit; + } + mutex_lock(&hive->hive_lock); + + top_info = &adev->psp.xgmi_context.top_info; + + list_add_tail(&adev->gmc.xgmi.head, &hive->device_list); + list_for_each_entry(entry, &hive->device_list, head) + top_info->nodes[count++].node_id = entry->node_id; + top_info->num_nodes = count; + atomic_set(&hive->number_devices, count); + + task_barrier_add_task(&hive->tb); + + if (!adev->gmc.xgmi.pending_reset && + amdgpu_device_ip_get_ip_block(adev, AMD_IP_BLOCK_TYPE_PSP)) { + list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) { + /* update node list for other device in the hive */ + if (tmp_adev != adev) { + top_info = &tmp_adev->psp.xgmi_context.top_info; + top_info->nodes[count - 1].node_id = + adev->gmc.xgmi.node_id; + top_info->num_nodes = count; + } + ret = amdgpu_xgmi_update_topology(hive, tmp_adev); + if (ret) + goto exit_unlock; + } + + /* get latest topology info for each device from psp */ + list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) { + ret = psp_xgmi_get_topology_info(&tmp_adev->psp, count, + &tmp_adev->psp.xgmi_context.top_info, false); + if (ret) { + dev_err(tmp_adev->dev, + "XGMI: Get topology failure on device %llx, hive %llx, ret %d", + tmp_adev->gmc.xgmi.node_id, + tmp_adev->gmc.xgmi.hive_id, ret); + /* To do : continue with some node failed or disable the whole hive */ + goto exit_unlock; + } + } + + /* get topology again for hives that support extended data */ + if (adev->psp.xgmi_context.supports_extended_data) { + + /* initialize the hive to get extended data. */ + ret = amdgpu_xgmi_initialize_hive_get_data_partition(hive, true); + if (ret) + goto exit_unlock; + + /* get the extended data. */ + list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) { + ret = psp_xgmi_get_topology_info(&tmp_adev->psp, count, + &tmp_adev->psp.xgmi_context.top_info, true); + if (ret) { + dev_err(tmp_adev->dev, + "XGMI: Get topology for extended data failure on device %llx, hive %llx, ret %d", + tmp_adev->gmc.xgmi.node_id, + tmp_adev->gmc.xgmi.hive_id, ret); + goto exit_unlock; + } + } + + /* initialize the hive to get non-extended data for the next round. */ + ret = amdgpu_xgmi_initialize_hive_get_data_partition(hive, false); + if (ret) + goto exit_unlock; + + } + } + + if (!ret && !adev->gmc.xgmi.pending_reset) + ret = amdgpu_xgmi_sysfs_add_dev_info(adev, hive); + +exit_unlock: + mutex_unlock(&hive->hive_lock); +exit: + if (!ret) { + adev->hive = hive; + dev_info(adev->dev, "XGMI: Add node %d, hive 0x%llx.\n", + adev->gmc.xgmi.physical_node_id, adev->gmc.xgmi.hive_id); + } else { + amdgpu_put_xgmi_hive(hive); + dev_err(adev->dev, "XGMI: Failed to add node %d, hive 0x%llx ret: %d\n", + adev->gmc.xgmi.physical_node_id, adev->gmc.xgmi.hive_id, + ret); + } + + return ret; +} + +int amdgpu_xgmi_remove_device(struct amdgpu_device *adev) +{ + struct amdgpu_hive_info *hive = adev->hive; + + if (!adev->gmc.xgmi.supported) + return -EINVAL; + + if (!hive) + return -EINVAL; + + mutex_lock(&hive->hive_lock); + task_barrier_rem_task(&hive->tb); + amdgpu_xgmi_sysfs_rem_dev_info(adev, hive); + if (hive->hi_req_gpu == adev) + hive->hi_req_gpu = NULL; + list_del(&adev->gmc.xgmi.head); + mutex_unlock(&hive->hive_lock); + + amdgpu_put_xgmi_hive(hive); + adev->hive = NULL; + + if (atomic_dec_return(&hive->number_devices) == 0) { + /* Remove the hive from global hive list */ + mutex_lock(&xgmi_mutex); + list_del(&hive->node); + mutex_unlock(&xgmi_mutex); + + amdgpu_put_xgmi_hive(hive); + } + + return 0; +} + +static int amdgpu_xgmi_ras_late_init(struct amdgpu_device *adev, struct ras_common_if *ras_block) +{ + if (!adev->gmc.xgmi.supported || + adev->gmc.xgmi.num_physical_nodes == 0) + return 0; + + adev->gmc.xgmi.ras->ras_block.hw_ops->reset_ras_error_count(adev); + + return amdgpu_ras_block_late_init(adev, ras_block); +} + +uint64_t amdgpu_xgmi_get_relative_phy_addr(struct amdgpu_device *adev, + uint64_t addr) +{ + struct amdgpu_xgmi *xgmi = &adev->gmc.xgmi; + return (addr + xgmi->physical_node_id * xgmi->node_segment_size); +} + +static void pcs_clear_status(struct amdgpu_device *adev, uint32_t pcs_status_reg) +{ + WREG32_PCIE(pcs_status_reg, 0xFFFFFFFF); + WREG32_PCIE(pcs_status_reg, 0); +} + +static void amdgpu_xgmi_reset_ras_error_count(struct amdgpu_device *adev) +{ + uint32_t i; + + switch (adev->asic_type) { + case CHIP_ARCTURUS: + for (i = 0; i < ARRAY_SIZE(xgmi_pcs_err_status_reg_arct); i++) + pcs_clear_status(adev, + xgmi_pcs_err_status_reg_arct[i]); + break; + case CHIP_VEGA20: + for (i = 0; i < ARRAY_SIZE(xgmi_pcs_err_status_reg_vg20); i++) + pcs_clear_status(adev, + xgmi_pcs_err_status_reg_vg20[i]); + break; + case CHIP_ALDEBARAN: + for (i = 0; i < ARRAY_SIZE(xgmi3x16_pcs_err_status_reg_aldebaran); i++) + pcs_clear_status(adev, + xgmi3x16_pcs_err_status_reg_aldebaran[i]); + for (i = 0; i < ARRAY_SIZE(walf_pcs_err_status_reg_aldebaran); i++) + pcs_clear_status(adev, + walf_pcs_err_status_reg_aldebaran[i]); + break; + default: + break; + } +} + +static int amdgpu_xgmi_query_pcs_error_status(struct amdgpu_device *adev, + uint32_t value, + uint32_t *ue_count, + uint32_t *ce_count, + bool is_xgmi_pcs) +{ + int i; + int ue_cnt; + + if (is_xgmi_pcs) { + /* query xgmi pcs error status, + * only ue is supported */ + for (i = 0; i < ARRAY_SIZE(xgmi_pcs_ras_fields); i ++) { + ue_cnt = (value & + xgmi_pcs_ras_fields[i].pcs_err_mask) >> + xgmi_pcs_ras_fields[i].pcs_err_shift; + if (ue_cnt) { + dev_info(adev->dev, "%s detected\n", + xgmi_pcs_ras_fields[i].err_name); + *ue_count += ue_cnt; + } + } + } else { + /* query wafl pcs error status, + * only ue is supported */ + for (i = 0; i < ARRAY_SIZE(wafl_pcs_ras_fields); i++) { + ue_cnt = (value & + wafl_pcs_ras_fields[i].pcs_err_mask) >> + wafl_pcs_ras_fields[i].pcs_err_shift; + if (ue_cnt) { + dev_info(adev->dev, "%s detected\n", + wafl_pcs_ras_fields[i].err_name); + *ue_count += ue_cnt; + } + } + } + + return 0; +} + +static void amdgpu_xgmi_query_ras_error_count(struct amdgpu_device *adev, + void *ras_error_status) +{ + struct ras_err_data *err_data = (struct ras_err_data *)ras_error_status; + int i; + uint32_t data; + uint32_t ue_cnt = 0, ce_cnt = 0; + + if (!amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__XGMI_WAFL)) + return ; + + err_data->ue_count = 0; + err_data->ce_count = 0; + + switch (adev->asic_type) { + case CHIP_ARCTURUS: + /* check xgmi pcs error */ + for (i = 0; i < ARRAY_SIZE(xgmi_pcs_err_status_reg_arct); i++) { + data = RREG32_PCIE(xgmi_pcs_err_status_reg_arct[i]); + if (data) + amdgpu_xgmi_query_pcs_error_status(adev, + data, &ue_cnt, &ce_cnt, true); + } + /* check wafl pcs error */ + for (i = 0; i < ARRAY_SIZE(wafl_pcs_err_status_reg_arct); i++) { + data = RREG32_PCIE(wafl_pcs_err_status_reg_arct[i]); + if (data) + amdgpu_xgmi_query_pcs_error_status(adev, + data, &ue_cnt, &ce_cnt, false); + } + break; + case CHIP_VEGA20: + /* check xgmi pcs error */ + for (i = 0; i < ARRAY_SIZE(xgmi_pcs_err_status_reg_vg20); i++) { + data = RREG32_PCIE(xgmi_pcs_err_status_reg_vg20[i]); + if (data) + amdgpu_xgmi_query_pcs_error_status(adev, + data, &ue_cnt, &ce_cnt, true); + } + /* check wafl pcs error */ + for (i = 0; i < ARRAY_SIZE(wafl_pcs_err_status_reg_vg20); i++) { + data = RREG32_PCIE(wafl_pcs_err_status_reg_vg20[i]); + if (data) + amdgpu_xgmi_query_pcs_error_status(adev, + data, &ue_cnt, &ce_cnt, false); + } + break; + case CHIP_ALDEBARAN: + /* check xgmi3x16 pcs error */ + for (i = 0; i < ARRAY_SIZE(xgmi3x16_pcs_err_status_reg_aldebaran); i++) { + data = RREG32_PCIE(xgmi3x16_pcs_err_status_reg_aldebaran[i]); + if (data) + amdgpu_xgmi_query_pcs_error_status(adev, + data, &ue_cnt, &ce_cnt, true); + } + /* check wafl pcs error */ + for (i = 0; i < ARRAY_SIZE(walf_pcs_err_status_reg_aldebaran); i++) { + data = RREG32_PCIE(walf_pcs_err_status_reg_aldebaran[i]); + if (data) + amdgpu_xgmi_query_pcs_error_status(adev, + data, &ue_cnt, &ce_cnt, false); + } + break; + default: + dev_warn(adev->dev, "XGMI RAS error query not supported"); + break; + } + + adev->gmc.xgmi.ras->ras_block.hw_ops->reset_ras_error_count(adev); + + err_data->ue_count += ue_cnt; + err_data->ce_count += ce_cnt; +} + +/* Trigger XGMI/WAFL error */ +static int amdgpu_ras_error_inject_xgmi(struct amdgpu_device *adev, void *inject_if) +{ + int ret = 0; + struct ta_ras_trigger_error_input *block_info = + (struct ta_ras_trigger_error_input *)inject_if; + + if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_DISALLOW)) + dev_warn(adev->dev, "Failed to disallow df cstate"); + + if (amdgpu_dpm_allow_xgmi_power_down(adev, false)) + dev_warn(adev->dev, "Failed to disallow XGMI power down"); + + ret = psp_ras_trigger_error(&adev->psp, block_info); + + if (amdgpu_ras_intr_triggered()) + return ret; + + if (amdgpu_dpm_allow_xgmi_power_down(adev, true)) + dev_warn(adev->dev, "Failed to allow XGMI power down"); + + if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_ALLOW)) + dev_warn(adev->dev, "Failed to allow df cstate"); + + return ret; +} + +struct amdgpu_ras_block_hw_ops xgmi_ras_hw_ops = { + .query_ras_error_count = amdgpu_xgmi_query_ras_error_count, + .reset_ras_error_count = amdgpu_xgmi_reset_ras_error_count, + .ras_error_inject = amdgpu_ras_error_inject_xgmi, +}; + +struct amdgpu_xgmi_ras xgmi_ras = { + .ras_block = { + .ras_comm = { + .name = "xgmi_wafl", + .block = AMDGPU_RAS_BLOCK__XGMI_WAFL, + .type = AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE, + }, + .hw_ops = &xgmi_ras_hw_ops, + .ras_late_init = amdgpu_xgmi_ras_late_init, + }, +}; -- cgit v1.2.3