diff --git a/Documentation/core-api/dma-api.rst b/Documentation/core-api/dma-api.rst index 3087bea715ed2..ca75b35416792 100644 --- a/Documentation/core-api/dma-api.rst +++ b/Documentation/core-api/dma-api.rst @@ -761,7 +761,7 @@ example warning message may look like this:: [] find_busiest_group+0x207/0x8a0 [] _spin_lock_irqsave+0x1f/0x50 [] check_unmap+0x203/0x490 - [] debug_dma_unmap_page+0x49/0x50 + [] debug_dma_unmap_phys+0x49/0x50 [] nv_tx_done_optimized+0xc6/0x2c0 [] nv_nic_irq_optimized+0x73/0x2b0 [] handle_IRQ_event+0x34/0x70 @@ -855,7 +855,7 @@ that a driver may be leaking mappings. dma-debug interface debug_dma_mapping_error() to debug drivers that fail to check DMA mapping errors on addresses returned by dma_map_single() and dma_map_page() interfaces. This interface clears a flag set by -debug_dma_map_page() to indicate that dma_mapping_error() has been called by +debug_dma_map_phys() to indicate that dma_mapping_error() has been called by the driver. When driver does unmap, debug_dma_unmap() checks the flag and if this flag is still set, prints warning message that includes call trace that leads up to the unmap. This interface can be called from dma_mapping_error() diff --git a/Documentation/core-api/dma-attributes.rst b/Documentation/core-api/dma-attributes.rst index 1887d92e8e926..0bdc2be65e575 100644 --- a/Documentation/core-api/dma-attributes.rst +++ b/Documentation/core-api/dma-attributes.rst @@ -130,3 +130,21 @@ accesses to DMA buffers in both privileged "supervisor" and unprivileged subsystem that the buffer is fully accessible at the elevated privilege level (and ideally inaccessible or at least read-only at the lesser-privileged levels). + +DMA_ATTR_MMIO +------------- + +This attribute indicates the physical address is not normal system +memory. It may not be used with kmap*()/phys_to_virt()/phys_to_page() +functions, it may not be cacheable, and access using CPU load/store +instructions may not be allowed. + +Usually this will be used to describe MMIO addresses, or other non-cacheable +register addresses. When DMA mapping this sort of address we call +the operation Peer to Peer as a one device is DMA'ing to another device. +For PCI devices the p2pdma APIs must be used to determine if +DMA_ATTR_MMIO is appropriate. + +For architectures that require cache flushing for DMA coherence +DMA_ATTR_MMIO will not perform any cache flushing. The address +provided must never be mapped cacheable into the CPU. diff --git a/Documentation/driver-api/pci/p2pdma.rst b/Documentation/driver-api/pci/p2pdma.rst index d0b241628cf13..280673b50350b 100644 --- a/Documentation/driver-api/pci/p2pdma.rst +++ b/Documentation/driver-api/pci/p2pdma.rst @@ -9,22 +9,48 @@ between two devices on the bus. This type of transaction is henceforth called Peer-to-Peer (or P2P). However, there are a number of issues that make P2P transactions tricky to do in a perfectly safe way. -One of the biggest issues is that PCI doesn't require forwarding -transactions between hierarchy domains, and in PCIe, each Root Port -defines a separate hierarchy domain. To make things worse, there is no -simple way to determine if a given Root Complex supports this or not. -(See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, the kernel -only supports doing P2P when the endpoints involved are all behind the -same PCI bridge, as such devices are all in the same PCI hierarchy -domain, and the spec guarantees that all transactions within the -hierarchy will be routable, but it does not require routing -between hierarchies. - -The second issue is that to make use of existing interfaces in Linux, -memory that is used for P2P transactions needs to be backed by struct -pages. However, PCI BARs are not typically cache coherent so there are -a few corner case gotchas with these pages so developers need to -be careful about what they do with them. +For PCIe the routing of Transaction Layer Packets (TLPs) is well-defined up +until they reach a host bridge or root port. If the path includes PCIe switches +then based on the ACS settings the transaction can route entirely within +the PCIe hierarchy and never reach the root port. The kernel will evaluate +the PCIe topology and always permit P2P in these well-defined cases. + +However, if the P2P transaction reaches the host bridge then it might have to +hairpin back out the same root port, be routed inside the CPU SOC to another +PCIe root port, or routed internally to the SOC. + +The PCIe specification doesn't define the forwarding of transactions between +hierarchy domains and kernel defaults to blocking such routing. There is an +allow list to allow detecting known-good HW, in which case P2P between any +two PCIe devices will be permitted. + +Since P2P inherently is doing transactions between two devices it requires two +drivers to be co-operating inside the kernel. The providing driver has to convey +its MMIO to the consuming driver. To meet the driver model lifecycle rules the +MMIO must have all DMA mapping removed, all CPU accesses prevented, all page +table mappings undone before the providing driver completes remove(). + +This requires the providing and consuming driver to actively work together to +guarantee that the consuming driver has stopped using the MMIO during a removal +cycle. This is done by either a synchronous invalidation shutdown or waiting +for all usage refcounts to reach zero. + +At the lowest level the P2P subsystem offers a naked struct p2p_provider that +delegates lifecycle management to the providing driver. It is expected that +drivers using this option will wrap their MMIO memory in DMABUF and use DMABUF +to provide an invalidation shutdown. These MMIO addresess have no struct page, and +if used with mmap() must create special PTEs. As such there are very few +kernel uAPIs that can accept pointers to them; in particular they cannot be used +with read()/write(), including O_DIRECT. + +Building on this, the subsystem offers a layer to wrap the MMIO in a ZONE_DEVICE +pgmap of MEMORY_DEVICE_PCI_P2PDMA to create struct pages. The lifecycle of +pgmap ensures that when the pgmap is destroyed all other drivers have stopped +using the MMIO. This option works with O_DIRECT flows, in some cases, if the +underlying subsystem supports handling MEMORY_DEVICE_PCI_P2PDMA through +FOLL_PCI_P2PDMA. The use of FOLL_LONGTERM is prevented. As this relies on pgmap +it also relies on architecture support along with alignment and minimum size +limitations. Driver Writer's Guide @@ -114,14 +140,39 @@ allocating scatter-gather lists with P2P memory. Struct Page Caveats ------------------- -Driver writers should be very careful about not passing these special -struct pages to code that isn't prepared for it. At this time, the kernel -interfaces do not have any checks for ensuring this. This obviously -precludes passing these pages to userspace. +While the MEMORY_DEVICE_PCI_P2PDMA pages can be installed in VMAs, +pin_user_pages() and related will not return them unless FOLL_PCI_P2PDMA is set. -P2P memory is also technically IO memory but should never have any side -effects behind it. Thus, the order of loads and stores should not be important -and ioreadX(), iowriteX() and friends should not be necessary. +The MEMORY_DEVICE_PCI_P2PDMA pages require care to support in the kernel. The +KVA is still MMIO and must still be accessed through the normal +readX()/writeX()/etc helpers. Direct CPU access (e.g. memcpy) is forbidden, just +like any other MMIO mapping. While this will actually work on some +architectures, others will experience corruption or just crash in the kernel. +Supporting FOLL_PCI_P2PDMA in a subsystem requires scrubbing it to ensure no CPU +access happens. + + +Usage With DMABUF +================= + +DMABUF provides an alternative to the above struct page-based +client/provider/orchestrator system and should be used when struct page +doesn't exist. In this mode the exporting driver will wrap +some of its MMIO in a DMABUF and give the DMABUF FD to userspace. + +Userspace can then pass the FD to an importing driver which will ask the +exporting driver to map it to the importer. + +In this case the initiator and target pci_devices are known and the P2P subsystem +is used to determine the mapping type. The phys_addr_t-based DMA API is used to +establish the dma_addr_t. + +Lifecycle is controlled by DMABUF move_notify(). When the exporting driver wants +to remove() it must deliver an invalidation shutdown to all DMABUF importing +drivers through move_notify() and synchronously DMA unmap all the MMIO. + +No importing driver can continue to have a DMA map to the MMIO after the +exporting driver has destroyed its p2p_provider. P2P DMA Support Library diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index b280b06eaecc7..8aaced7dbc4e9 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -181,8 +181,20 @@ flag KVM_VM_MIPS_VZ. ARM64: ^^^^^^ -On arm64, the physical address size for a VM (IPA Size limit) is limited -to 40bits by default. The limit can be configured if the host supports the +On arm64, the machine type identifier is used to encode a type and the +physical address size for the VM. The lower byte (bits[7-0]) encode the +address size and the upper bits[11-8] encode a machine type. The machine +types that might be available are: + + ====================== ============================================ + KVM_VM_TYPE_ARM_NORMAL A standard VM + KVM_VM_TYPE_ARM_REALM A "Realm" VM using the Arm Confidential + Compute extensions, the VM's memory is + protected from the host. + ====================== ============================================ + +The physical address size for a VM (IPA Size limit) is limited to 40bits +by default. The limit can be configured if the host supports the extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use KVM_VM_TYPE_ARM_IPA_SIZE(IPA_Bits) to set the size in the machine type identifier, where IPA_Bits is the maximum width of any physical @@ -1295,6 +1307,8 @@ User space may need to inject several types of events to the guest. Set the pending SError exception state for this VCPU. It is not possible to 'cancel' an Serror that has been made pending. +User space cannot inject SErrors into Realms. + If the guest performed an access to I/O memory which could not be handled by userspace, for example because of missing instruction syndrome decode information or because there is no device mapped at the accessed IPA, then @@ -3550,6 +3564,11 @@ Possible features: Depends on KVM_CAP_ARM_EL2_E2H0. KVM_ARM_VCPU_HAS_EL2 must also be set. + - KVM_ARM_VCPU_REC: Allocate a REC (Realm Execution Context) for this + VCPU. This must be specified on all VCPUs created in a Realm VM. + Depends on KVM_CAP_ARM_RME. + Requires KVM_ARM_VCPU_FINALIZE(KVM_ARM_VCPU_REC). + 4.83 KVM_ARM_PREFERRED_TARGET ----------------------------- @@ -5123,6 +5142,7 @@ Recognised values for feature: ===== =========================================== arm64 KVM_ARM_VCPU_SVE (requires KVM_CAP_ARM_SVE) + arm64 KVM_ARM_VCPU_REC (requires KVM_CAP_ARM_RME) ===== =========================================== Finalizes the configuration of the specified vcpu feature. @@ -6477,6 +6497,30 @@ the capability to be present. `flags` must currently be zero. +4.144 KVM_ARM_VCPU_RMM_PSCI_COMPLETE +------------------------------------ + +:Capability: KVM_CAP_ARM_RME +:Architectures: arm64 +:Type: vcpu ioctl +:Parameters: struct kvm_arm_rmm_psci_complete (in) +:Returns: 0 if successful, < 0 on error + +:: + + struct kvm_arm_rmm_psci_complete { + __u64 target_mpidr; + __u32 psci_status; + __u32 padding[3]; + }; + +Where PSCI functions are handled by user space, the RMM needs to be informed of +the target of the operation using `target_mpidr`, along with the status +(`psci_status`). The RMM v1.0 specification defines two functions that require +this call: PSCI_CPU_ON and PSCI_AFFINITY_INFO. + +If the kernel is handling PSCI then this is done automatically and the VMM +doesn't need to call this ioctl. .. _kvm_run: @@ -8726,6 +8770,47 @@ This capability indicate to the userspace whether a PFNMAP memory region can be safely mapped as cacheable. This relies on the presence of force write back (FWB) feature support on the hardware. +7.44 KVM_CAP_ARM_RME +-------------------- + +:Architectures: arm64 +:Target: VM +:Parameters: args[0] provides an action, args[1] points to a structure in + memory for the action. +:Returns: 0 on success, negative value on error + +Used to configure and set up the memory for a Realm. The available actions are: + +================================= ============================================= + KVM_CAP_ARM_RME_CONFIG_REALM Takes struct arm_rme_config as args[1] and + configures realm parameters prior to it being + created. + + Options are ARM_RME_CONFIG_RPV to set the + "Realm Personalization Value" and + ARM_RME_CONFIG_HASH_ALGO to set the hash + algorithm. + + KVM_CAP_ARM_RME_CREATE_REALM Request the RMM to create the realm. The + realm's configuration parameters must be set + first. + + KVM_CAP_ARM_RME_INIT_RIPAS_REALM Takes struct arm_rme_init_ripas as args[1] + and sets the RIPAS (Realm IPA State) to + RIPAS_RAM of a specified area of the realm's + IPA. + + KVM_CAP_ARM_RME_POPULATE_REALM Takes struct arm_rme_populate_realm as + args[1] and populates a region of protected + address space by copying the data from the + shared alias. + + KVM_CAP_ARM_RME_ACTIVATE_REALM Request the RMM to activate the realm. No + changes can be made to the Realm's populated + memory, IPA state, configuration parameters + or vCPU additions after this step. +================================= ============================================= + 7.45 KVM_CAP_ARM_SEA_TO_USER ---------------------------- @@ -9005,6 +9090,9 @@ is supported, than the other should as well and vice versa. For arm64 see Documentation/virt/kvm/devices/vcpu.rst "KVM_ARM_VCPU_PVTIME_CTRL". For x86 see Documentation/virt/kvm/x86/msr.rst "MSR_KVM_STEAL_TIME". +Note that steal time accounting is not available when a guest is running +within a Arm CCA realm (machine type KVM_VM_TYPE_ARM_REALM). + 8.25 KVM_CAP_S390_DIAG318 ------------------------- diff --git a/arch/alpha/kernel/pci_iommu.c b/arch/alpha/kernel/pci_iommu.c index dc91de50f906d..955b6ca616276 100644 --- a/arch/alpha/kernel/pci_iommu.c +++ b/arch/alpha/kernel/pci_iommu.c @@ -224,28 +224,26 @@ static int pci_dac_dma_supported(struct pci_dev *dev, u64 mask) until either pci_unmap_single or pci_dma_sync_single is performed. */ static dma_addr_t -pci_map_single_1(struct pci_dev *pdev, void *cpu_addr, size_t size, +pci_map_single_1(struct pci_dev *pdev, phys_addr_t paddr, size_t size, int dac_allowed) { struct pci_controller *hose = pdev ? pdev->sysdata : pci_isa_hose; dma_addr_t max_dma = pdev ? pdev->dma_mask : ISA_DMA_MASK; + unsigned long offset = offset_in_page(paddr); struct pci_iommu_arena *arena; long npages, dma_ofs, i; - unsigned long paddr; dma_addr_t ret; unsigned int align = 0; struct device *dev = pdev ? &pdev->dev : NULL; - paddr = __pa(cpu_addr); - #if !DEBUG_NODIRECT /* First check to see if we can use the direct map window. */ if (paddr + size + __direct_map_base - 1 <= max_dma && paddr + size <= __direct_map_size) { ret = paddr + __direct_map_base; - DBGA2("pci_map_single: [%p,%zx] -> direct %llx from %ps\n", - cpu_addr, size, ret, __builtin_return_address(0)); + DBGA2("pci_map_single: [%pa,%zx] -> direct %llx from %ps\n", + &paddr, size, ret, __builtin_return_address(0)); return ret; } @@ -255,8 +253,8 @@ pci_map_single_1(struct pci_dev *pdev, void *cpu_addr, size_t size, if (dac_allowed) { ret = paddr + alpha_mv.pci_dac_offset; - DBGA2("pci_map_single: [%p,%zx] -> DAC %llx from %ps\n", - cpu_addr, size, ret, __builtin_return_address(0)); + DBGA2("pci_map_single: [%pa,%zx] -> DAC %llx from %ps\n", + &paddr, size, ret, __builtin_return_address(0)); return ret; } @@ -290,10 +288,10 @@ pci_map_single_1(struct pci_dev *pdev, void *cpu_addr, size_t size, arena->ptes[i + dma_ofs] = mk_iommu_pte(paddr); ret = arena->dma_base + dma_ofs * PAGE_SIZE; - ret += (unsigned long)cpu_addr & ~PAGE_MASK; + ret += offset; - DBGA2("pci_map_single: [%p,%zx] np %ld -> sg %llx from %ps\n", - cpu_addr, size, npages, ret, __builtin_return_address(0)); + DBGA2("pci_map_single: [%pa,%zx] np %ld -> sg %llx from %ps\n", + &paddr, size, npages, ret, __builtin_return_address(0)); return ret; } @@ -322,19 +320,18 @@ static struct pci_dev *alpha_gendev_to_pci(struct device *dev) return NULL; } -static dma_addr_t alpha_pci_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t size, - enum dma_data_direction dir, +static dma_addr_t alpha_pci_map_phys(struct device *dev, phys_addr_t phys, + size_t size, enum dma_data_direction dir, unsigned long attrs) { struct pci_dev *pdev = alpha_gendev_to_pci(dev); int dac_allowed; - BUG_ON(dir == DMA_NONE); + if (unlikely(attrs & DMA_ATTR_MMIO)) + return DMA_MAPPING_ERROR; - dac_allowed = pdev ? pci_dac_dma_supported(pdev, pdev->dma_mask) : 0; - return pci_map_single_1(pdev, (char *)page_address(page) + offset, - size, dac_allowed); + dac_allowed = pdev ? pci_dac_dma_supported(pdev, pdev->dma_mask) : 0; + return pci_map_single_1(pdev, phys, size, dac_allowed); } /* Unmap a single streaming mode DMA translation. The DMA_ADDR and @@ -343,7 +340,7 @@ static dma_addr_t alpha_pci_map_page(struct device *dev, struct page *page, the cpu to the buffer are guaranteed to see whatever the device wrote there. */ -static void alpha_pci_unmap_page(struct device *dev, dma_addr_t dma_addr, +static void alpha_pci_unmap_phys(struct device *dev, dma_addr_t dma_addr, size_t size, enum dma_data_direction dir, unsigned long attrs) { @@ -353,8 +350,6 @@ static void alpha_pci_unmap_page(struct device *dev, dma_addr_t dma_addr, struct pci_iommu_arena *arena; long dma_ofs, npages; - BUG_ON(dir == DMA_NONE); - if (dma_addr >= __direct_map_base && dma_addr < __direct_map_base + __direct_map_size) { /* Nothing to do. */ @@ -429,7 +424,7 @@ static void *alpha_pci_alloc_coherent(struct device *dev, size_t size, } memset(cpu_addr, 0, size); - *dma_addrp = pci_map_single_1(pdev, cpu_addr, size, 0); + *dma_addrp = pci_map_single_1(pdev, virt_to_phys(cpu_addr), size, 0); if (*dma_addrp == DMA_MAPPING_ERROR) { free_pages((unsigned long)cpu_addr, order); if (alpha_mv.mv_pci_tbi || (gfp & GFP_DMA)) @@ -643,9 +638,8 @@ static int alpha_pci_map_sg(struct device *dev, struct scatterlist *sg, /* Fast path single entry scatterlists. */ if (nents == 1) { sg->dma_length = sg->length; - sg->dma_address - = pci_map_single_1(pdev, SG_ENT_VIRT_ADDRESS(sg), - sg->length, dac_allowed); + sg->dma_address = pci_map_single_1(pdev, sg_phys(sg), + sg->length, dac_allowed); if (sg->dma_address == DMA_MAPPING_ERROR) return -EIO; return 1; @@ -917,8 +911,8 @@ iommu_unbind(struct pci_iommu_arena *arena, long pg_start, long pg_count) const struct dma_map_ops alpha_pci_ops = { .alloc = alpha_pci_alloc_coherent, .free = alpha_pci_free_coherent, - .map_page = alpha_pci_map_page, - .unmap_page = alpha_pci_unmap_page, + .map_phys = alpha_pci_map_phys, + .unmap_phys = alpha_pci_unmap_phys, .map_sg = alpha_pci_map_sg, .unmap_sg = alpha_pci_unmap_sg, .dma_supported = alpha_pci_supported, diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index 88c2d68a69c9e..a6606ba0584f4 100644 --- a/arch/arm/mm/dma-mapping.c +++ b/arch/arm/mm/dma-mapping.c @@ -624,16 +624,14 @@ static void __arm_dma_free(struct device *dev, size_t size, void *cpu_addr, kfree(buf); } -static void dma_cache_maint_page(struct page *page, unsigned long offset, - size_t size, enum dma_data_direction dir, +static void dma_cache_maint_page(phys_addr_t phys, size_t size, + enum dma_data_direction dir, void (*op)(const void *, size_t, int)) { - unsigned long pfn; + unsigned long offset = offset_in_page(phys); + unsigned long pfn = __phys_to_pfn(phys); size_t left = size; - pfn = page_to_pfn(page) + offset / PAGE_SIZE; - offset %= PAGE_SIZE; - /* * A single sg entry may refer to multiple physically contiguous * pages. But we still need to process highmem pages individually. @@ -644,17 +642,18 @@ static void dma_cache_maint_page(struct page *page, unsigned long offset, size_t len = left; void *vaddr; - page = pfn_to_page(pfn); - - if (PageHighMem(page)) { + phys = __pfn_to_phys(pfn); + if (PhysHighMem(phys)) { if (len + offset > PAGE_SIZE) len = PAGE_SIZE - offset; if (cache_is_vipt_nonaliasing()) { - vaddr = kmap_atomic(page); + vaddr = kmap_atomic_pfn(pfn); op(vaddr + offset, len, dir); kunmap_atomic(vaddr); } else { + struct page *page = phys_to_page(phys); + vaddr = kmap_high_get(page); if (vaddr) { op(vaddr + offset, len, dir); @@ -662,7 +661,8 @@ static void dma_cache_maint_page(struct page *page, unsigned long offset, } } } else { - vaddr = page_address(page) + offset; + phys += offset; + vaddr = phys_to_virt(phys); op(vaddr, len, dir); } offset = 0; @@ -676,14 +676,11 @@ static void dma_cache_maint_page(struct page *page, unsigned long offset, * Note: Drivers should NOT use this function directly. * Use the driver DMA support - see dma-mapping.h (dma_sync_*) */ -static void __dma_page_cpu_to_dev(struct page *page, unsigned long off, - size_t size, enum dma_data_direction dir) +void arch_sync_dma_for_device(phys_addr_t paddr, size_t size, + enum dma_data_direction dir) { - phys_addr_t paddr; + dma_cache_maint_page(paddr, size, dir, dmac_map_area); - dma_cache_maint_page(page, off, size, dir, dmac_map_area); - - paddr = page_to_phys(page) + off; if (dir == DMA_FROM_DEVICE) { outer_inv_range(paddr, paddr + size); } else { @@ -692,17 +689,15 @@ static void __dma_page_cpu_to_dev(struct page *page, unsigned long off, /* FIXME: non-speculating: flush on bidirectional mappings? */ } -static void __dma_page_dev_to_cpu(struct page *page, unsigned long off, - size_t size, enum dma_data_direction dir) +void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size, + enum dma_data_direction dir) { - phys_addr_t paddr = page_to_phys(page) + off; - /* FIXME: non-speculating: not required */ /* in any case, don't bother invalidating if DMA to device */ if (dir != DMA_TO_DEVICE) { outer_inv_range(paddr, paddr + size); - dma_cache_maint_page(page, off, size, dir, dmac_unmap_area); + dma_cache_maint_page(paddr, size, dir, dmac_unmap_area); } /* @@ -737,6 +732,9 @@ static int __dma_info_to_prot(enum dma_data_direction dir, unsigned long attrs) if (attrs & DMA_ATTR_PRIVILEGED) prot |= IOMMU_PRIV; + if (attrs & DMA_ATTR_MMIO) + prot |= IOMMU_MMIO; + switch (dir) { case DMA_BIDIRECTIONAL: return prot | IOMMU_READ | IOMMU_WRITE; @@ -1205,7 +1203,7 @@ static int __map_sg_chunk(struct device *dev, struct scatterlist *sg, unsigned int len = PAGE_ALIGN(s->offset + s->length); if (!dev->dma_coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) - __dma_page_cpu_to_dev(sg_page(s), s->offset, s->length, dir); + arch_sync_dma_for_device(sg_phys(s), s->length, dir); prot = __dma_info_to_prot(dir, attrs); @@ -1307,8 +1305,7 @@ static void arm_iommu_unmap_sg(struct device *dev, __iommu_remove_mapping(dev, sg_dma_address(s), sg_dma_len(s)); if (!dev->dma_coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) - __dma_page_dev_to_cpu(sg_page(s), s->offset, - s->length, dir); + arch_sync_dma_for_cpu(sg_phys(s), s->length, dir); } } @@ -1330,7 +1327,7 @@ static void arm_iommu_sync_sg_for_cpu(struct device *dev, return; for_each_sg(sg, s, nents, i) - __dma_page_dev_to_cpu(sg_page(s), s->offset, s->length, dir); + arch_sync_dma_for_cpu(sg_phys(s), s->length, dir); } @@ -1352,29 +1349,31 @@ static void arm_iommu_sync_sg_for_device(struct device *dev, return; for_each_sg(sg, s, nents, i) - __dma_page_cpu_to_dev(sg_page(s), s->offset, s->length, dir); + arch_sync_dma_for_device(sg_phys(s), s->length, dir); } /** - * arm_iommu_map_page + * arm_iommu_map_phys * @dev: valid struct device pointer - * @page: page that buffer resides in - * @offset: offset into page for start of buffer + * @phys: physical address that buffer resides in * @size: size of buffer to map * @dir: DMA transfer direction + * @attrs: DMA mapping attributes * * IOMMU aware version of arm_dma_map_page() */ -static dma_addr_t arm_iommu_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t size, enum dma_data_direction dir, - unsigned long attrs) +static dma_addr_t arm_iommu_map_phys(struct device *dev, phys_addr_t phys, + size_t size, enum dma_data_direction dir, unsigned long attrs) { struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev); + int len = PAGE_ALIGN(size + offset_in_page(phys)); + phys_addr_t addr = phys & PAGE_MASK; dma_addr_t dma_addr; - int ret, prot, len = PAGE_ALIGN(size + offset); + int ret, prot; - if (!dev->dma_coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) - __dma_page_cpu_to_dev(page, offset, size, dir); + if (!dev->dma_coherent && + !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO))) + arch_sync_dma_for_device(phys, size, dir); dma_addr = __alloc_iova(mapping, len); if (dma_addr == DMA_MAPPING_ERROR) @@ -1382,12 +1381,11 @@ static dma_addr_t arm_iommu_map_page(struct device *dev, struct page *page, prot = __dma_info_to_prot(dir, attrs); - ret = iommu_map(mapping->domain, dma_addr, page_to_phys(page), len, - prot, GFP_KERNEL); + ret = iommu_map(mapping->domain, dma_addr, addr, len, prot, GFP_KERNEL); if (ret < 0) goto fail; - return dma_addr + offset; + return dma_addr + offset_in_page(phys); fail: __free_iova(mapping, dma_addr, len); return DMA_MAPPING_ERROR; @@ -1399,82 +1397,27 @@ static dma_addr_t arm_iommu_map_page(struct device *dev, struct page *page, * @handle: DMA address of buffer * @size: size of buffer (same as passed to dma_map_page) * @dir: DMA transfer direction (same as passed to dma_map_page) + * @attrs: DMA mapping attributes * - * IOMMU aware version of arm_dma_unmap_page() + * IOMMU aware version of arm_dma_unmap_phys() */ -static void arm_iommu_unmap_page(struct device *dev, dma_addr_t handle, +static void arm_iommu_unmap_phys(struct device *dev, dma_addr_t handle, size_t size, enum dma_data_direction dir, unsigned long attrs) { struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev); dma_addr_t iova = handle & PAGE_MASK; - struct page *page; int offset = handle & ~PAGE_MASK; int len = PAGE_ALIGN(size + offset); if (!iova) return; - if (!dev->dma_coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) { - page = phys_to_page(iommu_iova_to_phys(mapping->domain, iova)); - __dma_page_dev_to_cpu(page, offset, size, dir); - } - - iommu_unmap(mapping->domain, iova, len); - __free_iova(mapping, iova, len); -} - -/** - * arm_iommu_map_resource - map a device resource for DMA - * @dev: valid struct device pointer - * @phys_addr: physical address of resource - * @size: size of resource to map - * @dir: DMA transfer direction - */ -static dma_addr_t arm_iommu_map_resource(struct device *dev, - phys_addr_t phys_addr, size_t size, - enum dma_data_direction dir, unsigned long attrs) -{ - struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev); - dma_addr_t dma_addr; - int ret, prot; - phys_addr_t addr = phys_addr & PAGE_MASK; - unsigned int offset = phys_addr & ~PAGE_MASK; - size_t len = PAGE_ALIGN(size + offset); - - dma_addr = __alloc_iova(mapping, len); - if (dma_addr == DMA_MAPPING_ERROR) - return dma_addr; - - prot = __dma_info_to_prot(dir, attrs) | IOMMU_MMIO; - - ret = iommu_map(mapping->domain, dma_addr, addr, len, prot, GFP_KERNEL); - if (ret < 0) - goto fail; - - return dma_addr + offset; -fail: - __free_iova(mapping, dma_addr, len); - return DMA_MAPPING_ERROR; -} + if (!dev->dma_coherent && + !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO))) { + phys_addr_t phys = iommu_iova_to_phys(mapping->domain, iova); -/** - * arm_iommu_unmap_resource - unmap a device DMA resource - * @dev: valid struct device pointer - * @dma_handle: DMA address to resource - * @size: size of resource to map - * @dir: DMA transfer direction - */ -static void arm_iommu_unmap_resource(struct device *dev, dma_addr_t dma_handle, - size_t size, enum dma_data_direction dir, - unsigned long attrs) -{ - struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev); - dma_addr_t iova = dma_handle & PAGE_MASK; - unsigned int offset = dma_handle & ~PAGE_MASK; - size_t len = PAGE_ALIGN(size + offset); - - if (!iova) - return; + arch_sync_dma_for_cpu(phys + offset, size, dir); + } iommu_unmap(mapping->domain, iova, len); __free_iova(mapping, iova, len); @@ -1485,14 +1428,14 @@ static void arm_iommu_sync_single_for_cpu(struct device *dev, { struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev); dma_addr_t iova = handle & PAGE_MASK; - struct page *page; unsigned int offset = handle & ~PAGE_MASK; + phys_addr_t phys; if (dev->dma_coherent || !iova) return; - page = phys_to_page(iommu_iova_to_phys(mapping->domain, iova)); - __dma_page_dev_to_cpu(page, offset, size, dir); + phys = iommu_iova_to_phys(mapping->domain, iova); + arch_sync_dma_for_cpu(phys + offset, size, dir); } static void arm_iommu_sync_single_for_device(struct device *dev, @@ -1500,14 +1443,14 @@ static void arm_iommu_sync_single_for_device(struct device *dev, { struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev); dma_addr_t iova = handle & PAGE_MASK; - struct page *page; unsigned int offset = handle & ~PAGE_MASK; + phys_addr_t phys; if (dev->dma_coherent || !iova) return; - page = phys_to_page(iommu_iova_to_phys(mapping->domain, iova)); - __dma_page_cpu_to_dev(page, offset, size, dir); + phys = iommu_iova_to_phys(mapping->domain, iova); + arch_sync_dma_for_device(phys + offset, size, dir); } static const struct dma_map_ops iommu_ops = { @@ -1516,8 +1459,8 @@ static const struct dma_map_ops iommu_ops = { .mmap = arm_iommu_mmap_attrs, .get_sgtable = arm_iommu_get_sgtable, - .map_page = arm_iommu_map_page, - .unmap_page = arm_iommu_unmap_page, + .map_phys = arm_iommu_map_phys, + .unmap_phys = arm_iommu_unmap_phys, .sync_single_for_cpu = arm_iommu_sync_single_for_cpu, .sync_single_for_device = arm_iommu_sync_single_for_device, @@ -1525,9 +1468,6 @@ static const struct dma_map_ops iommu_ops = { .unmap_sg = arm_iommu_unmap_sg, .sync_sg_for_cpu = arm_iommu_sync_sg_for_cpu, .sync_sg_for_device = arm_iommu_sync_sg_for_device, - - .map_resource = arm_iommu_map_resource, - .unmap_resource = arm_iommu_unmap_resource, }; /** @@ -1794,20 +1734,6 @@ void arch_teardown_dma_ops(struct device *dev) set_dma_ops(dev, NULL); } -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size, - enum dma_data_direction dir) -{ - __dma_page_cpu_to_dev(phys_to_page(paddr), paddr & (PAGE_SIZE - 1), - size, dir); -} - -void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size, - enum dma_data_direction dir) -{ - __dma_page_dev_to_cpu(phys_to_page(paddr), paddr & (PAGE_SIZE - 1), - size, dir); -} - void *arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs) { diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h index 9b96840fb979b..83e03abbb2ca9 100644 --- a/arch/arm64/include/asm/io.h +++ b/arch/arm64/include/asm/io.h @@ -274,6 +274,10 @@ int arm64_ioremap_prot_hook_register(const ioremap_prot_hook_t hook); #define ioremap_np(addr, size) \ ioremap_prot((addr), (size), __pgprot(PROT_DEVICE_nGnRnE)) + +#define ioremap_encrypted(addr, size) \ + ioremap_prot((addr), (size), PAGE_KERNEL) + /* * io{read,write}{16,32,64}be() macros */ @@ -311,7 +315,7 @@ extern bool arch_memremap_can_ram_remap(resource_size_t offset, size_t size, static inline bool arm64_is_protected_mmio(phys_addr_t phys_addr, size_t size) { if (unlikely(is_realm_world())) - return __arm64_is_protected_mmio(phys_addr, size); + return arm64_rsi_is_protected(phys_addr, size); return false; } diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h index 3f59b2b7c18c8..beceeda8945d9 100644 --- a/arch/arm64/include/asm/kvm_emulate.h +++ b/arch/arm64/include/asm/kvm_emulate.h @@ -681,4 +681,42 @@ static inline void vcpu_set_hcrx(struct kvm_vcpu *vcpu) vcpu->arch.hcrx_el2 |= HCRX_EL2_EnASR; } } + +static inline bool kvm_is_realm(struct kvm *kvm) +{ + if (static_branch_unlikely(&kvm_rme_is_available) && kvm) + return kvm->arch.is_realm; + return false; +} + +static inline enum realm_state kvm_realm_state(struct kvm *kvm) +{ + return READ_ONCE(kvm->arch.realm.state); +} + +static inline bool kvm_realm_is_created(struct kvm *kvm) +{ + return kvm_is_realm(kvm) && kvm_realm_state(kvm) != REALM_STATE_NONE; +} + +static inline gpa_t kvm_gpa_from_fault(struct kvm *kvm, phys_addr_t ipa) +{ + if (!kvm_is_realm(kvm)) + return ipa; + + return ipa & ~BIT(kvm->arch.realm.ia_bits - 1); +} + +static inline bool vcpu_is_rec(struct kvm_vcpu *vcpu) +{ + if (static_branch_unlikely(&kvm_rme_is_available)) + return vcpu_has_feature(vcpu, KVM_ARM_VCPU_REC); + return false; +} + +static inline bool kvm_arm_rec_finalized(struct kvm_vcpu *vcpu) +{ + return vcpu->arch.rec.mpidr != INVALID_HWID; +} + #endif /* __ARM64_KVM_EMULATE_H__ */ diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index ca550e369b59a..a1a8d5499ef44 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -27,6 +27,7 @@ #include #include #include +#include #include #define __KVM_HAVE_ARCH_INTC_INITIALIZED @@ -39,7 +40,7 @@ #define KVM_MAX_VCPUS VGIC_V3_MAX_CPUS -#define KVM_VCPU_MAX_FEATURES 9 +#define KVM_VCPU_MAX_FEATURES 10 #define KVM_VCPU_VALID_FEATURES (BIT(KVM_VCPU_MAX_FEATURES) - 1) #define KVM_REQ_SLEEP \ @@ -76,9 +77,9 @@ enum kvm_mode kvm_get_mode(void); static inline enum kvm_mode kvm_get_mode(void) { return KVM_MODE_NONE; }; #endif -extern unsigned int __ro_after_init kvm_sve_max_vl; extern unsigned int __ro_after_init kvm_host_sve_max_vl; int __init kvm_arm_init_sve(void); +unsigned int kvm_sve_get_max_vl(struct kvm *kvm); u32 __attribute_const__ kvm_target_cpu(void); void kvm_reset_vcpu(struct kvm_vcpu *vcpu); @@ -406,6 +407,9 @@ struct kvm_arch { * the associated pKVM instance in the hypervisor. */ struct kvm_protected_vm pkvm; + + bool is_realm; + struct realm realm; }; struct kvm_vcpu_fault_info { @@ -889,6 +893,9 @@ struct kvm_vcpu_arch { /* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */ struct vncr_tlb *vncr_tlb; + + /* Realm meta data */ + struct realm_rec rec; }; /* @@ -1448,6 +1455,8 @@ struct kvm *kvm_arch_alloc_vm(void); #define vcpu_is_protected(vcpu) kvm_vm_is_protected((vcpu)->kvm) +#define kvm_arch_has_private_mem(kvm) ((kvm)->arch.is_realm) + int kvm_arm_vcpu_finalize(struct kvm_vcpu *vcpu, int feature); bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu); diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h new file mode 100644 index 0000000000000..acc198d0b5619 --- /dev/null +++ b/arch/arm64/include/asm/kvm_rme.h @@ -0,0 +1,142 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2023 ARM Ltd. + */ + +#ifndef __ASM_KVM_RME_H +#define __ASM_KVM_RME_H + +#include +#include + +/** + * enum realm_state - State of a Realm + */ +enum realm_state { + /** + * @REALM_STATE_NONE: + * Realm has not yet been created. rmi_realm_create() may be + * called to create the realm. + */ + REALM_STATE_NONE, + /** + * @REALM_STATE_NEW: + * Realm is under construction, not eligible for execution. Pages + * may be populated with rmi_data_create(). + */ + REALM_STATE_NEW, + /** + * @REALM_STATE_ACTIVE: + * Realm has been created and is eligible for execution with + * rmi_rec_enter(). Pages may no longer be populated with + * rmi_data_create(). + */ + REALM_STATE_ACTIVE, + /** + * @REALM_STATE_DYING: + * Realm is in the process of being destroyed or has already been + * destroyed. + */ + REALM_STATE_DYING, + /** + * @REALM_STATE_DEAD: + * Realm has been destroyed. + */ + REALM_STATE_DEAD +}; + +/** + * struct realm - Additional per VM data for a Realm + * + * @state: The lifetime state machine for the realm + * @rd: Kernel mapping of the Realm Descriptor (RD) + * @params: Parameters for the RMI_REALM_CREATE command + * @num_aux: The number of auxiliary pages required by the RMM + * @vmid: VMID to be used by the RMM for the realm + * @mecid: MECID to be used by the RMM for the realm + * @ia_bits: Number of valid Input Address bits in the IPA + */ +struct realm { + enum realm_state state; + + void *rd; + struct realm_params *params; + + unsigned long num_aux; + unsigned int vmid; + unsigned int ia_bits; + unsigned short mecid; + enum { + MEC_POLICY_UNCONFIGURED = 0, /* Use shared for compatibility */ + MEC_POLICY_PRIVATE, /* Allocate private MECID */ + MEC_POLICY_SHARED, /* Use shared MECID */ + } mec_policy; +}; + +/** + * struct realm_rec - Additional per VCPU data for a Realm + * + * @mpidr: MPIDR (Multiprocessor Affinity Register) value to identify this VCPU + * @rec_page: Kernel VA of the RMM's private page for this REC + * @aux_pages: Additional pages private to the RMM for this REC + * @run: Kernel VA of the RmiRecRun structure shared with the RMM + */ +struct realm_rec { + unsigned long mpidr; + void *rec_page; + /* + * REC_PARAMS_AUX_GRANULES is the maximum number of 4K granules that + * the RMM can require. The array is sized to be large enough for the + * maximum number of host sized pages that could be required. + */ + struct page *aux_pages[(REC_PARAMS_AUX_GRANULES * SZ_4K) >> PAGE_SHIFT]; + struct rec_run *run; +}; + +void kvm_init_rme(void); +u32 kvm_realm_ipa_limit(void); +u32 kvm_realm_vgic_nr_lr(void); +u8 kvm_realm_max_pmu_counters(void); +unsigned int kvm_realm_sve_max_vl(void); + +u64 kvm_realm_reset_id_aa64dfr0_el1(const struct kvm_vcpu *vcpu, u64 val); + +bool kvm_rme_supports_sve(void); + +int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap); +int kvm_init_realm_vm(struct kvm *kvm); +void kvm_destroy_realm(struct kvm *kvm); +void kvm_realm_destroy_rtts(struct kvm *kvm, u32 ia_bits); +int kvm_create_rec(struct kvm_vcpu *vcpu); +void kvm_destroy_rec(struct kvm_vcpu *vcpu); + +int kvm_rec_enter(struct kvm_vcpu *vcpu); +int kvm_rec_pre_enter(struct kvm_vcpu *vcpu); +int handle_rec_exit(struct kvm_vcpu *vcpu, int rec_run_status); + +void kvm_realm_unmap_range(struct kvm *kvm, + unsigned long ipa, + unsigned long size, + bool unmap_private, + bool may_block); +int realm_map_protected(struct realm *realm, + unsigned long base_ipa, + kvm_pfn_t pfn, + unsigned long size, + struct kvm_mmu_memory_cache *memcache); +int realm_map_non_secure(struct realm *realm, + unsigned long ipa, + kvm_pfn_t pfn, + unsigned long size, + struct kvm_mmu_memory_cache *memcache); +int realm_psci_complete(struct kvm_vcpu *source, + struct kvm_vcpu *target, + unsigned long status); + +static inline bool kvm_realm_is_private_address(struct realm *realm, + unsigned long addr) +{ + return !(addr & BIT(realm->ia_bits - 1)); +} + +#endif /* __ASM_KVM_RME_H */ diff --git a/arch/arm64/include/asm/rmi_cmds.h b/arch/arm64/include/asm/rmi_cmds.h new file mode 100644 index 0000000000000..02f4d978bc5b4 --- /dev/null +++ b/arch/arm64/include/asm/rmi_cmds.h @@ -0,0 +1,526 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2023 ARM Ltd. + */ + +#ifndef __ASM_RMI_CMDS_H +#define __ASM_RMI_CMDS_H + +#include + +#include + +struct rtt_entry { + unsigned long walk_level; + unsigned long desc; + int state; + int ripas; +}; + +/** + * rmi_data_create() - Create a data granule + * @rd: PA of the RD + * @data: PA of the target granule + * @ipa: IPA at which the granule will be mapped in the guest + * @src: PA of the source granule + * @flags: RMI_MEASURE_CONTENT if the contents should be measured + * + * Create a new data granule, copying contents from a non-secure granule. + * + * Return: RMI return code + */ +static inline int rmi_data_create(unsigned long rd, unsigned long data, + unsigned long ipa, unsigned long src, + unsigned long flags) +{ + struct arm_smccc_res res; + + arm_smccc_1_1_invoke(SMC_RMI_DATA_CREATE, rd, data, ipa, src, + flags, &res); + + return res.a0; +} + +/** + * rmi_data_create_unknown() - Create a data granule with unknown contents + * @rd: PA of the RD + * @data: PA of the target granule + * @ipa: IPA at which the granule will be mapped in the guest + * + * Return: RMI return code + */ +static inline int rmi_data_create_unknown(unsigned long rd, + unsigned long data, + unsigned long ipa) +{ + struct arm_smccc_res res; + + arm_smccc_1_1_invoke(SMC_RMI_DATA_CREATE_UNKNOWN, rd, data, ipa, &res); + + return res.a0; +} + +/** + * rmi_data_destroy() - Destroy a data granule + * @rd: PA of the RD + * @ipa: IPA at which the granule is mapped in the guest + * @data_out: PA of the granule which was destroyed + * @top_out: Top IPA of non-live RTT entries + * + * Unmap a protected IPA from stage 2, transitioning it to DESTROYED. + * The IPA cannot be used by the guest unless it is transitioned to RAM again + * by the realm guest. + * + * Return: RMI return code + */ +static inline int rmi_data_destroy(unsigned long rd, unsigned long ipa, + unsigned long *data_out, + unsigned long *top_out) +{ + struct arm_smccc_res res; + + arm_smccc_1_1_invoke(SMC_RMI_DATA_DESTROY, rd, ipa, &res); + + if (data_out) + *data_out = res.a1; + if (top_out) + *top_out = res.a2; + + return res.a0; +} + +/** + * rmi_features() - Read feature register + * @index: Feature register index + * @out: Feature register value is written to this pointer + * + * Return: RMI return code + */ +static inline int rmi_features(unsigned long index, unsigned long *out) +{ + struct arm_smccc_res res; + + arm_smccc_1_1_invoke(SMC_RMI_FEATURES, index, &res); + + if (out) + *out = res.a1; + return res.a0; +} + +/** + * rmi_granule_delegate() - Delegate a granule + * @phys: PA of the granule + * + * Delegate a granule for use by the realm world. + * + * Return: RMI return code + */ +static inline int rmi_granule_delegate(unsigned long phys) +{ + struct arm_smccc_res res; + + arm_smccc_1_1_invoke(SMC_RMI_GRANULE_DELEGATE, phys, &res); + + return res.a0; +} + +/** + * rmi_granule_undelegate() - Undelegate a granule + * @phys: PA of the granule + * + * Undelegate a granule to allow use by the normal world. Will fail if the + * granule is in use. + * + * Return: RMI return code + */ +static inline int rmi_granule_undelegate(unsigned long phys) +{ + struct arm_smccc_res res; + + arm_smccc_1_1_invoke(SMC_RMI_GRANULE_UNDELEGATE, phys, &res); + + return res.a0; +} + +/** + * rmi_psci_complete() - Complete pending PSCI command + * @calling_rec: PA of the calling REC + * @target_rec: PA of the target REC + * @status: Status of the PSCI request + * + * Completes a pending PSCI command which was called with an MPIDR argument, by + * providing the corresponding REC. + * + * Return: RMI return code + */ +static inline int rmi_psci_complete(unsigned long calling_rec, + unsigned long target_rec, + unsigned long status) +{ + struct arm_smccc_res res; + + arm_smccc_1_1_invoke(SMC_RMI_PSCI_COMPLETE, calling_rec, target_rec, + status, &res); + + return res.a0; +} + +/** + * rmi_realm_activate() - Active a realm + * @rd: PA of the RD + * + * Mark a realm as Active signalling that creation is complete and allowing + * execution of the realm. + * + * Return: RMI return code + */ +static inline int rmi_realm_activate(unsigned long rd) +{ + struct arm_smccc_res res; + + arm_smccc_1_1_invoke(SMC_RMI_REALM_ACTIVATE, rd, &res); + + return res.a0; +} + +/** + * rmi_realm_create() - Create a realm + * @rd: PA of the RD + * @params: PA of realm parameters + * + * Create a new realm using the given parameters. + * + * Return: RMI return code + */ +static inline int rmi_realm_create(unsigned long rd, unsigned long params) +{ + struct arm_smccc_res res; + + arm_smccc_1_1_invoke(SMC_RMI_REALM_CREATE, rd, params, &res); + + return res.a0; +} + +/** + * rmi_realm_destroy() - Destroy a realm + * @rd: PA of the RD + * + * Destroys a realm, all objects belonging to the realm must be destroyed first. + * + * Return: RMI return code + */ +static inline int rmi_realm_destroy(unsigned long rd) +{ + struct arm_smccc_res res; + + arm_smccc_1_1_invoke(SMC_RMI_REALM_DESTROY, rd, &res); + + return res.a0; +} + +/** + * rmi_rec_aux_count() - Get number of auxiliary granules required + * @rd: PA of the RD + * @aux_count: Number of granules written to this pointer + * + * A REC may require extra auxiliary granules to be delegated for the RMM to + * store metadata (not visible to the normal world) in. This function provides + * the number of granules that are required. + * + * Return: RMI return code + */ +static inline int rmi_rec_aux_count(unsigned long rd, unsigned long *aux_count) +{ + struct arm_smccc_res res; + + arm_smccc_1_1_invoke(SMC_RMI_REC_AUX_COUNT, rd, &res); + + if (aux_count) + *aux_count = res.a1; + return res.a0; +} + +/** + * rmi_rec_create() - Create a REC + * @rd: PA of the RD + * @rec: PA of the target REC + * @params: PA of REC parameters + * + * Create a REC using the parameters specified in the struct rec_params pointed + * to by @params. + * + * Return: RMI return code + */ +static inline int rmi_rec_create(unsigned long rd, unsigned long rec, + unsigned long params) +{ + struct arm_smccc_res res; + + arm_smccc_1_1_invoke(SMC_RMI_REC_CREATE, rd, rec, params, &res); + + return res.a0; +} + +/** + * rmi_rec_destroy() - Destroy a REC + * @rec: PA of the target REC + * + * Destroys a REC. The REC must not be running. + * + * Return: RMI return code + */ +static inline int rmi_rec_destroy(unsigned long rec) +{ + struct arm_smccc_res res; + + arm_smccc_1_1_invoke(SMC_RMI_REC_DESTROY, rec, &res); + + return res.a0; +} + +/** + * rmi_rec_enter() - Enter a REC + * @rec: PA of the target REC + * @run_ptr: PA of RecRun structure + * + * Starts (or continues) execution within a REC. + * + * Return: RMI return code + */ +static inline int rmi_rec_enter(unsigned long rec, unsigned long run_ptr) +{ + struct arm_smccc_res res; + + arm_smccc_1_1_invoke(SMC_RMI_REC_ENTER, rec, run_ptr, &res); + + return res.a0; +} + +/** + * rmi_rtt_create() - Creates an RTT + * @rd: PA of the RD + * @rtt: PA of the target RTT + * @ipa: Base of the IPA range described by the RTT + * @level: Depth of the RTT within the tree + * + * Creates an RTT (Realm Translation Table) at the specified level for the + * translation of the specified address within the realm. + * + * Return: RMI return code + */ +static inline int rmi_rtt_create(unsigned long rd, unsigned long rtt, + unsigned long ipa, long level) +{ + struct arm_smccc_res res; + + arm_smccc_1_1_invoke(SMC_RMI_RTT_CREATE, rd, rtt, ipa, level, &res); + + return res.a0; +} + +/** + * rmi_rtt_destroy() - Destroy an RTT + * @rd: PA of the RD + * @ipa: Base of the IPA range described by the RTT + * @level: Depth of the RTT within the tree + * @out_rtt: Pointer to write the PA of the RTT which was destroyed + * @out_top: Pointer to write the top IPA of non-live RTT entries + * + * Destroys an RTT. The RTT must be non-live, i.e. none of the entries in the + * table are in ASSIGNED or TABLE state. + * + * Return: RMI return code. + */ +static inline int rmi_rtt_destroy(unsigned long rd, + unsigned long ipa, + long level, + unsigned long *out_rtt, + unsigned long *out_top) +{ + struct arm_smccc_res res; + + arm_smccc_1_1_invoke(SMC_RMI_RTT_DESTROY, rd, ipa, level, &res); + + if (out_rtt) + *out_rtt = res.a1; + if (out_top) + *out_top = res.a2; + + return res.a0; +} + +/** + * rmi_rtt_fold() - Fold an RTT + * @rd: PA of the RD + * @ipa: Base of the IPA range described by the RTT + * @level: Depth of the RTT within the tree + * @out_rtt: Pointer to write the PA of the RTT which was destroyed + * + * Folds an RTT. If all entries with the RTT are 'homogeneous' the RTT can be + * folded into the parent and the RTT destroyed. + * + * Return: RMI return code + */ +static inline int rmi_rtt_fold(unsigned long rd, unsigned long ipa, + long level, unsigned long *out_rtt) +{ + struct arm_smccc_res res; + + arm_smccc_1_1_invoke(SMC_RMI_RTT_FOLD, rd, ipa, level, &res); + + if (out_rtt) + *out_rtt = res.a1; + + return res.a0; +} + +/** + * rmi_rtt_init_ripas() - Set RIPAS for new realm + * @rd: PA of the RD + * @base: Base of target IPA region + * @top: Top of target IPA region + * @out_top: Top IPA of range whose RIPAS was modified + * + * Sets the RIPAS of a target IPA range to RAM, for a realm in the NEW state. + * + * Return: RMI return code + */ +static inline int rmi_rtt_init_ripas(unsigned long rd, unsigned long base, + unsigned long top, unsigned long *out_top) +{ + struct arm_smccc_res res; + + arm_smccc_1_1_invoke(SMC_RMI_RTT_INIT_RIPAS, rd, base, top, &res); + + if (out_top) + *out_top = res.a1; + + return res.a0; +} + +/** + * rmi_rtt_map_unprotected() - Map NS granules into a realm + * @rd: PA of the RD + * @ipa: Base IPA of the mapping + * @level: Depth within the RTT tree + * @desc: RTTE descriptor + * + * Create a mapping from an Unprotected IPA to a Non-secure PA. + * + * Return: RMI return code + */ +static inline int rmi_rtt_map_unprotected(unsigned long rd, + unsigned long ipa, + long level, + unsigned long desc) +{ + struct arm_smccc_res res; + + arm_smccc_1_1_invoke(SMC_RMI_RTT_MAP_UNPROTECTED, rd, ipa, level, + desc, &res); + + return res.a0; +} + +/** + * rmi_rtt_read_entry() - Read an RTTE + * @rd: PA of the RD + * @ipa: IPA for which to read the RTTE + * @level: RTT level at which to read the RTTE + * @rtt: Output structure describing the RTTE + * + * Reads a RTTE (Realm Translation Table Entry). + * + * Return: RMI return code + */ +static inline int rmi_rtt_read_entry(unsigned long rd, unsigned long ipa, + long level, struct rtt_entry *rtt) +{ + struct arm_smccc_1_2_regs regs = { + SMC_RMI_RTT_READ_ENTRY, + rd, ipa, level + }; + + arm_smccc_1_2_invoke(®s, ®s); + + rtt->walk_level = regs.a1; + rtt->state = regs.a2 & 0xFF; + rtt->desc = regs.a3; + rtt->ripas = regs.a4 & 0xFF; + + return regs.a0; +} + +/** + * rmi_rtt_set_ripas() - Set RIPAS for an running realm + * @rd: PA of the RD + * @rec: PA of the REC making the request + * @base: Base of target IPA region + * @top: Top of target IPA region + * @out_top: Pointer to write top IPA of range whose RIPAS was modified + * + * Completes a request made by the realm to change the RIPAS of a target IPA + * range. + * + * Return: RMI return code + */ +static inline int rmi_rtt_set_ripas(unsigned long rd, unsigned long rec, + unsigned long base, unsigned long top, + unsigned long *out_top) +{ + struct arm_smccc_res res; + + arm_smccc_1_1_invoke(SMC_RMI_RTT_SET_RIPAS, rd, rec, base, top, &res); + + if (out_top) + *out_top = res.a1; + + return res.a0; +} + +/** + * rmi_rtt_unmap_unprotected() - Remove a NS mapping + * @rd: PA of the RD + * @ipa: Base IPA of the mapping + * @level: Depth within the RTT tree + * @out_top: Pointer to write top IPA of non-live RTT entries + * + * Removes a mapping at an Unprotected IPA. + * + * Return: RMI return code + */ +static inline int rmi_rtt_unmap_unprotected(unsigned long rd, + unsigned long ipa, + long level, + unsigned long *out_top) +{ + struct arm_smccc_res res; + + arm_smccc_1_1_invoke(SMC_RMI_RTT_UNMAP_UNPROTECTED, rd, ipa, + level, &res); + + if (out_top) + *out_top = res.a1; + + return res.a0; +} + +static inline int rmi_mec_set_shared(unsigned long mecid) +{ + struct arm_smccc_res res; + + arm_smccc_1_1_invoke(SMC_RMI_MEC_SET_SHARED, mecid, &res); + + return res.a0; +} + +static inline int rmi_mec_set_private(unsigned long mecid) +{ + struct arm_smccc_res res; + + arm_smccc_1_1_invoke(SMC_RMI_MEC_SET_PRIVATE, mecid, &res); + + return res.a0; +} + +#endif /* __ASM_RMI_CMDS_H */ diff --git a/arch/arm64/include/asm/rmi_smc.h b/arch/arm64/include/asm/rmi_smc.h new file mode 100644 index 0000000000000..aadae17cc63b7 --- /dev/null +++ b/arch/arm64/include/asm/rmi_smc.h @@ -0,0 +1,274 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2023-2024 ARM Ltd. + * + * The values and structures in this file are from the Realm Management Monitor + * specification (DEN0137) version 1.0-rel0: + * https://developer.arm.com/documentation/den0137/1-0rel0/ + */ + +#ifndef __ASM_RMI_SMC_H +#define __ASM_RMI_SMC_H + +#include + +#define SMC_RMI_CALL(func) \ + ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \ + ARM_SMCCC_SMC_64, \ + ARM_SMCCC_OWNER_STANDARD, \ + (func)) + +#define SMC_RMI_VERSION SMC_RMI_CALL(0x0150) +#define SMC_RMI_GRANULE_DELEGATE SMC_RMI_CALL(0x0151) +#define SMC_RMI_GRANULE_UNDELEGATE SMC_RMI_CALL(0x0152) +#define SMC_RMI_DATA_CREATE SMC_RMI_CALL(0x0153) +#define SMC_RMI_DATA_CREATE_UNKNOWN SMC_RMI_CALL(0x0154) +#define SMC_RMI_DATA_DESTROY SMC_RMI_CALL(0x0155) + +#define SMC_RMI_REALM_ACTIVATE SMC_RMI_CALL(0x0157) +#define SMC_RMI_REALM_CREATE SMC_RMI_CALL(0x0158) +#define SMC_RMI_REALM_DESTROY SMC_RMI_CALL(0x0159) +#define SMC_RMI_REC_CREATE SMC_RMI_CALL(0x015a) +#define SMC_RMI_REC_DESTROY SMC_RMI_CALL(0x015b) +#define SMC_RMI_REC_ENTER SMC_RMI_CALL(0x015c) +#define SMC_RMI_RTT_CREATE SMC_RMI_CALL(0x015d) +#define SMC_RMI_RTT_DESTROY SMC_RMI_CALL(0x015e) +#define SMC_RMI_RTT_MAP_UNPROTECTED SMC_RMI_CALL(0x015f) + +#define SMC_RMI_RTT_READ_ENTRY SMC_RMI_CALL(0x0161) +#define SMC_RMI_RTT_UNMAP_UNPROTECTED SMC_RMI_CALL(0x0162) + +#define SMC_RMI_PSCI_COMPLETE SMC_RMI_CALL(0x0164) +#define SMC_RMI_FEATURES SMC_RMI_CALL(0x0165) +#define SMC_RMI_RTT_FOLD SMC_RMI_CALL(0x0166) +#define SMC_RMI_REC_AUX_COUNT SMC_RMI_CALL(0x0167) +#define SMC_RMI_RTT_INIT_RIPAS SMC_RMI_CALL(0x0168) +#define SMC_RMI_RTT_SET_RIPAS SMC_RMI_CALL(0x0169) + +#define SMC_RMI_MEC_SET_SHARED SMC_RMI_CALL(0x018C) +#define SMC_RMI_MEC_SET_PRIVATE SMC_RMI_CALL(0x018D) + +#define RMI_ABI_MAJOR_VERSION 1 +#define RMI_ABI_MINOR_VERSION 0 + +#define RMI_ABI_VERSION_GET_MAJOR(version) ((version) >> 16) +#define RMI_ABI_VERSION_GET_MINOR(version) ((version) & 0xFFFF) +#define RMI_ABI_VERSION(major, minor) (((major) << 16) | (minor)) + +#define RMI_UNASSIGNED 0 +#define RMI_ASSIGNED 1 +#define RMI_TABLE 2 + +#define RMI_RETURN_STATUS(ret) ((ret) & 0xFF) +#define RMI_RETURN_INDEX(ret) (((ret) >> 8) & 0xFF) + +#define RMI_SUCCESS 0 +#define RMI_ERROR_INPUT 1 +#define RMI_ERROR_REALM 2 +#define RMI_ERROR_REC 3 +#define RMI_ERROR_RTT 4 + +enum rmi_ripas { + RMI_EMPTY = 0, + RMI_RAM = 1, + RMI_DESTROYED = 2, +}; + +#define RMI_NO_MEASURE_CONTENT 0 +#define RMI_MEASURE_CONTENT 1 + +#define RMI_FEATURE_REGISTER_0_S2SZ GENMASK(7, 0) +#define RMI_FEATURE_REGISTER_0_LPA2 BIT(8) +#define RMI_FEATURE_REGISTER_0_SVE_EN BIT(9) +#define RMI_FEATURE_REGISTER_0_SVE_VL GENMASK(13, 10) +#define RMI_FEATURE_REGISTER_0_NUM_BPS GENMASK(19, 14) +#define RMI_FEATURE_REGISTER_0_NUM_WPS GENMASK(25, 20) +#define RMI_FEATURE_REGISTER_0_PMU_EN BIT(26) +#define RMI_FEATURE_REGISTER_0_PMU_NUM_CTRS GENMASK(31, 27) +#define RMI_FEATURE_REGISTER_0_HASH_SHA_256 BIT(32) +#define RMI_FEATURE_REGISTER_0_HASH_SHA_512 BIT(33) +#define RMI_FEATURE_REGISTER_0_GICV3_NUM_LRS GENMASK(37, 34) +#define RMI_FEATURE_REGISTER_0_MAX_RECS_ORDER GENMASK(41, 38) +#define RMI_FEATURE_REGISTER_0_Reserved GENMASK(63, 42) + +#define RMI_REALM_PARAM_FLAG_LPA2 BIT(0) +#define RMI_REALM_PARAM_FLAG_SVE BIT(1) +#define RMI_REALM_PARAM_FLAG_PMU BIT(2) + +/* + * Note many of these fields are smaller than u64 but all fields have u64 + * alignment, so use u64 to ensure correct alignment. + */ +struct realm_params { + union { /* 0x0 */ + struct { + u64 flags; + u64 s2sz; + u64 sve_vl; + u64 num_bps; + u64 num_wps; + u64 pmu_num_ctrs; + u64 hash_algo; + }; + u8 padding0[0x400]; + }; + union { /* 0x400 */ + u8 rpv[64]; + u8 padding1[0x400]; + }; + union { /* 0x800 */ + struct { + u64 vmid; + u64 rtt_base; + s64 rtt_level_start; + u64 rtt_num_start; + u64 flags1; + u64 mecid; + }; + u8 padding2[0x800]; + }; +}; + +/* + * The number of GPRs (starting from X0) that are + * configured by the host when a REC is created. + */ +#define REC_CREATE_NR_GPRS 8 + +#define REC_PARAMS_FLAG_RUNNABLE BIT_ULL(0) + +#define REC_PARAMS_AUX_GRANULES 16 + +struct rec_params { + union { /* 0x0 */ + u64 flags; + u8 padding0[0x100]; + }; + union { /* 0x100 */ + u64 mpidr; + u8 padding1[0x100]; + }; + union { /* 0x200 */ + u64 pc; + u8 padding2[0x100]; + }; + union { /* 0x300 */ + u64 gprs[REC_CREATE_NR_GPRS]; + u8 padding3[0x500]; + }; + union { /* 0x800 */ + struct { + u64 num_rec_aux; + u64 aux[REC_PARAMS_AUX_GRANULES]; + }; + u8 padding4[0x800]; + }; +}; + +#define REC_ENTER_FLAG_EMULATED_MMIO BIT(0) +#define REC_ENTER_FLAG_INJECT_SEA BIT(1) +#define REC_ENTER_FLAG_TRAP_WFI BIT(2) +#define REC_ENTER_FLAG_TRAP_WFE BIT(3) +#define REC_ENTER_FLAG_RIPAS_RESPONSE BIT(4) + +#define REC_RUN_GPRS 31 +#define REC_MAX_GIC_NUM_LRS 16 + +#define RMI_PERMITTED_GICV3_HCR_BITS (ICH_HCR_EL2_UIE | \ + ICH_HCR_EL2_LRENPIE | \ + ICH_HCR_EL2_NPIE | \ + ICH_HCR_EL2_VGrp0EIE | \ + ICH_HCR_EL2_VGrp0DIE | \ + ICH_HCR_EL2_VGrp1EIE | \ + ICH_HCR_EL2_VGrp1DIE | \ + ICH_HCR_EL2_TDIR) + +struct rec_enter { + union { /* 0x000 */ + u64 flags; + u8 padding0[0x200]; + }; + union { /* 0x200 */ + u64 gprs[REC_RUN_GPRS]; + u8 padding1[0x100]; + }; + union { /* 0x300 */ + struct { + u64 gicv3_hcr; + u64 gicv3_lrs[REC_MAX_GIC_NUM_LRS]; + }; + u8 padding2[0x100]; + }; + u8 padding3[0x400]; +}; + +#define RMI_EXIT_SYNC 0x00 +#define RMI_EXIT_IRQ 0x01 +#define RMI_EXIT_FIQ 0x02 +#define RMI_EXIT_PSCI 0x03 +#define RMI_EXIT_RIPAS_CHANGE 0x04 +#define RMI_EXIT_HOST_CALL 0x05 +#define RMI_EXIT_SERROR 0x06 + +struct rec_exit { + union { /* 0x000 */ + u8 exit_reason; + u8 padding0[0x100]; + }; + union { /* 0x100 */ + struct { + u64 esr; + u64 far; + u64 hpfar; + }; + u8 padding1[0x100]; + }; + union { /* 0x200 */ + u64 gprs[REC_RUN_GPRS]; + u8 padding2[0x100]; + }; + union { /* 0x300 */ + struct { + u64 gicv3_hcr; + u64 gicv3_lrs[REC_MAX_GIC_NUM_LRS]; + u64 gicv3_misr; + u64 gicv3_vmcr; + }; + u8 padding3[0x100]; + }; + union { /* 0x400 */ + struct { + u64 cntp_ctl; + u64 cntp_cval; + u64 cntv_ctl; + u64 cntv_cval; + }; + u8 padding4[0x100]; + }; + union { /* 0x500 */ + struct { + u64 ripas_base; + u64 ripas_top; + u8 ripas_value; + u8 padding8[7]; + }; + u8 padding5[0x100]; + }; + union { /* 0x600 */ + u16 imm; + u8 padding6[0x100]; + }; + union { /* 0x700 */ + struct { + u8 pmu_ovf_status; + }; + u8 padding7[0x100]; + }; +}; + +struct rec_run { + struct rec_enter enter; + struct rec_exit exit; +}; + +#endif /* __ASM_RMI_SMC_H */ diff --git a/arch/arm64/include/asm/rsi.h b/arch/arm64/include/asm/rsi.h index b42aeac05340e..88b50d660e85a 100644 --- a/arch/arm64/include/asm/rsi.h +++ b/arch/arm64/include/asm/rsi.h @@ -16,7 +16,7 @@ DECLARE_STATIC_KEY_FALSE(rsi_present); void __init arm64_rsi_init(void); -bool __arm64_is_protected_mmio(phys_addr_t base, size_t size); +bool arm64_rsi_is_protected(phys_addr_t base, size_t size); static inline bool is_realm_world(void) { diff --git a/arch/arm64/include/asm/virt.h b/arch/arm64/include/asm/virt.h index aa280f356b96a..db73c9bfd8c90 100644 --- a/arch/arm64/include/asm/virt.h +++ b/arch/arm64/include/asm/virt.h @@ -82,6 +82,7 @@ void __hyp_reset_vectors(void); bool is_kvm_arm_initialised(void); DECLARE_STATIC_KEY_FALSE(kvm_protected_mode_initialized); +DECLARE_STATIC_KEY_FALSE(kvm_rme_is_available); static inline bool is_pkvm_initialized(void) { diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index ed5f3892674c7..79df08c1421d0 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -106,6 +106,7 @@ struct kvm_regs { #define KVM_ARM_VCPU_PTRAUTH_GENERIC 6 /* VCPU uses generic authentication */ #define KVM_ARM_VCPU_HAS_EL2 7 /* Support nested virtualization */ #define KVM_ARM_VCPU_HAS_EL2_E2H0 8 /* Limit NV support to E2H RES0 */ +#define KVM_ARM_VCPU_REC 9 /* VCPU REC state as part of Realm */ struct kvm_vcpu_init { __u32 target; @@ -429,6 +430,67 @@ enum { #define KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES 3 #define KVM_DEV_ARM_ITS_CTRL_RESET 4 +/* KVM_CAP_ARM_RME on VM fd */ +#define KVM_CAP_ARM_RME_CONFIG_REALM 0 +#define KVM_CAP_ARM_RME_CREATE_REALM 1 +#define KVM_CAP_ARM_RME_INIT_RIPAS_REALM 2 +#define KVM_CAP_ARM_RME_POPULATE_REALM 3 +#define KVM_CAP_ARM_RME_ACTIVATE_REALM 4 + +/* List of configuration items accepted for KVM_CAP_ARM_RME_CONFIG_REALM */ +#define ARM_RME_CONFIG_RPV 0 +#define ARM_RME_CONFIG_HASH_ALGO 1 +#define ARM_RME_CONFIG_MEC 2 +#define ARM_RME_CONFIG_MEC_QUERY 3 + +#define ARM_RME_CONFIG_HASH_ALGO_SHA256 0 +#define ARM_RME_CONFIG_HASH_ALGO_SHA512 1 + +#define ARM_RME_CONFIG_RPV_SIZE 64 + +struct arm_rme_config { + __u32 cfg; + union { + /* cfg == ARM_RME_CONFIG_RPV */ + struct { + __u8 rpv[ARM_RME_CONFIG_RPV_SIZE]; + }; + + /* cfg == ARM_RME_CONFIG_HASH_ALGO */ + struct { + __u32 hash_algo; + }; + + /* cfg == ARM_RME_CONFIG_MEC */ + struct { + /* At the moment this is a bool: 0=private MECID */ + __u32 shared_mec; + }; + + /* cfg == ARM_RME_CONFIG_MEC_QUERY - output */ + struct { + __u32 mec_supported; /* 0=not supported, 1=supported */ + __u32 mec_count; /* Number of available MECIDs */ + }; + /* Fix the size of the union */ + __u8 reserved[256]; + }; +}; + +#define KVM_ARM_RME_POPULATE_FLAGS_MEASURE (1 << 0) +struct arm_rme_populate_realm { + __u64 base; + __u64 size; + __u32 flags; + __u32 reserved[3]; +}; + +struct arm_rme_init_ripas { + __u64 base; + __u64 size; + __u64 reserved[2]; +}; + /* Device Control API on vcpu fd */ #define KVM_ARM_VCPU_PMU_V3_CTRL 0 #define KVM_ARM_VCPU_PMU_V3_IRQ 0 diff --git a/arch/arm64/kernel/rsi.c b/arch/arm64/kernel/rsi.c index ce4778141ec7b..c64a06f58c0bc 100644 --- a/arch/arm64/kernel/rsi.c +++ b/arch/arm64/kernel/rsi.c @@ -84,7 +84,25 @@ static void __init arm64_rsi_setup_memory(void) } } -bool __arm64_is_protected_mmio(phys_addr_t base, size_t size) +/* + * Check if a given PA range is Trusted (e.g., Protected memory, a Trusted Device + * mapping, or an MMIO emulated in the Realm world). + * + * We can rely on the RIPAS value of the region to detect if a given region is + * protected. + * + * RIPAS_DEV - A trusted device memory or a trusted emulated MMIO (in the Realm + * world + * RIPAS_RAM - Memory (RAM), protected by the RMM guarantees. (e.g., Firmware + * reserved regions for data sharing). + * + * RIPAS_DESTROYED is a special case of one of the above, where the host did + * something without our permission and as such we can't do anything about it. + * + * The only case where something is emulated by the untrusted hypervisor or is + * backed by shared memory is indicated by RSI_RIPAS_EMPTY. + */ +bool arm64_rsi_is_protected(phys_addr_t base, size_t size) { enum ripas ripas; phys_addr_t end, top; @@ -101,18 +119,18 @@ bool __arm64_is_protected_mmio(phys_addr_t base, size_t size) break; if (WARN_ON(top <= base)) break; - if (ripas != RSI_RIPAS_DEV) + if (ripas == RSI_RIPAS_EMPTY) break; base = top; } return base >= end; } -EXPORT_SYMBOL(__arm64_is_protected_mmio); +EXPORT_SYMBOL(arm64_rsi_is_protected); static int realm_ioremap_hook(phys_addr_t phys, size_t size, pgprot_t *prot) { - if (__arm64_is_protected_mmio(phys, size)) + if (arm64_rsi_is_protected(phys, size)) *prot = pgprot_encrypted(*prot); else *prot = pgprot_decrypted(*prot); diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig index 713248f240e03..3a04b040869df 100644 --- a/arch/arm64/kvm/Kconfig +++ b/arch/arm64/kvm/Kconfig @@ -37,6 +37,7 @@ menuconfig KVM select HAVE_KVM_VCPU_RUN_PID_CHANGE select SCHED_INFO select GUEST_PERF_EVENTS if PERF_EVENTS + select KVM_GENERIC_PRIVATE_MEM help Support hosting virtualized guest machines. diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile index 3ebc0570345cc..e273838aad9ce 100644 --- a/arch/arm64/kvm/Makefile +++ b/arch/arm64/kvm/Makefile @@ -16,7 +16,7 @@ CFLAGS_handle_exit.o += -Wno-override-init kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \ inject_fault.o va_layout.o handle_exit.o config.o \ guest.o debug.o reset.o sys_regs.o stacktrace.o \ - vgic-sys-reg-v3.o fpsimd.o pkvm.o \ + vgic-sys-reg-v3.o fpsimd.o pkvm.o rme.o rme-exit.o \ arch_timer.o trng.o vmid.o emulate-nested.o nested.o at.o \ vgic/vgic.o vgic/vgic-init.o \ vgic/vgic-irqfd.o vgic/vgic-v2.o \ diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c index dbd74e4885e24..0a15450571df5 100644 --- a/arch/arm64/kvm/arch_timer.c +++ b/arch/arm64/kvm/arch_timer.c @@ -148,6 +148,13 @@ static void timer_set_cval(struct arch_timer_context *ctxt, u64 cval) static void timer_set_offset(struct arch_timer_context *ctxt, u64 offset) { + struct kvm_vcpu *vcpu = ctxt->vcpu; + + if (kvm_is_realm(vcpu->kvm)) { + WARN_ON(offset); + return; + } + if (!ctxt->offset.vm_offset) { WARN(offset, "timer %ld\n", arch_timer_ctx_index(ctxt)); return; @@ -462,6 +469,21 @@ static void kvm_timer_update_irq(struct kvm_vcpu *vcpu, bool new_level, timer_ctx); } +void kvm_realm_timers_update(struct kvm_vcpu *vcpu) +{ + struct arch_timer_cpu *arch_timer = &vcpu->arch.timer_cpu; + int i; + + for (i = 0; i < NR_KVM_EL0_TIMERS; i++) { + struct arch_timer_context *timer = &arch_timer->timers[i]; + bool status = timer_get_ctl(timer) & ARCH_TIMER_CTRL_IT_STAT; + bool level = kvm_timer_irq_can_fire(timer) && status; + + if (level != timer->irq.level) + kvm_timer_update_irq(vcpu, level, timer); + } +} + /* Only called for a fully emulated timer */ static void timer_emulate(struct arch_timer_context *ctx) { @@ -1065,7 +1087,9 @@ static void timer_context_init(struct kvm_vcpu *vcpu, int timerid) ctxt->vcpu = vcpu; - if (timerid == TIMER_VTIMER) + if (kvm_is_realm(vcpu->kvm)) + ctxt->offset.vm_offset = NULL; + else if (timerid == TIMER_VTIMER) ctxt->offset.vm_offset = &kvm->arch.timer_data.voffset; else ctxt->offset.vm_offset = &kvm->arch.timer_data.poffset; @@ -1087,13 +1111,19 @@ static void timer_context_init(struct kvm_vcpu *vcpu, int timerid) void kvm_timer_vcpu_init(struct kvm_vcpu *vcpu) { struct arch_timer_cpu *timer = vcpu_timer(vcpu); + u64 cntvoff; for (int i = 0; i < NR_KVM_TIMERS; i++) timer_context_init(vcpu, i); + if (kvm_is_realm(vcpu->kvm)) + cntvoff = 0; + else + cntvoff = kvm_phys_timer_read(); + /* Synchronize offsets across timers of a VM if not already provided */ if (!test_bit(KVM_ARCH_FLAG_VM_COUNTER_OFFSET, &vcpu->kvm->arch.flags)) { - timer_set_offset(vcpu_vtimer(vcpu), kvm_phys_timer_read()); + timer_set_offset(vcpu_vtimer(vcpu), cntvoff); timer_set_offset(vcpu_ptimer(vcpu), 0); } @@ -1633,6 +1663,13 @@ int kvm_timer_enable(struct kvm_vcpu *vcpu) return -EINVAL; } + /* + * We don't use mapped IRQs for Realms because the RMI doesn't allow + * us setting the LR.HW bit in the VGIC. + */ + if (vcpu_is_rec(vcpu)) + return 0; + get_timer_map(vcpu, &map); ret = kvm_vgic_map_phys_irq(vcpu, @@ -1764,6 +1801,9 @@ int kvm_vm_ioctl_set_counter_offset(struct kvm *kvm, if (offset->reserved) return -EINVAL; + if (kvm_is_realm(kvm)) + return -EINVAL; + mutex_lock(&kvm->lock); if (!kvm_trylock_all_vcpus(kvm)) { diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 034b5cecaaa75..b89b7190c0672 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include @@ -40,6 +41,7 @@ #include #include #include +#include #include #include @@ -48,6 +50,16 @@ #include "sys_regs.h" +/* + * Expose KVM_CAP_ARM_RME capability number via sysfs so userspace (QEMU) + * can discover it at runtime. This is needed because the capability number + * is not yet stable upstream and can shift when other patches are merged. + * Exposed at: /sys/module/kvm/parameters/kvm_cap_arm_rme + */ +static int kvm_cap_arm_rme = KVM_CAP_ARM_RME; +module_param(kvm_cap_arm_rme, int, 0444); +MODULE_PARM_DESC(kvm_cap_arm_rme, "KVM capability number for ARM RME"); + static enum kvm_mode kvm_mode = KVM_MODE_DEFAULT; enum kvm_wfx_trap_policy { @@ -59,6 +71,8 @@ enum kvm_wfx_trap_policy { static enum kvm_wfx_trap_policy kvm_wfi_trap_policy __read_mostly = KVM_WFX_NOTRAP_SINGLE_TASK; static enum kvm_wfx_trap_policy kvm_wfe_trap_policy __read_mostly = KVM_WFX_NOTRAP_SINGLE_TASK; +DEFINE_STATIC_KEY_FALSE(kvm_rme_is_available); + DECLARE_KVM_HYP_PER_CPU(unsigned long, kvm_hyp_vector); DEFINE_PER_CPU(unsigned long, kvm_arm_hyp_stack_base); @@ -137,6 +151,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, r = 0; set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags); break; + case KVM_CAP_ARM_RME: + mutex_lock(&kvm->lock); + r = kvm_realm_enable_cap(kvm, cap); + mutex_unlock(&kvm->lock); + break; default: break; } @@ -168,6 +187,22 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) mutex_unlock(&kvm->lock); #endif + if (type & ~(KVM_VM_TYPE_ARM_MASK | KVM_VM_TYPE_ARM_IPA_SIZE_MASK)) + return -EINVAL; + + switch (type & KVM_VM_TYPE_ARM_MASK) { + case KVM_VM_TYPE_ARM_NORMAL: + break; + case KVM_VM_TYPE_ARM_REALM: + if (!static_branch_unlikely(&kvm_rme_is_available)) + return -EPERM; + WRITE_ONCE(kvm->arch.realm.state, REALM_STATE_NONE); + kvm->arch.is_realm = true; + break; + default: + return -EINVAL; + } + kvm_init_nested(kvm); ret = kvm_share_hyp(kvm, kvm + 1); @@ -199,6 +234,13 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) bitmap_zero(kvm->arch.vcpu_features, KVM_VCPU_MAX_FEATURES); + /* Initialise the realm bits after the generic bits are enabled */ + if (kvm_is_realm(kvm)) { + ret = kvm_init_realm_vm(kvm); + if (ret) + goto err_free_cpumask; + } + return 0; err_free_cpumask: @@ -258,6 +300,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm) kvm_unshare_hyp(kvm, kvm + 1); kvm_arm_teardown_hypercalls(kvm); + kvm_destroy_realm(kvm); } static bool kvm_has_full_ptr_auth(void) @@ -312,23 +355,25 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_ONE_REG: case KVM_CAP_ARM_PSCI: case KVM_CAP_ARM_PSCI_0_2: - case KVM_CAP_READONLY_MEM: case KVM_CAP_MP_STATE: case KVM_CAP_IMMEDIATE_EXIT: case KVM_CAP_VCPU_EVENTS: case KVM_CAP_ARM_IRQ_LINE_LAYOUT_2: case KVM_CAP_ARM_NISV_TO_USER: case KVM_CAP_ARM_INJECT_EXT_DABT: - case KVM_CAP_SET_GUEST_DEBUG: case KVM_CAP_VCPU_ATTRIBUTES: case KVM_CAP_PTP_KVM: case KVM_CAP_ARM_SYSTEM_SUSPEND: case KVM_CAP_IRQFD_RESAMPLE: - case KVM_CAP_COUNTER_OFFSET: case KVM_CAP_ARM_WRITABLE_IMP_ID_REGS: case KVM_CAP_ARM_SEA_TO_USER: r = 1; break; + case KVM_CAP_COUNTER_OFFSET: + case KVM_CAP_READONLY_MEM: + case KVM_CAP_SET_GUEST_DEBUG: + r = !kvm_is_realm(kvm); + break; case KVM_CAP_SET_GUEST_DEBUG2: return KVM_GUESTDBG_VALID_MASK; case KVM_CAP_ARM_SET_DEVICE_ADDR: @@ -368,7 +413,10 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) r = system_supports_mte(); break; case KVM_CAP_STEAL_TIME: - r = kvm_arm_pvtime_supported(); + if (kvm_is_realm(kvm)) + r = 0; + else + r = kvm_arm_pvtime_supported(); break; case KVM_CAP_ARM_EL1_32BIT: r = cpus_have_final_cap(ARM64_HAS_32BIT_EL1); @@ -380,10 +428,10 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) r = cpus_have_final_cap(ARM64_HAS_HCR_NV1); break; case KVM_CAP_GUEST_DEBUG_HW_BPS: - r = get_num_brps(); + r = kvm_is_realm(kvm) ? 0 : get_num_brps(); break; case KVM_CAP_GUEST_DEBUG_HW_WPS: - r = get_num_wrps(); + r = kvm_is_realm(kvm) ? 0 : get_num_wrps(); break; case KVM_CAP_ARM_PMU_V3: r = kvm_supports_guest_pmuv3(); @@ -395,7 +443,10 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) r = get_kvm_ipa_limit(); break; case KVM_CAP_ARM_SVE: - r = system_supports_sve(); + if (kvm_is_realm(kvm)) + r = kvm_rme_supports_sve(); + else + r = system_supports_sve(); break; case KVM_CAP_ARM_PTRAUTH_ADDRESS: case KVM_CAP_ARM_PTRAUTH_GENERIC: @@ -419,6 +470,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) else r = kvm_supports_cacheable_pfnmap(); break; + case KVM_CAP_ARM_RME: + r = static_key_enabled(&kvm_rme_is_available); + break; default: r = 0; @@ -471,6 +525,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) /* Force users to call KVM_ARM_VCPU_INIT */ vcpu_clear_flag(vcpu, VCPU_INITIALIZED); + vcpu->arch.rec.mpidr = INVALID_HWID; + vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO; /* Set up the timer */ @@ -581,7 +637,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) struct kvm_s2_mmu *mmu; int *last_ran; - if (is_protected_kvm_enabled()) + if (is_protected_kvm_enabled() || kvm_is_realm(vcpu->kvm)) goto nommu; if (vcpu_has_nv(vcpu)) @@ -624,12 +680,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) kvm_timer_vcpu_load(vcpu); kvm_vgic_load(vcpu); kvm_vcpu_load_debug(vcpu); - if (has_vhe()) - kvm_vcpu_load_vhe(vcpu); - kvm_arch_vcpu_load_fp(vcpu); - kvm_vcpu_pmu_restore_guest(vcpu); - if (kvm_arm_is_pvtime_enabled(&vcpu->arch)) - kvm_make_request(KVM_REQ_RECORD_STEAL, vcpu); if (kvm_vcpu_should_clear_twe(vcpu)) vcpu->arch.hcr_el2 &= ~HCR_TWE; @@ -651,6 +701,17 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) &vcpu->arch.vgic_cpu.vgic_v3); } + /* No additional state needs to be loaded on Realmed VMs */ + if (vcpu_is_rec(vcpu)) + return; + + if (has_vhe()) + kvm_vcpu_load_vhe(vcpu); + kvm_arch_vcpu_load_fp(vcpu); + kvm_vcpu_pmu_restore_guest(vcpu); + if (kvm_arm_is_pvtime_enabled(&vcpu->arch)) + kvm_make_request(KVM_REQ_RECORD_STEAL, vcpu); + if (!cpumask_test_cpu(cpu, vcpu->kvm->arch.supported_cpus)) vcpu_set_on_unsupported_cpu(vcpu); } @@ -663,19 +724,24 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) kvm_call_hyp_nvhe(__pkvm_vcpu_put); } + kvm_timer_vcpu_put(vcpu); + kvm_vgic_put(vcpu); + + vcpu->cpu = -1; + + if (vcpu_is_rec(vcpu)) + return; + kvm_vcpu_put_debug(vcpu); kvm_arch_vcpu_put_fp(vcpu); if (has_vhe()) kvm_vcpu_put_vhe(vcpu); - kvm_timer_vcpu_put(vcpu); - kvm_vgic_put(vcpu); kvm_vcpu_pmu_restore_host(vcpu); if (vcpu_has_nv(vcpu)) kvm_vcpu_put_hw_mmu(vcpu); kvm_arm_vmid_clear_active(); vcpu_clear_on_unsupported_cpu(vcpu); - vcpu->cpu = -1; } static void __kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu) @@ -893,6 +959,11 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu) return ret; } + if (!irqchip_in_kernel(kvm) && kvm_is_realm(vcpu->kvm)) { + /* Userspace irqchip not yet supported with Realms */ + return -EOPNOTSUPP; + } + mutex_lock(&kvm->arch.config_lock); set_bit(KVM_ARCH_FLAG_HAS_RAN_ONCE, &kvm->arch.flags); mutex_unlock(&kvm->arch.config_lock); @@ -1179,6 +1250,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) run->exit_reason = KVM_EXIT_UNKNOWN; run->flags = 0; while (ret > 0) { + bool pmu_stopped = false; + /* * Check conditions before entering the guest */ @@ -1189,6 +1262,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) if (ret > 0) ret = check_vcpu_requests(vcpu); + if (ret > 0 && vcpu_is_rec(vcpu)) + ret = kvm_rec_pre_enter(vcpu); + /* * Preparing the interrupts to be injected also * involves poking the GIC, which must be done in a @@ -1201,6 +1277,11 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) if (kvm_vcpu_has_pmu(vcpu)) kvm_pmu_flush_hwstate(vcpu); + if (vcpu_is_rec(vcpu) && kvm_pmu_get_irq_level(vcpu)) { + pmu_stopped = true; + arm_pmu_set_phys_irq(false); + } + local_irq_disable(); kvm_vgic_flush_hwstate(vcpu); @@ -1236,7 +1317,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) trace_kvm_entry(*vcpu_pc(vcpu)); guest_timing_enter_irqoff(); - ret = kvm_arm_vcpu_enter_exit(vcpu); + if (vcpu_is_rec(vcpu)) + ret = kvm_rec_enter(vcpu); + else + ret = kvm_arm_vcpu_enter_exit(vcpu); vcpu->mode = OUTSIDE_GUEST_MODE; vcpu->stat.exits++; @@ -1294,13 +1378,21 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) trace_kvm_exit(ret, kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu)); - /* Exit types that need handling before we can be preempted */ - handle_exit_early(vcpu, ret); + if (!vcpu_is_rec(vcpu)) { + /* + * Exit types that need handling before we can be + * preempted + */ + handle_exit_early(vcpu, ret); + } kvm_nested_sync_hwstate(vcpu); preempt_enable(); + if (pmu_stopped) + arm_pmu_set_phys_irq(true); + /* * The ARMv8 architecture doesn't give the hypervisor * a mechanism to prevent a guest from dropping to AArch32 EL0 @@ -1320,7 +1412,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) ret = ARM_EXCEPTION_IL; } - ret = handle_exit(vcpu, ret); + if (vcpu_is_rec(vcpu)) + ret = handle_rec_exit(vcpu, ret); + else + ret = handle_exit(vcpu, ret); } /* Tell userspace about in-kernel device output levels */ @@ -1434,7 +1529,7 @@ int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_level, return -EINVAL; } -static unsigned long system_supported_vcpu_features(void) +static unsigned long system_supported_vcpu_features(struct kvm *kvm) { unsigned long features = KVM_VCPU_VALID_FEATURES; @@ -1455,6 +1550,9 @@ static unsigned long system_supported_vcpu_features(void) if (!cpus_have_final_cap(ARM64_HAS_NESTED_VIRT)) clear_bit(KVM_ARM_VCPU_HAS_EL2, &features); + if (!kvm_is_realm(kvm)) + clear_bit(KVM_ARM_VCPU_REC, &features); + return features; } @@ -1472,7 +1570,7 @@ static int kvm_vcpu_init_check_features(struct kvm_vcpu *vcpu, return -ENOENT; } - if (features & ~system_supported_vcpu_features()) + if (features & ~system_supported_vcpu_features(vcpu->kvm)) return -EINVAL; /* @@ -1494,6 +1592,10 @@ static int kvm_vcpu_init_check_features(struct kvm_vcpu *vcpu, if (test_bit(KVM_ARM_VCPU_HAS_EL2, &features)) return -EINVAL; + /* RME is incompatible with AArch32 */ + if (test_bit(KVM_ARM_VCPU_REC, &features)) + return -EINVAL; + return 0; } @@ -1699,6 +1801,22 @@ static int kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu, return __kvm_arm_vcpu_set_events(vcpu, events); } +static int kvm_arm_vcpu_rmm_psci_complete(struct kvm_vcpu *vcpu, + struct kvm_arm_rmm_psci_complete *arg) +{ + struct kvm_vcpu *target = kvm_mpidr_to_vcpu(vcpu->kvm, arg->target_mpidr); + + if (!target) + return -EINVAL; + + /* + * RMM v1.0 only supports PSCI_RET_SUCCESS or PSCI_RET_DENIED + * for the status. But, let us leave it to the RMM to filter + * for making this future proof. + */ + return realm_psci_complete(vcpu, target, arg->psci_status); +} + long kvm_arch_vcpu_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { @@ -1753,10 +1871,6 @@ long kvm_arch_vcpu_ioctl(struct file *filp, if (unlikely(!kvm_vcpu_initialized(vcpu))) break; - r = -EPERM; - if (!kvm_arm_vcpu_is_finalized(vcpu)) - break; - r = -EFAULT; if (copy_from_user(®_list, user_list, sizeof(reg_list))) break; @@ -1827,6 +1941,15 @@ long kvm_arch_vcpu_ioctl(struct file *filp, return kvm_arm_vcpu_finalize(vcpu, what); } + case KVM_ARM_VCPU_RMM_PSCI_COMPLETE: { + struct kvm_arm_rmm_psci_complete req; + + if (!vcpu_is_rec(vcpu)) + return -EPERM; + if (copy_from_user(&req, argp, sizeof(req))) + return -EFAULT; + return kvm_arm_vcpu_rmm_psci_complete(vcpu, &req); + } default: r = -EINVAL; } @@ -2849,6 +2972,8 @@ static __init int kvm_arm_init(void) in_hyp_mode = is_kernel_in_hyp_mode(); + kvm_init_rme(); + if (cpus_have_final_cap(ARM64_WORKAROUND_DEVICE_LOAD_ACQUIRE) || cpus_have_final_cap(ARM64_WORKAROUND_1508412)) kvm_info("Guests without required CPU erratum workarounds can deadlock system!\n" \ diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c index 16ba5e9ac86c3..c5bdbcede0861 100644 --- a/arch/arm64/kvm/guest.c +++ b/arch/arm64/kvm/guest.c @@ -73,6 +73,25 @@ static u64 core_reg_offset_from_id(u64 id) return id & ~(KVM_REG_ARCH_MASK | KVM_REG_SIZE_MASK | KVM_REG_ARM_CORE); } +static bool kvm_realm_validate_core_reg(u64 off) +{ + /* + * Note that GPRs can only sometimes be controlled by the VMM. + * For PSCI only X0-X6 are used, higher registers are ignored (restored + * from the REC). + * For HOST_CALL all of X0-X30 are copied to the RsiHostCall structure. + * For emulated MMIO X0 is always used. + * PC can only be set before the realm is activated. + */ + switch (off) { + case KVM_REG_ARM_CORE_REG(regs.regs[0]) ... + KVM_REG_ARM_CORE_REG(regs.regs[30]): + case KVM_REG_ARM_CORE_REG(regs.pc): + return true; + } + return false; +} + static int core_reg_size_from_offset(const struct kvm_vcpu *vcpu, u64 off) { int size; @@ -342,7 +361,7 @@ static int set_sve_vls(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg) if (!vcpu_has_sve(vcpu)) return -ENOENT; - if (kvm_arm_vcpu_sve_finalized(vcpu)) + if (kvm_arm_vcpu_sve_finalized(vcpu) || kvm_realm_is_created(vcpu->kvm)) return -EPERM; /* too late! */ if (WARN_ON(vcpu->arch.sve_state)) @@ -356,7 +375,7 @@ static int set_sve_vls(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg) if (vq_present(vqs, vq)) max_vq = vq; - if (max_vq > sve_vq_from_vl(kvm_sve_max_vl)) + if (max_vq > sve_vq_from_vl(kvm_sve_get_max_vl(vcpu->kvm))) return -EINVAL; /* @@ -600,8 +619,6 @@ static const u64 timer_reg_list[] = { KVM_REG_ARM_PTIMER_CVAL, }; -#define NUM_TIMER_REGS ARRAY_SIZE(timer_reg_list) - static bool is_timer_reg(u64 index) { switch (index) { @@ -616,9 +633,14 @@ static bool is_timer_reg(u64 index) return false; } +static inline unsigned long num_timer_regs(struct kvm_vcpu *vcpu) +{ + return kvm_is_realm(vcpu->kvm) ? 0 : ARRAY_SIZE(timer_reg_list); +} + static int copy_timer_indices(struct kvm_vcpu *vcpu, u64 __user *uindices) { - for (int i = 0; i < NUM_TIMER_REGS; i++) { + for (unsigned long i = 0; i < num_timer_regs(vcpu); i++) { if (put_user(timer_reg_list[i], uindices)) return -EFAULT; uindices++; @@ -656,8 +678,11 @@ static unsigned long num_sve_regs(const struct kvm_vcpu *vcpu) if (!vcpu_has_sve(vcpu)) return 0; - /* Policed by KVM_GET_REG_LIST: */ - WARN_ON(!kvm_arm_vcpu_sve_finalized(vcpu)); + if (!kvm_arm_vcpu_sve_finalized(vcpu)) + return 1; /* KVM_REG_ARM64_SVE_VLS */ + + if (kvm_is_realm(vcpu->kvm)) + return 1; /* KVM_REG_ARM64_SVE_VLS */ return slices * (SVE_NUM_PREGS + SVE_NUM_ZREGS + 1 /* FFR */) + 1; /* KVM_REG_ARM64_SVE_VLS */ @@ -674,9 +699,6 @@ static int copy_sve_reg_indices(const struct kvm_vcpu *vcpu, if (!vcpu_has_sve(vcpu)) return 0; - /* Policed by KVM_GET_REG_LIST: */ - WARN_ON(!kvm_arm_vcpu_sve_finalized(vcpu)); - /* * Enumerate this first, so that userspace can save/restore in * the order reported by KVM_GET_REG_LIST: @@ -686,6 +708,12 @@ static int copy_sve_reg_indices(const struct kvm_vcpu *vcpu, return -EFAULT; ++num_regs; + if (!kvm_arm_vcpu_sve_finalized(vcpu)) + return num_regs; + + if (kvm_is_realm(vcpu->kvm)) + return num_regs; + for (i = 0; i < slices; i++) { for (n = 0; n < SVE_NUM_ZREGS; n++) { reg = KVM_REG_ARM64_SVE_ZREG(n, i); @@ -724,7 +752,7 @@ unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu) res += num_sve_regs(vcpu); res += kvm_arm_num_sys_reg_descs(vcpu); res += kvm_arm_get_fw_num_regs(vcpu); - res += NUM_TIMER_REGS; + res += num_timer_regs(vcpu); return res; } @@ -758,7 +786,7 @@ int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *uindices) ret = copy_timer_indices(vcpu, uindices); if (ret < 0) return ret; - uindices += NUM_TIMER_REGS; + uindices += num_timer_regs(vcpu); return kvm_arm_copy_sys_reg_indices(vcpu, uindices); } @@ -783,12 +811,44 @@ int kvm_arm_get_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg) return kvm_arm_sys_reg_get_reg(vcpu, reg); } +#define KVM_REG_ARM_PMCR_EL0 ARM64_SYS_REG(3, 3, 9, 12, 0) +#define KVM_REG_ARM_ID_AA64DFR0_EL1 ARM64_SYS_REG(3, 0, 0, 5, 0) + +/* + * The RMI ABI only enables setting some GPRs and PC. The selection of GPRs + * that are available depends on the Realm state and the reason for the last + * exit. All other registers are reset to architectural or otherwise defined + * reset values by the RMM, except for a few configuration fields that + * correspond to Realm parameters. + */ +static bool validate_realm_set_reg(struct kvm_vcpu *vcpu, + const struct kvm_one_reg *reg) +{ + if ((reg->id & KVM_REG_ARM_COPROC_MASK) == KVM_REG_ARM_CORE) { + u64 off = core_reg_offset_from_id(reg->id); + + return kvm_realm_validate_core_reg(off); + } else { + switch (reg->id) { + case KVM_REG_ARM_PMCR_EL0: + case KVM_REG_ARM_ID_AA64DFR0_EL1: + case KVM_REG_ARM64_SVE_VLS: + return true; + } + } + + return false; +} + int kvm_arm_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg) { /* We currently use nothing arch-specific in upper 32 bits */ if ((reg->id & ~KVM_REG_SIZE_MASK) >> 32 != KVM_REG_ARM64 >> 32) return -EINVAL; + if (kvm_is_realm(vcpu->kvm) && !validate_realm_set_reg(vcpu, reg)) + return -EINVAL; + switch (reg->id & KVM_REG_ARM_COPROC_MASK) { case KVM_REG_ARM_CORE: return set_core_reg(vcpu, reg); case KVM_REG_ARM_FW: @@ -856,6 +916,30 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu, u64 esr = events->exception.serror_esr; int ret = 0; + if (vcpu_is_rec(vcpu)) { + /* Cannot inject SError into a Realm. */ + if (serror_pending) + return -EINVAL; + + /* + * If a data abort is pending, set the flag and let the RMM + * inject an SEA when the REC is scheduled to be run. + */ + if (ext_dabt_pending) { + /* + * Can only inject SEA into a Realm if the previous exit + * was due to a data abort of an Unprotected IPA. + */ + if (!(vcpu->arch.rec.run->enter.flags & REC_ENTER_FLAG_EMULATED_MMIO)) + return -EINVAL; + + vcpu->arch.rec.run->enter.flags &= ~REC_ENTER_FLAG_EMULATED_MMIO; + vcpu->arch.rec.run->enter.flags |= REC_ENTER_FLAG_INJECT_SEA; + } + + return 0; + } + /* * Immediately commit the pending SEA to the vCPU's architectural * state which is necessary since we do not return a pending SEA diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c index 58c5fe7d75727..70ac7971416c0 100644 --- a/arch/arm64/kvm/hypercalls.c +++ b/arch/arm64/kvm/hypercalls.c @@ -414,14 +414,14 @@ void kvm_arm_teardown_hypercalls(struct kvm *kvm) int kvm_arm_get_fw_num_regs(struct kvm_vcpu *vcpu) { - return ARRAY_SIZE(kvm_arm_fw_reg_ids); + return kvm_is_realm(vcpu->kvm) ? 0 : ARRAY_SIZE(kvm_arm_fw_reg_ids); } int kvm_arm_copy_fw_reg_indices(struct kvm_vcpu *vcpu, u64 __user *uindices) { int i; - for (i = 0; i < ARRAY_SIZE(kvm_arm_fw_reg_ids); i++) { + for (i = 0; i < kvm_arm_get_fw_num_regs(vcpu); i++) { if (put_user(kvm_arm_fw_reg_ids[i], uindices++)) return -EFAULT; } diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c index 18890bbd87c73..03910f9ff449e 100644 --- a/arch/arm64/kvm/inject_fault.c +++ b/arch/arm64/kvm/inject_fault.c @@ -201,7 +201,9 @@ static void inject_abt32(struct kvm_vcpu *vcpu, bool is_pabt, u32 addr) static void __kvm_inject_sea(struct kvm_vcpu *vcpu, bool iabt, u64 addr) { - if (vcpu_el1_is_32bit(vcpu)) + if (unlikely(vcpu_is_rec(vcpu))) + vcpu->arch.rec.run->enter.flags |= REC_ENTER_FLAG_INJECT_SEA; + else if (vcpu_el1_is_32bit(vcpu)) inject_abt32(vcpu, iabt, addr); else inject_abt64(vcpu, iabt, addr); @@ -298,6 +300,7 @@ void kvm_inject_size_fault(struct kvm_vcpu *vcpu) */ void kvm_inject_undefined(struct kvm_vcpu *vcpu) { + WARN(vcpu_is_rec(vcpu), "Unexpected undefined exception injection to REC"); if (vcpu_el1_is_32bit(vcpu)) inject_undef32(vcpu); else diff --git a/arch/arm64/kvm/mmio.c b/arch/arm64/kvm/mmio.c index e2285ed8c91de..6a8cb927fccad 100644 --- a/arch/arm64/kvm/mmio.c +++ b/arch/arm64/kvm/mmio.c @@ -6,6 +6,7 @@ #include #include +#include #include #include "trace.h" @@ -138,14 +139,21 @@ int kvm_handle_mmio_return(struct kvm_vcpu *vcpu) trace_kvm_mmio(KVM_TRACE_MMIO_READ, len, run->mmio.phys_addr, &data); data = vcpu_data_host_to_guest(vcpu, data, len); - vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu), data); + + if (vcpu_is_rec(vcpu)) + vcpu->arch.rec.run->enter.gprs[0] = data; + else + vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu), data); } /* * The MMIO instruction is emulated and should not be re-executed * in the guest. */ - kvm_incr_pc(vcpu); + if (vcpu_is_rec(vcpu)) + vcpu->arch.rec.run->enter.flags |= REC_ENTER_FLAG_EMULATED_MMIO; + else + kvm_incr_pc(vcpu); return 1; } @@ -167,14 +175,14 @@ int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa) * No valid syndrome? Ask userspace for help if it has * volunteered to do so, and bail out otherwise. * - * In the protected VM case, there isn't much userspace can do + * In the protected/realm VM case, there isn't much userspace can do * though, so directly deliver an exception to the guest. */ if (!kvm_vcpu_dabt_isvalid(vcpu)) { trace_kvm_mmio_nisv(*vcpu_pc(vcpu), esr, kvm_vcpu_get_hfar(vcpu), fault_ipa); - if (vcpu_is_protected(vcpu)) + if (vcpu_is_protected(vcpu) || vcpu_is_rec(vcpu)) return kvm_inject_sea_dabt(vcpu, kvm_vcpu_get_hfar(vcpu)); if (test_bit(KVM_ARCH_FLAG_RETURN_NISV_IO_ABORT_TO_USER, diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 06fa8766133f5..3aac9ceb51b58 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -319,6 +319,7 @@ static void invalidate_icache_guest_page(void *va, size_t size) * @start: The intermediate physical base address of the range to unmap * @size: The size of the area to unmap * @may_block: Whether or not we are permitted to block + * @only_shared: If true then protected mappings should not be unmapped * * Clear a range of stage-2 mappings, lowering the various ref-counts. Must * be called while holding mmu_lock (unless for freeing the stage2 pgd before @@ -326,21 +327,28 @@ static void invalidate_icache_guest_page(void *va, size_t size) * with things behind our backs. */ static void __unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64 size, - bool may_block) + bool may_block, bool only_shared) { struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu); phys_addr_t end = start + size; lockdep_assert_held_write(&kvm->mmu_lock); WARN_ON(size & ~PAGE_MASK); - WARN_ON(stage2_apply_range(mmu, start, end, KVM_PGT_FN(kvm_pgtable_stage2_unmap), - may_block)); + + if (kvm_is_realm(kvm)) { + kvm_realm_unmap_range(kvm, start, size, !only_shared, + may_block); + } else { + WARN_ON(stage2_apply_range(mmu, start, end, + KVM_PGT_FN(kvm_pgtable_stage2_unmap), + may_block)); + } } void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64 size, bool may_block) { - __unmap_stage2_range(mmu, start, size, may_block); + __unmap_stage2_range(mmu, start, size, may_block, false); } void kvm_stage2_flush_range(struct kvm_s2_mmu *mmu, phys_addr_t addr, phys_addr_t end) @@ -354,7 +362,10 @@ static void stage2_flush_memslot(struct kvm *kvm, phys_addr_t addr = memslot->base_gfn << PAGE_SHIFT; phys_addr_t end = addr + PAGE_SIZE * memslot->npages; - kvm_stage2_flush_range(&kvm->arch.mmu, addr, end); + if (kvm_is_realm(kvm)) + kvm_realm_unmap_range(kvm, addr, end - addr, false, true); + else + kvm_stage2_flush_range(&kvm->arch.mmu, addr, end); } /** @@ -872,14 +883,15 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = { .icache_inval_pou = invalidate_icache_guest_page, }; -static int kvm_init_ipa_range(struct kvm_s2_mmu *mmu, unsigned long type) +static int kvm_init_ipa_range(struct kvm *kvm, + struct kvm_s2_mmu *mmu, unsigned long type) { u32 kvm_ipa_limit = get_kvm_ipa_limit(); u64 mmfr0, mmfr1; u32 phys_shift; - if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK) - return -EINVAL; + if (kvm_is_realm(kvm)) + kvm_ipa_limit = kvm_realm_ipa_limit(); phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type); if (is_protected_kvm_enabled()) { @@ -942,7 +954,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t return -EINVAL; } - err = kvm_init_ipa_range(mmu, type); + err = kvm_init_ipa_range(kvm, mmu, type); if (err) return err; @@ -1047,6 +1059,10 @@ void stage2_unmap_vm(struct kvm *kvm) struct kvm_memory_slot *memslot; int idx, bkt; + /* For realms this is handled by the RMM so nothing to do here */ + if (kvm_is_realm(kvm)) + return; + idx = srcu_read_lock(&kvm->srcu); mmap_read_lock(current->mm); write_lock(&kvm->mmu_lock); @@ -1065,10 +1081,26 @@ void stage2_unmap_vm(struct kvm *kvm) void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu) { struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu); - struct kvm_pgtable *pgt = NULL; + struct kvm_pgtable *pgt; write_lock(&kvm->mmu_lock); pgt = mmu->pgt; + if (kvm_is_realm(kvm) && + (kvm_realm_state(kvm) != REALM_STATE_DEAD && + kvm_realm_state(kvm) != REALM_STATE_NONE)) { + struct realm *realm = &kvm->arch.realm; + + kvm_stage2_unmap_range(mmu, 0, BIT(realm->ia_bits - 1), true); + write_unlock(&kvm->mmu_lock); + kvm_realm_destroy_rtts(kvm, pgt->ia_bits); + + /* + * The PGD pages can be reclaimed only after the realm (RD) is + * destroyed. We call this again from kvm_destroy_realm() after + * the RD is destroyed. + */ + return; + } if (pgt) { mmu->pgd_phys = 0; mmu->pgt = NULL; @@ -1081,7 +1113,8 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu) write_unlock(&kvm->mmu_lock); if (pgt) { - KVM_PGT_FN(kvm_pgtable_stage2_destroy)(pgt); + if (!kvm_is_realm(kvm)) + KVM_PGT_FN(kvm_pgtable_stage2_destroy)(pgt); kfree(pgt); } } @@ -1156,6 +1189,10 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa, if (is_protected_kvm_enabled()) return -EPERM; + /* We don't support mapping special pages into a Realm */ + if (kvm_is_realm(kvm)) + return -EPERM; + size += offset_in_page(guest_ipa); guest_ipa &= PAGE_MASK; @@ -1470,6 +1507,84 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma) return vma->vm_flags & VM_MTE_ALLOWED; } +static int realm_map_ipa(struct kvm *kvm, phys_addr_t ipa, + kvm_pfn_t pfn, unsigned long map_size, + enum kvm_pgtable_prot prot, + struct kvm_mmu_memory_cache *memcache) +{ + struct realm *realm = &kvm->arch.realm; + + /* + * Write permission is required for now even though it's possible to + * map unprotected pages (granules) as read-only. It's impossible to + * map protected pages (granules) as read-only. + */ + if (WARN_ON(!(prot & KVM_PGTABLE_PROT_W))) + return -EFAULT; + + ipa = ALIGN_DOWN(ipa, PAGE_SIZE); + if (!kvm_realm_is_private_address(realm, ipa)) + return realm_map_non_secure(realm, ipa, pfn, map_size, + memcache); + + return realm_map_protected(realm, ipa, pfn, map_size, memcache); +} + +static int private_memslot_fault(struct kvm_vcpu *vcpu, + phys_addr_t fault_ipa, + struct kvm_memory_slot *memslot) +{ + struct kvm *kvm = vcpu->kvm; + gpa_t gpa = kvm_gpa_from_fault(kvm, fault_ipa); + gfn_t gfn = gpa >> PAGE_SHIFT; + bool is_priv_gfn = kvm_mem_is_private(kvm, gfn); + struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache; + struct page *page; + kvm_pfn_t pfn; + int ret; + /* + * For Realms, the shared address is an alias of the private GPA with + * the top bit set. Thus is the fault address matches the GPA then it + * is the private alias. + */ + bool is_priv_fault = (gpa == fault_ipa); + + if (is_priv_gfn != is_priv_fault) { + kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE, + kvm_is_write_fault(vcpu), false, + is_priv_fault); + + /* + * KVM_EXIT_MEMORY_FAULT requires an return code of -EFAULT, + * see the API documentation + */ + return -EFAULT; + } + + if (!is_priv_fault) { + /* Not a private mapping, handling normally */ + return -EINVAL; + } + + ret = kvm_mmu_topup_memory_cache(memcache, + kvm_mmu_cache_min_pages(vcpu->arch.hw_mmu)); + if (ret) + return ret; + + ret = kvm_gmem_get_pfn(kvm, memslot, gfn, &pfn, &page, NULL); + if (ret) + return ret; + + /* FIXME: Should be able to use bigger than PAGE_SIZE mappings */ + ret = realm_map_ipa(kvm, fault_ipa, pfn, PAGE_SIZE, KVM_PGTABLE_PROT_W, + memcache); + if (!ret) + return 1; /* Handled */ + + put_page(page); + return ret; +} + static bool kvm_vma_is_cacheable(struct vm_area_struct *vma) { switch (FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(vma->vm_page_prot))) { @@ -1510,6 +1625,14 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, if (fault_is_perm) fault_granule = kvm_vcpu_trap_get_perm_fault_granule(vcpu); write_fault = kvm_is_write_fault(vcpu); + + /* + * Realms cannot map protected pages read-only + * FIXME: It should be possible to map unprotected pages read-only + */ + if (vcpu_is_rec(vcpu)) + write_fault = true; + exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu); VM_BUG_ON(write_fault && exec_fault); @@ -1555,6 +1678,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, if (logging_active) { force_pte = true; vma_shift = PAGE_SHIFT; + } else if (vcpu_is_rec(vcpu)) { + /* Force PTE level mappings for realms */ + force_pte = true; + vma_shift = PAGE_SHIFT; } else { vma_shift = get_vma_page_shift(vma, hva); } @@ -1622,7 +1749,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, ipa &= ~(vma_pagesize - 1); } - gfn = ipa >> PAGE_SHIFT; + gfn = kvm_gpa_from_fault(kvm, ipa) >> PAGE_SHIFT; mte_allowed = kvm_vma_mte_allowed(vma); if (!cpus_have_cap(ARM64_WORKAROUND_NC_TO_NGNRE)) @@ -1715,6 +1842,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, return 1; } + /* + * For now we shouldn't be hitting protected addresses because they are + * handled in private_memslot_fault(). In the future this check may be + * relaxed to support e.g. protected devices. + */ + if (vcpu_is_rec(vcpu) && + kvm_gpa_from_fault(kvm, fault_ipa) == fault_ipa) + return -EINVAL; + /* * Potentially reduce shadow S2 permissions to match the guest's own * S2. For exec faults, we'd only reach this point if the guest @@ -1797,6 +1933,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, */ prot &= ~KVM_NV_GUEST_MAP_SZ; ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, fault_ipa, prot, flags); + } else if (kvm_is_realm(kvm)) { + ret = realm_map_ipa(kvm, fault_ipa, pfn, vma_pagesize, + prot, memcache); } else { ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault_ipa, vma_pagesize, __pfn_to_phys(pfn), prot, @@ -2012,8 +2151,15 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu) nested = &nested_trans; } - gfn = ipa >> PAGE_SHIFT; + gfn = kvm_gpa_from_fault(vcpu->kvm, ipa) >> PAGE_SHIFT; memslot = gfn_to_memslot(vcpu->kvm, gfn); + + if (kvm_slot_can_be_private(memslot)) { + ret = private_memslot_fault(vcpu, ipa, memslot); + if (ret != -EINVAL) + goto out; + } + hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable); write_fault = kvm_is_write_fault(vcpu); if (kvm_is_error_hva(hva) || (write_fault && !writable)) { @@ -2056,7 +2202,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu) * of the page size. */ ipa |= kvm_vcpu_get_hfar(vcpu) & GENMASK(11, 0); - ret = io_mem_abort(vcpu, ipa); + ret = io_mem_abort(vcpu, kvm_gpa_from_fault(vcpu->kvm, ipa)); goto out_unlock; } @@ -2088,7 +2234,8 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) __unmap_stage2_range(&kvm->arch.mmu, range->start << PAGE_SHIFT, (range->end - range->start) << PAGE_SHIFT, - range->may_block); + range->may_block, + !(range->attr_filter & KVM_FILTER_PRIVATE)); kvm_nested_s2_unmap(kvm, range->may_block); return false; @@ -2101,6 +2248,10 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) if (!kvm->arch.mmu.pgt) return false; + /* We don't support aging for Realms */ + if (kvm_is_realm(kvm)) + return true; + return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT, size, true); @@ -2117,6 +2268,10 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) if (!kvm->arch.mmu.pgt) return false; + /* We don't support aging for Realms */ + if (kvm_is_realm(kvm)) + return true; + return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT, size, false); @@ -2352,6 +2507,30 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, return ret; } +#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES +bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm, + struct kvm_gfn_range *range) +{ + WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)); + return false; +} + +bool kvm_arch_post_set_memory_attributes(struct kvm *kvm, + struct kvm_gfn_range *range) +{ + if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm))) + return false; + + if (range->arg.attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE) + range->attr_filter = KVM_FILTER_SHARED; + else + range->attr_filter = KVM_FILTER_PRIVATE; + kvm_unmap_gfn_range(kvm, range); + + return false; +} +#endif + void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) { } diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c index b03dbda7f1ab9..6604aa982d027 100644 --- a/arch/arm64/kvm/pmu-emul.c +++ b/arch/arm64/kvm/pmu-emul.c @@ -374,6 +374,9 @@ static bool kvm_pmu_overflow_status(struct kvm_vcpu *vcpu) { u64 reg = __vcpu_sys_reg(vcpu, PMOVSSET_EL0); + if (vcpu_is_rec(vcpu)) + return vcpu->arch.rec.run->exit.pmu_ovf_status; + reg &= __vcpu_sys_reg(vcpu, PMINTENSET_EL1); /* @@ -1011,6 +1014,9 @@ u8 kvm_arm_pmu_get_max_counters(struct kvm *kvm) { struct arm_pmu *arm_pmu = kvm->arch.arm_pmu; + if (kvm_is_realm(kvm)) + return kvm_realm_max_pmu_counters(); + /* * PMUv3 requires that all event counters are capable of counting any * event, though the same may not be true of non-PMUv3 hardware. diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c index 3b5dbe9a0a0ea..a68f3c1878a57 100644 --- a/arch/arm64/kvm/psci.c +++ b/arch/arm64/kvm/psci.c @@ -103,6 +103,12 @@ static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu *source_vcpu) reset_state->reset = true; kvm_make_request(KVM_REQ_VCPU_RESET, vcpu); + /* + * Make sure we issue PSCI_COMPLETE before the VCPU can be + * scheduled. + */ + if (vcpu_is_rec(vcpu)) + realm_psci_complete(source_vcpu, vcpu, PSCI_RET_SUCCESS); /* * Make sure the reset request is observed if the RUNNABLE mp_state is @@ -115,6 +121,11 @@ static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu *source_vcpu) out_unlock: spin_unlock(&vcpu->arch.mp_state_lock); + if (vcpu_is_rec(vcpu) && ret != PSCI_RET_SUCCESS) { + realm_psci_complete(source_vcpu, vcpu, + ret == PSCI_RET_ALREADY_ON ? + PSCI_RET_SUCCESS : PSCI_RET_DENIED); + } return ret; } @@ -142,6 +153,25 @@ static unsigned long kvm_psci_vcpu_affinity_info(struct kvm_vcpu *vcpu) /* Ignore other bits of target affinity */ target_affinity &= target_affinity_mask; + if (vcpu_is_rec(vcpu)) { + struct kvm_vcpu *target_vcpu; + + /* RMM supports only zero affinity level */ + if (lowest_affinity_level != 0) + return PSCI_RET_INVALID_PARAMS; + + target_vcpu = kvm_mpidr_to_vcpu(kvm, target_affinity); + if (!target_vcpu) + return PSCI_RET_INVALID_PARAMS; + + /* + * Provide the references of the source and target RECs to the + * RMM so that the RMM can complete the PSCI request. + */ + realm_psci_complete(vcpu, target_vcpu, PSCI_RET_SUCCESS); + return PSCI_RET_SUCCESS; + } + /* * If one or more VCPU matching target affinity are running * then ON else OFF diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c index 959532422d3a3..6396aaa367360 100644 --- a/arch/arm64/kvm/reset.c +++ b/arch/arm64/kvm/reset.c @@ -46,7 +46,7 @@ unsigned int __ro_after_init kvm_host_sve_max_vl; #define VCPU_RESET_PSTATE_SVC (PSR_AA32_MODE_SVC | PSR_AA32_A_BIT | \ PSR_AA32_I_BIT | PSR_AA32_F_BIT) -unsigned int __ro_after_init kvm_sve_max_vl; +static unsigned int __ro_after_init kvm_sve_max_vl; int __init kvm_arm_init_sve(void) { @@ -76,9 +76,17 @@ int __init kvm_arm_init_sve(void) return 0; } +unsigned int kvm_sve_get_max_vl(struct kvm *kvm) +{ + if (kvm_is_realm(kvm)) + return kvm_realm_sve_max_vl(); + else + return kvm_sve_max_vl; +} + static void kvm_vcpu_enable_sve(struct kvm_vcpu *vcpu) { - vcpu->arch.sve_max_vl = kvm_sve_max_vl; + vcpu->arch.sve_max_vl = kvm_sve_get_max_vl(vcpu->kvm); /* * Userspace can still customize the vector lengths by writing @@ -137,6 +145,11 @@ int kvm_arm_vcpu_finalize(struct kvm_vcpu *vcpu, int feature) return -EPERM; return kvm_vcpu_finalize_sve(vcpu); + case KVM_ARM_VCPU_REC: + if (!kvm_is_realm(vcpu->kvm) || !vcpu_is_rec(vcpu)) + return -EINVAL; + + return kvm_create_rec(vcpu); } return -EINVAL; @@ -147,6 +160,11 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu) if (vcpu_has_sve(vcpu) && !kvm_arm_vcpu_sve_finalized(vcpu)) return false; + if (kvm_is_realm(vcpu->kvm) && + !(vcpu_is_rec(vcpu) && kvm_arm_rec_finalized(vcpu) && + READ_ONCE(vcpu->kvm->arch.realm.state) == REALM_STATE_ACTIVE)) + return false; + return true; } @@ -161,6 +179,7 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu) free_page((unsigned long)vcpu->arch.ctxt.vncr_array); kfree(vcpu->arch.vncr_tlb); kfree(vcpu->arch.ccsidr); + kvm_destroy_rec(vcpu); } static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu) diff --git a/arch/arm64/kvm/rme-exit.c b/arch/arm64/kvm/rme-exit.c new file mode 100644 index 0000000000000..1a8ca75268635 --- /dev/null +++ b/arch/arm64/kvm/rme-exit.c @@ -0,0 +1,207 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2023 ARM Ltd. + */ + +#include +#include +#include + +#include +#include +#include +#include + +typedef int (*exit_handler_fn)(struct kvm_vcpu *vcpu); + +static int rec_exit_reason_notimpl(struct kvm_vcpu *vcpu) +{ + struct realm_rec *rec = &vcpu->arch.rec; + + vcpu_err(vcpu, "Unhandled exit reason from realm (ESR: %#llx)\n", + rec->run->exit.esr); + return -ENXIO; +} + +static int rec_exit_sync_dabt(struct kvm_vcpu *vcpu) +{ + struct realm_rec *rec = &vcpu->arch.rec; + + /* + * In the case of a write, copy over gprs[0] to the target GPR, + * preparing to handle MMIO write fault. The content to be written has + * been saved to gprs[0] by the RMM (even if another register was used + * by the guest). In the case of normal memory access this is redundant + * (the guest will replay the instruction), but the overhead is + * minimal. + */ + if (kvm_vcpu_dabt_iswrite(vcpu) && kvm_vcpu_dabt_isvalid(vcpu)) + vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu), + rec->run->exit.gprs[0]); + + return kvm_handle_guest_abort(vcpu); +} + +static int rec_exit_sync_iabt(struct kvm_vcpu *vcpu) +{ + struct realm_rec *rec = &vcpu->arch.rec; + + vcpu_err(vcpu, "Unhandled instruction abort (ESR: %#llx).\n", + rec->run->exit.esr); + return -ENXIO; +} + +static int rec_exit_sys_reg(struct kvm_vcpu *vcpu) +{ + struct realm_rec *rec = &vcpu->arch.rec; + unsigned long esr = kvm_vcpu_get_esr(vcpu); + int rt = kvm_vcpu_sys_get_rt(vcpu); + bool is_write = !(esr & 1); + int ret; + + if (is_write) + vcpu_set_reg(vcpu, rt, rec->run->exit.gprs[0]); + + ret = kvm_handle_sys_reg(vcpu); + if (!is_write) + rec->run->enter.gprs[0] = vcpu_get_reg(vcpu, rt); + + return ret; +} + +static exit_handler_fn rec_exit_handlers[] = { + [0 ... ESR_ELx_EC_MAX] = rec_exit_reason_notimpl, + [ESR_ELx_EC_SYS64] = rec_exit_sys_reg, + [ESR_ELx_EC_DABT_LOW] = rec_exit_sync_dabt, + [ESR_ELx_EC_IABT_LOW] = rec_exit_sync_iabt +}; + +static int rec_exit_psci(struct kvm_vcpu *vcpu) +{ + struct realm_rec *rec = &vcpu->arch.rec; + int i; + + for (i = 0; i < REC_RUN_GPRS; i++) + vcpu_set_reg(vcpu, i, rec->run->exit.gprs[i]); + + return kvm_smccc_call_handler(vcpu); +} + +static int rec_exit_ripas_change(struct kvm_vcpu *vcpu) +{ + struct kvm *kvm = vcpu->kvm; + struct realm *realm = &kvm->arch.realm; + struct realm_rec *rec = &vcpu->arch.rec; + unsigned long base = rec->run->exit.ripas_base; + unsigned long top = rec->run->exit.ripas_top; + unsigned long ripas = rec->run->exit.ripas_value; + + if (!kvm_realm_is_private_address(realm, base) || + !kvm_realm_is_private_address(realm, top - 1)) { + vcpu_err(vcpu, "Invalid RIPAS_CHANGE for %#lx - %#lx, ripas: %#lx\n", + base, top, ripas); + /* Set RMI_REJECT bit */ + rec->run->enter.flags = REC_ENTER_FLAG_RIPAS_RESPONSE; + return -EINVAL; + } + + /* Exit to VMM, the actual RIPAS change is done on next entry */ + kvm_prepare_memory_fault_exit(vcpu, base, top - base, false, false, + ripas == RMI_RAM); + + /* + * KVM_EXIT_MEMORY_FAULT requires an return code of -EFAULT, see the + * API documentation + */ + return -EFAULT; +} + +static int rec_exit_host_call(struct kvm_vcpu *vcpu) +{ + int i; + struct realm_rec *rec = &vcpu->arch.rec; + + vcpu->stat.hvc_exit_stat++; + + for (i = 0; i < REC_RUN_GPRS; i++) + vcpu_set_reg(vcpu, i, rec->run->exit.gprs[i]); + + return kvm_smccc_call_handler(vcpu); +} + +static void update_arch_timer_irq_lines(struct kvm_vcpu *vcpu) +{ + struct realm_rec *rec = &vcpu->arch.rec; + + __vcpu_assign_sys_reg(vcpu, CNTV_CTL_EL0, rec->run->exit.cntv_ctl); + __vcpu_assign_sys_reg(vcpu, CNTV_CVAL_EL0, rec->run->exit.cntv_cval); + __vcpu_assign_sys_reg(vcpu, CNTP_CTL_EL0, rec->run->exit.cntp_ctl); + __vcpu_assign_sys_reg(vcpu, CNTP_CVAL_EL0, rec->run->exit.cntp_cval); + + kvm_realm_timers_update(vcpu); +} + +/* + * Return > 0 to return to guest, < 0 on error, 0 (and set exit_reason) on + * proper exit to userspace. + */ +int handle_rec_exit(struct kvm_vcpu *vcpu, int rec_run_ret) +{ + struct realm_rec *rec = &vcpu->arch.rec; + u8 esr_ec = ESR_ELx_EC(rec->run->exit.esr); + unsigned long status, index; + + status = RMI_RETURN_STATUS(rec_run_ret); + index = RMI_RETURN_INDEX(rec_run_ret); + + /* + * If a PSCI_SYSTEM_OFF request raced with a vcpu executing, we might + * see the following status code and index indicating an attempt to run + * a REC when the RD state is SYSTEM_OFF. In this case, we just need to + * return to user space which can deal with the system event or will try + * to run the KVM VCPU again, at which point we will no longer attempt + * to enter the Realm because we will have a sleep request pending on + * the VCPU as a result of KVM's PSCI handling. + */ + if (status == RMI_ERROR_REALM && index == 1) { + vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN; + return 0; + } + + if (rec_run_ret) + return -ENXIO; + + vcpu->arch.fault.esr_el2 = rec->run->exit.esr; + vcpu->arch.fault.far_el2 = rec->run->exit.far; + /* HPFAR_EL2 is only valid for RMI_EXIT_SYNC */ + vcpu->arch.fault.hpfar_el2 = 0; + + update_arch_timer_irq_lines(vcpu); + + /* Reset the emulation flags for the next run of the REC */ + rec->run->enter.flags = 0; + + switch (rec->run->exit.exit_reason) { + case RMI_EXIT_SYNC: + /* + * HPFAR_EL2_NS is hijacked to indicate a valid HPFAR value, + * see __get_fault_info() + */ + vcpu->arch.fault.hpfar_el2 = rec->run->exit.hpfar | HPFAR_EL2_NS; + return rec_exit_handlers[esr_ec](vcpu); + case RMI_EXIT_IRQ: + case RMI_EXIT_FIQ: + return 1; + case RMI_EXIT_PSCI: + return rec_exit_psci(vcpu); + case RMI_EXIT_RIPAS_CHANGE: + return rec_exit_ripas_change(vcpu); + case RMI_EXIT_HOST_CALL: + return rec_exit_host_call(vcpu); + } + + kvm_pr_unimpl("Unsupported exit reason: %u\n", + rec->run->exit.exit_reason); + vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR; + return 0; +} diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c new file mode 100644 index 0000000000000..401a69a37504d --- /dev/null +++ b/arch/arm64/kvm/rme.c @@ -0,0 +1,1970 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2023 ARM Ltd. + */ + +#include + +#include +#include +#include +#include + +#include + +#define MECID_INVALID (-1) + +/* + * struct mecid_state - Global MECID allocation state + * @shared_mecid: The MECID being shared by multiple realms (-1 if none) + * @shared_mecid_members: Number of realms using the shared MECID + * @bitmap: Bitmap tracking allocated MECIDs + * + * All fields protected by mecid_lock defined below. + */ +struct mecid_state { + int shared_mecid; + unsigned int shared_mecid_members; + unsigned long *bitmap; +}; + +static struct mecid_state mecid_state = { + .shared_mecid = MECID_INVALID, + .shared_mecid_members = 0, + .bitmap = NULL, +}; + +/* + * Protects all fields in struct mecid_state. + * Must be taken after rme_vmid_lock if both locks are needed. + */ +static DEFINE_SPINLOCK(mecid_lock); + +static unsigned long rmm_feat_reg0; +static unsigned long rmm_feat_reg1; + +/* + * Feature register 1 contains a 64-bit MAX_MECID, but the architecture only + * allows 16 bits at the moment. + */ +#define mecid_count() ((u32)rmm_feat_reg1 + 1) +/* + * RMM reports MAX_MECID=0 (count=1) when MEC is not supported, + * otherwise reports the actual maximum MECID value. + */ +#define mecid_supported() (mecid_count() != 1) + +#define RMM_PAGE_SHIFT 12 +#define RMM_PAGE_SIZE BIT(RMM_PAGE_SHIFT) + +#define RMM_RTT_BLOCK_LEVEL 2 +#define RMM_RTT_MAX_LEVEL 3 + +/* See ARM64_HW_PGTABLE_LEVEL_SHIFT() */ +#define RMM_RTT_LEVEL_SHIFT(l) \ + ((RMM_PAGE_SHIFT - 3) * (4 - (l)) + 3) +#define RMM_L2_BLOCK_SIZE BIT(RMM_RTT_LEVEL_SHIFT(2)) + +static inline unsigned long rme_rtt_level_mapsize(int level) +{ + if (WARN_ON(level > RMM_RTT_MAX_LEVEL)) + return RMM_PAGE_SIZE; + + return (1UL << RMM_RTT_LEVEL_SHIFT(level)); +} + +static bool rme_has_feature(unsigned long feature) +{ + return !!u64_get_bits(rmm_feat_reg0, feature); +} + +bool kvm_rme_supports_sve(void) +{ + return rme_has_feature(RMI_FEATURE_REGISTER_0_SVE_EN); +} + +static int rmi_check_version(void) +{ + struct arm_smccc_res res; + unsigned short version_major, version_minor; + unsigned long host_version = RMI_ABI_VERSION(RMI_ABI_MAJOR_VERSION, + RMI_ABI_MINOR_VERSION); + + arm_smccc_1_1_invoke(SMC_RMI_VERSION, host_version, &res); + + if (res.a0 == SMCCC_RET_NOT_SUPPORTED) + return -ENXIO; + + version_major = RMI_ABI_VERSION_GET_MAJOR(res.a1); + version_minor = RMI_ABI_VERSION_GET_MINOR(res.a1); + + if (res.a0 != RMI_SUCCESS) { + unsigned short high_version_major, high_version_minor; + + high_version_major = RMI_ABI_VERSION_GET_MAJOR(res.a2); + high_version_minor = RMI_ABI_VERSION_GET_MINOR(res.a2); + + kvm_err("Unsupported RMI ABI (v%d.%d - v%d.%d) we want v%d.%d\n", + version_major, version_minor, + high_version_major, high_version_minor, + RMI_ABI_MAJOR_VERSION, + RMI_ABI_MINOR_VERSION); + return -ENXIO; + } + + kvm_info("RMI ABI version %d.%d\n", version_major, version_minor); + + return 0; +} + +u32 kvm_realm_ipa_limit(void) +{ + return u64_get_bits(rmm_feat_reg0, RMI_FEATURE_REGISTER_0_S2SZ); +} + +u32 kvm_realm_vgic_nr_lr(void) +{ + return u64_get_bits(rmm_feat_reg0, RMI_FEATURE_REGISTER_0_GICV3_NUM_LRS); +} + +u8 kvm_realm_max_pmu_counters(void) +{ + return u64_get_bits(rmm_feat_reg0, RMI_FEATURE_REGISTER_0_PMU_NUM_CTRS); +} + +unsigned int kvm_realm_sve_max_vl(void) +{ + return sve_vl_from_vq(u64_get_bits(rmm_feat_reg0, + RMI_FEATURE_REGISTER_0_SVE_VL) + 1); +} + +u64 kvm_realm_reset_id_aa64dfr0_el1(const struct kvm_vcpu *vcpu, u64 val) +{ + u32 bps = u64_get_bits(rmm_feat_reg0, RMI_FEATURE_REGISTER_0_NUM_BPS); + u32 wps = u64_get_bits(rmm_feat_reg0, RMI_FEATURE_REGISTER_0_NUM_WPS); + u32 ctx_cmps; + + if (!kvm_is_realm(vcpu->kvm)) + return val; + + /* Ensure CTX_CMPs is still valid */ + ctx_cmps = FIELD_GET(ID_AA64DFR0_EL1_CTX_CMPs, val); + ctx_cmps = min(bps, ctx_cmps); + + val &= ~(ID_AA64DFR0_EL1_BRPs_MASK | ID_AA64DFR0_EL1_WRPs_MASK | + ID_AA64DFR0_EL1_CTX_CMPs); + val |= FIELD_PREP(ID_AA64DFR0_EL1_BRPs_MASK, bps) | + FIELD_PREP(ID_AA64DFR0_EL1_WRPs_MASK, wps) | + FIELD_PREP(ID_AA64DFR0_EL1_CTX_CMPs, ctx_cmps); + + return val; +} + +static int get_start_level(struct realm *realm) +{ + /* + * Open coded version of 4 - stage2_pgtable_levels(ia_bits) but using + * the RMM's page size rather than the host's. + */ + return 4 - ((realm->ia_bits - 8) / (RMM_PAGE_SHIFT - 3)); +} + +static int find_map_level(struct realm *realm, + unsigned long start, + unsigned long end) +{ + int level = RMM_RTT_MAX_LEVEL; + + while (level > get_start_level(realm)) { + unsigned long map_size = rme_rtt_level_mapsize(level - 1); + + if (!IS_ALIGNED(start, map_size) || + (start + map_size) > end) + break; + + level--; + } + + return level; +} + +static phys_addr_t alloc_delegated_granule(struct kvm_mmu_memory_cache *mc) +{ + phys_addr_t phys; + void *virt; + + if (mc) { + virt = kvm_mmu_memory_cache_alloc(mc); + } else { + virt = (void *)__get_free_page(GFP_ATOMIC | __GFP_ZERO | + __GFP_ACCOUNT); + } + + if (!virt) + return PHYS_ADDR_MAX; + + phys = virt_to_phys(virt); + if (rmi_granule_delegate(phys)) { + free_page((unsigned long)virt); + return PHYS_ADDR_MAX; + } + + return phys; +} + +static phys_addr_t alloc_rtt(struct kvm_mmu_memory_cache *mc) +{ + phys_addr_t phys = alloc_delegated_granule(mc); + + if (phys != PHYS_ADDR_MAX) + kvm_account_pgtable_pages(phys_to_virt(phys), 1); + + return phys; +} + +static int free_delegated_granule(phys_addr_t phys) +{ + if (WARN_ON(rmi_granule_undelegate(phys))) { + /* Undelegate failed: leak the page */ + return -EBUSY; + } + + free_page((unsigned long)phys_to_virt(phys)); + + return 0; +} + +static void free_rtt(phys_addr_t phys) +{ + if (free_delegated_granule(phys)) + return; + + kvm_account_pgtable_pages(phys_to_virt(phys), -1); +} + +int realm_psci_complete(struct kvm_vcpu *source, struct kvm_vcpu *target, + unsigned long status) +{ + int ret; + + ret = rmi_psci_complete(virt_to_phys(source->arch.rec.rec_page), + virt_to_phys(target->arch.rec.rec_page), + status); + if (ret) + return -EINVAL; + + return 0; +} + +static int realm_rtt_create(struct realm *realm, + unsigned long addr, + int level, + phys_addr_t phys) +{ + addr = ALIGN_DOWN(addr, rme_rtt_level_mapsize(level - 1)); + return rmi_rtt_create(virt_to_phys(realm->rd), phys, addr, level); +} + +static int realm_rtt_fold(struct realm *realm, + unsigned long addr, + int level, + phys_addr_t *rtt_granule) +{ + unsigned long out_rtt; + int ret; + + addr = ALIGN_DOWN(addr, rme_rtt_level_mapsize(level - 1)); + ret = rmi_rtt_fold(virt_to_phys(realm->rd), addr, level, &out_rtt); + + if (rtt_granule) + *rtt_granule = out_rtt; + + return ret; +} + +static int realm_rtt_destroy(struct realm *realm, unsigned long addr, + int level, phys_addr_t *rtt_granule, + unsigned long *next_addr) +{ + unsigned long out_rtt; + int ret; + + ret = rmi_rtt_destroy(virt_to_phys(realm->rd), addr, level, + &out_rtt, next_addr); + + *rtt_granule = out_rtt; + + return ret; +} + +static int realm_create_rtt_levels(struct realm *realm, + unsigned long ipa, + int level, + int max_level, + struct kvm_mmu_memory_cache *mc) +{ + while (level++ < max_level) { + phys_addr_t rtt = alloc_rtt(mc); + int ret; + + if (rtt == PHYS_ADDR_MAX) + return -ENOMEM; + + ret = realm_rtt_create(realm, ipa, level, rtt); + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT && + RMI_RETURN_INDEX(ret) == level - 1) { + /* The RTT already exists, continue */ + free_rtt(rtt); + continue; + } + + if (ret) { + WARN(1, "Failed to create RTT at level %d: %d\n", + level, ret); + free_rtt(rtt); + return -ENXIO; + } + } + + return 0; +} + +static int realm_tear_down_rtt_level(struct realm *realm, int level, + unsigned long start, unsigned long end) +{ + ssize_t map_size; + unsigned long addr, next_addr; + + if (WARN_ON(level > RMM_RTT_MAX_LEVEL)) + return -EINVAL; + + map_size = rme_rtt_level_mapsize(level - 1); + + for (addr = start; addr < end; addr = next_addr) { + phys_addr_t rtt_granule; + int ret; + unsigned long align_addr = ALIGN(addr, map_size); + + next_addr = ALIGN(addr + 1, map_size); + + if (next_addr > end || align_addr != addr) { + /* + * The target range is smaller than what this level + * covers, recurse deeper. + */ + ret = realm_tear_down_rtt_level(realm, + level + 1, + addr, + min(next_addr, end)); + if (ret) + return ret; + continue; + } + + ret = realm_rtt_destroy(realm, addr, level, + &rtt_granule, &next_addr); + + switch (RMI_RETURN_STATUS(ret)) { + case RMI_SUCCESS: + free_rtt(rtt_granule); + break; + case RMI_ERROR_RTT: + if (next_addr > addr) { + /* Missing RTT, skip */ + break; + } + /* + * We tear down the RTT range for the full IPA + * space, after everything is unmapped. Also we + * descend down only if we cannot tear down a + * top level RTT. Thus RMM must be able to walk + * to the requested level. e.g., a block mapping + * exists at L1 or L2. + */ + if (WARN_ON(RMI_RETURN_INDEX(ret) != level)) + return -EBUSY; + if (WARN_ON(level == RMM_RTT_MAX_LEVEL)) + return -EBUSY; + + /* + * The table has active entries in it, recurse deeper + * and tear down the RTTs. + */ + next_addr = ALIGN(addr + 1, map_size); + ret = realm_tear_down_rtt_level(realm, + level + 1, + addr, + next_addr); + if (ret) + return ret; + /* + * Now that the child RTTs are destroyed, + * retry at this level. + */ + next_addr = addr; + break; + default: + WARN_ON(1); + return -ENXIO; + } + } + + return 0; +} + +static int realm_tear_down_rtt_range(struct realm *realm, + unsigned long start, unsigned long end) +{ + /* + * Root level RTTs can only be destroyed after the RD is destroyed. So + * tear down everything below the root level + */ + return realm_tear_down_rtt_level(realm, get_start_level(realm) + 1, + start, end); +} + +/* + * Returns 0 on successful fold, a negative value on error, a positive value if + * we were not able to fold all tables at this level. + */ +static int realm_fold_rtt_level(struct realm *realm, int level, + unsigned long start, unsigned long end) +{ + int not_folded = 0; + ssize_t map_size; + unsigned long addr, next_addr; + + if (WARN_ON(level > RMM_RTT_MAX_LEVEL)) + return -EINVAL; + + map_size = rme_rtt_level_mapsize(level - 1); + + for (addr = start; addr < end; addr = next_addr) { + phys_addr_t rtt_granule; + int ret; + unsigned long align_addr = ALIGN(addr, map_size); + + next_addr = ALIGN(addr + 1, map_size); + + ret = realm_rtt_fold(realm, align_addr, level, &rtt_granule); + + switch (RMI_RETURN_STATUS(ret)) { + case RMI_SUCCESS: + free_rtt(rtt_granule); + break; + case RMI_ERROR_RTT: + if (level == RMM_RTT_MAX_LEVEL || + RMI_RETURN_INDEX(ret) < level) { + not_folded++; + break; + } + /* Recurse a level deeper */ + ret = realm_fold_rtt_level(realm, + level + 1, + addr, + next_addr); + if (ret < 0) { + return ret; + } else if (ret == 0) { + /* Try again at this level */ + next_addr = addr; + } + break; + default: + WARN_ON(1); + return -ENXIO; + } + } + + return not_folded; +} + +void kvm_realm_destroy_rtts(struct kvm *kvm, u32 ia_bits) +{ + struct realm *realm = &kvm->arch.realm; + + WARN_ON(realm_tear_down_rtt_range(realm, 0, (1UL << ia_bits))); +} + +static int realm_destroy_private_granule(struct realm *realm, + unsigned long ipa, + unsigned long *next_addr, + phys_addr_t *out_rtt) +{ + unsigned long rd = virt_to_phys(realm->rd); + unsigned long rtt_addr; + phys_addr_t rtt; + int ret; + +retry: + ret = rmi_data_destroy(rd, ipa, &rtt_addr, next_addr); + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) { + if (*next_addr > ipa) + return 0; /* UNASSIGNED */ + rtt = alloc_rtt(NULL); + if (WARN_ON(rtt == PHYS_ADDR_MAX)) + return -ENOMEM; + /* + * ASSIGNED - ipa is mapped as a block, so split. The index + * from the return code should be 2 otherwise it appears + * there's a huge page bigger than KVM currently supports + */ + WARN_ON(RMI_RETURN_INDEX(ret) != 2); + ret = realm_rtt_create(realm, ipa, 3, rtt); + if (WARN_ON(ret)) { + free_rtt(rtt); + return -ENXIO; + } + goto retry; + } else if (WARN_ON(ret)) { + return -ENXIO; + } + + ret = rmi_granule_undelegate(rtt_addr); + if (WARN_ON(ret)) + return -ENXIO; + + *out_rtt = rtt_addr; + + return 0; +} + +static int realm_unmap_private_page(struct realm *realm, + unsigned long ipa, + unsigned long *next_addr) +{ + unsigned long end = ALIGN(ipa + 1, PAGE_SIZE); + unsigned long addr; + phys_addr_t out_rtt = PHYS_ADDR_MAX; + int ret; + + for (addr = ipa; addr < end; addr = *next_addr) { + ret = realm_destroy_private_granule(realm, addr, next_addr, + &out_rtt); + if (ret) + return ret; + } + + if (out_rtt != PHYS_ADDR_MAX) { + out_rtt = ALIGN_DOWN(out_rtt, PAGE_SIZE); + free_page((unsigned long)phys_to_virt(out_rtt)); + } + + return 0; +} + +static void realm_unmap_shared_range(struct kvm *kvm, + int level, + unsigned long start, + unsigned long end, + bool may_block) +{ + struct realm *realm = &kvm->arch.realm; + unsigned long rd = virt_to_phys(realm->rd); + ssize_t map_size = rme_rtt_level_mapsize(level); + unsigned long next_addr, addr; + unsigned long shared_bit = BIT(realm->ia_bits - 1); + + if (WARN_ON(level > RMM_RTT_MAX_LEVEL)) + return; + + start |= shared_bit; + end |= shared_bit; + + for (addr = start; addr < end; addr = next_addr) { + unsigned long align_addr = ALIGN(addr, map_size); + int ret; + + next_addr = ALIGN(addr + 1, map_size); + + if (align_addr != addr || next_addr > end) { + /* Need to recurse deeper */ + if (addr < align_addr) + next_addr = align_addr; + realm_unmap_shared_range(kvm, level + 1, addr, + min(next_addr, end), + may_block); + continue; + } + + ret = rmi_rtt_unmap_unprotected(rd, addr, level, &next_addr); + switch (RMI_RETURN_STATUS(ret)) { + case RMI_SUCCESS: + break; + case RMI_ERROR_RTT: + if (next_addr == addr) { + /* + * There's a mapping here, but it's not a block + * mapping, so reset next_addr to the next block + * boundary and recurse to clear out the pages + * one level deeper. + */ + next_addr = ALIGN(addr + 1, map_size); + realm_unmap_shared_range(kvm, level + 1, addr, + next_addr, + may_block); + } + break; + default: + WARN_ON(1); + return; + } + + if (may_block) + cond_resched_rwlock_write(&kvm->mmu_lock); + } + + realm_fold_rtt_level(realm, get_start_level(realm) + 1, + start, end); +} + +static int realm_init_sve_param(struct kvm *kvm, struct realm_params *params) +{ + int ret = 0; + unsigned long i; + struct kvm_vcpu *vcpu; + int vl, last_vl = -1; + + /* + * Get the preferred SVE configuration, set by userspace with the + * KVM_ARM_VCPU_SVE feature and KVM_REG_ARM64_SVE_VLS pseudo-register. + */ + kvm_for_each_vcpu(i, vcpu, kvm) { + mutex_lock(&vcpu->mutex); + if (vcpu_has_sve(vcpu)) { + if (!kvm_arm_vcpu_sve_finalized(vcpu)) + ret = -EINVAL; + vl = vcpu->arch.sve_max_vl; + } else { + vl = 0; + } + mutex_unlock(&vcpu->mutex); + if (ret) + return ret; + + /* We need all vCPUs to have the same SVE config */ + if (last_vl >= 0 && last_vl != vl) + return -EINVAL; + + last_vl = vl; + } + + if (last_vl > 0) { + params->sve_vl = sve_vq_from_vl(last_vl) - 1; + params->flags |= RMI_REALM_PARAM_FLAG_SVE; + } + return 0; +} + +/* Calculate the number of s2 root rtts needed */ +static int realm_num_root_rtts(struct realm *realm) +{ + unsigned int ipa_bits = realm->ia_bits; + unsigned int levels = 4 - get_start_level(realm); + unsigned int sl_ipa_bits = levels * (RMM_PAGE_SHIFT - 3) + + RMM_PAGE_SHIFT; + + if (sl_ipa_bits >= ipa_bits) + return 1; + + return 1 << (ipa_bits - sl_ipa_bits); +} + +static int realm_create_rd(struct kvm *kvm) +{ + struct realm *realm = &kvm->arch.realm; + struct realm_params *params = realm->params; + void *rd = NULL; + phys_addr_t rd_phys, params_phys; + size_t pgd_size = kvm_pgtable_stage2_pgd_size(kvm->arch.mmu.vtcr); + u64 dfr0 = kvm_read_vm_id_reg(kvm, SYS_ID_AA64DFR0_EL1); + int i, r; + int rtt_num_start; + + realm->ia_bits = VTCR_EL2_IPA(kvm->arch.mmu.vtcr); + rtt_num_start = realm_num_root_rtts(realm); + + if (WARN_ON(realm->rd || !realm->params)) + return -EEXIST; + + if (pgd_size / RMM_PAGE_SIZE < rtt_num_start) + return -EINVAL; + + rd = (void *)__get_free_page(GFP_KERNEL); + if (!rd) + return -ENOMEM; + + rd_phys = virt_to_phys(rd); + if (rmi_granule_delegate(rd_phys)) { + r = -ENXIO; + goto free_rd; + } + + for (i = 0; i < pgd_size; i += RMM_PAGE_SIZE) { + phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i; + + if (rmi_granule_delegate(pgd_phys)) { + r = -ENXIO; + goto out_undelegate_tables; + } + } + + params->s2sz = VTCR_EL2_IPA(kvm->arch.mmu.vtcr); + params->rtt_level_start = get_start_level(realm); + params->rtt_num_start = rtt_num_start; + params->rtt_base = kvm->arch.mmu.pgd_phys; + params->vmid = realm->vmid; + params->num_bps = SYS_FIELD_GET(ID_AA64DFR0_EL1, BRPs, dfr0); + params->num_wps = SYS_FIELD_GET(ID_AA64DFR0_EL1, WRPs, dfr0); + + if (kvm->arch.arm_pmu) { + params->pmu_num_ctrs = kvm->arch.nr_pmu_counters; + params->flags |= RMI_REALM_PARAM_FLAG_PMU; + } + + /* Set MECID in realm parameters - 0 when not supported */ + params->mecid = mecid_supported() ? realm->mecid : 0; + + r = realm_init_sve_param(kvm, params); + if (r) + goto out_undelegate_tables; + + params_phys = virt_to_phys(params); + + if (rmi_realm_create(rd_phys, params_phys)) { + r = -ENXIO; + goto out_undelegate_tables; + } + + if (WARN_ON(rmi_rec_aux_count(rd_phys, &realm->num_aux))) { + WARN_ON(rmi_realm_destroy(rd_phys)); + r = -ENXIO; + goto out_undelegate_tables; + } + + if (WARN_ON(realm->num_aux > REC_PARAMS_AUX_GRANULES)) { + WARN_ON(rmi_realm_destroy(rd_phys)); + r = -ENXIO; + goto out_undelegate_tables; + } + + realm->rd = rd; + + return 0; + +out_undelegate_tables: + while (i > 0) { + i -= RMM_PAGE_SIZE; + + phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i; + + if (WARN_ON(rmi_granule_undelegate(pgd_phys))) { + /* Leak the pages if they cannot be returned */ + kvm->arch.mmu.pgt = NULL; + break; + } + } + if (WARN_ON(rmi_granule_undelegate(rd_phys))) { + /* Leak the page if it isn't returned */ + return r; + } +free_rd: + free_page((unsigned long)rd); + return r; +} + +static void realm_unmap_private_range(struct kvm *kvm, + unsigned long start, + unsigned long end, + bool may_block) +{ + struct realm *realm = &kvm->arch.realm; + unsigned long next_addr, addr; + int ret; + + for (addr = start; addr < end; addr = next_addr) { + ret = realm_unmap_private_page(realm, addr, &next_addr); + + if (ret) + break; + + if (may_block) + cond_resched_rwlock_write(&kvm->mmu_lock); + } + + realm_fold_rtt_level(realm, get_start_level(realm) + 1, + start, end); +} + +void kvm_realm_unmap_range(struct kvm *kvm, unsigned long start, + unsigned long size, bool unmap_private, + bool may_block) +{ + unsigned long end = start + size; + struct realm *realm = &kvm->arch.realm; + + if (!kvm_realm_is_created(kvm)) + return; + + end = min(BIT(realm->ia_bits - 1), end); + + realm_unmap_shared_range(kvm, find_map_level(realm, start, end), + start, end, may_block); + if (unmap_private) + realm_unmap_private_range(kvm, start, end, may_block); +} + +static int realm_create_protected_data_granule(struct realm *realm, + unsigned long ipa, + phys_addr_t dst_phys, + phys_addr_t src_phys, + unsigned long flags) +{ + phys_addr_t rd = virt_to_phys(realm->rd); + int ret; + + if (rmi_granule_delegate(dst_phys)) + return -ENXIO; + + ret = rmi_data_create(rd, dst_phys, ipa, src_phys, flags); + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) { + /* Create missing RTTs and retry */ + int level = RMI_RETURN_INDEX(ret); + + WARN_ON(level == RMM_RTT_MAX_LEVEL); + + ret = realm_create_rtt_levels(realm, ipa, level, + RMM_RTT_MAX_LEVEL, NULL); + if (ret) + return -EIO; + + ret = rmi_data_create(rd, dst_phys, ipa, src_phys, flags); + } + if (ret) + return -EIO; + + return 0; +} + +static int realm_create_protected_data_page(struct realm *realm, + unsigned long ipa, + kvm_pfn_t dst_pfn, + kvm_pfn_t src_pfn, + unsigned long flags) +{ + unsigned long rd = virt_to_phys(realm->rd); + phys_addr_t dst_phys, src_phys; + bool undelegate_failed = false; + int ret, offset; + + dst_phys = __pfn_to_phys(dst_pfn); + src_phys = __pfn_to_phys(src_pfn); + + for (offset = 0; offset < PAGE_SIZE; offset += RMM_PAGE_SIZE) { + ret = realm_create_protected_data_granule(realm, + ipa, + dst_phys, + src_phys, + flags); + if (ret) + goto err; + + ipa += RMM_PAGE_SIZE; + dst_phys += RMM_PAGE_SIZE; + src_phys += RMM_PAGE_SIZE; + } + + return 0; + +err: + if (ret == -EIO) { + /* current offset needs undelegating */ + if (WARN_ON(rmi_granule_undelegate(dst_phys))) + undelegate_failed = true; + } + while (offset > 0) { + ipa -= RMM_PAGE_SIZE; + offset -= RMM_PAGE_SIZE; + dst_phys -= RMM_PAGE_SIZE; + + rmi_data_destroy(rd, ipa, NULL, NULL); + + if (WARN_ON(rmi_granule_undelegate(dst_phys))) + undelegate_failed = true; + } + + if (undelegate_failed) { + /* + * A granule could not be undelegated, + * so the page has to be leaked + */ + get_page(pfn_to_page(dst_pfn)); + } + + return -ENXIO; +} + +static int fold_rtt(struct realm *realm, unsigned long addr, int level) +{ + phys_addr_t rtt_addr; + int ret; + + ret = realm_rtt_fold(realm, addr, level, &rtt_addr); + if (ret) + return ret; + + free_rtt(rtt_addr); + + return 0; +} + +int realm_map_protected(struct realm *realm, + unsigned long ipa, + kvm_pfn_t pfn, + unsigned long map_size, + struct kvm_mmu_memory_cache *memcache) +{ + phys_addr_t phys = __pfn_to_phys(pfn); + phys_addr_t rd = virt_to_phys(realm->rd); + unsigned long base_ipa = ipa; + unsigned long size; + int map_level = IS_ALIGNED(map_size, RMM_L2_BLOCK_SIZE) ? + RMM_RTT_BLOCK_LEVEL : RMM_RTT_MAX_LEVEL; + int ret = 0; + + if (WARN_ON(!IS_ALIGNED(map_size, RMM_PAGE_SIZE) || + !IS_ALIGNED(ipa, map_size))) + return -EINVAL; + + if (map_level < RMM_RTT_MAX_LEVEL) { + /* + * A temporary RTT is needed during the map, precreate it, + * however if there is an error (e.g. missing parent tables) + * this will be handled below. + */ + realm_create_rtt_levels(realm, ipa, map_level, + RMM_RTT_MAX_LEVEL, memcache); + } + + for (size = 0; size < map_size; size += RMM_PAGE_SIZE) { + if (rmi_granule_delegate(phys)) { + /* + * It's likely we raced with another VCPU on the same + * fault. Assume the other VCPU has handled the fault + * and return to the guest. + */ + return 0; + } + + ret = rmi_data_create_unknown(rd, phys, ipa); + + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) { + /* Create missing RTTs and retry */ + int level = RMI_RETURN_INDEX(ret); + + WARN_ON(level == RMM_RTT_MAX_LEVEL); + ret = realm_create_rtt_levels(realm, ipa, level, + RMM_RTT_MAX_LEVEL, + memcache); + if (ret) + goto err_undelegate; + + ret = rmi_data_create_unknown(rd, phys, ipa); + } + + if (WARN_ON(ret)) + goto err_undelegate; + + phys += RMM_PAGE_SIZE; + ipa += RMM_PAGE_SIZE; + } + + if (map_size == RMM_L2_BLOCK_SIZE) { + ret = fold_rtt(realm, base_ipa, map_level + 1); + if (WARN_ON(ret)) + goto err; + } + + return 0; + +err_undelegate: + if (WARN_ON(rmi_granule_undelegate(phys))) { + /* Page can't be returned to NS world so is lost */ + get_page(phys_to_page(phys)); + } +err: + while (size > 0) { + unsigned long data, top; + + phys -= RMM_PAGE_SIZE; + size -= RMM_PAGE_SIZE; + ipa -= RMM_PAGE_SIZE; + + WARN_ON(rmi_data_destroy(rd, ipa, &data, &top)); + + if (WARN_ON(rmi_granule_undelegate(phys))) { + /* Page can't be returned to NS world so is lost */ + get_page(phys_to_page(phys)); + } + } + return -ENXIO; +} + +int realm_map_non_secure(struct realm *realm, + unsigned long ipa, + kvm_pfn_t pfn, + unsigned long size, + struct kvm_mmu_memory_cache *memcache) +{ + phys_addr_t rd = virt_to_phys(realm->rd); + phys_addr_t phys = __pfn_to_phys(pfn); + unsigned long offset; + /* TODO: Support block mappings */ + int map_level = RMM_RTT_MAX_LEVEL; + int map_size = rme_rtt_level_mapsize(map_level); + int ret = 0; + + if (WARN_ON(!IS_ALIGNED(size, RMM_PAGE_SIZE) || + !IS_ALIGNED(ipa, size))) + return -EINVAL; + + for (offset = 0; offset < size; offset += map_size) { + /* + * realm_map_ipa() enforces that the memory is writable, + * so for now we permit both read and write. + */ + unsigned long desc = phys | + PTE_S2_MEMATTR(MT_S2_FWB_NORMAL) | + KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | + KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W; + ret = rmi_rtt_map_unprotected(rd, ipa, map_level, desc); + + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) { + /* Create missing RTTs and retry */ + int level = RMI_RETURN_INDEX(ret); + + ret = realm_create_rtt_levels(realm, ipa, level, + map_level, memcache); + if (ret) + return -ENXIO; + + ret = rmi_rtt_map_unprotected(rd, ipa, map_level, desc); + } + /* + * RMI_ERROR_RTT can be reported for two reasons: either the + * RTT tables are not there, or there is an RTTE already + * present for the address. The above call to create RTTs + * handles the first case, and in the second case this + * indicates that another thread has already populated the RTTE + * for us, so we can ignore the error and continue. + */ + if (ret && RMI_RETURN_STATUS(ret) != RMI_ERROR_RTT) + return -ENXIO; + + ipa += map_size; + phys += map_size; + } + + return 0; +} + +static int populate_region(struct kvm *kvm, + phys_addr_t ipa_base, + phys_addr_t ipa_end, + unsigned long data_flags) +{ + struct realm *realm = &kvm->arch.realm; + struct kvm_memory_slot *memslot; + gfn_t base_gfn, end_gfn; + int idx; + phys_addr_t ipa = ipa_base; + int ret = 0; + + base_gfn = gpa_to_gfn(ipa_base); + end_gfn = gpa_to_gfn(ipa_end); + + idx = srcu_read_lock(&kvm->srcu); + memslot = gfn_to_memslot(kvm, base_gfn); + if (!memslot) { + ret = -EFAULT; + goto out; + } + + /* We require the region to be contained within a single memslot */ + if (memslot->base_gfn + memslot->npages < end_gfn) { + ret = -EINVAL; + goto out; + } + + if (!kvm_slot_can_be_private(memslot)) { + ret = -EPERM; + goto out; + } + + while (ipa < ipa_end) { + struct vm_area_struct *vma; + unsigned long hva; + struct page *page; + bool writeable; + kvm_pfn_t pfn; + kvm_pfn_t priv_pfn; + struct page *gmem_page; + + hva = gfn_to_hva_memslot(memslot, gpa_to_gfn(ipa)); + vma = vma_lookup(current->mm, hva); + if (!vma) { + ret = -EFAULT; + break; + } + + pfn = __kvm_faultin_pfn(memslot, gpa_to_gfn(ipa), FOLL_WRITE, + &writeable, &page); + + if (is_error_pfn(pfn)) { + ret = -EFAULT; + break; + } + + ret = kvm_gmem_get_pfn(kvm, memslot, + ipa >> PAGE_SHIFT, + &priv_pfn, &gmem_page, NULL); + if (ret) + break; + + ret = realm_create_protected_data_page(realm, ipa, + priv_pfn, + pfn, + data_flags); + + kvm_release_page_clean(page); + + if (ret) + break; + + ipa += PAGE_SIZE; + } + +out: + srcu_read_unlock(&kvm->srcu, idx); + return ret; +} + +static int kvm_populate_realm(struct kvm *kvm, + struct arm_rme_populate_realm *args) +{ + phys_addr_t ipa_base, ipa_end; + unsigned long data_flags = 0; + + if (kvm_realm_state(kvm) != REALM_STATE_NEW) + return -EPERM; + + if (!IS_ALIGNED(args->base, PAGE_SIZE) || + !IS_ALIGNED(args->size, PAGE_SIZE) || + (args->flags & ~RMI_MEASURE_CONTENT)) + return -EINVAL; + + ipa_base = args->base; + ipa_end = ipa_base + args->size; + + if (ipa_end < ipa_base) + return -EINVAL; + + if (args->flags & RMI_MEASURE_CONTENT) + data_flags |= RMI_MEASURE_CONTENT; + + /* + * Perform the population in parts to ensure locks are not held for too + * long + */ + while (ipa_base < ipa_end) { + phys_addr_t end = min(ipa_end, ipa_base + SZ_2M); + + int ret = populate_region(kvm, ipa_base, end, + args->flags); + + if (ret) + return ret; + + ipa_base = end; + + cond_resched(); + } + + return 0; +} + +enum ripas_action { + RIPAS_INIT, + RIPAS_SET, +}; + +static int ripas_change(struct kvm *kvm, + struct kvm_vcpu *vcpu, + unsigned long ipa, + unsigned long end, + enum ripas_action action, + unsigned long *top_ipa) +{ + struct realm *realm = &kvm->arch.realm; + phys_addr_t rd_phys = virt_to_phys(realm->rd); + phys_addr_t rec_phys; + struct kvm_mmu_memory_cache *memcache = NULL; + int ret = 0; + + if (vcpu) { + rec_phys = virt_to_phys(vcpu->arch.rec.rec_page); + memcache = &vcpu->arch.mmu_page_cache; + + WARN_ON(action != RIPAS_SET); + } else { + WARN_ON(action != RIPAS_INIT); + } + + while (ipa < end) { + unsigned long next; + + switch (action) { + case RIPAS_INIT: + ret = rmi_rtt_init_ripas(rd_phys, ipa, end, &next); + break; + case RIPAS_SET: + ret = rmi_rtt_set_ripas(rd_phys, rec_phys, ipa, end, + &next); + break; + } + + switch (RMI_RETURN_STATUS(ret)) { + case RMI_SUCCESS: + ipa = next; + break; + case RMI_ERROR_RTT: { + int err_level = RMI_RETURN_INDEX(ret); + int level = find_map_level(realm, ipa, end); + + if (err_level >= level) + return -EINVAL; + + ret = realm_create_rtt_levels(realm, ipa, err_level, + level, memcache); + if (ret) + return ret; + /* Retry with the RTT levels in place */ + break; + } + default: + WARN_ON(1); + return -ENXIO; + } + } + + if (top_ipa) + *top_ipa = ipa; + + return 0; +} + +static int realm_set_ipa_state(struct kvm_vcpu *vcpu, + unsigned long start, + unsigned long end, + unsigned long ripas, + unsigned long *top_ipa) +{ + struct kvm *kvm = vcpu->kvm; + int ret = ripas_change(kvm, vcpu, start, end, RIPAS_SET, top_ipa); + + if (ripas == RMI_EMPTY && *top_ipa != start) + realm_unmap_private_range(kvm, start, *top_ipa, false); + + return ret; +} + +static int realm_init_ipa_state(struct kvm *kvm, + unsigned long ipa, + unsigned long end) +{ + return ripas_change(kvm, NULL, ipa, end, RIPAS_INIT, NULL); +} + +static int kvm_init_ipa_range_realm(struct kvm *kvm, + struct arm_rme_init_ripas *args) +{ + gpa_t addr, end; + + addr = args->base; + end = addr + args->size; + + if (end < addr) + return -EINVAL; + + if (kvm_realm_state(kvm) != REALM_STATE_NEW) + return -EPERM; + + return realm_init_ipa_state(kvm, addr, end); +} + +static int kvm_activate_realm(struct kvm *kvm) +{ + struct realm *realm = &kvm->arch.realm; + + if (kvm_realm_state(kvm) != REALM_STATE_NEW) + return -EINVAL; + + if (rmi_realm_activate(virt_to_phys(realm->rd))) + return -ENXIO; + + WRITE_ONCE(realm->state, REALM_STATE_ACTIVE); + return 0; +} + +/* Protects access to rme_vmid_bitmap */ +static DEFINE_SPINLOCK(rme_vmid_lock); +static unsigned long *rme_vmid_bitmap; + +static int rme_vmid_init(void) +{ + unsigned int vmid_count = 1 << kvm_get_vmid_bits(); + + rme_vmid_bitmap = bitmap_zalloc(vmid_count, GFP_KERNEL); + if (!rme_vmid_bitmap) { + kvm_err("%s: Couldn't allocate rme vmid bitmap\n", __func__); + return -ENOMEM; + } + + return 0; +} + +static int rme_vmid_reserve(void) +{ + int ret; + unsigned int vmid_count = 1 << kvm_get_vmid_bits(); + + spin_lock(&rme_vmid_lock); + ret = bitmap_find_free_region(rme_vmid_bitmap, vmid_count, 0); + spin_unlock(&rme_vmid_lock); + + return ret; +} + +static void rme_vmid_release(unsigned int vmid) +{ + spin_lock(&rme_vmid_lock); + bitmap_release_region(rme_vmid_bitmap, vmid, 0); + spin_unlock(&rme_vmid_lock); +} + +static int __mecid_alloc(struct mecid_state *state) +{ + lockdep_assert_held(&mecid_lock); + return bitmap_find_free_region(state->bitmap, mecid_count(), 0); +} + +static int __mecid_get_shared(struct mecid_state *state) +{ + int mecid; + + lockdep_assert_held(&mecid_lock); + + if (state->shared_mecid != MECID_INVALID) { + if (WARN_ON(state->shared_mecid_members > UINT_MAX)) + return -ENOSPC; + + state->shared_mecid_members++; + return state->shared_mecid; + } + + /* Sanity check: members without valid shared MECID indicates corruption */ + if (WARN_ON(state->shared_mecid_members)) + return -ENXIO; + + mecid = __mecid_alloc(state); + if (mecid < 0) + return mecid; + + if (rmi_mec_set_shared(mecid)) { + bitmap_release_region(state->bitmap, mecid, 0); + return -ENXIO; + } + + state->shared_mecid = mecid; + state->shared_mecid_members++; + + return mecid; +} + +static int __mecid_put_shared(struct mecid_state *state) +{ + lockdep_assert_held(&mecid_lock); + + if (WARN_ON(!state->shared_mecid_members || state->shared_mecid == MECID_INVALID)) + return -EINVAL; + + if (state->shared_mecid_members > 1) { + state->shared_mecid_members--; + return 0; + } + + if (rmi_mec_set_private(state->shared_mecid)) + return -ENXIO; + + bitmap_release_region(state->bitmap, state->shared_mecid, 0); + state->shared_mecid = MECID_INVALID; + state->shared_mecid_members = 0; + + return 0; +} + +static int __mecid_init(struct mecid_state *state) +{ + if (!mecid_supported()) + return 0; + + state->bitmap = bitmap_zalloc(mecid_count(), GFP_KERNEL); + if (!state->bitmap) { + kvm_err("Couldn't allocate rme mecid bitmap\n"); + return -ENOMEM; + } + + return 0; +} + +static void __mecid_destroy(struct mecid_state *state) +{ + bitmap_free(state->bitmap); + state->bitmap = NULL; +} + +/* Public wrappers that handle global state */ +static int rme_mecid_init(void) +{ + return __mecid_init(&mecid_state); +} + +static void rme_mecid_destroy(void) +{ + __mecid_destroy(&mecid_state); +} + +static int rme_mecid_reserve(struct realm *realm) +{ + int ret = 0; + + if (!mecid_supported()) { + /* RMM doesn't support MEC, force MECID to 0 */ + realm->mecid = 0; + return 0; + } + + spin_lock(&mecid_lock); + /* Unconfigured or explicitly shared -> use shared MECID */ + if (realm->mec_policy != MEC_POLICY_PRIVATE) + ret = __mecid_get_shared(&mecid_state); + else + ret = __mecid_alloc(&mecid_state); + if (ret >= 0) { + realm->mecid = ret; + ret = 0; + } + spin_unlock(&mecid_lock); + + return ret; +} + +static void __mecid_release(struct mecid_state *state, unsigned int mecid) +{ + lockdep_assert_held(&mecid_lock); + + if (mecid == state->shared_mecid) + WARN_ON(__mecid_put_shared(state)); + else + bitmap_release_region(state->bitmap, mecid, 0); +} + +static void rme_mecid_release(unsigned int mecid) +{ + if (!mecid_supported()) + return; + + spin_lock(&mecid_lock); + __mecid_release(&mecid_state, mecid); + spin_unlock(&mecid_lock); +} + +static int kvm_create_realm(struct kvm *kvm) +{ + struct realm *realm = &kvm->arch.realm; + int ret; + + if (kvm_realm_is_created(kvm)) + return -EEXIST; + + ret = rme_vmid_reserve(); + if (ret < 0) + return ret; + realm->vmid = ret; + + ret = rme_mecid_reserve(realm); + if (ret < 0) + goto err_free_vmid; + + ret = realm_create_rd(kvm); + if (ret) + goto err_free_mecid; + + WRITE_ONCE(realm->state, REALM_STATE_NEW); + + /* The realm is up, free the parameters. */ + free_page((unsigned long)realm->params); + realm->params = NULL; + + return 0; + +err_free_mecid: + rme_mecid_release(realm->mecid); +err_free_vmid: + rme_vmid_release(realm->vmid); + return ret; +} + +static int config_realm_hash_algo(struct realm *realm, + struct arm_rme_config *cfg) +{ + switch (cfg->hash_algo) { + case ARM_RME_CONFIG_HASH_ALGO_SHA256: + if (!rme_has_feature(RMI_FEATURE_REGISTER_0_HASH_SHA_256)) + return -EINVAL; + break; + case ARM_RME_CONFIG_HASH_ALGO_SHA512: + if (!rme_has_feature(RMI_FEATURE_REGISTER_0_HASH_SHA_512)) + return -EINVAL; + break; + default: + return -EINVAL; + } + realm->params->hash_algo = cfg->hash_algo; + return 0; +} + +static int config_realm_mec(struct realm *realm, + struct arm_rme_config *cfg) +{ + realm->mec_policy = cfg->shared_mec ? MEC_POLICY_SHARED : MEC_POLICY_PRIVATE; + + return 0; +} + +static int kvm_rme_config_realm(struct kvm *kvm, struct kvm_enable_cap *cap) +{ + struct arm_rme_config cfg; + struct realm *realm = &kvm->arch.realm; + int r = 0; + + if (copy_from_user(&cfg, (void __user *)cap->args[1], sizeof(cfg))) + return -EFAULT; + + /* Query operations don't require realm to be in NEW state */ + if (cfg.cfg == ARM_RME_CONFIG_MEC_QUERY) { + cfg.mec_supported = mecid_supported() ? 1 : 0; + cfg.mec_count = mecid_supported() ? mecid_count() : 0; + + if (copy_to_user((void __user *)cap->args[1], &cfg, sizeof(cfg))) + return -EFAULT; + return 0; + } + + if (kvm_realm_is_created(kvm)) + return -EBUSY; + + switch (cfg.cfg) { + case ARM_RME_CONFIG_RPV: + memcpy(&realm->params->rpv, &cfg.rpv, sizeof(cfg.rpv)); + break; + case ARM_RME_CONFIG_HASH_ALGO: + r = config_realm_hash_algo(realm, &cfg); + break; + case ARM_RME_CONFIG_MEC: + r = config_realm_mec(realm, &cfg); + break; + default: + r = -EINVAL; + } + + return r; +} + +int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap) +{ + int r = 0; + + if (!kvm_is_realm(kvm)) + return -EINVAL; + + switch (cap->args[0]) { + case KVM_CAP_ARM_RME_CONFIG_REALM: + r = kvm_rme_config_realm(kvm, cap); + break; + case KVM_CAP_ARM_RME_CREATE_REALM: + r = kvm_create_realm(kvm); + break; + case KVM_CAP_ARM_RME_INIT_RIPAS_REALM: { + struct arm_rme_init_ripas args; + void __user *argp = u64_to_user_ptr(cap->args[1]); + + if (copy_from_user(&args, argp, sizeof(args))) { + r = -EFAULT; + break; + } + + r = kvm_init_ipa_range_realm(kvm, &args); + break; + } + case KVM_CAP_ARM_RME_POPULATE_REALM: { + struct arm_rme_populate_realm args; + void __user *argp = u64_to_user_ptr(cap->args[1]); + + if (copy_from_user(&args, argp, sizeof(args))) { + r = -EFAULT; + break; + } + + r = kvm_populate_realm(kvm, &args); + break; + } + case KVM_CAP_ARM_RME_ACTIVATE_REALM: + r = kvm_activate_realm(kvm); + break; + default: + r = -EINVAL; + break; + } + + return r; +} + +void kvm_destroy_realm(struct kvm *kvm) +{ + struct realm *realm = &kvm->arch.realm; + size_t pgd_size = kvm_pgtable_stage2_pgd_size(kvm->arch.mmu.vtcr); + int i; + + if (realm->params) { + free_page((unsigned long)realm->params); + realm->params = NULL; + } + + if (!kvm_realm_is_created(kvm)) + return; + + WRITE_ONCE(realm->state, REALM_STATE_DYING); + + if (realm->rd) { + phys_addr_t rd_phys = virt_to_phys(realm->rd); + + if (WARN_ON(rmi_realm_destroy(rd_phys))) + return; + free_delegated_granule(rd_phys); + realm->rd = NULL; + } + + rme_vmid_release(realm->vmid); + rme_mecid_release(realm->mecid); + + for (i = 0; i < pgd_size; i += RMM_PAGE_SIZE) { + phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i; + + if (WARN_ON(rmi_granule_undelegate(pgd_phys))) + return; + } + + WRITE_ONCE(realm->state, REALM_STATE_DEAD); + + /* Now that the Realm is destroyed, free the entry level RTTs */ + kvm_free_stage2_pgd(&kvm->arch.mmu); +} + +static void kvm_complete_ripas_change(struct kvm_vcpu *vcpu) +{ + struct kvm *kvm = vcpu->kvm; + struct realm_rec *rec = &vcpu->arch.rec; + unsigned long base = rec->run->exit.ripas_base; + unsigned long top = rec->run->exit.ripas_top; + unsigned long ripas = rec->run->exit.ripas_value; + unsigned long top_ipa; + int ret; + + do { + kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_page_cache, + kvm_mmu_cache_min_pages(vcpu->arch.hw_mmu)); + write_lock(&kvm->mmu_lock); + ret = realm_set_ipa_state(vcpu, base, top, ripas, &top_ipa); + write_unlock(&kvm->mmu_lock); + + if (WARN_RATELIMIT(ret && ret != -ENOMEM, + "Unable to satisfy RIPAS_CHANGE for %#lx - %#lx, ripas: %#lx\n", + base, top, ripas)) + break; + + base = top_ipa; + } while (base < top); + + rec->run->exit.ripas_base = base; +} + +/* + * kvm_rec_pre_enter - Complete operations before entering a REC + * + * Some operations require work to be completed before entering a realm. That + * work may require memory allocation so cannot be done in the kvm_rec_enter() + * call. + * + * Return: 1 if we should enter the guest + * 0 if we should exit to userspace + * < 0 if we should exit to userspace, where the return value indicates + * an error + */ +int kvm_rec_pre_enter(struct kvm_vcpu *vcpu) +{ + struct realm_rec *rec = &vcpu->arch.rec; + + if (kvm_realm_state(vcpu->kvm) != REALM_STATE_ACTIVE) + return -EINVAL; + + switch (rec->run->exit.exit_reason) { + case RMI_EXIT_HOST_CALL: + case RMI_EXIT_PSCI: + for (int i = 0; i < REC_RUN_GPRS; i++) + rec->run->enter.gprs[i] = vcpu_get_reg(vcpu, i); + break; + case RMI_EXIT_RIPAS_CHANGE: + kvm_complete_ripas_change(vcpu); + break; + } + + return 1; +} + +int kvm_rec_enter(struct kvm_vcpu *vcpu) +{ + struct realm_rec *rec = &vcpu->arch.rec; + + return rmi_rec_enter(virt_to_phys(rec->rec_page), + virt_to_phys(rec->run)); +} + +static void free_rec_aux(struct page **aux_pages, + unsigned int num_aux) +{ + unsigned int i, j; + unsigned int page_count = 0; + + for (i = 0; i < num_aux;) { + struct page *aux_page = aux_pages[page_count++]; + phys_addr_t aux_page_phys = page_to_phys(aux_page); + bool should_free = true; + + for (j = 0; j < PAGE_SIZE && i < num_aux; j += RMM_PAGE_SIZE) { + if (WARN_ON(rmi_granule_undelegate(aux_page_phys))) + should_free = false; + aux_page_phys += RMM_PAGE_SIZE; + i++; + } + /* Only free if all the undelegate calls were successful */ + if (should_free) + __free_page(aux_page); + } +} + +static int alloc_rec_aux(struct page **aux_pages, + u64 *aux_phys_pages, + unsigned int num_aux) +{ + struct page *aux_page; + int page_count = 0; + unsigned int i, j; + int ret; + + for (i = 0; i < num_aux;) { + phys_addr_t aux_page_phys; + + aux_page = alloc_page(GFP_KERNEL); + if (!aux_page) { + ret = -ENOMEM; + goto out_err; + } + + aux_page_phys = page_to_phys(aux_page); + for (j = 0; j < PAGE_SIZE && i < num_aux; j += RMM_PAGE_SIZE) { + if (rmi_granule_delegate(aux_page_phys)) { + ret = -ENXIO; + goto err_undelegate; + } + aux_phys_pages[i++] = aux_page_phys; + aux_page_phys += RMM_PAGE_SIZE; + } + aux_pages[page_count++] = aux_page; + } + + return 0; +err_undelegate: + while (j > 0) { + j -= RMM_PAGE_SIZE; + i--; + if (WARN_ON(rmi_granule_undelegate(aux_phys_pages[i]))) { + /* Leak the page if the undelegate fails */ + goto out_err; + } + } + __free_page(aux_page); +out_err: + free_rec_aux(aux_pages, i); + return ret; +} + +int kvm_create_rec(struct kvm_vcpu *vcpu) +{ + struct user_pt_regs *vcpu_regs = vcpu_gp_regs(vcpu); + unsigned long mpidr = kvm_vcpu_get_mpidr_aff(vcpu); + struct realm *realm = &vcpu->kvm->arch.realm; + struct realm_rec *rec = &vcpu->arch.rec; + unsigned long rec_page_phys; + struct rec_params *params; + int r, i; + + if (kvm_realm_state(vcpu->kvm) != REALM_STATE_NEW) + return -ENOENT; + + if (rec->run) + return -EBUSY; + + /* + * The RMM will report PSCI v1.0 to Realms and the KVM_ARM_VCPU_PSCI_0_2 + * flag covers v0.2 and onwards. + */ + if (!vcpu_has_feature(vcpu, KVM_ARM_VCPU_PSCI_0_2)) + return -EINVAL; + + if (vcpu->kvm->arch.arm_pmu && !kvm_vcpu_has_pmu(vcpu)) + return -EINVAL; + + BUILD_BUG_ON(sizeof(*params) > PAGE_SIZE); + BUILD_BUG_ON(sizeof(*rec->run) > PAGE_SIZE); + + params = (struct rec_params *)get_zeroed_page(GFP_KERNEL); + rec->rec_page = (void *)__get_free_page(GFP_KERNEL); + rec->run = (void *)get_zeroed_page(GFP_KERNEL); + if (!params || !rec->rec_page || !rec->run) { + r = -ENOMEM; + goto out_free_pages; + } + + for (i = 0; i < ARRAY_SIZE(params->gprs); i++) + params->gprs[i] = vcpu_regs->regs[i]; + + params->pc = vcpu_regs->pc; + + if (vcpu->vcpu_id == 0) + params->flags |= REC_PARAMS_FLAG_RUNNABLE; + + rec_page_phys = virt_to_phys(rec->rec_page); + + if (rmi_granule_delegate(rec_page_phys)) { + r = -ENXIO; + goto out_free_pages; + } + + r = alloc_rec_aux(rec->aux_pages, params->aux, realm->num_aux); + if (r) + goto out_undelegate_rmm_rec; + + params->num_rec_aux = realm->num_aux; + params->mpidr = mpidr; + + if (rmi_rec_create(virt_to_phys(realm->rd), + rec_page_phys, + virt_to_phys(params))) { + r = -ENXIO; + goto out_free_rec_aux; + } + + rec->mpidr = mpidr; + + free_page((unsigned long)params); + return 0; + +out_free_rec_aux: + free_rec_aux(rec->aux_pages, realm->num_aux); +out_undelegate_rmm_rec: + if (WARN_ON(rmi_granule_undelegate(rec_page_phys))) + rec->rec_page = NULL; +out_free_pages: + free_page((unsigned long)rec->run); + free_page((unsigned long)rec->rec_page); + free_page((unsigned long)params); + return r; +} + +void kvm_destroy_rec(struct kvm_vcpu *vcpu) +{ + struct realm *realm = &vcpu->kvm->arch.realm; + struct realm_rec *rec = &vcpu->arch.rec; + unsigned long rec_page_phys; + + if (!vcpu_is_rec(vcpu)) + return; + + if (!rec->run) { + /* Nothing to do if the VCPU hasn't been finalized */ + return; + } + + free_page((unsigned long)rec->run); + + rec_page_phys = virt_to_phys(rec->rec_page); + + /* + * The REC and any AUX pages cannot be reclaimed until the REC is + * destroyed. So if the REC destroy fails then the REC page and any AUX + * pages will be leaked. + */ + if (WARN_ON(rmi_rec_destroy(rec_page_phys))) + return; + + free_rec_aux(rec->aux_pages, realm->num_aux); + + free_delegated_granule(rec_page_phys); +} + +int kvm_init_realm_vm(struct kvm *kvm) +{ + kvm->arch.realm.params = (void *)get_zeroed_page(GFP_KERNEL); + + if (!kvm->arch.realm.params) + return -ENOMEM; + return 0; +} + +void kvm_init_rme(void) +{ + if (PAGE_SIZE != SZ_4K) + /* Only 4k page size on the host is supported */ + return; + + if (rmi_check_version()) + /* Continue without realm support */ + return; + + if (WARN_ON(rmi_features(0, &rmm_feat_reg0))) + return; + + if (WARN_ON(rmi_features(1, &rmm_feat_reg1))) + return; + + if (rme_mecid_init()) + return; + + if (rme_vmid_init()) { + rme_mecid_destroy(); + return; + } + + static_branch_enable(&kvm_rme_is_available); +} diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 74b412640185e..cf3c7db02f3ef 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -1346,8 +1346,9 @@ static int set_pmcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r, * implements. Ignore this error to maintain compatibility * with the existing KVM behavior. */ - if (!kvm_vm_has_ran_once(kvm) && - !vcpu_has_nv(vcpu) && + if (!kvm_vm_has_ran_once(kvm) && + !kvm_realm_is_created(kvm) && + !vcpu_has_nv(vcpu) && new_n <= kvm_arm_pmu_get_max_counters(kvm)) kvm->arch.nr_pmu_counters = new_n; @@ -1994,7 +1995,7 @@ static u64 sanitise_id_aa64dfr0_el1(const struct kvm_vcpu *vcpu, u64 val) /* Hide BRBE from guests */ val &= ~ID_AA64DFR0_EL1_BRBE_MASK; - return val; + return kvm_realm_reset_id_aa64dfr0_el1(vcpu, val); } static int set_id_aa64dfr0_el1(struct kvm_vcpu *vcpu, @@ -2003,6 +2004,9 @@ static int set_id_aa64dfr0_el1(struct kvm_vcpu *vcpu, { u8 debugver = SYS_FIELD_GET(ID_AA64DFR0_EL1, DebugVer, val); u8 pmuver = SYS_FIELD_GET(ID_AA64DFR0_EL1, PMUVer, val); + u8 bps = SYS_FIELD_GET(ID_AA64DFR0_EL1, BRPs, val); + u8 wps = SYS_FIELD_GET(ID_AA64DFR0_EL1, WRPs, val); + u8 ctx_cmps = SYS_FIELD_GET(ID_AA64DFR0_EL1, CTX_CMPs, val); /* * Prior to commit 3d0dba5764b9 ("KVM: arm64: PMU: Move the @@ -2022,10 +2026,11 @@ static int set_id_aa64dfr0_el1(struct kvm_vcpu *vcpu, val &= ~ID_AA64DFR0_EL1_PMUVer_MASK; /* - * ID_AA64DFR0_EL1.DebugVer is one of those awkward fields with a - * nonzero minimum safe value. + * ID_AA64DFR0_EL1.DebugVer, BRPs and WRPs all have to be greater than + * zero. CTX_CMPs is never greater than BRPs. */ - if (debugver < ID_AA64DFR0_EL1_DebugVer_IMP) + if (debugver < ID_AA64DFR0_EL1_DebugVer_IMP || !bps || !wps || + ctx_cmps > bps) return -EINVAL; return set_id_reg(vcpu, rd, val); @@ -2242,10 +2247,11 @@ static int set_id_reg(struct kvm_vcpu *vcpu, const struct sys_reg_desc *rd, mutex_lock(&vcpu->kvm->arch.config_lock); /* - * Once the VM has started the ID registers are immutable. Reject any - * write that does not match the final register value. + * Once the VM has started or the Realm descriptor is created, the ID + * registers are immutable. Reject any write that does not match the + * final register value. */ - if (kvm_vm_has_ran_once(vcpu->kvm)) { + if (kvm_vm_has_ran_once(vcpu->kvm) || kvm_realm_is_created(vcpu->kvm)) { if (val != read_id_reg(vcpu, rd)) ret = -EBUSY; else @@ -5272,18 +5278,18 @@ int kvm_arm_sys_reg_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg sys_reg_descs, ARRAY_SIZE(sys_reg_descs)); } -static unsigned int num_demux_regs(void) +static inline unsigned int num_demux_regs(struct kvm_vcpu *vcpu) { - return CSSELR_MAX; + return kvm_is_realm(vcpu->kvm) ? 0 : CSSELR_MAX; } -static int write_demux_regids(u64 __user *uindices) +static int write_demux_regids(struct kvm_vcpu *vcpu, u64 __user *uindices) { u64 val = KVM_REG_ARM64 | KVM_REG_SIZE_U32 | KVM_REG_ARM_DEMUX; unsigned int i; val |= KVM_REG_ARM_DEMUX_ID_CCSIDR; - for (i = 0; i < CSSELR_MAX; i++) { + for (i = 0; i < num_demux_regs(vcpu); i++) { if (put_user(val | i, uindices)) return -EFAULT; uindices++; @@ -5314,11 +5320,28 @@ static bool copy_reg_to_user(const struct sys_reg_desc *reg, u64 __user **uind) return true; } +static inline bool kvm_realm_sys_reg_hidden_user(const struct kvm_vcpu *vcpu, + u64 reg) +{ + if (!kvm_is_realm(vcpu->kvm)) + return false; + + switch (reg) { + case SYS_ID_AA64DFR0_EL1: + case SYS_PMCR_EL0: + return false; + } + return true; +} + static int walk_one_sys_reg(const struct kvm_vcpu *vcpu, const struct sys_reg_desc *rd, u64 __user **uind, unsigned int *total) { + if (kvm_realm_sys_reg_hidden_user(vcpu, reg_to_encoding(rd))) + return 0; + /* * Ignore registers we trap but don't save, * and for which no custom user accessor is provided. @@ -5356,7 +5379,7 @@ static int walk_sys_regs(struct kvm_vcpu *vcpu, u64 __user *uind) unsigned long kvm_arm_num_sys_reg_descs(struct kvm_vcpu *vcpu) { - return num_demux_regs() + return num_demux_regs(vcpu) + walk_sys_regs(vcpu, (u64 __user *)NULL); } @@ -5369,7 +5392,7 @@ int kvm_arm_copy_sys_reg_indices(struct kvm_vcpu *vcpu, u64 __user *uindices) return err; uindices += err; - return write_demux_regids(uindices); + return write_demux_regids(vcpu, uindices); } #define KVM_ARM_FEATURE_ID_RANGE_INDEX(r) \ diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c index 4c3c0d82e4760..b841ca6c777c1 100644 --- a/arch/arm64/kvm/vgic/vgic-init.c +++ b/arch/arm64/kvm/vgic/vgic-init.c @@ -81,7 +81,7 @@ int kvm_vgic_create(struct kvm *kvm, u32 type) * the proper checks already. */ if (type == KVM_DEV_TYPE_ARM_VGIC_V2 && - !kvm_vgic_global_state.can_emulate_gicv2) + (!kvm_vgic_global_state.can_emulate_gicv2 || kvm_is_realm(kvm))) return -ENODEV; /* diff --git a/arch/arm64/kvm/vgic/vgic.c b/arch/arm64/kvm/vgic/vgic.c index 6dd5a10081e27..b62a0ac1d994b 100644 --- a/arch/arm64/kvm/vgic/vgic.c +++ b/arch/arm64/kvm/vgic/vgic.c @@ -10,7 +10,9 @@ #include #include +#include #include +#include #include "vgic.h" @@ -21,6 +23,13 @@ struct vgic_global kvm_vgic_global_state __ro_after_init = { .gicv3_cpuif = STATIC_KEY_FALSE_INIT, }; +static inline int kvm_vcpu_vgic_nr_lr(struct kvm_vcpu *vcpu) +{ + if (unlikely(vcpu_is_rec(vcpu))) + return kvm_realm_vgic_nr_lr(); + return kvm_vgic_global_state.nr_lr; +} + /* * Locking order is always: * kvm->lock (mutex) @@ -838,7 +847,7 @@ static void vgic_flush_lr_state(struct kvm_vcpu *vcpu) lockdep_assert_held(&vgic_cpu->ap_list_lock); count = compute_ap_list_depth(vcpu, &multi_sgi); - if (count > kvm_vgic_global_state.nr_lr || multi_sgi) + if (count > kvm_vcpu_vgic_nr_lr(vcpu) || multi_sgi) vgic_sort_ap_list(vcpu); count = 0; @@ -867,7 +876,7 @@ static void vgic_flush_lr_state(struct kvm_vcpu *vcpu) raw_spin_unlock(&irq->irq_lock); - if (count == kvm_vgic_global_state.nr_lr) { + if (count == kvm_vcpu_vgic_nr_lr(vcpu)) { if (!list_is_last(&irq->ap_list, &vgic_cpu->ap_list_head)) vgic_set_underflow(vcpu); @@ -876,7 +885,7 @@ static void vgic_flush_lr_state(struct kvm_vcpu *vcpu) } /* Nuke remaining LRs */ - for (i = count ; i < kvm_vgic_global_state.nr_lr; i++) + for (i = count; i < kvm_vcpu_vgic_nr_lr(vcpu); i++) vgic_clear_lr(vcpu, i); if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) @@ -895,10 +904,26 @@ static inline bool can_access_vgic_from_kernel(void) return !static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif) || has_vhe(); } +static inline void vgic_rmm_save_state(struct kvm_vcpu *vcpu) +{ + struct vgic_v3_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v3; + int i; + + for (i = 0; i < kvm_vcpu_vgic_nr_lr(vcpu); i++) { + cpu_if->vgic_lr[i] = vcpu->arch.rec.run->exit.gicv3_lrs[i]; + vcpu->arch.rec.run->enter.gicv3_lrs[i] = 0; + } + + cpu_if->vgic_hcr = vcpu->arch.rec.run->exit.gicv3_hcr; + cpu_if->vgic_vmcr = vcpu->arch.rec.run->exit.gicv3_vmcr; +} + static inline void vgic_save_state(struct kvm_vcpu *vcpu) { if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) vgic_v2_save_state(vcpu); + else if (vcpu_is_rec(vcpu)) + vgic_rmm_save_state(vcpu); else __vgic_v3_save_state(&vcpu->arch.vgic_cpu.vgic_v3); } @@ -934,10 +959,30 @@ void kvm_vgic_sync_hwstate(struct kvm_vcpu *vcpu) vgic_prune_ap_list(vcpu); } +static inline void vgic_rmm_restore_state(struct kvm_vcpu *vcpu) +{ + struct vgic_v3_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v3; + int i; + + for (i = 0; i < kvm_vcpu_vgic_nr_lr(vcpu); i++) { + vcpu->arch.rec.run->enter.gicv3_lrs[i] = cpu_if->vgic_lr[i]; + /* + * Also populate the rec.run->exit copies so that a late + * decision to back out from entering the realm doesn't cause + * the state to be lost + */ + vcpu->arch.rec.run->exit.gicv3_lrs[i] = cpu_if->vgic_lr[i]; + } + + vcpu->arch.rec.run->enter.gicv3_hcr = cpu_if->vgic_hcr & RMI_PERMITTED_GICV3_HCR_BITS; +} + static inline void vgic_restore_state(struct kvm_vcpu *vcpu) { if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) vgic_v2_restore_state(vcpu); + else if (vcpu_is_rec(vcpu)) + vgic_rmm_restore_state(vcpu); else __vgic_v3_restore_state(&vcpu->arch.vgic_cpu.vgic_v3); } @@ -1007,7 +1052,10 @@ void kvm_vgic_flush_hwstate(struct kvm_vcpu *vcpu) void kvm_vgic_load(struct kvm_vcpu *vcpu) { - if (unlikely(!irqchip_in_kernel(vcpu->kvm) || !vgic_initialized(vcpu->kvm))) { + if (unlikely(vcpu_is_rec(vcpu))) + return; + if (unlikely(!irqchip_in_kernel(vcpu->kvm) || + !vgic_initialized(vcpu->kvm))) { if (has_vhe() && static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) __vgic_v3_activate_traps(&vcpu->arch.vgic_cpu.vgic_v3); return; @@ -1021,7 +1069,10 @@ void kvm_vgic_load(struct kvm_vcpu *vcpu) void kvm_vgic_put(struct kvm_vcpu *vcpu) { - if (unlikely(!irqchip_in_kernel(vcpu->kvm) || !vgic_initialized(vcpu->kvm))) { + if (unlikely(vcpu_is_rec(vcpu))) + return; + if (unlikely(!irqchip_in_kernel(vcpu->kvm) || + !vgic_initialized(vcpu->kvm))) { if (has_vhe() && static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) __vgic_v3_deactivate_traps(&vcpu->arch.vgic_cpu.vgic_v3); return; diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index d816ff44faff9..e4237637cd8f1 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -854,6 +854,25 @@ static int do_tag_check_fault(unsigned long far, unsigned long esr, return 0; } +static int do_gpf_ptw(unsigned long far, unsigned long esr, struct pt_regs *regs) +{ + const struct fault_info *inf = esr_to_fault_info(esr); + + die_kernel_fault(inf->name, far, esr, regs); + return 0; +} + +static int do_gpf(unsigned long far, unsigned long esr, struct pt_regs *regs) +{ + const struct fault_info *inf = esr_to_fault_info(esr); + + if (!is_el1_instruction_abort(esr) && fixup_exception(regs, esr)) + return 0; + + arm64_notify_die(inf->name, regs, inf->sig, inf->code, far, esr); + return 0; +} + static const struct fault_info fault_info[] = { { do_bad, SIGKILL, SI_KERNEL, "ttbr address size fault" }, { do_bad, SIGKILL, SI_KERNEL, "level 1 address size fault" }, @@ -890,12 +909,12 @@ static const struct fault_info fault_info[] = { { do_bad, SIGKILL, SI_KERNEL, "unknown 32" }, { do_alignment_fault, SIGBUS, BUS_ADRALN, "alignment fault" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 34" }, - { do_bad, SIGKILL, SI_KERNEL, "unknown 35" }, - { do_bad, SIGKILL, SI_KERNEL, "unknown 36" }, - { do_bad, SIGKILL, SI_KERNEL, "unknown 37" }, - { do_bad, SIGKILL, SI_KERNEL, "unknown 38" }, - { do_bad, SIGKILL, SI_KERNEL, "unknown 39" }, - { do_bad, SIGKILL, SI_KERNEL, "unknown 40" }, + { do_gpf_ptw, SIGKILL, SI_KERNEL, "Granule Protection Fault at level -1" }, + { do_gpf_ptw, SIGKILL, SI_KERNEL, "Granule Protection Fault at level 0" }, + { do_gpf_ptw, SIGKILL, SI_KERNEL, "Granule Protection Fault at level 1" }, + { do_gpf_ptw, SIGKILL, SI_KERNEL, "Granule Protection Fault at level 2" }, + { do_gpf_ptw, SIGKILL, SI_KERNEL, "Granule Protection Fault at level 3" }, + { do_gpf, SIGBUS, SI_KERNEL, "Granule Protection Fault not on table walk" }, { do_bad, SIGKILL, SI_KERNEL, "level -1 address size fault" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 42" }, { do_translation_fault, SIGSEGV, SEGV_MAPERR, "level -1 translation fault" }, diff --git a/arch/mips/jazz/jazzdma.c b/arch/mips/jazz/jazzdma.c index c97b089b99029..eb9fb2f2a7201 100644 --- a/arch/mips/jazz/jazzdma.c +++ b/arch/mips/jazz/jazzdma.c @@ -521,18 +521,24 @@ static void jazz_dma_free(struct device *dev, size_t size, void *vaddr, __free_pages(virt_to_page(vaddr), get_order(size)); } -static dma_addr_t jazz_dma_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t size, enum dma_data_direction dir, - unsigned long attrs) +static dma_addr_t jazz_dma_map_phys(struct device *dev, phys_addr_t phys, + size_t size, enum dma_data_direction dir, unsigned long attrs) { - phys_addr_t phys = page_to_phys(page) + offset; + if (unlikely(attrs & DMA_ATTR_MMIO)) + /* + * This check is included because older versions of the code lacked + * MMIO path support, and my ability to test this path is limited. + * However, from a software technical standpoint, there is no restriction, + * as the following code operates solely on physical addresses. + */ + return DMA_MAPPING_ERROR; if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC)) arch_sync_dma_for_device(phys, size, dir); return vdma_alloc(phys, size); } -static void jazz_dma_unmap_page(struct device *dev, dma_addr_t dma_addr, +static void jazz_dma_unmap_phys(struct device *dev, dma_addr_t dma_addr, size_t size, enum dma_data_direction dir, unsigned long attrs) { if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC)) @@ -607,8 +613,8 @@ static void jazz_dma_sync_sg_for_cpu(struct device *dev, const struct dma_map_ops jazz_dma_ops = { .alloc = jazz_dma_alloc, .free = jazz_dma_free, - .map_page = jazz_dma_map_page, - .unmap_page = jazz_dma_unmap_page, + .map_phys = jazz_dma_map_phys, + .unmap_phys = jazz_dma_unmap_phys, .map_sg = jazz_dma_map_sg, .unmap_sg = jazz_dma_unmap_sg, .sync_single_for_cpu = jazz_dma_sync_single_for_cpu, diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index b410021ad4c69..eafdd63cd6c4f 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -274,12 +274,12 @@ extern void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl, unsigned long mask, gfp_t flag, int node); extern void iommu_free_coherent(struct iommu_table *tbl, size_t size, void *vaddr, dma_addr_t dma_handle); -extern dma_addr_t iommu_map_page(struct device *dev, struct iommu_table *tbl, - struct page *page, unsigned long offset, - size_t size, unsigned long mask, +extern dma_addr_t iommu_map_phys(struct device *dev, struct iommu_table *tbl, + phys_addr_t phys, size_t size, + unsigned long mask, enum dma_data_direction direction, unsigned long attrs); -extern void iommu_unmap_page(struct iommu_table *tbl, dma_addr_t dma_handle, +extern void iommu_unmap_phys(struct iommu_table *tbl, dma_addr_t dma_handle, size_t size, enum dma_data_direction direction, unsigned long attrs); diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c index 4d64a5db50f38..aa3689d619179 100644 --- a/arch/powerpc/kernel/dma-iommu.c +++ b/arch/powerpc/kernel/dma-iommu.c @@ -14,7 +14,7 @@ #define can_map_direct(dev, addr) \ ((dev)->bus_dma_limit >= phys_to_dma((dev), (addr))) -bool arch_dma_map_page_direct(struct device *dev, phys_addr_t addr) +bool arch_dma_map_phys_direct(struct device *dev, phys_addr_t addr) { if (likely(!dev->bus_dma_limit)) return false; @@ -24,7 +24,7 @@ bool arch_dma_map_page_direct(struct device *dev, phys_addr_t addr) #define is_direct_handle(dev, h) ((h) >= (dev)->archdata.dma_offset) -bool arch_dma_unmap_page_direct(struct device *dev, dma_addr_t dma_handle) +bool arch_dma_unmap_phys_direct(struct device *dev, dma_addr_t dma_handle) { if (likely(!dev->bus_dma_limit)) return false; @@ -93,28 +93,26 @@ static void dma_iommu_free_coherent(struct device *dev, size_t size, /* Creates TCEs for a user provided buffer. The user buffer must be * contiguous real kernel storage (not vmalloc). The address passed here - * comprises a page address and offset into that page. The dma_addr_t - * returned will point to the same byte within the page as was passed in. + * is a physical address to that page. The dma_addr_t returned will point + * to the same byte within the page as was passed in. */ -static dma_addr_t dma_iommu_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t size, +static dma_addr_t dma_iommu_map_phys(struct device *dev, phys_addr_t phys, + size_t size, enum dma_data_direction direction, unsigned long attrs) { - return iommu_map_page(dev, get_iommu_table_base(dev), page, offset, - size, dma_get_mask(dev), direction, attrs); + return iommu_map_phys(dev, get_iommu_table_base(dev), phys, size, + dma_get_mask(dev), direction, attrs); } - -static void dma_iommu_unmap_page(struct device *dev, dma_addr_t dma_handle, +static void dma_iommu_unmap_phys(struct device *dev, dma_addr_t dma_handle, size_t size, enum dma_data_direction direction, unsigned long attrs) { - iommu_unmap_page(get_iommu_table_base(dev), dma_handle, size, direction, + iommu_unmap_phys(get_iommu_table_base(dev), dma_handle, size, direction, attrs); } - static int dma_iommu_map_sg(struct device *dev, struct scatterlist *sglist, int nelems, enum dma_data_direction direction, unsigned long attrs) @@ -211,8 +209,8 @@ const struct dma_map_ops dma_iommu_ops = { .map_sg = dma_iommu_map_sg, .unmap_sg = dma_iommu_unmap_sg, .dma_supported = dma_iommu_dma_supported, - .map_page = dma_iommu_map_page, - .unmap_page = dma_iommu_unmap_page, + .map_phys = dma_iommu_map_phys, + .unmap_phys = dma_iommu_unmap_phys, .get_required_mask = dma_iommu_get_required_mask, .mmap = dma_common_mmap, .get_sgtable = dma_common_get_sgtable, diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 244eb4857e7f4..6b5f4b72ce97f 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -848,12 +848,12 @@ EXPORT_SYMBOL_GPL(iommu_tce_table_put); /* Creates TCEs for a user provided buffer. The user buffer must be * contiguous real kernel storage (not vmalloc). The address passed here - * comprises a page address and offset into that page. The dma_addr_t - * returned will point to the same byte within the page as was passed in. + * is physical address into that page. The dma_addr_t returned will point + * to the same byte within the page as was passed in. */ -dma_addr_t iommu_map_page(struct device *dev, struct iommu_table *tbl, - struct page *page, unsigned long offset, size_t size, - unsigned long mask, enum dma_data_direction direction, +dma_addr_t iommu_map_phys(struct device *dev, struct iommu_table *tbl, + phys_addr_t phys, size_t size, unsigned long mask, + enum dma_data_direction direction, unsigned long attrs) { dma_addr_t dma_handle = DMA_MAPPING_ERROR; @@ -863,7 +863,7 @@ dma_addr_t iommu_map_page(struct device *dev, struct iommu_table *tbl, BUG_ON(direction == DMA_NONE); - vaddr = page_address(page) + offset; + vaddr = phys_to_virt(phys); uaddr = (unsigned long)vaddr; if (tbl) { @@ -890,7 +890,7 @@ dma_addr_t iommu_map_page(struct device *dev, struct iommu_table *tbl, return dma_handle; } -void iommu_unmap_page(struct iommu_table *tbl, dma_addr_t dma_handle, +void iommu_unmap_phys(struct iommu_table *tbl, dma_addr_t dma_handle, size_t size, enum dma_data_direction direction, unsigned long attrs) { diff --git a/arch/powerpc/platforms/ps3/system-bus.c b/arch/powerpc/platforms/ps3/system-bus.c index afbaabf182d01..f4f3477d3a234 100644 --- a/arch/powerpc/platforms/ps3/system-bus.c +++ b/arch/powerpc/platforms/ps3/system-bus.c @@ -551,18 +551,20 @@ static void ps3_free_coherent(struct device *_dev, size_t size, void *vaddr, /* Creates TCEs for a user provided buffer. The user buffer must be * contiguous real kernel storage (not vmalloc). The address passed here - * comprises a page address and offset into that page. The dma_addr_t - * returned will point to the same byte within the page as was passed in. + * is physical address to that hat page. The dma_addr_t returned will point + * to the same byte within the page as was passed in. */ -static dma_addr_t ps3_sb_map_page(struct device *_dev, struct page *page, - unsigned long offset, size_t size, enum dma_data_direction direction, - unsigned long attrs) +static dma_addr_t ps3_sb_map_phys(struct device *_dev, phys_addr_t phys, + size_t size, enum dma_data_direction direction, unsigned long attrs) { struct ps3_system_bus_device *dev = ps3_dev_to_system_bus_dev(_dev); int result; dma_addr_t bus_addr; - void *ptr = page_address(page) + offset; + void *ptr = phys_to_virt(phys); + + if (unlikely(attrs & DMA_ATTR_MMIO)) + return DMA_MAPPING_ERROR; result = ps3_dma_map(dev->d_region, (unsigned long)ptr, size, &bus_addr, @@ -577,8 +579,8 @@ static dma_addr_t ps3_sb_map_page(struct device *_dev, struct page *page, return bus_addr; } -static dma_addr_t ps3_ioc0_map_page(struct device *_dev, struct page *page, - unsigned long offset, size_t size, +static dma_addr_t ps3_ioc0_map_phys(struct device *_dev, phys_addr_t phys, + size_t size, enum dma_data_direction direction, unsigned long attrs) { @@ -586,7 +588,10 @@ static dma_addr_t ps3_ioc0_map_page(struct device *_dev, struct page *page, int result; dma_addr_t bus_addr; u64 iopte_flag; - void *ptr = page_address(page) + offset; + void *ptr = phys_to_virt(phys); + + if (unlikely(attrs & DMA_ATTR_MMIO)) + return DMA_MAPPING_ERROR; iopte_flag = CBE_IOPTE_M; switch (direction) { @@ -613,7 +618,7 @@ static dma_addr_t ps3_ioc0_map_page(struct device *_dev, struct page *page, return bus_addr; } -static void ps3_unmap_page(struct device *_dev, dma_addr_t dma_addr, +static void ps3_unmap_phys(struct device *_dev, dma_addr_t dma_addr, size_t size, enum dma_data_direction direction, unsigned long attrs) { struct ps3_system_bus_device *dev = ps3_dev_to_system_bus_dev(_dev); @@ -690,8 +695,8 @@ static const struct dma_map_ops ps3_sb_dma_ops = { .map_sg = ps3_sb_map_sg, .unmap_sg = ps3_sb_unmap_sg, .dma_supported = ps3_dma_supported, - .map_page = ps3_sb_map_page, - .unmap_page = ps3_unmap_page, + .map_phys = ps3_sb_map_phys, + .unmap_phys = ps3_unmap_phys, .mmap = dma_common_mmap, .get_sgtable = dma_common_get_sgtable, .alloc_pages_op = dma_common_alloc_pages, @@ -704,8 +709,8 @@ static const struct dma_map_ops ps3_ioc0_dma_ops = { .map_sg = ps3_ioc0_map_sg, .unmap_sg = ps3_ioc0_unmap_sg, .dma_supported = ps3_dma_supported, - .map_page = ps3_ioc0_map_page, - .unmap_page = ps3_unmap_page, + .map_phys = ps3_ioc0_map_phys, + .unmap_phys = ps3_unmap_phys, .mmap = dma_common_mmap, .get_sgtable = dma_common_get_sgtable, .alloc_pages_op = dma_common_alloc_pages, diff --git a/arch/powerpc/platforms/pseries/ibmebus.c b/arch/powerpc/platforms/pseries/ibmebus.c index 3436b0af795e2..cad2deb7e70d9 100644 --- a/arch/powerpc/platforms/pseries/ibmebus.c +++ b/arch/powerpc/platforms/pseries/ibmebus.c @@ -86,17 +86,18 @@ static void ibmebus_free_coherent(struct device *dev, kfree(vaddr); } -static dma_addr_t ibmebus_map_page(struct device *dev, - struct page *page, - unsigned long offset, +static dma_addr_t ibmebus_map_phys(struct device *dev, phys_addr_t phys, size_t size, enum dma_data_direction direction, unsigned long attrs) { - return (dma_addr_t)(page_address(page) + offset); + if (attrs & DMA_ATTR_MMIO) + return DMA_MAPPING_ERROR; + + return (dma_addr_t)(phys_to_virt(phys)); } -static void ibmebus_unmap_page(struct device *dev, +static void ibmebus_unmap_phys(struct device *dev, dma_addr_t dma_addr, size_t size, enum dma_data_direction direction, @@ -146,8 +147,8 @@ static const struct dma_map_ops ibmebus_dma_ops = { .unmap_sg = ibmebus_unmap_sg, .dma_supported = ibmebus_dma_supported, .get_required_mask = ibmebus_dma_get_required_mask, - .map_page = ibmebus_map_page, - .unmap_page = ibmebus_unmap_page, + .map_phys = ibmebus_map_phys, + .unmap_phys = ibmebus_unmap_phys, }; static int ibmebus_match_path(struct device *dev, const void *data) diff --git a/arch/powerpc/platforms/pseries/vio.c b/arch/powerpc/platforms/pseries/vio.c index 1293cd646c949..81bb042b6020a 100644 --- a/arch/powerpc/platforms/pseries/vio.c +++ b/arch/powerpc/platforms/pseries/vio.c @@ -511,18 +511,21 @@ static void vio_dma_iommu_free_coherent(struct device *dev, size_t size, vio_cmo_dealloc(viodev, roundup(size, PAGE_SIZE)); } -static dma_addr_t vio_dma_iommu_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t size, - enum dma_data_direction direction, - unsigned long attrs) +static dma_addr_t vio_dma_iommu_map_phys(struct device *dev, phys_addr_t phys, + size_t size, + enum dma_data_direction direction, + unsigned long attrs) { struct vio_dev *viodev = to_vio_dev(dev); struct iommu_table *tbl = get_iommu_table_base(dev); dma_addr_t ret = DMA_MAPPING_ERROR; + if (unlikely(attrs & DMA_ATTR_MMIO)) + return ret; + if (vio_cmo_alloc(viodev, roundup(size, IOMMU_PAGE_SIZE(tbl)))) goto out_fail; - ret = iommu_map_page(dev, tbl, page, offset, size, dma_get_mask(dev), + ret = iommu_map_phys(dev, tbl, phys, size, dma_get_mask(dev), direction, attrs); if (unlikely(ret == DMA_MAPPING_ERROR)) goto out_deallocate; @@ -535,7 +538,7 @@ static dma_addr_t vio_dma_iommu_map_page(struct device *dev, struct page *page, return DMA_MAPPING_ERROR; } -static void vio_dma_iommu_unmap_page(struct device *dev, dma_addr_t dma_handle, +static void vio_dma_iommu_unmap_phys(struct device *dev, dma_addr_t dma_handle, size_t size, enum dma_data_direction direction, unsigned long attrs) @@ -543,7 +546,7 @@ static void vio_dma_iommu_unmap_page(struct device *dev, dma_addr_t dma_handle, struct vio_dev *viodev = to_vio_dev(dev); struct iommu_table *tbl = get_iommu_table_base(dev); - iommu_unmap_page(tbl, dma_handle, size, direction, attrs); + iommu_unmap_phys(tbl, dma_handle, size, direction, attrs); vio_cmo_dealloc(viodev, roundup(size, IOMMU_PAGE_SIZE(tbl))); } @@ -604,8 +607,8 @@ static const struct dma_map_ops vio_dma_mapping_ops = { .free = vio_dma_iommu_free_coherent, .map_sg = vio_dma_iommu_map_sg, .unmap_sg = vio_dma_iommu_unmap_sg, - .map_page = vio_dma_iommu_map_page, - .unmap_page = vio_dma_iommu_unmap_page, + .map_phys = vio_dma_iommu_map_phys, + .unmap_phys = vio_dma_iommu_unmap_phys, .dma_supported = dma_iommu_dma_supported, .get_required_mask = dma_iommu_get_required_mask, .mmap = dma_common_mmap, diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c index da03636925283..46ef88bc9c26e 100644 --- a/arch/sparc/kernel/iommu.c +++ b/arch/sparc/kernel/iommu.c @@ -260,26 +260,35 @@ static void dma_4u_free_coherent(struct device *dev, size_t size, free_pages((unsigned long)cpu, order); } -static dma_addr_t dma_4u_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t sz, - enum dma_data_direction direction, +static dma_addr_t dma_4u_map_phys(struct device *dev, phys_addr_t phys, + size_t sz, enum dma_data_direction direction, unsigned long attrs) { struct iommu *iommu; struct strbuf *strbuf; iopte_t *base; unsigned long flags, npages, oaddr; - unsigned long i, base_paddr, ctx; + unsigned long i, ctx; u32 bus_addr, ret; unsigned long iopte_protection; + if (unlikely(attrs & DMA_ATTR_MMIO)) + /* + * This check is included because older versions of the code + * lacked MMIO path support, and my ability to test this path + * is limited. However, from a software technical standpoint, + * there is no restriction, as the following code operates + * solely on physical addresses. + */ + goto bad_no_ctx; + iommu = dev->archdata.iommu; strbuf = dev->archdata.stc; if (unlikely(direction == DMA_NONE)) goto bad_no_ctx; - oaddr = (unsigned long)(page_address(page) + offset); + oaddr = (unsigned long)(phys_to_virt(phys)); npages = IO_PAGE_ALIGN(oaddr + sz) - (oaddr & IO_PAGE_MASK); npages >>= IO_PAGE_SHIFT; @@ -296,7 +305,6 @@ static dma_addr_t dma_4u_map_page(struct device *dev, struct page *page, bus_addr = (iommu->tbl.table_map_base + ((base - iommu->page_table) << IO_PAGE_SHIFT)); ret = bus_addr | (oaddr & ~IO_PAGE_MASK); - base_paddr = __pa(oaddr & IO_PAGE_MASK); if (strbuf->strbuf_enabled) iopte_protection = IOPTE_STREAMING(ctx); else @@ -304,8 +312,8 @@ static dma_addr_t dma_4u_map_page(struct device *dev, struct page *page, if (direction != DMA_TO_DEVICE) iopte_protection |= IOPTE_WRITE; - for (i = 0; i < npages; i++, base++, base_paddr += IO_PAGE_SIZE) - iopte_val(*base) = iopte_protection | base_paddr; + for (i = 0; i < npages; i++, base++, phys += IO_PAGE_SIZE) + iopte_val(*base) = iopte_protection | phys; return ret; @@ -383,7 +391,7 @@ static void strbuf_flush(struct strbuf *strbuf, struct iommu *iommu, vaddr, ctx, npages); } -static void dma_4u_unmap_page(struct device *dev, dma_addr_t bus_addr, +static void dma_4u_unmap_phys(struct device *dev, dma_addr_t bus_addr, size_t sz, enum dma_data_direction direction, unsigned long attrs) { @@ -753,8 +761,8 @@ static int dma_4u_supported(struct device *dev, u64 device_mask) static const struct dma_map_ops sun4u_dma_ops = { .alloc = dma_4u_alloc_coherent, .free = dma_4u_free_coherent, - .map_page = dma_4u_map_page, - .unmap_page = dma_4u_unmap_page, + .map_phys = dma_4u_map_phys, + .unmap_phys = dma_4u_unmap_phys, .map_sg = dma_4u_map_sg, .unmap_sg = dma_4u_unmap_sg, .sync_single_for_cpu = dma_4u_sync_single_for_cpu, diff --git a/arch/sparc/kernel/pci_sun4v.c b/arch/sparc/kernel/pci_sun4v.c index b720b21ccfbd8..791f0a76665f6 100644 --- a/arch/sparc/kernel/pci_sun4v.c +++ b/arch/sparc/kernel/pci_sun4v.c @@ -352,9 +352,8 @@ static void dma_4v_free_coherent(struct device *dev, size_t size, void *cpu, free_pages((unsigned long)cpu, order); } -static dma_addr_t dma_4v_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t sz, - enum dma_data_direction direction, +static dma_addr_t dma_4v_map_phys(struct device *dev, phys_addr_t phys, + size_t sz, enum dma_data_direction direction, unsigned long attrs) { struct iommu *iommu; @@ -362,18 +361,27 @@ static dma_addr_t dma_4v_map_page(struct device *dev, struct page *page, struct iommu_map_table *tbl; u64 mask; unsigned long flags, npages, oaddr; - unsigned long i, base_paddr; - unsigned long prot; + unsigned long i, prot; dma_addr_t bus_addr, ret; long entry; + if (unlikely(attrs & DMA_ATTR_MMIO)) + /* + * This check is included because older versions of the code + * lacked MMIO path support, and my ability to test this path + * is limited. However, from a software technical standpoint, + * there is no restriction, as the following code operates + * solely on physical addresses. + */ + goto bad; + iommu = dev->archdata.iommu; atu = iommu->atu; if (unlikely(direction == DMA_NONE)) goto bad; - oaddr = (unsigned long)(page_address(page) + offset); + oaddr = (unsigned long)(phys_to_virt(phys)); npages = IO_PAGE_ALIGN(oaddr + sz) - (oaddr & IO_PAGE_MASK); npages >>= IO_PAGE_SHIFT; @@ -391,7 +399,6 @@ static dma_addr_t dma_4v_map_page(struct device *dev, struct page *page, bus_addr = (tbl->table_map_base + (entry << IO_PAGE_SHIFT)); ret = bus_addr | (oaddr & ~IO_PAGE_MASK); - base_paddr = __pa(oaddr & IO_PAGE_MASK); prot = HV_PCI_MAP_ATTR_READ; if (direction != DMA_TO_DEVICE) prot |= HV_PCI_MAP_ATTR_WRITE; @@ -403,8 +410,8 @@ static dma_addr_t dma_4v_map_page(struct device *dev, struct page *page, iommu_batch_start(dev, prot, entry); - for (i = 0; i < npages; i++, base_paddr += IO_PAGE_SIZE) { - long err = iommu_batch_add(base_paddr, mask); + for (i = 0; i < npages; i++, phys += IO_PAGE_SIZE) { + long err = iommu_batch_add(phys, mask); if (unlikely(err < 0L)) goto iommu_map_fail; } @@ -426,7 +433,7 @@ static dma_addr_t dma_4v_map_page(struct device *dev, struct page *page, return DMA_MAPPING_ERROR; } -static void dma_4v_unmap_page(struct device *dev, dma_addr_t bus_addr, +static void dma_4v_unmap_phys(struct device *dev, dma_addr_t bus_addr, size_t sz, enum dma_data_direction direction, unsigned long attrs) { @@ -686,8 +693,8 @@ static int dma_4v_supported(struct device *dev, u64 device_mask) static const struct dma_map_ops sun4v_dma_ops = { .alloc = dma_4v_alloc_coherent, .free = dma_4v_free_coherent, - .map_page = dma_4v_map_page, - .unmap_page = dma_4v_unmap_page, + .map_phys = dma_4v_map_phys, + .unmap_phys = dma_4v_unmap_phys, .map_sg = dma_4v_map_sg, .unmap_sg = dma_4v_unmap_sg, .dma_supported = dma_4v_supported, diff --git a/arch/sparc/mm/io-unit.c b/arch/sparc/mm/io-unit.c index d8376f61b4d08..d409cb450de48 100644 --- a/arch/sparc/mm/io-unit.c +++ b/arch/sparc/mm/io-unit.c @@ -94,13 +94,14 @@ static int __init iounit_init(void) subsys_initcall(iounit_init); /* One has to hold iounit->lock to call this */ -static unsigned long iounit_get_area(struct iounit_struct *iounit, unsigned long vaddr, int size) +static dma_addr_t iounit_get_area(struct iounit_struct *iounit, + phys_addr_t phys, int size) { int i, j, k, npages; unsigned long rotor, scan, limit; iopte_t iopte; - npages = ((vaddr & ~PAGE_MASK) + size + (PAGE_SIZE-1)) >> PAGE_SHIFT; + npages = (offset_in_page(phys) + size + (PAGE_SIZE - 1)) >> PAGE_SHIFT; /* A tiny bit of magic ingredience :) */ switch (npages) { @@ -109,7 +110,7 @@ static unsigned long iounit_get_area(struct iounit_struct *iounit, unsigned long default: i = 0x0213; break; } - IOD(("iounit_get_area(%08lx,%d[%d])=", vaddr, size, npages)); + IOD(("%s(%pa,%d[%d])=", __func__, &phys, size, npages)); next: j = (i & 15); rotor = iounit->rotor[j - 1]; @@ -124,7 +125,8 @@ nexti: scan = find_next_zero_bit(iounit->bmap, limit, scan); } i >>= 4; if (!(i & 15)) - panic("iounit_get_area: Couldn't find free iopte slots for (%08lx,%d)\n", vaddr, size); + panic("iounit_get_area: Couldn't find free iopte slots for (%pa,%d)\n", + &phys, size); goto next; } for (k = 1, scan++; k < npages; k++) @@ -132,30 +134,29 @@ nexti: scan = find_next_zero_bit(iounit->bmap, limit, scan); goto nexti; iounit->rotor[j - 1] = (scan < limit) ? scan : iounit->limit[j - 1]; scan -= npages; - iopte = MKIOPTE(__pa(vaddr & PAGE_MASK)); - vaddr = IOUNIT_DMA_BASE + (scan << PAGE_SHIFT) + (vaddr & ~PAGE_MASK); + iopte = MKIOPTE(phys & PAGE_MASK); + phys = IOUNIT_DMA_BASE + (scan << PAGE_SHIFT) + offset_in_page(phys); for (k = 0; k < npages; k++, iopte = __iopte(iopte_val(iopte) + 0x100), scan++) { set_bit(scan, iounit->bmap); sbus_writel(iopte_val(iopte), &iounit->page_table[scan]); } - IOD(("%08lx\n", vaddr)); - return vaddr; + IOD(("%pa\n", &phys)); + return phys; } -static dma_addr_t iounit_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t len, enum dma_data_direction dir, - unsigned long attrs) +static dma_addr_t iounit_map_phys(struct device *dev, phys_addr_t phys, + size_t len, enum dma_data_direction dir, unsigned long attrs) { - void *vaddr = page_address(page) + offset; struct iounit_struct *iounit = dev->archdata.iommu; - unsigned long ret, flags; + unsigned long flags; + dma_addr_t ret; /* XXX So what is maxphys for us and how do drivers know it? */ if (!len || len > 256 * 1024) return DMA_MAPPING_ERROR; spin_lock_irqsave(&iounit->lock, flags); - ret = iounit_get_area(iounit, (unsigned long)vaddr, len); + ret = iounit_get_area(iounit, phys, len); spin_unlock_irqrestore(&iounit->lock, flags); return ret; } @@ -171,14 +172,15 @@ static int iounit_map_sg(struct device *dev, struct scatterlist *sgl, int nents, /* FIXME: Cache some resolved pages - often several sg entries are to the same page */ spin_lock_irqsave(&iounit->lock, flags); for_each_sg(sgl, sg, nents, i) { - sg->dma_address = iounit_get_area(iounit, (unsigned long) sg_virt(sg), sg->length); + sg->dma_address = + iounit_get_area(iounit, sg_phys(sg), sg->length); sg->dma_length = sg->length; } spin_unlock_irqrestore(&iounit->lock, flags); return nents; } -static void iounit_unmap_page(struct device *dev, dma_addr_t vaddr, size_t len, +static void iounit_unmap_phys(struct device *dev, dma_addr_t vaddr, size_t len, enum dma_data_direction dir, unsigned long attrs) { struct iounit_struct *iounit = dev->archdata.iommu; @@ -279,8 +281,8 @@ static const struct dma_map_ops iounit_dma_ops = { .alloc = iounit_alloc, .free = iounit_free, #endif - .map_page = iounit_map_page, - .unmap_page = iounit_unmap_page, + .map_phys = iounit_map_phys, + .unmap_phys = iounit_unmap_phys, .map_sg = iounit_map_sg, .unmap_sg = iounit_unmap_sg, }; diff --git a/arch/sparc/mm/iommu.c b/arch/sparc/mm/iommu.c index 5a5080db800f5..f48adf62724ab 100644 --- a/arch/sparc/mm/iommu.c +++ b/arch/sparc/mm/iommu.c @@ -181,18 +181,20 @@ static void iommu_flush_iotlb(iopte_t *iopte, unsigned int niopte) } } -static dma_addr_t __sbus_iommu_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t len, bool per_page_flush) +static dma_addr_t __sbus_iommu_map_phys(struct device *dev, phys_addr_t paddr, + size_t len, bool per_page_flush, unsigned long attrs) { struct iommu_struct *iommu = dev->archdata.iommu; - phys_addr_t paddr = page_to_phys(page) + offset; - unsigned long off = paddr & ~PAGE_MASK; + unsigned long off = offset_in_page(paddr); unsigned long npages = (off + len + PAGE_SIZE - 1) >> PAGE_SHIFT; unsigned long pfn = __phys_to_pfn(paddr); unsigned int busa, busa0; iopte_t *iopte, *iopte0; int ioptex, i; + if (unlikely(attrs & DMA_ATTR_MMIO)) + return DMA_MAPPING_ERROR; + /* XXX So what is maxphys for us and how do drivers know it? */ if (!len || len > 256 * 1024) return DMA_MAPPING_ERROR; @@ -202,10 +204,10 @@ static dma_addr_t __sbus_iommu_map_page(struct device *dev, struct page *page, * XXX Is this a good assumption? * XXX What if someone else unmaps it here and races us? */ - if (per_page_flush && !PageHighMem(page)) { + if (per_page_flush && !PhysHighMem(paddr)) { unsigned long vaddr, p; - vaddr = (unsigned long)page_address(page) + offset; + vaddr = (unsigned long)phys_to_virt(paddr); for (p = vaddr & PAGE_MASK; p < vaddr + len; p += PAGE_SIZE) flush_page_for_dma(p); } @@ -231,19 +233,19 @@ static dma_addr_t __sbus_iommu_map_page(struct device *dev, struct page *page, return busa0 + off; } -static dma_addr_t sbus_iommu_map_page_gflush(struct device *dev, - struct page *page, unsigned long offset, size_t len, - enum dma_data_direction dir, unsigned long attrs) +static dma_addr_t sbus_iommu_map_phys_gflush(struct device *dev, + phys_addr_t phys, size_t len, enum dma_data_direction dir, + unsigned long attrs) { flush_page_for_dma(0); - return __sbus_iommu_map_page(dev, page, offset, len, false); + return __sbus_iommu_map_phys(dev, phys, len, false, attrs); } -static dma_addr_t sbus_iommu_map_page_pflush(struct device *dev, - struct page *page, unsigned long offset, size_t len, - enum dma_data_direction dir, unsigned long attrs) +static dma_addr_t sbus_iommu_map_phys_pflush(struct device *dev, + phys_addr_t phys, size_t len, enum dma_data_direction dir, + unsigned long attrs) { - return __sbus_iommu_map_page(dev, page, offset, len, true); + return __sbus_iommu_map_phys(dev, phys, len, true, attrs); } static int __sbus_iommu_map_sg(struct device *dev, struct scatterlist *sgl, @@ -254,8 +256,8 @@ static int __sbus_iommu_map_sg(struct device *dev, struct scatterlist *sgl, int j; for_each_sg(sgl, sg, nents, j) { - sg->dma_address =__sbus_iommu_map_page(dev, sg_page(sg), - sg->offset, sg->length, per_page_flush); + sg->dma_address = __sbus_iommu_map_phys(dev, sg_phys(sg), + sg->length, per_page_flush, attrs); if (sg->dma_address == DMA_MAPPING_ERROR) return -EIO; sg->dma_length = sg->length; @@ -277,7 +279,7 @@ static int sbus_iommu_map_sg_pflush(struct device *dev, struct scatterlist *sgl, return __sbus_iommu_map_sg(dev, sgl, nents, dir, attrs, true); } -static void sbus_iommu_unmap_page(struct device *dev, dma_addr_t dma_addr, +static void sbus_iommu_unmap_phys(struct device *dev, dma_addr_t dma_addr, size_t len, enum dma_data_direction dir, unsigned long attrs) { struct iommu_struct *iommu = dev->archdata.iommu; @@ -303,7 +305,7 @@ static void sbus_iommu_unmap_sg(struct device *dev, struct scatterlist *sgl, int i; for_each_sg(sgl, sg, nents, i) { - sbus_iommu_unmap_page(dev, sg->dma_address, sg->length, dir, + sbus_iommu_unmap_phys(dev, sg->dma_address, sg->length, dir, attrs); sg->dma_address = 0x21212121; } @@ -426,8 +428,8 @@ static const struct dma_map_ops sbus_iommu_dma_gflush_ops = { .alloc = sbus_iommu_alloc, .free = sbus_iommu_free, #endif - .map_page = sbus_iommu_map_page_gflush, - .unmap_page = sbus_iommu_unmap_page, + .map_phys = sbus_iommu_map_phys_gflush, + .unmap_phys = sbus_iommu_unmap_phys, .map_sg = sbus_iommu_map_sg_gflush, .unmap_sg = sbus_iommu_unmap_sg, }; @@ -437,8 +439,8 @@ static const struct dma_map_ops sbus_iommu_dma_pflush_ops = { .alloc = sbus_iommu_alloc, .free = sbus_iommu_free, #endif - .map_page = sbus_iommu_map_page_pflush, - .unmap_page = sbus_iommu_unmap_page, + .map_phys = sbus_iommu_map_phys_pflush, + .unmap_phys = sbus_iommu_unmap_phys, .map_sg = sbus_iommu_map_sg_pflush, .unmap_sg = sbus_iommu_unmap_sg, }; diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c index 3485d419c2f5e..93a06307d9539 100644 --- a/arch/x86/kernel/amd_gart_64.c +++ b/arch/x86/kernel/amd_gart_64.c @@ -222,13 +222,14 @@ static dma_addr_t dma_map_area(struct device *dev, dma_addr_t phys_mem, } /* Map a single area into the IOMMU */ -static dma_addr_t gart_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t size, - enum dma_data_direction dir, +static dma_addr_t gart_map_phys(struct device *dev, phys_addr_t paddr, + size_t size, enum dma_data_direction dir, unsigned long attrs) { unsigned long bus; - phys_addr_t paddr = page_to_phys(page) + offset; + + if (unlikely(attrs & DMA_ATTR_MMIO)) + return DMA_MAPPING_ERROR; if (!need_iommu(dev, paddr, size)) return paddr; @@ -242,7 +243,7 @@ static dma_addr_t gart_map_page(struct device *dev, struct page *page, /* * Free a DMA mapping. */ -static void gart_unmap_page(struct device *dev, dma_addr_t dma_addr, +static void gart_unmap_phys(struct device *dev, dma_addr_t dma_addr, size_t size, enum dma_data_direction dir, unsigned long attrs) { @@ -282,7 +283,7 @@ static void gart_unmap_sg(struct device *dev, struct scatterlist *sg, int nents, for_each_sg(sg, s, nents, i) { if (!s->dma_length || !s->length) break; - gart_unmap_page(dev, s->dma_address, s->dma_length, dir, 0); + gart_unmap_phys(dev, s->dma_address, s->dma_length, dir, 0); } } @@ -487,7 +488,7 @@ static void gart_free_coherent(struct device *dev, size_t size, void *vaddr, dma_addr_t dma_addr, unsigned long attrs) { - gart_unmap_page(dev, dma_addr, size, DMA_BIDIRECTIONAL, 0); + gart_unmap_phys(dev, dma_addr, size, DMA_BIDIRECTIONAL, 0); dma_direct_free(dev, size, vaddr, dma_addr, attrs); } @@ -668,8 +669,8 @@ static __init int init_amd_gatt(struct agp_kern_info *info) static const struct dma_map_ops gart_dma_ops = { .map_sg = gart_map_sg, .unmap_sg = gart_unmap_sg, - .map_page = gart_map_page, - .unmap_page = gart_unmap_page, + .map_phys = gart_map_phys, + .unmap_phys = gart_unmap_phys, .alloc = gart_alloc_coherent, .free = gart_free_coherent, .mmap = dma_common_mmap, diff --git a/block/bio.c b/block/bio.c index 1904683f7ab05..c8fce0d6e3323 100644 --- a/block/bio.c +++ b/block/bio.c @@ -981,7 +981,7 @@ void __bio_add_page(struct bio *bio, struct page *page, WARN_ON_ONCE(bio_full(bio, len)); if (is_pci_p2pdma_page(page)) - bio->bi_opf |= REQ_P2PDMA | REQ_NOMERGE; + bio->bi_opf |= REQ_NOMERGE; bvec_set_page(&bio->bi_io_vec[bio->bi_vcnt], page, len, off); bio->bi_iter.bi_size += len; diff --git a/block/blk-integrity.c b/block/blk-integrity.c index 056b8948369d5..dd97b27366e0e 100644 --- a/block/blk-integrity.c +++ b/block/blk-integrity.c @@ -122,64 +122,6 @@ int blk_get_meta_cap(struct block_device *bdev, unsigned int cmd, NULL); } -/** - * blk_rq_map_integrity_sg - Map integrity metadata into a scatterlist - * @rq: request to map - * @sglist: target scatterlist - * - * Description: Map the integrity vectors in request into a - * scatterlist. The scatterlist must be big enough to hold all - * elements. I.e. sized using blk_rq_count_integrity_sg() or - * rq->nr_integrity_segments. - */ -int blk_rq_map_integrity_sg(struct request *rq, struct scatterlist *sglist) -{ - struct bio_vec iv, ivprv = { NULL }; - struct request_queue *q = rq->q; - struct scatterlist *sg = NULL; - struct bio *bio = rq->bio; - unsigned int segments = 0; - struct bvec_iter iter; - int prev = 0; - - bio_for_each_integrity_vec(iv, bio, iter) { - if (prev) { - if (!biovec_phys_mergeable(q, &ivprv, &iv)) - goto new_segment; - if (sg->length + iv.bv_len > queue_max_segment_size(q)) - goto new_segment; - - sg->length += iv.bv_len; - } else { -new_segment: - if (!sg) - sg = sglist; - else { - sg_unmark_end(sg); - sg = sg_next(sg); - } - - sg_set_page(sg, iv.bv_page, iv.bv_len, iv.bv_offset); - segments++; - } - - prev = 1; - ivprv = iv; - } - - if (sg) - sg_mark_end(sg); - - /* - * Something must have been wrong if the figured number of segment - * is bigger than number of req's physical integrity segments - */ - BUG_ON(segments > rq->nr_integrity_segments); - BUG_ON(segments > queue_max_integrity_segments(q)); - return segments; -} -EXPORT_SYMBOL(blk_rq_map_integrity_sg); - int blk_rq_integrity_map_user(struct request *rq, void __user *ubuf, ssize_t bytes) { diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c index ad283017caef2..361dc91a90ff0 100644 --- a/block/blk-mq-dma.c +++ b/block/blk-mq-dma.c @@ -2,6 +2,7 @@ /* * Copyright (C) 2025 Christoph Hellwig */ +#include #include #include "blk.h" @@ -10,29 +11,38 @@ struct phys_vec { u32 len; }; -static bool blk_map_iter_next(struct request *req, struct req_iterator *iter, +static bool __blk_map_iter_next(struct blk_map_iter *iter) +{ + if (iter->iter.bi_size) + return true; + if (!iter->bio || !iter->bio->bi_next) + return false; + + iter->bio = iter->bio->bi_next; + if (iter->is_integrity) { + iter->iter = bio_integrity(iter->bio)->bip_iter; + iter->bvecs = bio_integrity(iter->bio)->bip_vec; + } else { + iter->iter = iter->bio->bi_iter; + iter->bvecs = iter->bio->bi_io_vec; + } + return true; +} + +static bool blk_map_iter_next(struct request *req, struct blk_map_iter *iter, struct phys_vec *vec) { unsigned int max_size; struct bio_vec bv; - if (req->rq_flags & RQF_SPECIAL_PAYLOAD) { - if (!iter->bio) - return false; - vec->paddr = bvec_phys(&req->special_vec); - vec->len = req->special_vec.bv_len; - iter->bio = NULL; - return true; - } - if (!iter->iter.bi_size) return false; - bv = mp_bvec_iter_bvec(iter->bio->bi_io_vec, iter->iter); + bv = mp_bvec_iter_bvec(iter->bvecs, iter->iter); vec->paddr = bvec_phys(&bv); max_size = get_max_segment_size(&req->q->limits, vec->paddr, UINT_MAX); bv.bv_len = min(bv.bv_len, max_size); - bio_advance_iter_single(iter->bio, &iter->iter, bv.bv_len); + bvec_iter_advance_single(iter->bvecs, &iter->iter, bv.bv_len); /* * If we are entirely done with this bi_io_vec entry, check if the next @@ -42,20 +52,16 @@ static bool blk_map_iter_next(struct request *req, struct req_iterator *iter, while (!iter->iter.bi_size || !iter->iter.bi_bvec_done) { struct bio_vec next; - if (!iter->iter.bi_size) { - if (!iter->bio->bi_next) - break; - iter->bio = iter->bio->bi_next; - iter->iter = iter->bio->bi_iter; - } + if (!__blk_map_iter_next(iter)) + break; - next = mp_bvec_iter_bvec(iter->bio->bi_io_vec, iter->iter); + next = mp_bvec_iter_bvec(iter->bvecs, iter->iter); if (bv.bv_len + next.bv_len > max_size || !biovec_phys_mergeable(req->q, &bv, &next)) break; bv.bv_len += next.bv_len; - bio_advance_iter_single(iter->bio, &iter->iter, next.bv_len); + bvec_iter_advance_single(iter->bvecs, &iter->iter, next.bv_len); } vec->len = bv.bv_len; @@ -79,7 +85,7 @@ static inline bool blk_can_dma_map_iova(struct request *req, static bool blk_dma_map_bus(struct blk_dma_iter *iter, struct phys_vec *vec) { - iter->addr = pci_p2pdma_bus_addr_map(&iter->p2pdma, vec->paddr); + iter->addr = pci_p2pdma_bus_addr_map(iter->p2pdma.mem, vec->paddr); iter->len = vec->len; return true; } @@ -87,8 +93,13 @@ static bool blk_dma_map_bus(struct blk_dma_iter *iter, struct phys_vec *vec) static bool blk_dma_map_direct(struct request *req, struct device *dma_dev, struct blk_dma_iter *iter, struct phys_vec *vec) { - iter->addr = dma_map_page(dma_dev, phys_to_page(vec->paddr), - offset_in_page(vec->paddr), vec->len, rq_dma_dir(req)); + unsigned int attrs = 0; + + if (iter->p2pdma.map == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE) + attrs |= DMA_ATTR_MMIO; + + iter->addr = dma_map_phys(dma_dev, vec->paddr, vec->len, + rq_dma_dir(req), attrs); if (dma_mapping_error(dma_dev, iter->addr)) { iter->status = BLK_STS_RESOURCE; return false; @@ -103,14 +114,18 @@ static bool blk_rq_dma_map_iova(struct request *req, struct device *dma_dev, { enum dma_data_direction dir = rq_dma_dir(req); unsigned int mapped = 0; + unsigned int attrs = 0; int error; iter->addr = state->addr; iter->len = dma_iova_size(state); + if (iter->p2pdma.map == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE) + attrs |= DMA_ATTR_MMIO; + do { error = dma_iova_link(dma_dev, state, vec->paddr, mapped, - vec->len, dir, 0); + vec->len, dir, attrs); if (error) break; mapped += vec->len; @@ -125,6 +140,69 @@ static bool blk_rq_dma_map_iova(struct request *req, struct device *dma_dev, return true; } +static inline void blk_rq_map_iter_init(struct request *rq, + struct blk_map_iter *iter) +{ + struct bio *bio = rq->bio; + + if (rq->rq_flags & RQF_SPECIAL_PAYLOAD) { + *iter = (struct blk_map_iter) { + .bvecs = &rq->special_vec, + .iter = { + .bi_size = rq->special_vec.bv_len, + } + }; + } else if (bio) { + *iter = (struct blk_map_iter) { + .bio = bio, + .bvecs = bio->bi_io_vec, + .iter = bio->bi_iter, + }; + } else { + /* the internal flush request may not have bio attached */ + *iter = (struct blk_map_iter) {}; + } +} + +static bool blk_dma_map_iter_start(struct request *req, struct device *dma_dev, + struct dma_iova_state *state, struct blk_dma_iter *iter, + unsigned int total_len) +{ + struct phys_vec vec; + + memset(&iter->p2pdma, 0, sizeof(iter->p2pdma)); + iter->status = BLK_STS_OK; + iter->p2pdma.map = PCI_P2PDMA_MAP_NONE; + + /* + * Grab the first segment ASAP because we'll need it to check for P2P + * transfers. + */ + if (!blk_map_iter_next(req, &iter->iter, &vec)) + return false; + + switch (pci_p2pdma_state(&iter->p2pdma, dma_dev, + phys_to_page(vec.paddr))) { + case PCI_P2PDMA_MAP_BUS_ADDR: + return blk_dma_map_bus(iter, &vec); + case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: + /* + * P2P transfers through the host bridge are treated the + * same as non-P2P transfers below and during unmap. + */ + case PCI_P2PDMA_MAP_NONE: + break; + default: + iter->status = BLK_STS_INVAL; + return false; + } + + if (blk_can_dma_map_iova(req, dma_dev) && + dma_iova_try_alloc(dma_dev, state, vec.paddr, total_len)) + return blk_rq_dma_map_iova(req, dma_dev, state, iter, &vec); + return blk_dma_map_direct(req, dma_dev, iter, &vec); +} + /** * blk_rq_dma_map_iter_start - map the first DMA segment for a request * @req: request to map @@ -150,43 +228,9 @@ static bool blk_rq_dma_map_iova(struct request *req, struct device *dma_dev, bool blk_rq_dma_map_iter_start(struct request *req, struct device *dma_dev, struct dma_iova_state *state, struct blk_dma_iter *iter) { - unsigned int total_len = blk_rq_payload_bytes(req); - struct phys_vec vec; - - iter->iter.bio = req->bio; - iter->iter.iter = req->bio->bi_iter; - memset(&iter->p2pdma, 0, sizeof(iter->p2pdma)); - iter->status = BLK_STS_OK; - - /* - * Grab the first segment ASAP because we'll need it to check for P2P - * transfers. - */ - if (!blk_map_iter_next(req, &iter->iter, &vec)) - return false; - - if (IS_ENABLED(CONFIG_PCI_P2PDMA) && (req->cmd_flags & REQ_P2PDMA)) { - switch (pci_p2pdma_state(&iter->p2pdma, dma_dev, - phys_to_page(vec.paddr))) { - case PCI_P2PDMA_MAP_BUS_ADDR: - return blk_dma_map_bus(iter, &vec); - case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: - /* - * P2P transfers through the host bridge are treated the - * same as non-P2P transfers below and during unmap. - */ - req->cmd_flags &= ~REQ_P2PDMA; - break; - default: - iter->status = BLK_STS_INVAL; - return false; - } - } - - if (blk_can_dma_map_iova(req, dma_dev) && - dma_iova_try_alloc(dma_dev, state, vec.paddr, total_len)) - return blk_rq_dma_map_iova(req, dma_dev, state, iter, &vec); - return blk_dma_map_direct(req, dma_dev, iter, &vec); + blk_rq_map_iter_init(req, &iter->iter); + return blk_dma_map_iter_start(req, dma_dev, state, iter, + blk_rq_payload_bytes(req)); } EXPORT_SYMBOL_GPL(blk_rq_dma_map_iter_start); @@ -246,16 +290,11 @@ blk_next_sg(struct scatterlist **sg, struct scatterlist *sglist) int __blk_rq_map_sg(struct request *rq, struct scatterlist *sglist, struct scatterlist **last_sg) { - struct req_iterator iter = { - .bio = rq->bio, - }; + struct blk_map_iter iter; struct phys_vec vec; int nsegs = 0; - /* the internal flush request may not have bio attached */ - if (iter.bio) - iter.iter = iter.bio->bi_iter; - + blk_rq_map_iter_init(rq, &iter); while (blk_map_iter_next(rq, &iter, &vec)) { *last_sg = blk_next_sg(last_sg, sglist); sg_set_page(*last_sg, phys_to_page(vec.paddr), vec.len, @@ -275,3 +314,124 @@ int __blk_rq_map_sg(struct request *rq, struct scatterlist *sglist, return nsegs; } EXPORT_SYMBOL(__blk_rq_map_sg); + +#ifdef CONFIG_BLK_DEV_INTEGRITY +/** + * blk_rq_integrity_dma_map_iter_start - map the first integrity DMA segment + * for a request + * @req: request to map + * @dma_dev: device to map to + * @state: DMA IOVA state + * @iter: block layer DMA iterator + * + * Start DMA mapping @req integrity data to @dma_dev. @state and @iter are + * provided by the caller and don't need to be initialized. @state needs to be + * stored for use at unmap time, @iter is only needed at map time. + * + * Returns %false if there is no segment to map, including due to an error, or + * %true if it did map a segment. + * + * If a segment was mapped, the DMA address for it is returned in @iter.addr + * and the length in @iter.len. If no segment was mapped the status code is + * returned in @iter.status. + * + * The caller can call blk_rq_dma_map_coalesce() to check if further segments + * need to be mapped after this, or go straight to blk_rq_dma_map_iter_next() + * to try to map the following segments. + */ +bool blk_rq_integrity_dma_map_iter_start(struct request *req, + struct device *dma_dev, struct dma_iova_state *state, + struct blk_dma_iter *iter) +{ + unsigned len = bio_integrity_bytes(&req->q->limits.integrity, + blk_rq_sectors(req)); + struct bio *bio = req->bio; + + iter->iter = (struct blk_map_iter) { + .bio = bio, + .iter = bio_integrity(bio)->bip_iter, + .bvecs = bio_integrity(bio)->bip_vec, + .is_integrity = true, + }; + return blk_dma_map_iter_start(req, dma_dev, state, iter, len); +} +EXPORT_SYMBOL_GPL(blk_rq_integrity_dma_map_iter_start); + +/** + * blk_rq_integrity_dma_map_iter_start - map the next integrity DMA segment for + * a request + * @req: request to map + * @dma_dev: device to map to + * @state: DMA IOVA state + * @iter: block layer DMA iterator + * + * Iterate to the next integrity mapping after a previous call to + * blk_rq_integrity_dma_map_iter_start(). See there for a detailed description + * of the arguments. + * + * Returns %false if there is no segment to map, including due to an error, or + * %true if it did map a segment. + * + * If a segment was mapped, the DMA address for it is returned in @iter.addr and + * the length in @iter.len. If no segment was mapped the status code is + * returned in @iter.status. + */ +bool blk_rq_integrity_dma_map_iter_next(struct request *req, + struct device *dma_dev, struct blk_dma_iter *iter) +{ + struct phys_vec vec; + + if (!blk_map_iter_next(req, &iter->iter, &vec)) + return false; + + if (iter->p2pdma.map == PCI_P2PDMA_MAP_BUS_ADDR) + return blk_dma_map_bus(iter, &vec); + return blk_dma_map_direct(req, dma_dev, iter, &vec); +} +EXPORT_SYMBOL_GPL(blk_rq_integrity_dma_map_iter_next); + +/** + * blk_rq_map_integrity_sg - Map integrity metadata into a scatterlist + * @rq: request to map + * @sglist: target scatterlist + * + * Description: Map the integrity vectors in request into a + * scatterlist. The scatterlist must be big enough to hold all + * elements. I.e. sized using blk_rq_count_integrity_sg() or + * rq->nr_integrity_segments. + */ +int blk_rq_map_integrity_sg(struct request *rq, struct scatterlist *sglist) +{ + struct request_queue *q = rq->q; + struct scatterlist *sg = NULL; + struct bio *bio = rq->bio; + unsigned int segments = 0; + struct phys_vec vec; + + struct blk_map_iter iter = { + .bio = bio, + .iter = bio_integrity(bio)->bip_iter, + .bvecs = bio_integrity(bio)->bip_vec, + .is_integrity = true, + }; + + while (blk_map_iter_next(rq, &iter, &vec)) { + sg = blk_next_sg(&sg, sglist); + sg_set_page(sg, phys_to_page(vec.paddr), vec.len, + offset_in_page(vec.paddr)); + segments++; + } + + if (sg) + sg_mark_end(sg); + + /* + * Something must have been wrong if the figured number of segment + * is bigger than number of req's physical integrity segments + */ + BUG_ON(segments > rq->nr_integrity_segments); + BUG_ON(segments > queue_max_integrity_segments(q)); + return segments; +} +EXPORT_SYMBOL(blk_rq_map_integrity_sg); +#endif diff --git a/debian.nvidia-6.17/changelog b/debian.nvidia-6.17/changelog index 0e7f10d1594ec..a1e06588c3972 100644 --- a/debian.nvidia-6.17/changelog +++ b/debian.nvidia-6.17/changelog @@ -1,3 +1,345 @@ +linux-nvidia-6.17 (6.17.0-1010.10) noble; urgency=medium + + * noble/linux-nvidia-6.17: 6.17.0-1010.10 -proposed tracker (LP: #2141777) + + * Enable Coresight in Perf (LP: #2093957) + - [Packaging] Enable coresight in Perf if arm64 + - [Packaging] Add libopencsd-dev as a build dependency + + * Packaging resync (LP: #1786013) + - [Packaging] debian.nvidia-6.17/dkms-versions -- update from kernel- + versions (main/d2026.02.09) + + * Add ARM CCA host support (LP: #2139249) + - NVIDIA: VR: SAUCE: kvm: arm64: Include kvm_emulate.h in kvm/arm_psci.h + - NVIDIA: VR: SAUCE: arm64: RME: Handle Granule Protection Faults (GPFs) + - NVIDIA: VR: SAUCE: arm64: RME: Add SMC definitions for calling the RMM + - NVIDIA: VR: SAUCE: arm64: RME: Add wrappers for RMI calls + - NVIDIA: VR: SAUCE: arm64: RME: Check for RME support at KVM init + - NVIDIA: VR: SAUCE: arm64: RME: Define the user ABI + - NVIDIA: VR: SAUCE: arm64: RME: ioctls to create and configure realms + - NVIDIA: VR: SAUCE: kvm: arm64: Don't expose debug capabilities for realm + guests + - NVIDIA: VR: SAUCE: KVM: arm64: Allow passing machine type in KVM + creation + - NVIDIA: VR: SAUCE: arm64: RME: RTT tear down + - NVIDIA: VR: SAUCE: arm64: RME: Allocate/free RECs to match vCPUs + - NVIDIA: VR: SAUCE: KVM: arm64: vgic: Provide helper for number of list + registers + - NVIDIA: VR: SAUCE: arm64: RME: Support for the VGIC in realms + - NVIDIA: VR: SAUCE: KVM: arm64: Support timers in realm RECs + - NVIDIA: VR: SAUCE: arm64: RME: Allow VMM to set RIPAS + - NVIDIA: VR: SAUCE: arm64: RME: Handle realm enter/exit + - NVIDIA: VR: SAUCE: arm64: RME: Handle RMI_EXIT_RIPAS_CHANGE + - NVIDIA: VR: SAUCE: KVM: arm64: Handle realm MMIO emulation + - NVIDIA: VR: SAUCE: arm64: RME: Allow populating initial contents + - NVIDIA: VR: SAUCE: arm64: RME: Runtime faulting of memory + - NVIDIA: VR: SAUCE: KVM: arm64: Handle realm VCPU load + - NVIDIA: VR: SAUCE: KVM: arm64: Validate register access for a Realm VM + - NVIDIA: VR: SAUCE: KVM: arm64: Handle Realm PSCI requests + - NVIDIA: VR: SAUCE: KVM: arm64: WARN on injected undef exceptions + - NVIDIA: VR: SAUCE: arm64: Don't expose stolen time for realm guests + - NVIDIA: VR: SAUCE: arm64: RME: allow userspace to inject aborts + - NVIDIA: VR: SAUCE: arm64: RME: support RSI_HOST_CALL + - NVIDIA: VR: SAUCE: arm64: RME: Allow checking SVE on VM instance + - NVIDIA: VR: SAUCE: arm64: RME: Always use 4k pages for realms + - NVIDIA: VR: SAUCE: arm64: RME: Prevent Device mappings for Realms + - NVIDIA: VR: SAUCE: arm_pmu: Provide a mechanism for disabling the + physical IRQ + - NVIDIA: VR: SAUCE: arm64: RME: Enable PMU support with a realm guest + - NVIDIA: VR: SAUCE: arm64: RME: Hide KVM_CAP_READONLY_MEM for realm + guests + - NVIDIA: VR: SAUCE: arm64: RME: Propagate number of breakpoints and + watchpoints to userspace + - NVIDIA: VR: SAUCE: arm64: RME: Set breakpoint parameters through + SET_ONE_REG + - NVIDIA: VR: SAUCE: arm64: RME: Initialize PMCR.N with number counter + supported by RMM + - NVIDIA: VR: SAUCE: arm64: RME: Propagate max SVE vector length from RMM + - NVIDIA: VR: SAUCE: arm64: RME: Configure max SVE vector length for a + Realm + - NVIDIA: VR: SAUCE: arm64: RME: Provide register list for unfinalized RME + RECs + - NVIDIA: VR: SAUCE: arm64: RME: Provide accurate register list + - NVIDIA: VR: SAUCE: KVM: arm64: Expose support for private memory + - NVIDIA: VR: SAUCE: KVM: arm64: Expose KVM_ARM_VCPU_REC to user space + - NVIDIA: VR: SAUCE: KVM: arm64: Allow activating realms + - NVIDIA: VR: SAUCE: arm64: RME: Add MECID support + - NVIDIA: VR: SAUCE: arm64: RME: Add bounds check + - NVIDIA: VR: SAUCE: KVM: arm64: Expose KVM_CAP_ARM_RME via module + parameter + - arm64: realm: ioremap: Allow mapping memory as encrypted + - arm64: Enable EFI secret area Securityfs support + - NVIDIA: VR: SAUCE: [Config] Update annotations for ARM CCA + + * kexec: warning when doing kexec -a (LP: #2141705) + - kernel/kexec: change the prototype of kimage_map_segment() + - kernel/kexec: fix IMA when allocation happens in CMA area + + * backport CPPC fixes for "nosmt"/"nosmt=force" scenarios (LP: #2141613) + - cpufreq/amd-pstate: Call cppc_set_auto_sel() only for online CPUs + - NVIDIA: VR: SAUCE: ACPI: CPPC: Fix remaining for_each_possible_cpu() to + use online CPUs + + * Backport DMA_ATTR_MMIO and VFIO/PCI to export MMIO region as DMA-Buf + series from upstream (LP: #2139370) + - dma-mapping: introduce new DMA attribute to indicate MMIO memory + - iommu/dma: implement DMA_ATTR_MMIO for dma_iova_link(). + - dma-debug: refactor to use physical addresses for page mapping + - dma-mapping: rename trace_dma_*map_page to trace_dma_*map_phys + - iommu/dma: rename iommu_dma_*map_page to iommu_dma_*map_phys + - iommu/dma: implement DMA_ATTR_MMIO for iommu_dma_(un)map_phys() + - dma-mapping: convert dma_direct_*map_page to be phys_addr_t based + - kmsan: convert kmsan_handle_dma to use physical addresses + - dma-mapping: implement DMA_ATTR_MMIO for dma_(un)map_page_attrs() + - xen: swiotlb: Open code map_resource callback + - dma-mapping: export new dma_*map_phys() interface + - mm/hmm: migrate to physical address-based DMA mapping API + - mm/hmm: properly take MMIO path + - Revert "NVIDIA: SAUCE: Patch NVMe/NVMeoF driver to support GDS on Linux + 6.17 Kernel" + - blk-mq-dma: create blk_map_iter type + - blk-mq-dma: provide the bio_vec array being iterated + - blk-mq-dma: require unmap caller provide p2p map type + - blk-mq: remove REQ_P2PDMA flag + - blk-mq-dma: move common dma start code to a helper + - blk-mq-dma: add scatter-less integrity data DMA mapping + - blk-integrity: use iterator for mapping sg + - nvme-pci: create common sgl unmapping helper + - nvme-pci: convert metadata mapping to dma iter + - blk-mq-dma: bring back p2p request flags + - nvme-pci: migrate to dma_map_phys instead of map_page + - block-dma: properly take MMIO path + - virtio_balloon: Remove redundant __GFP_NOWARN + - virtio_ring: constify virtqueue pointer for DMA helpers + - virtio_ring: switch to use dma_{map|unmap}_page() + - virtio: rename dma helpers + - virtio: introduce virtio_map container union + - virtio_ring: rename dma_handle to map_handle + - virtio: introduce map ops in virtio core + - vdpa: support virtio_map + - vdpa: introduce map ops + - vduse: switch to use virtio map API instead of DMA API + - vduse: Use fixed 4KB bounce pages for non-4KB page size + - virtio-vdpa: Drop redundant conversion to bool + - dma-mapping: prepare dma_map_ops to conversion to physical address + - dma-mapping: convert dummy ops to physical address mapping + - ARM: dma-mapping: Reduce struct page exposure in arch_sync_dma*() + - ARM: dma-mapping: Switch to physical address mapping callbacks + - xen: swiotlb: Switch to physical address mapping callbacks + - dma-mapping: remove unused mapping resource callbacks + - alpha: Convert mapping routine to rely on physical address + - MIPS/jazzdma: Provide physical address directly + - parisc: Convert DMA map_page to map_phys interface + - powerpc: Convert to physical address DMA mapping + - sparc: Use physical address DMA mapping + - x86: Use physical address for DMA mapping + - xen: swiotlb: Convert mapping routine to rely on physical address + - dma-mapping: remove unused map_page callback + - PCI/P2PDMA: Separate the mmap() support from the core logic + - PCI/P2PDMA: Simplify bus address mapping API + - PCI/P2PDMA: Refactor to separate core P2P functionality from memory + allocation + - PCI/P2PDMA: Provide an access to pci_p2pdma_map_type() function + - PCI/P2PDMA: Document DMABUF model + - dma-buf: provide phys_vec to scatter-gather mapping routine + - vfio: Export vfio device get and put registration helpers + - vfio/pci: Share the core device pointer while invoking feature functions + - vfio/pci: Enable peer-to-peer DMA transactions by default + - vfio/pci: Add dma-buf export support for MMIO regions + - vfio/nvgrace: Support get_dmabuf_phys + - iommu/dma: add missing support for DMA_ATTR_MMIO for dma_iova_unlink() + - kmsan: fix missed kmsan_handle_dma() signature conversion + - kmsan: fix kmsan_handle_dma() to avoid false positives + - nvme-pci: DMA unmap the correct regions in nvme_free_sgls + - parisc: Set valid bit in high byte of 64‑bit physical address + - dma-buf: fix integer overflow in fill_sg_entry() for buffers >= 8GiB + - vfio: Prevent from pinned DMABUF importers to attach to VFIO DMABUF + - NVIDIA: SAUCE: [Config] Add CONFIG_VFIO_PCI_DMABUF to annotations + + * Update GDS/NVMe SAUCE for v6.17 (LP: #2134960) // Backport DMA_ATTR_MMIO + and VFIO/PCI to export MMIO region as DMA-Buf series from upstream + (LP: #2139370) + - NVIDIA: SAUCE: Patch NVMe/NVMeoF driver to support GDS on Linux 6.17 + Kernel + + * Atlantic fix fragment overflow handling in RX path (LP: #2140997) + - net: atlantic: fix fragment overflow handling in RX path + + * VR: Add Live Firmware Activation (LFA) Support (LP: #2138342) + - NVIDIA: VR: SAUCE: firmware: smccc: add support for Live Firmware + Activation (LFA) + - NVIDIA: VR: SAUCE: firmware: smccc: add timeout, touch wdt + - NVIDIA: VR: SAUCE: firmware: smccc: register as platform driver + - NVIDIA: VR: SAUCE: [Config] Enable ARM LFA support + + * Fix speed PCIe-to-PCI bridge alisaing issue (LP: #2136828) + - NVIDIA: SAUCE: PCI: Add ASPEED vendor ID to pci_ids.h + - NVIDIA: SAUCE: PCI: Add PCI_BRIDGE_NO_ALIASES quirk for ASPEED AST1150 + + * Backport nvgrace-gpu hugepfnmap, ecc patches and miscellaneous cleanups + (LP: #2138892) + - Revert "NVIDIA: SAUCE: vfio/nvgrace-egm: Prevent double-unregister of + pfn_address_space" + - Revert "NVIDIA: SAUCE: vfio/nvgrace-gpu: Avoid resmem pfn + unregistration" + - Revert "NVIDIA: SAUCE: KVM: arm64: Allow exec fault on memory mapped + cacheable in VMA" + - Revert "NVIDIA: SAUCE: arm64: configs: Replace VFIO_CONTAINER with + IOMMUFD_VFIO_CONTAINER" + - Revert "NVIDIA: SAUCE: WAR: Expose PCI PASID capability to userspace" + - Revert "NVIDIA: SAUCE: vfio/nvgrace-egm: Register EGM for runtime ECC + poison errors handling" + - Revert "NVIDIA: SAUCE: vfio/nvgrace-gpu: register device memory for + poison handling" + - Revert "NVIDIA: SAUCE: mm: Change ghes code to allow poison of non- + struct pfn" + - Revert "NVIDIA: SAUCE: mm: Add poison error check in fixup_user_fault() + for mapped pfn" + - Revert "NVIDIA: SAUCE: mm: correctly identify pfn without struct pages" + - Revert "NVIDIA: SAUCE: mm: handle poisoning of pfn without struct pages" + - mm: change ghes code to allow poison of non-struct pfn + - mm: handle poisoning of pfn without struct pages + - KVM: arm64: VM exit to userspace to handle SEA + - KVM: selftests: Test for KVM_EXIT_ARM_SEA + - Documentation: kvm: new UAPI for handling SEA + - vfio: refactor vfio_pci_mmap_huge_fault function + - vfio/nvgrace-gpu: Add support for huge pfnmap + - vfio: use vfio_pci_core_setup_barmap to map bar in mmap + - vfio/nvgrace-gpu: split the code to wait for GPU ready + - vfio/nvgrace-gpu: Inform devmem unmapped after reset + - vfio/nvgrace-gpu: wait for the GPU mem to be ready + - mm: fixup pfnmap memory failure handling to use pgoff + - mm: add stubs for PFNMAP memory failure registration functions + - vfio/nvgrace-gpu: register device memory for poison handling + - NVIDIA: SAUCE: vfio/nvgrace-egm: register EGM PFNMAP range with + memory_failure + - NVIDIA: SAUCE: vfio: Remove vfio_device_from_file() declaration + + * Backport perf: Fix 0 count issue of cpu-clock (LP: #2139648) + - perf: Fix 0 count issue of cpu-clock + + * tegra-qspi: Fix race condition causing NULL pointer dereference + (LP: #2139640) + - spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed + transfer + - spi: tegra210-quad: Move curr_xfer read inside spinlock + - spi: tegra210-quad: Protect curr_xfer assignment in + tegra_qspi_setup_transfer_one + - spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer + - spi: tegra210-quad: Protect curr_xfer clearing in + tegra_qspi_non_combined_seq_xfer + - spi: tegra210-quad: Protect curr_xfer check in IRQ handler + + * Backport device passthrough virtualization fixes from v6.19 (LP: #2140343) + - iommu/arm-smmu-v3-iommufd: Allow attaching nested domain for GBPA cases + - mm/hugetlb: fix incorrect error return from hugetlb_reserve_pages() + - iommu/tegra241-cmdqv: Reset VCMDQ in tegra241_vcmdq_hw_init_user() + + * apparmor warnings while building (LP: #2138131) + - SAUCE: security/apparmor: Fix AA_DEBUG_PROFILE define + + * Backport support for T410 PMU (LP: #2139315) + - Revert "NVIDIA: SAUCE: arm64: cputype: Add NVIDIA Olympus definitions" + - arm64: cputype: Add NVIDIA Olympus definitions + - tools headers arm64: Add NVIDIA Olympus part + - perf arm-spe: Add NVIDIA Olympus to neoverse list + - NVIDIA: VR: SAUCE: perf/arm_cspmu: nvidia: Rename doc to Tegra241 + - NVIDIA: VR: SAUCE: perf/arm_cspmu: nvidia: Add Tegra410 UCF PMU + - NVIDIA: VR: SAUCE: perf/arm_cspmu: Add arm_cspmu_acpi_dev_get + - NVIDIA: VR: SAUCE: perf/arm_cspmu: nvidia: Add Tegra410 PCIE PMU + - NVIDIA: VR: SAUCE: perf/arm_cspmu: nvidia: Add Tegra410 PCIE-TGT PMU + - NVIDIA: VR: SAUCE: perf: add NVIDIA Tegra410 CPU Memory Latency PMU + - NVIDIA: VR: SAUCE: perf: add NVIDIA Tegra410 C2C PMU + - NVIDIA: VR: SAUCE: arm64: defconfig: Enable NVIDIA TEGRA410 PMU + - NVIDIA: VR: SAUCE: perf vendor events arm64: Add Tegra410 Olympus PMU + events + - NVIDIA: VR: SAUCE: [Config] nvidia-6.17 enable TEGRA410_C2C_PMU and + TEGRA410_CMEM_LATENCY_PMU + + * Backport support for FEAT_LS64 (LP: #2139248) + - KVM: arm64: Add exit to userspace on {LD,ST}64B* outside of memslots + - KVM: arm64: Add documentation for KVM_EXIT_ARM_LDST64B + - KVM: arm64: Handle DABT caused by LS64* instructions on unsupported + memory + - arm64: Provide basic EL2 setup for FEAT_{LS64, LS64_V} usage at EL0/1 + - KVM: arm64: Enable FEAT_{LS64, LS64_V} in the supported guest + - arm64: Add support for FEAT_{LS64, LS64_V} + - kselftest/arm64: Add HWCAP test for FEAT_LS64 + + * Backport patches to enable ATS to remain always-on for CXL.cache devices + and specific NVIDIA GPUs by adding pci_ats_always_on() API and SMMU driver + support. (LP: #2139088) + - NVIDIA: VR: SAUCE: PCI: Allow ATS to be always on for CXL.cache capable + devices + - NVIDIA: VR: SAUCE: PCI: Allow ATS to be always on for non-CXL NVIDIA + GPUs + - NVIDIA: VR: SAUCE: iommu/arm-smmu-v3: Allow ATS to be always on + + * backport "soc/tegra: pmc: Add PMC support for Tegra410" (LP: #2139082) + - NVIDIA: VR: SAUCE: soc/tegra: pmc: Add PMC support for Tegra410 + + * Backport NVIDIA: VR: SAUCE: arm64: Add workaround to convert MT_NORMAL_NC + to Device-nGnRE (LP: #2138952) + - NVIDIA: VR: SAUCE: arm64: Add workaround to convert MT_NORMAL_NC to + Device-nGnRE + - NVIDIA: VR: SAUCE: [Config] Enable arm64 NC-to-NGNRE workaround + + * missing prototype for vfio_device_from_file() (LP: #2138132) + - NVIDIA: SAUCE: vfio: Fix missing prototype warning + + * Backport soc/tegra: fuse: Do not register SoC device on ACPI boot + (LP: #2138239) + - soc/tegra: fuse: Do not register SoC device on ACPI boot + + * backport gpio patches for Tegra 256, tegra 186, tegra410 (LP: #2137739) + - Revert "gpio: tegra186: Add support for Tegra410" + - Revert "gpio: tegra186: Use generic macro for port definitions" + - dt-bindings: gpio: Add Tegra256 support + - gpio: tegra186: Add support for Tegra256 + - gpio: tegra186: Use generic macro for port definitions + - gpio: tegra186: Add support for Tegra410 + - gpio: tegra186: Fix GPIO name collisions for Tegra410 + + * r8127: Downgrade GPL claim to info (LP: #2137588) + - NVIDIA: SAUCE: r8127: print GPL_CLAIM with KERN_INFO + + * Backport i2c patches for Tegra256, Tegra264, and Tegra410 (LP: #2138238) + - i2c: tegra: Add Tegra256 support + - NVIDIA: VR: SAUCE: i2c: tegra: Do not configure DMA if not supported + - NVIDIA: VR: SAUCE: i2c: tegra: Use separate variables for fast and + fastplus + - NVIDIA: VR: SAUCE: i2c: tegra: Update Tegra256 timing parameters + - NVIDIA: VR: SAUCE: i2c: tegra: Add HS mode support + - NVIDIA: VR: SAUCE: i2c: tegra: Add support for SW mutex register + - NVIDIA: VR: SAUCE: i2c: tegra: Add Tegra264 support + - NVIDIA: VR: SAUCE: i2c: tegra: Introduce tegra_i2c_variant to identify + DVC and VI + - NVIDIA: VR: SAUCE: i2c: tegra: Move variant to tegra_i2c_hw_feature + - NVIDIA: VR: SAUCE: i2c: tegra: Add logic to support different register + offsets + - NVIDIA: VR: SAUCE: i2c: tegra: Add support for Tegra410 + + * Add kernel patches for CXL type 3 device support (LP: #2138266) + - NVIDIA: VR: SAUCE: cxl: add support for cxl reset + - NVIDIA: VR: SAUCE: cxl_test: enable zero sized decoders under hb0 + - NVIDIA: VR: SAUCE: cxl: Allow zero sized HDM decoders + - NVIDIA: VR: SAUCE: cxl/hdm: Fix infinite loop in DPA partition discovery + - NVIDIA: VR: SAUCE: cxl/region: Validate partition index before array + access + - NVIDIA: VR: SAUCE: [Config] Add a CXL config for CXL type 3 devices + + * Backport arch_topology stub fix to prevent build failure on ARM without + CONFIG_GENERIC_ARCH_TOPOLOGY (LP: #2138375) + - arch_topology: Provide a stub topology_core_has_smt() for + !CONFIG_GENERIC_ARCH_TOPOLOGY + + * [linux-nvidia-6.17] Backport NVIDIA: VR: SAUCE: soc/tegra: misc: Use SMCCC + to get chipid (LP: #2138329) + - NVIDIA: VR: SAUCE: soc/tegra: misc: Use SMCCC to get chipid + + -- Jacob Martin Sat, 14 Feb 2026 09:29:08 -0600 + linux-nvidia-6.17 (6.17.0-1008.8) noble; urgency=medium * noble/linux-nvidia-6.17: 6.17.0-1008.8 -proposed tracker (LP: #2138765) diff --git a/debian.nvidia-6.17/config/annotations b/debian.nvidia-6.17/config/annotations index 7c1af0e339f8b..f297fa1b11a9f 100644 --- a/debian.nvidia-6.17/config/annotations +++ b/debian.nvidia-6.17/config/annotations @@ -117,6 +117,9 @@ CONFIG_DRM_NOUVEAU_SVM note<'Disable nouveau for NVIDIA CONFIG_EFI_CAPSULE_LOADER policy<{'amd64': 'm', 'arm64': 'y'}> CONFIG_EFI_CAPSULE_LOADER note<'LP: #2067111'> +CONFIG_EFI_SECRET policy<{'amd64': 'm', 'arm64': 'm'}> +CONFIG_EFI_SECRET note<'Required for confidential computing guest secrets'> + CONFIG_ETM4X_IMPDEF_FEATURE policy<{'arm64': 'n'}> CONFIG_ETM4X_IMPDEF_FEATURE note<'Required for Grace enablement'> @@ -236,6 +239,9 @@ CONFIG_EC_HUAWEI_GAOKUN policy<{'arm64': 'n'}> CONFIG_GCC_VERSION policy<{'amd64': '130300', 'arm64': '130300'}> CONFIG_HAVE_RUST policy<{'amd64': 'y', 'arm64': '-'}> CONFIG_IOMMUFD_VFIO_CONTAINER policy<{'arm64': 'y'}> +CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES policy<{'amd64': 'y', 'arm64': 'y'}> +CONFIG_KVM_GENERIC_PRIVATE_MEM policy<{'amd64': 'y', 'arm64': 'y'}> +CONFIG_KVM_PRIVATE_MEM policy<{'amd64': 'y', 'arm64': 'y'}> CONFIG_LD_VERSION policy<{'amd64': '24200', 'arm64': '24200'}> CONFIG_MTD_NAND_CORE policy<{'amd64': 'm', 'arm64': 'y'}> CONFIG_NVGRACE_EGM policy<{'arm64': 'm'}> @@ -267,3 +273,4 @@ CONFIG_TOOLS_SUPPORT_RELR policy<{'amd64': 'y', 'arm64': ' CONFIG_UCSI_HUAWEI_GAOKUN policy<{'arm64': '-'}> CONFIG_VFIO_CONTAINER policy<{'amd64': 'y', 'arm64': 'n'}> CONFIG_VFIO_IOMMU_TYPE1 policy<{'amd64': 'm', 'arm64': '-'}> +CONFIG_VFIO_PCI_DMABUF policy<{'amd64': 'y', 'arm64': 'y'}> diff --git a/debian.nvidia-6.17/control.stub.in b/debian.nvidia-6.17/control.stub.in index 808f7dca049c3..0876936d75aae 100644 --- a/debian.nvidia-6.17/control.stub.in +++ b/debian.nvidia-6.17/control.stub.in @@ -52,8 +52,9 @@ Build-Depends: uuid-dev , zstd , bpftool:native [amd64 arm64] , - nvidia-dkms-kernel [amd64 arm64] , - nvidia-kernel-source [amd64 arm64] , + nvidia-dkms-580-open [amd64 arm64] , + nvidia-kernel-source-580-open [amd64 arm64] , + libopencsd-dev [arm64] , Build-Depends-Indep: asciidoc , bzip2 , diff --git a/debian.nvidia-6.17/dkms-versions b/debian.nvidia-6.17/dkms-versions index 131617ee2b513..0b7ba7c1d1d99 100644 --- a/debian.nvidia-6.17/dkms-versions +++ b/debian.nvidia-6.17/dkms-versions @@ -1,4 +1,4 @@ zfs-linux 2.3.4-1ubuntu2 modulename=zfs debpath=pool/universe/z/%package%/zfs-dkms_%version%_all.deb arch=amd64 arch=arm64 arch=ppc64el arch=s390x arch=riscv64 rprovides=spl-modules rprovides=spl-dkms rprovides=zfs-modules rprovides=zfs-dkms -v4l2loopback 0.15.0-0ubuntu2 modulename=v4l2loopback debpath=pool/universe/v/%package%/v4l2loopback-dkms_%version%_all.deb arch=amd64 rprovides=v4l2loopback-modules rprovides=v4l2loopback-dkms +v4l2loopback 0.15.0-0ubuntu2.1 modulename=v4l2loopback debpath=pool/universe/v/%package%/v4l2loopback-dkms_%version%_all.deb arch=amd64 rprovides=v4l2loopback-modules rprovides=v4l2loopback-dkms mstflint 4.26.0-1 modulename=mstflint_access debpath=pool/universe/m/%package%/mstflint-dkms_%version%_all.deb arch=amd64 arch=arm64 rprovides=mstflint-modules rprovides=mstflint-dkms nvidia-fs 2.28.0-1 modulename=nvidia-fs debpath=pool/universe/n/%package%/nvidia-fs-dkms_%version%_amd64.deb arch=amd64 arch=arm64 rprovides=nvidia-fs-modules rprovides=nvidia-fs-dkms type=standalone diff --git a/debian.nvidia-6.17/reconstruct b/debian.nvidia-6.17/reconstruct index 8c46e786cef15..4baa5a320c6df 100644 --- a/debian.nvidia-6.17/reconstruct +++ b/debian.nvidia-6.17/reconstruct @@ -44,4 +44,5 @@ chmod +x 'drivers/net/ethernet/realtek/r8127/rtl_eeprom.h' chmod +x 'drivers/net/ethernet/realtek/r8127/rtltool.c' chmod +x 'drivers/net/ethernet/realtek/r8127/rtltool.h' # Remove any files deleted from the orig. +rm -f 'Documentation/admin-guide/perf/nvidia-pmu.rst' exit 0 diff --git a/debian.nvidia-6.17/tracking-bug b/debian.nvidia-6.17/tracking-bug index 6400076c36ec6..2dac1580a8326 100644 --- a/debian.nvidia-6.17/tracking-bug +++ b/debian.nvidia-6.17/tracking-bug @@ -1 +1 @@ -2138765 d2025.12.18-2 +2141777 d2026.02.09-1 diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c index b6b1bf0bdd212..01f80f4dcbe66 100644 --- a/drivers/acpi/cppc_acpi.c +++ b/drivers/acpi/cppc_acpi.c @@ -362,7 +362,7 @@ static int send_pcc_cmd(int pcc_ss_id, u16 cmd) end: if (cmd == CMD_WRITE) { if (unlikely(ret)) { - for_each_possible_cpu(i) { + for_each_online_cpu(i) { struct cpc_desc *desc = per_cpu(cpc_desc_ptr, i); if (!desc) @@ -524,7 +524,7 @@ int acpi_get_psd_map(unsigned int cpu, struct cppc_cpudata *cpu_data) else if (pdomain->coord_type == DOMAIN_COORD_TYPE_SW_ANY) cpu_data->shared_type = CPUFREQ_SHARED_TYPE_ANY; - for_each_possible_cpu(i) { + for_each_online_cpu(i) { if (i == cpu) continue; diff --git a/drivers/cpufreq/amd-pstate.c b/drivers/cpufreq/amd-pstate.c index e4f1933dd7d47..7be26007f1d8e 100644 --- a/drivers/cpufreq/amd-pstate.c +++ b/drivers/cpufreq/amd-pstate.c @@ -1282,7 +1282,7 @@ static int amd_pstate_change_mode_without_dvr_change(int mode) if (cpu_feature_enabled(X86_FEATURE_CPPC) || cppc_state == AMD_PSTATE_ACTIVE) return 0; - for_each_present_cpu(cpu) { + for_each_online_cpu(cpu) { cppc_set_auto_sel(cpu, (cppc_state == AMD_PSTATE_PASSIVE) ? 0 : 1); } diff --git a/drivers/dma-buf/Makefile b/drivers/dma-buf/Makefile index 70ec901edf2c5..2008fb7481b35 100644 --- a/drivers/dma-buf/Makefile +++ b/drivers/dma-buf/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only obj-y := dma-buf.o dma-fence.o dma-fence-array.o dma-fence-chain.o \ - dma-fence-unwrap.o dma-resv.o + dma-fence-unwrap.o dma-resv.o dma-buf-mapping.o obj-$(CONFIG_DMABUF_HEAPS) += dma-heap.o obj-$(CONFIG_DMABUF_HEAPS) += heaps/ obj-$(CONFIG_SYNC_FILE) += sync_file.o diff --git a/drivers/dma-buf/dma-buf-mapping.c b/drivers/dma-buf/dma-buf-mapping.c new file mode 100644 index 0000000000000..b7352e609fbdf --- /dev/null +++ b/drivers/dma-buf/dma-buf-mapping.c @@ -0,0 +1,248 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * DMA BUF Mapping Helpers + * + */ +#include +#include + +static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length, + dma_addr_t addr) +{ + unsigned int len, nents; + int i; + + nents = DIV_ROUND_UP(length, UINT_MAX); + for (i = 0; i < nents; i++) { + len = min_t(size_t, length, UINT_MAX); + length -= len; + /* + * DMABUF abuses scatterlist to create a scatterlist + * that does not have any CPU list, only the DMA list. + * Always set the page related values to NULL to ensure + * importers can't use it. The phys_addr based DMA API + * does not require the CPU list for mapping or unmapping. + */ + sg_set_page(sgl, NULL, 0, 0); + sg_dma_address(sgl) = addr + (dma_addr_t)i * UINT_MAX; + sg_dma_len(sgl) = len; + sgl = sg_next(sgl); + } + + return sgl; +} + +static unsigned int calc_sg_nents(struct dma_iova_state *state, + struct dma_buf_phys_vec *phys_vec, + size_t nr_ranges, size_t size) +{ + unsigned int nents = 0; + size_t i; + + if (!state || !dma_use_iova(state)) { + for (i = 0; i < nr_ranges; i++) + nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX); + } else { + /* + * In IOVA case, there is only one SG entry which spans + * for whole IOVA address space, but we need to make sure + * that it fits sg->length, maybe we need more. + */ + nents = DIV_ROUND_UP(size, UINT_MAX); + } + + return nents; +} + +/** + * struct dma_buf_dma - holds DMA mapping information + * @sgt: Scatter-gather table + * @state: DMA IOVA state relevant in IOMMU-based DMA + * @size: Total size of DMA transfer + */ +struct dma_buf_dma { + struct sg_table sgt; + struct dma_iova_state *state; + size_t size; +}; + +/** + * dma_buf_phys_vec_to_sgt - Returns the scatterlist table of the attachment + * from arrays of physical vectors. This funciton is intended for MMIO memory + * only. + * @attach: [in] attachment whose scatterlist is to be returned + * @provider: [in] p2pdma provider + * @phys_vec: [in] array of physical vectors + * @nr_ranges: [in] number of entries in phys_vec array + * @size: [in] total size of phys_vec + * @dir: [in] direction of DMA transfer + * + * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR + * on error. May return -EINTR if it is interrupted by a signal. + * + * On success, the DMA addresses and lengths in the returned scatterlist are + * PAGE_SIZE aligned. + * + * A mapping must be unmapped by using dma_buf_free_sgt(). + * + * NOTE: This function is intended for exporters. If direct traffic routing is + * mandatory exporter should call routing pci_p2pdma_map_type() before calling + * this function. + */ +struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach, + struct p2pdma_provider *provider, + struct dma_buf_phys_vec *phys_vec, + size_t nr_ranges, size_t size, + enum dma_data_direction dir) +{ + unsigned int nents, mapped_len = 0; + struct dma_buf_dma *dma; + struct scatterlist *sgl; + dma_addr_t addr; + size_t i; + int ret; + + dma_resv_assert_held(attach->dmabuf->resv); + + if (WARN_ON(!attach || !attach->dmabuf || !provider)) + /* This function is supposed to work on MMIO memory only */ + return ERR_PTR(-EINVAL); + + dma = kzalloc(sizeof(*dma), GFP_KERNEL); + if (!dma) + return ERR_PTR(-ENOMEM); + + switch (pci_p2pdma_map_type(provider, attach->dev)) { + case PCI_P2PDMA_MAP_BUS_ADDR: + /* + * There is no need in IOVA at all for this flow. + */ + break; + case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: + dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL); + if (!dma->state) { + ret = -ENOMEM; + goto err_free_dma; + } + + dma_iova_try_alloc(attach->dev, dma->state, 0, size); + break; + default: + ret = -EINVAL; + goto err_free_dma; + } + + nents = calc_sg_nents(dma->state, phys_vec, nr_ranges, size); + ret = sg_alloc_table(&dma->sgt, nents, GFP_KERNEL | __GFP_ZERO); + if (ret) + goto err_free_state; + + sgl = dma->sgt.sgl; + + for (i = 0; i < nr_ranges; i++) { + if (!dma->state) { + addr = pci_p2pdma_bus_addr_map(provider, + phys_vec[i].paddr); + } else if (dma_use_iova(dma->state)) { + ret = dma_iova_link(attach->dev, dma->state, + phys_vec[i].paddr, 0, + phys_vec[i].len, dir, + DMA_ATTR_MMIO); + if (ret) + goto err_unmap_dma; + + mapped_len += phys_vec[i].len; + } else { + addr = dma_map_phys(attach->dev, phys_vec[i].paddr, + phys_vec[i].len, dir, + DMA_ATTR_MMIO); + ret = dma_mapping_error(attach->dev, addr); + if (ret) + goto err_unmap_dma; + } + + if (!dma->state || !dma_use_iova(dma->state)) + sgl = fill_sg_entry(sgl, phys_vec[i].len, addr); + } + + if (dma->state && dma_use_iova(dma->state)) { + WARN_ON_ONCE(mapped_len != size); + ret = dma_iova_sync(attach->dev, dma->state, 0, mapped_len); + if (ret) + goto err_unmap_dma; + + sgl = fill_sg_entry(sgl, mapped_len, dma->state->addr); + } + + dma->size = size; + + /* + * No CPU list included — set orig_nents = 0 so others can detect + * this via SG table (use nents only). + */ + dma->sgt.orig_nents = 0; + + + /* + * SGL must be NULL to indicate that SGL is the last one + * and we allocated correct number of entries in sg_alloc_table() + */ + WARN_ON_ONCE(sgl); + return &dma->sgt; + +err_unmap_dma: + if (!i || !dma->state) { + ; /* Do nothing */ + } else if (dma_use_iova(dma->state)) { + dma_iova_destroy(attach->dev, dma->state, mapped_len, dir, + DMA_ATTR_MMIO); + } else { + for_each_sgtable_dma_sg(&dma->sgt, sgl, i) + dma_unmap_phys(attach->dev, sg_dma_address(sgl), + sg_dma_len(sgl), dir, DMA_ATTR_MMIO); + } + sg_free_table(&dma->sgt); +err_free_state: + kfree(dma->state); +err_free_dma: + kfree(dma); + return ERR_PTR(ret); +} +EXPORT_SYMBOL_NS_GPL(dma_buf_phys_vec_to_sgt, "DMA_BUF"); + +/** + * dma_buf_free_sgt- unmaps the buffer + * @attach: [in] attachment to unmap buffer from + * @sgt: [in] scatterlist info of the buffer to unmap + * @dir: [in] direction of DMA transfer + * + * This unmaps a DMA mapping for @attached obtained + * by dma_buf_phys_vec_to_sgt(). + */ +void dma_buf_free_sgt(struct dma_buf_attachment *attach, struct sg_table *sgt, + enum dma_data_direction dir) +{ + struct dma_buf_dma *dma = container_of(sgt, struct dma_buf_dma, sgt); + int i; + + dma_resv_assert_held(attach->dmabuf->resv); + + if (!dma->state) { + ; /* Do nothing */ + } else if (dma_use_iova(dma->state)) { + dma_iova_destroy(attach->dev, dma->state, dma->size, dir, + DMA_ATTR_MMIO); + } else { + struct scatterlist *sgl; + + for_each_sgtable_dma_sg(sgt, sgl, i) + dma_unmap_phys(attach->dev, sg_dma_address(sgl), + sg_dma_len(sgl), dir, DMA_ATTR_MMIO); + } + + sg_free_table(sgt); + kfree(dma->state); + kfree(dma); + +} +EXPORT_SYMBOL_NS_GPL(dma_buf_free_sgt, "DMA_BUF"); diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index ea2ef53bd4fef..c92088855450a 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -724,7 +724,12 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, struct device *dev static int dma_info_to_prot(enum dma_data_direction dir, bool coherent, unsigned long attrs) { - int prot = coherent ? IOMMU_CACHE : 0; + int prot; + + if (attrs & DMA_ATTR_MMIO) + prot = IOMMU_MMIO; + else + prot = coherent ? IOMMU_CACHE : 0; if (attrs & DMA_ATTR_PRIVILEGED) prot |= IOMMU_PRIV; @@ -1190,11 +1195,9 @@ static inline size_t iova_unaligned(struct iova_domain *iovad, phys_addr_t phys, return iova_offset(iovad, phys | size); } -dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t size, enum dma_data_direction dir, - unsigned long attrs) +dma_addr_t iommu_dma_map_phys(struct device *dev, phys_addr_t phys, size_t size, + enum dma_data_direction dir, unsigned long attrs) { - phys_addr_t phys = page_to_phys(page) + offset; bool coherent = dev_is_dma_coherent(dev); int prot = dma_info_to_prot(dir, coherent, attrs); struct iommu_domain *domain = iommu_get_dma_domain(dev); @@ -1208,27 +1211,34 @@ dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page, */ if (dev_use_swiotlb(dev, size, dir) && iova_unaligned(iovad, phys, size)) { + if (attrs & DMA_ATTR_MMIO) + return DMA_MAPPING_ERROR; + phys = iommu_dma_map_swiotlb(dev, phys, size, dir, attrs); if (phys == (phys_addr_t)DMA_MAPPING_ERROR) return DMA_MAPPING_ERROR; } - if (!coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) + if (!coherent && !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO))) arch_sync_dma_for_device(phys, size, dir); iova = __iommu_dma_map(dev, phys, size, prot, dma_mask); - if (iova == DMA_MAPPING_ERROR) + if (iova == DMA_MAPPING_ERROR && !(attrs & DMA_ATTR_MMIO)) swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs); return iova; } -void iommu_dma_unmap_page(struct device *dev, dma_addr_t dma_handle, +void iommu_dma_unmap_phys(struct device *dev, dma_addr_t dma_handle, size_t size, enum dma_data_direction dir, unsigned long attrs) { - struct iommu_domain *domain = iommu_get_dma_domain(dev); phys_addr_t phys; - phys = iommu_iova_to_phys(domain, dma_handle); + if (attrs & DMA_ATTR_MMIO) { + __iommu_dma_unmap(dev, dma_handle, size); + return; + } + + phys = iommu_iova_to_phys(iommu_get_dma_domain(dev), dma_handle); if (WARN_ON(!phys)) return; @@ -1341,7 +1351,7 @@ static void iommu_dma_unmap_sg_swiotlb(struct device *dev, struct scatterlist *s int i; for_each_sg(sg, s, nents, i) - iommu_dma_unmap_page(dev, sg_dma_address(s), + iommu_dma_unmap_phys(dev, sg_dma_address(s), sg_dma_len(s), dir, attrs); } @@ -1354,8 +1364,8 @@ static int iommu_dma_map_sg_swiotlb(struct device *dev, struct scatterlist *sg, sg_dma_mark_swiotlb(sg); for_each_sg(sg, s, nents, i) { - sg_dma_address(s) = iommu_dma_map_page(dev, sg_page(s), - s->offset, s->length, dir, attrs); + sg_dma_address(s) = iommu_dma_map_phys(dev, sg_phys(s), + s->length, dir, attrs); if (sg_dma_address(s) == DMA_MAPPING_ERROR) goto out_unmap; sg_dma_len(s) = s->length; @@ -1429,8 +1439,8 @@ int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, int nents, * as a bus address, __finalise_sg() will copy the dma * address into the output segment. */ - s->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state, - sg_phys(s)); + s->dma_address = pci_p2pdma_bus_addr_map( + p2pdma_state.mem, sg_phys(s)); sg_dma_len(s) = sg->length; sg_dma_mark_bus_address(s); continue; @@ -1546,20 +1556,6 @@ void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents, __iommu_dma_unmap(dev, start, end - start); } -dma_addr_t iommu_dma_map_resource(struct device *dev, phys_addr_t phys, - size_t size, enum dma_data_direction dir, unsigned long attrs) -{ - return __iommu_dma_map(dev, phys, size, - dma_info_to_prot(dir, false, attrs) | IOMMU_MMIO, - dma_get_mask(dev)); -} - -void iommu_dma_unmap_resource(struct device *dev, dma_addr_t handle, - size_t size, enum dma_data_direction dir, unsigned long attrs) -{ - __iommu_dma_unmap(dev, handle, size); -} - static void __iommu_dma_free(struct device *dev, size_t size, void *cpu_addr) { size_t alloc_size = PAGE_ALIGN(size); @@ -1838,12 +1834,13 @@ static int __dma_iova_link(struct device *dev, dma_addr_t addr, unsigned long attrs) { bool coherent = dev_is_dma_coherent(dev); + int prot = dma_info_to_prot(dir, coherent, attrs); - if (!coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) + if (!coherent && !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO))) arch_sync_dma_for_device(phys, size, dir); return iommu_map_nosync(iommu_get_dma_domain(dev), addr, phys, size, - dma_info_to_prot(dir, coherent, attrs), GFP_ATOMIC); + prot, GFP_ATOMIC); } static int iommu_dma_iova_bounce_and_link(struct device *dev, dma_addr_t addr, @@ -1949,9 +1946,13 @@ int dma_iova_link(struct device *dev, struct dma_iova_state *state, return -EIO; if (dev_use_swiotlb(dev, size, dir) && - iova_unaligned(iovad, phys, size)) + iova_unaligned(iovad, phys, size)) { + if (attrs & DMA_ATTR_MMIO) + return -EPERM; + return iommu_dma_iova_link_swiotlb(dev, state, phys, offset, size, dir, attrs); + } return __dma_iova_link(dev, state->addr + offset - iova_start_pad, phys - iova_start_pad, @@ -2007,7 +2008,7 @@ static void iommu_dma_iova_unlink_range_slow(struct device *dev, end - addr, iovad->granule - iova_start_pad); if (!dev_is_dma_coherent(dev) && - !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) + !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO))) arch_sync_dma_for_cpu(phys, len, dir); swiotlb_tbl_unmap_single(dev, phys, len, dir, attrs); @@ -2031,7 +2032,8 @@ static void __iommu_dma_iova_unlink(struct device *dev, size_t unmapped; if ((state->__size & DMA_IOVA_USE_SWIOTLB) || - (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))) + (!dev_is_dma_coherent(dev) && + !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))) iommu_dma_iova_unlink_range_slow(dev, addr, size, dir, attrs); iommu_iotlb_gather_init(&iotlb_gather); diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c index 75d60f2ad9008..54cf4d856179b 100644 --- a/drivers/iommu/iommufd/io_pagetable.c +++ b/drivers/iommu/iommufd/io_pagetable.c @@ -8,8 +8,10 @@ * The datastructure uses the iopt_pages to optimize the storage of the PFNs * between the domains and xarray. */ +#include #include #include +#include #include #include #include @@ -284,6 +286,9 @@ static int iopt_alloc_area_pages(struct io_pagetable *iopt, case IOPT_ADDRESS_FILE: start = elm->start_byte + elm->pages->start; break; + case IOPT_ADDRESS_DMABUF: + start = elm->start_byte + elm->pages->dmabuf.start; + break; } rc = iopt_alloc_iova(iopt, dst_iova, start, length); if (rc) @@ -468,25 +473,53 @@ int iopt_map_user_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt, * @iopt: io_pagetable to act on * @iova: If IOPT_ALLOC_IOVA is set this is unused on input and contains * the chosen iova on output. Otherwise is the iova to map to on input - * @file: file to map + * @fd: fdno of a file to map * @start: map file starting at this byte offset * @length: Number of bytes to map * @iommu_prot: Combination of IOMMU_READ/WRITE/etc bits for the mapping * @flags: IOPT_ALLOC_IOVA or zero */ int iopt_map_file_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt, - unsigned long *iova, struct file *file, - unsigned long start, unsigned long length, - int iommu_prot, unsigned int flags) + unsigned long *iova, int fd, unsigned long start, + unsigned long length, int iommu_prot, + unsigned int flags) { struct iopt_pages *pages; + struct dma_buf *dmabuf; + unsigned long start_byte; + unsigned long last; + + if (!length) + return -EINVAL; + if (check_add_overflow(start, length - 1, &last)) + return -EOVERFLOW; + + start_byte = start - ALIGN_DOWN(start, PAGE_SIZE); + dmabuf = dma_buf_get(fd); + if (!IS_ERR(dmabuf)) { + pages = iopt_alloc_dmabuf_pages(ictx, dmabuf, start_byte, start, + length, + iommu_prot & IOMMU_WRITE); + if (IS_ERR(pages)) { + dma_buf_put(dmabuf); + return PTR_ERR(pages); + } + } else { + struct file *file; + + file = fget(fd); + if (!file) + return -EBADF; + + pages = iopt_alloc_file_pages(file, start_byte, start, length, + iommu_prot & IOMMU_WRITE); + fput(file); + if (IS_ERR(pages)) + return PTR_ERR(pages); + } - pages = iopt_alloc_file_pages(file, start, length, - iommu_prot & IOMMU_WRITE); - if (IS_ERR(pages)) - return PTR_ERR(pages); return iopt_map_common(ictx, iopt, pages, iova, length, - start - pages->start, iommu_prot, flags); + start_byte, iommu_prot, flags); } struct iova_bitmap_fn_arg { @@ -961,9 +994,15 @@ static void iopt_unfill_domain(struct io_pagetable *iopt, WARN_ON(!area->storage_domain); if (area->storage_domain == domain) area->storage_domain = storage_domain; + if (iopt_is_dmabuf(pages)) { + if (!iopt_dmabuf_revoked(pages)) + iopt_area_unmap_domain(area, domain); + iopt_dmabuf_untrack_domain(pages, area, domain); + } mutex_unlock(&pages->mutex); - iopt_area_unmap_domain(area, domain); + if (!iopt_is_dmabuf(pages)) + iopt_area_unmap_domain(area, domain); } return; } @@ -980,6 +1019,8 @@ static void iopt_unfill_domain(struct io_pagetable *iopt, WARN_ON(area->storage_domain != domain); area->storage_domain = NULL; iopt_area_unfill_domain(area, pages, domain); + if (iopt_is_dmabuf(pages)) + iopt_dmabuf_untrack_domain(pages, area, domain); mutex_unlock(&pages->mutex); } } @@ -1009,10 +1050,16 @@ static int iopt_fill_domain(struct io_pagetable *iopt, if (!pages) continue; - mutex_lock(&pages->mutex); + guard(mutex)(&pages->mutex); + if (iopt_is_dmabuf(pages)) { + rc = iopt_dmabuf_track_domain(pages, area, domain); + if (rc) + goto out_unfill; + } rc = iopt_area_fill_domain(area, domain); if (rc) { - mutex_unlock(&pages->mutex); + if (iopt_is_dmabuf(pages)) + iopt_dmabuf_untrack_domain(pages, area, domain); goto out_unfill; } if (!area->storage_domain) { @@ -1021,7 +1068,6 @@ static int iopt_fill_domain(struct io_pagetable *iopt, interval_tree_insert(&area->pages_node, &pages->domains_itree); } - mutex_unlock(&pages->mutex); } return 0; @@ -1042,6 +1088,8 @@ static int iopt_fill_domain(struct io_pagetable *iopt, area->storage_domain = NULL; } iopt_area_unfill_domain(area, pages, domain); + if (iopt_is_dmabuf(pages)) + iopt_dmabuf_untrack_domain(pages, area, domain); mutex_unlock(&pages->mutex); } return rc; @@ -1252,6 +1300,10 @@ static int iopt_area_split(struct iopt_area *area, unsigned long iova) if (!pages || area->prevent_access) return -EBUSY; + /* Maintaining the domains_itree below is a bit complicated */ + if (iopt_is_dmabuf(pages)) + return -EOPNOTSUPP; + if (new_start & (alignment - 1) || iopt_area_start_byte(area, new_start) & (alignment - 1)) return -EINVAL; diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h index b6064f4ce4af9..14cd052fd3204 100644 --- a/drivers/iommu/iommufd/io_pagetable.h +++ b/drivers/iommu/iommufd/io_pagetable.h @@ -5,6 +5,7 @@ #ifndef __IO_PAGETABLE_H #define __IO_PAGETABLE_H +#include #include #include #include @@ -69,6 +70,16 @@ void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages, void iopt_area_unmap_domain(struct iopt_area *area, struct iommu_domain *domain); +int iopt_dmabuf_track_domain(struct iopt_pages *pages, struct iopt_area *area, + struct iommu_domain *domain); +void iopt_dmabuf_untrack_domain(struct iopt_pages *pages, + struct iopt_area *area, + struct iommu_domain *domain); +int iopt_dmabuf_track_all_domains(struct iopt_area *area, + struct iopt_pages *pages); +void iopt_dmabuf_untrack_all_domains(struct iopt_area *area, + struct iopt_pages *pages); + static inline unsigned long iopt_area_index(struct iopt_area *area) { return area->pages_node.start; @@ -179,7 +190,22 @@ enum { enum iopt_address_type { IOPT_ADDRESS_USER = 0, - IOPT_ADDRESS_FILE = 1, + IOPT_ADDRESS_FILE, + IOPT_ADDRESS_DMABUF, +}; + +struct iopt_pages_dmabuf_track { + struct iommu_domain *domain; + struct iopt_area *area; + struct list_head elm; +}; + +struct iopt_pages_dmabuf { + struct dma_buf_attachment *attach; + struct dma_buf_phys_vec phys; + /* Always PAGE_SIZE aligned */ + unsigned long start; + struct list_head tracker; }; /* @@ -209,6 +235,8 @@ struct iopt_pages { struct file *file; unsigned long start; }; + /* IOPT_ADDRESS_DMABUF */ + struct iopt_pages_dmabuf dmabuf; }; bool writable:1; u8 account_mode; @@ -220,10 +248,32 @@ struct iopt_pages { struct rb_root_cached domains_itree; }; +static inline bool iopt_is_dmabuf(struct iopt_pages *pages) +{ + if (!IS_ENABLED(CONFIG_DMA_SHARED_BUFFER)) + return false; + return pages->type == IOPT_ADDRESS_DMABUF; +} + +static inline bool iopt_dmabuf_revoked(struct iopt_pages *pages) +{ + lockdep_assert_held(&pages->mutex); + if (iopt_is_dmabuf(pages)) + return pages->dmabuf.phys.len == 0; + return false; +} + struct iopt_pages *iopt_alloc_user_pages(void __user *uptr, unsigned long length, bool writable); -struct iopt_pages *iopt_alloc_file_pages(struct file *file, unsigned long start, +struct iopt_pages *iopt_alloc_file_pages(struct file *file, + unsigned long start_byte, + unsigned long start, unsigned long length, bool writable); +struct iopt_pages *iopt_alloc_dmabuf_pages(struct iommufd_ctx *ictx, + struct dma_buf *dmabuf, + unsigned long start_byte, + unsigned long start, + unsigned long length, bool writable); void iopt_release_pages(struct kref *kref); static inline void iopt_put_pages(struct iopt_pages *pages) { diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c index 459a7c5169154..f4721afedadcf 100644 --- a/drivers/iommu/iommufd/ioas.c +++ b/drivers/iommu/iommufd/ioas.c @@ -207,7 +207,6 @@ int iommufd_ioas_map_file(struct iommufd_ucmd *ucmd) unsigned long iova = cmd->iova; struct iommufd_ioas *ioas; unsigned int flags = 0; - struct file *file; int rc; if (cmd->flags & @@ -229,11 +228,7 @@ int iommufd_ioas_map_file(struct iommufd_ucmd *ucmd) if (!(cmd->flags & IOMMU_IOAS_MAP_FIXED_IOVA)) flags = IOPT_ALLOC_IOVA; - file = fget(cmd->fd); - if (!file) - return -EBADF; - - rc = iopt_map_file_pages(ucmd->ictx, &ioas->iopt, &iova, file, + rc = iopt_map_file_pages(ucmd->ictx, &ioas->iopt, &iova, cmd->fd, cmd->start, cmd->length, conv_iommu_prot(cmd->flags), flags); if (rc) @@ -243,7 +238,6 @@ int iommufd_ioas_map_file(struct iommufd_ucmd *ucmd) rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd)); out_put: iommufd_put_object(ucmd->ictx, &ioas->obj); - fput(file); return rc; } diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 627f9b78483a0..ef2db82e3d7bf 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -19,6 +19,8 @@ struct iommu_domain; struct iommu_group; struct iommu_option; struct iommufd_device; +struct dma_buf_attachment; +struct dma_buf_phys_vec; struct iommufd_sw_msi_map { struct list_head sw_msi_item; @@ -108,7 +110,7 @@ int iopt_map_user_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt, unsigned long length, int iommu_prot, unsigned int flags); int iopt_map_file_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt, - unsigned long *iova, struct file *file, + unsigned long *iova, int fd, unsigned long start, unsigned long length, int iommu_prot, unsigned int flags); int iopt_map_pages(struct io_pagetable *iopt, struct list_head *pages_list, @@ -504,6 +506,8 @@ void iommufd_device_pre_destroy(struct iommufd_object *obj); void iommufd_device_destroy(struct iommufd_object *obj); int iommufd_get_hw_info(struct iommufd_ucmd *ucmd); +struct device *iommufd_global_device(void); + struct iommufd_access { struct iommufd_object obj; struct iommufd_ctx *ictx; @@ -711,6 +715,8 @@ bool iommufd_should_fail(void); int __init iommufd_test_init(void); void iommufd_test_exit(void); bool iommufd_selftest_is_mock_dev(struct device *dev); +int iommufd_test_dma_buf_iommufd_map(struct dma_buf_attachment *attachment, + struct dma_buf_phys_vec *phys); #else static inline void iommufd_test_syz_conv_iova_id(struct iommufd_ucmd *ucmd, unsigned int ioas_id, @@ -732,5 +738,11 @@ static inline bool iommufd_selftest_is_mock_dev(struct device *dev) { return false; } +static inline int +iommufd_test_dma_buf_iommufd_map(struct dma_buf_attachment *attachment, + struct dma_buf_phys_vec *phys) +{ + return -EOPNOTSUPP; +} #endif #endif diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h index 8fc618b2bcf96..9166c39eb0c8b 100644 --- a/drivers/iommu/iommufd/iommufd_test.h +++ b/drivers/iommu/iommufd/iommufd_test.h @@ -29,6 +29,8 @@ enum { IOMMU_TEST_OP_PASID_REPLACE, IOMMU_TEST_OP_PASID_DETACH, IOMMU_TEST_OP_PASID_CHECK_HWPT, + IOMMU_TEST_OP_DMABUF_GET, + IOMMU_TEST_OP_DMABUF_REVOKE, }; enum { @@ -176,6 +178,14 @@ struct iommu_test_cmd { __u32 hwpt_id; /* @id is stdev_id */ } pasid_check; + struct { + __u32 length; + __u32 open_flags; + } dmabuf_get; + struct { + __s32 dmabuf_fd; + __u32 revoked; + } dmabuf_revoke; }; __u32 last; }; diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index ce775fbbae94e..5cc4b08c25f58 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -751,6 +751,15 @@ static struct miscdevice vfio_misc_dev = { .mode = 0666, }; +/* + * Used only by DMABUF, returns a valid struct device to use as a dummy struct + * device for attachment. + */ +struct device *iommufd_global_device(void) +{ + return iommu_misc_dev.this_device; +} + static int __init iommufd_init(void) { int ret; @@ -794,5 +803,6 @@ MODULE_ALIAS("devname:vfio/vfio"); #endif MODULE_IMPORT_NS("IOMMUFD_INTERNAL"); MODULE_IMPORT_NS("IOMMUFD"); +MODULE_IMPORT_NS("DMA_BUF"); MODULE_DESCRIPTION("I/O Address Space Management for passthrough devices"); MODULE_LICENSE("GPL"); diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c index 939be83e3b3f7..5a2fcab6f29ce 100644 --- a/drivers/iommu/iommufd/pages.c +++ b/drivers/iommu/iommufd/pages.c @@ -45,6 +45,8 @@ * last_iova + 1 can overflow. An iopt_pages index will always be much less than * ULONG_MAX so last_index + 1 cannot overflow. */ +#include +#include #include #include #include @@ -53,6 +55,7 @@ #include #include #include +#include #include "double_span.h" #include "io_pagetable.h" @@ -258,6 +261,11 @@ static struct iopt_area *iopt_pages_find_domain_area(struct iopt_pages *pages, return container_of(node, struct iopt_area, pages_node); } +enum batch_kind { + BATCH_CPU_MEMORY = 0, + BATCH_MMIO, +}; + /* * A simple datastructure to hold a vector of PFNs, optimized for contiguous * PFNs. This is used as a temporary holding memory for shuttling pfns from one @@ -271,7 +279,9 @@ struct pfn_batch { unsigned int array_size; unsigned int end; unsigned int total_pfns; + enum batch_kind kind; }; +enum { MAX_NPFNS = type_max(typeof(((struct pfn_batch *)0)->npfns[0])) }; static void batch_clear(struct pfn_batch *batch) { @@ -348,11 +358,17 @@ static void batch_destroy(struct pfn_batch *batch, void *backup) } static bool batch_add_pfn_num(struct pfn_batch *batch, unsigned long pfn, - u32 nr) + u32 nr, enum batch_kind kind) { - const unsigned int MAX_NPFNS = type_max(typeof(*batch->npfns)); unsigned int end = batch->end; + if (batch->kind != kind) { + /* One kind per batch */ + if (batch->end != 0) + return false; + batch->kind = kind; + } + if (end && pfn == batch->pfns[end - 1] + batch->npfns[end - 1] && nr <= MAX_NPFNS - batch->npfns[end - 1]) { batch->npfns[end - 1] += nr; @@ -379,7 +395,7 @@ static void batch_remove_pfn_num(struct pfn_batch *batch, unsigned long nr) /* true if the pfn was added, false otherwise */ static bool batch_add_pfn(struct pfn_batch *batch, unsigned long pfn) { - return batch_add_pfn_num(batch, pfn, 1); + return batch_add_pfn_num(batch, pfn, 1, BATCH_CPU_MEMORY); } /* @@ -492,6 +508,7 @@ static int batch_to_domain(struct pfn_batch *batch, struct iommu_domain *domain, { bool disable_large_pages = area->iopt->disable_large_pages; unsigned long last_iova = iopt_area_last_iova(area); + int iommu_prot = area->iommu_prot; unsigned int page_offset = 0; unsigned long start_iova; unsigned long next_iova; @@ -499,6 +516,11 @@ static int batch_to_domain(struct pfn_batch *batch, struct iommu_domain *domain, unsigned long iova; int rc; + if (batch->kind == BATCH_MMIO) { + iommu_prot &= ~IOMMU_CACHE; + iommu_prot |= IOMMU_MMIO; + } + /* The first index might be a partial page */ if (start_index == iopt_area_index(area)) page_offset = area->page_offset; @@ -512,11 +534,11 @@ static int batch_to_domain(struct pfn_batch *batch, struct iommu_domain *domain, rc = batch_iommu_map_small( domain, iova, PFN_PHYS(batch->pfns[cur]) + page_offset, - next_iova - iova, area->iommu_prot); + next_iova - iova, iommu_prot); else rc = iommu_map(domain, iova, PFN_PHYS(batch->pfns[cur]) + page_offset, - next_iova - iova, area->iommu_prot, + next_iova - iova, iommu_prot, GFP_KERNEL_ACCOUNT); if (rc) goto err_unmap; @@ -652,7 +674,7 @@ static int batch_from_folios(struct pfn_batch *batch, struct folio ***folios_p, nr = min(nr, npages); npages -= nr; - if (!batch_add_pfn_num(batch, pfn, nr)) + if (!batch_add_pfn_num(batch, pfn, nr, BATCH_CPU_MEMORY)) break; if (nr > 1) { rc = folio_add_pins(folio, nr - 1); @@ -1126,6 +1148,41 @@ static int pfn_reader_user_update_pinned(struct pfn_reader_user *user, return iopt_pages_update_pinned(pages, npages, inc, user); } +struct pfn_reader_dmabuf { + struct dma_buf_phys_vec phys; + unsigned long start_offset; +}; + +static int pfn_reader_dmabuf_init(struct pfn_reader_dmabuf *dmabuf, + struct iopt_pages *pages) +{ + /* Callers must not get here if the dmabuf was already revoked */ + if (WARN_ON(iopt_dmabuf_revoked(pages))) + return -EINVAL; + + dmabuf->phys = pages->dmabuf.phys; + dmabuf->start_offset = pages->dmabuf.start; + return 0; +} + +static int pfn_reader_fill_dmabuf(struct pfn_reader_dmabuf *dmabuf, + struct pfn_batch *batch, + unsigned long start_index, + unsigned long last_index) +{ + unsigned long start = dmabuf->start_offset + start_index * PAGE_SIZE; + + /* + * start/last_index and start are all PAGE_SIZE aligned, the batch is + * always filled using page size aligned PFNs just like the other types. + * If the dmabuf has been sliced on a sub page offset then the common + * batch to domain code will adjust it before mapping to the domain. + */ + batch_add_pfn_num(batch, PHYS_PFN(dmabuf->phys.paddr + start), + last_index - start_index + 1, BATCH_MMIO); + return 0; +} + /* * PFNs are stored in three places, in order of preference: * - The iopt_pages xarray. This is only populated if there is a @@ -1144,7 +1201,10 @@ struct pfn_reader { unsigned long batch_end_index; unsigned long last_index; - struct pfn_reader_user user; + union { + struct pfn_reader_user user; + struct pfn_reader_dmabuf dmabuf; + }; }; static int pfn_reader_update_pinned(struct pfn_reader *pfns) @@ -1180,7 +1240,7 @@ static int pfn_reader_fill_span(struct pfn_reader *pfns) { struct interval_tree_double_span_iter *span = &pfns->span; unsigned long start_index = pfns->batch_end_index; - struct pfn_reader_user *user = &pfns->user; + struct pfn_reader_user *user; unsigned long npages; struct iopt_area *area; int rc; @@ -1212,8 +1272,13 @@ static int pfn_reader_fill_span(struct pfn_reader *pfns) return 0; } - if (start_index >= pfns->user.upages_end) { - rc = pfn_reader_user_pin(&pfns->user, pfns->pages, start_index, + if (iopt_is_dmabuf(pfns->pages)) + return pfn_reader_fill_dmabuf(&pfns->dmabuf, &pfns->batch, + start_index, span->last_hole); + + user = &pfns->user; + if (start_index >= user->upages_end) { + rc = pfn_reader_user_pin(user, pfns->pages, start_index, span->last_hole); if (rc) return rc; @@ -1281,7 +1346,10 @@ static int pfn_reader_init(struct pfn_reader *pfns, struct iopt_pages *pages, pfns->batch_start_index = start_index; pfns->batch_end_index = start_index; pfns->last_index = last_index; - pfn_reader_user_init(&pfns->user, pages); + if (iopt_is_dmabuf(pages)) + pfn_reader_dmabuf_init(&pfns->dmabuf, pages); + else + pfn_reader_user_init(&pfns->user, pages); rc = batch_init(&pfns->batch, last_index - start_index + 1); if (rc) return rc; @@ -1302,8 +1370,12 @@ static int pfn_reader_init(struct pfn_reader *pfns, struct iopt_pages *pages, static void pfn_reader_release_pins(struct pfn_reader *pfns) { struct iopt_pages *pages = pfns->pages; - struct pfn_reader_user *user = &pfns->user; + struct pfn_reader_user *user; + + if (iopt_is_dmabuf(pages)) + return; + user = &pfns->user; if (user->upages_end > pfns->batch_end_index) { /* Any pages not transferred to the batch are just unpinned */ @@ -1334,7 +1406,8 @@ static void pfn_reader_destroy(struct pfn_reader *pfns) struct iopt_pages *pages = pfns->pages; pfn_reader_release_pins(pfns); - pfn_reader_user_destroy(&pfns->user, pfns->pages); + if (!iopt_is_dmabuf(pfns->pages)) + pfn_reader_user_destroy(&pfns->user, pfns->pages); batch_destroy(&pfns->batch, NULL); WARN_ON(pages->last_npinned != pages->npinned); } @@ -1413,26 +1486,234 @@ struct iopt_pages *iopt_alloc_user_pages(void __user *uptr, return pages; } -struct iopt_pages *iopt_alloc_file_pages(struct file *file, unsigned long start, +struct iopt_pages *iopt_alloc_file_pages(struct file *file, + unsigned long start_byte, + unsigned long start, unsigned long length, bool writable) { struct iopt_pages *pages; - unsigned long start_down = ALIGN_DOWN(start, PAGE_SIZE); - unsigned long end; - if (length && check_add_overflow(start, length - 1, &end)) - return ERR_PTR(-EOVERFLOW); - - pages = iopt_alloc_pages(start - start_down, length, writable); + pages = iopt_alloc_pages(start_byte, length, writable); if (IS_ERR(pages)) return pages; pages->file = get_file(file); - pages->start = start_down; + pages->start = start - start_byte; pages->type = IOPT_ADDRESS_FILE; return pages; } +static void iopt_revoke_notify(struct dma_buf_attachment *attach) +{ + struct iopt_pages *pages = attach->importer_priv; + struct iopt_pages_dmabuf_track *track; + + guard(mutex)(&pages->mutex); + if (iopt_dmabuf_revoked(pages)) + return; + + list_for_each_entry(track, &pages->dmabuf.tracker, elm) { + struct iopt_area *area = track->area; + + iopt_area_unmap_domain_range(area, track->domain, + iopt_area_index(area), + iopt_area_last_index(area)); + } + pages->dmabuf.phys.len = 0; +} + +static struct dma_buf_attach_ops iopt_dmabuf_attach_revoke_ops = { + .allow_peer2peer = true, + .move_notify = iopt_revoke_notify, +}; + +/* + * iommufd and vfio have a circular dependency. Future work for a phys + * based private interconnect will remove this. + */ +static int +sym_vfio_pci_dma_buf_iommufd_map(struct dma_buf_attachment *attachment, + struct dma_buf_phys_vec *phys) +{ + typeof(&vfio_pci_dma_buf_iommufd_map) fn; + int rc; + + rc = iommufd_test_dma_buf_iommufd_map(attachment, phys); + if (rc != -EOPNOTSUPP) + return rc; + + if (!IS_ENABLED(CONFIG_VFIO_PCI_DMABUF)) + return -EOPNOTSUPP; + + fn = symbol_get(vfio_pci_dma_buf_iommufd_map); + if (!fn) + return -EOPNOTSUPP; + rc = fn(attachment, phys); + symbol_put(vfio_pci_dma_buf_iommufd_map); + return rc; +} + +static int iopt_map_dmabuf(struct iommufd_ctx *ictx, struct iopt_pages *pages, + struct dma_buf *dmabuf) +{ + struct dma_buf_attachment *attach; + int rc; + + attach = dma_buf_dynamic_attach(dmabuf, iommufd_global_device(), + &iopt_dmabuf_attach_revoke_ops, pages); + if (IS_ERR(attach)) + return PTR_ERR(attach); + + dma_resv_lock(dmabuf->resv, NULL); + /* + * Lock ordering requires the mutex to be taken inside the reservation, + * make sure lockdep sees this. + */ + if (IS_ENABLED(CONFIG_LOCKDEP)) { + mutex_lock(&pages->mutex); + mutex_unlock(&pages->mutex); + } + + rc = sym_vfio_pci_dma_buf_iommufd_map(attach, &pages->dmabuf.phys); + if (rc) + goto err_detach; + + dma_resv_unlock(dmabuf->resv); + + /* On success iopt_release_pages() will detach and put the dmabuf. */ + pages->dmabuf.attach = attach; + return 0; + +err_detach: + dma_resv_unlock(dmabuf->resv); + dma_buf_detach(dmabuf, attach); + return rc; +} + +struct iopt_pages *iopt_alloc_dmabuf_pages(struct iommufd_ctx *ictx, + struct dma_buf *dmabuf, + unsigned long start_byte, + unsigned long start, + unsigned long length, bool writable) +{ + static struct lock_class_key pages_dmabuf_mutex_key; + struct iopt_pages *pages; + int rc; + + if (!IS_ENABLED(CONFIG_DMA_SHARED_BUFFER)) + return ERR_PTR(-EOPNOTSUPP); + + if (dmabuf->size <= (start + length - 1) || + length / PAGE_SIZE >= MAX_NPFNS) + return ERR_PTR(-EINVAL); + + pages = iopt_alloc_pages(start_byte, length, writable); + if (IS_ERR(pages)) + return pages; + + /* + * The mmap_lock can be held when obtaining the dmabuf reservation lock + * which creates a locking cycle with the pages mutex which is held + * while obtaining the mmap_lock. This locking path is not present for + * IOPT_ADDRESS_DMABUF so split the lock class. + */ + lockdep_set_class(&pages->mutex, &pages_dmabuf_mutex_key); + + /* dmabuf does not use pinned page accounting. */ + pages->account_mode = IOPT_PAGES_ACCOUNT_NONE; + pages->type = IOPT_ADDRESS_DMABUF; + pages->dmabuf.start = start - start_byte; + INIT_LIST_HEAD(&pages->dmabuf.tracker); + + rc = iopt_map_dmabuf(ictx, pages, dmabuf); + if (rc) { + iopt_put_pages(pages); + return ERR_PTR(rc); + } + + return pages; +} + +int iopt_dmabuf_track_domain(struct iopt_pages *pages, struct iopt_area *area, + struct iommu_domain *domain) +{ + struct iopt_pages_dmabuf_track *track; + + lockdep_assert_held(&pages->mutex); + if (WARN_ON(!iopt_is_dmabuf(pages))) + return -EINVAL; + + list_for_each_entry(track, &pages->dmabuf.tracker, elm) + if (WARN_ON(track->domain == domain && track->area == area)) + return -EINVAL; + + track = kzalloc(sizeof(*track), GFP_KERNEL); + if (!track) + return -ENOMEM; + track->domain = domain; + track->area = area; + list_add_tail(&track->elm, &pages->dmabuf.tracker); + + return 0; +} + +void iopt_dmabuf_untrack_domain(struct iopt_pages *pages, + struct iopt_area *area, + struct iommu_domain *domain) +{ + struct iopt_pages_dmabuf_track *track; + + lockdep_assert_held(&pages->mutex); + WARN_ON(!iopt_is_dmabuf(pages)); + + list_for_each_entry(track, &pages->dmabuf.tracker, elm) { + if (track->domain == domain && track->area == area) { + list_del(&track->elm); + kfree(track); + return; + } + } + WARN_ON(true); +} + +int iopt_dmabuf_track_all_domains(struct iopt_area *area, + struct iopt_pages *pages) +{ + struct iopt_pages_dmabuf_track *track; + struct iommu_domain *domain; + unsigned long index; + int rc; + + list_for_each_entry(track, &pages->dmabuf.tracker, elm) + if (WARN_ON(track->area == area)) + return -EINVAL; + + xa_for_each(&area->iopt->domains, index, domain) { + rc = iopt_dmabuf_track_domain(pages, area, domain); + if (rc) + goto err_untrack; + } + return 0; +err_untrack: + iopt_dmabuf_untrack_all_domains(area, pages); + return rc; +} + +void iopt_dmabuf_untrack_all_domains(struct iopt_area *area, + struct iopt_pages *pages) +{ + struct iopt_pages_dmabuf_track *track; + struct iopt_pages_dmabuf_track *tmp; + + list_for_each_entry_safe(track, tmp, &pages->dmabuf.tracker, + elm) { + if (track->area == area) { + list_del(&track->elm); + kfree(track); + } + } +} + void iopt_release_pages(struct kref *kref) { struct iopt_pages *pages = container_of(kref, struct iopt_pages, kref); @@ -1445,8 +1726,15 @@ void iopt_release_pages(struct kref *kref) mutex_destroy(&pages->mutex); put_task_struct(pages->source_task); free_uid(pages->source_user); - if (pages->type == IOPT_ADDRESS_FILE) + if (iopt_is_dmabuf(pages) && pages->dmabuf.attach) { + struct dma_buf *dmabuf = pages->dmabuf.attach->dmabuf; + + dma_buf_detach(dmabuf, pages->dmabuf.attach); + dma_buf_put(dmabuf); + WARN_ON(!list_empty(&pages->dmabuf.tracker)); + } else if (pages->type == IOPT_ADDRESS_FILE) { fput(pages->file); + } kfree(pages); } @@ -1524,6 +1812,14 @@ static void __iopt_area_unfill_domain(struct iopt_area *area, lockdep_assert_held(&pages->mutex); + if (iopt_is_dmabuf(pages)) { + if (WARN_ON(iopt_dmabuf_revoked(pages))) + return; + iopt_area_unmap_domain_range(area, domain, start_index, + last_index); + return; + } + /* * For security we must not unpin something that is still DMA mapped, * so this must unmap any IOVA before we go ahead and unpin the pages. @@ -1599,6 +1895,9 @@ void iopt_area_unmap_domain(struct iopt_area *area, struct iommu_domain *domain) void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages, struct iommu_domain *domain) { + if (iopt_dmabuf_revoked(pages)) + return; + __iopt_area_unfill_domain(area, pages, domain, iopt_area_last_index(area)); } @@ -1619,6 +1918,9 @@ int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain) lockdep_assert_held(&area->pages->mutex); + if (iopt_dmabuf_revoked(area->pages)) + return 0; + rc = pfn_reader_first(&pfns, area->pages, iopt_area_index(area), iopt_area_last_index(area)); if (rc) @@ -1678,33 +1980,44 @@ int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages) return 0; mutex_lock(&pages->mutex); - rc = pfn_reader_first(&pfns, pages, iopt_area_index(area), - iopt_area_last_index(area)); - if (rc) - goto out_unlock; + if (iopt_is_dmabuf(pages)) { + rc = iopt_dmabuf_track_all_domains(area, pages); + if (rc) + goto out_unlock; + } - while (!pfn_reader_done(&pfns)) { - done_first_end_index = pfns.batch_end_index; - done_all_end_index = pfns.batch_start_index; - xa_for_each(&area->iopt->domains, index, domain) { - rc = batch_to_domain(&pfns.batch, domain, area, - pfns.batch_start_index); + if (!iopt_dmabuf_revoked(pages)) { + rc = pfn_reader_first(&pfns, pages, iopt_area_index(area), + iopt_area_last_index(area)); + if (rc) + goto out_untrack; + + while (!pfn_reader_done(&pfns)) { + done_first_end_index = pfns.batch_end_index; + done_all_end_index = pfns.batch_start_index; + xa_for_each(&area->iopt->domains, index, domain) { + rc = batch_to_domain(&pfns.batch, domain, area, + pfns.batch_start_index); + if (rc) + goto out_unmap; + } + done_all_end_index = done_first_end_index; + + rc = pfn_reader_next(&pfns); if (rc) goto out_unmap; } - done_all_end_index = done_first_end_index; - - rc = pfn_reader_next(&pfns); + rc = pfn_reader_update_pinned(&pfns); if (rc) goto out_unmap; + + pfn_reader_destroy(&pfns); } - rc = pfn_reader_update_pinned(&pfns); - if (rc) - goto out_unmap; area->storage_domain = xa_load(&area->iopt->domains, 0); interval_tree_insert(&area->pages_node, &pages->domains_itree); - goto out_destroy; + mutex_unlock(&pages->mutex); + return 0; out_unmap: pfn_reader_release_pins(&pfns); @@ -1731,8 +2044,10 @@ int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages) end_index); } } -out_destroy: pfn_reader_destroy(&pfns); +out_untrack: + if (iopt_is_dmabuf(pages)) + iopt_dmabuf_untrack_all_domains(area, pages); out_unlock: mutex_unlock(&pages->mutex); return rc; @@ -1758,16 +2073,22 @@ void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages) if (!area->storage_domain) goto out_unlock; - xa_for_each(&iopt->domains, index, domain) - if (domain != area->storage_domain) + xa_for_each(&iopt->domains, index, domain) { + if (domain == area->storage_domain) + continue; + + if (!iopt_dmabuf_revoked(pages)) iopt_area_unmap_domain_range( area, domain, iopt_area_index(area), iopt_area_last_index(area)); + } if (IS_ENABLED(CONFIG_IOMMUFD_TEST)) WARN_ON(RB_EMPTY_NODE(&area->pages_node.rb)); interval_tree_remove(&area->pages_node, &pages->domains_itree); iopt_area_unfill_domain(area, pages, area->storage_domain); + if (iopt_is_dmabuf(pages)) + iopt_dmabuf_untrack_all_domains(area, pages); area->storage_domain = NULL; out_unlock: mutex_unlock(&pages->mutex); @@ -2104,15 +2425,14 @@ int iopt_pages_rw_access(struct iopt_pages *pages, unsigned long start_byte, if ((flags & IOMMUFD_ACCESS_RW_WRITE) && !pages->writable) return -EPERM; - if (pages->type == IOPT_ADDRESS_FILE) + if (iopt_is_dmabuf(pages)) + return -EINVAL; + + if (pages->type != IOPT_ADDRESS_USER) return iopt_pages_rw_slow(pages, start_index, last_index, start_byte % PAGE_SIZE, data, length, flags); - if (IS_ENABLED(CONFIG_IOMMUFD_TEST) && - WARN_ON(pages->type != IOPT_ADDRESS_USER)) - return -EINVAL; - if (!(flags & IOMMUFD_ACCESS_RW_KTHREAD) && change_mm) { if (start_index == last_index) return iopt_pages_rw_page(pages, start_index, diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index de178827a078a..5d14dd0fb37d6 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -5,6 +5,8 @@ */ #include #include +#include +#include #include #include #include @@ -2031,6 +2033,140 @@ void iommufd_selftest_destroy(struct iommufd_object *obj) } } +struct iommufd_test_dma_buf { + void *memory; + size_t length; + bool revoked; +}; + +static int iommufd_test_dma_buf_attach(struct dma_buf *dmabuf, + struct dma_buf_attachment *attachment) +{ + return 0; +} + +static void iommufd_test_dma_buf_detach(struct dma_buf *dmabuf, + struct dma_buf_attachment *attachment) +{ +} + +static struct sg_table * +iommufd_test_dma_buf_map(struct dma_buf_attachment *attachment, + enum dma_data_direction dir) +{ + return ERR_PTR(-EOPNOTSUPP); +} + +static void iommufd_test_dma_buf_unmap(struct dma_buf_attachment *attachment, + struct sg_table *sgt, + enum dma_data_direction dir) +{ +} + +static void iommufd_test_dma_buf_release(struct dma_buf *dmabuf) +{ + struct iommufd_test_dma_buf *priv = dmabuf->priv; + + kfree(priv->memory); + kfree(priv); +} + +static const struct dma_buf_ops iommufd_test_dmabuf_ops = { + .attach = iommufd_test_dma_buf_attach, + .detach = iommufd_test_dma_buf_detach, + .map_dma_buf = iommufd_test_dma_buf_map, + .release = iommufd_test_dma_buf_release, + .unmap_dma_buf = iommufd_test_dma_buf_unmap, +}; + +int iommufd_test_dma_buf_iommufd_map(struct dma_buf_attachment *attachment, + struct dma_buf_phys_vec *phys) +{ + struct iommufd_test_dma_buf *priv = attachment->dmabuf->priv; + + dma_resv_assert_held(attachment->dmabuf->resv); + + if (attachment->dmabuf->ops != &iommufd_test_dmabuf_ops) + return -EOPNOTSUPP; + + if (priv->revoked) + return -ENODEV; + + phys->paddr = virt_to_phys(priv->memory); + phys->len = priv->length; + return 0; +} + +static int iommufd_test_dmabuf_get(struct iommufd_ucmd *ucmd, + unsigned int open_flags, + size_t len) +{ + DEFINE_DMA_BUF_EXPORT_INFO(exp_info); + struct iommufd_test_dma_buf *priv; + struct dma_buf *dmabuf; + int rc; + + len = ALIGN(len, PAGE_SIZE); + if (len == 0 || len > PAGE_SIZE * 512) + return -EINVAL; + + priv = kzalloc(sizeof(*priv), GFP_KERNEL); + if (!priv) + return -ENOMEM; + + priv->length = len; + priv->memory = kzalloc(len, GFP_KERNEL); + if (!priv->memory) { + rc = -ENOMEM; + goto err_free; + } + + exp_info.ops = &iommufd_test_dmabuf_ops; + exp_info.size = len; + exp_info.flags = open_flags; + exp_info.priv = priv; + + dmabuf = dma_buf_export(&exp_info); + if (IS_ERR(dmabuf)) { + rc = PTR_ERR(dmabuf); + goto err_free; + } + + return dma_buf_fd(dmabuf, open_flags); + +err_free: + kfree(priv->memory); + kfree(priv); + return rc; +} + +static int iommufd_test_dmabuf_revoke(struct iommufd_ucmd *ucmd, int fd, + bool revoked) +{ + struct iommufd_test_dma_buf *priv; + struct dma_buf *dmabuf; + int rc = 0; + + dmabuf = dma_buf_get(fd); + if (IS_ERR(dmabuf)) + return PTR_ERR(dmabuf); + + if (dmabuf->ops != &iommufd_test_dmabuf_ops) { + rc = -EOPNOTSUPP; + goto err_put; + } + + priv = dmabuf->priv; + dma_resv_lock(dmabuf->resv, NULL); + priv->revoked = revoked; + dma_buf_move_notify(dmabuf); + dma_resv_unlock(dmabuf->resv); + +err_put: + dma_buf_put(dmabuf); + return rc; +} + int iommufd_test(struct iommufd_ucmd *ucmd) { struct iommu_test_cmd *cmd = ucmd->cmd; @@ -2109,6 +2245,13 @@ int iommufd_test(struct iommufd_ucmd *ucmd) return iommufd_test_pasid_detach(ucmd, cmd); case IOMMU_TEST_OP_PASID_CHECK_HWPT: return iommufd_test_pasid_check_hwpt(ucmd, cmd); + case IOMMU_TEST_OP_DMABUF_GET: + return iommufd_test_dmabuf_get(ucmd, cmd->dmabuf_get.open_flags, + cmd->dmabuf_get.length); + case IOMMU_TEST_OP_DMABUF_REVOKE: + return iommufd_test_dmabuf_revoke(ucmd, + cmd->dmabuf_revoke.dmabuf_fd, + cmd->dmabuf_revoke.revoked); default: return -EOPNOTSUPP; } diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 36b1bc0d56846..572f032616c68 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -951,7 +951,7 @@ static void virtnet_rq_unmap(struct receive_queue *rq, void *buf, u32 len) if (dma->need_sync && len) { offset = buf - (head + sizeof(*dma)); - virtqueue_dma_sync_single_range_for_cpu(rq->vq, dma->addr, + virtqueue_map_sync_single_range_for_cpu(rq->vq, dma->addr, offset, len, DMA_FROM_DEVICE); } @@ -959,8 +959,8 @@ static void virtnet_rq_unmap(struct receive_queue *rq, void *buf, u32 len) if (dma->ref) return; - virtqueue_dma_unmap_single_attrs(rq->vq, dma->addr, dma->len, - DMA_FROM_DEVICE, DMA_ATTR_SKIP_CPU_SYNC); + virtqueue_unmap_single_attrs(rq->vq, dma->addr, dma->len, + DMA_FROM_DEVICE, DMA_ATTR_SKIP_CPU_SYNC); put_page(page); } @@ -1027,13 +1027,13 @@ static void *virtnet_rq_alloc(struct receive_queue *rq, u32 size, gfp_t gfp) dma->len = alloc_frag->size - sizeof(*dma); - addr = virtqueue_dma_map_single_attrs(rq->vq, dma + 1, - dma->len, DMA_FROM_DEVICE, 0); - if (virtqueue_dma_mapping_error(rq->vq, addr)) + addr = virtqueue_map_single_attrs(rq->vq, dma + 1, + dma->len, DMA_FROM_DEVICE, 0); + if (virtqueue_map_mapping_error(rq->vq, addr)) return NULL; dma->addr = addr; - dma->need_sync = virtqueue_dma_need_sync(rq->vq, addr); + dma->need_sync = virtqueue_map_need_sync(rq->vq, addr); /* Add a reference to dma to prevent the entire dma from * being released during error handling. This reference @@ -5973,9 +5973,9 @@ static int virtnet_xsk_pool_enable(struct net_device *dev, if (!rq->xsk_buffs) return -ENOMEM; - hdr_dma = virtqueue_dma_map_single_attrs(sq->vq, &xsk_hdr, vi->hdr_len, - DMA_TO_DEVICE, 0); - if (virtqueue_dma_mapping_error(sq->vq, hdr_dma)) { + hdr_dma = virtqueue_map_single_attrs(sq->vq, &xsk_hdr, vi->hdr_len, + DMA_TO_DEVICE, 0); + if (virtqueue_map_mapping_error(sq->vq, hdr_dma)) { err = -ENOMEM; goto err_free_buffs; } @@ -6004,8 +6004,8 @@ static int virtnet_xsk_pool_enable(struct net_device *dev, err_rq: xsk_pool_dma_unmap(pool, 0); err_xsk_map: - virtqueue_dma_unmap_single_attrs(rq->vq, hdr_dma, vi->hdr_len, - DMA_TO_DEVICE, 0); + virtqueue_unmap_single_attrs(rq->vq, hdr_dma, vi->hdr_len, + DMA_TO_DEVICE, 0); err_free_buffs: kvfree(rq->xsk_buffs); return err; @@ -6032,8 +6032,8 @@ static int virtnet_xsk_pool_disable(struct net_device *dev, u16 qid) xsk_pool_dma_unmap(pool, 0); - virtqueue_dma_unmap_single_attrs(sq->vq, sq->xsk_hdr_dma_addr, - vi->hdr_len, DMA_TO_DEVICE, 0); + virtqueue_unmap_single_attrs(sq->vq, sq->xsk_hdr_dma_addr, + vi->hdr_len, DMA_TO_DEVICE, 0); kvfree(rq->xsk_buffs); return err; diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 7e17c3f57d3eb..a578df68555d7 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -176,9 +176,7 @@ struct nvme_dev { u32 last_ps; bool hmb; struct sg_table *hmb_sgt; - mempool_t *dmavec_mempool; - mempool_t *iod_meta_mempool; /* shadow doorbell buffer support: */ __le32 *dbbuf_dbs; @@ -266,9 +264,24 @@ enum nvme_iod_flags { /* single segment dma mapping */ IOD_SINGLE_SEGMENT = 1U << 2, + /* Data payload contains p2p memory */ + IOD_DATA_P2P = 1U << 3, + + /* Metadata contains p2p memory */ + IOD_META_P2P = 1U << 4, + + /* Data payload contains MMIO memory */ + IOD_DATA_MMIO = 1U << 5, + + /* Metadata contains MMIO memory */ + IOD_META_MMIO = 1U << 6, + + /* Metadata using non-coalesced MPTR */ + IOD_SINGLE_META_SEGMENT = 1U << 7, + #ifdef CONFIG_NVFS /* NVFS GPU Direct Storage I/O */ - IOD_NVFS_IO = 1U << 3, + IOD_NVFS_IO = 1U << 8, #endif }; @@ -293,7 +306,8 @@ struct nvme_iod { unsigned int nr_dma_vecs; dma_addr_t meta_dma; - struct sg_table meta_sgt; + unsigned int meta_total_len; + struct dma_iova_state meta_dma_state; struct nvme_sgl_desc *meta_descriptor; #ifdef CONFIG_NVFS void *nvfs_cookie; @@ -653,6 +667,11 @@ static inline struct dma_pool *nvme_dma_pool(struct nvme_queue *nvmeq, return nvmeq->descriptor_pools.large; } +static inline bool nvme_pci_cmd_use_meta_sgl(struct nvme_command *cmd) +{ + return (cmd->common.flags & NVME_CMD_SGL_ALL) == NVME_CMD_SGL_METASEG; +} + static inline bool nvme_pci_cmd_use_sgl(struct nvme_command *cmd) { return cmd->common.flags & @@ -690,37 +709,74 @@ static void nvme_free_descriptors(struct request *req) } } -static void nvme_free_prps(struct request *req) +static void nvme_free_prps(struct request *req, unsigned int attrs) { struct nvme_iod *iod = blk_mq_rq_to_pdu(req); struct nvme_queue *nvmeq = req->mq_hctx->driver_data; unsigned int i; for (i = 0; i < iod->nr_dma_vecs; i++) - dma_unmap_page(nvmeq->dev->dev, iod->dma_vecs[i].addr, - iod->dma_vecs[i].len, rq_dma_dir(req)); + dma_unmap_phys(nvmeq->dev->dev, iod->dma_vecs[i].addr, + iod->dma_vecs[i].len, rq_dma_dir(req), attrs); mempool_free(iod->dma_vecs, nvmeq->dev->dmavec_mempool); } -static void nvme_free_sgls(struct request *req) +static void nvme_free_sgls(struct request *req, struct nvme_sgl_desc *sge, + struct nvme_sgl_desc *sg_list, unsigned int attrs) { - struct nvme_iod *iod = blk_mq_rq_to_pdu(req); struct nvme_queue *nvmeq = req->mq_hctx->driver_data; + enum dma_data_direction dir = rq_dma_dir(req); + unsigned int len = le32_to_cpu(sge->length); struct device *dma_dev = nvmeq->dev->dev; - dma_addr_t sqe_dma_addr = le64_to_cpu(iod->cmd.common.dptr.sgl.addr); - unsigned int sqe_dma_len = le32_to_cpu(iod->cmd.common.dptr.sgl.length); - struct nvme_sgl_desc *sg_list = iod->descriptors[0]; + unsigned int i; + + if (sge->type == (NVME_SGL_FMT_DATA_DESC << 4)) { + dma_unmap_phys(dma_dev, le64_to_cpu(sge->addr), len, dir, + attrs); + return; + } + + for (i = 0; i < len / sizeof(*sg_list); i++) + dma_unmap_phys(dma_dev, le64_to_cpu(sg_list[i].addr), + le32_to_cpu(sg_list[i].length), dir, attrs); +} + +static void nvme_unmap_metadata(struct request *req) +{ + struct nvme_queue *nvmeq = req->mq_hctx->driver_data; + enum pci_p2pdma_map_type map = PCI_P2PDMA_MAP_NONE; enum dma_data_direction dir = rq_dma_dir(req); + struct nvme_iod *iod = blk_mq_rq_to_pdu(req); + struct device *dma_dev = nvmeq->dev->dev; + struct nvme_sgl_desc *sge = iod->meta_descriptor; + unsigned int attrs = 0; - if (iod->nr_descriptors) { - unsigned int nr_entries = sqe_dma_len / sizeof(*sg_list), i; + if (iod->flags & IOD_SINGLE_META_SEGMENT) { + dma_unmap_page(dma_dev, iod->meta_dma, + rq_integrity_vec(req).bv_len, + rq_dma_dir(req)); + return; + } - for (i = 0; i < nr_entries; i++) - dma_unmap_page(dma_dev, le64_to_cpu(sg_list[i].addr), - le32_to_cpu(sg_list[i].length), dir); - } else { - dma_unmap_page(dma_dev, sqe_dma_addr, sqe_dma_len, dir); + if (iod->flags & IOD_META_P2P) + map = PCI_P2PDMA_MAP_BUS_ADDR; + else if (iod->flags & IOD_META_MMIO) { + map = PCI_P2PDMA_MAP_THRU_HOST_BRIDGE; + attrs |= DMA_ATTR_MMIO; } + + if (!blk_rq_dma_unmap(req, dma_dev, &iod->meta_dma_state, + iod->meta_total_len, map)) { + if (nvme_pci_cmd_use_meta_sgl(&iod->cmd)) + nvme_free_sgls(req, sge, &sge[1], attrs); + else + dma_unmap_phys(dma_dev, iod->meta_dma, + iod->meta_total_len, dir, attrs); + } + + if (iod->meta_descriptor) + dma_pool_free(nvmeq->descriptor_pools.small, + iod->meta_descriptor, iod->meta_dma); } #ifdef CONFIG_NVFS @@ -729,9 +785,11 @@ static void nvme_free_sgls(struct request *req) static void nvme_unmap_data(struct request *req) { + enum pci_p2pdma_map_type map = PCI_P2PDMA_MAP_NONE; struct nvme_iod *iod = blk_mq_rq_to_pdu(req); struct nvme_queue *nvmeq = req->mq_hctx->driver_data; struct device *dma_dev = nvmeq->dev->dev; + unsigned int attrs = 0; #ifdef CONFIG_NVFS /* Check if this was an NVFS I/O and handle unmapping */ @@ -747,11 +805,20 @@ static void nvme_unmap_data(struct request *req) return; } - if (!blk_rq_dma_unmap(req, dma_dev, &iod->dma_state, iod->total_len)) { + if (iod->flags & IOD_DATA_P2P) + map = PCI_P2PDMA_MAP_BUS_ADDR; + else if (iod->flags & IOD_DATA_MMIO) { + map = PCI_P2PDMA_MAP_THRU_HOST_BRIDGE; + attrs |= DMA_ATTR_MMIO; + } + + if (!blk_rq_dma_unmap(req, dma_dev, &iod->dma_state, iod->total_len, + map)) { if (nvme_pci_cmd_use_sgl(&iod->cmd)) - nvme_free_sgls(req); + nvme_free_sgls(req, &iod->cmd.common.dptr.sgl, + iod->descriptors[0], attrs); else - nvme_free_prps(req); + nvme_free_prps(req, attrs); } if (iod->nr_descriptors) @@ -1068,6 +1135,19 @@ static blk_status_t nvme_map_data(struct request *req) if (!blk_rq_dma_map_iter_start(req, dev->dev, &iod->dma_state, &iter)) return iter.status; + switch (iter.p2pdma.map) { + case PCI_P2PDMA_MAP_BUS_ADDR: + iod->flags |= IOD_DATA_P2P; + break; + case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: + iod->flags |= IOD_DATA_MMIO; + break; + case PCI_P2PDMA_MAP_NONE: + break; + default: + return BLK_STS_RESOURCE; + } + if (use_sgl == SGL_FORCED || (use_sgl == SGL_SUPPORTED && (sgl_threshold && nvme_pci_avg_seg_size(req) >= sgl_threshold))) @@ -1075,70 +1155,83 @@ static blk_status_t nvme_map_data(struct request *req) return nvme_pci_setup_data_prp(req, &iter); } -static void nvme_pci_sgl_set_data_sg(struct nvme_sgl_desc *sge, - struct scatterlist *sg) -{ - sge->addr = cpu_to_le64(sg_dma_address(sg)); - sge->length = cpu_to_le32(sg_dma_len(sg)); - sge->type = NVME_SGL_FMT_DATA_DESC << 4; -} - static blk_status_t nvme_pci_setup_meta_sgls(struct request *req) { struct nvme_queue *nvmeq = req->mq_hctx->driver_data; - struct nvme_dev *dev = nvmeq->dev; + unsigned int entries = req->nr_integrity_segments; struct nvme_iod *iod = blk_mq_rq_to_pdu(req); + struct nvme_dev *dev = nvmeq->dev; struct nvme_sgl_desc *sg_list; - struct scatterlist *sgl, *sg; - unsigned int entries; + struct blk_dma_iter iter; dma_addr_t sgl_dma; - int rc, i; + int i = 0; - iod->meta_sgt.sgl = mempool_alloc(dev->iod_meta_mempool, GFP_ATOMIC); - if (!iod->meta_sgt.sgl) + if (!blk_rq_integrity_dma_map_iter_start(req, dev->dev, + &iod->meta_dma_state, &iter)) + return iter.status; + + switch (iter.p2pdma.map) { + case PCI_P2PDMA_MAP_BUS_ADDR: + iod->flags |= IOD_META_P2P; + break; + case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: + iod->flags |= IOD_META_MMIO; + break; + case PCI_P2PDMA_MAP_NONE: + break; + default: return BLK_STS_RESOURCE; + } - sg_init_table(iod->meta_sgt.sgl, req->nr_integrity_segments); - iod->meta_sgt.orig_nents = blk_rq_map_integrity_sg(req, - iod->meta_sgt.sgl); - if (!iod->meta_sgt.orig_nents) - goto out_free_sg; + if (blk_rq_dma_map_coalesce(&iod->meta_dma_state)) + entries = 1; - rc = dma_map_sgtable(dev->dev, &iod->meta_sgt, rq_dma_dir(req), - DMA_ATTR_NO_WARN); - if (rc) - goto out_free_sg; + /* + * The NVMe MPTR descriptor has an implicit length that the host and + * device must agree on to avoid data/memory corruption. We trust the + * kernel allocated correctly based on the format's parameters, so use + * the more efficient MPTR to avoid extra dma pool allocations for the + * SGL indirection. + * + * But for user commands, we don't necessarily know what they do, so + * the driver can't validate the metadata buffer size. The SGL + * descriptor provides an explicit length, so we're relying on that + * mechanism to catch any misunderstandings between the application and + * device. + */ + if (entries == 1 && !(nvme_req(req)->flags & NVME_REQ_USERCMD)) { + iod->cmd.common.metadata = cpu_to_le64(iter.addr); + iod->meta_total_len = iter.len; + iod->meta_dma = iter.addr; + iod->meta_descriptor = NULL; + return BLK_STS_OK; + } sg_list = dma_pool_alloc(nvmeq->descriptor_pools.small, GFP_ATOMIC, &sgl_dma); if (!sg_list) - goto out_unmap_sg; + return BLK_STS_RESOURCE; - entries = iod->meta_sgt.nents; iod->meta_descriptor = sg_list; iod->meta_dma = sgl_dma; - iod->cmd.common.flags = NVME_CMD_SGL_METASEG; iod->cmd.common.metadata = cpu_to_le64(sgl_dma); - - sgl = iod->meta_sgt.sgl; if (entries == 1) { - nvme_pci_sgl_set_data_sg(sg_list, sgl); + iod->meta_total_len = iter.len; + nvme_pci_sgl_set_data(sg_list, &iter); return BLK_STS_OK; } sgl_dma += sizeof(*sg_list); - nvme_pci_sgl_set_seg(sg_list, sgl_dma, entries); - for_each_sg(sgl, sg, entries, i) - nvme_pci_sgl_set_data_sg(&sg_list[i + 1], sg); - - return BLK_STS_OK; + do { + nvme_pci_sgl_set_data(&sg_list[++i], &iter); + iod->meta_total_len += iter.len; + } while (blk_rq_integrity_dma_map_iter_next(req, dev->dev, &iter)); -out_unmap_sg: - dma_unmap_sgtable(dev->dev, &iod->meta_sgt, rq_dma_dir(req), 0); -out_free_sg: - mempool_free(iod->meta_sgt.sgl, dev->iod_meta_mempool); - return BLK_STS_RESOURCE; + nvme_pci_sgl_set_seg(sg_list, sgl_dma, i); + if (unlikely(iter.status)) + nvme_unmap_metadata(req); + return iter.status; } static blk_status_t nvme_pci_setup_meta_mptr(struct request *req) @@ -1151,6 +1244,7 @@ static blk_status_t nvme_pci_setup_meta_mptr(struct request *req) if (dma_mapping_error(nvmeq->dev->dev, iod->meta_dma)) return BLK_STS_IOERR; iod->cmd.common.metadata = cpu_to_le64(iod->meta_dma); + iod->flags |= IOD_SINGLE_META_SEGMENT; return BLK_STS_OK; } @@ -1172,7 +1266,7 @@ static blk_status_t nvme_prep_rq(struct request *req) iod->flags = 0; iod->nr_descriptors = 0; iod->total_len = 0; - iod->meta_sgt.nents = 0; + iod->meta_total_len = 0; ret = nvme_setup_cmd(req->q->queuedata, req); if (ret) @@ -1283,25 +1377,6 @@ static void nvme_queue_rqs(struct rq_list *rqlist) *rqlist = requeue_list; } -static __always_inline void nvme_unmap_metadata(struct request *req) -{ - struct nvme_iod *iod = blk_mq_rq_to_pdu(req); - struct nvme_queue *nvmeq = req->mq_hctx->driver_data; - struct nvme_dev *dev = nvmeq->dev; - - if (!iod->meta_sgt.nents) { - dma_unmap_page(dev->dev, iod->meta_dma, - rq_integrity_vec(req).bv_len, - rq_dma_dir(req)); - return; - } - - dma_pool_free(nvmeq->descriptor_pools.small, iod->meta_descriptor, - iod->meta_dma); - dma_unmap_sgtable(dev->dev, &iod->meta_sgt, rq_dma_dir(req), 0); - mempool_free(iod->meta_sgt.sgl, dev->iod_meta_mempool); -} - static __always_inline void nvme_pci_unmap_rq(struct request *req) { if (blk_integrity_rq(req)) @@ -3107,7 +3182,6 @@ static int nvme_disable_prepare_reset(struct nvme_dev *dev, bool shutdown) static int nvme_pci_alloc_iod_mempool(struct nvme_dev *dev) { - size_t meta_size = sizeof(struct scatterlist) * (NVME_MAX_META_SEGS + 1); size_t alloc_size = sizeof(struct nvme_dma_vec) * NVME_MAX_SEGS; dev->dmavec_mempool = mempool_create_node(1, @@ -3116,17 +3190,7 @@ static int nvme_pci_alloc_iod_mempool(struct nvme_dev *dev) dev_to_node(dev->dev)); if (!dev->dmavec_mempool) return -ENOMEM; - - dev->iod_meta_mempool = mempool_create_node(1, - mempool_kmalloc, mempool_kfree, - (void *)meta_size, GFP_KERNEL, - dev_to_node(dev->dev)); - if (!dev->iod_meta_mempool) - goto free; return 0; -free: - mempool_destroy(dev->dmavec_mempool); - return -ENOMEM; } static void nvme_free_tagset(struct nvme_dev *dev) @@ -3578,7 +3642,6 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) nvme_free_queues(dev, 0); out_release_iod_mempool: mempool_destroy(dev->dmavec_mempool); - mempool_destroy(dev->iod_meta_mempool); out_dev_unmap: nvme_dev_unmap(dev); out_uninit_ctrl: @@ -3642,7 +3705,6 @@ static void nvme_remove(struct pci_dev *pdev) nvme_dbbuf_dma_free(dev); nvme_free_queues(dev, 0); mempool_destroy(dev->dmavec_mempool); - mempool_destroy(dev->iod_meta_mempool); nvme_release_descriptor_pools(dev); nvme_dev_unmap(dev); nvme_uninit_ctrl(&dev->ctrl); diff --git a/drivers/parisc/ccio-dma.c b/drivers/parisc/ccio-dma.c index feef537257d05..4e70717143569 100644 --- a/drivers/parisc/ccio-dma.c +++ b/drivers/parisc/ccio-dma.c @@ -517,10 +517,10 @@ static u32 hint_lookup[] = { * ccio_io_pdir_entry - Initialize an I/O Pdir. * @pdir_ptr: A pointer into I/O Pdir. * @sid: The Space Identifier. - * @vba: The virtual address. + * @pba: The physical address. * @hints: The DMA Hint. * - * Given a virtual address (vba, arg2) and space id, (sid, arg1), + * Given a physical address (pba, arg2) and space id, (sid, arg1), * load the I/O PDIR entry pointed to by pdir_ptr (arg0). Each IO Pdir * entry consists of 8 bytes as shown below (MSB == bit 0): * @@ -543,7 +543,7 @@ static u32 hint_lookup[] = { * index are bits 12:19 of the value returned by LCI. */ static void -ccio_io_pdir_entry(__le64 *pdir_ptr, space_t sid, unsigned long vba, +ccio_io_pdir_entry(__le64 *pdir_ptr, space_t sid, phys_addr_t pba, unsigned long hints) { register unsigned long pa; @@ -557,7 +557,7 @@ ccio_io_pdir_entry(__le64 *pdir_ptr, space_t sid, unsigned long vba, ** "hints" parm includes the VALID bit! ** "dep" clobbers the physical address offset bits as well. */ - pa = lpa(vba); + pa = pba; asm volatile("depw %1,31,12,%0" : "+r" (pa) : "r" (hints)); ((u32 *)pdir_ptr)[1] = (u32) pa; @@ -582,7 +582,7 @@ ccio_io_pdir_entry(__le64 *pdir_ptr, space_t sid, unsigned long vba, ** Grab virtual index [0:11] ** Deposit virt_idx bits into I/O PDIR word */ - asm volatile ("lci %%r0(%1), %0" : "=r" (ci) : "r" (vba)); + asm volatile ("lci %%r0(%1), %0" : "=r" (ci) : "r" (phys_to_virt(pba))); asm volatile ("extru %1,19,12,%0" : "+r" (ci) : "r" (ci)); asm volatile ("depw %1,15,12,%0" : "+r" (pa) : "r" (ci)); @@ -704,14 +704,14 @@ ccio_dma_supported(struct device *dev, u64 mask) /** * ccio_map_single - Map an address range into the IOMMU. * @dev: The PCI device. - * @addr: The start address of the DMA region. + * @addr: The physical address of the DMA region. * @size: The length of the DMA region. * @direction: The direction of the DMA transaction (to/from device). * * This function implements the pci_map_single function. */ static dma_addr_t -ccio_map_single(struct device *dev, void *addr, size_t size, +ccio_map_single(struct device *dev, phys_addr_t addr, size_t size, enum dma_data_direction direction) { int idx; @@ -730,7 +730,7 @@ ccio_map_single(struct device *dev, void *addr, size_t size, BUG_ON(size <= 0); /* save offset bits */ - offset = ((unsigned long) addr) & ~IOVP_MASK; + offset = offset_in_page(addr); /* round up to nearest IOVP_SIZE */ size = ALIGN(size + offset, IOVP_SIZE); @@ -746,15 +746,15 @@ ccio_map_single(struct device *dev, void *addr, size_t size, pdir_start = &(ioc->pdir_base[idx]); - DBG_RUN("%s() %px -> %#lx size: %zu\n", - __func__, addr, (long)(iovp | offset), size); + DBG_RUN("%s() %pa -> %#lx size: %zu\n", + __func__, &addr, (long)(iovp | offset), size); /* If not cacheline aligned, force SAFE_DMA on the whole mess */ - if((size % L1_CACHE_BYTES) || ((unsigned long)addr % L1_CACHE_BYTES)) + if ((size % L1_CACHE_BYTES) || (addr % L1_CACHE_BYTES)) hint |= HINT_SAFE_DMA; while(size > 0) { - ccio_io_pdir_entry(pdir_start, KERNEL_SPACE, (unsigned long)addr, hint); + ccio_io_pdir_entry(pdir_start, KERNEL_SPACE, addr, hint); DBG_RUN(" pdir %p %08x%08x\n", pdir_start, @@ -773,17 +773,18 @@ ccio_map_single(struct device *dev, void *addr, size_t size, static dma_addr_t -ccio_map_page(struct device *dev, struct page *page, unsigned long offset, - size_t size, enum dma_data_direction direction, - unsigned long attrs) +ccio_map_phys(struct device *dev, phys_addr_t phys, size_t size, + enum dma_data_direction direction, unsigned long attrs) { - return ccio_map_single(dev, page_address(page) + offset, size, - direction); + if (unlikely(attrs & DMA_ATTR_MMIO)) + return DMA_MAPPING_ERROR; + + return ccio_map_single(dev, phys, size, direction); } /** - * ccio_unmap_page - Unmap an address range from the IOMMU. + * ccio_unmap_phys - Unmap an address range from the IOMMU. * @dev: The PCI device. * @iova: The start address of the DMA region. * @size: The length of the DMA region. @@ -791,7 +792,7 @@ ccio_map_page(struct device *dev, struct page *page, unsigned long offset, * @attrs: attributes */ static void -ccio_unmap_page(struct device *dev, dma_addr_t iova, size_t size, +ccio_unmap_phys(struct device *dev, dma_addr_t iova, size_t size, enum dma_data_direction direction, unsigned long attrs) { struct ioc *ioc; @@ -853,7 +854,8 @@ ccio_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t flag, if (ret) { memset(ret, 0, size); - *dma_handle = ccio_map_single(dev, ret, size, DMA_BIDIRECTIONAL); + *dma_handle = ccio_map_single(dev, virt_to_phys(ret), size, + DMA_BIDIRECTIONAL); } return ret; @@ -873,7 +875,7 @@ static void ccio_free(struct device *dev, size_t size, void *cpu_addr, dma_addr_t dma_handle, unsigned long attrs) { - ccio_unmap_page(dev, dma_handle, size, 0, 0); + ccio_unmap_phys(dev, dma_handle, size, 0, 0); free_pages((unsigned long)cpu_addr, get_order(size)); } @@ -920,7 +922,7 @@ ccio_map_sg(struct device *dev, struct scatterlist *sglist, int nents, /* Fast path single entry scatterlists. */ if (nents == 1) { sg_dma_address(sglist) = ccio_map_single(dev, - sg_virt(sglist), sglist->length, + sg_phys(sglist), sglist->length, direction); sg_dma_len(sglist) = sglist->length; return 1; @@ -1004,7 +1006,7 @@ ccio_unmap_sg(struct device *dev, struct scatterlist *sglist, int nents, #ifdef CCIO_COLLECT_STATS ioc->usg_pages += sg_dma_len(sglist) >> PAGE_SHIFT; #endif - ccio_unmap_page(dev, sg_dma_address(sglist), + ccio_unmap_phys(dev, sg_dma_address(sglist), sg_dma_len(sglist), direction, 0); ++sglist; nents--; @@ -1017,8 +1019,8 @@ static const struct dma_map_ops ccio_ops = { .dma_supported = ccio_dma_supported, .alloc = ccio_alloc, .free = ccio_free, - .map_page = ccio_map_page, - .unmap_page = ccio_unmap_page, + .map_phys = ccio_map_phys, + .unmap_phys = ccio_unmap_phys, .map_sg = ccio_map_sg, .unmap_sg = ccio_unmap_sg, .get_sgtable = dma_common_get_sgtable, @@ -1072,7 +1074,7 @@ static int ccio_proc_info(struct seq_file *m, void *p) ioc->msingle_calls, ioc->msingle_pages, (int)((ioc->msingle_pages * 1000)/ioc->msingle_calls)); - /* KLUGE - unmap_sg calls unmap_page for each mapped page */ + /* KLUGE - unmap_sg calls unmap_phys for each mapped page */ min = ioc->usingle_calls - ioc->usg_calls; max = ioc->usingle_pages - ioc->usg_pages; seq_printf(m, "pci_unmap_single: %8ld calls %8ld pages (avg %d/1000)\n", diff --git a/drivers/parisc/iommu-helpers.h b/drivers/parisc/iommu-helpers.h index c43f1a212a5c8..0691884f50959 100644 --- a/drivers/parisc/iommu-helpers.h +++ b/drivers/parisc/iommu-helpers.h @@ -14,7 +14,7 @@ static inline unsigned int iommu_fill_pdir(struct ioc *ioc, struct scatterlist *startsg, int nents, unsigned long hint, - void (*iommu_io_pdir_entry)(__le64 *, space_t, unsigned long, + void (*iommu_io_pdir_entry)(__le64 *, space_t, phys_addr_t, unsigned long)) { struct scatterlist *dma_sg = startsg; /* pointer to current DMA */ @@ -28,7 +28,7 @@ iommu_fill_pdir(struct ioc *ioc, struct scatterlist *startsg, int nents, dma_sg--; while (nents-- > 0) { - unsigned long vaddr; + phys_addr_t paddr; long size; DBG_RUN_SG(" %d : %08lx %p/%05x\n", nents, @@ -67,7 +67,7 @@ iommu_fill_pdir(struct ioc *ioc, struct scatterlist *startsg, int nents, BUG_ON(pdirp == NULL); - vaddr = (unsigned long)sg_virt(startsg); + paddr = sg_phys(startsg); sg_dma_len(dma_sg) += startsg->length; size = startsg->length + dma_offset; dma_offset = 0; @@ -76,8 +76,8 @@ iommu_fill_pdir(struct ioc *ioc, struct scatterlist *startsg, int nents, #endif do { iommu_io_pdir_entry(pdirp, KERNEL_SPACE, - vaddr, hint); - vaddr += IOVP_SIZE; + paddr, hint); + paddr += IOVP_SIZE; size -= IOVP_SIZE; pdirp++; } while(unlikely(size > 0)); diff --git a/drivers/parisc/sba_iommu.c b/drivers/parisc/sba_iommu.c index fc3863c09f83d..eefb2bac8443f 100644 --- a/drivers/parisc/sba_iommu.c +++ b/drivers/parisc/sba_iommu.c @@ -532,7 +532,7 @@ typedef unsigned long space_t; * sba_io_pdir_entry - fill in one IO PDIR entry * @pdir_ptr: pointer to IO PDIR entry * @sid: process Space ID - currently only support KERNEL_SPACE - * @vba: Virtual CPU address of buffer to map + * @pba: Physical address of buffer to map * @hint: DMA hint set to use for this mapping * * SBA Mapping Routine @@ -569,20 +569,17 @@ typedef unsigned long space_t; */ static void -sba_io_pdir_entry(__le64 *pdir_ptr, space_t sid, unsigned long vba, +sba_io_pdir_entry(__le64 *pdir_ptr, space_t sid, phys_addr_t pba, unsigned long hint) { - u64 pa; /* physical address */ register unsigned ci; /* coherent index */ - pa = lpa(vba); - pa &= IOVP_MASK; + asm("lci 0(%1), %0" : "=r" (ci) : "r" (phys_to_virt(pba))); + pba &= IOVP_MASK; + pba |= (ci >> PAGE_SHIFT) & 0xff; /* move CI (8 bits) into lowest byte */ - asm("lci 0(%1), %0" : "=r" (ci) : "r" (vba)); - pa |= (ci >> PAGE_SHIFT) & 0xff; /* move CI (8 bits) into lowest byte */ - - pa |= SBA_PDIR_VALID_BIT; /* set "valid" bit */ - *pdir_ptr = cpu_to_le64(pa); /* swap and store into I/O Pdir */ + /* set "valid" bit, swap and store into I/O Pdir */ + *pdir_ptr = cpu_to_le64((unsigned long)pba | SBA_PDIR_VALID_BIT); /* * If the PDC_MODEL capabilities has Non-coherent IO-PDIR bit set @@ -707,7 +704,7 @@ static int sba_dma_supported( struct device *dev, u64 mask) * See Documentation/core-api/dma-api-howto.rst */ static dma_addr_t -sba_map_single(struct device *dev, void *addr, size_t size, +sba_map_single(struct device *dev, phys_addr_t addr, size_t size, enum dma_data_direction direction) { struct ioc *ioc; @@ -722,7 +719,7 @@ sba_map_single(struct device *dev, void *addr, size_t size, return DMA_MAPPING_ERROR; /* save offset bits */ - offset = ((dma_addr_t) (long) addr) & ~IOVP_MASK; + offset = offset_in_page(addr); /* round up to nearest IOVP_SIZE */ size = (size + offset + ~IOVP_MASK) & IOVP_MASK; @@ -739,13 +736,13 @@ sba_map_single(struct device *dev, void *addr, size_t size, pide = sba_alloc_range(ioc, dev, size); iovp = (dma_addr_t) pide << IOVP_SHIFT; - DBG_RUN("%s() 0x%p -> 0x%lx\n", - __func__, addr, (long) iovp | offset); + DBG_RUN("%s() 0x%pa -> 0x%lx\n", + __func__, &addr, (long) iovp | offset); pdir_start = &(ioc->pdir_base[pide]); while (size > 0) { - sba_io_pdir_entry(pdir_start, KERNEL_SPACE, (unsigned long) addr, 0); + sba_io_pdir_entry(pdir_start, KERNEL_SPACE, addr, 0); DBG_RUN(" pdir 0x%p %02x%02x%02x%02x%02x%02x%02x%02x\n", pdir_start, @@ -778,17 +775,18 @@ sba_map_single(struct device *dev, void *addr, size_t size, static dma_addr_t -sba_map_page(struct device *dev, struct page *page, unsigned long offset, - size_t size, enum dma_data_direction direction, - unsigned long attrs) +sba_map_phys(struct device *dev, phys_addr_t phys, size_t size, + enum dma_data_direction direction, unsigned long attrs) { - return sba_map_single(dev, page_address(page) + offset, size, - direction); + if (unlikely(attrs & DMA_ATTR_MMIO)) + return DMA_MAPPING_ERROR; + + return sba_map_single(dev, phys, size, direction); } /** - * sba_unmap_page - unmap one IOVA and free resources + * sba_unmap_phys - unmap one IOVA and free resources * @dev: instance of PCI owned by the driver that's asking. * @iova: IOVA of driver buffer previously mapped. * @size: number of bytes mapped in driver buffer. @@ -798,7 +796,7 @@ sba_map_page(struct device *dev, struct page *page, unsigned long offset, * See Documentation/core-api/dma-api-howto.rst */ static void -sba_unmap_page(struct device *dev, dma_addr_t iova, size_t size, +sba_unmap_phys(struct device *dev, dma_addr_t iova, size_t size, enum dma_data_direction direction, unsigned long attrs) { struct ioc *ioc; @@ -893,7 +891,7 @@ static void *sba_alloc(struct device *hwdev, size_t size, dma_addr_t *dma_handle if (ret) { memset(ret, 0, size); - *dma_handle = sba_map_single(hwdev, ret, size, 0); + *dma_handle = sba_map_single(hwdev, virt_to_phys(ret), size, 0); } return ret; @@ -914,7 +912,7 @@ static void sba_free(struct device *hwdev, size_t size, void *vaddr, dma_addr_t dma_handle, unsigned long attrs) { - sba_unmap_page(hwdev, dma_handle, size, 0, 0); + sba_unmap_phys(hwdev, dma_handle, size, 0, 0); free_pages((unsigned long) vaddr, get_order(size)); } @@ -962,7 +960,7 @@ sba_map_sg(struct device *dev, struct scatterlist *sglist, int nents, /* Fast path single entry scatterlists. */ if (nents == 1) { - sg_dma_address(sglist) = sba_map_single(dev, sg_virt(sglist), + sg_dma_address(sglist) = sba_map_single(dev, sg_phys(sglist), sglist->length, direction); sg_dma_len(sglist) = sglist->length; return 1; @@ -1061,7 +1059,7 @@ sba_unmap_sg(struct device *dev, struct scatterlist *sglist, int nents, while (nents && sg_dma_len(sglist)) { - sba_unmap_page(dev, sg_dma_address(sglist), sg_dma_len(sglist), + sba_unmap_phys(dev, sg_dma_address(sglist), sg_dma_len(sglist), direction, 0); #ifdef SBA_COLLECT_STATS ioc->usg_pages += ((sg_dma_address(sglist) & ~IOVP_MASK) + sg_dma_len(sglist) + IOVP_SIZE - 1) >> PAGE_SHIFT; @@ -1085,8 +1083,8 @@ static const struct dma_map_ops sba_ops = { .dma_supported = sba_dma_supported, .alloc = sba_alloc, .free = sba_free, - .map_page = sba_map_page, - .unmap_page = sba_unmap_page, + .map_phys = sba_map_phys, + .unmap_phys = sba_unmap_phys, .map_sg = sba_map_sg, .unmap_sg = sba_unmap_sg, .get_sgtable = dma_common_get_sgtable, diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index 1cb5e423eed4f..dbb4fc9558174 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -25,12 +25,12 @@ struct pci_p2pdma { struct gen_pool *pool; bool p2pmem_published; struct xarray map_types; + struct p2pdma_provider mem[PCI_STD_NUM_BARS]; }; struct pci_p2pdma_pagemap { - struct pci_dev *provider; - u64 bus_offset; struct dev_pagemap pgmap; + struct p2pdma_provider *mem; }; static struct pci_p2pdma_pagemap *to_p2p_pgmap(struct dev_pagemap *pgmap) @@ -204,8 +204,8 @@ static void p2pdma_page_free(struct page *page) { struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page)); /* safe to dereference while a reference is held to the percpu ref */ - struct pci_p2pdma *p2pdma = - rcu_dereference_protected(pgmap->provider->p2pdma, 1); + struct pci_p2pdma *p2pdma = rcu_dereference_protected( + to_pci_dev(pgmap->mem->owner)->p2pdma, 1); struct percpu_ref *ref; gen_pool_free_owner(p2pdma->pool, (uintptr_t)page_to_virt(page), @@ -228,56 +228,136 @@ static void pci_p2pdma_release(void *data) /* Flush and disable pci_alloc_p2p_mem() */ pdev->p2pdma = NULL; - synchronize_rcu(); + if (p2pdma->pool) + synchronize_rcu(); + xa_destroy(&p2pdma->map_types); + + if (!p2pdma->pool) + return; gen_pool_destroy(p2pdma->pool); sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group); - xa_destroy(&p2pdma->map_types); } -static int pci_p2pdma_setup(struct pci_dev *pdev) +/** + * pcim_p2pdma_init - Initialise peer-to-peer DMA providers + * @pdev: The PCI device to enable P2PDMA for + * + * This function initializes the peer-to-peer DMA infrastructure + * for a PCI device. It allocates and sets up the necessary data + * structures to support P2PDMA operations, including mapping type + * tracking. + */ +int pcim_p2pdma_init(struct pci_dev *pdev) { - int error = -ENOMEM; struct pci_p2pdma *p2p; + int i, ret; + + p2p = rcu_dereference_protected(pdev->p2pdma, 1); + if (p2p) + return 0; p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL); if (!p2p) return -ENOMEM; xa_init(&p2p->map_types); + /* + * Iterate over all standard PCI BARs and record only those that + * correspond to MMIO regions. Skip non-memory resources (e.g. I/O + * port BARs) since they cannot be used for peer-to-peer (P2P) + * transactions. + */ + for (i = 0; i < PCI_STD_NUM_BARS; i++) { + if (!(pci_resource_flags(pdev, i) & IORESOURCE_MEM)) + continue; - p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev)); - if (!p2p->pool) - goto out; + p2p->mem[i].owner = &pdev->dev; + p2p->mem[i].bus_offset = + pci_bus_address(pdev, i) - pci_resource_start(pdev, i); + } - error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev); - if (error) - goto out_pool_destroy; + ret = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev); + if (ret) + goto out_p2p; - error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group); - if (error) + rcu_assign_pointer(pdev->p2pdma, p2p); + return 0; + +out_p2p: + devm_kfree(&pdev->dev, p2p); + return ret; +} +EXPORT_SYMBOL_GPL(pcim_p2pdma_init); + +/** + * pcim_p2pdma_provider - Get peer-to-peer DMA provider + * @pdev: The PCI device to enable P2PDMA for + * @bar: BAR index to get provider + * + * This function gets peer-to-peer DMA provider for a PCI device. The lifetime + * of the provider (and of course the MMIO) is bound to the lifetime of the + * driver. A driver calling this function must ensure that all references to the + * provider, and any DMA mappings created for any MMIO, are all cleaned up + * before the driver remove() completes. + * + * Since P2P is almost always shared with a second driver this means some system + * to notify, invalidate and revoke the MMIO's DMA must be in place to use this + * function. For example a revoke can be built using DMABUF. + */ +struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar) +{ + struct pci_p2pdma *p2p; + + if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM)) + return NULL; + + p2p = rcu_dereference_protected(pdev->p2pdma, 1); + if (WARN_ON(!p2p)) + /* Someone forgot to call to pcim_p2pdma_init() before */ + return NULL; + + return &p2p->mem[bar]; +} +EXPORT_SYMBOL_GPL(pcim_p2pdma_provider); + +static int pci_p2pdma_setup_pool(struct pci_dev *pdev) +{ + struct pci_p2pdma *p2pdma; + int ret; + + p2pdma = rcu_dereference_protected(pdev->p2pdma, 1); + if (p2pdma->pool) + /* We already setup pools, do nothing, */ + return 0; + + p2pdma->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev)); + if (!p2pdma->pool) + return -ENOMEM; + + ret = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group); + if (ret) goto out_pool_destroy; - rcu_assign_pointer(pdev->p2pdma, p2p); return 0; out_pool_destroy: - gen_pool_destroy(p2p->pool); -out: - devm_kfree(&pdev->dev, p2p); - return error; + gen_pool_destroy(p2pdma->pool); + p2pdma->pool = NULL; + return ret; } static void pci_p2pdma_unmap_mappings(void *data) { - struct pci_dev *pdev = data; + struct pci_p2pdma_pagemap *p2p_pgmap = data; /* * Removing the alloc attribute from sysfs will call * unmap_mapping_range() on the inode, teardown any existing userspace * mappings and prevent new ones from being created. */ - sysfs_remove_file_from_group(&pdev->dev.kobj, &p2pmem_alloc_attr.attr, + sysfs_remove_file_from_group(&p2p_pgmap->mem->owner->kobj, + &p2pmem_alloc_attr.attr, p2pmem_group.name); } @@ -295,6 +375,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, u64 offset) { struct pci_p2pdma_pagemap *p2p_pgmap; + struct p2pdma_provider *mem; struct dev_pagemap *pgmap; struct pci_p2pdma *p2pdma; void *addr; @@ -312,11 +393,21 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, if (size + offset > pci_resource_len(pdev, bar)) return -EINVAL; - if (!pdev->p2pdma) { - error = pci_p2pdma_setup(pdev); - if (error) - return error; - } + error = pcim_p2pdma_init(pdev); + if (error) + return error; + + error = pci_p2pdma_setup_pool(pdev); + if (error) + return error; + + mem = pcim_p2pdma_provider(pdev, bar); + /* + * We checked validity of BAR prior to call + * to pcim_p2pdma_provider. It should never return NULL. + */ + if (WARN_ON(!mem)) + return -EINVAL; p2p_pgmap = devm_kzalloc(&pdev->dev, sizeof(*p2p_pgmap), GFP_KERNEL); if (!p2p_pgmap) @@ -328,10 +419,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, pgmap->nr_range = 1; pgmap->type = MEMORY_DEVICE_PCI_P2PDMA; pgmap->ops = &p2pdma_pgmap_ops; - - p2p_pgmap->provider = pdev; - p2p_pgmap->bus_offset = pci_bus_address(pdev, bar) - - pci_resource_start(pdev, bar); + p2p_pgmap->mem = mem; addr = devm_memremap_pages(&pdev->dev, pgmap); if (IS_ERR(addr)) { @@ -340,7 +428,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, } error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_unmap_mappings, - pdev); + p2p_pgmap); if (error) goto pages_free; @@ -973,16 +1061,26 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish) } EXPORT_SYMBOL_GPL(pci_p2pmem_publish); -static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap, - struct device *dev) +/** + * pci_p2pdma_map_type - Determine the mapping type for P2PDMA transfers + * @provider: P2PDMA provider structure + * @dev: Target device for the transfer + * + * Determines how peer-to-peer DMA transfers should be mapped between + * the provider and the target device. The mapping type indicates whether + * the transfer can be done directly through PCI switches or must go + * through the host bridge. + */ +enum pci_p2pdma_map_type pci_p2pdma_map_type(struct p2pdma_provider *provider, + struct device *dev) { enum pci_p2pdma_map_type type = PCI_P2PDMA_MAP_NOT_SUPPORTED; - struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider; + struct pci_dev *pdev = to_pci_dev(provider->owner); struct pci_dev *client; struct pci_p2pdma *p2pdma; int dist; - if (!provider->p2pdma) + if (!pdev->p2pdma) return PCI_P2PDMA_MAP_NOT_SUPPORTED; if (!dev_is_pci(dev)) @@ -991,7 +1089,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap, client = to_pci_dev(dev); rcu_read_lock(); - p2pdma = rcu_dereference(provider->p2pdma); + p2pdma = rcu_dereference(pdev->p2pdma); if (p2pdma) type = xa_to_value(xa_load(&p2pdma->map_types, @@ -999,7 +1097,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap, rcu_read_unlock(); if (type == PCI_P2PDMA_MAP_UNKNOWN) - return calc_map_type_and_dist(provider, client, &dist, true); + return calc_map_type_and_dist(pdev, client, &dist, true); return type; } @@ -1007,9 +1105,13 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap, void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state, struct device *dev, struct page *page) { - state->pgmap = page_pgmap(page); - state->map = pci_p2pdma_map_type(state->pgmap, dev); - state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset; + struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page_pgmap(page)); + + if (state->mem == p2p_pgmap->mem) + return; + + state->mem = p2p_pgmap->mem; + state->map = pci_p2pdma_map_type(p2p_pgmap->mem, dev); } /** diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c index ae437791b5f8c..0e9281e23fcba 100644 --- a/drivers/perf/arm_pmu.c +++ b/drivers/perf/arm_pmu.c @@ -743,6 +743,21 @@ static int arm_perf_teardown_cpu(unsigned int cpu, struct hlist_node *node) return 0; } +void arm_pmu_set_phys_irq(bool enable) +{ + int cpu = get_cpu(); + struct arm_pmu *pmu = per_cpu(cpu_armpmu, cpu); + int irq; + + irq = armpmu_get_cpu_irq(pmu, cpu); + if (irq && !enable) + per_cpu(cpu_irq_ops, cpu)->disable_pmuirq(irq); + else if (irq && enable) + per_cpu(cpu_irq_ops, cpu)->enable_pmuirq(irq); + + put_cpu(); +} + #ifdef CONFIG_CPU_PM static void cpu_pm_pmu_setup(struct arm_pmu *armpmu, unsigned long cmd) { diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig index 559fb9d3271fc..857cf288c876a 100644 --- a/drivers/vdpa/Kconfig +++ b/drivers/vdpa/Kconfig @@ -34,13 +34,7 @@ config VDPA_SIM_BLOCK config VDPA_USER tristate "VDUSE (vDPA Device in Userspace) support" - depends on EVENTFD && MMU && HAS_DMA - # - # This driver incorrectly tries to override the dma_ops. It should - # never have done that, but for now keep it working on architectures - # that use dma ops - # - depends on ARCH_HAS_DMA_OPS + depends on EVENTFD && MMU select VHOST_IOTLB select IOMMU_IOVA help diff --git a/drivers/vdpa/alibaba/eni_vdpa.c b/drivers/vdpa/alibaba/eni_vdpa.c index ad7f3447fe90c..e476504db0c82 100644 --- a/drivers/vdpa/alibaba/eni_vdpa.c +++ b/drivers/vdpa/alibaba/eni_vdpa.c @@ -478,7 +478,8 @@ static int eni_vdpa_probe(struct pci_dev *pdev, const struct pci_device_id *id) return ret; eni_vdpa = vdpa_alloc_device(struct eni_vdpa, vdpa, - dev, &eni_vdpa_ops, 1, 1, NULL, false); + dev, &eni_vdpa_ops, NULL, + 1, 1, NULL, false); if (IS_ERR(eni_vdpa)) { ENI_ERR(pdev, "failed to allocate vDPA structure\n"); return PTR_ERR(eni_vdpa); @@ -496,7 +497,7 @@ static int eni_vdpa_probe(struct pci_dev *pdev, const struct pci_device_id *id) pci_set_master(pdev); pci_set_drvdata(pdev, eni_vdpa); - eni_vdpa->vdpa.dma_dev = &pdev->dev; + eni_vdpa->vdpa.vmap.dma_dev = &pdev->dev; eni_vdpa->queues = eni_vdpa_get_num_queues(eni_vdpa); eni_vdpa->vring = devm_kcalloc(&pdev->dev, eni_vdpa->queues, diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c index ccf64d7bbfaa2..6658dc74d9150 100644 --- a/drivers/vdpa/ifcvf/ifcvf_main.c +++ b/drivers/vdpa/ifcvf/ifcvf_main.c @@ -705,7 +705,8 @@ static int ifcvf_vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name, vf = &ifcvf_mgmt_dev->vf; pdev = vf->pdev; adapter = vdpa_alloc_device(struct ifcvf_adapter, vdpa, - &pdev->dev, &ifc_vdpa_ops, 1, 1, NULL, false); + &pdev->dev, &ifc_vdpa_ops, + NULL, 1, 1, NULL, false); if (IS_ERR(adapter)) { IFCVF_ERR(pdev, "Failed to allocate vDPA structure"); return PTR_ERR(adapter); @@ -713,7 +714,7 @@ static int ifcvf_vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name, ifcvf_mgmt_dev->adapter = adapter; adapter->pdev = pdev; - adapter->vdpa.dma_dev = &pdev->dev; + adapter->vdpa.vmap.dma_dev = &pdev->dev; adapter->vdpa.mdev = mdev; adapter->vf = vf; vdpa_dev = &adapter->vdpa; diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c index c7a20278bc3ca..8870a7169267e 100644 --- a/drivers/vdpa/mlx5/core/mr.c +++ b/drivers/vdpa/mlx5/core/mr.c @@ -378,7 +378,7 @@ static int map_direct_mr(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_direct_mr u64 pa, offset; u64 paend; struct scatterlist *sg; - struct device *dma = mvdev->vdev.dma_dev; + struct device *dma = mvdev->vdev.vmap.dma_dev; for (map = vhost_iotlb_itree_first(iotlb, mr->start, mr->end - 1); map; map = vhost_iotlb_itree_next(map, mr->start, mr->end - 1)) { @@ -432,7 +432,7 @@ static int map_direct_mr(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_direct_mr static void unmap_direct_mr(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_direct_mr *mr) { - struct device *dma = mvdev->vdev.dma_dev; + struct device *dma = mvdev->vdev.vmap.dma_dev; destroy_direct_mr(mvdev, mr); dma_unmap_sg_attrs(dma, mr->sg_head.sgl, mr->nsg, DMA_BIDIRECTIONAL, 0); diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 53cc9ef01e9f7..a7936bd1aabe1 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -3393,14 +3393,17 @@ static int mlx5_vdpa_reset_map(struct vdpa_device *vdev, unsigned int asid) return err; } -static struct device *mlx5_get_vq_dma_dev(struct vdpa_device *vdev, u16 idx) +static union virtio_map mlx5_get_vq_map(struct vdpa_device *vdev, u16 idx) { struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev); + union virtio_map map; if (is_ctrl_vq_idx(mvdev, idx)) - return &vdev->dev; + map.dma_dev = &vdev->dev; + else + map.dma_dev = mvdev->vdev.vmap.dma_dev; - return mvdev->vdev.dma_dev; + return map; } static void free_irqs(struct mlx5_vdpa_net *ndev) @@ -3684,7 +3687,7 @@ static const struct vdpa_config_ops mlx5_vdpa_ops = { .set_map = mlx5_vdpa_set_map, .reset_map = mlx5_vdpa_reset_map, .set_group_asid = mlx5_set_group_asid, - .get_vq_dma_dev = mlx5_get_vq_dma_dev, + .get_vq_map = mlx5_get_vq_map, .free = mlx5_vdpa_free, .suspend = mlx5_vdpa_suspend, .resume = mlx5_vdpa_resume, /* Op disabled if not supported. */ @@ -3877,7 +3880,7 @@ static int mlx5_vdpa_dev_add(struct vdpa_mgmt_dev *v_mdev, const char *name, } ndev = vdpa_alloc_device(struct mlx5_vdpa_net, mvdev.vdev, mdev->device, &mgtdev->vdpa_ops, - MLX5_VDPA_NUMVQ_GROUPS, MLX5_VDPA_NUM_AS, name, false); + NULL, MLX5_VDPA_NUMVQ_GROUPS, MLX5_VDPA_NUM_AS, name, false); if (IS_ERR(ndev)) return PTR_ERR(ndev); @@ -3963,7 +3966,7 @@ static int mlx5_vdpa_dev_add(struct vdpa_mgmt_dev *v_mdev, const char *name, } ndev->mvdev.mlx_features = device_features; - mvdev->vdev.dma_dev = &mdev->pdev->dev; + mvdev->vdev.vmap.dma_dev = &mdev->pdev->dev; err = mlx5_vdpa_alloc_resources(&ndev->mvdev); if (err) goto err_alloc; diff --git a/drivers/vdpa/octeon_ep/octep_vdpa_main.c b/drivers/vdpa/octeon_ep/octep_vdpa_main.c index 9b49efd24391e..9e8d07078606f 100644 --- a/drivers/vdpa/octeon_ep/octep_vdpa_main.c +++ b/drivers/vdpa/octeon_ep/octep_vdpa_main.c @@ -508,15 +508,15 @@ static int octep_vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name, u64 device_features; int ret; - oct_vdpa = vdpa_alloc_device(struct octep_vdpa, vdpa, &pdev->dev, &octep_vdpa_ops, 1, 1, - NULL, false); + oct_vdpa = vdpa_alloc_device(struct octep_vdpa, vdpa, &pdev->dev, &octep_vdpa_ops, + NULL, 1, 1, NULL, false); if (IS_ERR(oct_vdpa)) { dev_err(&pdev->dev, "Failed to allocate vDPA structure for octep vdpa device"); return PTR_ERR(oct_vdpa); } oct_vdpa->pdev = pdev; - oct_vdpa->vdpa.dma_dev = &pdev->dev; + oct_vdpa->vdpa.vmap.dma_dev = &pdev->dev; oct_vdpa->vdpa.mdev = mdev; oct_vdpa->oct_hw = oct_hw; vdpa_dev = &oct_vdpa->vdpa; diff --git a/drivers/vdpa/pds/vdpa_dev.c b/drivers/vdpa/pds/vdpa_dev.c index 301d95e085960..36f61cc96e211 100644 --- a/drivers/vdpa/pds/vdpa_dev.c +++ b/drivers/vdpa/pds/vdpa_dev.c @@ -632,7 +632,8 @@ static int pds_vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name, } pdsv = vdpa_alloc_device(struct pds_vdpa_device, vdpa_dev, - dev, &pds_vdpa_ops, 1, 1, name, false); + dev, &pds_vdpa_ops, NULL, + 1, 1, name, false); if (IS_ERR(pdsv)) { dev_err(dev, "Failed to allocate vDPA structure: %pe\n", pdsv); return PTR_ERR(pdsv); @@ -643,7 +644,7 @@ static int pds_vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name, pdev = vdpa_aux->padev->vf_pdev; dma_dev = &pdev->dev; - pdsv->vdpa_dev.dma_dev = dma_dev; + pdsv->vdpa_dev.vmap.dma_dev = dma_dev; status = pds_vdpa_get_status(&pdsv->vdpa_dev); if (status == 0xff) { diff --git a/drivers/vdpa/solidrun/snet_main.c b/drivers/vdpa/solidrun/snet_main.c index 55ec51c17ab35..4588211d57ebc 100644 --- a/drivers/vdpa/solidrun/snet_main.c +++ b/drivers/vdpa/solidrun/snet_main.c @@ -1008,8 +1008,8 @@ static int snet_vdpa_probe_vf(struct pci_dev *pdev) } /* Allocate vdpa device */ - snet = vdpa_alloc_device(struct snet, vdpa, &pdev->dev, &snet_config_ops, 1, 1, NULL, - false); + snet = vdpa_alloc_device(struct snet, vdpa, &pdev->dev, &snet_config_ops, + NULL, 1, 1, NULL, false); if (!snet) { SNET_ERR(pdev, "Failed to allocate a vdpa device\n"); ret = -ENOMEM; @@ -1052,8 +1052,8 @@ static int snet_vdpa_probe_vf(struct pci_dev *pdev) */ snet_reserve_irq_idx(pf_irqs ? pdev_pf : pdev, snet); - /*set DMA device*/ - snet->vdpa.dma_dev = &pdev->dev; + /* set map metadata */ + snet->vdpa.vmap.dma_dev = &pdev->dev; /* Register VDPA device */ ret = vdpa_register_device(&snet->vdpa, snet->cfg->vq_num); diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c index 8a372b51c21ad..34874beb0152e 100644 --- a/drivers/vdpa/vdpa.c +++ b/drivers/vdpa/vdpa.c @@ -142,6 +142,7 @@ static void vdpa_release_dev(struct device *d) * initialized but before registered. * @parent: the parent device * @config: the bus operations that is supported by this device + * @map: the map operations that is supported by this device * @ngroups: number of groups supported by this device * @nas: number of address spaces supported by this device * @size: size of the parent structure that contains private data @@ -151,11 +152,12 @@ static void vdpa_release_dev(struct device *d) * Driver should use vdpa_alloc_device() wrapper macro instead of * using this directly. * - * Return: Returns an error when parent/config/dma_dev is not set or fail to get + * Return: Returns an error when parent/config/map is not set or fail to get * ida. */ struct vdpa_device *__vdpa_alloc_device(struct device *parent, const struct vdpa_config_ops *config, + const struct virtio_map_ops *map, unsigned int ngroups, unsigned int nas, size_t size, const char *name, bool use_va) @@ -187,6 +189,7 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent, vdev->dev.release = vdpa_release_dev; vdev->index = err; vdev->config = config; + vdev->map = map; vdev->features_valid = false; vdev->use_va = use_va; vdev->ngroups = ngroups; diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c index c204fc8e471a7..c1c6431950e1b 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c @@ -215,7 +215,7 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *dev_attr, else ops = &vdpasim_config_ops; - vdpa = __vdpa_alloc_device(NULL, ops, + vdpa = __vdpa_alloc_device(NULL, ops, NULL, dev_attr->ngroups, dev_attr->nas, dev_attr->alloc_size, dev_attr->name, use_va); @@ -272,7 +272,7 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *dev_attr, vringh_set_iotlb(&vdpasim->vqs[i].vring, &vdpasim->iommu[0], &vdpasim->iommu_lock); - vdpasim->vdpa.dma_dev = dev; + vdpasim->vdpa.vmap.dma_dev = dev; return vdpasim; diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c index 58116f89d8dae..4352b5cf74f07 100644 --- a/drivers/vdpa/vdpa_user/iova_domain.c +++ b/drivers/vdpa/vdpa_user/iova_domain.c @@ -103,19 +103,38 @@ void vduse_domain_clear_map(struct vduse_iova_domain *domain, static int vduse_domain_map_bounce_page(struct vduse_iova_domain *domain, u64 iova, u64 size, u64 paddr) { - struct vduse_bounce_map *map; + struct vduse_bounce_map *map, *head_map; + struct page *tmp_page; u64 last = iova + size - 1; while (iova <= last) { - map = &domain->bounce_maps[iova >> PAGE_SHIFT]; + /* + * When PAGE_SIZE is larger than 4KB, multiple adjacent bounce_maps will + * point to the same memory page of PAGE_SIZE. Since bounce_maps originate + * from IO requests, we may not be able to guarantee that the orig_phys + * values of all IO requests within the same 64KB memory page are contiguous. + * Therefore, we need to store them separately. + * + * Bounce pages are allocated on demand. As a result, it may occur that + * multiple bounce pages corresponding to the same 64KB memory page attempt + * to allocate memory simultaneously, so we use cmpxchg to handle this + * concurrency. + */ + map = &domain->bounce_maps[iova >> BOUNCE_MAP_SHIFT]; if (!map->bounce_page) { - map->bounce_page = alloc_page(GFP_ATOMIC); - if (!map->bounce_page) - return -ENOMEM; + head_map = &domain->bounce_maps[(iova & PAGE_MASK) >> BOUNCE_MAP_SHIFT]; + if (!head_map->bounce_page) { + tmp_page = alloc_page(GFP_ATOMIC); + if (!tmp_page) + return -ENOMEM; + if (cmpxchg(&head_map->bounce_page, NULL, tmp_page)) + __free_page(tmp_page); + } + map->bounce_page = head_map->bounce_page; } map->orig_phys = paddr; - paddr += PAGE_SIZE; - iova += PAGE_SIZE; + paddr += BOUNCE_MAP_SIZE; + iova += BOUNCE_MAP_SIZE; } return 0; } @@ -127,12 +146,17 @@ static void vduse_domain_unmap_bounce_page(struct vduse_iova_domain *domain, u64 last = iova + size - 1; while (iova <= last) { - map = &domain->bounce_maps[iova >> PAGE_SHIFT]; + map = &domain->bounce_maps[iova >> BOUNCE_MAP_SHIFT]; map->orig_phys = INVALID_PHYS_ADDR; - iova += PAGE_SIZE; + iova += BOUNCE_MAP_SIZE; } } +static unsigned int offset_in_bounce_page(dma_addr_t addr) +{ + return (addr & ~BOUNCE_MAP_MASK); +} + static void do_bounce(phys_addr_t orig, void *addr, size_t size, enum dma_data_direction dir) { @@ -163,7 +187,7 @@ static void vduse_domain_bounce(struct vduse_iova_domain *domain, { struct vduse_bounce_map *map; struct page *page; - unsigned int offset; + unsigned int offset, head_offset; void *addr; size_t sz; @@ -171,9 +195,10 @@ static void vduse_domain_bounce(struct vduse_iova_domain *domain, return; while (size) { - map = &domain->bounce_maps[iova >> PAGE_SHIFT]; - offset = offset_in_page(iova); - sz = min_t(size_t, PAGE_SIZE - offset, size); + map = &domain->bounce_maps[iova >> BOUNCE_MAP_SHIFT]; + head_offset = offset_in_page(iova); + offset = offset_in_bounce_page(iova); + sz = min_t(size_t, BOUNCE_MAP_SIZE - offset, size); if (WARN_ON(!map->bounce_page || map->orig_phys == INVALID_PHYS_ADDR)) @@ -183,7 +208,7 @@ static void vduse_domain_bounce(struct vduse_iova_domain *domain, map->user_bounce_page : map->bounce_page; addr = kmap_local_page(page); - do_bounce(map->orig_phys + offset, addr + offset, sz, dir); + do_bounce(map->orig_phys + offset, addr + head_offset, sz, dir); kunmap_local(addr); size -= sz; iova += sz; @@ -218,7 +243,7 @@ vduse_domain_get_bounce_page(struct vduse_iova_domain *domain, u64 iova) struct page *page = NULL; read_lock(&domain->bounce_lock); - map = &domain->bounce_maps[iova >> PAGE_SHIFT]; + map = &domain->bounce_maps[iova >> BOUNCE_MAP_SHIFT]; if (domain->user_bounce_pages || !map->bounce_page) goto out; @@ -236,7 +261,7 @@ vduse_domain_free_kernel_bounce_pages(struct vduse_iova_domain *domain) struct vduse_bounce_map *map; unsigned long pfn, bounce_pfns; - bounce_pfns = domain->bounce_size >> PAGE_SHIFT; + bounce_pfns = domain->bounce_size >> BOUNCE_MAP_SHIFT; for (pfn = 0; pfn < bounce_pfns; pfn++) { map = &domain->bounce_maps[pfn]; @@ -246,7 +271,8 @@ vduse_domain_free_kernel_bounce_pages(struct vduse_iova_domain *domain) if (!map->bounce_page) continue; - __free_page(map->bounce_page); + if (!((pfn << BOUNCE_MAP_SHIFT) & ~PAGE_MASK)) + __free_page(map->bounce_page); map->bounce_page = NULL; } } @@ -254,8 +280,12 @@ vduse_domain_free_kernel_bounce_pages(struct vduse_iova_domain *domain) int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain, struct page **pages, int count) { - struct vduse_bounce_map *map; - int i, ret; + struct vduse_bounce_map *map, *head_map; + int i, j, ret; + int inner_pages = PAGE_SIZE / BOUNCE_MAP_SIZE; + int bounce_pfns = domain->bounce_size >> BOUNCE_MAP_SHIFT; + struct page *head_page = NULL; + bool need_copy; /* Now we don't support partial mapping */ if (count != (domain->bounce_size >> PAGE_SHIFT)) @@ -267,16 +297,23 @@ int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain, goto out; for (i = 0; i < count; i++) { - map = &domain->bounce_maps[i]; - if (map->bounce_page) { + need_copy = false; + head_map = &domain->bounce_maps[(i * inner_pages)]; + head_page = head_map->bounce_page; + for (j = 0; j < inner_pages; j++) { + if ((i * inner_pages + j) >= bounce_pfns) + break; + map = &domain->bounce_maps[(i * inner_pages + j)]; /* Copy kernel page to user page if it's in use */ - if (map->orig_phys != INVALID_PHYS_ADDR) - memcpy_to_page(pages[i], 0, - page_address(map->bounce_page), - PAGE_SIZE); + if ((head_page) && (map->orig_phys != INVALID_PHYS_ADDR)) + need_copy = true; + map->user_bounce_page = pages[i]; } - map->user_bounce_page = pages[i]; get_page(pages[i]); + if ((head_page) && (need_copy)) + memcpy_to_page(pages[i], 0, + page_address(head_page), + PAGE_SIZE); } domain->user_bounce_pages = true; ret = 0; @@ -288,8 +325,12 @@ int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain, void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain) { - struct vduse_bounce_map *map; - unsigned long i, count; + struct vduse_bounce_map *map, *head_map; + unsigned long i, j, count; + int inner_pages = PAGE_SIZE / BOUNCE_MAP_SIZE; + int bounce_pfns = domain->bounce_size >> BOUNCE_MAP_SHIFT; + struct page *head_page = NULL; + bool need_copy; write_lock(&domain->bounce_lock); if (!domain->user_bounce_pages) @@ -297,20 +338,27 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain) count = domain->bounce_size >> PAGE_SHIFT; for (i = 0; i < count; i++) { - struct page *page = NULL; - - map = &domain->bounce_maps[i]; - if (WARN_ON(!map->user_bounce_page)) + need_copy = false; + head_map = &domain->bounce_maps[(i * inner_pages)]; + if (WARN_ON(!head_map->user_bounce_page)) continue; - - /* Copy user page to kernel page if it's in use */ - if (map->orig_phys != INVALID_PHYS_ADDR) { - page = map->bounce_page; - memcpy_from_page(page_address(page), - map->user_bounce_page, 0, PAGE_SIZE); + head_page = head_map->user_bounce_page; + + for (j = 0; j < inner_pages; j++) { + if ((i * inner_pages + j) >= bounce_pfns) + break; + map = &domain->bounce_maps[(i * inner_pages + j)]; + if (WARN_ON(!map->user_bounce_page)) + continue; + /* Copy user page to kernel page if it's in use */ + if ((map->orig_phys != INVALID_PHYS_ADDR) && (head_map->bounce_page)) + need_copy = true; + map->user_bounce_page = NULL; } - put_page(map->user_bounce_page); - map->user_bounce_page = NULL; + if (need_copy) + memcpy_from_page(page_address(head_map->bounce_page), + head_page, 0, PAGE_SIZE); + put_page(head_page); } domain->user_bounce_pages = false; out: @@ -447,7 +495,7 @@ void vduse_domain_unmap_page(struct vduse_iova_domain *domain, void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain, size_t size, dma_addr_t *dma_addr, - gfp_t flag, unsigned long attrs) + gfp_t flag) { struct iova_domain *iovad = &domain->consistent_iovad; unsigned long limit = domain->iova_limit; @@ -581,7 +629,7 @@ vduse_domain_create(unsigned long iova_limit, size_t bounce_size) unsigned long pfn, bounce_pfns; int ret; - bounce_pfns = PAGE_ALIGN(bounce_size) >> PAGE_SHIFT; + bounce_pfns = PAGE_ALIGN(bounce_size) >> BOUNCE_MAP_SHIFT; if (iova_limit <= bounce_size) return NULL; @@ -613,7 +661,7 @@ vduse_domain_create(unsigned long iova_limit, size_t bounce_size) rwlock_init(&domain->bounce_lock); spin_lock_init(&domain->iotlb_lock); init_iova_domain(&domain->stream_iovad, - PAGE_SIZE, IOVA_START_PFN); + BOUNCE_MAP_SIZE, IOVA_START_PFN); ret = iova_domain_init_rcaches(&domain->stream_iovad); if (ret) goto err_iovad_stream; diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h index 7f3f0928ec781..775cad5238f3a 100644 --- a/drivers/vdpa/vdpa_user/iova_domain.h +++ b/drivers/vdpa/vdpa_user/iova_domain.h @@ -19,6 +19,11 @@ #define INVALID_PHYS_ADDR (~(phys_addr_t)0) +#define BOUNCE_MAP_SHIFT 12 +#define BOUNCE_MAP_SIZE (1 << BOUNCE_MAP_SHIFT) +#define BOUNCE_MAP_MASK (~(BOUNCE_MAP_SIZE - 1)) +#define BOUNCE_MAP_ALIGN(addr) (((addr) + BOUNCE_MAP_SIZE - 1) & ~(BOUNCE_MAP_SIZE - 1)) + struct vduse_bounce_map { struct page *bounce_page; struct page *user_bounce_page; @@ -64,7 +69,7 @@ void vduse_domain_unmap_page(struct vduse_iova_domain *domain, void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain, size_t size, dma_addr_t *dma_addr, - gfp_t flag, unsigned long attrs); + gfp_t flag); void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size, void *vaddr, dma_addr_t dma_addr, diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c index 04620bb77203d..e7bced0b55422 100644 --- a/drivers/vdpa/vdpa_user/vduse_dev.c +++ b/drivers/vdpa/vdpa_user/vduse_dev.c @@ -814,59 +814,53 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = { .free = vduse_vdpa_free, }; -static void vduse_dev_sync_single_for_device(struct device *dev, +static void vduse_dev_sync_single_for_device(union virtio_map token, dma_addr_t dma_addr, size_t size, enum dma_data_direction dir) { - struct vduse_dev *vdev = dev_to_vduse(dev); - struct vduse_iova_domain *domain = vdev->domain; + struct vduse_iova_domain *domain = token.iova_domain; vduse_domain_sync_single_for_device(domain, dma_addr, size, dir); } -static void vduse_dev_sync_single_for_cpu(struct device *dev, +static void vduse_dev_sync_single_for_cpu(union virtio_map token, dma_addr_t dma_addr, size_t size, enum dma_data_direction dir) { - struct vduse_dev *vdev = dev_to_vduse(dev); - struct vduse_iova_domain *domain = vdev->domain; + struct vduse_iova_domain *domain = token.iova_domain; vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir); } -static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page, +static dma_addr_t vduse_dev_map_page(union virtio_map token, struct page *page, unsigned long offset, size_t size, enum dma_data_direction dir, unsigned long attrs) { - struct vduse_dev *vdev = dev_to_vduse(dev); - struct vduse_iova_domain *domain = vdev->domain; + struct vduse_iova_domain *domain = token.iova_domain; return vduse_domain_map_page(domain, page, offset, size, dir, attrs); } -static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr, - size_t size, enum dma_data_direction dir, - unsigned long attrs) +static void vduse_dev_unmap_page(union virtio_map token, dma_addr_t dma_addr, + size_t size, enum dma_data_direction dir, + unsigned long attrs) { - struct vduse_dev *vdev = dev_to_vduse(dev); - struct vduse_iova_domain *domain = vdev->domain; + struct vduse_iova_domain *domain = token.iova_domain; return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs); } -static void *vduse_dev_alloc_coherent(struct device *dev, size_t size, - dma_addr_t *dma_addr, gfp_t flag, - unsigned long attrs) +static void *vduse_dev_alloc_coherent(union virtio_map token, size_t size, + dma_addr_t *dma_addr, gfp_t flag) { - struct vduse_dev *vdev = dev_to_vduse(dev); - struct vduse_iova_domain *domain = vdev->domain; + struct vduse_iova_domain *domain = token.iova_domain; unsigned long iova; void *addr; *dma_addr = DMA_MAPPING_ERROR; addr = vduse_domain_alloc_coherent(domain, size, - (dma_addr_t *)&iova, flag, attrs); + (dma_addr_t *)&iova, flag); if (!addr) return NULL; @@ -875,31 +869,45 @@ static void *vduse_dev_alloc_coherent(struct device *dev, size_t size, return addr; } -static void vduse_dev_free_coherent(struct device *dev, size_t size, - void *vaddr, dma_addr_t dma_addr, - unsigned long attrs) +static void vduse_dev_free_coherent(union virtio_map token, size_t size, + void *vaddr, dma_addr_t dma_addr, + unsigned long attrs) { - struct vduse_dev *vdev = dev_to_vduse(dev); - struct vduse_iova_domain *domain = vdev->domain; + struct vduse_iova_domain *domain = token.iova_domain; vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs); } -static size_t vduse_dev_max_mapping_size(struct device *dev) +static bool vduse_dev_need_sync(union virtio_map token, dma_addr_t dma_addr) { - struct vduse_dev *vdev = dev_to_vduse(dev); - struct vduse_iova_domain *domain = vdev->domain; + struct vduse_iova_domain *domain = token.iova_domain; + + return dma_addr < domain->bounce_size; +} + +static int vduse_dev_mapping_error(union virtio_map token, dma_addr_t dma_addr) +{ + if (unlikely(dma_addr == DMA_MAPPING_ERROR)) + return -ENOMEM; + return 0; +} + +static size_t vduse_dev_max_mapping_size(union virtio_map token) +{ + struct vduse_iova_domain *domain = token.iova_domain; return domain->bounce_size; } -static const struct dma_map_ops vduse_dev_dma_ops = { +static const struct virtio_map_ops vduse_map_ops = { .sync_single_for_device = vduse_dev_sync_single_for_device, .sync_single_for_cpu = vduse_dev_sync_single_for_cpu, .map_page = vduse_dev_map_page, .unmap_page = vduse_dev_unmap_page, .alloc = vduse_dev_alloc_coherent, .free = vduse_dev_free_coherent, + .need_sync = vduse_dev_need_sync, + .mapping_error = vduse_dev_mapping_error, .max_mapping_size = vduse_dev_max_mapping_size, }; @@ -2003,26 +2011,18 @@ static struct vduse_mgmt_dev *vduse_mgmt; static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name) { struct vduse_vdpa *vdev; - int ret; if (dev->vdev) return -EEXIST; vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev, - &vduse_vdpa_config_ops, 1, 1, name, true); + &vduse_vdpa_config_ops, &vduse_map_ops, + 1, 1, name, true); if (IS_ERR(vdev)) return PTR_ERR(vdev); dev->vdev = vdev; vdev->dev = dev; - vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask; - ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64)); - if (ret) { - put_device(&vdev->vdpa.dev); - return ret; - } - set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops); - vdev->vdpa.dma_dev = &vdev->vdpa.dev; vdev->vdpa.mdev = &vduse_mgmt->mgmt_dev; return 0; @@ -2055,6 +2055,7 @@ static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name, return -ENOMEM; } + dev->vdev->vdpa.vmap.iova_domain = dev->domain; ret = _vdpa_register_device(&dev->vdev->vdpa, dev->vq_num); if (ret) { put_device(&dev->vdev->vdpa.dev); diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c b/drivers/vdpa/virtio_pci/vp_vdpa.c index 8787407f75b06..17a19a728c9cb 100644 --- a/drivers/vdpa/virtio_pci/vp_vdpa.c +++ b/drivers/vdpa/virtio_pci/vp_vdpa.c @@ -511,7 +511,8 @@ static int vp_vdpa_dev_add(struct vdpa_mgmt_dev *v_mdev, const char *name, int ret, i; vp_vdpa = vdpa_alloc_device(struct vp_vdpa, vdpa, - dev, &vp_vdpa_ops, 1, 1, name, false); + dev, &vp_vdpa_ops, NULL, + 1, 1, name, false); if (IS_ERR(vp_vdpa)) { dev_err(dev, "vp_vdpa: Failed to allocate vDPA structure\n"); @@ -520,7 +521,7 @@ static int vp_vdpa_dev_add(struct vdpa_mgmt_dev *v_mdev, const char *name, vp_vdpa_mgtdev->vp_vdpa = vp_vdpa; - vp_vdpa->vdpa.dma_dev = &pdev->dev; + vp_vdpa->vdpa.vmap.dma_dev = &pdev->dev; vp_vdpa->queues = vp_modern_get_num_queues(mdev); vp_vdpa->mdev = mdev; diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig index 2b0172f546652..2b9fca00e9e8b 100644 --- a/drivers/vfio/pci/Kconfig +++ b/drivers/vfio/pci/Kconfig @@ -55,6 +55,9 @@ config VFIO_PCI_ZDEV_KVM To enable s390x KVM vfio-pci extensions, say Y. +config VFIO_PCI_DMABUF + def_bool y if VFIO_PCI_CORE && PCI_P2PDMA && DMA_SHARED_BUFFER + source "drivers/vfio/pci/mlx5/Kconfig" source "drivers/vfio/pci/hisilicon/Kconfig" diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index cf00c0a7e55c8..53f59226ae013 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -2,6 +2,7 @@ vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o +vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o vfio-pci-y := vfio_pci.o diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c index 7a4b46d972fe1..0dd7bc685f4ad 100644 --- a/drivers/vfio/pci/nvgrace-gpu/main.c +++ b/drivers/vfio/pci/nvgrace-gpu/main.c @@ -7,6 +7,7 @@ #include #include #include +#include #include #include #include @@ -920,6 +921,50 @@ nvgrace_gpu_write(struct vfio_device *core_vdev, return vfio_pci_core_write(core_vdev, buf, count, ppos); } +static int nvgrace_get_dmabuf_phys(struct vfio_pci_core_device *core_vdev, + struct p2pdma_provider **provider, + unsigned int region_index, + struct dma_buf_phys_vec *phys_vec, + struct vfio_region_dma_range *dma_ranges, + size_t nr_ranges) +{ + struct nvgrace_gpu_pci_core_device *nvdev = container_of( + core_vdev, struct nvgrace_gpu_pci_core_device, core_device); + struct pci_dev *pdev = core_vdev->pdev; + struct mem_region *mem_region; + + /* + * if (nvdev->resmem.memlength && region_index == RESMEM_REGION_INDEX) { + * The P2P properties of the non-BAR memory is the same as the + * BAR memory, so just use the provider for index 0. Someday + * when CXL gets P2P support we could create CXLish providers + * for the non-BAR memory. + * } else if (region_index == USEMEM_REGION_INDEX) { + * This is actually cachable memory and isn't treated as P2P in + * the chip. For now we have no way to push cachable memory + * through everything and the Grace HW doesn't care what caching + * attribute is programmed into the SMMU. So use BAR 0. + * } + */ + mem_region = nvgrace_gpu_memregion(region_index, nvdev); + if (mem_region) { + *provider = pcim_p2pdma_provider(pdev, 0); + if (!*provider) + return -EINVAL; + return vfio_pci_core_fill_phys_vec(phys_vec, dma_ranges, + nr_ranges, + mem_region->memphys, + mem_region->memlength); + } + + return vfio_pci_core_get_dmabuf_phys(core_vdev, provider, region_index, + phys_vec, dma_ranges, nr_ranges); +} + +static const struct vfio_pci_device_ops nvgrace_gpu_pci_dev_ops = { + .get_dmabuf_phys = nvgrace_get_dmabuf_phys, +}; + static const struct vfio_device_ops nvgrace_gpu_pci_ops = { .name = "nvgrace-gpu-vfio-pci", .init = vfio_pci_core_init_dev, @@ -940,6 +985,10 @@ static const struct vfio_device_ops nvgrace_gpu_pci_ops = { .detach_ioas = vfio_iommufd_physical_detach_ioas, }; +static const struct vfio_pci_device_ops nvgrace_gpu_pci_dev_core_ops = { + .get_dmabuf_phys = vfio_pci_core_get_dmabuf_phys, +}; + static const struct vfio_device_ops nvgrace_gpu_pci_core_ops = { .name = "nvgrace-gpu-vfio-pci-core", .init = vfio_pci_core_init_dev, @@ -1206,6 +1255,7 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev, memphys, memlength); if (ret) goto out_put_vdev; + nvdev->core_device.pci_ops = &nvgrace_gpu_pci_dev_ops; if (egm_enabled) { ret = register_egm_node(pdev); @@ -1215,6 +1265,8 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev, nvdev->egm_node = egmpxm; } + } else { + nvdev->core_device.pci_ops = &nvgrace_gpu_pci_dev_core_ops; } ret = vfio_pci_core_register_device(&nvdev->core_device); diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index ac10f14417f2f..6d41cf26b5399 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -147,6 +147,10 @@ static const struct vfio_device_ops vfio_pci_ops = { .pasid_detach_ioas = vfio_iommufd_physical_pasid_detach_ioas, }; +static const struct vfio_pci_device_ops vfio_pci_dev_ops = { + .get_dmabuf_phys = vfio_pci_core_get_dmabuf_phys, +}; + static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id) { struct vfio_pci_core_device *vdev; @@ -161,6 +165,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id) return PTR_ERR(vdev); dev_set_drvdata(&pdev->dev, vdev); + vdev->pci_ops = &vfio_pci_dev_ops; ret = vfio_pci_core_register_device(vdev); if (ret) goto out_put_vdev; diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index 4abd4f2719958..dc4e510e6e1bf 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -590,10 +590,12 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos, virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY); new_mem = !!(new_cmd & PCI_COMMAND_MEMORY); - if (!new_mem) + if (!new_mem) { vfio_pci_zap_and_down_write_memory_lock(vdev); - else + vfio_pci_dma_buf_move(vdev, true); + } else { down_write(&vdev->memory_lock); + } /* * If the user is writing mem/io enable (new_mem/io) and we @@ -628,6 +630,8 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos, *virt_cmd &= cpu_to_le16(~mask); *virt_cmd |= cpu_to_le16(new_cmd & mask); + if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock); } @@ -708,12 +712,16 @@ static int __init init_pci_cap_basic_perm(struct perm_bits *perm) static void vfio_lock_and_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state) { - if (state >= PCI_D3hot) + if (state >= PCI_D3hot) { vfio_pci_zap_and_down_write_memory_lock(vdev); - else + vfio_pci_dma_buf_move(vdev, true); + } else { down_write(&vdev->memory_lock); + } vfio_pci_set_power_state(vdev, state); + if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock); } @@ -901,7 +909,10 @@ static int vfio_exp_config_write(struct vfio_pci_core_device *vdev, int pos, if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) { vfio_pci_zap_and_down_write_memory_lock(vdev); + vfio_pci_dma_buf_move(vdev, true); pci_try_reset_function(vdev->pdev); + if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock); } } @@ -983,7 +994,10 @@ static int vfio_af_config_write(struct vfio_pci_core_device *vdev, int pos, if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) { vfio_pci_zap_and_down_write_memory_lock(vdev); + vfio_pci_dma_buf_move(vdev, true); pci_try_reset_function(vdev->pdev); + if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock); } } diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 54c2133501718..a9dd910d068d8 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -28,6 +28,7 @@ #include #include #include +#include #if IS_ENABLED(CONFIG_EEH) #include #endif @@ -286,6 +287,8 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev, * semaphore. */ vfio_pci_zap_and_down_write_memory_lock(vdev); + vfio_pci_dma_buf_move(vdev, true); + if (vdev->pm_runtime_engaged) { up_write(&vdev->memory_lock); return -EINVAL; @@ -299,11 +302,9 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev, return 0; } -static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags, +static int vfio_pci_core_pm_entry(struct vfio_pci_core_device *vdev, u32 flags, void __user *arg, size_t argsz) { - struct vfio_pci_core_device *vdev = - container_of(device, struct vfio_pci_core_device, vdev); int ret; ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0); @@ -320,12 +321,10 @@ static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags, } static int vfio_pci_core_pm_entry_with_wakeup( - struct vfio_device *device, u32 flags, + struct vfio_pci_core_device *vdev, u32 flags, struct vfio_device_low_power_entry_with_wakeup __user *arg, size_t argsz) { - struct vfio_pci_core_device *vdev = - container_of(device, struct vfio_pci_core_device, vdev); struct vfio_device_low_power_entry_with_wakeup entry; struct eventfd_ctx *efdctx; int ret; @@ -373,14 +372,14 @@ static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev) */ down_write(&vdev->memory_lock); __vfio_pci_runtime_pm_exit(vdev); + if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock); } -static int vfio_pci_core_pm_exit(struct vfio_device *device, u32 flags, +static int vfio_pci_core_pm_exit(struct vfio_pci_core_device *vdev, u32 flags, void __user *arg, size_t argsz) { - struct vfio_pci_core_device *vdev = - container_of(device, struct vfio_pci_core_device, vdev); int ret; ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0); @@ -695,6 +694,8 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev) #endif vfio_pci_core_disable(vdev); + vfio_pci_dma_buf_cleanup(vdev); + mutex_lock(&vdev->igate); if (vdev->err_trigger) { eventfd_ctx_put(vdev->err_trigger); @@ -1227,7 +1228,10 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev, */ vfio_pci_set_power_state(vdev, PCI_D0); + vfio_pci_dma_buf_move(vdev, true); ret = pci_try_reset_function(vdev->pdev); + if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock); return ret; @@ -1473,11 +1477,10 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, } EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl); -static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags, - uuid_t __user *arg, size_t argsz) +static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev, + u32 flags, uuid_t __user *arg, + size_t argsz) { - struct vfio_pci_core_device *vdev = - container_of(device, struct vfio_pci_core_device, vdev); uuid_t uuid; int ret; @@ -1504,16 +1507,21 @@ static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags, int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags, void __user *arg, size_t argsz) { + struct vfio_pci_core_device *vdev = + container_of(device, struct vfio_pci_core_device, vdev); + switch (flags & VFIO_DEVICE_FEATURE_MASK) { case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY: - return vfio_pci_core_pm_entry(device, flags, arg, argsz); + return vfio_pci_core_pm_entry(vdev, flags, arg, argsz); case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP: - return vfio_pci_core_pm_entry_with_wakeup(device, flags, + return vfio_pci_core_pm_entry_with_wakeup(vdev, flags, arg, argsz); case VFIO_DEVICE_FEATURE_LOW_POWER_EXIT: - return vfio_pci_core_pm_exit(device, flags, arg, argsz); + return vfio_pci_core_pm_exit(vdev, flags, arg, argsz); case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN: - return vfio_pci_core_feature_token(device, flags, arg, argsz); + return vfio_pci_core_feature_token(vdev, flags, arg, argsz); + case VFIO_DEVICE_FEATURE_DMA_BUF: + return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz); default: return -ENOTTY; } @@ -2076,6 +2084,7 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev) { struct vfio_pci_core_device *vdev = container_of(core_vdev, struct vfio_pci_core_device, vdev); + int ret; vdev->pdev = to_pci_dev(core_vdev->dev); vdev->irq_type = VFIO_PCI_NUM_IRQS; @@ -2085,6 +2094,10 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev) INIT_LIST_HEAD(&vdev->dummy_resources_list); INIT_LIST_HEAD(&vdev->ioeventfds_list); INIT_LIST_HEAD(&vdev->sriov_pfs_item); + ret = pcim_p2pdma_init(vdev->pdev); + if (ret && ret != -EOPNOTSUPP) + return ret; + INIT_LIST_HEAD(&vdev->dmabufs); init_rwsem(&vdev->memory_lock); xa_init(&vdev->ctx); @@ -2449,6 +2462,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, break; } + vfio_pci_dma_buf_move(vdev, true); vfio_pci_zap_bars(vdev); } @@ -2477,8 +2491,11 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, err_undo: list_for_each_entry_from_reverse(vdev, &dev_set->device_list, - vdev.dev_set_list) + vdev.dev_set_list) { + if (vdev->vdev.open_count && __vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock); + } list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) pm_runtime_put(&vdev->pdev->dev); diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c new file mode 100644 index 0000000000000..4be4a85005cbc --- /dev/null +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -0,0 +1,362 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. + */ +#include +#include +#include + +#include "vfio_pci_priv.h" + +MODULE_IMPORT_NS("DMA_BUF"); + +struct vfio_pci_dma_buf { + struct dma_buf *dmabuf; + struct vfio_pci_core_device *vdev; + struct list_head dmabufs_elm; + size_t size; + struct dma_buf_phys_vec *phys_vec; + struct p2pdma_provider *provider; + u32 nr_ranges; + u8 revoked : 1; +}; + +static int vfio_pci_dma_buf_pin(struct dma_buf_attachment *attachment) +{ + return -EOPNOTSUPP; +} + +static void vfio_pci_dma_buf_unpin(struct dma_buf_attachment *attachment) +{ + /* Do nothing */ +} + +static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, + struct dma_buf_attachment *attachment) +{ + struct vfio_pci_dma_buf *priv = dmabuf->priv; + + if (!attachment->peer2peer) + return -EOPNOTSUPP; + + if (priv->revoked) + return -ENODEV; + + return 0; +} + +static struct sg_table * +vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment, + enum dma_data_direction dir) +{ + struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv; + + dma_resv_assert_held(priv->dmabuf->resv); + + if (priv->revoked) + return ERR_PTR(-ENODEV); + + return dma_buf_phys_vec_to_sgt(attachment, priv->provider, + priv->phys_vec, priv->nr_ranges, + priv->size, dir); +} + +static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment, + struct sg_table *sgt, + enum dma_data_direction dir) +{ + dma_buf_free_sgt(attachment, sgt, dir); +} + +static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) +{ + struct vfio_pci_dma_buf *priv = dmabuf->priv; + + /* + * Either this or vfio_pci_dma_buf_cleanup() will remove from the list. + * The refcount prevents both. + */ + if (priv->vdev) { + down_write(&priv->vdev->memory_lock); + list_del_init(&priv->dmabufs_elm); + up_write(&priv->vdev->memory_lock); + vfio_device_put_registration(&priv->vdev->vdev); + } + kfree(priv->phys_vec); + kfree(priv); +} + +static const struct dma_buf_ops vfio_pci_dmabuf_ops = { + .pin = vfio_pci_dma_buf_pin, + .unpin = vfio_pci_dma_buf_unpin, + .attach = vfio_pci_dma_buf_attach, + .map_dma_buf = vfio_pci_dma_buf_map, + .unmap_dma_buf = vfio_pci_dma_buf_unmap, + .release = vfio_pci_dma_buf_release, +}; + +/* + * This is a temporary "private interconnect" between VFIO DMABUF and iommufd. + * It allows the two co-operating drivers to exchange the physical address of + * the BAR. This is to be replaced with a formal DMABUF system for negotiated + * interconnect types. + * + * If this function succeeds the following are true: + * - There is one physical range and it is pointing to MMIO + * - When move_notify is called it means revoke, not move, vfio_dma_buf_map + * will fail if it is currently revoked + */ +int vfio_pci_dma_buf_iommufd_map(struct dma_buf_attachment *attachment, + struct dma_buf_phys_vec *phys) +{ + struct vfio_pci_dma_buf *priv; + + dma_resv_assert_held(attachment->dmabuf->resv); + + if (attachment->dmabuf->ops != &vfio_pci_dmabuf_ops) + return -EOPNOTSUPP; + + priv = attachment->dmabuf->priv; + if (priv->revoked) + return -ENODEV; + + /* More than one range to iommufd will require proper DMABUF support */ + if (priv->nr_ranges != 1) + return -EOPNOTSUPP; + + *phys = priv->phys_vec[0]; + return 0; +} +EXPORT_SYMBOL_FOR_MODULES(vfio_pci_dma_buf_iommufd_map, "iommufd"); + +int vfio_pci_core_fill_phys_vec(struct dma_buf_phys_vec *phys_vec, + struct vfio_region_dma_range *dma_ranges, + size_t nr_ranges, phys_addr_t start, + phys_addr_t len) +{ + phys_addr_t max_addr; + unsigned int i; + + max_addr = start + len; + for (i = 0; i < nr_ranges; i++) { + phys_addr_t end; + + if (!dma_ranges[i].length) + return -EINVAL; + + if (check_add_overflow(start, dma_ranges[i].offset, + &phys_vec[i].paddr) || + check_add_overflow(phys_vec[i].paddr, + dma_ranges[i].length, &end)) + return -EOVERFLOW; + if (end > max_addr) + return -EINVAL; + + phys_vec[i].len = dma_ranges[i].length; + } + return 0; +} +EXPORT_SYMBOL_GPL(vfio_pci_core_fill_phys_vec); + +int vfio_pci_core_get_dmabuf_phys(struct vfio_pci_core_device *vdev, + struct p2pdma_provider **provider, + unsigned int region_index, + struct dma_buf_phys_vec *phys_vec, + struct vfio_region_dma_range *dma_ranges, + size_t nr_ranges) +{ + struct pci_dev *pdev = vdev->pdev; + + *provider = pcim_p2pdma_provider(pdev, region_index); + if (!*provider) + return -EINVAL; + + return vfio_pci_core_fill_phys_vec( + phys_vec, dma_ranges, nr_ranges, + pci_resource_start(pdev, region_index), + pci_resource_len(pdev, region_index)); +} +EXPORT_SYMBOL_GPL(vfio_pci_core_get_dmabuf_phys); + +static int validate_dmabuf_input(struct vfio_device_feature_dma_buf *dma_buf, + struct vfio_region_dma_range *dma_ranges, + size_t *lengthp) +{ + size_t length = 0; + u32 i; + + for (i = 0; i < dma_buf->nr_ranges; i++) { + u64 offset = dma_ranges[i].offset; + u64 len = dma_ranges[i].length; + + if (!len || !PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) + return -EINVAL; + + if (check_add_overflow(length, len, &length)) + return -EINVAL; + } + + /* + * dma_iova_try_alloc() will WARN on if userspace proposes a size that + * is too big, eg with lots of ranges. + */ + if ((u64)(length) & DMA_IOVA_USE_SWIOTLB) + return -EINVAL; + + *lengthp = length; + return 0; +} + +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf __user *arg, + size_t argsz) +{ + struct vfio_device_feature_dma_buf get_dma_buf = {}; + struct vfio_region_dma_range *dma_ranges; + DEFINE_DMA_BUF_EXPORT_INFO(exp_info); + struct vfio_pci_dma_buf *priv; + size_t length; + int ret; + + if (!vdev->pci_ops || !vdev->pci_ops->get_dmabuf_phys) + return -EOPNOTSUPP; + + ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET, + sizeof(get_dma_buf)); + if (ret != 1) + return ret; + + if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf))) + return -EFAULT; + + if (!get_dma_buf.nr_ranges || get_dma_buf.flags) + return -EINVAL; + + /* + * For PCI the region_index is the BAR number like everything else. + */ + if (get_dma_buf.region_index >= VFIO_PCI_ROM_REGION_INDEX) + return -ENODEV; + + dma_ranges = memdup_array_user(&arg->dma_ranges, get_dma_buf.nr_ranges, + sizeof(*dma_ranges)); + if (IS_ERR(dma_ranges)) + return PTR_ERR(dma_ranges); + + ret = validate_dmabuf_input(&get_dma_buf, dma_ranges, &length); + if (ret) + goto err_free_ranges; + + priv = kzalloc(sizeof(*priv), GFP_KERNEL); + if (!priv) { + ret = -ENOMEM; + goto err_free_ranges; + } + priv->phys_vec = kcalloc(get_dma_buf.nr_ranges, sizeof(*priv->phys_vec), + GFP_KERNEL); + if (!priv->phys_vec) { + ret = -ENOMEM; + goto err_free_priv; + } + + priv->vdev = vdev; + priv->nr_ranges = get_dma_buf.nr_ranges; + priv->size = length; + ret = vdev->pci_ops->get_dmabuf_phys(vdev, &priv->provider, + get_dma_buf.region_index, + priv->phys_vec, dma_ranges, + priv->nr_ranges); + if (ret) + goto err_free_phys; + + kfree(dma_ranges); + dma_ranges = NULL; + + if (!vfio_device_try_get_registration(&vdev->vdev)) { + ret = -ENODEV; + goto err_free_phys; + } + + exp_info.ops = &vfio_pci_dmabuf_ops; + exp_info.size = priv->size; + exp_info.flags = get_dma_buf.open_flags; + exp_info.priv = priv; + + priv->dmabuf = dma_buf_export(&exp_info); + if (IS_ERR(priv->dmabuf)) { + ret = PTR_ERR(priv->dmabuf); + goto err_dev_put; + } + + /* dma_buf_put() now frees priv */ + INIT_LIST_HEAD(&priv->dmabufs_elm); + down_write(&vdev->memory_lock); + dma_resv_lock(priv->dmabuf->resv, NULL); + priv->revoked = !__vfio_pci_memory_enabled(vdev); + list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs); + dma_resv_unlock(priv->dmabuf->resv); + up_write(&vdev->memory_lock); + + /* + * dma_buf_fd() consumes the reference, when the file closes the dmabuf + * will be released. + */ + ret = dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags); + if (ret < 0) + goto err_dma_buf; + return ret; + +err_dma_buf: + dma_buf_put(priv->dmabuf); +err_dev_put: + vfio_device_put_registration(&vdev->vdev); +err_free_phys: + kfree(priv->phys_vec); +err_free_priv: + kfree(priv); +err_free_ranges: + kfree(dma_ranges); + return ret; +} + +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) +{ + struct vfio_pci_dma_buf *priv; + struct vfio_pci_dma_buf *tmp; + + lockdep_assert_held_write(&vdev->memory_lock); + + list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { + if (!get_file_active(&priv->dmabuf->file)) + continue; + + if (priv->revoked != revoked) { + dma_resv_lock(priv->dmabuf->resv, NULL); + priv->revoked = revoked; + dma_buf_move_notify(priv->dmabuf); + dma_resv_unlock(priv->dmabuf->resv); + } + fput(priv->dmabuf->file); + } +} + +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) +{ + struct vfio_pci_dma_buf *priv; + struct vfio_pci_dma_buf *tmp; + + down_write(&vdev->memory_lock); + list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { + if (!get_file_active(&priv->dmabuf->file)) + continue; + + dma_resv_lock(priv->dmabuf->resv, NULL); + list_del_init(&priv->dmabufs_elm); + priv->vdev = NULL; + priv->revoked = true; + dma_buf_move_notify(priv->dmabuf); + dma_resv_unlock(priv->dmabuf->resv); + vfio_device_put_registration(&vdev->vdev); + fput(priv->dmabuf->file); + } + up_write(&vdev->memory_lock); +} diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h index 7b1776bae8026..fd145e34e2aec 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -106,4 +106,27 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev) return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA; } +#ifdef CONFIG_VFIO_PCI_DMABUF +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf __user *arg, + size_t argsz); +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev); +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked); +#else +static inline int +vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf __user *arg, + size_t argsz) +{ + return -ENOTTY; +} +static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) +{ +} +static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, + bool revoked) +{ +} +#endif + #endif diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c index 715368076a1fe..81e8ef5801a25 100644 --- a/drivers/vfio/vfio_main.c +++ b/drivers/vfio/vfio_main.c @@ -171,11 +171,13 @@ void vfio_device_put_registration(struct vfio_device *device) if (refcount_dec_and_test(&device->refcount)) complete(&device->comp); } +EXPORT_SYMBOL_GPL(vfio_device_put_registration); bool vfio_device_try_get_registration(struct vfio_device *device) { return refcount_inc_not_zero(&device->refcount); } +EXPORT_SYMBOL_GPL(vfio_device_try_get_registration); /* * VFIO driver API diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index af1e1fdfd9ed0..05a481e4c385a 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -1318,7 +1318,8 @@ static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v) { struct vdpa_device *vdpa = v->vdpa; const struct vdpa_config_ops *ops = vdpa->config; - struct device *dma_dev = vdpa_get_dma_dev(vdpa); + union virtio_map map = vdpa_get_map(vdpa); + struct device *dma_dev = map.dma_dev; int ret; /* Device want to do DMA by itself */ @@ -1353,7 +1354,8 @@ static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v) static void vhost_vdpa_free_domain(struct vhost_vdpa *v) { struct vdpa_device *vdpa = v->vdpa; - struct device *dma_dev = vdpa_get_dma_dev(vdpa); + union virtio_map map = vdpa_get_map(vdpa); + struct device *dma_dev = map.dma_dev; if (v->domain) { iommu_detach_device(v->domain, dma_dev); diff --git a/drivers/virt/coco/efi_secret/Kconfig b/drivers/virt/coco/efi_secret/Kconfig index 4404d198f3b20..94d88e5da7072 100644 --- a/drivers/virt/coco/efi_secret/Kconfig +++ b/drivers/virt/coco/efi_secret/Kconfig @@ -1,7 +1,7 @@ # SPDX-License-Identifier: GPL-2.0-only config EFI_SECRET tristate "EFI secret area securityfs support" - depends on EFI && X86_64 + depends on EFI && (X86_64 || ARM64) select EFI_COCO_SECRET select SECURITYFS help diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index eae65136cdfb5..5c34c9b53f0dc 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -205,7 +205,7 @@ static int virtballoon_free_page_report(struct page_reporting_dev_info *pr_dev_i unsigned int unused, err; /* We should always be able to add these buffers to an empty queue. */ - err = virtqueue_add_inbuf(vq, sg, nents, vb, GFP_NOWAIT | __GFP_NOWARN); + err = virtqueue_add_inbuf(vq, sg, nents, vb, GFP_NOWAIT); /* * In the extremely unlikely case that something has occurred and we diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index f5062061c4084..7b6205253b46b 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -166,7 +166,7 @@ struct vring_virtqueue { bool packed_ring; /* Is DMA API used? */ - bool use_dma_api; + bool use_map_api; /* Can we use weak barriers? */ bool weak_barriers; @@ -210,8 +210,7 @@ struct vring_virtqueue { /* DMA, allocation, and size information */ bool we_own_ring; - /* Device used for doing DMA */ - struct device *dma_dev; + union virtio_map map; #ifdef DEBUG /* They're supposed to lock for us. */ @@ -268,7 +267,7 @@ static bool virtqueue_use_indirect(const struct vring_virtqueue *vq, * unconditionally on data path. */ -static bool vring_use_dma_api(const struct virtio_device *vdev) +static bool vring_use_map_api(const struct virtio_device *vdev) { if (!virtio_has_dma_quirk(vdev)) return true; @@ -291,33 +290,39 @@ static bool vring_use_dma_api(const struct virtio_device *vdev) static bool vring_need_unmap_buffer(const struct vring_virtqueue *vring, const struct vring_desc_extra *extra) { - return vring->use_dma_api && (extra->addr != DMA_MAPPING_ERROR); + return vring->use_map_api && (extra->addr != DMA_MAPPING_ERROR); } size_t virtio_max_dma_size(const struct virtio_device *vdev) { size_t max_segment_size = SIZE_MAX; - if (vring_use_dma_api(vdev)) - max_segment_size = dma_max_mapping_size(vdev->dev.parent); + if (vring_use_map_api(vdev)) { + if (vdev->map) { + max_segment_size = + vdev->map->max_mapping_size(vdev->vmap); + } else + max_segment_size = + dma_max_mapping_size(vdev->dev.parent); + } return max_segment_size; } EXPORT_SYMBOL_GPL(virtio_max_dma_size); static void *vring_alloc_queue(struct virtio_device *vdev, size_t size, - dma_addr_t *dma_handle, gfp_t flag, - struct device *dma_dev) + dma_addr_t *map_handle, gfp_t flag, + union virtio_map map) { - if (vring_use_dma_api(vdev)) { - return dma_alloc_coherent(dma_dev, size, - dma_handle, flag); + if (vring_use_map_api(vdev)) { + return virtqueue_map_alloc_coherent(vdev, map, size, + map_handle, flag); } else { void *queue = alloc_pages_exact(PAGE_ALIGN(size), flag); if (queue) { phys_addr_t phys_addr = virt_to_phys(queue); - *dma_handle = (dma_addr_t)phys_addr; + *map_handle = (dma_addr_t)phys_addr; /* * Sanity check: make sure we dind't truncate @@ -330,7 +335,7 @@ static void *vring_alloc_queue(struct virtio_device *vdev, size_t size, * warning and abort if we end up with an * unrepresentable address. */ - if (WARN_ON_ONCE(*dma_handle != phys_addr)) { + if (WARN_ON_ONCE(*map_handle != phys_addr)) { free_pages_exact(queue, PAGE_ALIGN(size)); return NULL; } @@ -340,11 +345,12 @@ static void *vring_alloc_queue(struct virtio_device *vdev, size_t size, } static void vring_free_queue(struct virtio_device *vdev, size_t size, - void *queue, dma_addr_t dma_handle, - struct device *dma_dev) + void *queue, dma_addr_t map_handle, + union virtio_map map) { - if (vring_use_dma_api(vdev)) - dma_free_coherent(dma_dev, size, queue, dma_handle); + if (vring_use_map_api(vdev)) + virtqueue_map_free_coherent(vdev, map, size, + queue, map_handle); else free_pages_exact(queue, PAGE_ALIGN(size)); } @@ -356,7 +362,21 @@ static void vring_free_queue(struct virtio_device *vdev, size_t size, */ static struct device *vring_dma_dev(const struct vring_virtqueue *vq) { - return vq->dma_dev; + return vq->map.dma_dev; +} + +static int vring_mapping_error(const struct vring_virtqueue *vq, + dma_addr_t addr) +{ + struct virtio_device *vdev = vq->vq.vdev; + + if (!vq->use_map_api) + return 0; + + if (vdev->map) + return vdev->map->mapping_error(vq->map, addr); + else + return dma_mapping_error(vring_dma_dev(vq), addr); } /* Map one sg entry. */ @@ -372,13 +392,13 @@ static int vring_map_one_sg(const struct vring_virtqueue *vq, struct scatterlist *len = sg->length; - if (!vq->use_dma_api) { + if (!vq->use_map_api) { /* * If DMA is not used, KMSAN doesn't know that the scatterlist * is initialized by the hardware. Explicitly check/unpoison it * depending on the direction. */ - kmsan_handle_dma(sg_page(sg), sg->offset, sg->length, direction); + kmsan_handle_dma(sg_phys(sg), sg->length, direction); *addr = (dma_addr_t)sg_phys(sg); return 0; } @@ -388,11 +408,11 @@ static int vring_map_one_sg(const struct vring_virtqueue *vq, struct scatterlist * the way it expects (we don't guarantee that the scatterlist * will exist for the lifetime of the mapping). */ - *addr = dma_map_page(vring_dma_dev(vq), - sg_page(sg), sg->offset, sg->length, - direction); + *addr = virtqueue_map_page_attrs(&vq->vq, sg_page(sg), + sg->offset, sg->length, + direction, 0); - if (dma_mapping_error(vring_dma_dev(vq), *addr)) + if (vring_mapping_error(vq, *addr)) return -ENOMEM; return 0; @@ -402,20 +422,11 @@ static dma_addr_t vring_map_single(const struct vring_virtqueue *vq, void *cpu_addr, size_t size, enum dma_data_direction direction) { - if (!vq->use_dma_api) + if (!vq->use_map_api) return (dma_addr_t)virt_to_phys(cpu_addr); - return dma_map_single(vring_dma_dev(vq), - cpu_addr, size, direction); -} - -static int vring_mapping_error(const struct vring_virtqueue *vq, - dma_addr_t addr) -{ - if (!vq->use_dma_api) - return 0; - - return dma_mapping_error(vring_dma_dev(vq), addr); + return virtqueue_map_single_attrs(&vq->vq, cpu_addr, + size, direction, 0); } static void virtqueue_init(struct vring_virtqueue *vq, u32 num) @@ -449,24 +460,17 @@ static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq, flags = extra->flags; if (flags & VRING_DESC_F_INDIRECT) { - if (!vq->use_dma_api) - goto out; - - dma_unmap_single(vring_dma_dev(vq), - extra->addr, - extra->len, - (flags & VRING_DESC_F_WRITE) ? - DMA_FROM_DEVICE : DMA_TO_DEVICE); - } else { - if (!vring_need_unmap_buffer(vq, extra)) + if (!vq->use_map_api) goto out; + } else if (!vring_need_unmap_buffer(vq, extra)) + goto out; - dma_unmap_page(vring_dma_dev(vq), - extra->addr, - extra->len, - (flags & VRING_DESC_F_WRITE) ? - DMA_FROM_DEVICE : DMA_TO_DEVICE); - } + virtqueue_unmap_page_attrs(&vq->vq, + extra->addr, + extra->len, + (flags & VRING_DESC_F_WRITE) ? + DMA_FROM_DEVICE : DMA_TO_DEVICE, + 0); out: return extra->next; @@ -790,7 +794,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head, extra = (struct vring_desc_extra *)&indir_desc[num]; - if (vq->use_dma_api) { + if (vq->use_map_api) { for (j = 0; j < num; j++) vring_unmap_one_split(vq, &extra[j]); } @@ -1064,12 +1068,13 @@ static int vring_alloc_state_extra_split(struct vring_virtqueue_split *vring_spl } static void vring_free_split(struct vring_virtqueue_split *vring_split, - struct virtio_device *vdev, struct device *dma_dev) + struct virtio_device *vdev, + union virtio_map map) { vring_free_queue(vdev, vring_split->queue_size_in_bytes, vring_split->vring.desc, vring_split->queue_dma_addr, - dma_dev); + map); kfree(vring_split->desc_state); kfree(vring_split->desc_extra); @@ -1080,7 +1085,7 @@ static int vring_alloc_queue_split(struct vring_virtqueue_split *vring_split, u32 num, unsigned int vring_align, bool may_reduce_num, - struct device *dma_dev) + union virtio_map map) { void *queue = NULL; dma_addr_t dma_addr; @@ -1096,7 +1101,7 @@ static int vring_alloc_queue_split(struct vring_virtqueue_split *vring_split, queue = vring_alloc_queue(vdev, vring_size(num, vring_align), &dma_addr, GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO, - dma_dev); + map); if (queue) break; if (!may_reduce_num) @@ -1110,7 +1115,7 @@ static int vring_alloc_queue_split(struct vring_virtqueue_split *vring_split, /* Try to get a single page. You are my only hope! */ queue = vring_alloc_queue(vdev, vring_size(num, vring_align), &dma_addr, GFP_KERNEL | __GFP_ZERO, - dma_dev); + map); } if (!queue) return -ENOMEM; @@ -1134,7 +1139,7 @@ static struct virtqueue *__vring_new_virtqueue_split(unsigned int index, bool (*notify)(struct virtqueue *), void (*callback)(struct virtqueue *), const char *name, - struct device *dma_dev) + union virtio_map map) { struct vring_virtqueue *vq; int err; @@ -1157,8 +1162,8 @@ static struct virtqueue *__vring_new_virtqueue_split(unsigned int index, #else vq->broken = false; #endif - vq->dma_dev = dma_dev; - vq->use_dma_api = vring_use_dma_api(vdev); + vq->map = map; + vq->use_map_api = vring_use_map_api(vdev); vq->indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC) && !context; @@ -1195,21 +1200,21 @@ static struct virtqueue *vring_create_virtqueue_split( bool (*notify)(struct virtqueue *), void (*callback)(struct virtqueue *), const char *name, - struct device *dma_dev) + union virtio_map map) { struct vring_virtqueue_split vring_split = {}; struct virtqueue *vq; int err; err = vring_alloc_queue_split(&vring_split, vdev, num, vring_align, - may_reduce_num, dma_dev); + may_reduce_num, map); if (err) return NULL; vq = __vring_new_virtqueue_split(index, &vring_split, vdev, weak_barriers, - context, notify, callback, name, dma_dev); + context, notify, callback, name, map); if (!vq) { - vring_free_split(&vring_split, vdev, dma_dev); + vring_free_split(&vring_split, vdev, map); return NULL; } @@ -1228,7 +1233,7 @@ static int virtqueue_resize_split(struct virtqueue *_vq, u32 num) err = vring_alloc_queue_split(&vring_split, vdev, num, vq->split.vring_align, vq->split.may_reduce_num, - vring_dma_dev(vq)); + vq->map); if (err) goto err; @@ -1246,7 +1251,7 @@ static int virtqueue_resize_split(struct virtqueue *_vq, u32 num) return 0; err_state_extra: - vring_free_split(&vring_split, vdev, vring_dma_dev(vq)); + vring_free_split(&vring_split, vdev, vq->map); err: virtqueue_reinit_split(vq); return -ENOMEM; @@ -1274,22 +1279,16 @@ static void vring_unmap_extra_packed(const struct vring_virtqueue *vq, flags = extra->flags; if (flags & VRING_DESC_F_INDIRECT) { - if (!vq->use_dma_api) - return; - - dma_unmap_single(vring_dma_dev(vq), - extra->addr, extra->len, - (flags & VRING_DESC_F_WRITE) ? - DMA_FROM_DEVICE : DMA_TO_DEVICE); - } else { - if (!vring_need_unmap_buffer(vq, extra)) + if (!vq->use_map_api) return; + } else if (!vring_need_unmap_buffer(vq, extra)) + return; - dma_unmap_page(vring_dma_dev(vq), - extra->addr, extra->len, - (flags & VRING_DESC_F_WRITE) ? - DMA_FROM_DEVICE : DMA_TO_DEVICE); - } + virtqueue_unmap_page_attrs(&vq->vq, + extra->addr, extra->len, + (flags & VRING_DESC_F_WRITE) ? + DMA_FROM_DEVICE : DMA_TO_DEVICE, + 0); } static struct vring_packed_desc *alloc_indirect_packed(unsigned int total_sg, @@ -1366,7 +1365,7 @@ static int virtqueue_add_indirect_packed(struct vring_virtqueue *vq, desc[i].addr = cpu_to_le64(addr); desc[i].len = cpu_to_le32(len); - if (unlikely(vq->use_dma_api)) { + if (unlikely(vq->use_map_api)) { extra[i].addr = premapped ? DMA_MAPPING_ERROR : addr; extra[i].len = len; extra[i].flags = n < out_sgs ? 0 : VRING_DESC_F_WRITE; @@ -1388,7 +1387,7 @@ static int virtqueue_add_indirect_packed(struct vring_virtqueue *vq, sizeof(struct vring_packed_desc)); vq->packed.vring.desc[head].id = cpu_to_le16(id); - if (vq->use_dma_api) { + if (vq->use_map_api) { vq->packed.desc_extra[id].addr = addr; vq->packed.desc_extra[id].len = total_sg * sizeof(struct vring_packed_desc); @@ -1530,7 +1529,7 @@ static inline int virtqueue_add_packed(struct virtqueue *_vq, desc[i].len = cpu_to_le32(len); desc[i].id = cpu_to_le16(id); - if (unlikely(vq->use_dma_api)) { + if (unlikely(vq->use_map_api)) { vq->packed.desc_extra[curr].addr = premapped ? DMA_MAPPING_ERROR : addr; vq->packed.desc_extra[curr].len = len; @@ -1665,7 +1664,7 @@ static void detach_buf_packed(struct vring_virtqueue *vq, vq->free_head = id; vq->vq.num_free += state->num; - if (unlikely(vq->use_dma_api)) { + if (unlikely(vq->use_map_api)) { curr = id; for (i = 0; i < state->num; i++) { vring_unmap_extra_packed(vq, @@ -1683,7 +1682,7 @@ static void detach_buf_packed(struct vring_virtqueue *vq, if (!desc) return; - if (vq->use_dma_api) { + if (vq->use_map_api) { len = vq->packed.desc_extra[id].len; num = len / sizeof(struct vring_packed_desc); @@ -1962,25 +1961,25 @@ static struct vring_desc_extra *vring_alloc_desc_extra(unsigned int num) static void vring_free_packed(struct vring_virtqueue_packed *vring_packed, struct virtio_device *vdev, - struct device *dma_dev) + union virtio_map map) { if (vring_packed->vring.desc) vring_free_queue(vdev, vring_packed->ring_size_in_bytes, vring_packed->vring.desc, vring_packed->ring_dma_addr, - dma_dev); + map); if (vring_packed->vring.driver) vring_free_queue(vdev, vring_packed->event_size_in_bytes, vring_packed->vring.driver, vring_packed->driver_event_dma_addr, - dma_dev); + map); if (vring_packed->vring.device) vring_free_queue(vdev, vring_packed->event_size_in_bytes, vring_packed->vring.device, vring_packed->device_event_dma_addr, - dma_dev); + map); kfree(vring_packed->desc_state); kfree(vring_packed->desc_extra); @@ -1988,7 +1987,7 @@ static void vring_free_packed(struct vring_virtqueue_packed *vring_packed, static int vring_alloc_queue_packed(struct vring_virtqueue_packed *vring_packed, struct virtio_device *vdev, - u32 num, struct device *dma_dev) + u32 num, union virtio_map map) { struct vring_packed_desc *ring; struct vring_packed_desc_event *driver, *device; @@ -2000,7 +1999,7 @@ static int vring_alloc_queue_packed(struct vring_virtqueue_packed *vring_packed, ring = vring_alloc_queue(vdev, ring_size_in_bytes, &ring_dma_addr, GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO, - dma_dev); + map); if (!ring) goto err; @@ -2013,7 +2012,7 @@ static int vring_alloc_queue_packed(struct vring_virtqueue_packed *vring_packed, driver = vring_alloc_queue(vdev, event_size_in_bytes, &driver_event_dma_addr, GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO, - dma_dev); + map); if (!driver) goto err; @@ -2024,7 +2023,7 @@ static int vring_alloc_queue_packed(struct vring_virtqueue_packed *vring_packed, device = vring_alloc_queue(vdev, event_size_in_bytes, &device_event_dma_addr, GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO, - dma_dev); + map); if (!device) goto err; @@ -2036,7 +2035,7 @@ static int vring_alloc_queue_packed(struct vring_virtqueue_packed *vring_packed, return 0; err: - vring_free_packed(vring_packed, vdev, dma_dev); + vring_free_packed(vring_packed, vdev, map); return -ENOMEM; } @@ -2112,7 +2111,7 @@ static struct virtqueue *__vring_new_virtqueue_packed(unsigned int index, bool (*notify)(struct virtqueue *), void (*callback)(struct virtqueue *), const char *name, - struct device *dma_dev) + union virtio_map map) { struct vring_virtqueue *vq; int err; @@ -2135,8 +2134,8 @@ static struct virtqueue *__vring_new_virtqueue_packed(unsigned int index, vq->broken = false; #endif vq->packed_ring = true; - vq->dma_dev = dma_dev; - vq->use_dma_api = vring_use_dma_api(vdev); + vq->map = map; + vq->use_map_api = vring_use_map_api(vdev); vq->indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC) && !context; @@ -2173,18 +2172,18 @@ static struct virtqueue *vring_create_virtqueue_packed( bool (*notify)(struct virtqueue *), void (*callback)(struct virtqueue *), const char *name, - struct device *dma_dev) + union virtio_map map) { struct vring_virtqueue_packed vring_packed = {}; struct virtqueue *vq; - if (vring_alloc_queue_packed(&vring_packed, vdev, num, dma_dev)) + if (vring_alloc_queue_packed(&vring_packed, vdev, num, map)) return NULL; vq = __vring_new_virtqueue_packed(index, &vring_packed, vdev, weak_barriers, - context, notify, callback, name, dma_dev); + context, notify, callback, name, map); if (!vq) { - vring_free_packed(&vring_packed, vdev, dma_dev); + vring_free_packed(&vring_packed, vdev, map); return NULL; } @@ -2200,7 +2199,7 @@ static int virtqueue_resize_packed(struct virtqueue *_vq, u32 num) struct virtio_device *vdev = _vq->vdev; int err; - if (vring_alloc_queue_packed(&vring_packed, vdev, num, vring_dma_dev(vq))) + if (vring_alloc_queue_packed(&vring_packed, vdev, num, vq->map)) goto err_ring; err = vring_alloc_state_extra_packed(&vring_packed); @@ -2217,7 +2216,7 @@ static int virtqueue_resize_packed(struct virtqueue *_vq, u32 num) return 0; err_state_extra: - vring_free_packed(&vring_packed, vdev, vring_dma_dev(vq)); + vring_free_packed(&vring_packed, vdev, vq->map); err_ring: virtqueue_reinit_packed(vq); return -ENOMEM; @@ -2448,8 +2447,8 @@ struct device *virtqueue_dma_dev(struct virtqueue *_vq) { struct vring_virtqueue *vq = to_vvq(_vq); - if (vq->use_dma_api) - return vring_dma_dev(vq); + if (vq->use_map_api && !_vq->vdev->map) + return vq->map.dma_dev; else return NULL; } @@ -2734,19 +2733,20 @@ struct virtqueue *vring_create_virtqueue( void (*callback)(struct virtqueue *), const char *name) { + union virtio_map map = {.dma_dev = vdev->dev.parent}; if (virtio_has_feature(vdev, VIRTIO_F_RING_PACKED)) return vring_create_virtqueue_packed(index, num, vring_align, vdev, weak_barriers, may_reduce_num, - context, notify, callback, name, vdev->dev.parent); + context, notify, callback, name, map); return vring_create_virtqueue_split(index, num, vring_align, vdev, weak_barriers, may_reduce_num, - context, notify, callback, name, vdev->dev.parent); + context, notify, callback, name, map); } EXPORT_SYMBOL_GPL(vring_create_virtqueue); -struct virtqueue *vring_create_virtqueue_dma( +struct virtqueue *vring_create_virtqueue_map( unsigned int index, unsigned int num, unsigned int vring_align, @@ -2757,19 +2757,19 @@ struct virtqueue *vring_create_virtqueue_dma( bool (*notify)(struct virtqueue *), void (*callback)(struct virtqueue *), const char *name, - struct device *dma_dev) + union virtio_map map) { if (virtio_has_feature(vdev, VIRTIO_F_RING_PACKED)) return vring_create_virtqueue_packed(index, num, vring_align, vdev, weak_barriers, may_reduce_num, - context, notify, callback, name, dma_dev); + context, notify, callback, name, map); return vring_create_virtqueue_split(index, num, vring_align, vdev, weak_barriers, may_reduce_num, - context, notify, callback, name, dma_dev); + context, notify, callback, name, map); } -EXPORT_SYMBOL_GPL(vring_create_virtqueue_dma); +EXPORT_SYMBOL_GPL(vring_create_virtqueue_map); /** * virtqueue_resize - resize the vring of vq @@ -2880,6 +2880,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int index, const char *name) { struct vring_virtqueue_split vring_split = {}; + union virtio_map map = {.dma_dev = vdev->dev.parent}; if (virtio_has_feature(vdev, VIRTIO_F_RING_PACKED)) { struct vring_virtqueue_packed vring_packed = {}; @@ -2889,13 +2890,13 @@ struct virtqueue *vring_new_virtqueue(unsigned int index, return __vring_new_virtqueue_packed(index, &vring_packed, vdev, weak_barriers, context, notify, callback, - name, vdev->dev.parent); + name, map); } vring_init(&vring_split.vring, num, pages, vring_align); return __vring_new_virtqueue_split(index, &vring_split, vdev, weak_barriers, context, notify, callback, name, - vdev->dev.parent); + map); } EXPORT_SYMBOL_GPL(vring_new_virtqueue); @@ -2909,19 +2910,19 @@ static void vring_free(struct virtqueue *_vq) vq->packed.ring_size_in_bytes, vq->packed.vring.desc, vq->packed.ring_dma_addr, - vring_dma_dev(vq)); + vq->map); vring_free_queue(vq->vq.vdev, vq->packed.event_size_in_bytes, vq->packed.vring.driver, vq->packed.driver_event_dma_addr, - vring_dma_dev(vq)); + vq->map); vring_free_queue(vq->vq.vdev, vq->packed.event_size_in_bytes, vq->packed.vring.device, vq->packed.device_event_dma_addr, - vring_dma_dev(vq)); + vq->map); kfree(vq->packed.desc_state); kfree(vq->packed.desc_extra); @@ -2930,7 +2931,7 @@ static void vring_free(struct virtqueue *_vq) vq->split.queue_size_in_bytes, vq->split.vring.desc, vq->split.queue_dma_addr, - vring_dma_dev(vq)); + vq->map); } } if (!vq->packed_ring) { @@ -3137,7 +3138,108 @@ const struct vring *virtqueue_get_vring(const struct virtqueue *vq) EXPORT_SYMBOL_GPL(virtqueue_get_vring); /** - * virtqueue_dma_map_single_attrs - map DMA for _vq + * virtqueue_map_alloc_coherent - alloc coherent mapping + * @vdev: the virtio device we are talking to + * @map: metadata for performing mapping + * @size: the size of the buffer + * @map_handle: the pointer to the mapped address + * @gfp: allocation flag (GFP_XXX) + * + * return virtual address or NULL on error + */ +void *virtqueue_map_alloc_coherent(struct virtio_device *vdev, + union virtio_map map, + size_t size, dma_addr_t *map_handle, + gfp_t gfp) +{ + if (vdev->map) + return vdev->map->alloc(map, size, + map_handle, gfp); + else + return dma_alloc_coherent(map.dma_dev, size, + map_handle, gfp); +} +EXPORT_SYMBOL_GPL(virtqueue_map_alloc_coherent); + +/** + * virtqueue_map_free_coherent - free coherent mapping + * @vdev: the virtio device we are talking to + * @map: metadata for performing mapping + * @size: the size of the buffer + * @map_handle: the mapped address that needs to be freed + * + */ +void virtqueue_map_free_coherent(struct virtio_device *vdev, + union virtio_map map, size_t size, void *vaddr, + dma_addr_t map_handle) +{ + if (vdev->map) + vdev->map->free(map, size, vaddr, + map_handle, 0); + else + dma_free_coherent(map.dma_dev, size, vaddr, map_handle); +} +EXPORT_SYMBOL_GPL(virtqueue_map_free_coherent); + +/** + * virtqueue_map_page_attrs - map a page to the device + * @_vq: the virtqueue we are talking to + * @page: the page that will be mapped by the device + * @offset: the offset in the page for a buffer + * @size: the buffer size + * @dir: mapping direction + * @attrs: mapping attributes + * + * Returns mapped address. Caller should check that by virtqueue_mapping_error(). + */ +dma_addr_t virtqueue_map_page_attrs(const struct virtqueue *_vq, + struct page *page, + unsigned long offset, + size_t size, + enum dma_data_direction dir, + unsigned long attrs) +{ + const struct vring_virtqueue *vq = to_vvq(_vq); + struct virtio_device *vdev = _vq->vdev; + + if (vdev->map) + return vdev->map->map_page(vq->map, + page, offset, size, + dir, attrs); + + return dma_map_page_attrs(vring_dma_dev(vq), + page, offset, size, + dir, attrs); +} +EXPORT_SYMBOL_GPL(virtqueue_map_page_attrs); + +/** + * virtqueue_unmap_page_attrs - map a page to the device + * @_vq: the virtqueue we are talking to + * @map_handle: the mapped address + * @size: the buffer size + * @dir: mapping direction + * @attrs: unmapping attributes + */ +void virtqueue_unmap_page_attrs(const struct virtqueue *_vq, + dma_addr_t map_handle, + size_t size, enum dma_data_direction dir, + unsigned long attrs) +{ + const struct vring_virtqueue *vq = to_vvq(_vq); + struct virtio_device *vdev = _vq->vdev; + + if (vdev->map) + vdev->map->unmap_page(vq->map, + map_handle, size, dir, attrs); + else + dma_unmap_page_attrs(vring_dma_dev(vq), map_handle, + size, dir, attrs); +} +EXPORT_SYMBOL_GPL(virtqueue_unmap_page_attrs); + +/** + * virtqueue_map_single_attrs - map DMA for _vq * @_vq: the struct virtqueue we're talking about. * @ptr: the pointer of the buffer to do dma * @size: the size of the buffer to do dma @@ -3147,139 +3249,158 @@ EXPORT_SYMBOL_GPL(virtqueue_get_vring); * The caller calls this to do dma mapping in advance. The DMA address can be * passed to this _vq when it is in pre-mapped mode. * - * return DMA address. Caller should check that by virtqueue_dma_mapping_error(). + * return mapped address. Caller should check that by virtqueue_mapping_error(). */ -dma_addr_t virtqueue_dma_map_single_attrs(struct virtqueue *_vq, void *ptr, - size_t size, - enum dma_data_direction dir, - unsigned long attrs) +dma_addr_t virtqueue_map_single_attrs(const struct virtqueue *_vq, void *ptr, + size_t size, + enum dma_data_direction dir, + unsigned long attrs) { - struct vring_virtqueue *vq = to_vvq(_vq); + const struct vring_virtqueue *vq = to_vvq(_vq); - if (!vq->use_dma_api) { - kmsan_handle_dma(virt_to_page(ptr), offset_in_page(ptr), size, dir); + if (!vq->use_map_api) { + kmsan_handle_dma(virt_to_phys(ptr), size, dir); return (dma_addr_t)virt_to_phys(ptr); } - return dma_map_single_attrs(vring_dma_dev(vq), ptr, size, dir, attrs); + /* DMA must never operate on areas that might be remapped. */ + if (dev_WARN_ONCE(&_vq->vdev->dev, is_vmalloc_addr(ptr), + "rejecting DMA map of vmalloc memory\n")) + return DMA_MAPPING_ERROR; + + return virtqueue_map_page_attrs(&vq->vq, virt_to_page(ptr), + offset_in_page(ptr), size, dir, attrs); } -EXPORT_SYMBOL_GPL(virtqueue_dma_map_single_attrs); +EXPORT_SYMBOL_GPL(virtqueue_map_single_attrs); /** - * virtqueue_dma_unmap_single_attrs - unmap DMA for _vq + * virtqueue_unmap_single_attrs - unmap map for _vq * @_vq: the struct virtqueue we're talking about. * @addr: the dma address to unmap * @size: the size of the buffer * @dir: DMA direction * @attrs: DMA Attrs * - * Unmap the address that is mapped by the virtqueue_dma_map_* APIs. + * Unmap the address that is mapped by the virtqueue_map_* APIs. * */ -void virtqueue_dma_unmap_single_attrs(struct virtqueue *_vq, dma_addr_t addr, - size_t size, enum dma_data_direction dir, - unsigned long attrs) +void virtqueue_unmap_single_attrs(const struct virtqueue *_vq, + dma_addr_t addr, + size_t size, enum dma_data_direction dir, + unsigned long attrs) { - struct vring_virtqueue *vq = to_vvq(_vq); + const struct vring_virtqueue *vq = to_vvq(_vq); - if (!vq->use_dma_api) + if (!vq->use_map_api) return; - dma_unmap_single_attrs(vring_dma_dev(vq), addr, size, dir, attrs); + virtqueue_unmap_page_attrs(_vq, addr, size, dir, attrs); } -EXPORT_SYMBOL_GPL(virtqueue_dma_unmap_single_attrs); +EXPORT_SYMBOL_GPL(virtqueue_unmap_single_attrs); /** - * virtqueue_dma_mapping_error - check dma address + * virtqueue_mapping_error - check dma address * @_vq: the struct virtqueue we're talking about. * @addr: DMA address * * Returns 0 means dma valid. Other means invalid dma address. */ -int virtqueue_dma_mapping_error(struct virtqueue *_vq, dma_addr_t addr) +int virtqueue_map_mapping_error(const struct virtqueue *_vq, dma_addr_t addr) { - struct vring_virtqueue *vq = to_vvq(_vq); - - if (!vq->use_dma_api) - return 0; + const struct vring_virtqueue *vq = to_vvq(_vq); - return dma_mapping_error(vring_dma_dev(vq), addr); + return vring_mapping_error(vq, addr); } -EXPORT_SYMBOL_GPL(virtqueue_dma_mapping_error); +EXPORT_SYMBOL_GPL(virtqueue_map_mapping_error); /** - * virtqueue_dma_need_sync - check a dma address needs sync + * virtqueue_map_need_sync - check a dma address needs sync * @_vq: the struct virtqueue we're talking about. * @addr: DMA address * - * Check if the dma address mapped by the virtqueue_dma_map_* APIs needs to be + * Check if the dma address mapped by the virtqueue_map_* APIs needs to be * synchronized * * return bool */ -bool virtqueue_dma_need_sync(struct virtqueue *_vq, dma_addr_t addr) +bool virtqueue_map_need_sync(const struct virtqueue *_vq, dma_addr_t addr) { - struct vring_virtqueue *vq = to_vvq(_vq); + const struct vring_virtqueue *vq = to_vvq(_vq); + struct virtio_device *vdev = _vq->vdev; - if (!vq->use_dma_api) + if (!vq->use_map_api) return false; - return dma_need_sync(vring_dma_dev(vq), addr); + if (vdev->map) + return vdev->map->need_sync(vq->map, addr); + else + return dma_need_sync(vring_dma_dev(vq), addr); } -EXPORT_SYMBOL_GPL(virtqueue_dma_need_sync); +EXPORT_SYMBOL_GPL(virtqueue_map_need_sync); /** - * virtqueue_dma_sync_single_range_for_cpu - dma sync for cpu + * virtqueue_map_sync_single_range_for_cpu - map sync for cpu * @_vq: the struct virtqueue we're talking about. * @addr: DMA address * @offset: DMA address offset * @size: buf size for sync * @dir: DMA direction * - * Before calling this function, use virtqueue_dma_need_sync() to confirm that + * Before calling this function, use virtqueue_map_need_sync() to confirm that * the DMA address really needs to be synchronized * */ -void virtqueue_dma_sync_single_range_for_cpu(struct virtqueue *_vq, +void virtqueue_map_sync_single_range_for_cpu(const struct virtqueue *_vq, dma_addr_t addr, unsigned long offset, size_t size, enum dma_data_direction dir) { - struct vring_virtqueue *vq = to_vvq(_vq); - struct device *dev = vring_dma_dev(vq); + const struct vring_virtqueue *vq = to_vvq(_vq); + struct virtio_device *vdev = _vq->vdev; - if (!vq->use_dma_api) + if (!vq->use_map_api) return; - dma_sync_single_range_for_cpu(dev, addr, offset, size, dir); + if (vdev->map) + vdev->map->sync_single_for_cpu(vq->map, + addr + offset, size, dir); + else + dma_sync_single_range_for_cpu(vring_dma_dev(vq), + addr, offset, size, dir); } -EXPORT_SYMBOL_GPL(virtqueue_dma_sync_single_range_for_cpu); +EXPORT_SYMBOL_GPL(virtqueue_map_sync_single_range_for_cpu); /** - * virtqueue_dma_sync_single_range_for_device - dma sync for device + * virtqueue_map_sync_single_range_for_device - map sync for device * @_vq: the struct virtqueue we're talking about. * @addr: DMA address * @offset: DMA address offset * @size: buf size for sync * @dir: DMA direction * - * Before calling this function, use virtqueue_dma_need_sync() to confirm that + * Before calling this function, use virtqueue_map_need_sync() to confirm that * the DMA address really needs to be synchronized */ -void virtqueue_dma_sync_single_range_for_device(struct virtqueue *_vq, +void virtqueue_map_sync_single_range_for_device(const struct virtqueue *_vq, dma_addr_t addr, unsigned long offset, size_t size, enum dma_data_direction dir) { - struct vring_virtqueue *vq = to_vvq(_vq); - struct device *dev = vring_dma_dev(vq); + const struct vring_virtqueue *vq = to_vvq(_vq); + struct virtio_device *vdev = _vq->vdev; - if (!vq->use_dma_api) + if (!vq->use_map_api) return; - dma_sync_single_range_for_device(dev, addr, offset, size, dir); + if (vdev->map) + vdev->map->sync_single_for_device(vq->map, + addr + offset, + size, dir); + else + dma_sync_single_range_for_device(vring_dma_dev(vq), addr, + offset, size, dir); } -EXPORT_SYMBOL_GPL(virtqueue_dma_sync_single_range_for_device); +EXPORT_SYMBOL_GPL(virtqueue_map_sync_single_range_for_device); MODULE_DESCRIPTION("Virtio ring implementation"); MODULE_LICENSE("GPL"); diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c index 657b07a607881..f9a29045eca0d 100644 --- a/drivers/virtio/virtio_vdpa.c +++ b/drivers/virtio/virtio_vdpa.c @@ -133,12 +133,12 @@ virtio_vdpa_setup_vq(struct virtio_device *vdev, unsigned int index, const char *name, bool ctx) { struct vdpa_device *vdpa = vd_get_vdpa(vdev); - struct device *dma_dev; const struct vdpa_config_ops *ops = vdpa->config; bool (*notify)(struct virtqueue *vq) = virtio_vdpa_notify; struct vdpa_callback cb; struct virtqueue *vq; u64 desc_addr, driver_addr, device_addr; + union virtio_map map = {0}; /* Assume split virtqueue, switch to packed if necessary */ struct vdpa_vq_state state = {0}; u32 align, max_num, min_num = 1; @@ -176,23 +176,27 @@ virtio_vdpa_setup_vq(struct virtio_device *vdev, unsigned int index, if (ops->get_vq_num_min) min_num = ops->get_vq_num_min(vdpa); - may_reduce_num = (max_num == min_num) ? false : true; + may_reduce_num = (max_num != min_num); /* Create the vring */ align = ops->get_vq_align(vdpa); - if (ops->get_vq_dma_dev) - dma_dev = ops->get_vq_dma_dev(vdpa, index); + if (ops->get_vq_map) + map = ops->get_vq_map(vdpa, index); else - dma_dev = vdpa_get_dma_dev(vdpa); - vq = vring_create_virtqueue_dma(index, max_num, align, vdev, + map = vdpa_get_map(vdpa); + + vq = vring_create_virtqueue_map(index, max_num, align, vdev, true, may_reduce_num, ctx, - notify, callback, name, dma_dev); + notify, callback, name, map); if (!vq) { err = -ENOMEM; goto error_new_virtqueue; } + if (index == 0) + vdev->vmap = map; + vq->num_max = max_num; /* Setup virtqueue callback */ @@ -462,9 +466,11 @@ static int virtio_vdpa_probe(struct vdpa_device *vdpa) if (!vd_dev) return -ENOMEM; - vd_dev->vdev.dev.parent = vdpa_get_dma_dev(vdpa); + vd_dev->vdev.dev.parent = vdpa->map ? &vdpa->dev : + vdpa_get_map(vdpa).dma_dev; vd_dev->vdev.dev.release = virtio_vdpa_release_dev; vd_dev->vdev.config = &virtio_vdpa_config_ops; + vd_dev->vdev.map = vdpa->map; vd_dev->vdpa = vdpa; vd_dev->vdev.id.device = ops->get_device_id(vdpa); diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c index 29257d2639dbf..14077d23f2a19 100644 --- a/drivers/xen/grant-dma-ops.c +++ b/drivers/xen/grant-dma-ops.c @@ -163,18 +163,22 @@ static void xen_grant_dma_free_pages(struct device *dev, size_t size, xen_grant_dma_free(dev, size, page_to_virt(vaddr), dma_handle, 0); } -static dma_addr_t xen_grant_dma_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t size, +static dma_addr_t xen_grant_dma_map_phys(struct device *dev, phys_addr_t phys, + size_t size, enum dma_data_direction dir, unsigned long attrs) { struct xen_grant_dma_data *data; + unsigned long offset = offset_in_page(phys); unsigned long dma_offset = xen_offset_in_page(offset), pfn_offset = XEN_PFN_DOWN(offset); unsigned int i, n_pages = XEN_PFN_UP(dma_offset + size); grant_ref_t grant; dma_addr_t dma_handle; + if (unlikely(attrs & DMA_ATTR_MMIO)) + return DMA_MAPPING_ERROR; + if (WARN_ON(dir == DMA_NONE)) return DMA_MAPPING_ERROR; @@ -190,7 +194,7 @@ static dma_addr_t xen_grant_dma_map_page(struct device *dev, struct page *page, for (i = 0; i < n_pages; i++) { gnttab_grant_foreign_access_ref(grant + i, data->backend_domid, - pfn_to_gfn(page_to_xen_pfn(page) + i + pfn_offset), + pfn_to_gfn(page_to_xen_pfn(phys_to_page(phys)) + i + pfn_offset), dir == DMA_TO_DEVICE); } @@ -199,7 +203,7 @@ static dma_addr_t xen_grant_dma_map_page(struct device *dev, struct page *page, return dma_handle; } -static void xen_grant_dma_unmap_page(struct device *dev, dma_addr_t dma_handle, +static void xen_grant_dma_unmap_phys(struct device *dev, dma_addr_t dma_handle, size_t size, enum dma_data_direction dir, unsigned long attrs) { @@ -242,7 +246,7 @@ static void xen_grant_dma_unmap_sg(struct device *dev, struct scatterlist *sg, return; for_each_sg(sg, s, nents, i) - xen_grant_dma_unmap_page(dev, s->dma_address, sg_dma_len(s), dir, + xen_grant_dma_unmap_phys(dev, s->dma_address, sg_dma_len(s), dir, attrs); } @@ -257,7 +261,7 @@ static int xen_grant_dma_map_sg(struct device *dev, struct scatterlist *sg, return -EINVAL; for_each_sg(sg, s, nents, i) { - s->dma_address = xen_grant_dma_map_page(dev, sg_page(s), s->offset, + s->dma_address = xen_grant_dma_map_phys(dev, sg_phys(s), s->length, dir, attrs); if (s->dma_address == DMA_MAPPING_ERROR) goto out; @@ -286,8 +290,8 @@ static const struct dma_map_ops xen_grant_dma_ops = { .free_pages = xen_grant_dma_free_pages, .mmap = dma_common_mmap, .get_sgtable = dma_common_get_sgtable, - .map_page = xen_grant_dma_map_page, - .unmap_page = xen_grant_dma_unmap_page, + .map_phys = xen_grant_dma_map_phys, + .unmap_phys = xen_grant_dma_unmap_phys, .map_sg = xen_grant_dma_map_sg, .unmap_sg = xen_grant_dma_unmap_sg, .dma_supported = xen_grant_dma_supported, diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c index da1a7d3d377cf..ccf25027bec19 100644 --- a/drivers/xen/swiotlb-xen.c +++ b/drivers/xen/swiotlb-xen.c @@ -200,17 +200,32 @@ xen_swiotlb_free_coherent(struct device *dev, size_t size, void *vaddr, * physical address to use is returned. * * Once the device is given the dma address, the device owns this memory until - * either xen_swiotlb_unmap_page or xen_swiotlb_dma_sync_single is performed. + * either xen_swiotlb_unmap_phys or xen_swiotlb_dma_sync_single is performed. */ -static dma_addr_t xen_swiotlb_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t size, - enum dma_data_direction dir, +static dma_addr_t xen_swiotlb_map_phys(struct device *dev, phys_addr_t phys, + size_t size, enum dma_data_direction dir, unsigned long attrs) { - phys_addr_t map, phys = page_to_phys(page) + offset; - dma_addr_t dev_addr = xen_phys_to_dma(dev, phys); + dma_addr_t dev_addr; + phys_addr_t map; BUG_ON(dir == DMA_NONE); + + if (attrs & DMA_ATTR_MMIO) { + if (unlikely(!dma_capable(dev, phys, size, false))) { + dev_err_once( + dev, + "DMA addr %pa+%zu overflow (mask %llx, bus limit %llx).\n", + &phys, size, *dev->dma_mask, + dev->bus_dma_limit); + WARN_ON_ONCE(1); + return DMA_MAPPING_ERROR; + } + return phys; + } + + dev_addr = xen_phys_to_dma(dev, phys); + /* * If the address happens to be in the device's DMA window, * we can safely return the device addr and not worry about bounce @@ -257,13 +272,13 @@ static dma_addr_t xen_swiotlb_map_page(struct device *dev, struct page *page, /* * Unmap a single streaming mode DMA translation. The dma_addr and size must - * match what was provided for in a previous xen_swiotlb_map_page call. All + * match what was provided for in a previous xen_swiotlb_map_phys call. All * other usages are undefined. * * After this call, reads by the cpu to the buffer are guaranteed to see * whatever the device wrote there. */ -static void xen_swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr, +static void xen_swiotlb_unmap_phys(struct device *hwdev, dma_addr_t dev_addr, size_t size, enum dma_data_direction dir, unsigned long attrs) { phys_addr_t paddr = xen_dma_to_phys(hwdev, dev_addr); @@ -325,7 +340,7 @@ xen_swiotlb_sync_single_for_device(struct device *dev, dma_addr_t dma_addr, /* * Unmap a set of streaming mode DMA translations. Again, cpu read rules - * concerning calls here are the same as for swiotlb_unmap_page() above. + * concerning calls here are the same as for swiotlb_unmap_phys() above. */ static void xen_swiotlb_unmap_sg(struct device *hwdev, struct scatterlist *sgl, int nelems, @@ -337,7 +352,7 @@ xen_swiotlb_unmap_sg(struct device *hwdev, struct scatterlist *sgl, int nelems, BUG_ON(dir == DMA_NONE); for_each_sg(sgl, sg, nelems, i) - xen_swiotlb_unmap_page(hwdev, sg->dma_address, sg_dma_len(sg), + xen_swiotlb_unmap_phys(hwdev, sg->dma_address, sg_dma_len(sg), dir, attrs); } @@ -352,8 +367,8 @@ xen_swiotlb_map_sg(struct device *dev, struct scatterlist *sgl, int nelems, BUG_ON(dir == DMA_NONE); for_each_sg(sgl, sg, nelems, i) { - sg->dma_address = xen_swiotlb_map_page(dev, sg_page(sg), - sg->offset, sg->length, dir, attrs); + sg->dma_address = xen_swiotlb_map_phys(dev, sg_phys(sg), + sg->length, dir, attrs); if (sg->dma_address == DMA_MAPPING_ERROR) goto out_unmap; sg_dma_len(sg) = sg->length; @@ -418,13 +433,12 @@ const struct dma_map_ops xen_swiotlb_dma_ops = { .sync_sg_for_device = xen_swiotlb_sync_sg_for_device, .map_sg = xen_swiotlb_map_sg, .unmap_sg = xen_swiotlb_unmap_sg, - .map_page = xen_swiotlb_map_page, - .unmap_page = xen_swiotlb_unmap_page, + .map_phys = xen_swiotlb_map_phys, + .unmap_phys = xen_swiotlb_unmap_phys, .dma_supported = xen_swiotlb_dma_supported, .mmap = dma_common_mmap, .get_sgtable = dma_common_get_sgtable, .alloc_pages_op = dma_common_alloc_pages, .free_pages = dma_common_free_pages, .max_mapping_size = swiotlb_max_mapping_size, - .map_resource = dma_direct_map_resource, }; diff --git a/include/kvm/arm_arch_timer.h b/include/kvm/arm_arch_timer.h index 681cf0c8b9df4..f64e317c091bd 100644 --- a/include/kvm/arm_arch_timer.h +++ b/include/kvm/arm_arch_timer.h @@ -113,6 +113,8 @@ int kvm_arm_timer_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr); int kvm_arm_timer_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr); int kvm_arm_timer_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr); +void kvm_realm_timers_update(struct kvm_vcpu *vcpu); + u64 kvm_phys_timer_read(void); void kvm_timer_vcpu_load(struct kvm_vcpu *vcpu); diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h index 96754b51b4116..da32f1bd9f8ce 100644 --- a/include/kvm/arm_pmu.h +++ b/include/kvm/arm_pmu.h @@ -70,6 +70,8 @@ void kvm_vcpu_pmu_restore_guest(struct kvm_vcpu *vcpu); void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu); void kvm_vcpu_pmu_resync_el0(void); +#define kvm_pmu_get_irq_level(vcpu) ((vcpu)->arch.pmu.irq_level) + #define kvm_vcpu_has_pmu(vcpu) \ (vcpu_has_feature(vcpu, KVM_ARM_VCPU_PMU_V3)) @@ -157,6 +159,8 @@ static inline u64 kvm_pmu_get_pmceid(struct kvm_vcpu *vcpu, bool pmceid1) return 0; } +#define kvm_pmu_get_irq_level(vcpu) (false) + #define kvm_vcpu_has_pmu(vcpu) ({ false; }) static inline void kvm_pmu_update_vcpu_events(struct kvm_vcpu *vcpu) {} static inline void kvm_vcpu_pmu_restore_guest(struct kvm_vcpu *vcpu) {} diff --git a/include/kvm/arm_psci.h b/include/kvm/arm_psci.h index cbaec804eb839..38dab7add79b2 100644 --- a/include/kvm/arm_psci.h +++ b/include/kvm/arm_psci.h @@ -10,6 +10,8 @@ #include #include +#include + #define KVM_ARM_PSCI_0_1 PSCI_VERSION(0, 1) #define KVM_ARM_PSCI_0_2 PSCI_VERSION(0, 2) #define KVM_ARM_PSCI_1_0 PSCI_VERSION(1, 0) diff --git a/include/linux/blk-integrity.h b/include/linux/blk-integrity.h index e67a2b6e8f111..b9e6376b5e36a 100644 --- a/include/linux/blk-integrity.h +++ b/include/linux/blk-integrity.h @@ -4,6 +4,7 @@ #include #include +#include struct request; @@ -26,11 +27,17 @@ static inline bool queue_limits_stack_integrity_bdev(struct queue_limits *t, #ifdef CONFIG_BLK_DEV_INTEGRITY int blk_rq_map_integrity_sg(struct request *, struct scatterlist *); + int blk_rq_count_integrity_sg(struct request_queue *, struct bio *); int blk_rq_integrity_map_user(struct request *rq, void __user *ubuf, ssize_t bytes); int blk_get_meta_cap(struct block_device *bdev, unsigned int cmd, struct logical_block_metadata_cap __user *argp); +bool blk_rq_integrity_dma_map_iter_start(struct request *req, + struct device *dma_dev, struct dma_iova_state *state, + struct blk_dma_iter *iter); +bool blk_rq_integrity_dma_map_iter_next(struct request *req, + struct device *dma_dev, struct blk_dma_iter *iter); static inline bool blk_integrity_queue_supports_integrity(struct request_queue *q) @@ -115,6 +122,17 @@ static inline int blk_rq_integrity_map_user(struct request *rq, { return -EINVAL; } +static inline bool blk_rq_integrity_dma_map_iter_start(struct request *req, + struct device *dma_dev, struct dma_iova_state *state, + struct blk_dma_iter *iter) +{ + return false; +} +static inline bool blk_rq_integrity_dma_map_iter_next(struct request *req, + struct device *dma_dev, struct blk_dma_iter *iter) +{ + return false; +} static inline struct blk_integrity *bdev_get_integrity(struct block_device *b) { return NULL; diff --git a/include/linux/blk-mq-dma.h b/include/linux/blk-mq-dma.h index c26a01aeae006..cb88fc791fbd1 100644 --- a/include/linux/blk-mq-dma.h +++ b/include/linux/blk-mq-dma.h @@ -5,17 +5,24 @@ #include #include +struct blk_map_iter { + struct bvec_iter iter; + struct bio *bio; + struct bio_vec *bvecs; + bool is_integrity; +}; + struct blk_dma_iter { /* Output address range for this iteration */ dma_addr_t addr; u32 len; + struct pci_p2pdma_map_state p2pdma; /* Status code. Only valid when blk_rq_dma_map_iter_* returned false */ blk_status_t status; /* Internal to blk_rq_dma_map_iter_* */ - struct req_iterator iter; - struct pci_p2pdma_map_state p2pdma; + struct blk_map_iter iter; }; bool blk_rq_dma_map_iter_start(struct request *req, struct device *dma_dev, @@ -41,23 +48,29 @@ static inline bool blk_rq_dma_map_coalesce(struct dma_iova_state *state) * @dma_dev: device to unmap from * @state: DMA IOVA state * @mapped_len: number of bytes to unmap + * @map: peer-to-peer mapping type * * Returns %false if the callers need to manually unmap every DMA segment * mapped using @iter or %true if no work is left to be done. */ static inline bool blk_rq_dma_unmap(struct request *req, struct device *dma_dev, - struct dma_iova_state *state, size_t mapped_len) + struct dma_iova_state *state, size_t mapped_len, + enum pci_p2pdma_map_type map) { - if (req->cmd_flags & REQ_P2PDMA) + if (map == PCI_P2PDMA_MAP_BUS_ADDR) return true; if (dma_use_iova(state)) { + unsigned int attrs = 0; + + if (map == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE) + attrs |= DMA_ATTR_MMIO; + dma_iova_destroy(dma_dev, state, mapped_len, rq_dma_dir(req), - 0); + attrs); return true; } return !dma_need_unmap(dma_dev); } - #endif /* BLK_MQ_DMA_H */ diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 1d6e2df0fdd31..30f6c1d005ec3 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -383,7 +383,6 @@ enum req_flag_bits { __REQ_DRV, /* for driver use */ __REQ_FS_PRIVATE, /* for file system (submitter) use */ __REQ_ATOMIC, /* for atomic write operations */ - __REQ_P2PDMA, /* contains P2P DMA pages */ /* * Command specific flags, keep last: */ @@ -416,7 +415,6 @@ enum req_flag_bits { #define REQ_DRV (__force blk_opf_t)(1ULL << __REQ_DRV) #define REQ_FS_PRIVATE (__force blk_opf_t)(1ULL << __REQ_FS_PRIVATE) #define REQ_ATOMIC (__force blk_opf_t)(1ULL << __REQ_ATOMIC) -#define REQ_P2PDMA (__force blk_opf_t)(1ULL << __REQ_P2PDMA) #define REQ_NOUNMAP (__force blk_opf_t)(1ULL << __REQ_NOUNMAP) diff --git a/include/linux/dma-buf-mapping.h b/include/linux/dma-buf-mapping.h new file mode 100644 index 0000000000000..a3c0ce2d3a42f --- /dev/null +++ b/include/linux/dma-buf-mapping.h @@ -0,0 +1,17 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * DMA BUF Mapping Helpers + * + */ +#ifndef __DMA_BUF_MAPPING_H__ +#define __DMA_BUF_MAPPING_H__ +#include + +struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach, + struct p2pdma_provider *provider, + struct dma_buf_phys_vec *phys_vec, + size_t nr_ranges, size_t size, + enum dma_data_direction dir); +void dma_buf_free_sgt(struct dma_buf_attachment *attach, struct sg_table *sgt, + enum dma_data_direction dir); +#endif diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index d58e329ac0e71..0bc492090237e 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -22,6 +22,7 @@ #include #include #include +#include struct device; struct dma_buf; @@ -530,6 +531,16 @@ struct dma_buf_export_info { void *priv; }; +/** + * struct dma_buf_phys_vec - describe continuous chunk of memory + * @paddr: physical address of that chunk + * @len: Length of this chunk + */ +struct dma_buf_phys_vec { + phys_addr_t paddr; + size_t len; +}; + /** * DEFINE_DMA_BUF_EXPORT_INFO - helper macro for exporters * @name: export-info name diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h index f3bc0bcd70980..c249912456f96 100644 --- a/include/linux/dma-direct.h +++ b/include/linux/dma-direct.h @@ -149,7 +149,5 @@ void dma_direct_free_pages(struct device *dev, size_t size, struct page *page, dma_addr_t dma_addr, enum dma_data_direction dir); int dma_direct_supported(struct device *dev, u64 mask); -dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr, - size_t size, enum dma_data_direction dir, unsigned long attrs); #endif /* _LINUX_DMA_DIRECT_H */ diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h index 332b80c42b6f3..4809204c674cc 100644 --- a/include/linux/dma-map-ops.h +++ b/include/linux/dma-map-ops.h @@ -31,10 +31,10 @@ struct dma_map_ops { void *cpu_addr, dma_addr_t dma_addr, size_t size, unsigned long attrs); - dma_addr_t (*map_page)(struct device *dev, struct page *page, - unsigned long offset, size_t size, - enum dma_data_direction dir, unsigned long attrs); - void (*unmap_page)(struct device *dev, dma_addr_t dma_handle, + dma_addr_t (*map_phys)(struct device *dev, phys_addr_t phys, + size_t size, enum dma_data_direction dir, + unsigned long attrs); + void (*unmap_phys)(struct device *dev, dma_addr_t dma_handle, size_t size, enum dma_data_direction dir, unsigned long attrs); /* @@ -46,12 +46,6 @@ struct dma_map_ops { enum dma_data_direction dir, unsigned long attrs); void (*unmap_sg)(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir, unsigned long attrs); - dma_addr_t (*map_resource)(struct device *dev, phys_addr_t phys_addr, - size_t size, enum dma_data_direction dir, - unsigned long attrs); - void (*unmap_resource)(struct device *dev, dma_addr_t dma_handle, - size_t size, enum dma_data_direction dir, - unsigned long attrs); void (*sync_single_for_cpu)(struct device *dev, dma_addr_t dma_handle, size_t size, enum dma_data_direction dir); void (*sync_single_for_device)(struct device *dev, @@ -395,15 +389,15 @@ void *arch_dma_set_uncached(void *addr, size_t size); void arch_dma_clear_uncached(void *addr, size_t size); #ifdef CONFIG_ARCH_HAS_DMA_MAP_DIRECT -bool arch_dma_map_page_direct(struct device *dev, phys_addr_t addr); -bool arch_dma_unmap_page_direct(struct device *dev, dma_addr_t dma_handle); +bool arch_dma_map_phys_direct(struct device *dev, phys_addr_t addr); +bool arch_dma_unmap_phys_direct(struct device *dev, dma_addr_t dma_handle); bool arch_dma_map_sg_direct(struct device *dev, struct scatterlist *sg, int nents); bool arch_dma_unmap_sg_direct(struct device *dev, struct scatterlist *sg, int nents); #else -#define arch_dma_map_page_direct(d, a) (false) -#define arch_dma_unmap_page_direct(d, a) (false) +#define arch_dma_map_phys_direct(d, a) (false) +#define arch_dma_unmap_phys_direct(d, a) (false) #define arch_dma_map_sg_direct(d, s, n) (false) #define arch_dma_unmap_sg_direct(d, s, n) (false) #endif diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 55c03e5fe8cb3..8248ff9363eed 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -58,6 +58,26 @@ */ #define DMA_ATTR_PRIVILEGED (1UL << 9) +/* + * DMA_ATTR_MMIO - Indicates memory-mapped I/O (MMIO) region for DMA mapping + * + * This attribute indicates the physical address is not normal system + * memory. It may not be used with kmap*()/phys_to_virt()/phys_to_page() + * functions, it may not be cacheable, and access using CPU load/store + * instructions may not be allowed. + * + * Usually this will be used to describe MMIO addresses, or other non-cacheable + * register addresses. When DMA mapping this sort of address we call + * the operation Peer to Peer as a one device is DMA'ing to another device. + * For PCI devices the p2pdma APIs must be used to determine if DMA_ATTR_MMIO + * is appropriate. + * + * For architectures that require cache flushing for DMA coherence + * DMA_ATTR_MMIO will not perform any cache flushing. The address + * provided must never be mapped cacheable into the CPU. + */ +#define DMA_ATTR_MMIO (1UL << 10) + /* * A dma_addr_t can hold any valid DMA or bus address for the platform. It can * be given to a device to use as a DMA source or target. It is specific to a @@ -118,6 +138,10 @@ dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page, unsigned long attrs); void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size, enum dma_data_direction dir, unsigned long attrs); +dma_addr_t dma_map_phys(struct device *dev, phys_addr_t phys, size_t size, + enum dma_data_direction dir, unsigned long attrs); +void dma_unmap_phys(struct device *dev, dma_addr_t addr, size_t size, + enum dma_data_direction dir, unsigned long attrs); unsigned int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir, unsigned long attrs); void dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg, @@ -172,6 +196,15 @@ static inline void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size, enum dma_data_direction dir, unsigned long attrs) { } +static inline dma_addr_t dma_map_phys(struct device *dev, phys_addr_t phys, + size_t size, enum dma_data_direction dir, unsigned long attrs) +{ + return DMA_MAPPING_ERROR; +} +static inline void dma_unmap_phys(struct device *dev, dma_addr_t addr, + size_t size, enum dma_data_direction dir, unsigned long attrs) +{ +} static inline unsigned int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir, unsigned long attrs) diff --git a/include/linux/iommu-dma.h b/include/linux/iommu-dma.h index 508beaa44c39e..a92b3ff9b9343 100644 --- a/include/linux/iommu-dma.h +++ b/include/linux/iommu-dma.h @@ -21,10 +21,9 @@ static inline bool use_dma_iommu(struct device *dev) } #endif /* CONFIG_IOMMU_DMA */ -dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t size, enum dma_data_direction dir, - unsigned long attrs); -void iommu_dma_unmap_page(struct device *dev, dma_addr_t dma_handle, +dma_addr_t iommu_dma_map_phys(struct device *dev, phys_addr_t phys, size_t size, + enum dma_data_direction dir, unsigned long attrs); +void iommu_dma_unmap_phys(struct device *dev, dma_addr_t dma_handle, size_t size, enum dma_data_direction dir, unsigned long attrs); int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir, unsigned long attrs); @@ -43,10 +42,6 @@ size_t iommu_dma_opt_mapping_size(void); size_t iommu_dma_max_mapping_size(struct device *dev); void iommu_dma_free(struct device *dev, size_t size, void *cpu_addr, dma_addr_t handle, unsigned long attrs); -dma_addr_t iommu_dma_map_resource(struct device *dev, phys_addr_t phys, - size_t size, enum dma_data_direction dir, unsigned long attrs); -void iommu_dma_unmap_resource(struct device *dev, dma_addr_t handle, - size_t size, enum dma_data_direction dir, unsigned long attrs); struct sg_table *iommu_dma_alloc_noncontiguous(struct device *dev, size_t size, enum dma_data_direction dir, gfp_t gfp, unsigned long attrs); void iommu_dma_free_noncontiguous(struct device *dev, size_t size, diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 39fe3e6cd282f..6c8abb2534cf3 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -527,7 +527,7 @@ extern bool kexec_file_dbg_print; #define kexec_dprintk(fmt, arg...) \ do { if (kexec_file_dbg_print) pr_info(fmt, ##arg); } while (0) -extern void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size); +extern void *kimage_map_segment(struct kimage *image, int idx); extern void kimage_unmap_segment(void *buffer); #else /* !CONFIG_KEXEC_CORE */ struct pt_regs; @@ -537,7 +537,7 @@ static inline void __crash_kexec(struct pt_regs *regs) { } static inline void crash_kexec(struct pt_regs *regs) { } static inline int kexec_should_crash(struct task_struct *p) { return 0; } static inline int kexec_crash_loaded(void) { return 0; } -static inline void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size) +static inline void *kimage_map_segment(struct kimage *image, int idx) { return NULL; } static inline void kimage_unmap_segment(void *buffer) { } #define kexec_in_progress false diff --git a/include/linux/kmsan.h b/include/linux/kmsan.h index 2b1432cc16d59..f2fd221107bba 100644 --- a/include/linux/kmsan.h +++ b/include/linux/kmsan.h @@ -182,8 +182,7 @@ void kmsan_iounmap_page_range(unsigned long start, unsigned long end); /** * kmsan_handle_dma() - Handle a DMA data transfer. - * @page: first page of the buffer. - * @offset: offset of the buffer within the first page. + * @phys: physical address of the buffer. * @size: buffer size. * @dir: one of possible dma_data_direction values. * @@ -192,7 +191,7 @@ void kmsan_iounmap_page_range(unsigned long start, unsigned long end); * * initializes the buffer, if it is copied from device; * * does both, if this is a DMA_BIDIRECTIONAL transfer. */ -void kmsan_handle_dma(struct page *page, size_t offset, size_t size, +void kmsan_handle_dma(phys_addr_t phys, size_t size, enum dma_data_direction dir); /** @@ -372,8 +371,8 @@ static inline void kmsan_iounmap_page_range(unsigned long start, { } -static inline void kmsan_handle_dma(struct page *page, size_t offset, - size_t size, enum dma_data_direction dir) +static inline void kmsan_handle_dma(phys_addr_t phys, size_t size, + enum dma_data_direction dir) { } diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 8d3fa3a91ce47..2a1f346178024 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -618,6 +618,7 @@ FOLIO_FLAG(dropbehind, FOLIO_HEAD_PAGE) #else PAGEFLAG_FALSE(HighMem, highmem) #endif +#define PhysHighMem(__p) (PageHighMem(phys_to_page(__p))) /* Does kmap_local_folio() only allow access to one page of the folio? */ #ifdef CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h index 075c20b161d98..b35603a7c94f9 100644 --- a/include/linux/pci-p2pdma.h +++ b/include/linux/pci-p2pdma.h @@ -16,7 +16,58 @@ struct block_device; struct scatterlist; +/** + * struct p2pdma_provider + * + * A p2pdma provider is a range of MMIO address space available to the CPU. + */ +struct p2pdma_provider { + struct device *owner; + u64 bus_offset; +}; + +enum pci_p2pdma_map_type { + /* + * PCI_P2PDMA_MAP_UNKNOWN: Used internally as an initial state before + * the mapping type has been calculated. Exported routines for the API + * will never return this value. + */ + PCI_P2PDMA_MAP_UNKNOWN = 0, + + /* + * Not a PCI P2PDMA transfer. + */ + PCI_P2PDMA_MAP_NONE, + + /* + * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will + * traverse the host bridge and the host bridge is not in the + * allowlist. DMA Mapping routines should return an error when + * this is returned. + */ + PCI_P2PDMA_MAP_NOT_SUPPORTED, + + /* + * PCI_P2PDMA_MAP_BUS_ADDR: Indicates that two devices can talk to + * each other directly through a PCI switch and the transaction will + * not traverse the host bridge. Such a mapping should program + * the DMA engine with PCI bus addresses. + */ + PCI_P2PDMA_MAP_BUS_ADDR, + + /* + * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk + * to each other, but the transaction traverses a host bridge on the + * allowlist. In this case, a normal mapping either with CPU physical + * addresses (in the case of dma-direct) or IOVA addresses (in the + * case of IOMMUs) should be used to program the DMA engine. + */ + PCI_P2PDMA_MAP_THRU_HOST_BRIDGE, +}; + #ifdef CONFIG_PCI_P2PDMA +int pcim_p2pdma_init(struct pci_dev *pdev); +struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar); int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, u64 offset); int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **clients, @@ -34,7 +85,18 @@ int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev, bool *use_p2pdma); ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev, bool use_p2pdma); +enum pci_p2pdma_map_type pci_p2pdma_map_type(struct p2pdma_provider *provider, + struct device *dev); #else /* CONFIG_PCI_P2PDMA */ +static inline int pcim_p2pdma_init(struct pci_dev *pdev) +{ + return -EOPNOTSUPP; +} +static inline struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, + int bar) +{ + return NULL; +} static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, u64 offset) { @@ -90,6 +152,11 @@ static inline ssize_t pci_p2pdma_enable_show(char *page, { return sprintf(page, "none\n"); } +static inline enum pci_p2pdma_map_type +pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev) +{ + return PCI_P2PDMA_MAP_NOT_SUPPORTED; +} #endif /* CONFIG_PCI_P2PDMA */ @@ -104,51 +171,12 @@ static inline struct pci_dev *pci_p2pmem_find(struct device *client) return pci_p2pmem_find_many(&client, 1); } -enum pci_p2pdma_map_type { - /* - * PCI_P2PDMA_MAP_UNKNOWN: Used internally as an initial state before - * the mapping type has been calculated. Exported routines for the API - * will never return this value. - */ - PCI_P2PDMA_MAP_UNKNOWN = 0, - - /* - * Not a PCI P2PDMA transfer. - */ - PCI_P2PDMA_MAP_NONE, - - /* - * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will - * traverse the host bridge and the host bridge is not in the - * allowlist. DMA Mapping routines should return an error when - * this is returned. - */ - PCI_P2PDMA_MAP_NOT_SUPPORTED, - - /* - * PCI_P2PDMA_MAP_BUS_ADDR: Indicates that two devices can talk to - * each other directly through a PCI switch and the transaction will - * not traverse the host bridge. Such a mapping should program - * the DMA engine with PCI bus addresses. - */ - PCI_P2PDMA_MAP_BUS_ADDR, - - /* - * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk - * to each other, but the transaction traverses a host bridge on the - * allowlist. In this case, a normal mapping either with CPU physical - * addresses (in the case of dma-direct) or IOVA addresses (in the - * case of IOMMUs) should be used to program the DMA engine. - */ - PCI_P2PDMA_MAP_THRU_HOST_BRIDGE, -}; - struct pci_p2pdma_map_state { - struct dev_pagemap *pgmap; + struct p2pdma_provider *mem; enum pci_p2pdma_map_type map; - u64 bus_off; }; + /* helper for pci_p2pdma_state(), do not use directly */ void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state, struct device *dev, struct page *page); @@ -167,8 +195,7 @@ pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev, struct page *page) { if (IS_ENABLED(CONFIG_PCI_P2PDMA) && is_pci_p2pdma_page(page)) { - if (state->pgmap != page_pgmap(page)) - __pci_p2pdma_update_state(state, dev, page); + __pci_p2pdma_update_state(state, dev, page); return state->map; } return PCI_P2PDMA_MAP_NONE; @@ -177,16 +204,15 @@ pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev, /** * pci_p2pdma_bus_addr_map - Translate a physical address to a bus address * for a PCI_P2PDMA_MAP_BUS_ADDR transfer. - * @state: P2P state structure + * @provider: P2P provider structure * @paddr: physical address to map * * Map a physically contiguous PCI_P2PDMA_MAP_BUS_ADDR transfer. */ static inline dma_addr_t -pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr) +pci_p2pdma_bus_addr_map(struct p2pdma_provider *provider, phys_addr_t paddr) { - WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR); - return paddr + state->bus_off; + return paddr + provider->bus_offset; } #endif /* _LINUX_PCI_P2P_H */ diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h index 2d39322c40c43..7b41ea0f51803 100644 --- a/include/linux/perf/arm_pmu.h +++ b/include/linux/perf/arm_pmu.h @@ -186,6 +186,7 @@ void kvm_host_pmu_init(struct arm_pmu *pmu); #endif bool arm_pmu_irq_is_nmi(void); +void arm_pmu_set_phys_irq(bool enable); /* Internal functions only for core arm_pmu code */ struct arm_pmu *armpmu_alloc(void); @@ -196,6 +197,10 @@ void armpmu_free_irq(int irq, int cpu); #define ARMV8_PMU_PDEV_NAME "armv8-pmu" +#else /* CONFIG_ARM_PMU */ + +static inline void arm_pmu_set_phys_irq(bool enable) {} + #endif /* CONFIG_ARM_PMU */ #define ARMV8_SPE_PDEV_NAME "arm,spe-v1" diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 2e7a30fe6b925..4cf21d6e9cfde 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -5,6 +5,7 @@ #include #include #include +#include #include #include #include @@ -70,11 +71,12 @@ struct vdpa_mgmt_dev; /** * struct vdpa_device - representation of a vDPA device * @dev: underlying device - * @dma_dev: the actual device that is performing DMA + * @vmap: the metadata passed to upper layer to be used for mapping * @driver_override: driver name to force a match; do not set directly, * because core frees it; use driver_set_override() to * set or clear it. * @config: the configuration ops for this device. + * @map: the map ops for this device * @cf_lock: Protects get and set access to configuration layout. * @index: device index * @features_valid: were features initialized? for legacy guests @@ -87,9 +89,10 @@ struct vdpa_mgmt_dev; */ struct vdpa_device { struct device dev; - struct device *dma_dev; + union virtio_map vmap; const char *driver_override; const struct vdpa_config_ops *config; + const struct virtio_map_ops *map; struct rw_semaphore cf_lock; /* Protects get/set config */ unsigned int index; bool features_valid; @@ -352,11 +355,11 @@ struct vdpa_map_file { * @vdev: vdpa device * @asid: address space identifier * Returns integer: success (0) or error (< 0) - * @get_vq_dma_dev: Get the dma device for a specific + * @get_vq_map: Get the map metadata for a specific * virtqueue (optional) * @vdev: vdpa device * @idx: virtqueue index - * Returns pointer to structure device or error (NULL) + * Returns map token union error (NULL) * @bind_mm: Bind the device to a specific address space * so the vDPA framework can use VA when this * callback is implemented. (optional) @@ -436,7 +439,7 @@ struct vdpa_config_ops { int (*reset_map)(struct vdpa_device *vdev, unsigned int asid); int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group, unsigned int asid); - struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx); + union virtio_map (*get_vq_map)(struct vdpa_device *vdev, u16 idx); int (*bind_mm)(struct vdpa_device *vdev, struct mm_struct *mm); void (*unbind_mm)(struct vdpa_device *vdev); @@ -446,6 +449,7 @@ struct vdpa_config_ops { struct vdpa_device *__vdpa_alloc_device(struct device *parent, const struct vdpa_config_ops *config, + const struct virtio_map_ops *map, unsigned int ngroups, unsigned int nas, size_t size, const char *name, bool use_va); @@ -457,6 +461,7 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent, * @member: the name of struct vdpa_device within the @dev_struct * @parent: the parent device * @config: the bus operations that is supported by this device + * @map: the map operations that is supported by this device * @ngroups: the number of virtqueue groups supported by this device * @nas: the number of address spaces * @name: name of the vdpa device @@ -464,10 +469,10 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent, * * Return allocated data structure or ERR_PTR upon error */ -#define vdpa_alloc_device(dev_struct, member, parent, config, ngroups, nas, \ - name, use_va) \ +#define vdpa_alloc_device(dev_struct, member, parent, config, map, \ + ngroups, nas, name, use_va) \ container_of((__vdpa_alloc_device( \ - parent, config, ngroups, nas, \ + parent, config, map, ngroups, nas, \ (sizeof(dev_struct) + \ BUILD_BUG_ON_ZERO(offsetof( \ dev_struct, member))), name, use_va)), \ @@ -520,9 +525,9 @@ static inline void vdpa_set_drvdata(struct vdpa_device *vdev, void *data) dev_set_drvdata(&vdev->dev, data); } -static inline struct device *vdpa_get_dma_dev(struct vdpa_device *vdev) +static inline union virtio_map vdpa_get_map(struct vdpa_device *vdev) { - return vdev->dma_dev; + return vdev->vmap; } static inline int vdpa_reset(struct vdpa_device *vdev, u32 flags) diff --git a/include/linux/vfio.h b/include/linux/vfio.h index eb563f538dee5..217ba4ef17522 100644 --- a/include/linux/vfio.h +++ b/include/linux/vfio.h @@ -297,6 +297,8 @@ static inline void vfio_put_device(struct vfio_device *device) int vfio_register_group_dev(struct vfio_device *device); int vfio_register_emulated_iommu_dev(struct vfio_device *device); void vfio_unregister_group_dev(struct vfio_device *device); +bool vfio_device_try_get_registration(struct vfio_device *device); +void vfio_device_put_registration(struct vfio_device *device); int vfio_assign_device_set(struct vfio_device *device, void *set_id); unsigned int vfio_device_set_open_count(struct vfio_device_set *dev_set); diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h index 6db13f66b5e4b..db4c6210f2e65 100644 --- a/include/linux/vfio_pci_core.h +++ b/include/linux/vfio_pci_core.h @@ -26,6 +26,9 @@ struct vfio_pci_core_device; struct vfio_pci_region; +struct p2pdma_provider; +struct dma_buf_phys_vec; +struct dma_buf_attachment; struct vfio_pci_regops { ssize_t (*rw)(struct vfio_pci_core_device *vdev, char __user *buf, @@ -49,9 +52,48 @@ struct vfio_pci_region { u32 flags; }; +struct vfio_pci_device_ops { + int (*get_dmabuf_phys)(struct vfio_pci_core_device *vdev, + struct p2pdma_provider **provider, + unsigned int region_index, + struct dma_buf_phys_vec *phys_vec, + struct vfio_region_dma_range *dma_ranges, + size_t nr_ranges); +}; + +#if IS_ENABLED(CONFIG_VFIO_PCI_DMABUF) +int vfio_pci_core_fill_phys_vec(struct dma_buf_phys_vec *phys_vec, + struct vfio_region_dma_range *dma_ranges, + size_t nr_ranges, phys_addr_t start, + phys_addr_t len); +int vfio_pci_core_get_dmabuf_phys(struct vfio_pci_core_device *vdev, + struct p2pdma_provider **provider, + unsigned int region_index, + struct dma_buf_phys_vec *phys_vec, + struct vfio_region_dma_range *dma_ranges, + size_t nr_ranges); +#else +static inline int +vfio_pci_core_fill_phys_vec(struct dma_buf_phys_vec *phys_vec, + struct vfio_region_dma_range *dma_ranges, + size_t nr_ranges, phys_addr_t start, + phys_addr_t len) +{ + return -EINVAL; +} +static inline int vfio_pci_core_get_dmabuf_phys( + struct vfio_pci_core_device *vdev, struct p2pdma_provider **provider, + unsigned int region_index, struct dma_buf_phys_vec *phys_vec, + struct vfio_region_dma_range *dma_ranges, size_t nr_ranges) +{ + return -EOPNOTSUPP; +} +#endif + struct vfio_pci_core_device { struct vfio_device vdev; struct pci_dev *pdev; + const struct vfio_pci_device_ops *pci_ops; void __iomem *barmap[PCI_STD_NUM_BARS]; bool bar_mmap_supported[PCI_STD_NUM_BARS]; u8 *pci_config_map; @@ -94,6 +136,7 @@ struct vfio_pci_core_device { struct vfio_pci_core_device *sriov_pf_core_dev; struct notifier_block nb; struct rw_semaphore memory_lock; + struct list_head dmabufs; }; /* Will be exported for vfio pci drivers usage */ @@ -175,4 +218,7 @@ static inline bool is_aligned_for_order(struct vm_area_struct *vma, !IS_ALIGNED(pfn, 1 << order))); } +int vfio_pci_dma_buf_iommufd_map(struct dma_buf_attachment *attachment, + struct dma_buf_phys_vec *phys); + #endif /* VFIO_PCI_CORE_H */ diff --git a/include/linux/virtio.h b/include/linux/virtio.h index db31fc6f4f1fa..96c66126c0741 100644 --- a/include/linux/virtio.h +++ b/include/linux/virtio.h @@ -41,6 +41,15 @@ struct virtqueue { void *priv; }; +struct vduse_iova_domain; + +union virtio_map { + /* Device that performs DMA */ + struct device *dma_dev; + /* VDUSE specific mapping data */ + struct vduse_iova_domain *iova_domain; +}; + int virtqueue_add_outbuf(struct virtqueue *vq, struct scatterlist sg[], unsigned int num, void *data, @@ -161,9 +170,11 @@ struct virtio_device { struct virtio_device_id id; const struct virtio_config_ops *config; const struct vringh_config_ops *vringh_config; + const struct virtio_map_ops *map; struct list_head vqs; VIRTIO_DECLARE_FEATURES(features); void *priv; + union virtio_map vmap; #ifdef CONFIG_VIRTIO_DEBUG struct dentry *debugfs_dir; u64 debugfs_filter_features[VIRTIO_FEATURES_DWORDS]; @@ -262,18 +273,41 @@ void unregister_virtio_driver(struct virtio_driver *drv); module_driver(__virtio_driver, register_virtio_driver, \ unregister_virtio_driver) -dma_addr_t virtqueue_dma_map_single_attrs(struct virtqueue *_vq, void *ptr, size_t size, + +void *virtqueue_map_alloc_coherent(struct virtio_device *vdev, + union virtio_map mapping_token, + size_t size, dma_addr_t *dma_handle, + gfp_t gfp); + +void virtqueue_map_free_coherent(struct virtio_device *vdev, + union virtio_map mapping_token, + size_t size, void *vaddr, + dma_addr_t dma_handle); + +dma_addr_t virtqueue_map_page_attrs(const struct virtqueue *_vq, + struct page *page, + unsigned long offset, + size_t size, + enum dma_data_direction dir, + unsigned long attrs); + +void virtqueue_unmap_page_attrs(const struct virtqueue *_vq, + dma_addr_t dma_handle, + size_t size, enum dma_data_direction dir, + unsigned long attrs); + +dma_addr_t virtqueue_map_single_attrs(const struct virtqueue *_vq, void *ptr, size_t size, enum dma_data_direction dir, unsigned long attrs); -void virtqueue_dma_unmap_single_attrs(struct virtqueue *_vq, dma_addr_t addr, +void virtqueue_unmap_single_attrs(const struct virtqueue *_vq, dma_addr_t addr, size_t size, enum dma_data_direction dir, unsigned long attrs); -int virtqueue_dma_mapping_error(struct virtqueue *_vq, dma_addr_t addr); +int virtqueue_map_mapping_error(const struct virtqueue *_vq, dma_addr_t addr); -bool virtqueue_dma_need_sync(struct virtqueue *_vq, dma_addr_t addr); -void virtqueue_dma_sync_single_range_for_cpu(struct virtqueue *_vq, dma_addr_t addr, +bool virtqueue_map_need_sync(const struct virtqueue *_vq, dma_addr_t addr); +void virtqueue_map_sync_single_range_for_cpu(const struct virtqueue *_vq, dma_addr_t addr, unsigned long offset, size_t size, enum dma_data_direction dir); -void virtqueue_dma_sync_single_range_for_device(struct virtqueue *_vq, dma_addr_t addr, +void virtqueue_map_sync_single_range_for_device(const struct virtqueue *_vq, dma_addr_t addr, unsigned long offset, size_t size, enum dma_data_direction dir); diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h index 7427b79d6f3d5..16001e9f9b391 100644 --- a/include/linux/virtio_config.h +++ b/include/linux/virtio_config.h @@ -139,6 +139,78 @@ struct virtio_config_ops { int (*enable_vq_after_reset)(struct virtqueue *vq); }; +/** + * struct virtio_map_ops - operations for mapping buffer for a virtio device + * Note: For transport that has its own mapping logic it must + * implements all of the operations + * @map_page: map a buffer to the device + * map: metadata for performing mapping + * page: the page that will be mapped by the device + * offset: the offset in the page for a buffer + * size: the buffer size + * dir: mapping direction + * attrs: mapping attributes + * Returns: the mapped address + * @unmap_page: unmap a buffer from the device + * map: device specific mapping map + * map_handle: the mapped address + * size: the buffer size + * dir: mapping direction + * attrs: unmapping attributes + * @sync_single_for_cpu: sync a single buffer from device to cpu + * map: metadata for performing mapping + * map_handle: the mapping address to sync + * size: the size of the buffer + * dir: synchronization direction + * @sync_single_for_device: sync a single buffer from cpu to device + * map: metadata for performing mapping + * map_handle: the mapping address to sync + * size: the size of the buffer + * dir: synchronization direction + * @alloc: alloc a coherent buffer mapping + * map: metadata for performing mapping + * size: the size of the buffer + * map_handle: the mapping address to sync + * gfp: allocation flag (GFP_XXX) + * Returns: virtual address of the allocated buffer + * @free: free a coherent buffer mapping + * map: metadata for performing mapping + * size: the size of the buffer + * vaddr: virtual address of the buffer + * map_handle: the mapping address to sync + * attrs: unmapping attributes + * @need_sync: if the buffer needs synchronization + * map: metadata for performing mapping + * map_handle: the mapped address + * Returns: whether the buffer needs synchronization + * @mapping_error: if the mapping address is error + * map: metadata for performing mapping + * map_handle: the mapped address + * @max_mapping_size: get the maximum buffer size that can be mapped + * map: metadata for performing mapping + * Returns: the maximum buffer size that can be mapped + */ +struct virtio_map_ops { + dma_addr_t (*map_page)(union virtio_map map, struct page *page, + unsigned long offset, size_t size, + enum dma_data_direction dir, unsigned long attrs); + void (*unmap_page)(union virtio_map map, dma_addr_t map_handle, + size_t size, enum dma_data_direction dir, + unsigned long attrs); + void (*sync_single_for_cpu)(union virtio_map map, dma_addr_t map_handle, + size_t size, enum dma_data_direction dir); + void (*sync_single_for_device)(union virtio_map map, + dma_addr_t map_handle, size_t size, + enum dma_data_direction dir); + void *(*alloc)(union virtio_map map, size_t size, + dma_addr_t *map_handle, gfp_t gfp); + void (*free)(union virtio_map map, size_t size, void *vaddr, + dma_addr_t map_handle, unsigned long attrs); + bool (*need_sync)(union virtio_map map, dma_addr_t map_handle); + int (*mapping_error)(union virtio_map map, dma_addr_t map_handle); + size_t (*max_mapping_size)(union virtio_map map); +}; + /* If driver didn't advertise the feature, it will never appear. */ void virtio_check_driver_offered_feature(const struct virtio_device *vdev, unsigned int fbit); diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h index 9b33df741b630..c97a12c1cda36 100644 --- a/include/linux/virtio_ring.h +++ b/include/linux/virtio_ring.h @@ -3,6 +3,7 @@ #define _LINUX_VIRTIO_RING_H #include +#include #include #include @@ -79,9 +80,9 @@ struct virtqueue *vring_create_virtqueue(unsigned int index, /* * Creates a virtqueue and allocates the descriptor ring with per - * virtqueue DMA device. + * virtqueue mapping operations. */ -struct virtqueue *vring_create_virtqueue_dma(unsigned int index, +struct virtqueue *vring_create_virtqueue_map(unsigned int index, unsigned int num, unsigned int vring_align, struct virtio_device *vdev, @@ -91,7 +92,7 @@ struct virtqueue *vring_create_virtqueue_dma(unsigned int index, bool (*notify)(struct virtqueue *vq), void (*callback)(struct virtqueue *vq), const char *name, - struct device *dma_dev); + union virtio_map map); /* * Creates a virtqueue with a standard layout but a caller-allocated diff --git a/include/trace/events/dma.h b/include/trace/events/dma.h index 945fcbaae77e9..b3fef140ae155 100644 --- a/include/trace/events/dma.h +++ b/include/trace/events/dma.h @@ -31,7 +31,8 @@ TRACE_DEFINE_ENUM(DMA_NONE); { DMA_ATTR_FORCE_CONTIGUOUS, "FORCE_CONTIGUOUS" }, \ { DMA_ATTR_ALLOC_SINGLE_PAGES, "ALLOC_SINGLE_PAGES" }, \ { DMA_ATTR_NO_WARN, "NO_WARN" }, \ - { DMA_ATTR_PRIVILEGED, "PRIVILEGED" }) + { DMA_ATTR_PRIVILEGED, "PRIVILEGED" }, \ + { DMA_ATTR_MMIO, "MMIO" }) DECLARE_EVENT_CLASS(dma_map, TP_PROTO(struct device *dev, phys_addr_t phys_addr, dma_addr_t dma_addr, @@ -71,8 +72,7 @@ DEFINE_EVENT(dma_map, name, \ size_t size, enum dma_data_direction dir, unsigned long attrs), \ TP_ARGS(dev, phys_addr, dma_addr, size, dir, attrs)) -DEFINE_MAP_EVENT(dma_map_page); -DEFINE_MAP_EVENT(dma_map_resource); +DEFINE_MAP_EVENT(dma_map_phys); DECLARE_EVENT_CLASS(dma_unmap, TP_PROTO(struct device *dev, dma_addr_t addr, size_t size, @@ -109,8 +109,7 @@ DEFINE_EVENT(dma_unmap, name, \ enum dma_data_direction dir, unsigned long attrs), \ TP_ARGS(dev, addr, size, dir, attrs)) -DEFINE_UNMAP_EVENT(dma_unmap_page); -DEFINE_UNMAP_EVENT(dma_unmap_resource); +DEFINE_UNMAP_EVENT(dma_unmap_phys); DECLARE_EVENT_CLASS(dma_alloc_class, TP_PROTO(struct device *dev, void *virt_addr, dma_addr_t dma_addr, diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 52265c85ced6f..bdfc87bae3e3a 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -682,14 +682,25 @@ struct kvm_enable_cap { #define KVM_S390_SIE_PAGE_OFFSET 1 /* - * On arm64, machine type can be used to request the physical - * address size for the VM. Bits[7-0] are reserved for the guest - * PA size shift (i.e, log2(PA_Size)). For backward compatibility, - * value 0 implies the default IPA size, 40bits. + * On arm64, machine type can be used to request both the machine type and + * the physical address size for the VM. + * + * Bits[11-8] are reserved for the ARM specific machine type. + * + * Bits[7-0] are reserved for the guest PA size shift (i.e, log2(PA_Size)). + * For backward compatibility, value 0 implies the default IPA size, 40bits. */ +#define KVM_VM_TYPE_ARM_SHIFT 8 +#define KVM_VM_TYPE_ARM_MASK (0xfULL << KVM_VM_TYPE_ARM_SHIFT) +#define KVM_VM_TYPE_ARM(_type) \ + (((_type) << KVM_VM_TYPE_ARM_SHIFT) & KVM_VM_TYPE_ARM_MASK) +#define KVM_VM_TYPE_ARM_NORMAL KVM_VM_TYPE_ARM(0) +#define KVM_VM_TYPE_ARM_REALM KVM_VM_TYPE_ARM(1) + #define KVM_VM_TYPE_ARM_IPA_SIZE_MASK 0xffULL #define KVM_VM_TYPE_ARM_IPA_SIZE(x) \ ((x) & KVM_VM_TYPE_ARM_IPA_SIZE_MASK) + /* * ioctls for /dev/kvm fds: */ @@ -974,6 +985,7 @@ struct kvm_enable_cap { #define KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED 243 #define KVM_CAP_GUEST_MEMFD_FLAGS 244 #define KVM_CAP_ARM_SEA_TO_USER 245 +#define KVM_CAP_ARM_RME 246 struct kvm_irq_routing_irqchip { __u32 irqchip; @@ -1626,4 +1638,13 @@ struct kvm_pre_fault_memory { __u64 padding[5]; }; +/* Available with KVM_CAP_ARM_RME, only for VMs with KVM_VM_TYPE_ARM_REALM */ +struct kvm_arm_rmm_psci_complete { + __u64 target_mpidr; + __u32 psci_status; + __u32 padding[3]; +}; + +#define KVM_ARM_VCPU_RMM_PSCI_COMPLETE _IOW(KVMIO, 0xd6, struct kvm_arm_rmm_psci_complete) + #endif /* __LINUX_KVM_H */ diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 75100bf009baf..ac2329f241417 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -14,6 +14,7 @@ #include #include +#include #define VFIO_API_VERSION 0 @@ -1478,6 +1479,33 @@ struct vfio_device_feature_bus_master { }; #define VFIO_DEVICE_FEATURE_BUS_MASTER 10 +/** + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the + * regions selected. + * + * open_flags are the typical flags passed to open(2), eg O_RDWR, O_CLOEXEC, + * etc. offset/length specify a slice of the region to create the dmabuf from. + * nr_ranges is the total number of (P2P DMA) ranges that comprise the dmabuf. + * + * flags should be 0. + * + * Return: The fd number on success, -1 and errno is set on failure. + */ +#define VFIO_DEVICE_FEATURE_DMA_BUF 11 + +struct vfio_region_dma_range { + __u64 offset; + __u64 length; +}; + +struct vfio_device_feature_dma_buf { + __u32 region_index; + __u32 open_flags; + __u32 flags; + __u32 nr_ranges; + struct vfio_region_dma_range dma_ranges[] __counted_by(nr_ranges); +}; + /* -------- API for Type1 VFIO IOMMU -------- */ /** diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c index 7458382be8401..138ede653de40 100644 --- a/kernel/dma/debug.c +++ b/kernel/dma/debug.c @@ -39,8 +39,8 @@ enum { dma_debug_single, dma_debug_sg, dma_debug_coherent, - dma_debug_resource, dma_debug_noncoherent, + dma_debug_phy, }; enum map_err_types { @@ -142,8 +142,8 @@ static const char *type2name[] = { [dma_debug_single] = "single", [dma_debug_sg] = "scatter-gather", [dma_debug_coherent] = "coherent", - [dma_debug_resource] = "resource", [dma_debug_noncoherent] = "noncoherent", + [dma_debug_phy] = "phy", }; static const char *dir2name[] = { @@ -1057,17 +1057,16 @@ static void check_unmap(struct dma_debug_entry *ref) dma_entry_free(entry); } -static void check_for_stack(struct device *dev, - struct page *page, size_t offset) +static void check_for_stack(struct device *dev, phys_addr_t phys) { void *addr; struct vm_struct *stack_vm_area = task_stack_vm_area(current); if (!stack_vm_area) { /* Stack is direct-mapped. */ - if (PageHighMem(page)) + if (PhysHighMem(phys)) return; - addr = page_address(page) + offset; + addr = phys_to_virt(phys); if (object_is_on_stack(addr)) err_printk(dev, NULL, "device driver maps memory from stack [addr=%p]\n", addr); } else { @@ -1075,10 +1074,12 @@ static void check_for_stack(struct device *dev, int i; for (i = 0; i < stack_vm_area->nr_pages; i++) { - if (page != stack_vm_area->pages[i]) + if (__phys_to_pfn(phys) != + page_to_pfn(stack_vm_area->pages[i])) continue; - addr = (u8 *)current->stack + i * PAGE_SIZE + offset; + addr = (u8 *)current->stack + i * PAGE_SIZE + + (phys % PAGE_SIZE); err_printk(dev, NULL, "device driver maps memory from stack [probable addr=%p]\n", addr); break; } @@ -1207,9 +1208,8 @@ void debug_dma_map_single(struct device *dev, const void *addr, } EXPORT_SYMBOL(debug_dma_map_single); -void debug_dma_map_page(struct device *dev, struct page *page, size_t offset, - size_t size, int direction, dma_addr_t dma_addr, - unsigned long attrs) +void debug_dma_map_phys(struct device *dev, phys_addr_t phys, size_t size, + int direction, dma_addr_t dma_addr, unsigned long attrs) { struct dma_debug_entry *entry; @@ -1224,19 +1224,18 @@ void debug_dma_map_page(struct device *dev, struct page *page, size_t offset, return; entry->dev = dev; - entry->type = dma_debug_single; - entry->paddr = page_to_phys(page) + offset; + entry->type = dma_debug_phy; + entry->paddr = phys; entry->dev_addr = dma_addr; entry->size = size; entry->direction = direction; entry->map_err_type = MAP_ERR_NOT_CHECKED; - check_for_stack(dev, page, offset); + if (!(attrs & DMA_ATTR_MMIO)) { + check_for_stack(dev, phys); - if (!PageHighMem(page)) { - void *addr = page_address(page) + offset; - - check_for_illegal_area(dev, addr, size); + if (!PhysHighMem(phys)) + check_for_illegal_area(dev, phys_to_virt(phys), size); } add_dma_entry(entry, attrs); @@ -1280,11 +1279,11 @@ void debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr) } EXPORT_SYMBOL(debug_dma_mapping_error); -void debug_dma_unmap_page(struct device *dev, dma_addr_t dma_addr, +void debug_dma_unmap_phys(struct device *dev, dma_addr_t dma_addr, size_t size, int direction) { struct dma_debug_entry ref = { - .type = dma_debug_single, + .type = dma_debug_phy, .dev = dev, .dev_addr = dma_addr, .size = size, @@ -1308,7 +1307,7 @@ void debug_dma_map_sg(struct device *dev, struct scatterlist *sg, return; for_each_sg(sg, s, nents, i) { - check_for_stack(dev, sg_page(s), s->offset); + check_for_stack(dev, sg_phys(s)); if (!PageHighMem(sg_page(s))) check_for_illegal_area(dev, sg_virt(s), s->length); } @@ -1448,47 +1447,6 @@ void debug_dma_free_coherent(struct device *dev, size_t size, check_unmap(&ref); } -void debug_dma_map_resource(struct device *dev, phys_addr_t addr, size_t size, - int direction, dma_addr_t dma_addr, - unsigned long attrs) -{ - struct dma_debug_entry *entry; - - if (unlikely(dma_debug_disabled())) - return; - - entry = dma_entry_alloc(); - if (!entry) - return; - - entry->type = dma_debug_resource; - entry->dev = dev; - entry->paddr = addr; - entry->size = size; - entry->dev_addr = dma_addr; - entry->direction = direction; - entry->map_err_type = MAP_ERR_NOT_CHECKED; - - add_dma_entry(entry, attrs); -} - -void debug_dma_unmap_resource(struct device *dev, dma_addr_t dma_addr, - size_t size, int direction) -{ - struct dma_debug_entry ref = { - .type = dma_debug_resource, - .dev = dev, - .dev_addr = dma_addr, - .size = size, - .direction = direction, - }; - - if (unlikely(dma_debug_disabled())) - return; - - check_unmap(&ref); -} - void debug_dma_sync_single_for_cpu(struct device *dev, dma_addr_t dma_handle, size_t size, int direction) { diff --git a/kernel/dma/debug.h b/kernel/dma/debug.h index 48757ca13f314..da7be0bddcf67 100644 --- a/kernel/dma/debug.h +++ b/kernel/dma/debug.h @@ -9,12 +9,11 @@ #define _KERNEL_DMA_DEBUG_H #ifdef CONFIG_DMA_API_DEBUG -extern void debug_dma_map_page(struct device *dev, struct page *page, - size_t offset, size_t size, - int direction, dma_addr_t dma_addr, +extern void debug_dma_map_phys(struct device *dev, phys_addr_t phys, + size_t size, int direction, dma_addr_t dma_addr, unsigned long attrs); -extern void debug_dma_unmap_page(struct device *dev, dma_addr_t addr, +extern void debug_dma_unmap_phys(struct device *dev, dma_addr_t addr, size_t size, int direction); extern void debug_dma_map_sg(struct device *dev, struct scatterlist *sg, @@ -31,14 +30,6 @@ extern void debug_dma_alloc_coherent(struct device *dev, size_t size, extern void debug_dma_free_coherent(struct device *dev, size_t size, void *virt, dma_addr_t addr); -extern void debug_dma_map_resource(struct device *dev, phys_addr_t addr, - size_t size, int direction, - dma_addr_t dma_addr, - unsigned long attrs); - -extern void debug_dma_unmap_resource(struct device *dev, dma_addr_t dma_addr, - size_t size, int direction); - extern void debug_dma_sync_single_for_cpu(struct device *dev, dma_addr_t dma_handle, size_t size, int direction); @@ -62,14 +53,13 @@ extern void debug_dma_free_pages(struct device *dev, struct page *page, size_t size, int direction, dma_addr_t dma_addr); #else /* CONFIG_DMA_API_DEBUG */ -static inline void debug_dma_map_page(struct device *dev, struct page *page, - size_t offset, size_t size, - int direction, dma_addr_t dma_addr, - unsigned long attrs) +static inline void debug_dma_map_phys(struct device *dev, phys_addr_t phys, + size_t size, int direction, + dma_addr_t dma_addr, unsigned long attrs) { } -static inline void debug_dma_unmap_page(struct device *dev, dma_addr_t addr, +static inline void debug_dma_unmap_phys(struct device *dev, dma_addr_t addr, size_t size, int direction) { } @@ -97,19 +87,6 @@ static inline void debug_dma_free_coherent(struct device *dev, size_t size, { } -static inline void debug_dma_map_resource(struct device *dev, phys_addr_t addr, - size_t size, int direction, - dma_addr_t dma_addr, - unsigned long attrs) -{ -} - -static inline void debug_dma_unmap_resource(struct device *dev, - dma_addr_t dma_addr, size_t size, - int direction) -{ -} - static inline void debug_dma_sync_single_for_cpu(struct device *dev, dma_addr_t dma_handle, size_t size, int direction) diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index 24c359d9c8799..3e058c99fe856 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -453,7 +453,7 @@ void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl, if (sg_dma_is_bus_address(sg)) sg_dma_unmark_bus_address(sg); else - dma_direct_unmap_page(dev, sg->dma_address, + dma_direct_unmap_phys(dev, sg->dma_address, sg_dma_len(sg), dir, attrs); } } @@ -476,16 +476,16 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents, */ break; case PCI_P2PDMA_MAP_NONE: - sg->dma_address = dma_direct_map_page(dev, sg_page(sg), - sg->offset, sg->length, dir, attrs); + sg->dma_address = dma_direct_map_phys(dev, sg_phys(sg), + sg->length, dir, attrs); if (sg->dma_address == DMA_MAPPING_ERROR) { ret = -EIO; goto out_unmap; } break; case PCI_P2PDMA_MAP_BUS_ADDR: - sg->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state, - sg_phys(sg)); + sg->dma_address = pci_p2pdma_bus_addr_map( + p2pdma_state.mem, sg_phys(sg)); sg_dma_mark_bus_address(sg); continue; default: @@ -502,22 +502,6 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents, return ret; } -dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr, - size_t size, enum dma_data_direction dir, unsigned long attrs) -{ - dma_addr_t dma_addr = paddr; - - if (unlikely(!dma_capable(dev, dma_addr, size, false))) { - dev_err_once(dev, - "DMA addr %pad+%zu overflow (mask %llx, bus limit %llx).\n", - &dma_addr, size, *dev->dma_mask, dev->bus_dma_limit); - WARN_ON_ONCE(1); - return DMA_MAPPING_ERROR; - } - - return dma_addr; -} - int dma_direct_get_sgtable(struct device *dev, struct sg_table *sgt, void *cpu_addr, dma_addr_t dma_addr, size_t size, unsigned long attrs) diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h index d2c0b7e632fc0..da2fadf45bcd6 100644 --- a/kernel/dma/direct.h +++ b/kernel/dma/direct.h @@ -80,42 +80,57 @@ static inline void dma_direct_sync_single_for_cpu(struct device *dev, arch_dma_mark_clean(paddr, size); } -static inline dma_addr_t dma_direct_map_page(struct device *dev, - struct page *page, unsigned long offset, size_t size, - enum dma_data_direction dir, unsigned long attrs) +static inline dma_addr_t dma_direct_map_phys(struct device *dev, + phys_addr_t phys, size_t size, enum dma_data_direction dir, + unsigned long attrs) { - phys_addr_t phys = page_to_phys(page) + offset; - dma_addr_t dma_addr = phys_to_dma(dev, phys); + dma_addr_t dma_addr; if (is_swiotlb_force_bounce(dev)) { - if (is_pci_p2pdma_page(page)) - return DMA_MAPPING_ERROR; + if (attrs & DMA_ATTR_MMIO) + goto err_overflow; + return swiotlb_map(dev, phys, size, dir, attrs); } - if (unlikely(!dma_capable(dev, dma_addr, size, true)) || - dma_kmalloc_needs_bounce(dev, size, dir)) { - if (is_pci_p2pdma_page(page)) - return DMA_MAPPING_ERROR; - if (is_swiotlb_active(dev)) - return swiotlb_map(dev, phys, size, dir, attrs); - - dev_WARN_ONCE(dev, 1, - "DMA addr %pad+%zu overflow (mask %llx, bus limit %llx).\n", - &dma_addr, size, *dev->dma_mask, dev->bus_dma_limit); - return DMA_MAPPING_ERROR; + if (attrs & DMA_ATTR_MMIO) { + dma_addr = phys; + if (unlikely(!dma_capable(dev, dma_addr, size, false))) + goto err_overflow; + } else { + dma_addr = phys_to_dma(dev, phys); + if (unlikely(!dma_capable(dev, dma_addr, size, true)) || + dma_kmalloc_needs_bounce(dev, size, dir)) { + if (is_swiotlb_active(dev)) + return swiotlb_map(dev, phys, size, dir, attrs); + + goto err_overflow; + } } - if (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) + if (!dev_is_dma_coherent(dev) && + !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO))) arch_sync_dma_for_device(phys, size, dir); return dma_addr; + +err_overflow: + dev_WARN_ONCE( + dev, 1, + "DMA addr %pad+%zu overflow (mask %llx, bus limit %llx).\n", + &dma_addr, size, *dev->dma_mask, dev->bus_dma_limit); + return DMA_MAPPING_ERROR; } -static inline void dma_direct_unmap_page(struct device *dev, dma_addr_t addr, +static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr, size_t size, enum dma_data_direction dir, unsigned long attrs) { - phys_addr_t phys = dma_to_phys(dev, addr); + phys_addr_t phys; + + if (attrs & DMA_ATTR_MMIO) + /* nothing to do: uncached and no swiotlb */ + return; + phys = dma_to_phys(dev, addr); if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC)) dma_direct_sync_single_for_cpu(dev, addr, size, dir); diff --git a/kernel/dma/dummy.c b/kernel/dma/dummy.c index 92de80e5b057e..16a51736a2a39 100644 --- a/kernel/dma/dummy.c +++ b/kernel/dma/dummy.c @@ -11,17 +11,16 @@ static int dma_dummy_mmap(struct device *dev, struct vm_area_struct *vma, return -ENXIO; } -static dma_addr_t dma_dummy_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t size, enum dma_data_direction dir, - unsigned long attrs) +static dma_addr_t dma_dummy_map_phys(struct device *dev, phys_addr_t phys, + size_t size, enum dma_data_direction dir, unsigned long attrs) { return DMA_MAPPING_ERROR; } -static void dma_dummy_unmap_page(struct device *dev, dma_addr_t dma_handle, +static void dma_dummy_unmap_phys(struct device *dev, dma_addr_t dma_handle, size_t size, enum dma_data_direction dir, unsigned long attrs) { /* - * Dummy ops doesn't support map_page, so unmap_page should never be + * Dummy ops doesn't support map_phys, so unmap_page should never be * called. */ WARN_ON_ONCE(true); @@ -51,8 +50,8 @@ static int dma_dummy_supported(struct device *hwdev, u64 mask) const struct dma_map_ops dma_dummy_ops = { .mmap = dma_dummy_mmap, - .map_page = dma_dummy_map_page, - .unmap_page = dma_dummy_unmap_page, + .map_phys = dma_dummy_map_phys, + .unmap_phys = dma_dummy_unmap_phys, .map_sg = dma_dummy_map_sg, .unmap_sg = dma_dummy_unmap_sg, .dma_supported = dma_dummy_supported, diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index 56de28a3b1799..37163eb49f9fa 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -152,12 +152,12 @@ static inline bool dma_map_direct(struct device *dev, return dma_go_direct(dev, *dev->dma_mask, ops); } -dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page, - size_t offset, size_t size, enum dma_data_direction dir, - unsigned long attrs) +dma_addr_t dma_map_phys(struct device *dev, phys_addr_t phys, size_t size, + enum dma_data_direction dir, unsigned long attrs) { const struct dma_map_ops *ops = get_dma_ops(dev); - dma_addr_t addr; + bool is_mmio = attrs & DMA_ATTR_MMIO; + dma_addr_t addr = DMA_MAPPING_ERROR; BUG_ON(!valid_dma_direction(dir)); @@ -165,36 +165,65 @@ dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page, return DMA_MAPPING_ERROR; if (dma_map_direct(dev, ops) || - arch_dma_map_page_direct(dev, page_to_phys(page) + offset + size)) - addr = dma_direct_map_page(dev, page, offset, size, dir, attrs); + (!is_mmio && arch_dma_map_phys_direct(dev, phys + size))) + addr = dma_direct_map_phys(dev, phys, size, dir, attrs); else if (use_dma_iommu(dev)) - addr = iommu_dma_map_page(dev, page, offset, size, dir, attrs); - else - addr = ops->map_page(dev, page, offset, size, dir, attrs); - kmsan_handle_dma(page, offset, size, dir); - trace_dma_map_page(dev, page_to_phys(page) + offset, addr, size, dir, - attrs); - debug_dma_map_page(dev, page, offset, size, dir, addr, attrs); + addr = iommu_dma_map_phys(dev, phys, size, dir, attrs); + else if (ops->map_phys) + addr = ops->map_phys(dev, phys, size, dir, attrs); + + if (!is_mmio) + kmsan_handle_dma(phys, size, dir); + trace_dma_map_phys(dev, phys, addr, size, dir, attrs); + debug_dma_map_phys(dev, phys, size, dir, addr, attrs); return addr; } +EXPORT_SYMBOL_GPL(dma_map_phys); + +dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page, + size_t offset, size_t size, enum dma_data_direction dir, + unsigned long attrs) +{ + phys_addr_t phys = page_to_phys(page) + offset; + + if (unlikely(attrs & DMA_ATTR_MMIO)) + return DMA_MAPPING_ERROR; + + if (IS_ENABLED(CONFIG_DMA_API_DEBUG) && + WARN_ON_ONCE(is_zone_device_page(page))) + return DMA_MAPPING_ERROR; + + return dma_map_phys(dev, phys, size, dir, attrs); +} EXPORT_SYMBOL(dma_map_page_attrs); -void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size, +void dma_unmap_phys(struct device *dev, dma_addr_t addr, size_t size, enum dma_data_direction dir, unsigned long attrs) { const struct dma_map_ops *ops = get_dma_ops(dev); + bool is_mmio = attrs & DMA_ATTR_MMIO; BUG_ON(!valid_dma_direction(dir)); if (dma_map_direct(dev, ops) || - arch_dma_unmap_page_direct(dev, addr + size)) - dma_direct_unmap_page(dev, addr, size, dir, attrs); + (!is_mmio && arch_dma_unmap_phys_direct(dev, addr + size))) + dma_direct_unmap_phys(dev, addr, size, dir, attrs); else if (use_dma_iommu(dev)) - iommu_dma_unmap_page(dev, addr, size, dir, attrs); - else - ops->unmap_page(dev, addr, size, dir, attrs); - trace_dma_unmap_page(dev, addr, size, dir, attrs); - debug_dma_unmap_page(dev, addr, size, dir); + iommu_dma_unmap_phys(dev, addr, size, dir, attrs); + else if (ops->unmap_phys) + ops->unmap_phys(dev, addr, size, dir, attrs); + trace_dma_unmap_phys(dev, addr, size, dir, attrs); + debug_dma_unmap_phys(dev, addr, size, dir); +} +EXPORT_SYMBOL_GPL(dma_unmap_phys); + +void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size, + enum dma_data_direction dir, unsigned long attrs) +{ + if (unlikely(attrs & DMA_ATTR_MMIO)) + return; + + dma_unmap_phys(dev, addr, size, dir, attrs); } EXPORT_SYMBOL(dma_unmap_page_attrs); @@ -321,41 +350,18 @@ EXPORT_SYMBOL(dma_unmap_sg_attrs); dma_addr_t dma_map_resource(struct device *dev, phys_addr_t phys_addr, size_t size, enum dma_data_direction dir, unsigned long attrs) { - const struct dma_map_ops *ops = get_dma_ops(dev); - dma_addr_t addr = DMA_MAPPING_ERROR; - - BUG_ON(!valid_dma_direction(dir)); - - if (WARN_ON_ONCE(!dev->dma_mask)) + if (IS_ENABLED(CONFIG_DMA_API_DEBUG) && + WARN_ON_ONCE(pfn_valid(PHYS_PFN(phys_addr)))) return DMA_MAPPING_ERROR; - if (dma_map_direct(dev, ops)) - addr = dma_direct_map_resource(dev, phys_addr, size, dir, attrs); - else if (use_dma_iommu(dev)) - addr = iommu_dma_map_resource(dev, phys_addr, size, dir, attrs); - else if (ops->map_resource) - addr = ops->map_resource(dev, phys_addr, size, dir, attrs); - - trace_dma_map_resource(dev, phys_addr, addr, size, dir, attrs); - debug_dma_map_resource(dev, phys_addr, size, dir, addr, attrs); - return addr; + return dma_map_phys(dev, phys_addr, size, dir, attrs | DMA_ATTR_MMIO); } EXPORT_SYMBOL(dma_map_resource); void dma_unmap_resource(struct device *dev, dma_addr_t addr, size_t size, enum dma_data_direction dir, unsigned long attrs) { - const struct dma_map_ops *ops = get_dma_ops(dev); - - BUG_ON(!valid_dma_direction(dir)); - if (dma_map_direct(dev, ops)) - ; /* nothing to do: uncached and no swiotlb */ - else if (use_dma_iommu(dev)) - iommu_dma_unmap_resource(dev, addr, size, dir, attrs); - else if (ops->unmap_resource) - ops->unmap_resource(dev, addr, size, dir, attrs); - trace_dma_unmap_resource(dev, addr, size, dir, attrs); - debug_dma_unmap_resource(dev, addr, size, dir); + dma_unmap_phys(dev, addr, size, dir, attrs | DMA_ATTR_MMIO); } EXPORT_SYMBOL(dma_unmap_resource); diff --git a/kernel/dma/ops_helpers.c b/kernel/dma/ops_helpers.c index 9afd569eadb96..20caf9cabf699 100644 --- a/kernel/dma/ops_helpers.c +++ b/kernel/dma/ops_helpers.c @@ -64,6 +64,7 @@ struct page *dma_common_alloc_pages(struct device *dev, size_t size, { const struct dma_map_ops *ops = get_dma_ops(dev); struct page *page; + phys_addr_t phys; page = dma_alloc_contiguous(dev, size, gfp); if (!page) @@ -71,11 +72,12 @@ struct page *dma_common_alloc_pages(struct device *dev, size_t size, if (!page) return NULL; + phys = page_to_phys(page); if (use_dma_iommu(dev)) - *dma_handle = iommu_dma_map_page(dev, page, 0, size, dir, + *dma_handle = iommu_dma_map_phys(dev, phys, size, dir, DMA_ATTR_SKIP_CPU_SYNC); else - *dma_handle = ops->map_page(dev, page, 0, size, dir, + *dma_handle = ops->map_phys(dev, phys, size, dir, DMA_ATTR_SKIP_CPU_SYNC); if (*dma_handle == DMA_MAPPING_ERROR) { dma_free_contiguous(dev, page, size); @@ -92,10 +94,10 @@ void dma_common_free_pages(struct device *dev, size_t size, struct page *page, const struct dma_map_ops *ops = get_dma_ops(dev); if (use_dma_iommu(dev)) - iommu_dma_unmap_page(dev, dma_handle, size, dir, + iommu_dma_unmap_phys(dev, dma_handle, size, dir, DMA_ATTR_SKIP_CPU_SYNC); - else if (ops->unmap_page) - ops->unmap_page(dev, dma_handle, size, dir, + else if (ops->unmap_phys) + ops->unmap_phys(dev, dma_handle, size, dir, DMA_ATTR_SKIP_CPU_SYNC); dma_free_contiguous(dev, page, size); } diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index 31203f0bacafa..2f1c62c817e6a 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -961,17 +961,24 @@ int kimage_load_segment(struct kimage *image, int idx) return result; } -void *kimage_map_segment(struct kimage *image, - unsigned long addr, unsigned long size) +void *kimage_map_segment(struct kimage *image, int idx) { + unsigned long addr, size, eaddr; unsigned long src_page_addr, dest_page_addr = 0; - unsigned long eaddr = addr + size; kimage_entry_t *ptr, entry; struct page **src_pages; unsigned int npages; + struct page *cma; void *vaddr = NULL; int i; + cma = image->segment_cma[idx]; + if (cma) + return page_address(cma); + + addr = image->segment[idx].mem; + size = image->segment[idx].memsz; + eaddr = addr + size; /* * Collect the source pages and map them in a contiguous VA range. */ @@ -1012,7 +1019,8 @@ void *kimage_map_segment(struct kimage *image, void kimage_unmap_segment(void *segment_buffer) { - vunmap(segment_buffer); + if (is_vmalloc_addr(segment_buffer)) + vunmap(segment_buffer); } struct kexec_load_limit { diff --git a/mm/hmm.c b/mm/hmm.c index d545e24949949..012b78688fa18 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -746,12 +746,12 @@ dma_addr_t hmm_dma_map_pfn(struct device *dev, struct hmm_dma_map *map, case PCI_P2PDMA_MAP_NONE: break; case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: - attrs |= DMA_ATTR_SKIP_CPU_SYNC; + attrs |= DMA_ATTR_MMIO; pfns[idx] |= HMM_PFN_P2PDMA; break; case PCI_P2PDMA_MAP_BUS_ADDR: pfns[idx] |= HMM_PFN_P2PDMA_BUS | HMM_PFN_DMA_MAPPED; - return pci_p2pdma_bus_addr_map(p2pdma_state, paddr); + return pci_p2pdma_bus_addr_map(p2pdma_state->mem, paddr); default: return DMA_MAPPING_ERROR; } @@ -775,8 +775,8 @@ dma_addr_t hmm_dma_map_pfn(struct device *dev, struct hmm_dma_map *map, if (WARN_ON_ONCE(dma_need_unmap(dev) && !dma_addrs)) goto error; - dma_addr = dma_map_page(dev, page, 0, map->dma_entry_size, - DMA_BIDIRECTIONAL); + dma_addr = dma_map_phys(dev, paddr, map->dma_entry_size, + DMA_BIDIRECTIONAL, attrs); if (dma_mapping_error(dev, dma_addr)) goto error; @@ -811,16 +811,17 @@ bool hmm_dma_unmap_pfn(struct device *dev, struct hmm_dma_map *map, size_t idx) if ((pfns[idx] & valid_dma) != valid_dma) return false; + if (pfns[idx] & HMM_PFN_P2PDMA) + attrs |= DMA_ATTR_MMIO; + if (pfns[idx] & HMM_PFN_P2PDMA_BUS) ; /* no need to unmap bus address P2P mappings */ - else if (dma_use_iova(state)) { - if (pfns[idx] & HMM_PFN_P2PDMA) - attrs |= DMA_ATTR_SKIP_CPU_SYNC; + else if (dma_use_iova(state)) dma_iova_unlink(dev, state, idx * map->dma_entry_size, map->dma_entry_size, DMA_BIDIRECTIONAL, attrs); - } else if (dma_need_unmap(dev)) - dma_unmap_page(dev, dma_addrs[idx], map->dma_entry_size, - DMA_BIDIRECTIONAL); + else if (dma_need_unmap(dev)) + dma_unmap_phys(dev, dma_addrs[idx], map->dma_entry_size, + DMA_BIDIRECTIONAL, attrs); pfns[idx] &= ~(HMM_PFN_DMA_MAPPED | HMM_PFN_P2PDMA | HMM_PFN_P2PDMA_BUS); diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c index 92ebc0f557d0b..8f22d1f229813 100644 --- a/mm/kmsan/hooks.c +++ b/mm/kmsan/hooks.c @@ -338,14 +338,15 @@ static void kmsan_handle_dma_page(const void *addr, size_t size, } /* Helper function to handle DMA data transfers. */ -void kmsan_handle_dma(struct page *page, size_t offset, size_t size, +void kmsan_handle_dma(phys_addr_t phys, size_t size, enum dma_data_direction dir) { - u64 page_offset, to_go, addr; + u64 page_offset, to_go; + void *addr; - if (PageHighMem(page)) + if (PhysHighMem(phys)) return; - addr = (u64)page_address(page) + offset; + addr = phys_to_virt(phys); /* * The kernel may occasionally give us adjacent DMA pages not belonging * to the same allocation. Process them separately to avoid triggering @@ -368,8 +369,7 @@ void kmsan_handle_dma_sg(struct scatterlist *sg, int nents, int i; for_each_sg(sg, item, nents, i) - kmsan_handle_dma(sg_page(item), item->offset, item->length, - dir); + kmsan_handle_dma(sg_phys(item), item->length, dir); } /* Functions from kmsan-checks.h follow. */ diff --git a/rust/kernel/dma.rs b/rust/kernel/dma.rs index 2bc8ab51ec280..61d9eed7a786e 100644 --- a/rust/kernel/dma.rs +++ b/rust/kernel/dma.rs @@ -242,6 +242,9 @@ pub mod attrs { /// Indicates that the buffer is fully accessible at an elevated privilege level (and /// ideally inaccessible or at least read-only at lesser-privileged levels). pub const DMA_ATTR_PRIVILEGED: Attrs = Attrs(bindings::DMA_ATTR_PRIVILEGED); + + /// Indicates that the buffer is MMIO memory. + pub const DMA_ATTR_MMIO: Attrs = Attrs(bindings::DMA_ATTR_MMIO); } /// An abstraction of the `dma_alloc_coherent` API. diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c index 7362f68f2d8b1..5beb69edd12fd 100644 --- a/security/integrity/ima/ima_kexec.c +++ b/security/integrity/ima/ima_kexec.c @@ -250,9 +250,7 @@ void ima_kexec_post_load(struct kimage *image) if (!image->ima_buffer_addr) return; - ima_kexec_buffer = kimage_map_segment(image, - image->ima_buffer_addr, - image->ima_buffer_size); + ima_kexec_buffer = kimage_map_segment(image, image->ima_segment_index); if (!ima_kexec_buffer) { pr_err("Could not map measurements buffer.\n"); return; diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index bb4d33dde3c89..44eb2e1366eb5 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -1574,6 +1574,49 @@ TEST_F(iommufd_ioas, copy_sweep) test_ioctl_destroy(dst_ioas_id); } +TEST_F(iommufd_ioas, dmabuf_simple) +{ + size_t buf_size = PAGE_SIZE*4; + __u64 iova; + int dfd; + + test_cmd_get_dmabuf(buf_size, &dfd); + test_err_ioctl_ioas_map_file(EINVAL, dfd, 0, 0, &iova); + test_err_ioctl_ioas_map_file(EINVAL, dfd, buf_size, buf_size, &iova); + test_err_ioctl_ioas_map_file(EINVAL, dfd, 0, buf_size + 1, &iova); + test_ioctl_ioas_map_file(dfd, 0, buf_size, &iova); + + close(dfd); +} + +TEST_F(iommufd_ioas, dmabuf_revoke) +{ + size_t buf_size = PAGE_SIZE*4; + __u32 hwpt_id; + __u64 iova; + __u64 iova2; + int dfd; + + test_cmd_get_dmabuf(buf_size, &dfd); + test_ioctl_ioas_map_file(dfd, 0, buf_size, &iova); + test_cmd_revoke_dmabuf(dfd, true); + + if (variant->mock_domains) + test_cmd_hwpt_alloc(self->device_id, self->ioas_id, 0, + &hwpt_id); + + test_err_ioctl_ioas_map_file(ENODEV, dfd, 0, buf_size, &iova2); + + test_cmd_revoke_dmabuf(dfd, false); + test_ioctl_ioas_map_file(dfd, 0, buf_size, &iova2); + + /* Restore the iova back */ + test_ioctl_ioas_unmap(iova, buf_size); + test_ioctl_ioas_map_fixed_file(dfd, 0, buf_size, iova); + + close(dfd); +} + FIXTURE(iommufd_mock_domain) { int fd; diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h index 9f472c20c1905..55c285dbbf526 100644 --- a/tools/testing/selftests/iommu/iommufd_utils.h +++ b/tools/testing/selftests/iommu/iommufd_utils.h @@ -548,6 +548,39 @@ static int _test_cmd_destroy_access_pages(int fd, unsigned int access_id, EXPECT_ERRNO(_errno, _test_cmd_destroy_access_pages( \ self->fd, access_id, access_pages_id)) +static int _test_cmd_get_dmabuf(int fd, size_t len, int *out_fd) +{ + struct iommu_test_cmd cmd = { + .size = sizeof(cmd), + .op = IOMMU_TEST_OP_DMABUF_GET, + .dmabuf_get = { .length = len, .open_flags = O_CLOEXEC }, + }; + + *out_fd = ioctl(fd, IOMMU_TEST_CMD, &cmd); + if (*out_fd < 0) + return -1; + return 0; +} +#define test_cmd_get_dmabuf(len, out_fd) \ + ASSERT_EQ(0, _test_cmd_get_dmabuf(self->fd, len, out_fd)) + +static int _test_cmd_revoke_dmabuf(int fd, int dmabuf_fd, bool revoked) +{ + struct iommu_test_cmd cmd = { + .size = sizeof(cmd), + .op = IOMMU_TEST_OP_DMABUF_REVOKE, + .dmabuf_revoke = { .dmabuf_fd = dmabuf_fd, .revoked = revoked }, + }; + int ret; + + ret = ioctl(fd, IOMMU_TEST_CMD, &cmd); + if (ret < 0) + return -1; + return 0; +} +#define test_cmd_revoke_dmabuf(dmabuf_fd, revoke) \ + ASSERT_EQ(0, _test_cmd_revoke_dmabuf(self->fd, dmabuf_fd, revoke)) + static int _test_ioctl_destroy(int fd, unsigned int id) { struct iommu_destroy cmd = { @@ -718,6 +751,17 @@ static int _test_ioctl_ioas_map_file(int fd, unsigned int ioas_id, int mfd, self->fd, ioas_id, mfd, start, length, iova_p, \ IOMMU_IOAS_MAP_WRITEABLE | IOMMU_IOAS_MAP_READABLE)) +#define test_ioctl_ioas_map_fixed_file(mfd, start, length, iova) \ + ({ \ + __u64 __iova = iova; \ + ASSERT_EQ(0, _test_ioctl_ioas_map_file( \ + self->fd, self->ioas_id, mfd, start, \ + length, &__iova, \ + IOMMU_IOAS_MAP_FIXED_IOVA | \ + IOMMU_IOAS_MAP_WRITEABLE | \ + IOMMU_IOAS_MAP_READABLE)); \ + }) + static int _test_ioctl_set_temp_memory_limit(int fd, unsigned int limit) { struct iommu_test_cmd memlimit_cmd = { diff --git a/tools/virtio/linux/kmsan.h b/tools/virtio/linux/kmsan.h index 272b5aa285d5a..6cd2e3efd03dc 100644 --- a/tools/virtio/linux/kmsan.h +++ b/tools/virtio/linux/kmsan.h @@ -4,7 +4,7 @@ #include -inline void kmsan_handle_dma(struct page *page, size_t offset, size_t size, +inline void kmsan_handle_dma(phys_addr_t phys, size_t size, enum dma_data_direction dir) { }