Tushar Dave wrote this blog post on his experiences with making Linux run well on SPARC systems.
Earlier last year I had a chance to deep dive into SPARC I/O architecture while trying to root cause why next generation of SPARC machines running Linux suddenly became unresponsive. During initial investigation it looks like networking issue however I never thought of going deep into IOMMU and kernel DMA subsystem.
IOMMU and DMA are sensitive areas of the kernel and extreme care must be taken while implementing/modifying them with new idea/solution otherwise it can result into disasters like data corruption, security breach or as simple as poor I/O performance.
In this blog, I discussed brief design of legacy SPARC IOMMU in Linux, limitations of legacy IOMMU (I.e the issue with detail example) and finally a solution that has been upstream to Linux kernel.
Legacy IOMMU Design:
Traditional SPARC sun4v Hypervisor supports only 32bit address ranges and one IOTSB (I/O Translation Storage Buffer) per PCIe root complex fabric that has a 2GB per root complex DVMA space limit. The DVMA space is exposed to OS using device property 'virtual-dma' which represents base and size of IOTSB. DVMA space on traditional sun4v SPARC is shown in Fig 1.
SPARC sun4v hypervisor exposes various APIs to guest OS to communicate and configure system hardware. During system boot, PCI Bus Module driver (PBM) registers hypervisor APIs for PCI I/O, retrieves 'devhandle' from device properties ('devhandle' uniquely identifies the sun4v PCI root complex fabric), retrieves the 'virtual-dma' from device property and, initialize the needed data structures to maintain and optimize access of IOTSB. Any DMA allocation/map/unmap requests by PCI devices then arrive at PBM driver's dma ops where either request is satisfied or rejected depending upon if IOTSB has free pages available or not.
Legacy IOMMU limitation:
During investigation, I found that considering DVMA space of 2GB and default I/O page size of 8KB on SPARC, there are total of of 256K I/O pages available in IOMMU pool. This impose limits on the PCIe device drivers that total number of DMA buffers (of page size) that can be allocated at a time cannot be more than 256K. In other words, IOMMU at max can serve 256K page translations. The 256K number of pages are serving well until recently. However with advancement in PCIe hardwares (e.g. 40Gbps Ethernet), the 2GB DVMA space limit has become a scalability bottleneck that a typical 10G/40G NIC can consume 500MB DVMA space per instance. And when DVMA resource is exhausted, devices will not be usable since driver can't allocate DVMA for DMA mapping. Lets look at this with example.
Upon bringing up one 40Gbps ethernet interface on S7-2 (dual core with 64 threads per core) , Linux i40e driver allocates 128 QueuePair (QP), i.e. 128 queues for TX and 128 for RX descriptor rings. Each queue has 512 hardware descriptors by default (descriptor is multi-byte long field that contains DMA buffer address along with some other useful info for HW).
In an ideal case, during TX, i40e driver gets an skb from network stack. Driver checks length of skb (along with skb fragments) and counts number of data descriptors needed to transmit the data packet. For each skb data fragments, driver request a DMA mapping and signal HW via writing to tx_tail register that indicate data is ready for transmit. Each individual request of DMA mapping take one entry in IOMMU pool. Once data is sent out, the skb data fragment is unmapped by the driver that eventually free the entry in IOMMU pool. The important point is, driver doesn't need to premap TX buffers for DMA. The DMA mapping/unmapping occurs as transmit request arises from network stack.
However, on RX, driver has to allocate RX buffers beforehand and premap them for DMA so to be ready for any reception of packets. For each descriptor in RX queue, therefore, driver allocates RX buffer and maps it for DMA. Each mapping occupies one entry in IOMMU pool. So for 128 receive queues, each with 512 HW descriptors, driver needs 64K DMA mapping (128 x 512). Therefore, bringing up one i40e interface occupies 64K entries in IOMMU pool as shown in Fig 2. S7-2 has four i40e interface and if rest of 3 interface also get probed and open, four i40e interfaces in total occupy 4*64K entries in IOMMU pool. i.e 256K entries.
As stated before, legacy IOMMU pool has at max 256K entries available (2GB DVMA space and 8KB pagesize). So by bringing all four i40e interface, there will be no free entry in IOMMU pool. At this point, if a send request arrives to any one of these interfaces from network stack, request to map an skb for DMA will be instantly fail because there is no free entry in IOMMU pool.
In real world scenario, there are many PCIe devices in the system like disk, infiniband etc,. which are connected to PCIe root complex like i40e. These devices also request resources from IOMMU pool. So in real case not all 256K entries in IOMMU pool is available; the issue of IOMMU running out of DMA address sometimes occurs immediately as soon as IOMMU resource exhausted. Fig 3 shows snapshot of kernel errors occur in real world scenario where i40e device fails to map Rx buffers for DMA. The driver simply flood kernel ring buffers with DMA map errors, makes S7-2 sluggish and eventually unresponsive.
So after identifying the legacy IOMMU limitation, scaling the DVMA space from 2GB to something bigger became inevitable.
A new IOMMU called 'Address Translation Unit' (ATU) was enabled in Linux SPARC that provide support for virtualized IOMMU, larger DVAM space upto 2^56 and IOTSB with multiple huge page-sizes.
The ATU code is upstreamed and merged into Linux kernel 4.10. ATU allows host and guest OS to create IOTSB of size 32GB with 64bit address ranges available in ATU hardware. 32GB is more than enough DVMA space for PCIe device contrast to 2GB DVMA space provided by legacy IOMMU which is shared by all PCI devices under root complex.
ATU linux kernel upstream submission can be found at http://www.spinics.net/lists/sparclinux/msg16572.html