Saturday, May 13, 2006

IOMMU and Virtualization

I/O performance has been an important concern in the virtualization community. For a virtual machine monitor (VMM) to support complete virtualization, there is no way around but to emulate all the peripherals accessed by a guest operating system. The major factor that denies a VMM to let a guest directly access an I/O device is the insecurity involved in DMA. Since a DMA-capable device, on x86, operates on physical rather than virtual addresses (this may not be true in general, e.g. sparc supports direct virtual memory access), there is no way for a VMM to restrict the address ranges used in a DMA transfer to that belonging to the guest only. As a result, most of the hypervisors end up emulating some of the old I/O devices that are easy to implement lacking most of the advanced features that a modern device offers. With the AMD's newly proposed I/O memory management unit, or simply IOMMU, this problem seems to come to an end. In this article, I'll go a little bit deeper into this technology and discuss the differences it can make in the life of a system designer.

This paragraph is for the beginners, experts please move on to the next paragraph. IOMMU provides two main functionalities: virtual-to-physical address translation and access protection on the memory ranges that an I/O device is trying to operate on. To support these features, it architects several new data structures, out of which two are worth mentioning: device table and I/O page table. A device table is indexed by a device ID (the ones that we are used to, with bus, device and function numbers) and contains a domain ID (think of it as an address space ID) and a pointer to the I/O page table, among other things. The domain ID lets the host group a set of peripherals that share a virtual address space, which translates to I/O page table and IOTLB sharing. For example, a VMM could put all the hardware used by a VM into a common domain, saving memory on I/O page tables as well as preventing IOTLB thrashes. I/O page table provides the requisite virtual-to-physical translations as well as controls the access to those pages.

To me, IOMMU is more of an implementation (that we needed so badly) than an innovation in the I/O design aspect. For example, we already had graphics aperture remapping table (GART) to map aperture memory region to system DRAM. The difference in IOMMU is that, now it can map an arbitrary address and not just the pages belonging to the graphics aperture. Protection for DMA-targeted pages was already part of the AMD's Pacifica/SVM architecture using device exclusion vector (DEV) and also incorporated the idea of protection domains. IOMMU is thus an extension of these technologies with the additional communication infrastructure (with the processor) that includes command queuing (to support mutiple commands from the CPU, for efficiency), interrupts (on completion/error), and event logging (error information).

Hmm.. so it looks cool, now what do I do with it? Well, there are several implications of such a feature. It enables VMMs to cut down on the performance overhead that was part of the I/O virtualization process. With IOMMU, by setting up the I/O page tables that translate guest physical addresses to machine addresses, a VMM can let its guest directly control the device albeit with a catch. The catch here is that, IOMMU does not report when a page translation fails. As a result, now that the VMM is totally unaware of the guest physical pages that are undergoing a DMA, it has no way but to pin the entire guest physical addresses in memory. The next implication of an IOMMU is the consideration of an user-space driver. However, the interrupt handling still needs to be supported before a driver can move completely to user-space. With the current support, the user-space driver has to have a part of it in kernel that takes care of interrupts generated by the device. Another novel usage of this technology is to enable accesses to memory ranges beyond 4GB (on 64-bit machines) for legacy x86 32-bit devices by appropriately setting up the I/O page table to point to the high memory pages.

Now, what about devices that bypass IOMMU in a multipath set up like AMD64? Well, you are pretty much stuck here. Since IOMMU can translate/protect the I/O traffic only when it goes through it, it can't do much in a multipath scenario. The only solution, here, is to put multiple IOMMUs, one in each of the I/O hubs. Virtualizing IOMMU itself is not directly supported, and should the VMM need to virtualize, it has to emulate it using software techniques. However, these are the issues I wouldn't bother much for now and rather enjoy the facilities that come with it. With all major hypervisors waiting to jump on it, the I/O performance in VMs in the coming months is anyone's guess.

Wanna recommend people to read this article? Digg it here.