IO Virtualization On ARM - Part3

12/3/13
IO Virtualization on ARM
Future Chips
Search
Subscribe
ARM Virtualization I/O Virtualization (Part 3)

April 1, 2013 Posted by Ali Hussain at 9:25 am Add comments Chip Design for All, Tips for Pow er Coders, Understanding Chips Tagged w ith: AMD-Vi, arm, cortex a15, Cortex A57, drivers, emulation, Intel VT-d, IOMMU, paravirtualization, System MMU, virtualization
In the second part of the series we introduced memory management and interrupt handling support provided by virtualization hardware extensions. But effective virtualization solutions need to reach beyond the core to communicate with peripheral devices. In this post we discuss the various techniques used for virtualizing I/O, the problems faced, and the hardware solutions to mitigate these problems. The Difficulty Of Virtualizing I/O Before we talk about the hardware solutions at the system level for virtualization we need to set up a motivation for what is driving these features. To appreciate the problems we have to recognize that in some ways communicating with I/O in a virtualized environment is a paradox. We want to run an operating system in a sandboxed environment where it is oblivious to the the system outside the virtual environment. But I/O cannot be oblivious to the outside environment because it is communicating with that environment. So, understandably virtualizing I/O becomes a difficult problem. So moving away from the philosophical questions, what is the goal of virtualization and how does I/O fit into that goal? In my view it is to provide a managed environment for hosting a VM that improves the overall user experience. To achieve this goal, ideally wed like I/O in a VM to have the following properties: 1. The guest has access to the same I/O devices it would use in a native environment.
Recent Posts Meet Flux7 Labs (update + shameless marketing) ARM Virtualization ARM vs x86 (Part 5) ARM Virtualization Applications (Part 4)
1/11
QR Code
Categories Chip Design for All (21) Fun (11) Parallel Programming (13) Software for Hardware guys (22) Thoughts for Researchers (16) Thoughts on Latest Happenings (8) Tips for Power Coders (25) Understanding Chips (3)
www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.html
12/3/13
2. The guest OS cannot affect the I/O operations or memory of other guests. 3. The software changes to the guest OS must be minimal. 4. The guest OS needs to be able to recover from a failure of the hardware or migration of the VM. 5. The I/O operations on the guest OS should have similar performance to running natively. In this list we can see how several items on the list are competing with other items on the list. So the final solution will require trade-offs based on the particular use-case. Now, With these goals in mind let us look at the various techniques for implementing I/O virtualization and the problems faced. Emulated Or Paravirtualized Devices When implementing full virtualization, one of the simplest options is for the guest OS to emulate a virtual device on the host. The guest communicates with this virtual device and the hypervisor detects the guests communication. This can be done using trapping of device accesses, or permissions to certain pages of memory. The hypervisor understands the operations by the guest OS on the virtual device and performs the corresponding operation on the physical device. This technique is called hosted or split I/O.
ARM Virtualization I/O Virtualization (Part 3) ARM Virtualization Extensions Memory and Interrupts (Part 2)
Popular Posts Writing and Optimizing Parallel Programs A complete example 57 comment(s) What makes parallel programming hard? 46 comment(s) Quick Post: Should you ever use LinkedLists? 44 comment(s) Parallel Programming: When Amdahls law is inapplicable? 23 comment(s) How to trick C/C++ compilers into generating terrible code? 21 comment(s) Q & A: Do multicores save energy? Not really. 15 comment(s) Which little PC should I buy? Raspberry Pi? Mele A1000? or 14 comment(s) What every Programmer should know about the memory system 12 comment(s) Open MP vs pthreads 11 comment(s) Ten things every programmer must know about hardware 10 comment(s)
Meta Log in Entries RSS Comments RSS
2/11
12/3/13
Twittecounter
It seems there has been an internal server error with the page you've requested. Our coding monkeys have been notified and we'll be back real soon, promise! Send us a note if the problem persists!
What I'm Doing... @liliputingnews Which little PC should I
The advantage of this technique is that since every call goes through the hypervisor, the hypervisor can provide the desired functionality. For example the hypervisor can track every I/O operation the device is presently waiting on. Similarly restricting a guest from affecting other guests becomes simplified because all physical device accesses are managed by the hypervisor. But this technique has a high CPU overhead. The data needs to be copied multiple times, processed through multiple I/O stacks, etc. The performance can be improved by using paravirtualization. In this case the device drivers in the OS implement an ABI with the hypervisor. The device drivers interface with the hypervisor and the hypervisor directly communicates with the physical device
buy? Raspberry Pi? Mele A1000? or http://t.co/ydryDF1K via @sharethis 2012-07-16 @raspberry_pi Which little PC should I buy? Raspberry Pi? Mele A1000? or http://t.co/GSF5bICT 2012-07-16 @DrQz Agreed. In systems/disk usage, they are related directly but do you not agree that a channels latency and throughput are indep? 2012-06-30 More updates... Powered by Twitter Tools
3/11
12/3/13
as is shown in the figure below.

Archives June 2013 April 2013 March 2013 July 2012 June 2012 August 2011 July 2011 June 2011 May 2011
Pages About Us
This technique provides better performance with similar control but there is still a significant performance overhead, for example, in trapping to the hypervisor. Figure below shows the difference observed by IBM in using an emulated IDE controller vs IBMs virtio-blk paravirtualized device drivers in KVM.
4/11
12/3/13
When looking at this overhead it is important to keep in mind it is very use-case dependent. A CPU bound benchmark will not show much sensitivity to the virtualization of I/O. Alternatively for an I/O heavy benchmark this overhead can be significant. As an example the conjugate-gradient method for solving a system of linear equation spends around 70% of CPU cycles in the user mode and spends the remaining time in the hypervisor kernel engaged in disk I/O. Passthrough I/O Passthrough I/O greatly improves performance by remapping the guest page tables to directly write to the physical device. This eliminates most of the overhead in trapping to the hypervisor for every operation. This technique brings the bulk of I/O processing to near-native speeds.
5/11
12/3/13
There are several issues that need to be addressed to effectively virtualize I/O using this technique. Consider the case of a guest using DMA accesses to communicate with a device. In this scenario we need to account for the following issues. Isolation The goal of virtualization is to to sandbox the guest OS to keep it from accessing the data of other guest OSes. We do this in the guest by adding a second stage translation. However, the DMA devices operate on physical addresses and are not
www.futurechips.org/chip-design-for-all/arm-virtualization-part-3-iommu.html 6/11
12/3/13
aware of second stage translations. So if a guest is given unrestricted access to a DMA device it can read or write to any physical address in memory and corrupt the memory of other guests. So there needs to be a protection mechanism instituted to make sure a device only directs DMA requests from a particular guest to go to memory associated with that guest. Furthermore, more than one guest may need to access the same device. The device needs to be able to distinguish between the accesses coming from different devices and redirect them correctly. Physical Address To complete the DMA transaction the guest OS needs to provide the device with the proper physical address in memory to find the data. But the guest does not know the physical address of the data, only the Intermediate Physical Address (IPA) which is in essence a virtual address. For the DMA access to work the device must be able to translate the IPA to the correct physical address. Contiguous Memory Blocks The problem cannot be solved by just providing the device with the correct PA. The device expects the DMA target region to be located in a contiguous region of memory. In a virtualized environment this is not guaranteed. The hypervisor may allocate guest pages that are not contiguous in as small as 4K blocks. So the device must be able to do this translation for the entire DMA region. 32 Bit Devices In Larger Address Spaces This problem is similar to the problem with a 32 bit guest on a 64 bit host discussed in the previous post. The system may have older devices that cannot access the complete larger address spaces of newer systems. An address translation is necessary to use these devices with a DMA outside their normal addressable range. Hardware Support
7/11
12/3/13
The problems mentioned above are not easily solved in software and need a hardware solution that correctly maps device addresses to the correct guest. Most platforms have hardware solutions for this. This mechanism is called IOMMU for IO Memory Management Unit. Intel calls their implementation VT-d, AMD calls their implementation AMD-Vi, and ARM calls their implementation SystemMMU. The basic idea for the IOMMU is simple. An address translation unit is placed in between any devices that may be used by a guest OS. When the hypervisor is setting up second stage page tables for a guest OS to access the device, it sets up the IOMMU too. Similar to tablewalks in the core, address translations are expensive. So TLBs are implemented to reduce the overhead of address translations.
An example system showing where the System MMU can be located. Transactions with the device are translated through the system MMU. System MMU The ARM System MMU is programmed with different translation contexts. It maps each transaction to the corresponding context by matching against expected streams. Based
12/3/13
on the context the System MMU may either bypass the translation, cause a fault, or perform a translation. The System MMU in the ARM architecture provides full 2 stage translation support (as described in the previous post) and depending on the context we may either do a first stage translation or a second stage translation. To perform the translation the System MMU has registers analogous to the TTBRs and other control registers for each context. The system MMU may also receive faults during its translation process or if a context is not setup. Depending on the type of fault and how the System MMU is configured it may take certain actions. A translation fault can trigger an interrupt. This allows an opportunity for the hypervisor to service the interrupt and restart the translation so it can come to completion. The System MMU may also send a BUSERROR to the appropriate requestor. There are syndrome registers present to ease the process of diagnosing and fixing the problem. Some advantages of System MMU dont even need virtualization. Since the System MMU enables every device to perform VA to PA translations, I/O operations can be performed by drivers in user-space using VAs. The permission checking and translation maps can ensure one user application does not corrupt the memory of another application . This would eliminate the traps to kernel presently required further reducing I/O overhead. Another problem is dealing with contiguous memory. Many operations result in very large DMA accesses that cannot be allocated a single chunk of memory by the OS. Presently they need to either be split into multiple DMA requests or performed with complex DMA scatter-gather operations. The System MMU enables the device to communicate via a DMA based on a contiguous VA instead of fragmented PAs. This both reduces the CPU overhead and simplifies the software and device. It should be noted that the System MMU is a part of the platform rather than a part of the core architecture. This means it only affects the drivers. Because of this many features are implementation defined. For example the bits used to match a stream and map it to a context are implementation defined. Since there is no user code that is aware of this part of the system, changes to the system MMU architecture wouldnt require as many legacy code issues. So using these techniques the hypervisor can provide an appropriate implementation of
12/3/13
virtualized I/O according to the use-case. This concludes the third installment of this series on virtualization. This series continues in the next post discussing the use-cases for virtualization especially the use cases targeted in the mobile space by ARM. References For more information check out the following resources. http://xpgc.vicp.net/course/svt/TechDoc/ch12IOArchitecturesForVirtualization.pdf http://nowlab.cse.ohio-state.edu/NOW/dissertations/huang.pdf http://www.ibm.com/developerworks/linux/library/l-virtio/ http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/topic/liaat/liaatbestpractices_pdf.pdf http://www.mulix.org/lectures/xen-iommu/xen-io.pdf http://developer.amd.com/wordpress/media/2012/10/IOMMU-ben-yehuda.pdf http://www.arm.com/files/pdf/System-MMU-Whitepaper-v8.0.pdf http://software.intel.com/en-us/articles/intel-virtualization-technology-for-directedio-vt-d-enhancing-intel-platforms-for-efficient-virtualization-of-io-devices http://support.amd.com/us/Processor_TechDocs/48882.pdf
You may also like Which little PC should I buy? Raspberry Pi? Mele A1000? or ... Why computer architects MUST benchmark Javascript? Tips for iPhone Developers: The web-based sandbox for understanding Cortex A8 is ready (Part 3) Answers to Computer Science Self-assessment Quiz Computer Science Self-assessment Quiz
0 share 0 0 295
Leave a Reply
12/3/13
Name
(required)
E-mail
(required)
URI
Your Comment
You may use these HTML tags and attributes: < ah r e f = " "t i t l e = " " >< a b b rt i t l e = " " >< a c r o n y mt i t l e = " " >< b > < b l o c k q u o t ec i t e = " " >< c i t e >< c o d e >< d e ld a t e t i m e = " " >< e m >< i >< qc i t e = " " >< s t r i k e >< s t r o n g >
Submit Comment
ARM Virtualization Extensions Memory and Interrupts (Part 2)
ARM Virtualization ARM vs x86 (Part 5)
2012 Future Chips
Suffusion theme by Sayontan Sinha
11/11

IO Virtualization On ARM - Part3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IO Virtualization On ARM - Part3

Uploaded by

Copyright:

Available Formats

12/3/13

ARM Virtualization I/O Virtualization (Part 3)

Meta Log in Entries RSS Comments RSS

What I'm Doing... @liliputingnews Which little PC should I

as is shown in the figure below.

ARM Virtualization Extensions Memory and Interrupts (Part 2)

ARM Virtualization ARM vs x86 (Part 5)

2012 Future Chips

Suffusion theme by Sayontan Sinha

You might also like