Unmodified device driver reuse and improved system_动视

Unmodified device driver reuse and improved system

2025-09-25 14:27:33 责编:小OO

Unmodiﬁed Device Driver Reuse and Improved System Dependability via Virtual Machines

Joshua LeVasseur V olkmar Uhlig Jan Stoess Stefan G¨o tz

University of Karlsruhe,Germany

Abstract

We propose a method to reuse unmodiﬁed device drivers and to improve system dependability using vir-tual machines.We run the unmodiﬁed device driver,with its original operating system,in a virtual machine.This approach enables extensive reuse of existing and unmod-iﬁed drivers,independent of the OS or device vendor, signiﬁcantly reducing the barrier to building new OS en-deavors.By allowing distinct device drivers to reside in separate virtual machines,this technique isolates faults caused by defective or malicious drivers,thus improving a system’s dependability.

We show that our technique requires minimal support infrastructure and provides strong fault isolation.Our prototype’s network performance is within3–8%of a native Linux system.Each additional virtual machine in-creases the CPU utilization by about0.12%.We have successfully reused a wide variety of unmodiﬁed Linux network,disk,and PCI device drivers.

1Introduction

The majority of today’s operating system code base is accounted for by device drivers.1This has two major implications.First,any OS project that aims for even a reasonable breadth of device drivers faces either a major development and testing effort or has to support and inte-grate device drivers from a driver-rich OS(e.g.,Linux or Windows).Even though almost all research OS projects reuse device drivers to a certain extent,full reuse for a signiﬁcant driver base has remained an elusive goal and so far can be considered unachieved.The availabil-ity of drivers solely in binary format from the Windows driver base shows the limitations of integration and wrap-ping approaches as advocated by the OS-Kit project[10]. Also,implicit,undocumented,or in the worst case incor-rectly documented OS behavior makes driver reuse with a fully emulated execution environment questionable. The second implication of the large fraction of driver code in mature OS’s is the extent of programming er-rors[7].This is particularly problematic since testing requires accessibility to sometimes exotic or outdated 1Linux2.4.1drivers cover70%of its IA32code base[7].hardware.The likelihood of programming errors in com-monly used device drivers is probably much lower than in application code;however,such errors are often fa-tal.Device drivers,traditionally executing in privileged mode,can potentially propagate faults to other parts of the operating system,leading to sporadic system crashes. In this paper we propose a pragmatic approach for full reuse and strong isolation of legacy device drivers. Instead of integrating device driver code we leave all drivers in their original and fully compatible execution environment—the original operating system.We run the device driver wrapped in the original operating system in a dedicated virtual machine(VM).Thus we can(almost) guarantee that semantics are preserved and that incom-patibilities are limited to timing behavior introduced by virtual machine multiplexing.

The virtual machine environment also strongly iso-lates device drivers from the rest of the system to achieve fault containment.The isolation granularity depends on the number of collocated drivers in a single VM.By in-stantiating multiple collaborating VMs we can efﬁciently isolate device drivers with minimal resource overhead. Reuse of device drivers and driver isolation are two important aspects of operating systems;however,they are usually discussed independently.With virtual ma-chines,we propose to use a single abstraction to solve both problems in an extremelyﬂexible,elegant,and efﬁ-cient way.

2Related Work

Our work uses known principles of hardware-based iso-lation to achieve driver reuse and improved system de-pendability.It is unique in the manner and the extent to which it accomplishes unmodiﬁed driver reuse,and how it improves system dependability,in terms of drivers, without system modiﬁcation.

2.1Reuse

Binary driver reuse has been achieved with cohosting, as used in VMware Workstation[32].Cohosting mul-tiplexes the processor between two collaborating oper-ating systems,e.g.,the driver OS and the VM monitor.When device activity is necessary,processor control is transfered to the driver OS in a world switch(which re-stores the interrupt handlers of the driver OS,etc.).The driver OS releases ownership of the processor upon de-vice activity completion.The cohosting method offers no trust guarantees;both operating systems run fully priv-ileged in supervisor mode and can interfere with each other.

Device drivers are commonly reused by transplant-ing source modules from a donor OS into the new OS [2,4,11,15,28,35].In contrast to cohosting,the new OS dominates the transplanted drivers.The transplant merges two independently developed code bases,glued together with support infrastructure.Ideally the two sub-systems enjoy independence,such that the design of one does not interfere with the design of the other.Past work demonstrates that,despite great effort,conﬂicts are un-avoidable and lead to compromises in the structure of the new OS.Transplantation has several categories of reuse issues,which we further describe.

Semantic Resource Conﬂicts

The transplanted driver obtains resources(memory, locks,CPU,etc.)from its new OS,subject to normal obligations and limitations,creating a new and risky re-lationship between the two components.In the reused driver’s raw state,its manner of resource use could vi-olate the resource’s constraints.The misuse can cause accidental denial of service(e.g.,the reused driver’s non-preemptible interrupt handler consumes enough CPU to reduce the response latency for other subsystems),can cause corruption of a manager’s state machine(e.g.,in-voking a non-reentrant memory allocator at interrupt time[15]),or can dead-lock in a multiprocessor system. These semantic conﬂicts are due to the nature of OS design.A traditional OS divides bulk platform resources such as memory,processor time,and interrupts between an assortment of subsystems.The OS reﬁnes the bulk re-sources into linked lists,timers,hash tables,top-halves and bottom-halves,and other units acceptable for dis-tributing and multiplexing between the subsystems.The resource reﬁnements impose rules on the use of the re-sources,and depend on cooperation in maintaining the integrity of the state machines.Modules of independent origin substitute a glue layer for the cooperative design. For example,when a Linux driver waits for I/O,it re-moves the current thread from the run queue.To capture the intended thread operation and to map it into an opera-tion appropriate for the new OS,the glue layer allocates a Linux thread control block when entering a reused Linux component[2,28].In systems that use asynchronous I/O, the glue layer converts the thread operations into I/O con-tinuation objects[15].Sharing Conﬂicts

A transplanted driver shares the address space and privi-lege domain with the new OS.Their independently de-veloped structures contend for the same resources in these two domains,and are subject to each other’s faults. Due to picky device drivers and non-modular code, a solution for fair address space sharing may be un-achievable.The older Linux device drivers,dedicated to the IA32platform,assumed virtual memory was idem-potently mapped to physical memory.Reuse of these drivers requires modiﬁcations to the drivers or loss in ﬂexibility of the address space layout.The authors in [28]decided not to support such device drivers,because the costs conﬂicted with their goals.The authors of[15] opted to support the drivers by remapping their OS. Privileged operations generally have global side ef-fects.When a device driver executes a privileged opera-tion for the purposes of its local module,it likely affects the entire system.A device driver that disables processor interrupts disables them for all devices.Cooperatively designed components plan for the problem;driver reuse spoils cooperative design.

Engineering Effort

Device driver reuse reduces engineering effort in OS construction by avoiding reimplementation of the de-vice drivers.Preserving conﬁdence in the correctness of the original drivers is also important.When given de-vice drivers that are already considered to be reliable and correct(error counts tend to reduce over time[7]),it is hoped that their reuse will carry along the same proper-ties.Conﬁdence in the new system follows from thor-ough knowledge of the principles behind the system’s construction,accompanied by testing.

Reusing device drivers through transplantation re-duces the overall engineering effort for constructing a new OS,but it still involves substantial work.In[10] Ford et al.report12%of the OS-Kit code as glue code. Engineering effort is necessary to extract the reused device drivers from their source operating systems,and to compile and link with the new operating system.The transplant requires glue layers to handle semantic differ-ences and interface translation.

For implementation of a glue layer that gives us con-ﬁdence in its reliability,intimate knowledge is required about the functionality,interfaces,and semantics of the reused device drivers.The authors in[2,15,28]all demonstrate intimate knowledge of their source operat-ing systems.

The problems of semantic and resource conﬂicts mul-tiply as device drivers from several source operating sys-tems are transplanted into the new OS.Intimate knowl-edge of the internals of each source operating systemis indispensable.Driver update tracking can necessitate adaptation effort as well.

2.2Dependability

The use of virtual machines to enhance reliability has been long known[16].A variety of other techniques for enhancing system dependability also exist,such as safe languages and software isolation,and are complemen-tary to our approach.The orthogonal design provided by our solution permits coexistence with incompatible sub-systems and development methodologies.

User-level device driver frameworks[9,11,17,20,26, 31]are a known technique to improve dependability. They are typically deployed in a microkernel environ-ment.Our approach also executes the device drivers at user level;however,we use the platform interface rather than a specialized and potentially more efﬁcient API. The recent Nooks project[33]shares our goal of retroﬁtting dependability enhancements in commodity systems.Their solution isolates drivers within protection domains,yet still executes them within the kernel with complete privileges.Without privilege isolation,com-plete fault isolation is not achieved,nor is detection of malicious drivers possible.

Nooks collocates with the target kernel,adding22,000 lines of code to the Linux kernel’s large footprint,all privileged.The Nooks approach is similar to second gen-eration microkernels(such as L4,EROS,or K42)in pro-viding address space services and synchronous commu-nication across protection domains,but it doesn’t take the next step to deprivilege the isolation domains(and thus exit to user-level,which is a minuscule overhead com-pared to the cost of address space switching on IA32). To compensate for Linux’s intricate subsystem entan-glement,Nooks includes interposition services to main-tain the integrity of resources shared between drivers.In our approach,we connect drivers at a high abstraction level—the request—and thus avoid the possibility of cor-rupting one driver by the actions of another driver.

Like us,another contemporary project[12,13]uses paravirtualization for user-level device drivers,but fo-cuses on achieving a uniﬁed device API and driver iso-lation.Our approach speciﬁcally leaves driver interfaces undeﬁned and thus open for specializations and layer-cutting optimizations.Their work argues for a set of uni-versal common-denominator interfaces per device class. 3Approach

The traditional approach to device driver construction favors intimate relationships between the drivers and their kernel environments,interfering with easy reuse of drivers.On the other hand,applications in the same en-vironments interface with their kernels via well deﬁned APIs,permitting redeployment on similar kernels.Ap-plications enjoy the beneﬁts of orthogonal design.

To achieve reuse of device drivers from a wide selec-tion of operating systems,we classify drivers as applica-tions subject to orthogonal design,based on the follow-ing principles:

Resource delegation:The driver receives only bulk re-sources,such as memory at page granularity.The responsibility to further reﬁne the bulk resources lies on the device driver.The device driver converts its memory into linked lists and hash tables,it man-ages its stack layout to support reentrant interrupts, and divides its CPU time between its threads. Separation of name spaces:The device driver exe-cutes within its own address space.This require-ment avoids naming conﬂicts between driver in-stances,and helps prevent faulty accesses to other memory.

Separation of privilege:Like applications,the device driver executes in unprivileged mode.It is unable to interfere with other OS components via privileged instructions.

Secure isolation:The device driver lacks access to the memory of non-trusting components.Likewise,the device driver is unable to affect theﬂow of execu-tion in non-trusting components.These same prop-erties also protect the device driver from the other system components.When non-trusting compo-nents share memory with the drivers,they are ex-pected to protect their internal integrity;sensitive information is not stored on shared pages,or when it is,shadow copies are maintained in protected ar-eas of the clients[14].

Common API:The driver allocates resources and con-trols devices with an API common to all device drivers.The API is well documented,well under-stood,powerfully expressive,and relatively static. Most legacy device drivers in their native state violate these orthogonal design principles.They use internal in-terfaces of their native operating systems,expect reﬁned resources,execute privileged instructions,and share a global address space.Their native operating systems par-tially satisfy our requirements.Operating systems pro-vide resource delegation and reﬁnement,and use a com-mon API–the system’s instruction set and platform ar-chitecture.By running the OS with the device driver in a virtual machine,we satisfy all of the principles and thus achieve orthogonal design.3.1Architecture

To reuse and isolate a device driver,we execute it and its native OS within a virtual machine.The driver di-rectly controls its device via a pass-through enhancement to the virtual machine,which permits the device driver OS(DD/OS)to access the device’s registers,ports,and receive hardware interrupts.The VM,however,inhibits the DD/OS from seeing and accessing devices which be-long to other VMs.

The driver is reused by a client,which is any process in the system external to the VM,at a privileged or user level.The client interfaces with the driver via a transla-tion module added to the device driver’s OS.This mod-ule behaves as a server in a client-server model.It maps client requests into sequences of DD/OS primitives for accessing the device,and converts completed requests into appropriate responses to the client.

The translation module controls the DD/OS at one of several layers of abstraction:potentially the user-level API of the DD/OS(e.g.,ﬁle access to emulate a raw disk),raw device access from user level(e.g.,raw sockets),abstracted kernel module interfaces such as the buffer cache,or the kernel primitives of the device drivers in the DD/OS.It is important to choose the correct ab-straction layer to achieve the full advantages of our de-vice driver reuse approach;it enables a single transla-tion module to reuse a wide variety of devices,hopefully without a serious performance penalty.For example,a translation module that interfaces with the block layer can reuse hard disks,ﬂoppy disks,optical media,etc.,as opposed to reusing only a single device driver.

To isolate device drivers from each other,we execute the drivers in separate and co-existing virtual machines. This also enables simultaneous reuse of drivers from in-compatible operating systems.When an isolated driver relies on another(e.g.,a device needs bus services),then the two DD/OS’s are assembled into a client-server rela-tionship.See Figure1for a diagram of the architecture. The requirement for a complete virtual machine im-plementation is avoidable by substituting a paravirtual-ized DD/OS for the unmodiﬁed DD/OS.In the paravirtu-alized model[3,16],the device driver’s OS is modiﬁed to interface directly with the underlying system.However, most importantly,the device drivers in general remain unmodiﬁed;they only need to be recompiled.

3.2Virtual Machine Environment

In our virtualization architecture we differentiate be-tweenﬁve entities:

•The hypervisor is the privileged kernel,which se-curely multiplexes the processor between the virtual

with the hypervisor in privileged mode or unprivi-leged and interacting with the hypervisor through a specialized interface.

•Device driver OS’s host unmodiﬁed legacy device drivers and have pass-through access to the device.

They control the device via either port IO or mem-ory mapped IO and can initiate DMA.However,the VMM restricts access to only those devices that are managed by each particular DD/OS.

•Clients use device services exported by the DD/OS’s,in a traditional client-server scenario.Re-cursive usage of driver OS’s is possible;i.e.a client can act as a DD/OS for another client.The client could be the hypervisor itself.

•Translation modules are added to DD/OS’s to pro-vide device services to the clients.They provide the interface for the client-to-DD/OS communication, and map client requests into DD/OS primitives. The hypervisor features a low-overhead communica-tion mechanism for inter-virtual-machine communica-tion.For message notiﬁcation,each VM can raise a com-munication interrupt in another VM and thereby signal a pending request.Similarly,on request completion the DD/OS can raise a completion interrupt in the client OS. The hypervisor provides a mechanism to share mem-ory between multiple virtual machines.The VMM canregister memory areas of one VM in another VM’s phys-ical memory space,similarly to memory-mapped device drivers.

3.3Client Requests

To provide access to its devices,the driver OS exports a virtual device interface that can be accessed by the client. The interface for client-to-DD/OS device communica-tion is not deﬁned by the hypervisor or the VMM but rather left to the speciﬁc translation module implementa-tion.This allows for optimizations such as virtual inter-rupt coalescing,scatter-gather copying,shared buffers, and producer-consumer rings as used in Xen[3].

The translation module makes one or more memory pages accessible to the client OS and uses interrupts for signalling,subject to the particular interface and request requirements.This is very similar to interaction with real hardware devices.When the client signals the DD/OS, the VMM injects a virtual interrupt to cause invocation of the translation module.When the translation module signals the client in response,it invokes a method of the VMM,which can be implemented as a trap due to a spe-ciﬁc privileged instruction,due to an access to an IO port, or due to a memory access.

3.4Enhancing Dependability

Commodity operating systems continue to employ sys-tem construction techniques that favor performance over dependability[29].If their authors intend to improve system dependability,they face the challenge of enhanc-ing the large existing device driver base,potentially with-out source code access to all drivers.

Our architecture improves system availability and re-liability,while avoiding modiﬁcations to the device drivers,via driver isolation within virtual machines.The VM provides a hardware protection domain,deprivileges the driver,and inhibits its access to the remainder of the system(while also protecting the driver from the rest of the system).The use of the virtual machine supports to-day’s systems and is practical in that it avoids a large engineering effort.

The device driver isolation helps to improve reliabil-ity by preventing fault propagation between independent components.It improves driver availability by support-ingﬁne grained driver restart(virtual machine reboot). Improved driver availability leads to increased system reliability when clients of the drivers promote fault con-tainment.Proactive restart of drivers,to reset latent er-rors or to upgrade drivers,reduces dependence on recur-sive fault containment,thus helping to improve overall system reliability.

The DD/OS solution supports a continuum of conﬁgu-rations for device driver isolation,from individual driver isolation within dedicated VMs to grouping of all drivers within a single VM.Grouping drivers within the same DD/OS reduces the availability of the DD/OS to that of the least stable driver(if not further).Even with driver grouping,the system enjoys the beneﬁts of fault isola-tion and driver restart.

Driver restart is a response to one of two event types: asynchronous(e.g.,in response to fault detection[33],or in response to a malicious driver),or synchronous(e.g., live upgrades[23]or proactive restart[5]).The reboot response to driver failure returns the driver to a known good state:its initial state.The synchronous variant has the advantage of being able to quiesce the DD/OS prior to rebooting,and to negotiate with clients to complete sensitive tasks.Our solution permits restart of any driver via a VM reboot.However,drivers that rely on a hard-ware reset to reinitialize their devices may not be able to recover their devices.

The interface between the DD/OS and its clients pro-vides a natural layer of indirection to handle the disconti-nuity in service due to restarts.The indirection captures accesses to a restarting driver.The access is either de-layed until the connection is transparently restarted[23] (requiring the DD/OS or the VMM to preserve canonical cached client state across the restart),or reﬂected back to the client as a fault.

4Virtualization Issues

The isolation of the DD/OS via a virtual machine in-troduces several issues:the DD/OS consumes resources beyond those that a device driver requires,it performs DMA operations,and it can violate the special timing needs of physical hardware.Likewise,legacy operating systems are not designed to collaborate with other oper-ating systems to control the devices within the system. This section presents solutions to these issues.

4.1DMA Address Translation

DMA operates on physical addresses of the machine. In a VM,memory addresses are subject to another ad-dress translation:from guest physical to host physical addresses.Since devices are not subject to TLB ad-dress translation,DMA addresses calculated inside the VM and fed to a hardware device reference incorrect host memory addresses.

Virtual machine monitors usually run device drivers at kernel privilege level[3,21,35].The VMM exports virtual hardware devices to the VM,which may or may not resemble the real hardware in the system.On device access the monitor intercepts and translates requests andDMA addresses to the machine’s real hardware.Since all hardware accesses including DMA requests are inter-cepted,the VM is conﬁned to its compartment.

When giving a VM unrestricted access to DMA-capable devices,the VM-to-host memory translation has to either be incorporated into all device requests or the DMA address translation has to be preserved.The par-ticular approach depends on available hardware features and the virtualization method(full virtualization vs.par-avirtualization).

In a paravirtualized environment the DD/OS can incor-porate the VMM page mappings into the DMA address translation.For the Linux kernel this requires modiﬁca-tion to only a few functions.The hypervisor also has to support an interface for querying and pinning the VM’s memory translations.

When DMA address translation functions can’t be overridden,the DD/OS’s have to be mapped idempo-tently to physical memory.Apparently,this would re-strict the system to a single DD/OS instance.But by borrowing ideas from single-address-space OS’s we can overcome this restriction under certain circumstances.In many cases device drivers only issue DMA operations on dynamically allocated memory,such as the heap or page pool.Hence,only those pages require the restric-tion of being mapped idempotently.Using a memory balloon driver[36],pages can be reclaimed for use in other DD/OS’s,effectively sharing DMA-capable pages between all DD/OS’s(see Figure2).

DMA from static data pages,such as microcode for SCSI controllers,further requires idempotent mapping of data pages.However,dynamic driver instantiation usu-ally places drivers into memory allocated from the page pool anyway.Alternatively,one DD/OS can run com-pletely unrelocated;multiple instances of the same OS can potentially share the read-only parts.

It is important to note that all solutions assume well-behaving DD/OS’s.Without special hardware support, DD/OS’s can still bypass memory protection by perform-ing DMA to physical memory outside their compart-ments.

4.2DMA and Trust

Code with unrestricted access to DMA-capable hard-ware devices can circumvent standard memory protec-tion mechanisms.A malicious driver can potentially el-evate its privileges by using DMA to replace hypervisor code or data.In any system without explicit hardware support to restrict DMA accesses,we have to consider device drivers as part of the trusted computing base. Isolating device drivers in separate virtual machines can still be beneﬁcial.Nooks[33]only offers very weak protection by leaving device drivers fully privileged,but

latter two cases can only be enforced with DMA restric-tions as described in the next section.

During a DMA operation,page translations targeted by DMA have to stay constant.If the DD/OS’s mem-ory is not statically allocated it has to explicitly pin the memory.When the DD/OS initiates DMA in or out of the client’s memory to eliminate copying overhead,it must pin that memory as well.In the case that the DD/OS is untrusted,the hypervisor has to enable DMA permis-sions to the memory and to ensure that the DD/OS can-not run denial-of-service attacks by pinning excessive amounts of physical memory.

When the DD/OS and client distrust each other,fur-ther provisions are required.If the DD/OS gets charged for pinning memory,a malicious client could run a DoS attack against the driver.A similar attack by the DD/OS against the client is possible when the DD/OS performs the pinning on behalf of the client.The solution is a co-operative approach with both untrusted parties involved. The client performs the pin operation on its own memory, which eliminates a potential DoS attack by the DD/OS. Then,the DD/OS validates with the hypervisor that the pages are sufﬁciently pinned.By using time-bound pin-ning[27]guaranteed by the hypervisor,the DD/OS can safely perform the DMA operation.

Page translations also have to stay pinned during a VM restart,since a faulting DD/OS may leave a device ac-tively using DMA.All potentially targeted memory thus cannot be reclaimed until the VMM is sure that outstand-ing DMA operations have either completed or aborted.

Likewise,client OS’s must not use memory handed out to the faulted DD/OS until its restart has completed. 4.3IO-MMU and IO Contexts

The IO-MMU,initially designed to overcome the32-bit address limitation for DMA in-bit systems,enables remapping bus addresses to host addresses at page granu-larity.IO-MMUs are,amongst others,available in AMD Opteron[1],Alpha21172[8],and HP Itanium systems [22].They can be used to enforce access permissions for DMA operations and to translate DMA addresses.Thus, DD/OS’s can be fully hardware-isolated from the VMM and other VMs,removing device drivers from the trusted computing base[24].

Tailored towards monolithic OS designs,IO-MMUs usually don’t support multiple address contexts,such as per device,per slot,or per bus translations.The con-ﬂicting sets of virtual to physical mappings of isolated device drivers prevent simultaneous use of these IO-MMUs.We emulate multiple IO address contexts by time-multiplexing the IO-MMU between PCI devices. Resembling task scheduling,we periodically schedule IO-MMU contexts and enable bus access for only those devices that are associated with the active context.

The PCI speciﬁcation[30]does not deﬁne a maximum access latency to the PCI bus,but only requires fair arbi-tration preventing deadlocks.Devices therefore have to be designed for potentially long bus access latencies—up to multiple milliseconds—which makes a coarse-grained scheduling approach feasible.The scheduling period has to be within the bounds of each device’s timing toler-ances;the particular handling of timeouts is speciﬁc to the device class.For example network cards simply start dropping packets when the card’s internal buffers over-ﬂow,whereas the IDE DMA controller signals an error condition.2

A downside of time multiplexing is that the average available bus bandwidth for a device decreases and deliv-ery latency increases.Benchmarks with a gigabit Ether-net NIC show a throughput decrease that is proportional to the allocated bus share.We further reduce the impact 2IO-MMU time multiplexing is not fully transparent for all device classes.For example,the IDE DMA controller in our experimental AMD Opteron system requires dedicated handling.The IDE con-troller’s behavior changes based on its DMA state:DMA startup or in-progress DMA.For DMA startup it can accept a multi-millisecond latency until itsﬁrst bus access is permitted to proceed.But if its bus master access is rescinded for a multi-millisecond duration during an active DMA operation,it aborts instead of retrying the operation.The problem is that the millisecond scheduling period exceeds the device’s latency.We therefore additionally check for in-progress DMA directly at the IDE controller and delay the preemption until DMA completion. However,to perform this test we need speciﬁc device knowledge—even though it is for a whole device class—compromising the trans-parency of our approach.of time multiplexing by dynamically adapting bus allo-cations based on device utilization,preferring active and asynchronously operating devices.

The IO-MMU time multiplexing is a performance compromise to support device driver isolation on inad-equate hardware,and is a proof-of-concept for our reuse and isolation goals.Future hardware solutions could eliminate the need for time multiplexing.

4.4Resource Consumption

Each DD/OS consumes resources that extend beyond the inherent needs of the driver itself.The DD/OS needs a minimum amount of memory for code and data.Fur-thermore,each DD/OS has a certain dynamic processing overhead for periodic timers and housekeeping,such as page aging and cleaning.Periodic tasks in DD/OS’s lead to cache and TLB footprints,imposing overhead on the clients even when not using any device drivers.

Page sharing as described in[36]signiﬁcantly reduces the memory and cache footprint induced by individual DD/OS’s.The sharing level can be very high when the same DD/OS kernel image is used multiple times and customized with loadable device drivers.In particular, the steady-state cache footprint of concurrent DD/OS’s is reduced since the same housekeeping code is executed. It is important to note that memory sharing not only re-duces overall memory consumption but also the cache footprint for physically tagged caches.

The VMM can further reduce the memory consump-tion of a VM by swapping unused pages to disk.How-ever,this approach is infeasible for the DD/OS running the swap device itself(and its dependency chain).Hence, standard page swapping is permitted to all but the swap DD/OS.When treating the DD/OS as a black box,we cannot swap unused parts of the swap DD/OS via work-ing set analysis.All parts of the OS must always be in main memory to guarantee full functionality even for rare corner cases.

Besides memory sharing and swapping,we use three methods to further reduce the memory footprint.Firstly, memory ballooning actively allocates memory in the DD/OS,leading to self-paging[18,36].The freed mem-ory is handed back to the VMM.Secondly,we treat zero pages specially since they can be trivially restored.Fi-nally,we compress[6]the remaining pages that do not belong to the active working set and that are not safe to swap,and uncompress them on access.

Page swapping and compression are limited to ma-chines with DMA hardware that can fault on accesses to unmapped pages.Otherwise,a DMA operation could access invalid data(it must be assumed that all pages of a DD/OS are pinned and available for DMA when treating the DD/OS as a black box).Periodic tasks like timers can create a non-negligible steady-state runtime overhead.In some cases the re-quirements on the runtime environment for a DD/OS whose sole purpose is to encapsulate a device driver can be weakened in favor of less resource consumption.For example,a certain clock drift is acceptable for an idle VM as long as it does not lead to malfunction of the driver itself,allowing us to schedule OS’s less frequently or to simply drop their timer ticks.

4.5Timing

Time multiplexing of multiple VMs can violate timing assumptions made in the operating system code.OS’s assume linear time and non-interrupted execution.In-troducing a virtual time base and slowing down the VM only works if there is no dependence on real time.Hard-ware devices,however,are not subject to this virtual time base.Violating the timing assumptions of device drivers, such as short delays using busy waiting or bound re-sponse times,can potentially lead to malfunctioning of the device.3

We use a scheduling heuristic to avoid preemption within time critical sections,very similar to our approach to lock-holder preemption avoidance described in[34]. When consecutive operations are time-bound,operating systems usually disable preemption,for example by dis-abling hardware interrupts.When the VMM scheduler would preempt a virtual processor but interrupts are dis-abled,we postpone the preemption until interrupts are re-enabled,thereby preserving the timing assumptions of the OS.This requires the VMM to trap the re-enable op-eration.Hard preemption after a maximum period avoids potential DoS attacks by malicious VMs.

4.6Shared Hardware and Recursion Device drivers assume exclusive access to the hardware device.In many cases exclusiveness can be guaranteed by partitioning the system and only giving device access to a single DD/OS.Inherently shared resources,such as the PCI bus and PCI conﬁguration space,are incompat-ible with partitioning and require shared and synchro-nized access for multiple DD/OS’s.Following our reuse approach,we give one DD/OS full access to the shared device;all other DD/OS’s use driver stubs to access the shared device.The server part in the controlling DD/OS can then apply aﬁne-grained partitioning policy.For example,our PCI DD/OS partitions devices based on a 3Busy waiting,which relies on correct calibration at boot time,is particularly problematic when the calibration period exceeds a VM scheduling time slice and thus reports a slower processor.A device driver using busy waiting will then undershoot a device’s minimal tim-ing requirements.conﬁgurationﬁle,but makes PCI bridges read-only ac-cessible to all client DD/OS’s.To simplify VM device discovery,additional virtual devices can be registered. In a fully virtualized environment,some device drivers cannot be replaced dynamically.Linux,for example, does not allow substituting the PCI bus driver.In those cases,full hardware emulation is required by the VMM. The number of such devices is quite limited.In the case of Linux the limitations include PCI,the interrupt con-troller,keyboard,mouse,and real-time clock.

5Evaluation

We implemented a driver reuse system according to the architecture described in the prior sections,and assessed the architecture’s performance,resource,and engineer-ing costs.We evaluated reused drivers for the network, disk and PCI subsystems.We limit our evaluation to a paravirtualization environment.

To support a comparative performance analysis,we constructed a baseline system and a device driver reuse system that closely resemble each other.They use iden-tical device driver code.They run the same benchmarks, utilizing the same protocol stacks and the same OS in-frastructure.They differ in their architectures:the base-line uses its native device driver environment,while our system uses the driver reuse environment and is paravir-tualized.The baseline OS is a standard Linux operating system.The device driver reuse system is constructed from a set of paravirtualized Linux OS’s conﬁgured as DD/OS components and client components.The client OS communicates with the reused device drivers via spe-cial kernel modules installed into the client OS.

5.1Virtualization Environment

The paravirtualization environment is based on the L4 microkernel[25].L4serves as a small privileged-mode hypervisor.It offers minimal abstractions and mecha-nisms to support isolation and communication for the virtual machines.Fewer than13,000lines of code run privileged.

The DD/OS and the client OS are provided by two dif-ferent generations of the Linux kernel:versions2.4.22 and2.6.8.1.The2.4kernel was ported to the L4envi-ronment in the tradition of the original L4Linux adapta-tion[19].In contrast,we used a very lightweight adapta-tion of the2.6kernel to L4,with roughly3000additional lines of code(and only450lines intrusive).The par-avirtualized Linux kernels use L4mechanisms to receive interrupts,to schedule,to manage application memory, and to handle application system calls and exceptions. The VMM,a user-level L4task,coordinates resourcessuch as memory,device mappings,and I/O port map-pings for the DD/OS instances and the client OS.

All components communicate via L4mechanisms. These mechanisms include the ability to establish shared pages,perform high-speed IPC,and to efﬁciently copy memory between address spaces.The mechanisms are coordinated by object interfaces deﬁned in a high-level IDL,which are converted to optimized inlined assembler with an IDL compiler.

5.2Translation Modules

For efﬁcient data transfer,the client and DD/OS com-municate enough information to support DMA directly from the client’s pages via a shared producer-consumer command ring.In a typical sequence,the client adds de-vice commands to the ring and activates the DD/OS via a virtual interrupt,and then the DD/OS services the com-mand.Before performing the device DMA operation,the DD/OS validates the legality of the client’s addresses and the client’s pinning privileges.

The DD/OS does not generate virtual addresses for the client’s pages;Linux device drivers are designed to sup-port DMA operations on pages that are not addressable within the Linux kernel’s virtual address space(by de-fault,Linux can only address about940MB of memory in its kernel space).The Linux drivers refer to pages in-directly via a page map.To leverage Linux’s page map, we conﬁgure Linux with knowledge of all physical pages on the machine,but reserved from use(any attempts to access memory outside the DD/OS’s VM causes page permission faults),and then convert client request ad-dresses into page map offsets.In case a driver or subsys-tem places restrictions on acceptable addresses,it may be necessary toﬁrst copy the data.

Disk Interface The disk interface communicates with Linux’s block layer,and is added to the DD/OS as a ker-nel module.It converts client disk operations into Linux block requests,and injects the block requests into the Linux kernel.Linux invokes the translation layer upon completion of the requests via a callback associated with each request.The block layer additionally supports the ability for the DD/OS to process requests out-of-order. The client and DD/OS share a set of request ID’s to iden-tify the reordered commands.

Network Interface The network interface has the ad-ditional feature of asynchronous inbound packet deliv-ery.We developed our system to support multiple clients, and thus the DD/OS accepts the inbound packets into its own memory for demultiplexing.While outbound pack-ets are transmitted from the client via DMA,inbound packets are securely copied from the DD/OS to the client by the L4microkernel,thus protecting the client memory from the DD/OS(and requires agreement from the client to receive the packets).The L4kernel creates tempo-rary CPU-local memory mappings,within the L4kernel space,to achieve an optimized copy.

The translation layer is added to the DD/OS as a de-vice driver module.It represents itself to the DD/OS as a Linux network device,attached to a virtual intercon-nect.But it doesn’t behave as a standard network device; instead it appends outbound packets directly to the real adapter’s kernel packet queue(in the manner of network ﬁlters),where they are automatically rate controlled via the real device’s driver feedback to the Linux kernel.

To participate directly on the physical network,the translation layer accepts inbound packets using the Linux ISO layer-two bridging module hook.The translation layer queues the packets to the appropriate client OS,and eventually copies to the client.4

PCI Interface When the PCI driver is isolated,it helps the other DD/OS instances discover their appropriate de-vices on the bus,and restricts device access to only the appropriate DD/OS instances.

The PCI interface is not performance critical.We for-ward all client PCI conﬁguration-space read and write re-quests to the PCI DD/OS.It will perform write requests only for authorized clients.For read requests,it pro-vides accurate information to the device’s DD/OS,and contrived information to other clients.

We execute the PCI DD/OS at a lower priority than all other system components.With no timing requirements, it can tolerate severe clock drift.

5.3Resource Consumption

For memory,we measured the active and steady-state page working set sizes of DD/OS instances,and consid-ered the effect of page sharing and memory compression for all pages allocated to the DD/OS instances.For CPU, we focused on the idle cycle consumption(later sections explore the CPU costs of active workloads).

To avoid unnecessary resource consumption in the DD/OS,we conﬁgured the Linux kernel,via its build conﬁguration,to include only the device drivers and functionality essential to handle the devices intended to be used in the benchmarks.The runtime environment of each DD/OS is a tiny ROM image which initializes into a single-user mode with almost no application presence.

4An alternative to packet copying,page remapping,has a pro-hibitively expensive TLBﬂush penalty on SMPs when maintaining TLB coherence.A future alternative is to use a spare hyperthread to copy the packets.If the network DD/OS has only a single client,then the client can provide the pages backing the inbound packets,avoiding the copy.

0500

1000

1500

2000

25000

1015

Time (s)

KB TTCP receive

TTCP send

kernel untar

steady state

Figure 3:90ms aggregate samples of Linux 2.6.8.1DD/OS

memory working sets when idle and for various disk and net-work benchmarks.

The data was collected while using Linux 2.6.8.1.The numbers are generally similar for Linux 2.4.22.Working Set Figure 3is a plot of memory page work-ing sets of disk and network DD/OS’s,where each sam-ple covers 90ms of events.The “steady state”graph shows the inherent ﬁxed cost of an idle DD/OS,usually around 144KB,with a housekeeping spike about every two seconds.The remaining graphs provide an idea of working set sizes during activity.The “ttcp receive”and “ttcp send”tests show the working set sizes during in-tense network activity.The “untar”test shows the work-ing set response to the process of unarchiving a Linux kernel source tree to disk.The worst-case working set size reaches 2200KB,corresponding to high network ac-tivity.Our conﬁguration is susceptible to a large work-ing set for network activity because the DD/OS buffers incoming packets within its own memory.However,due to Linux’s reuse of packet buffers the DD/OS working set size remains bounded.

Memory Compression To test the possibility of shar-ing and compressing the pages that back the DD/OS in-stances,we performed an ofﬂine analysis of a snapshot of a particular DD/OS conﬁguration.The tested conﬁgu-ration included three DD/OS instances,one each for PCI,IDE,and the Intel e1000gigabit.The PCI VM was con-ﬁgured with 12MB and the others with 20MB memory each.We ran the PostMark benchmark stressing a VM with Linux 2.6serving ﬁles via NFS from the local IDE disk over the network.The active memory working set for all DD/OS’s was 2.5MB.

For systems without an IO-MMU,the memory con-sumption can only be reduced by cooperative memory ballooning [36].With the balloon driver in the DD/OS’s we can reclaim 33%of the memory.

compr. + shared (4)compressed (3)Memory consumption (in MB)

compressable comp. duplicates active

active duplicates zero balloon

Figure 4:(1)Combined memory consumption of disk,net-work,and PCI DD/OS’s with 20MB,20MB,and 12MB VMs,(2)after memory ballooning,(3)with memory compression,and (4)memory compression and sharing.

Using an IO-MMU that can recover from page faults,we can revoke page access rights and compress memory that is not part of the active working set.Support of read-only page access rights by the IO-MMUs furthermore enables sharing of identical pages of the active working set via copy-on-write.We searched for duplicate pages among the three DD/OS instances.Any duplicate page is shareable whether it is in an active working set or not.A page in any DD/OS instance is additionally upgraded to an active page if it has a duplicate in any working set,to avoid having a compressed as well as an uncompressed copy.Finally,the IO-MMU enables us to reclaim all zero pages uncooperatively.For the given setup,up to %of the allocated memory can be reclaimed,reducing the overall memory footprint of three concurrent DD/OS’s to 6MB (see Figure 4).

Without an IO-MMU,gray-box knowledge enables DD/OS paging.For example,the memory of Linux’s page map is never used for a DMA operation,and is thus pageable.Furthermore,the network and block DD/OS each had a contiguous 6.9MB identical region in their page maps,suitable for sharing.

CPU Utilization The steady state of a DD/OS has an inherent CPU utilization cost,not just inﬂuenced by in-ternal activities,but also by the number of DD/OS’s in the system.We measured the DD/OS CPU utilization response to additional DD/OS instances;the ﬁrst eight DD/OS’s each consume 0.12%of the CPU,and then the ninth consumes 0.15%,and the tenth consumes 0.23%(see Figure 5).

The DD/OS’s were idle with no device activity.Only the ﬁrst DD/OS was attached to a device—the PCI bus.The others contained a single device driver (the e1000).The machine was a Pentium 42.8GHz with a 1MB L2cache,which can almost ﬁt the steady-state memory working sets of seven DD/OS instances (at 144KB each,see Figure 3).The L2cache miss rate began to rapidly rise with the eighth DD/OS,leading to an inﬂection in the CPU utilization curve.

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0

2 4 6 8 10

P e r c e n t C P U u t i l i z a t i o n

Number of DD/OS instances

Figure 5:Incremental CPU utilization for additional steady-state DD/OS instances,representing the ﬁxed cost of executing a DD/OS.

5.4Performance

A set of benchmarks allowed us to explore the perfor-mance costs of the DD/OS approach to device driver reuse,stressing one driver at a time,and then using net-work and disk drivers together.The networking bench-marks were selected to help provide a point of compari-son with recent literature.

We executed our benchmarks with two device driver reuse scenarios:(1)with all drivers consolidated in a single DD/OS,and (2)with the devices isolated in dedi-cated DD/OS instances.For a baseline,the benchmarks are also executed within the original,native device driver environment.

The benchmark OS ran Debian Sarge with the Linux 2.6kernels,constrained to 768MB.When using the Linux 2.4kernels,performance numbers were very sim-ilar.The hardware used in the test system was a Pentium 42.8GHz processor,with an Intel 82540gigabit network PCI card,and a desktop SATA disk (Maxtor 6Y120M0).TTCP Figure 6presents the throughput of the TTCP benchmark relative to the native throughput,using two packet sizes.Throughput at the 1500-byte packet size remains within 3%of native,and drops to 8%of na-tive for 500-byte packets.Linux performs the packet sizing within the kernel,rather than within TTCP,via use of Linux’s maximum transmission unit (MTU)pa-rameter,avoiding a per-packet address space transition.The CPU utilization relative to native Linux was 1.6x for send,2.06x for receive with 1500-byte MTU,and 2.22x for receive with 500-byte MTU.As expected,network receive generated a larger CPU load than network send due to extra packet copies.TTCP was conﬁgured for a 128KB socket buffer size.

Figure 6:Normalized TTCP throughput results for native

Linux (N),consolidated (C),and isolated (I)DD/OS’s.Ab-solute throughput given in MB/s.

Netperf The Netperf benchmark conﬁrmed the TTCP MTU 1500results;throughput with driver reuse re-mained within 3%of native,with 1.6x CPU utilization for sending,and up to 2.03x CPU utilization for receiv-ing.The native throughput was 98.5MB/s.A substan-tial increase in TLB and L2cache misses led to higher CPU utilization.These misses are inherent to our test-platform;the Pentium 4ﬂushes TLBs and L1caches on every context switch between the client and DD/OS.The Netperf benchmark transfered one gigabyte,with a 32KB send and receive size,and a 256KB socket buffer size.Disk Figure 7presents the results of our streaming disk benchmark for the isolated DD/OS’s (consolidated re-sults are identical).The benchmark highlights the over-head of our solution,as opposed to masking it with random-access disk latency.The benchmark bypasses the client’s buffer cache (using a Linux raw device)and ﬁle system (by directly accessing the disk partition).We thus avoid timing the behavior of the ﬁle system.Native throughput averaged 50.75MB/s with a standard devia-tion of 0.46MB/s.For driver reuse,the throughput was nearly identical and the difference less than half the stan-dard deviation,with CPU utilization ranging from 1.2x to 1.9x native.

Application-Level We studied application-level per-formance with the PostMark benchmark,run over NFS.This benchmark emulates the ﬁle transaction behavior of an Internet electronic mail server,and in our scenario,the ﬁle storage is provided by an NFS server machine.The benchmark itself executes on a client machine.The NFS server used our driver reuse framework,and was conﬁg-ured as in the microbenchmarks.The client had a 1.4GHz Pentium 4,256MB memory,a MB Debian RAM disk,an Intel 82540gigabit Ethernet PCI card,and exe-cuted a native Linux 2.6.8.1kernel.The performance of

0.2 0.4 0.6

0.8

1 1.

2 1.4 1.6 1.8 2 0.5 1 2 4 8 16 32

M e t r i c r e l a t i v e t o n a t i v e

System call block size (KB) for readv() and writev()

Write CPU utilization Read CPU utilization

Write throughput Read throughput Figure 7:Throughput and CPU use relative to native Linux

for disk streaming read and write.

10%20%30%40%50% 0

50 100 150 200 250 300 350

C P U u t i l i z a t i o n Time (s)

Isolated Consolidated

Native

Figure 8:CPU utilization for the NFS server machine while

handling the PostMark benchmark.

the NFS server was nearly identical for all driver sce-narios,for native Linux and for driver reuse,with an average runtime of 343.4seconds.The standard devi-ation,2.4%,was over twice the loss in performance for driver reuse.Both the isolated and consolidated driver reuse conﬁgurations had higher CPU utilization than na-tive Linux;see Figure 8for CPU utilization traces of the NFS server machine covering the duration of the bench-mark.The benchmark starts with a large CPU spike due to ﬁle creation.Postmark was conﬁgured for ﬁle sizes ranging from 500-bytes to 1MB,a working set of 1000ﬁles,and 10000ﬁle transactions.

5.5IO-MMU

We used a 1.6GHz AMD Opteron system with an AMD 8111chipset to evaluate IO-MMU time multiplexing.The chipset’s graphics aperture relocation table mech-anism relocates up to 2GB of the 4GB DMA space at a 4KB granularity [1].The chipset only supports read-write and no-access permissions.

Each virtual machine running a DD/OS has a dedi-cated IO-MMU page table which is synchronized with

100 200 300 400 500 600 700 800 900 0 2 4 6 8 10 12 14 16N e t w o r k b a n d w i d t h (M b s )

IDE IO-MMU scheduling time slice (ms)

8ms NIC time slice 4ms NIC time slice 2ms NIC time slice 1ms NIC time slice

Figure 9:Network bandwidth in response to various IO-MMU

context scheduling rates.

the guest-physical-to-host-physical mappings of the VM.When clients grant the DD/OS access to parts of their memory,appropriate entries are added to the IO-MMU page table as well.

The VMM connects the managed PCI devices of each DD/OS with their respective IO-MMU contexts.Peri-odically,but independent of the processor scheduler,we switch between the IO contexts.On context switch,the hypervisor enables and disables bus master access in the PCI conﬁguration space for the respective devices.Our shortest scheduling granularity of 1ms is limited by the frequency of the periodic timer.

We evaluated the performance and overhead of scheduling IO-MMU contexts,as well as the bounds of the scheduling period for hardware devices.The test sys-tem contained two DD/OS’s,one driving an Intel e1000gigabit Ethernet adapter and the other handling the IDE disk controller.

First,and most importantly,we can completely iso-late the physical memory covered by the IO-MMU and transparently relocate both VMs.Neither VM is able to perform DMA to memory outside its compartment.We ran the TTCP benchmark and varied the bus allocation for the NIC and disk controller.The network throughput scaled almost linearly with the bus allocation.The NIC started dropping packets when it lost access to the bus for more than 8ms.Figure 9shows the achieved network bandwidth for various scheduling conﬁgurations.

The IDE controller is less bandwidth-sensitive since the throughput is bounded by disk latency.However,our scheduling granularity of 1ms exceeds the timeout for in-progress transactions.When disabling bus master we therefore postpone IDE deactivation when operations are still in-ﬂight.The overhead for IO-MMU context switch-ing was a 1%increase in CPU utilization.

server client common total

network11527702442166

block2.48056591081572

block2.675154601297

PCI59620952857 common00620620

total330421841024

Figure10:Itemization of source lines of code used to imple-ment our evaluation environment.Common lines are counted once.

5.6Engineering Effort

We estimate engineering effort in man hours and in lines of code.The translation modules and client de-vice drivers for the block and network,along with the user-level VMM,were written by a single student over roughly a two month period,originally for L4Linux2.4. This student already had experience with Linux network driver development for a paravirtualized Linux on L4.A second student implemented the PCI support within one week.

The2.4network translation module was easily up-graded to serve as the translation module for Linux2.6, with minor changes.However the2.4block translation module was mostly incompatible with2.6’s internal API (Linux2.6introduced a new block subsystem).We thus wrote new block translation and client device drivers for 2.6.We successfully reused the2.6block and network drivers with the2.4client,and vice versa.

See Figure10for an itemization of the lines of code. Theﬁgure distinguishes between lines speciﬁc to the translation modules added to the server,lines speciﬁc to the virtual device drivers added to the client,and addi-tional lines that are common(and are counted once). The achieved code reuse ratio is99.9%for NIC drivers in Linux;the translation modules add0.1%to their code base.When we additionally include all code required for the virtualization—the L4microkernel,the VMM,and the paravirtualization modiﬁcations—we still achieve a reuse ratio of91%just for Linux’s NIC driver base. The engineering effort enabled us to successfully reuse Linux device drivers with all of our tested lab hardware. The following drivers were tested:Intel gigabit,Intel 100Mbit,Tulip(with a variety of Tulip compatible hard-ware),Broadcom gigabit,pcnet32,ATA and SATA IDE, and a variety of uniprocessor and SMP chipsets for Intel Pentium3/4and AMD Opteron processors.

6Discussion and Future Work

We presented a new approach to reusing unmodiﬁed device drivers and enhancing system dependability us-ing virtual machines,but evaluated only a paravirtual-ized implementation.Paravirtualization is an enhanced machine API that relocates some functionality from the guest OS to the VMM and hypervisor[16].For example, it permits our DD/OS instances to directly translate their virtual addresses into bus addresses for DMA.It also pro-vides performance beneﬁts[3,16]compared to use of the real machine API.We have discussed the issues re-lated to device driver pass-through with full virtualiza-tion,and consider our paravirtualization implementation to be an approximation.In terms of correctness,the pri-mary difference relates to proper address translation for DMA operations,which becomes irrelevant with hard-ware device isolation(such as the IO-MMU).In terms of performance,the paravirtualization numbers underes-timate the costs of a fully-virtualized solution.

Our system currently supports a sufﬁciently large sub-set of device classes to be self-hosting in a server envi-ronment.We have not addressed the desktop environ-ment,which requires support for the graphics console, USB,Firewire,etc.

Generic driver stubs only provide access to the least common denominator,thereby hiding more advanced hardware features.Our client-server model enables de-vice access at any level in the software hierarchy of the DD/OS,even allowing programming against richer OS APIs like TW AIN,or enabling vendor-speciﬁc features such as DVD burning.Using the appropriate software engineering methods,e.g.,an IDL compiler,one can quickly generate cross-address-space interfaces that sup-port APIs with rich feature sets.

7Conclusion

Widely used operating systems support a variety of de-vices;for example,in Linux2.4on IA32,70%of1.6mil-lion lines of kernel code implement device support[7]. New operating system endeavors have the choice of ei-ther leveraging the existing device drivers,or expending effort to replicate the driver base.We present a technique that enables unmodiﬁed reuse of the existing driver base, and most importantly,does so in a manner that promotes independence of the new OS endeavor from the reused drivers.

The driver independence provides an opportunity to improve system dependability.The solution fortiﬁes the reused drivers(to the extent supported by hardware)to promote enhanced reliability and availability(with inde-pendent driver restart).

Our method for reusing unmodiﬁed drivers and improving system dependability via virtual machines achieves good performance.For networking,where packetized throughput is latency-sensitive,the through-put remains within3–8%of the native system.The driver isolation requires extra CPU utilization,which can be re-duced with hardware acceleration(such as direct DMA for inbound packets).

The DD/OS solution is designed for minimal engi-neering effort,even supporting reuse of binary drivers. The interface implementation between the new OS and reused drivers constitutes a trivial amount of code,which leverages the vast world of legacy drivers.Driver source code,by design,remains unmodiﬁed.

References

[1]Advanced Micro Devices,Inc.BIOS and Kernel Developer’s

Guide for AMD Athlonand AMD Opteron Processors,Apr.

2004.

[2]J.Appavoo,M.Auslander,D.DaSilva,D.Edelsohn,O.Krieger,

M.Ostrowski,et al.Utilizing Linux kernel components in K42.

Technical report,IBM Watson Research,Aug.2002.

[3]P.Barham,B.Dragovic,K.Fraser,S.Hand,T.Harris,A.Ho,

et al.Xen and the art of virtualization.In Proc.of the19th ACM Symposium on Operating Systems Principles,Bolton Land-ing,NY,Oct.2003.

[4] E.Bugnion,S.Devine,and M.Rosenblum.Disco:Running com-

modity operating systems on scalable multiprocessors.In Proc.

of the16th ACM Symposium on Operating Systems Principles, Saint-Malo,France,Oct.1997.

[5]G.Candea and A.Fox.Recursive restartability:Turning the re-

boot sledgehammer into a scalpel.In Eighth IEEE Workshop on Hot Topics in Operating Systems,Schloss Elmau,Germany,May 2001.

[6]R.Cervera,T.Cortes,and Y.Becerra.Improving application

performance through swap compression.In Usenix Annual Tech-nical Conference,Monterey,CA,June1999.

[7] A.Chou,J.Yang,B.Chelf,S.Hallem,and D.Engler.An empir-

ical study of operating system errors.In Proc.of the18th ACM Symposium on Operating Systems Principles,Banff,Canada, Oct.2001.

[8]Digital Equipment Corporation.Digital Semiconductor21172

Core Logic Chipset,Technical Reference Manual,Apr.1996. [9]K.Elphinstone and S.G¨o tz.Initial evaluation of a user-level de-

vice driver framework.In9th Asia-Paciﬁc Computer Systems Architecture Conference,Beijing,China,Sept.2004.

[10] B.Ford,G.Back,G.Benson,J.Lepreau,A.Lin,and O.Shivers.

The Flux OSKit:A substrate for kernel and language research.In Proc.of the16th ACM Symposium on Operating Systems Princi-ples,Saint-Malo,France,Oct.1997.

[11] A.Forin,D.Golub,and B.Bershad.An I/O system for Mach3.0.

In Proc.of the Second USENIX Mach Symposium,Monterey,CA, Nov.1991.

[12]K.Fraser,S.Hand,R.Neugebauer,I.Pratt,A.Warﬁeld,and

M.Williamson.Reconstructing I/O.Technical Report UCAM-CL-TR-596,University of Cambridge,Computer Laboratory, Aug.2004.

[13]K.Fraser,S.Hand,R.Neugebauer,I.Pratt,A.Warﬁeld,and

M.Williamson.Safe hardware access with the Xen virtual ma-chine monitor.In1st Workshop on Operating System and Archi-tectural Support for the On-Demand IT Infrastructure,Boston, MA,Oct.2004.

[14] A.Gefﬂaut,T.Jaeger,Y.Park,J.Liedtke,K.Elphinstone,V.Uh-

lig,et al.The SawMill multiserver approach.In9th SIGOPS European Workshop,Kolding,Denmark,Sept.2000.

[15]S.Goel and D.Duchamp.Linux device driver emulation in Mach.

In USENIX Annual Technical Conference,San Diego,CA,Jan.

1996.

[16]R.P.Goldberg.Survey of virtual machine research.IEEE Com-

puter Magazine,7(6),1974.[17] D.B.Golub,G.G.Sotomayor,Jr.,and F.L.Rawson III.An

architecture for device drivers executing as user-level tasks.In Proc.of the USENIX Mach III Symposium,Sante Fe,NM,Apr.

1993.

[18]S.M.Hand.Self-paging in the Nemesis operating system.In

Proc.of the3rd Symposium on Operating Systems Design and Implementation,New Orleans,LA,Feb.1999.

[19]H.H¨a rtig,M.Hohmuth,J.Liedtke,S.Sch¨o nberg,and J.Wolter.

The performance of microkernel-based systems.In Proc.of the 16th ACM Symposium on Operating System Principles,Saint-Malo,France,Oct.1997.

[20]H.H¨a rtig,J.L¨o ser,F.Mehnert,L.Reuther,M.Pohlack,and

A.Warg.An I/O architecture for microkernel-based operating

systems.Technical Report TUD-FI03-08-Juli-2003,TU Dres-den,Dresden,Germany,July2003.

[21]J.Honeycutt.Microsoft Virtual PC2004Technical Overview.

Microsoft,Nov.2003.

[22]HP Technical Computing Devision.HP zx1mio ERS,Rev.1.0.

Hewlett Packard,Mar.2003.

[23]K.Hui,J.Appavoo,R.Wisniewski,M.Auslander,D.Edelsohn,

B.Gamsa,et al.Position summary:Supporting hot-swappable

components for system software.In Eighth IEEE Workshop on Hot Topics in Operating Systems,Schloss Elmau,Germany,May 2001.

[24] B.Leslie and G.Heiser.Towards untrusted device drivers.Tech-

nical Report UNSW-CSE-TR-0303,School of Computer Science and Engineering,UNSW,Mar.2003.

[25]J.Liedtke.Onµ-kernel construction.In Proc.of the15th ACM

Symposium on Operating System Principles,Copper Mountain Resort,CO,Dec.1995.

[26]J.Liedtke,U.Bartling,U.Beyer,D.Heinrichs,R.Ruland,and

G.Szalay.Two years of experience with aµ-kernel based OS.

ACM SIGOPS Operating Systems Review,25(2),Apr.1991. [27]J.Liedtke,V.Uhlig,K.Elphinstone,T.Jaeger,and Y.Park.How

to schedule unlimited memory pinning of untrusted processes,or, provisional ideas about service-neutrality.In7th Workshop on Hot Topics in Operating Systems,Rio Rico,AR,Mar.1999. [28]K.V.Maren.The Fluke device driver framework.Master’s thesis,

University of Utah,Dec.1999.

[29] D.A.Patterson,A.Brown,P.Broadwell,G.Candea,M.Chen,

J.Cutler,et al.Recovery-Oriented Computing(ROC):Moti-vation,deﬁnition,techniques,and case studies.Technical Re-port UCB//CSD-02-1175,U.C.Berkely Computer Science,Mar.

2002.

[30]PCI Special Interest Group.PCI Local Bus Speciﬁcation,Rev.

2.1,June1995.

[31] D.S.Ritchie and G.W.Neufeld.User level IPC and device man-

agement in the Raven kernel.In USENIX Microkernels and Other Kernel Architectures Symposium,San Diego,CA,Sept.1993. [32]J.Sugerman,G.Venkitachalam,and B.-H.Lim.Virtualizing I/O

devices on VMware Workstation’s hosted virtual machine moni-tor.In Proc.of the2001USENIX Annual Technical Conference, Boston,MA,June2001.

[33]M.Swift,B.Bershad,and H.Levy.Improving the reliability of

commodity operating systems.In Proc.of the19th ACM Sympo-sium on Operating Systems Principles,Bolton Landing,NY,Oct.

2003.

[34]V.Uhlig,J.LeVasseur,E.Skoglund,and U.Dannowski.To-

wards scalable multiprocessor virtual machines.In Proc.of the 3rd Virtual Machine Research and Technology Symposium,San Jose,CA,May2004.

[35]VMware.VMware ESX Server I/O Adapter Compatibility Guide,

Jan.2003.

[36] C.Waldspurger.Memory resource management in VMware ESX

Server.In Proc.of the5th Symposium on Operating Systems Design and Implementation,Boston,MA,Dec.2002.下载本文

显示全文

全部频道