Memory Resource Management in the VMWare ESX Server Review Disco -- the research prototype Virtual memory: VA->PA->MA -- key optimization: Sharing Exception, interrupt, trap: App->VMM->guestOS->VMM->... IO, network: Fake device drivers Preview: VMWare...what a corporate development budget buys you (and elegant solution to two significant problems) BTW: about same time as this paper came out, a class project for this class (re-)invented ballooning...you can solve real significant problems... Observation: This paper seems simpler than disco in many respects..."real system" forces simplicity, commercial goals force focusing on real problems Outline \section{Overview} VMWare ESX server runs on bare hardware. Other VMWare products use the "hosted" approach where guest OS actually talks to hardware main benefit of hosted -- device driver support Main goal/motivation/constraint: {\em NO} changes to guest OS or applications Detail: Remember x86 has HW TLB (v. mips had software loaded TLB) --> VMM maintains shadow page tables that contain actual va->ma So OS has page table VA->PA and VMM has VA->MA page table. Which one does the hardware see? Think of HW page table as a big TLB \section{Ballooning} Goal: allow overcommitting of memory by increasing/shrinking amount of memory dedicated to each VM Challenge: OSes don't have a facility for changing amount of physical memory at runtime Could try paging idle pages, but... VM page replacement algorithm can pick a page important to the guest OS. Causes performance anomalies. Double paging problem: If VM pages out first, OS page out will cause VM reclaim. Could try modifying guest OS but... Solution: add "balloon" device driver to guest OS (OK, we're modifying guest OS, but in a standard, well-supported way) Inflate balloon to get OS to free (page out) memory VMM tells device driver to inflate balloon Device driver requests a physical page from OS using standard (per-OS) internal interface --> guest OS and applications will not touch this page any more...treat it as free Device driver tells VMM what physical pages are safe to reclaim QUESTION: balloon module "communicates with ESX Server via a private channel". How do you think they pull this off? \mike{It's a device driver, right. VMM emulates a hardware device on the IO bus at some physical range of addresses; OS detects this "hardware device" at boot time and installs appropriate device driver; hardware device can trigger interrupts to invoke device driver; device driver can do PIO to hardware device's physical addresses to send messages to VMM} Deflate balloon to get OS to use more memory Pages allocated to balloon have their entry in the pmap marked, and can be reclaimed by VM Problems What if guest OS reads/writes balloon page? (It should "never" do this, but we don't want to introduce new bugs/security vulnerabilities...and, in fact, they observe this happening when guest OS crashes...) --> "pop the balloon" (0) Balooned page physical memory not mapped for guest OS --> VMM can detect reference to ballooned page (just like a page fault of an application looks to OS) (1) allocate and map a new zeroed page for this access (2) Next interaction with guest driver will reset baloon state (return pages back to OS and then start ballooning from scratch) What if obscure guest OS doesn't support VMWare balloon driver or guestOS refuses request to allocate physical page to balloon or guestOS has max driver size? Can always resort to paging. Use randomize algorithm to avoid pathologically bad cases of paging out exactly what guest OS needs. Might not be able to reclaim memory fast enough (VMWare rate-limits allocation) Can always resort to paging. \section{Content-based sharing} Disco’s sharing mechanism required modification to the OS (change bcopy, change mbufs, new network device, ...) Share by content avoids these changes. Sharing by content is a trick used in other systems (like Venti, which tracks version histories) for file systems, and peer-to-peer systems). Hash every page. Store hashes in a hash table. On collision, check if pages are identical. If they are, share copy-on-write. With no collision, store hash as hint. On future collision check if hint is still valid (page contents have not changed). If it is, share page. share page = mark COW in all pmaps... How to find matches? Scan random pages in background (Minimal overhead even for CPU-intensive SPEC) Evaluation Good mix of synthetic and real-world data Synthetic: "easy case" six identical VMs running SPEC95 Reclaim ~2/3 of memory by mapping to shared No slowdown (small CPU overhead compensated by better cache locality in physicall addressed cache) Real world: 3 production deployments -- 7.2%-32% reclaimed by sharing is this a time average? snapshot at a random moment in time? Limitations QUESTION: Limitations? Did scanning find most of the available opportunities for sharing? How many cpu cycles should I devote to scanning to identify, say, 90% of sharing opportunities? "[Content based sharing] exploits many opportunities for sharing missed by both Disco and the standard copy-on-write techniques used in conventional operating systems" Are you convinced that content based sharing is strictly better than Disco's approach (assuming that modifying the OS is not a constraint.) (1) Is there any evidence to support the claim that they find more opportunities for sharing than Disco? (Is claim supported by data?) (2) Are there any ways in which Disco's approach is better? \section{Allocation policy} Proportional share -- assign shares (tickets) to each VM and allocate memory proportional to shares constraint: min/max min-funding revocation: if you need a page, take from VM with smallest shres/allocation (e.g., take from VM that is "paying the least" for its memory) But pure prop share doesn't account for idle pages "In general, the goals of performance isolation and efficient memory utilization often conflict." heuristic: idle memory tax estimate f = fraction of pages that are idle in VM statistical sampling exponentially weightd average over time bias things to react rapidly to increase demand but slowly to increased idle pages adjusted cost = S/P * (f + k(1-f)) S = shares (tickets) P = allocated pages f = fraction active k = 1/(1-taxRate) --> taxRate = 0 --> pure proportional share taxRate = inf --> idle pages always taken regardless of S default 75% Evaluation synthetic workloads -- (1) sampling accurately estimates idle fraction (fig 6) (2) tax allows system to allocate memory from idle to active VM (fig 7) is 75% tax rate "right"? How would you set it for real workload?