Memory Resource Management in the VMWare ESX Server

Review
  Disco -- the research prototype
     Virtual memory: VA->PA->MA
        -- key optimization: Sharing
     Exception, interrupt, trap: App->VMM->guestOS->VMM->...
     IO, network: Fake device drivers


Preview: VMWare...what a corporate development budget buys you (and
elegant solution to two significant problems)

   BTW: about same time as this paper came out, a class project for
   this class (re-)invented ballooning...you can solve real
   significant problems...

   
   Observation: This paper seems simpler than disco in many
   respects..."real system" forces simplicity, commercial goals force
   focusing on real problems


Outline

\section{Overview}

   VMWare ESX server runs on bare hardware. Other VMWare products use
   the "hosted" approach where guest OS actually talks to hardware

      main benefit of hosted -- device driver support

   Main goal/motivation/constraint: {\em NO} changes to guest OS or
   applications


   Detail: Remember x86 has HW TLB (v. mips had software loaded TLB)
    --> VMM maintains shadow page tables that contain actual va->ma

      So OS has page table VA->PA and VMM has VA->MA page table. Which
      one does the hardware see?

      Think of HW page table as a big TLB





\section{Ballooning}

   Goal: allow overcommitting of memory by increasing/shrinking
   amount of memory dedicated to each VM

   Challenge:

        OSes don't have a facility for changing amount of physical
        memory at runtime

        Could try paging idle pages, but...

             VM page replacement algorithm can pick a page important
             to the guest OS. Causes performance anomalies.
 
             Double paging problem: If VM pages out first, OS page out
             will cause VM reclaim.

        Could try modifying guest OS but...


    Solution: add "balloon" device driver to guest OS

        (OK, we're modifying guest OS, but in a standard,
        well-supported way)

        Inflate balloon to get OS to free (page out) memory

             VMM tells device driver to inflate balloon

             Device driver requests a physical page from OS using
             standard (per-OS) internal interface

             --> guest OS and applications will not touch this page
                 any more...treat it as free
             
             Device driver tells VMM what physical pages are safe to
             reclaim

             QUESTION: balloon module "communicates with ESX Server
             via a private channel". How do you think they pull this
             off?

             \mike{It's a device driver, right. VMM emulates a
             hardware device on the IO bus at some physical range of
             addresses; OS detects this "hardware device" at boot time
             and installs appropriate device driver; hardware device
             can trigger interrupts to invoke device driver; device
             driver can do PIO to hardware device's physical addresses
             to send messages to VMM}

        Deflate balloon to get OS to use more memory

        Pages allocated to balloon have their entry in the pmap
        marked, and can be reclaimed by VM


    Problems

        What if guest OS reads/writes balloon page?

           (It should "never" do this, but we don't want to introduce
           new bugs/security vulnerabilities...and, in fact, they
           observe this happening when guest OS crashes...)

           --> "pop the balloon"

               (0) Balooned page physical memory not mapped for guest
                   OS --> VMM can detect reference to ballooned page
                   (just like a page fault of an application looks to
                   OS)

               (1) allocate and map a new zeroed page for this access

               (2) Next interaction with guest driver will reset
                   baloon state (return pages back to OS and then
                   start ballooning from scratch)


       What if obscure guest OS doesn't support VMWare balloon driver
       or guestOS refuses request to allocate physical page to
       balloon or guestOS has max driver size?

           Can always resort to paging. Use randomize algorithm to
           avoid pathologically bad cases of paging out exactly what
           guest OS needs.


       Might not be able to reclaim memory fast enough (VMWare rate-limits allocation)


        Can always resort to paging. 




\section{Content-based sharing}

Disco’s sharing mechanism required modification to the OS (change
bcopy, change mbufs, new network device, ...)

Share by content avoids these changes. Sharing by content is a trick
used in other systems (like Venti, which tracks version histories)
for file systems, and peer-to-peer systems).

    Hash every page.

    Store hashes in a hash table.

    On collision, check if pages are identical. If they are, share
    copy-on-write.  With no collision, store hash as hint. On future
    collision check if hint is still valid (page contents have not
    changed). If it is, share page.

         share page = mark COW in all pmaps...


How to find matches?

    Scan random pages in background
 
         (Minimal overhead even for CPU-intensive SPEC)

   
    Evaluation

      Good mix of synthetic and real-world data

         Synthetic: "easy case" six identical VMs running SPEC95

             Reclaim ~2/3 of memory by mapping to shared

             No slowdown (small CPU overhead compensated by better
             cache locality in physicall addressed cache)

         Real world: 3 production deployments -- 7.2%-32% reclaimed by
         sharing 

             is this a time average? snapshot at a random moment in
             time?




      Limitations

          QUESTION: Limitations?

          Did scanning find most of the available opportunities for
          sharing?

          How many cpu cycles should I devote to scanning to identify,
          say, 90% of sharing opportunities?

          "[Content based sharing] exploits many opportunities for
          sharing missed by both Disco and the standard copy-on-write
          techniques used in conventional operating systems"

              Are you convinced that content based sharing is strictly
              better than Disco's approach (assuming that modifying the OS
              is not a constraint.)

              (1) Is there any evidence to support the claim that they
                  find more opportunities for sharing than Disco? (Is
                  claim supported by data?)

              (2) Are there any ways in which Disco's approach is
                  better?




\section{Allocation policy}

    Proportional share -- assign shares (tickets) to each VM and
    allocate memory proportional to shares

          constraint: min/max

          min-funding revocation: if you need a page, take from VM
          with smallest shres/allocation (e.g., take from VM that is
          "paying the least" for its memory)


    But pure prop share doesn't account for idle pages

        "In general, the goals of performance isolation and efficient
        memory utilization often conflict."


        heuristic: idle memory tax

           estimate f = fraction of pages that are idle in VM

                 statistical sampling

                 exponentially weightd average over time

                   bias things to react rapidly to increase demand but
                   slowly to increased idle pages

           adjusted cost = S/P * (f + k(1-f))
           
                  S = shares (tickets)
                  P = allocated pages
                  f = fraction active
                  k = 1/(1-taxRate)

            --> taxRate = 0 --> pure proportional share

                taxRate = inf --> idle pages always taken regardless
                of S


                default 75%

                


         Evaluation

              synthetic workloads -- 

                 (1) sampling accurately estimates idle fraction
                     (fig 6)

                 (2) tax allows system to allocate memory from idle to
                     active VM (fig 7)

              is 75% tax rate "right"? How would you set it for real
              workload?