Outline
   GFS
   Architecture
   Issues
      2PC
      performance/scalability
      consistency



Note
 -- mostly I'll talk about google FS

 -- occasionally, I'll fill in details/tweak things to describe HDFS
     (I'll try to point out where I'm mixing, but be warned..)


Cloud background

  -- Warehouse scale systems 
       10K-100K nodes
       50MW
       power-efficient (locate near cheap power; passive cooling,
           etc. PUE: 1.2 or better)
       highly uniform; commodity parts

       commodity parts and open-source/custom software
           --> cheap per-node costs
           --> need to worry about failures

       --> James Hamilton: resources (CPU, storage, network, power) 
          for 10K-100K-node data center 3-5x cheaper than for
          100-1K-node data center
       



Google File System  

Goals
  o 1000's disks
  o 100's-1000's clients
  o diverse applications
     read-mostly workload; lots of data mining
  o failures are common

Tech trends
  o Seek = bad
  o lots of machines
  o machine has several disks
  o disk BW ~= NW BW (100 MB/s v. 1 Gbit/s)
      1 disk BW can fill network


Workload
  - multi-gb files common

    -- each file contains many smaller application objects (e.g., web
       documents)

  - Most updates are appends

  - reads -- large streaming or small random

  - Weaken consistency OK

     -- but optimize consitency for append -- handle hundreds of
        producers simultaneously appending to a file

  - BW imporant; latency less so



Architecture

  - 1 master -- store all metadata
       -- namespace
       -- ACL
       -- file->chunk mapping
       -- chunk location
       -- chunk leases
       -- garbage collection of chunks

  - many chunkservers
       -- store [large] chunks (64MB)
          --- large chunks reduce interactions w/ master
          --- large chunks --> cache large mappings in memory
          --- large chunks --> clients tend to contact small # chunkservers
              --> reduce TCP overheads
          --- DA: limit load balancing for hot-spot files
       -- redundant storage (3x default)

  - clients
       -- no cache 
           -- don't need them for performance
           -- cache coherence adds complexity




Read

    client ------ file name, offset --------> master
           <----- chunk ID, chunkserver   --
      /| \
       |  \--- chunk ID, byterange --> chunkserver
        \---------------------------------

 

      Performance optimization: Client can cache mappings from master

      Performance optimization: On parallel read by many clients
      (mapreduce) can first learn where data is; then have
      each client read and process local data





Issue: How to do writes/create new chunk?
  (Case study: 2PC)

   Create a new chunk:
   Need to atomically update
     3 chunk servers: store [data] and tell me chunkID
     master: store mapping [fileID, offset] --> [chunkID, chunkserver]*


   How would you do this?
      Simple answer 1: Basic 2 phase commit (client is master)

         (1) 
           client -- data ---------------> chunk servers
                  <- chunkID, VOTE_COMMIT-

                  -- fildID, offset, chunkID ---> master
                  <- VOTE_COMMIT -------

         (2) 
            client -- GLOBAL_COMMIT -> chunk servers, master


         PROBLEM: 2PC blocks forever if master dies at just wrong
                     moment
                  client could die and never recover, 
                  -> exactly the wrong property for 2PC






   
         Simple answer 2: Basic 2 phase commit (master is master)

         client -- fid, off, data  ---> master ---- data ----------------> chunk servers
                                              <- blockID, VOTE_COMMIT --
                                              --- GLOBAL_COMMIT -------->


             Does this work?                                            
                o For now, assume master is highly reliable.
                o What if try to write to chunk servers 51, 47, and 99
                    but chunk server 47 is down?
                     --> timeout, abort, retry with different
                    chunkservers

             problem: send data through master; limits scalablity


      Better answer (HDFS; probably GFS)
             separate data path from metadata path


             (1) Send data to chunkservers
                 chunkservers send block IDs to master
                 (chain communication for efficiency)

            client -- data --> CS1 
              CS1 -- CS1/CID1, data --> CS2 --> |
                  ------------------------------|
                \|/                 
               CS2 -- CS2/CID2, CS1/CID1, data --> CS3 -->|
                   ---------------------------------------|
                  \|/                 
                  CS3 -- CS3/CID3, CS2/CID2, CS1/CID1, hash --> master, client


                 now master knows who has this data and what the chunk
                 IDs are



              (2) Send objID, hash(data) to master

             client -- fid, offset, hash --> master
                    <-- OK --------

              [(3) If master receives [BID*,hash] from chunkservers but then no binding
                   of [oid,hash] from client, TIMEOUT; send "DELETE
                   [hash]" to chunkservers

                   if master receive client request but not CS/CID
                   list, client timeout, retry data send to chunkservers


            Where is 2PC VOTE_COMMIT/GLOBAL_COMMIT?

                  [[Step 1 is "VOTE COMMIT" by everyone except master;
                    Step 2 is "VOTE COMMIT" by master + GLOBAL_COMMIT by master

                    Notice that no "GLOBAL COMMIT" is sent to chunkservers.

                    By 2PC, once they say "VOTE_COMMIT" they need to keep data
                    unless they are told ABORT --> assume "COMMIT" unless you hear "DELETE"

                   
            NOTE: This works because all reads go through
            master and because GFS/HDFS are (were originally) write-once
            file systems. Once a chunk is written, it is never
            changed. (Append is allowed).

                What could happen if you allowed writes to change
                bytes within chunk?  (strange consistency -- without
                global commit, chunkservers don't know when to stop
                serving old data and start serving new data...)


            --> Always safe and live (as long as master is reliable
                and available)

                safe -- if client gets ack, then fileID,offset,data
                        stored
                live -- eventually...




Issue: Master reliability

    What if master fails?

    Make master redundant
       simple idea -- before master "ack" client, send data to 
           2 or 3 other machines that hold redundant log

           but what if master crashes while doing this?

             suppose mster updates replica 1 but not replica 2 and
             then crashes

             if we fail-over to replica 2 "no data"

             if we fail over to replica 1 "data"

             what if we first fail over to replica 2; it cannot
             contact replica 1, so we operate for a while with "no
             data"

             Then replica 2 fails and we go back to replica 1 or 0?

             ugh.

             Illustrates problem with 2PC -- if master dies at just
             wrong moment, you're doomed

             (You can try to add hacks for specific scenario I
             described, but there will *always* be corner cases. 2PC
             sacrifices safety or liveness in corner cases.)



    Solution (UpRight HDFS)

            master is replicated with "3PC" (Paxos,PBFT, UpRight)

                 --> 2f+1 masters to tolerate f crash failures
                     -- always safe
                     -- live during periods when network is well
                        behaved and enough machines are up
                      









  
                             

Issue: Scalability
   - Chunk locations
      -- master keeps in memory as soft state

      (1) Won't that limit scalability (memory size)?

           People often get this type of intuition wrong...

           64 bytes of metadata per 64MB chunk on disk

           --> 1GB (10^9) of memory can index 1PB (10^15)

           Even better -- look at costs...as long as a byte of memory
               costs much less than 1MB of disk, this choice doesn't
               increas entire system cost much

           --> 1GB of DRAM $30 (2007)
               1PB of disk $100K (2007)


	      Still 2008 talk said -- they are hitting limits of scalability -- 
	      problem seems to be that the machines they use run out of 
	      slots in which to put memory.
	      
	            let's see where this might come from

		        2TB per data server (2 1TB SATA drives per machine)
			--> 1PB per 500 machines
			--> 1GB of memory per 500 machines
			--> a 8GB server can index about 4000 machines

			Plausible that it is starting to constrict a
			few deployments but maybe not quite a critical
			problem yet

			Technology trends -- if disk cost falls by 100% per year
			and memory cost fall by 60% per year gap is 40% per year
			--> halve the number of machines a single server 
			can support every 2 years

			--> plausible that they need to solve this
                            problems sometime soon...




      (2) Won't this limit scalability (performance)?

             How many disks can 1 master support?

             How many IOs per disk
                 64MB per IO / 100 MB/s per disk ~= 2 IO/sec per disk

             1K disks --> 2K IOs/second
                          1 IO per 500us --> EASY
             10K disks --> 20K IOs/sec
                          1 IO per 50us --> OK 

  
            


      (3) Isn't this [soft state] icky?

          "Realize that a chunkserver has the final word over what
          chunks it does and does not have on its own disks. There is
          no point in trying to maintain a consistent view of this
          information on the master because errors on a chunkserve may
          cause chunks to vanish spontaneously...or an operator may
          rename a chunkserver"




      (4) What about startup time?

          Not addressed in paper. Would want to look at this (could be
          similar to "recovery storm" in Sprite)....but not clear that
          the overhead of reading this info from a chunkserver is that
          much higher than reading this info from disk...

          Back of envelope: 1TB disk has 20K chunks; 1MB to send to
          chunkserver; if chunkserver needs to hear about index for 1PB,
          need to receive 1GB. This is at least 8 seconds on
          gigabit ethernet. Need to implement this carefully to keep
          recovery time under a minute. Could easily imagine a poor
          implementation taking 10 minutes... So, it is plausible to
          make this work, but probably needs to be designed,
          implemented, tested carefully





 
Issue:
  - consistency model (new writes to existing chunk)

      -- file creation/deletion are atomic and totally ordered
         (centralized at master)

      -- file contents

         --- Challenges:
            ---- several copies
            ---- clients can fail
            ---- replicas can fail
            ---- concurrent writers

            basic problem: under 2PC once chunkserver has new version
            of a chunk ("VOTE_COMMIT") it doesn't know whether to
            serve old or new version on next read until it hears
            GLOBAL_COMMIT

            --> reads may block

            Design decision: Maximize availability. Don't block reads
            in this situation. (--> Give up some consistency; OK for
            read-mostly data mining workload...)


         --- weak consistency on write (defined/consistent/inconsistent)

             single writer,  --> defined 
             no failure          all future reads will see 
                                 full results of write

                              (NOTE: I don't recall if they
                              promise anything to reads while
                              writes are in flight... I suspect that
                              while writes are in flight, some reads
                              to some chunkservers can see old
                              version, some see new)

             multi-writer --> "consistent" (they call it)
             no failure       All future reads will see same thing
                              (mix of different writes)

     
             failure --> inconsistent

                         Future reads may see different things
                         on different replicas



         --- Implementation

             ---- apply mutations to a chunk in same order at 
                  all replicas

                  primary in each chunkgroup holds lease from master

                  client sends updates to all replicas

                  client sends "write" to primary

                  primary assigns serial number and sends "write"
                    to replicas


             ---- chunk version numbers to detect stale chunks

             ---- Prevent modification of stale chunks

             ---- garbage collect stale chunks

         --- for appends -- if no failures "at least once" semantics
               + padding ("inconsistent ranges)

         --- Implementation?

              Not stated, but a guess would be for client to ask
              chunkserver for offset and then write; note that if
              client crashes, offset may be empty ("inconsistent"
              returned on read?)


         --- Implications for applications

              application level checksums
              self-describing records
              record IDs (to eliminate duplicates)
              ...



         --- suppose we wanted stronger guarantee

	        3 phase commit?
		       PRE-PREPARE: client send data to chunk servers, gather acks
                       PREPARE: primary send PREPARE
		       COMMIT: primary send COMMIT

		       recovery: master chooses new primary, polls chunk 
		       servers...

			issues: changing membership -- who to poll for state?
			   (by new primary or by recovering chunk server)

			solution (?): in PRE-PREPARE stage, include list of
			chunk servers that are participating in request;
			chunk servers query others on that list before
			expiring item from write buffer... (still some corner
			cases -- how to guarantee new master for
			a chunk knows about latest committed
			write for a chunk... (chunk version number?)...)
		
  -- garbage collection v. eager delete

        simplify life -- delete happens at master then lazily at
        chunkservers (several days later)



Issue: Fault tolerance/recovery from failure


     creation, re-replication, rebalancing

            -- key is that master has global view
	         detect if too few replicas of chunk
		 or if chunk server load imbalance
		 ...



            Suppose chunkserver 47 with 10TB fails

                 Time to recover 10TB to 1 replacement server with 10
                     disks and 1 Gbit/s NW

                     10TB / 100MB/s (limited by NW bandwidth)
                     --> 10^13 / 10^8 = 10^5 seconds 

                     --> MTTR > day

                 --> take advantage of parallelism (and level of
                     indirection from master directory) --> recover data
                     in parallel to different chunkservers (and from
                     different chunkservers)

                     e.g., suppose 100 servers
                     Now each server only needs to write/read 100GB
                     --> recovery can be done in a 10^3 seconds -- a
                     few tens of minutes

                     (And it gets better as we add more servers!)



            --> master detects missing chunkserver (heartbeat)

                master identifies which chunks now only have 2 copies

                foreach 2-copy-chunk
                      pick random chunkserver
                      tell chunkserver to read and store chunk

                         [[2 outcomes
                           chunkserver registers chunk --> done
                           timeout --> repeat]]