Goals
  o 1000's disks
  o 100's-1000's clients
  o diverse applications
  o failures are common


Workload
  - multi-gb files common

    -- each file contains many smaller application objects (e.g., web
       documents)

  - Most updates are appends

  - reads -- large streaming of rsmall random

  - Weaken consistency OK

     -- but optimize consitency for append -- handle hundreds of
        producers simultaneously appending to a file

  - BW imporant; latency less so



Architecture

  - 1 master -- store all metadata
       -- namespace
       -- ACL
       -- file->chunk mapping
       -- chunk location
       -- chunk leases
       -- garbage collection of chunks

  - many chunkservers
       -- store [large] chunks (64MB)
          --- large chunks reduce interactions w/ master
          --- large chunks --> cache large mappings in memory
          --- large chunks --> clients tend to contact small # chunkservers
              --> reduce TCP overheads
          --- DA: limit load balancing for hot-spot files
       -- redundant storage (3x default)

  - clients
       -- no cache 
           -- don't need them for performance
           -- cache coherence adds complexity


    client ------ file name, chunk index --------> master
           <----- chunk handle, chunk location --
      /| \
       |  \--- chunk handle, byterange --> chunkserver
        \---------------------------------

 


Design choices

   - Chunk locations
      -- master keeps in memory as soft state

      (1) Won't that limit scalability?

           People often get this type of intuition wrong...

           64 bytes of metadata per 64MB chunk on disk

           --> 1GB (2^9) of memory can index 1PB (2^15)

           Even better -- look at costs...as long as a byte of memory
               costs much less than 1MB of disk, this choice doesn't
               increas entire system cost much

           --> 1GB of DRAM $30 (2007)
               1PB of disk $100K (2007)


	      Still 2008 talk said -- they are hitting limits of scalability -- 
	      problem seems to be that the machines they use run out of 
	      slots in which to put memory.
	      
	            let's see where this might come from

		        2TB per data server (2 1TB SATA drives per machine)
			--> 1PB per 500 machines
			--> 1GB of memory per 500 machines
			--> a 8GB server can index about 4000 machines

			Plausible that it is starting to constrict a
			few deployments but maybe not quite a critical
			problem yet

			Technology trends -- if disk cost falls by 100% per year
			and memory cost fall by 60% per year gap is 40% per year
			--> halve the number of machines a single server 
			can support every 2 years

			--> plausible that they need to solve this
                            problems sometime soon...




      (2) Isn't this icky?

          "Realize that a chunkserver has the final word over what
          chunks it does and does not have on its own disks. There is
          no point in trying to maintain a consistent view of this
          information on the master because errors on a chunkserve may
          cause chunks to vanish spontaneously...or an operator may
          rename a chunkserver"


      (3) What about startup time?

          Not addressed in paper. Would want to look at this (could be
          similar to "recovery storm" in Sprite)....but not clear that
          the overhead of reading this info from a chunkserver is that
          much higher than reading this info from disk...

          Back of envelope: 1TB disk has 20K chunks; 1MB to send to
          chunkserver; if chunkserver needs to hear about index for 1PB,
          need to receive 1GB. This is at least 8 seconds on
          gigabit ethernet. Need to implement this carefully to keep
          recovery time under a minute. Could easily imagine a poor
          implementation taking 10 minutes... So, it is plausible to
          make this work, but probably needs to be designed,
          implemented, tested carefully





 
  - Single master, redundant operation log

  - consistency model

      -- file creation/deletion are atomic and totally ordered
         (centralized at master)

      -- file contents

         --- Challenges:
            ---- several copies
            ---- clients can fail
            ---- replicas can fail
            ---- concurrent writers


         --- weak consistency on write (defined/connsistent/inconsistent)

             single writer,  --> defined 
             no failure          all future reads will see 
                                 full results of write

             multi-writer --> consistent
             no failure       All future reads will see same thing
                              (mix of different writes)
      
             failure --> inconsistent
                         Future reads may see different things
                         on different replicas

         --- Implementation

             ---- apply mutations to a chunk in same order at 
                  all replicas

                  primary in each chunkgroup holds lease from master

                  client sends chunks to all replicas

                  client sends "write" to primary

                  primary assigns serial number and sends "write"
                    to replicas


             ---- chunk version numbers to detect stale chunks

             ---- Prevent modification of stale chunks

             ---- garbage collect stale chunks

         --- for appends -- if no failures "at least once" semantics
               + padding ("inconsistent ranges)

         --- Implementation?

              Not stated, but a guess would be for client to ask
              chunkserver for offset and then write; note that if
              client crashes, offset may be empty ("inconsistent"
              returned on read?)


         --- Implications for applications

              application level checksums
              self-describing records
              record IDs (to eliminate duplicates)
              ...



         --- suppose we wanted stronger guarantee

	        3 phase commit?
		       PRE-PREPARE: client send data to chunk servers, gather acks
                       PREPARE: primary send PREPARE
		       COMMIT: primary send COMMIT

		       recovery: master chooses new primary, polls chunk 
		       servers...

			issues: changing membership -- who to poll for state?
			   (by new primary or by recovering chunk server)

			solution (?): in PRE-PREPARE stage, include list of
			chunk servers that are participating in request;
			chunk servers query others on that list before
			expiring item from write buffer... (still some corner
			cases -- how to guarantee new master for
			a chunk knows about latest committed
			write for a chunk... (chunk version number?)...)
		
  -- garbage collection v. eager delete

        simplify life -- delete happens at master then lazily at
        chunkservers (several days later)



Fault tolerance


     creation, re-replication, rebalancing

            -- key is that master has global view
	         detect if too few replicas of chunk
		 or if chunk server load imbalance
		 ...