Goals o 1000's disks o 100's-1000's clients o diverse applications o failures are common Workload - multi-gb files common -- each file contains many smaller application objects (e.g., web documents) - Most updates are appends - reads -- large streaming of rsmall random - Weaken consistency OK -- but optimize consitency for append -- handle hundreds of producers simultaneously appending to a file - BW imporant; latency less so Architecture - 1 master -- store all metadata -- namespace -- ACL -- file->chunk mapping -- chunk location -- chunk leases -- garbage collection of chunks - many chunkservers -- store [large] chunks (64MB) --- large chunks reduce interactions w/ master --- large chunks --> cache large mappings in memory --- large chunks --> clients tend to contact small # chunkservers --> reduce TCP overheads --- DA: limit load balancing for hot-spot files -- redundant storage (3x default) - clients -- no cache -- don't need them for performance -- cache coherence adds complexity client ------ file name, chunk index --------> master <----- chunk handle, chunk location -- /| \ | \--- chunk handle, byterange --> chunkserver \--------------------------------- Design choices - Chunk locations -- master keeps in memory as soft state (1) Won't that limit scalability? People often get this type of intuition wrong... 64 bytes of metadata per 64MB chunk on disk --> 1GB (2^9) of memory can index 1PB (2^15) Even better -- look at costs...as long as a byte of memory costs much less than 1MB of disk, this choice doesn't increas entire system cost much --> 1GB of DRAM $30 (2007) 1PB of disk $100K (2007) (2) Isn't this icky? "Realize that a chunkserver has the final word over what chunks it does and does not have on its own disks. There is no point in trying to maintain a consistent view of this information on the master because errors on a chunkserve may cause chunks to vanish spontaneously...or an operator may rename a chunkserver" (3) What about startup time? Not addressed in paper. Would want to look at this (could be similar to "recovery storm" in Sprite)....but not clear that the overhead of reading this info from a chunkserver is that much higher than reading this info from disk... Back of envelope: 1TB disk has 20K chunks; 1MB to send to chunkserver; if chunkserver needs to hear about index for 1K disks, need to receive 1GB. This is at least 8 seconds on gigabit ethernet. Need to implement this carefully to keep recovery time under a minute. Could easily imagine a poor implementation taking 10 minutes... So, it is plausible to make this work, but probably needs to be designed, implemented, tested carefully - Single master, redundant operation log - consistency model -- file creation/deletion are atomic and totally ordered (centralized at master) -- file contents --- Challenges: ---- several copies ---- clients can fail ---- replicas can fail ---- concurrent writers --- weak consistency on write (defined/connsistent/inconsistent) single writer, --> defined no failure all future reads will see full results of write multi-writer --> consistent no failure All future reads will see same thing (mix of different writes) failure --> inconsistent Future reads may see different things on different replicas --- Implementation ---- apply mutations to a chunk in same order at all replicas primary in each chunkgroup holds lease from master client sends chunks to all replicas client sends "write" to primary primary assigns serial number and sends "write" to replicas ---- chunk version numbers to detect stale chunks ---- Prevent modification of stale chunks ---- garbage collect stale chunks --- for appends -- if no failures "at least once" semantics + padding ("inconsistent ranges) --- Implementation? Not stated, but a guess would be for client to ask chunkserver for offset and then write; note that if client crashes, offset may be empty ("inconsistent" returned on read?) --- Implications for applications application level checksums self-describing records record IDs (to eliminate duplicates) ... -- garbage collection v. eager delete simplify life -- delete happens at master then lazily at chunkservers (several days later) Fault tolerance