Goals o 1000's disks o 100's-1000's clients o diverse applications o failures are common Workload - multi-gb files common -- each file contains many smaller application objects (e.g., web documents) - Most updates are appends - reads -- large streaming of rsmall random - Weaken consistency OK -- but optimize consitency for append -- handle hundreds of producers simultaneously appending to a file - BW imporant; latency less so Architecture - 1 master -- store all metadata -- namespace -- ACL -- file->chunk mapping -- chunk location -- chunk leases -- garbage collection of chunks - many chunkservers -- store [large] chunks (64MB) --- large chunks reduce interactions w/ master --- large chunks --> cache large mappings in memory --- large chunks --> clients tend to contact small # chunkservers --> reduce TCP overheads --- DA: limit load balancing for hot-spot files -- redundant storage (3x default) - clients -- no cache -- don't need them for performance -- cache coherence adds complexity client ------ file name, chunk index --------> master <----- chunk handle, chunk location -- /| \ | \--- chunk handle, byterange --> chunkserver \--------------------------------- Design choices - Chunk locations -- master keeps in memory as soft state (1) Won't that limit scalability? People often get this type of intuition wrong... 64 bytes of metadata per 64MB chunk on disk --> 1GB (2^9) of memory can index 1PB (2^15) Even better -- look at costs...as long as a byte of memory costs much less than 1MB of disk, this choice doesn't increas entire system cost much --> 1GB of DRAM $30 (2007) 1PB of disk $100K (2007) Still 2008 talk said -- they are hitting limits of scalability -- problem seems to be that the machines they use run out of slots in which to put memory. let's see where this might come from 2TB per data server (2 1TB SATA drives per machine) --> 1PB per 500 machines --> 1GB of memory per 500 machines --> a 8GB server can index about 4000 machines Plausible that it is starting to constrict a few deployments but maybe not quite a critical problem yet Technology trends -- if disk cost falls by 100% per year and memory cost fall by 60% per year gap is 40% per year --> halve the number of machines a single server can support every 2 years --> plausible that they need to solve this problems sometime soon... (2) Isn't this icky? "Realize that a chunkserver has the final word over what chunks it does and does not have on its own disks. There is no point in trying to maintain a consistent view of this information on the master because errors on a chunkserve may cause chunks to vanish spontaneously...or an operator may rename a chunkserver" (3) What about startup time? Not addressed in paper. Would want to look at this (could be similar to "recovery storm" in Sprite)....but not clear that the overhead of reading this info from a chunkserver is that much higher than reading this info from disk... Back of envelope: 1TB disk has 20K chunks; 1MB to send to chunkserver; if chunkserver needs to hear about index for 1PB, need to receive 1GB. This is at least 8 seconds on gigabit ethernet. Need to implement this carefully to keep recovery time under a minute. Could easily imagine a poor implementation taking 10 minutes... So, it is plausible to make this work, but probably needs to be designed, implemented, tested carefully - Single master, redundant operation log - consistency model -- file creation/deletion are atomic and totally ordered (centralized at master) -- file contents --- Challenges: ---- several copies ---- clients can fail ---- replicas can fail ---- concurrent writers --- weak consistency on write (defined/connsistent/inconsistent) single writer, --> defined no failure all future reads will see full results of write multi-writer --> consistent no failure All future reads will see same thing (mix of different writes) failure --> inconsistent Future reads may see different things on different replicas --- Implementation ---- apply mutations to a chunk in same order at all replicas primary in each chunkgroup holds lease from master client sends chunks to all replicas client sends "write" to primary primary assigns serial number and sends "write" to replicas ---- chunk version numbers to detect stale chunks ---- Prevent modification of stale chunks ---- garbage collect stale chunks --- for appends -- if no failures "at least once" semantics + padding ("inconsistent ranges) --- Implementation? Not stated, but a guess would be for client to ask chunkserver for offset and then write; note that if client crashes, offset may be empty ("inconsistent" returned on read?) --- Implications for applications application level checksums self-describing records record IDs (to eliminate duplicates) ... --- suppose we wanted stronger guarantee 3 phase commit? PRE-PREPARE: client send data to chunk servers, gather acks PREPARE: primary send PREPARE COMMIT: primary send COMMIT recovery: master chooses new primary, polls chunk servers... issues: changing membership -- who to poll for state? (by new primary or by recovering chunk server) solution (?): in PRE-PREPARE stage, include list of chunk servers that are participating in request; chunk servers query others on that list before expiring item from write buffer... (still some corner cases -- how to guarantee new master for a chunk knows about latest committed write for a chunk... (chunk version number?)...) -- garbage collection v. eager delete simplify life -- delete happens at master then lazily at chunkservers (several days later) Fault tolerance creation, re-replication, rebalancing -- key is that master has global view detect if too few replicas of chunk or if chunk server load imbalance ...