Outline GFS Architecture Issues 2PC performance/scalability consistency Note -- mostly I'll talk about google FS -- occasionally, I'll fill in details/tweak things to describe HDFS (I'll try to point out where I'm mixing, but be warned..) Cloud background -- Warehouse scale systems 10K-100K nodes 50MW power-efficient (locate near cheap power; passive cooling, etc. PUE: 1.2 or better) highly uniform; commodity parts commodity parts and open-source/custom software --> cheap per-node costs --> need to worry about failures --> James Hamilton: resources (CPU, storage, network, power) for 10K-100K-node data center 3-5x cheaper than for 100-1K-node data center Google File System Goals o 1000's disks o 100's-1000's clients o diverse applications read-mostly workload; lots of data mining o failures are common Tech trends o Seek = bad o lots of machines o machine has several disks o disk BW ~= NW BW (100 MB/s v. 1 Gbit/s) 1 disk BW can fill network Workload - multi-gb files common -- each file contains many smaller application objects (e.g., web documents) - Most updates are appends - reads -- large streaming or small random - Weaken consistency OK -- but optimize consitency for append -- handle hundreds of producers simultaneously appending to a file - BW imporant; latency less so Architecture - 1 master -- store all metadata -- namespace -- ACL -- file->chunk mapping -- chunk location -- chunk leases -- garbage collection of chunks - many chunkservers -- store [large] chunks (64MB) --- large chunks reduce interactions w/ master --- large chunks --> cache large mappings in memory --- large chunks --> clients tend to contact small # chunkservers --> reduce TCP overheads --- DA: limit load balancing for hot-spot files -- redundant storage (3x default) - clients -- no cache -- don't need them for performance -- cache coherence adds complexity Read client ------ file name, offset --------> master <----- chunk ID, chunkserver -- /| \ | \--- chunk ID, byterange --> chunkserver \--------------------------------- Performance optimization: Client can cache mappings from master Performance optimization: On parallel read by many clients (mapreduce) can first learn where data is; then have each client read and process local data Issue: How to do writes/create new chunk? (Case study: 2PC) Create a new chunk: Need to atomically update 3 chunk servers: store [data] and tell me chunkID master: store mapping [fileID, offset] --> [chunkID, chunkserver]* How would you do this? Simple answer 1: Basic 2 phase commit (client is master) (1) client -- data ---------------> chunk servers <- chunkID, VOTE_COMMIT- -- fildID, offset, chunkID ---> master <- VOTE_COMMIT ------- (2) client -- GLOBAL_COMMIT -> chunk servers, master PROBLEM: 2PC blocks forever if master dies at just wrong moment client could die and never recover, -> exactly the wrong property for 2PC Simple answer 2: Basic 2 phase commit (master is master) client -- fid, off, data ---> master ---- data ----------------> chunk servers <- blockID, VOTE_COMMIT -- --- GLOBAL_COMMIT --------> Does this work? o For now, assume master is highly reliable. o What if try to write to chunk servers 51, 47, and 99 but chunk server 47 is down? --> timeout, abort, retry with different chunkservers problem: send data through master; limits scalablity Better answer (HDFS; probably GFS) separate data path from metadata path (1) Send data to chunkservers chunkservers send block IDs to master (chain communication for efficiency) client -- data --> CS1 CS1 -- CS1/CID1, data --> CS2 --> | ------------------------------| \|/ CS2 -- CS2/CID2, CS1/CID1, data --> CS3 -->| ---------------------------------------| \|/ CS3 -- CS3/CID3, CS2/CID2, CS1/CID1, hash --> master, client now master knows who has this data and what the chunk IDs are (2) Send objID, hash(data) to master client -- fid, offset, hash --> master <-- OK -------- [(3) If master receives [BID*,hash] from chunkservers but then no binding of [oid,hash] from client, TIMEOUT; send "DELETE [hash]" to chunkservers if master receive client request but not CS/CID list, client timeout, retry data send to chunkservers Where is 2PC VOTE_COMMIT/GLOBAL_COMMIT? [[Step 1 is "VOTE COMMIT" by everyone except master; Step 2 is "VOTE COMMIT" by master + GLOBAL_COMMIT by master Notice that no "GLOBAL COMMIT" is sent to chunkservers. By 2PC, once they say "VOTE_COMMIT" they need to keep data unless they are told ABORT --> assume "COMMIT" unless you hear "DELETE" NOTE: This works because all reads go through master and because GFS/HDFS are (were originally) write-once file systems. Once a chunk is written, it is never changed. (Append is allowed). What could happen if you allowed writes to change bytes within chunk? (strange consistency -- without global commit, chunkservers don't know when to stop serving old data and start serving new data...) --> Always safe and live (as long as master is reliable and available) safe -- if client gets ack, then fileID,offset,data stored live -- eventually... Issue: Master reliability What if master fails? Make master redundant simple idea -- before master "ack" client, send data to 2 or 3 other machines that hold redundant log but what if master crashes while doing this? suppose mster updates replica 1 but not replica 2 and then crashes if we fail-over to replica 2 "no data" if we fail over to replica 1 "data" what if we first fail over to replica 2; it cannot contact replica 1, so we operate for a while with "no data" Then replica 2 fails and we go back to replica 1 or 0? ugh. Illustrates problem with 2PC -- if master dies at just wrong moment, you're doomed (You can try to add hacks for specific scenario I described, but there will *always* be corner cases. 2PC sacrifices safety or liveness in corner cases.) Solution (UpRight HDFS) master is replicated with "3PC" (Paxos,PBFT, UpRight) --> 2f+1 masters to tolerate f crash failures -- always safe -- live during periods when network is well behaved and enough machines are up Issue: Scalability - Chunk locations -- master keeps in memory as soft state (1) Won't that limit scalability (memory size)? People often get this type of intuition wrong... 64 bytes of metadata per 64MB chunk on disk --> 1GB (10^9) of memory can index 1PB (10^15) Even better -- look at costs...as long as a byte of memory costs much less than 1MB of disk, this choice doesn't increas entire system cost much --> 1GB of DRAM $30 (2007) 1PB of disk $100K (2007) Still 2008 talk said -- they are hitting limits of scalability -- problem seems to be that the machines they use run out of slots in which to put memory. let's see where this might come from 2TB per data server (2 1TB SATA drives per machine) --> 1PB per 500 machines --> 1GB of memory per 500 machines --> a 8GB server can index about 4000 machines Plausible that it is starting to constrict a few deployments but maybe not quite a critical problem yet Technology trends -- if disk cost falls by 100% per year and memory cost fall by 60% per year gap is 40% per year --> halve the number of machines a single server can support every 2 years --> plausible that they need to solve this problems sometime soon... (2) Won't this limit scalability (performance)? How many disks can 1 master support? How many IOs per disk 64MB per IO / 100 MB/s per disk ~= 2 IO/sec per disk 1K disks --> 2K IOs/second 1 IO per 500us --> EASY 10K disks --> 20K IOs/sec 1 IO per 50us --> OK (3) Isn't this [soft state] icky? "Realize that a chunkserver has the final word over what chunks it does and does not have on its own disks. There is no point in trying to maintain a consistent view of this information on the master because errors on a chunkserve may cause chunks to vanish spontaneously...or an operator may rename a chunkserver" (4) What about startup time? Not addressed in paper. Would want to look at this (could be similar to "recovery storm" in Sprite)....but not clear that the overhead of reading this info from a chunkserver is that much higher than reading this info from disk... Back of envelope: 1TB disk has 20K chunks; 1MB to send to chunkserver; if chunkserver needs to hear about index for 1PB, need to receive 1GB. This is at least 8 seconds on gigabit ethernet. Need to implement this carefully to keep recovery time under a minute. Could easily imagine a poor implementation taking 10 minutes... So, it is plausible to make this work, but probably needs to be designed, implemented, tested carefully Issue: - consistency model (new writes to existing chunk) -- file creation/deletion are atomic and totally ordered (centralized at master) -- file contents --- Challenges: ---- several copies ---- clients can fail ---- replicas can fail ---- concurrent writers basic problem: under 2PC once chunkserver has new version of a chunk ("VOTE_COMMIT") it doesn't know whether to serve old or new version on next read until it hears GLOBAL_COMMIT --> reads may block Design decision: Maximize availability. Don't block reads in this situation. (--> Give up some consistency; OK for read-mostly data mining workload...) --- weak consistency on write (defined/consistent/inconsistent) single writer, --> defined no failure all future reads will see full results of write (NOTE: I don't recall if they promise anything to reads while writes are in flight... I suspect that while writes are in flight, some reads to some chunkservers can see old version, some see new) multi-writer --> "consistent" (they call it) no failure All future reads will see same thing (mix of different writes) failure --> inconsistent Future reads may see different things on different replicas --- Implementation ---- apply mutations to a chunk in same order at all replicas primary in each chunkgroup holds lease from master client sends updates to all replicas client sends "write" to primary primary assigns serial number and sends "write" to replicas ---- chunk version numbers to detect stale chunks ---- Prevent modification of stale chunks ---- garbage collect stale chunks --- for appends -- if no failures "at least once" semantics + padding ("inconsistent ranges) --- Implementation? Not stated, but a guess would be for client to ask chunkserver for offset and then write; note that if client crashes, offset may be empty ("inconsistent" returned on read?) --- Implications for applications application level checksums self-describing records record IDs (to eliminate duplicates) ... --- suppose we wanted stronger guarantee 3 phase commit? PRE-PREPARE: client send data to chunk servers, gather acks PREPARE: primary send PREPARE COMMIT: primary send COMMIT recovery: master chooses new primary, polls chunk servers... issues: changing membership -- who to poll for state? (by new primary or by recovering chunk server) solution (?): in PRE-PREPARE stage, include list of chunk servers that are participating in request; chunk servers query others on that list before expiring item from write buffer... (still some corner cases -- how to guarantee new master for a chunk knows about latest committed write for a chunk... (chunk version number?)...) -- garbage collection v. eager delete simplify life -- delete happens at master then lazily at chunkservers (several days later) Issue: Fault tolerance/recovery from failure creation, re-replication, rebalancing -- key is that master has global view detect if too few replicas of chunk or if chunk server load imbalance ... Suppose chunkserver 47 with 10TB fails Time to recover 10TB to 1 replacement server with 10 disks and 1 Gbit/s NW 10TB / 100MB/s (limited by NW bandwidth) --> 10^13 / 10^8 = 10^5 seconds --> MTTR > day --> take advantage of parallelism (and level of indirection from master directory) --> recover data in parallel to different chunkservers (and from different chunkservers) e.g., suppose 100 servers Now each server only needs to write/read 100GB --> recovery can be done in a 10^3 seconds -- a few tens of minutes (And it gets better as we add more servers!) --> master detects missing chunkserver (heartbeat) master identifies which chunks now only have 2 copies foreach 2-copy-chunk pick random chunkserver tell chunkserver to read and store chunk [[2 outcomes chunkserver registers chunk --> done timeout --> repeat]]