Cluster file system, not a local file system. Optimized for large files, access bandwidth, sequential reads and appends. Support for atomic appends (how transactional) Component (hardware) failures are the norm (as are other failures, like memory and human). Huge files (multi-GB). Files only appended then read. Snapshot support for files and directories (AFS,ZFS, etc) Name nodes and data nodes. This is the classic metadata/data distinction We do not provide the POSIX API and therefore need not hookin to the Linux vnode layer (learn from AFS). Fixed sized (64 MB) chunks means easy translation from offset to chunk ID (done by client). Each chunk possibly served by a different data node. Identified by 64-bit chunk handles. Data chunks replicated 3 times (why 3?). Client contacts name servers for chunk locations, but contacts chunk servers directly for data. No client data cache (why not?). Clients do cache metadata (why?) One master node (uh oh). See figure 1. Does GFS support hard links? Lazy chunk allocation justifies large chunk size (what is largest source of fragmentation?). Persistent TCP connection to data node. Might have hot spots for widely accessed small files. Write an executable then execute it--->high latency---->greater replication factor within GFS (hack). In master memory (why is memory sufficient?) : 1. File and chunk namespace. Changes logged to disk for persistence. - Read/write locks for name space management 2. Mapping of files to chunks. 3. Location of chunk replicas. Not persistent (why?) Operation log vital (just as the integrity of the file system is vital. Data don't mean nothin' without the metadata). Master's operation log serializes all namespace operations (are updates linearizable?) Namespace updates synchronously written locally and remotely (how can they tolerate this latency? How is this synchronized?) It is periodically checkpointed (done concurrently with mutation because it takes a minute). Consistent: All clients see the same data. Defined: All clients see all of the latest write (coherent, or fresh). Applicaton-level checksums for integrity. Applications tolerate duplicate chunks. Also, file system checksums every 64 KB. Why both? End to end? (5.2) Writes ordered by lease for primary data node & primary's chunkserver update order. Lease lifetime 60 seconds. Snapshot implementation, step 1 is revoke leases. See figure 2. Data moved in any order, but "commited" in order determine by the primary. Durability means writes to multiple racks. Garbage collect deleted files lazily. Deleted files renamed and space reclaimed after 3 days. Why this approach? Shadow master for fast fail over. How is GFS serialization of file creates like TxOS? 4.3 Availability Master shadowing Read-only failover Master restart is fast (what takes the longest?) Good results. Compare table 6 to AFS server traffic. Where is getattr? **GFS Evolution** 64MB makes it hard to support small files (think gmail). 1MB is new design target. Master memory limits the number of files in a GFS file system. Maximize bandwidth and sacrafice latency doesn't work so well with user-visible services like gmail. File contents inconsistencies a pain point. Support more than 1 master (difficult). Erasure coding and/or Reed Solomon codes bring 3x cost down to 2.1x with similar availability.