Disclaimer -- for other google papers "wow, how simple" -- this one -- "do they really need this complexity?" and "what is this feature/restriction buying you?" -- takes a bit of work to chew through it... Abstraction Provide big table of data each row has a key and a set of columns -- columns are very flexible colFamily:qualifier:timestamp -> value different rows have same colFamily but may have different numbers of items (qualifier:timestamp) per colFamily (in fact, each colFamily is basically a little key-value store like a hashtable with added support for versioning with timestamps) example: web table row ID is URL (reverse sorted hostname for locality: e.g., com.cnn.www/index.html) automatic garbage collection of old timestamps (or when too many versions of an item) operations: read a row write a row (can add/update/delete multiple colFamily:qualifier:timestamp items) -- update to row guaranteed to be atomic -- update to row can be conditional on "not changed" since read -- sequential iteration across ranges of rows for locality/performance bottom line: simple and very flexible abstraction: bigtable lets you store information about a bunch of items ("rows") each item ("row") you can store arbitrary information via key->value hashtable items in row can be small or large complexity from: locality -- locality across rows, grouping of related values (colFamilies, etc.) widely used within google; many applications -- open source Cassandra has many similar aspects Question: Why not a file system? Why not just Google FS? want: fine grained, locality, sparse keyspace, updates fine grained --> can't just use a gfs file per row (b/c rows may not be not huge) locality, sparse keyspace --> nontrivial mapping from rowID to file/offset updates: want atomic updates; want small updates to "middle" of file --> GFS not good at this... Big picture row identifier: One master; many tablet servers master is all soft state Divide data into TABLETs -- TABLET ~= 100MB -- Split as needed -- variable number/range of rows per tablet (row size, sparse index) -- all accesses to a tablet via a TABLET SERVER master never consulted on reads/writes; it does assignments of tablets to servers only 3 questions (1) Which tablets/tablet servers hold which ranges of keys? (2) How to assign tablet to at most 1 tablet server (3) How tablet server efficiently and reliably store, update, and accesse data (0) Chubby External subsystem: Chubby lock service Chubby lock service based on Paxos (asynchronous, nonblocking, replicated sequence of actions) Chubby provides abstraction of a file system for small files -- read/write/lock files/create/delete -- all atomic -- client creates session with chubby -- client can cache items and be notified if they change -- if session dies (heartbeats), client loses all locks and cached items (lease expired) (1) Which tablets hold which ranges of keys? Which tablet servers to contact for a tablet? Array of METADATA tablets each keeps (roughly -- details sketchy here in paper) FIRST_KEY,LAST_KEY -> tabletID, tabletServerID How to find the METADATA tablets? Special tablet: "ROOT_METADATA"; never split; alway stored at one tablet server Find ROOT_METADATA tablet server using Chubby (chubby keeps ID of root metadata server in well known chubby file) ROOT metadata tablet keeps rows of (roughly -- details sketchy here in paper) FIRST_KEY,LAST_KEY -> tabletID, tabletServerID --> so, use chubby to find ROOT_METADATA tablet's server then ask that server for row for first key smaller than target now know tablet server for desired row [[PICTURE: bigtable.jpg]] (2) How to assign tablet to at most 1 tablet server Could find out name of tablet server but by the time you send message to tablet server, mapping may have changed Mapping above is just a HINT -- ensure that each tablet assigned to at most one tablet server -- if client contacts wrong tablet server, server will reject request client retry Assigning tablet servers BigTable Master -- Use chubby to make sure at most one master (to act as master, must hold unique MASTER lock on chubby) Each tablet server TS -- create file /tservers/ in chubby and grab exclusive lock Tablet assignment -- Steady state: master periodically tries to acquire lock for each tablet server success --> tablet server is dead or unreachable --> delete /tservers/ file --> reassign tablets to other tablet servers -- bootstrapping master asks each TS for state of session/lock and list of tablets served if ROOT_METADATA tablet not assigned to server, assign it and update chubby scan ROOT_METADATA tablet then METADATA tablets for list of user/data tablets assign any unassined tablets to other tablet servers -- master just tell's server to take over an unassigned tablet -- new tserver updates METADATA row using atomic read-modify-write and sequence numbers to ensure atomicity (I'm guessing here -- details vague in paper) At most on tablet server per tablet: -- As long as lock held, TS is tablet server for all tablets assigned to it -- When session lost, try to reestablish -- when reestablish, try to reacquire lock on file success --> continue to serve tablets failure --> restart with 0 tablets (3) Tablet server stores state in multiple GFS files --> durable --> try to make all writes append-only (-- one tablet server at a time --> no concurent updates) 2 types of file per tablet (1) (one) log of recent updates to tablet (actualy -- one logical log; multiple tablets at same tablet server share same underlying physical log file to reduce seeks) (hashtable of recent updates cached in memory) (2) (several) SSTable files When log fills up, create a new SSTable file and write updates from log to that file SSTable file = key-sorted list of updates (and delete markers) + index of key->offset pairs for SSTable (~64KB granularity) --> to read key k from SSTable (1) read index (2) read desired 64K chunk (3) binary search within chunk for desired row for key k (Note: index for all SSTables stored in tablet server's memory so step 1 omitted --> SSTable read takes 1 disk access) Update: append update to log file update in-memory cache of recent updates Read: check cache of recent updates (in memory) iterate through SSTable files and try reading --> hit or delete record --> stop --> otherwise, try next SSTable [detail: need to keep "delete record" to suppress deleted data from older SSTables from appearing] [detail: use bloom filter to identify which cells populated --> avoid need to access disk for non-present cells] Compaction: speed up reads and garbage collect deleted records periodically do "merging compaction" of several SSTable files into one periodically do "major compaction" of all SSTables into one big SSTable (can discard "delete" tombstones) If tablet gets too big split it -- Splitting tablet is easy (both "children" SSTables can use same "parents" SSTables) Implementation details -- of course this is all layered over GFS write: send data to 3 chunk servers send commit to master master commits master replies how to make latency OK? in GFS paper they say we only care about throughput and large files but now they also care about latncy for small files (e.g., account information, ...) one trick: each tablet server writes its log to two GFS files; if one gets slow, it switches to the other (repeating unacknowledged writes; sequence number suppresses duplicates) -- batch log writes, group commit (window of vulnerability?) experiments show pretty good throughput What about latency? (no experiments) [my guess at their answer -- end to end principle -- build into application async writes b/c you don't want apps waiting for network or disk anyhow... Consistency single row transactions -- seems pretty easy since row on single tablet server all writes to redo log why not multi-row suppose you wanted multi-row, how would you do it Additional details: Locality Why not a file system? Unix file system (assumption) files are read in their entirety related files stored in a directory only current version of interest Bigtable: simple picture ... locality group localiy group ... col fam col fam .... col fam col fam ... ... row row row ... generalize locality across items: rows with similar rowID stored together --> apps can assign row ID to expose locality expose locality within item: locality group of columns some subsets of columns read together -> store column groups together rows are sparse -> iterate across sparese items -> cache of recently accessed items (v. having to cache full disk block) also... different permissions/owners for different column families keep several recent versions Compression -- CPU cheaper than Network --> compress to reduce network BW --> compress to increase effective size of cache compression works really well for their data (10x v. 2x for most text files) -- why? They see lots of mechanically generated "template text" that their algorithm can replace with small tokens OLD NOTES; Implementation METADATA table chubby file --> root table --> metadata tables --> user tables root table and metadata table just store startRow, endRow --> tablet ID Tablet -- range of rowIDs managed by same server server stores tablet info in SSTables Master assigns tablets to tablet servers (10 to 1000 tablets per tablet server) (tablet 100-200MB in size by default) all persistent data stored in GFS (or chubby) so any tablet server can serve any tablet (probably use location info in GFS to assign tablet servers; also do load balancing; details not described in paper. Any ideas of simple algorithm? e.g., of the three servers that have tablet stuff local in GFS, pick one with lowest load UNLESS all are above threshold, in which case pick a remote one that is below load threshold SSTable -- unit of storage Persistented, ordered, immutable map stored in 64KB blocks details hazy -- see end of page 3 start of page 4 (1) index -- ??key->location of value in SSTable?? (2) ?array of cell values?? ?? key is (row:string, column:string, time:int64) support random access to support random access to support efficient sequential scan Writes logically want to update a row or cell QUESTION; why not just update in place? (1) versioned data -- almost never overwrite old data with new Instead, add a new version and later garbage collect one of the old versions but even if no versioned data... (2) sparse data -- cells not fixed size, # cells per row not fixed (and can change over time) --> update in place is hard solution (1) tablet updates to log (2) periodically batch write recent log writes to SSTable --> current tablet state is sequence of SSTables to read , start with latest SSTable, then try earlier, ... to scan from , keep "finger" in each SSTable, look at all possible "next" records, pick the next one, repeat [detail: need to keep "delete record" to suppress deleted data from older SSTables from appearing] [detail: use bloom filter to identify which cells populated --> avoid need to access disk for non-present cells] compaction (1) minor compaction -- log->SSTable fo recent (when log/in memory table too large) (2) merging compaction -- merge a few SSTables (bound number of SSTables per tablet) (3) major compaction -- produce (periodically) Other good side effects from immutable tables (1) Splitting tablet is easy (both "children" can use same "parents" (2) simplified locking (3) simple mark and sweep garbage collection ???they list this as an advantage... How simple is this??? Implementation details -- of course this is all layered over GFS write: send data to 3 chunk servers send commit to master master commits master replies how to make latency OK? in GFS paper they say we only care about throughput and large files but now they also care about latncy for small files (e.g., account information, ...) -- batch log writes, group commit (window of vulnerability?) experiments show pretty good throughput What about latency? (no experiments) [my guess at their answer -- end to end principle -- build into application async writes b/c you don't want apps waiting for network or disk anyhow... Consistency single row transactions -- seems pretty easy since row on single tablet server all writes to redo log why not multi-row suppose you wanted multi-row, how would you do it Future work -- how would you do it? Would you do it? multi-row transactions secondary indices Project ideas (GFS, MapReduce, BigTable0 o scale to 10K, 100K server o 100s start-up time in mapreduce o sequential/causal consistency for GFS o sequential/causal consistency for BigTable ...