Google bigtable presentation by renowned professor of Google California related to biotechnology and innovation in this field. It is related to GM crops and hybrid variety seeds which are produced by mutation of various crops genes.
bigtable-uw-presentation.ppt exceptional case situation analysis
1. Google Bigtable
Fay Chang, Jeffrey Dean, Sanjay Ghemawat,
Wilson C. Hsieh, Deborah A. Wallach, Mike
Burrows, Tushar Chandra, Andrew Fikes,
Robert E. Gruber
Google, Inc.
OSDI 2006
Adapted by S. Sudarshan from a talk by
Erik Paulson, UW Madison
2. 2
Google Scale
Lots of data
Copies of the web, satellite data, user data, email and
USENET, Subversion backing store
Many incoming requests
No commercial system big enough
Couldn’t afford it if there was one
Might not have made appropriate design choices
Firm believers in the End-to-End argument
450,000 machines (NYTimes estimate, June 14th
2006
3. 3
Building Blocks
Scheduler (Google WorkQueue)
Google Filesystem
Chubby Lock service
Two other pieces helpful but not required
Sawzall
MapReduce (despite what the Internet says)
BigTable: build a more application-friendly
storage service using these parts
4. 4
Google File System
Large-scale distributed “filesystem”
Master: responsible for metadata
Chunk servers: responsible for reading
and writing large chunks of data
Chunks replicated on 3 machines, master
responsible for ensuring replicas exist
OSDI ’04 Paper
5. 5
Chubby
{lock/file/name} service
Coarse-grained locks, can store small
amount of data in a lock
5 replicas, need a majority vote to be
active
Also an OSDI ’06 Paper
6. 6
Data model: a big map
<Row, Column, Timestamp> triple for key - lookup, insert,
and delete API
Arbitrary “columns” on a row-by-row basis
Column family:qualifier. Family is heavyweight, qualifier lightweight
Column-oriented physical store- rows are sparse!
Does not support a relational model
No table-wide integrity constraints
No multirow transactions
7. 7
SSTable
Immutable, sorted file of key-value
pairs
Chunks of data plus an index
Index is of block ranges, not values
Index
64K
block
64K
block
64K
block
SSTable
8. 8
Tablet
Contains some range of rows of the table
Built out of multiple SSTables
Index
64K
block
64K
block
64K
block
SSTable
Index
64K
block
64K
block
64K
block
SSTable
Tablet Start:aardvark End:apple
9. 9
Table
Multiple tablets make up the table
SSTables can be shared
Tablets do not overlap, SSTables can overlap
SSTable SSTable SSTable SSTable
Tablet
aardvark apple
Tablet
apple_two_E boat
10. 10
Finding a tablet
Stores: Key: table id + end row, Data: location
Cached at clients, which may detect data to be incorrect
in which case, lookup on hierarchy performed
Also prefetched (for range queries)
11. 11
Servers
Tablet servers manage tablets, multiple tablets
per server. Each tablet is 100-200 MB
Each tablet lives at only one server
Tablet server splits tablets that get too big
Master responsible for load balancing and fault
tolerance
12. 12
Master’s Tasks
Use Chubby to monitor health of tablet servers,
restart failed servers
Tablet server registers itself by getting a lock in a
specific directory chubby
Chubby gives “lease” on lock, must be renewed periodically
Server loses lock if it gets disconnected
Master monitors this directory to find which servers
exist/are alive
If server not contactable/has lost lock, master grabs lock and
reassigns tablets
GFS replicates data. Prefer to start tablet server on same
machine that the data is already at
13. 13
Master’s Tasks (Cont)
When (new) master starts
grabs master lock on chubby
Ensures only one master at a time
Finds live servers (scan chubby directory)
Communicates with servers to find assigned
tablets
Scans metadata table to find all tablets
Keeps track of unassigned tablets, assigns them
Metadata root from chubby, other metadata tablets
assigned before scanning.
14. 14
Metadata Management
Master handles
table creation, and merging of tablet
Tablet servers directly update metadata on
tablet split, then notify master
lost notification may be detected lazily by
master
15. 15
Editing a table
Mutations are logged, then applied to an in-memory
memtable
May contain “deletion” entries to handle updates
Group commit on log: collect multiple updates before log flush
SSTable SSTable
Tablet
apple_two_E boat
Insert
Insert
Delete
Insert
Delete
Insert
Memtable
tablet
log
GFS
Memory
16. 16
Compactions
Minor compaction – convert the memtable into
an SSTable
Reduce memory usage
Reduce log traffic on restart
Merging compaction
Reduce number of SSTables
Good place to apply policy “keep only N versions”
Major compaction
Merging compaction that results in only one SSTable
No deletion records, only live data
17. 17
Locality Groups
Group column families together into an SSTable
Avoid mingling data, e.g. page contents and page metadata
Can keep some groups all in memory
Can compress locality groups
Bloom Filters on SSTables in a locality group
bitmap on keyvalue hash, used to overestimate which records
exist
avoid searching SSTable if bit not set
Tablet movement
Major compaction (with concurrent updates)
Minor compaction (to catch up with updates) without any
concurrent updates
Load on new server without requiring any recovery action
18. 18
Log Handling
Commit log is per server, not per tablet (why?)
complicates tablet movement
when server fails, tablets divided among multiple
servers
can cause heavy scan load by each such server
optimization to avoid multiple separate scans: sort log by
(table, rowname, LSN), so logs for a tablet are clustered,
then distribute
GFS delay spikes can mess up log write (time
critical)
solution: two separate logs, one active at a time
can have duplicates between these two
19. 19
Immutability
SSTables are immutable
simplifies caching, sharing across GFS etc
no need for concurrency control
SSTables of a tablet recorded in METADATA table
Garbage collection of SSTables done by master
On tablet split, split tables can start off quickly on
shared SSTables, splitting them lazily
Only memtable has reads and updates
concurrent
copy on write rows, allow concurrent read/write
23. 23
Lessons learned
Interesting point- only implement some of
the requirements, since the last is
probably not needed
Many types of failure possible
Big systems need proper systems-level
monitoring
Value simple designs