SlideShare a Scribd company logo
Google Bigtable
Fay Chang, Jeffrey Dean, Sanjay Ghemawat,
Wilson C. Hsieh, Deborah A. Wallach, Mike
Burrows, Tushar Chandra, Andrew Fikes,
Robert E. Gruber
Google, Inc.
OSDI 2006
Adapted by S. Sudarshan from a talk by
Erik Paulson, UW Madison
2
Google Scale
 Lots of data
 Copies of the web, satellite data, user data, email and
USENET, Subversion backing store
 Many incoming requests
 No commercial system big enough
 Couldn’t afford it if there was one
 Might not have made appropriate design choices
 Firm believers in the End-to-End argument
 450,000 machines (NYTimes estimate, June 14th
2006
3
Building Blocks
 Scheduler (Google WorkQueue)
 Google Filesystem
 Chubby Lock service
 Two other pieces helpful but not required
Sawzall
MapReduce (despite what the Internet says)
 BigTable: build a more application-friendly
storage service using these parts
4
Google File System
 Large-scale distributed “filesystem”
 Master: responsible for metadata
 Chunk servers: responsible for reading
and writing large chunks of data
 Chunks replicated on 3 machines, master
responsible for ensuring replicas exist
 OSDI ’04 Paper
5
Chubby
 {lock/file/name} service
 Coarse-grained locks, can store small
amount of data in a lock
 5 replicas, need a majority vote to be
active
 Also an OSDI ’06 Paper
6
Data model: a big map
 <Row, Column, Timestamp> triple for key - lookup, insert,
and delete API
 Arbitrary “columns” on a row-by-row basis
 Column family:qualifier. Family is heavyweight, qualifier lightweight
 Column-oriented physical store- rows are sparse!
 Does not support a relational model
 No table-wide integrity constraints
 No multirow transactions
7
SSTable
 Immutable, sorted file of key-value
pairs
 Chunks of data plus an index
Index is of block ranges, not values
Index
64K
block
64K
block
64K
block
SSTable
8
Tablet
 Contains some range of rows of the table
 Built out of multiple SSTables
Index
64K
block
64K
block
64K
block
SSTable
Index
64K
block
64K
block
64K
block
SSTable
Tablet Start:aardvark End:apple
9
Table
 Multiple tablets make up the table
 SSTables can be shared
 Tablets do not overlap, SSTables can overlap
SSTable SSTable SSTable SSTable
Tablet
aardvark apple
Tablet
apple_two_E boat
10
Finding a tablet
 Stores: Key: table id + end row, Data: location
 Cached at clients, which may detect data to be incorrect
 in which case, lookup on hierarchy performed
 Also prefetched (for range queries)
11
Servers
 Tablet servers manage tablets, multiple tablets
per server. Each tablet is 100-200 MB
 Each tablet lives at only one server
 Tablet server splits tablets that get too big
 Master responsible for load balancing and fault
tolerance
12
Master’s Tasks
 Use Chubby to monitor health of tablet servers,
restart failed servers
 Tablet server registers itself by getting a lock in a
specific directory chubby
 Chubby gives “lease” on lock, must be renewed periodically
 Server loses lock if it gets disconnected
 Master monitors this directory to find which servers
exist/are alive
 If server not contactable/has lost lock, master grabs lock and
reassigns tablets
 GFS replicates data. Prefer to start tablet server on same
machine that the data is already at
13
Master’s Tasks (Cont)
 When (new) master starts
grabs master lock on chubby
 Ensures only one master at a time
Finds live servers (scan chubby directory)
Communicates with servers to find assigned
tablets
Scans metadata table to find all tablets
 Keeps track of unassigned tablets, assigns them
 Metadata root from chubby, other metadata tablets
assigned before scanning.
14
Metadata Management
 Master handles
table creation, and merging of tablet
 Tablet servers directly update metadata on
tablet split, then notify master
lost notification may be detected lazily by
master
15
Editing a table
 Mutations are logged, then applied to an in-memory
memtable
 May contain “deletion” entries to handle updates
 Group commit on log: collect multiple updates before log flush
SSTable SSTable
Tablet
apple_two_E boat
Insert
Insert
Delete
Insert
Delete
Insert
Memtable
tablet
log
GFS
Memory
16
Compactions
 Minor compaction – convert the memtable into
an SSTable
 Reduce memory usage
 Reduce log traffic on restart
 Merging compaction
 Reduce number of SSTables
 Good place to apply policy “keep only N versions”
 Major compaction
 Merging compaction that results in only one SSTable
 No deletion records, only live data
17
Locality Groups
 Group column families together into an SSTable
 Avoid mingling data, e.g. page contents and page metadata
 Can keep some groups all in memory
 Can compress locality groups
 Bloom Filters on SSTables in a locality group
 bitmap on keyvalue hash, used to overestimate which records
exist
 avoid searching SSTable if bit not set
 Tablet movement
 Major compaction (with concurrent updates)
 Minor compaction (to catch up with updates) without any
concurrent updates
 Load on new server without requiring any recovery action
18
Log Handling
 Commit log is per server, not per tablet (why?)
 complicates tablet movement
 when server fails, tablets divided among multiple
servers
 can cause heavy scan load by each such server
 optimization to avoid multiple separate scans: sort log by
(table, rowname, LSN), so logs for a tablet are clustered,
then distribute
 GFS delay spikes can mess up log write (time
critical)
 solution: two separate logs, one active at a time
 can have duplicates between these two
19
Immutability
 SSTables are immutable
 simplifies caching, sharing across GFS etc
 no need for concurrency control
 SSTables of a tablet recorded in METADATA table
 Garbage collection of SSTables done by master
 On tablet split, split tables can start off quickly on
shared SSTables, splitting them lazily
 Only memtable has reads and updates
concurrent
 copy on write rows, allow concurrent read/write
20
Microbenchmarks
21
22
Application at Google
23
Lessons learned
 Interesting point- only implement some of
the requirements, since the last is
probably not needed
 Many types of failure possible
 Big systems need proper systems-level
monitoring
 Value simple designs

More Related Content

PPTX
GOOGLE BIGTABLE
PPT
Bigtable
PDF
The Google Bigtable
PDF
Bigtable and Dynamo
PPTX
Summary of "Google's Big Table" at nosql summer reading in Tokyo
PPT
8. column oriented databases
PDF
google Bigtable
PPTX
Google - Bigtable
GOOGLE BIGTABLE
Bigtable
The Google Bigtable
Bigtable and Dynamo
Summary of "Google's Big Table" at nosql summer reading in Tokyo
8. column oriented databases
google Bigtable
Google - Bigtable

Similar to bigtable-uw-presentation.ppt exceptional case situation analysis (20)

PDF
Bigtable and Boxwood
PDF
Big table presentation-final
PPT
BigTable PreReading
PPT
PPTX
Big table
PPTX
Chapter Six Storage-systemsgggggggg.pptx
PPTX
storage-systems.pptx
PPTX
Google Big Table
PDF
3 map reduce perspectives
ODP
Big table
PPTX
Dissecting Scalable Database Architectures
PDF
Google Bigtable
PPT
Google Bigtable paper presentation
PPT
The Anatomy Of The Google Architecture Fina Lv1.1
PDF
Consistency, Availability, Partition: Make Your Choice
PPT
Introduction To Maxtable
PDF
Bigtable_Paper
PDF
Google Bigtable Paper Presentation
PDF
Interactive big data analytics
PDF
Bigtable
Bigtable and Boxwood
Big table presentation-final
BigTable PreReading
Big table
Chapter Six Storage-systemsgggggggg.pptx
storage-systems.pptx
Google Big Table
3 map reduce perspectives
Big table
Dissecting Scalable Database Architectures
Google Bigtable
Google Bigtable paper presentation
The Anatomy Of The Google Architecture Fina Lv1.1
Consistency, Availability, Partition: Make Your Choice
Introduction To Maxtable
Bigtable_Paper
Google Bigtable Paper Presentation
Interactive big data analytics
Bigtable
Ad

Recently uploaded (20)

PDF
IARG - ICTC ANALOG RESEARCH GROUP - GROUP 1 - CHAPTER 2.pdf
PDF
Timeless Interiors by PEE VEE INTERIORS
PPT
Fire_electrical_safety community 08.ppt
PDF
Chalkpiece Annual Report from 2019 To 2025
PPTX
URBAN FINANCEnhynhynnnytnynnnynynyynynynyn
PPTX
Drawing as Communication for interior design
PDF
2025CategoryRanking of technology university
PPTX
UNIT III - GRAPHICS AND AUDIO FOR MOBILE
PDF
trenching-standard-drawings procedure rev
PPTX
22CDO02-IMGD-UNIT-I-MOBILE GAME DESIGN PROCESS
PPTX
Bitcoin predictor project presentation
PPTX
Presentation.pptx anemia in pregnancy in
PPTX
WHY UPLOADING IS IMPORTANT TO DOWNLOAD SLIDES.pptx
PPTX
22CDH01-V3-UNIT-I INTRODUCITON TO EXTENDED REALITY
PPTX
Project_Presentation Bitcoin Price Prediction
PPTX
22CDH01-V3-UNIT III-UX-UI for Immersive Design
PPTX
Introduction to Building Information Modeling
PDF
How Animation is Used by Sports Teams and Leagues
PDF
Architecture Design Portfolio- VICTOR OKUTU
PPTX
Presentation1.pptxnmnmnmnjhjhkjkjkkjkjjk
IARG - ICTC ANALOG RESEARCH GROUP - GROUP 1 - CHAPTER 2.pdf
Timeless Interiors by PEE VEE INTERIORS
Fire_electrical_safety community 08.ppt
Chalkpiece Annual Report from 2019 To 2025
URBAN FINANCEnhynhynnnytnynnnynynyynynynyn
Drawing as Communication for interior design
2025CategoryRanking of technology university
UNIT III - GRAPHICS AND AUDIO FOR MOBILE
trenching-standard-drawings procedure rev
22CDO02-IMGD-UNIT-I-MOBILE GAME DESIGN PROCESS
Bitcoin predictor project presentation
Presentation.pptx anemia in pregnancy in
WHY UPLOADING IS IMPORTANT TO DOWNLOAD SLIDES.pptx
22CDH01-V3-UNIT-I INTRODUCITON TO EXTENDED REALITY
Project_Presentation Bitcoin Price Prediction
22CDH01-V3-UNIT III-UX-UI for Immersive Design
Introduction to Building Information Modeling
How Animation is Used by Sports Teams and Leagues
Architecture Design Portfolio- VICTOR OKUTU
Presentation1.pptxnmnmnmnjhjhkjkjkkjkjjk
Ad

bigtable-uw-presentation.ppt exceptional case situation analysis

  • 1. Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 2006 Adapted by S. Sudarshan from a talk by Erik Paulson, UW Madison
  • 2. 2 Google Scale  Lots of data  Copies of the web, satellite data, user data, email and USENET, Subversion backing store  Many incoming requests  No commercial system big enough  Couldn’t afford it if there was one  Might not have made appropriate design choices  Firm believers in the End-to-End argument  450,000 machines (NYTimes estimate, June 14th 2006
  • 3. 3 Building Blocks  Scheduler (Google WorkQueue)  Google Filesystem  Chubby Lock service  Two other pieces helpful but not required Sawzall MapReduce (despite what the Internet says)  BigTable: build a more application-friendly storage service using these parts
  • 4. 4 Google File System  Large-scale distributed “filesystem”  Master: responsible for metadata  Chunk servers: responsible for reading and writing large chunks of data  Chunks replicated on 3 machines, master responsible for ensuring replicas exist  OSDI ’04 Paper
  • 5. 5 Chubby  {lock/file/name} service  Coarse-grained locks, can store small amount of data in a lock  5 replicas, need a majority vote to be active  Also an OSDI ’06 Paper
  • 6. 6 Data model: a big map  <Row, Column, Timestamp> triple for key - lookup, insert, and delete API  Arbitrary “columns” on a row-by-row basis  Column family:qualifier. Family is heavyweight, qualifier lightweight  Column-oriented physical store- rows are sparse!  Does not support a relational model  No table-wide integrity constraints  No multirow transactions
  • 7. 7 SSTable  Immutable, sorted file of key-value pairs  Chunks of data plus an index Index is of block ranges, not values Index 64K block 64K block 64K block SSTable
  • 8. 8 Tablet  Contains some range of rows of the table  Built out of multiple SSTables Index 64K block 64K block 64K block SSTable Index 64K block 64K block 64K block SSTable Tablet Start:aardvark End:apple
  • 9. 9 Table  Multiple tablets make up the table  SSTables can be shared  Tablets do not overlap, SSTables can overlap SSTable SSTable SSTable SSTable Tablet aardvark apple Tablet apple_two_E boat
  • 10. 10 Finding a tablet  Stores: Key: table id + end row, Data: location  Cached at clients, which may detect data to be incorrect  in which case, lookup on hierarchy performed  Also prefetched (for range queries)
  • 11. 11 Servers  Tablet servers manage tablets, multiple tablets per server. Each tablet is 100-200 MB  Each tablet lives at only one server  Tablet server splits tablets that get too big  Master responsible for load balancing and fault tolerance
  • 12. 12 Master’s Tasks  Use Chubby to monitor health of tablet servers, restart failed servers  Tablet server registers itself by getting a lock in a specific directory chubby  Chubby gives “lease” on lock, must be renewed periodically  Server loses lock if it gets disconnected  Master monitors this directory to find which servers exist/are alive  If server not contactable/has lost lock, master grabs lock and reassigns tablets  GFS replicates data. Prefer to start tablet server on same machine that the data is already at
  • 13. 13 Master’s Tasks (Cont)  When (new) master starts grabs master lock on chubby  Ensures only one master at a time Finds live servers (scan chubby directory) Communicates with servers to find assigned tablets Scans metadata table to find all tablets  Keeps track of unassigned tablets, assigns them  Metadata root from chubby, other metadata tablets assigned before scanning.
  • 14. 14 Metadata Management  Master handles table creation, and merging of tablet  Tablet servers directly update metadata on tablet split, then notify master lost notification may be detected lazily by master
  • 15. 15 Editing a table  Mutations are logged, then applied to an in-memory memtable  May contain “deletion” entries to handle updates  Group commit on log: collect multiple updates before log flush SSTable SSTable Tablet apple_two_E boat Insert Insert Delete Insert Delete Insert Memtable tablet log GFS Memory
  • 16. 16 Compactions  Minor compaction – convert the memtable into an SSTable  Reduce memory usage  Reduce log traffic on restart  Merging compaction  Reduce number of SSTables  Good place to apply policy “keep only N versions”  Major compaction  Merging compaction that results in only one SSTable  No deletion records, only live data
  • 17. 17 Locality Groups  Group column families together into an SSTable  Avoid mingling data, e.g. page contents and page metadata  Can keep some groups all in memory  Can compress locality groups  Bloom Filters on SSTables in a locality group  bitmap on keyvalue hash, used to overestimate which records exist  avoid searching SSTable if bit not set  Tablet movement  Major compaction (with concurrent updates)  Minor compaction (to catch up with updates) without any concurrent updates  Load on new server without requiring any recovery action
  • 18. 18 Log Handling  Commit log is per server, not per tablet (why?)  complicates tablet movement  when server fails, tablets divided among multiple servers  can cause heavy scan load by each such server  optimization to avoid multiple separate scans: sort log by (table, rowname, LSN), so logs for a tablet are clustered, then distribute  GFS delay spikes can mess up log write (time critical)  solution: two separate logs, one active at a time  can have duplicates between these two
  • 19. 19 Immutability  SSTables are immutable  simplifies caching, sharing across GFS etc  no need for concurrency control  SSTables of a tablet recorded in METADATA table  Garbage collection of SSTables done by master  On tablet split, split tables can start off quickly on shared SSTables, splitting them lazily  Only memtable has reads and updates concurrent  copy on write rows, allow concurrent read/write
  • 21. 21
  • 23. 23 Lessons learned  Interesting point- only implement some of the requirements, since the last is probably not needed  Many types of failure possible  Big systems need proper systems-level monitoring  Value simple designs