SlideShare a Scribd company logo
Outline
                Introduction
                      Design
             Implementation
                     Results
                 Conclusions




Bigtable: A Distributed Storage System for
             Structured Data

                 Alvanos Michalis


                    April 6, 2009




            Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                         Introduction
                               Design
                      Implementation
                              Results
                          Conclusions

1   Introduction
       Motivation
2   Design
       Data model
3   Implementation
       Building blocks
       Tablets
       Compactions
       Refinements
4   Results
       Hardware Environment
       Performance Evaluation
5   Conclusions
       Real applications
       Lessons
       End
                     Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                           Introduction
                                 Design
                                          Motivation
                        Implementation
                                Results
                            Conclusions


Google!

     Lots of Different kinds of data!
          Crawling system URLs, contents, links, anchors, page-rank etc
          Per-user data: preferences, recent queries/ search history
          Geographic data, images etc ...
     Many incoming requests
     No commercial system is big enough
          Scale is too large for commercial databases
          May not run on their commodity hardware
          No dependence on other vendors
          Optimizations
          Better Price/Performance
          Building internally means the system can be applied across
          many projects for low incremental cost

                       Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                           Introduction
                                 Design
                                          Motivation
                        Implementation
                                Results
                            Conclusions


Google goals



     Fault-tolerant, persistent
     Scalable
          1000s of servers
          Millions of reads/writes, efficient scans
     Self-managing
     Simple!




                       Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                            Introduction
                                  Design
                                           Data model
                         Implementation
                                 Results
                             Conclusions


Bigtable




  Definition
  A Bigtable is a sparse, distributed, persistent multidimensional
  sorted map.

  The map is indexed by a row key, column key, and a timestamp;
  each value in the map is an uninterpreted array of bytes.

  (row:string, column:string, time:int64) -> string
                        Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                            Introduction
                                  Design
                                           Data model
                         Implementation
                                 Results
                             Conclusions


Rows




       The row keys in a table are arbitrary strings
       Every read or write of data under a single row key is atomic
       maintains data in lexicographic order by row key


                        Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                         Introduction
                               Design
                                        Data model
                      Implementation
                              Results
                          Conclusions


Column Families




     Grouped into sets called column families
     All data stored in a column family is usually of the same type
     A column family must be created before data can be stored
     under any column key in that family
     A column key is named using the following syntax:
     family:qualifier
                     Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction
                                Design
                                         Data model
                       Implementation
                               Results
                           Conclusions


Timestamps


     Each cell in a Bigtable can contain multiple versions of the
     same data; these versions are indexed by timestamp (64-bit
     integers).
     Applications that need to avoid collisions must generate
     unique timestamps themselves.
     To make the management of versioned data less onerous, they
     support two per-column-family settings that tell Bigtable to
     garbage-collect cell versions automatically.




                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                           Introduction   Building blocks
                                 Design   Tablets
                        Implementation    Compactions
                                Results   Refinements
                            Conclusions


Infrastructure


      Google WorkQueue (scheduler)
      GFS: large-scale distributed file system
          Master: responsible for metadata
          Chunk servers: responsible for r/w large chunks of data
          Chunks replicated on 3 machines; master responsible
      Chubby: lock/file/name service
          Coarse-grained locks; can store small amount of data in a lock
          5 replicas; need a majority vote to be active




                       Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction   Building blocks
                                Design   Tablets
                       Implementation    Compactions
                               Results   Refinements
                           Conclusions


SSTable




     Lives in GFS
     Immutable, sorted file of key-value pairs
     Chunks of data plus an index
     Index is of block ranges, not values

                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                         Introduction   Building blocks
                               Design   Tablets
                      Implementation    Compactions
                              Results   Refinements
                          Conclusions


Tablet Design




     Large tables broken into tablets at row boundaries
         Tablets hold contiguous rows
         Approx 100 200 MB of data per tablet
     Approx 100 tablets per machine
         Fast recovery
         Load-balancing
     Built out of multiple SSTables
                     Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                           Introduction   Building blocks
                                 Design   Tablets
                        Implementation    Compactions
                                Results   Refinements
                            Conclusions


Tablet Location




      Like a B+-tree, but fixed at 3 levels
      How can we avoid creating a bottleneck at the root?
          Aggressively cache tablet locations
          Lookup starts from leaf (bet on it being correct); reverse on
          miss
                       Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                           Introduction   Building blocks
                                 Design   Tablets
                        Implementation    Compactions
                                Results   Refinements
                            Conclusions


Tablet Assignment
     Each tablet is assigned to one tablet server at a time. The
     master keeps track of the set of live tablet servers, and the
     current assignment of tablets to tablet servers.
     Bigtable uses Chubby to keep track of tablet servers. When a
     tablet server starts, it creates, and acquires an exclusive lock
     on, a uniquely-named file in a specific Chubby directory.
     Tablet server stops serving its tablets if loses its exclusive lock
     The master is responsible for detecting when a tablet server is
     no longer serving its tablets, and for reassigning those tablets
     as soon as possible.
     When a master is started by the cluster management system,
     it needs to discover the current tablet assignments before it
     can change them.
                       Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction   Building blocks
                                Design   Tablets
                       Implementation    Compactions
                               Results   Refinements
                           Conclusions


Serving a Tablet




      Updates are logged
      Each SSTable corresponds to a batch of updates or a
      snapshot of the tablet taken at some earlier time
      Memtable (sorted by key) caches recent updates
      Reads consult both memtable and SSTables
                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                           Introduction   Building blocks
                                 Design   Tablets
                        Implementation    Compactions
                                Results   Refinements
                            Conclusions


Compactions

  As write operations execute, the size of the memtable increases.
      Minor compaction convert the memtable into an SSTable
           Reduce memory usage
           Reduce log traffic on restart
      Merging compaction
           Periodically executed in the background
           Reduce number of SSTables
           Good place to apply policy keep only N versions
      Major compaction
           Merging compaction that results in only one SSTable
           No deletion records, only live data
           Reclaim resources.

                       Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction   Building blocks
                                Design   Tablets
                       Implementation    Compactions
                               Results   Refinements
                           Conclusions


Refinements (1/2)


     Group column families together into an SSTable. Segregating
     column families that are not typically accessed together into
     separate locality groups enables more efficient reads.
     Can compress locality groups, using Bentley and McIlroy’s
     scheme and a fast compression algorithm that looks for
     repetitions.
     Bloom Filters on locality groups allows to ask whether an
     SSTable might contain any data for a specified row/column
     pair. Drastically reduces the number of disk seeks required -
     for non-existent rows or columns do not need to touch disk.


                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction   Building blocks
                                Design   Tablets
                       Implementation    Compactions
                               Results   Refinements
                           Conclusions


Refinements (2/2)


     Caching for read performance ( two levels of caching)
         Scan Cache: higher-level cache that caches the key-value pairs
         returned by the SSTable interface to the tablet server code.
         Block Cache: lower-level cache that caches SSTables blocks
         that were read from GFS.
     Commit-log implementation
     Speeding up tablet recovery (log entries)
     Exploiting immutability



                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction
                                Design   Hardware Environment
                       Implementation    Performance Evaluation
                               Results
                           Conclusions


Hardware Environment


     Tablet servers were configured to use 1 GB of memory and to
     write to a GFS cell consisting of 1786 machines with two 400
     GB IDE hard drives each.
     Each machine had two dual-core Opteron 2 GHz chips
     Enough physical memory to hold the working set of all
     running processes
     Single gigabit Ethernet link
     Two-level tree-shaped switched network with 100-200 Gbps
     aggregate bandwidth at the root.



                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction
                                Design   Hardware Environment
                       Implementation    Performance Evaluation
                               Results
                           Conclusions


Results Per Tablet Server




  Number of 1000-byte values read/written per second.


                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction
                                Design   Hardware Environment
                       Implementation    Performance Evaluation
                               Results
                           Conclusions


Results Aggregate Rate




  Number of 1000-byte values read/written per second.

                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction
                                Design   Hardware Environment
                       Implementation    Performance Evaluation
                               Results
                           Conclusions


Single tablet-server performance


      The tablet server executes 1200 reads per second ( 75
      MB/s), enough to saturate the tablet server CPUs because of
      overheads in networking stack
      Random and sequential writes perform better than random
      reads (commit log and uses group commit)
      No significant difference between random writes and
      sequential writes (same commit log)
      Sequential reads perform better than random reads (block
      cache)



                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction
                                Design   Hardware Environment
                       Implementation    Performance Evaluation
                               Results
                           Conclusions


Scaling


      Aggregate throughput increases dramatically performance of
      random reads from memory increases
      However, performance does not increase linearly
      Drop in per-server throughput
          Imbalance in load: Re-balancing is throttled to reduce the
          number of tablet movement and the load generated by
          benchmarks shifts around as the benchmark progresses
          The random read benchmark: transfer one 64KB block over
          the network for every 1000-byte read and saturates shared 1
          Gigabit links



                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                            Introduction
                                           Real applications
                                  Design
                                           Lessons
                         Implementation
                                           End
                                 Results
                             Conclusions


Timestamps




     Google Analytics
     Google Earth
     Personalized Search

                        Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                           Introduction
                                          Real applications
                                 Design
                                          Lessons
                        Implementation
                                          End
                                Results
                            Conclusions


Lessons learned
      Large distributed systems are vulnerable to many types of
      failures, not just the standard network partitions and fail-stop
      failures
          Memory and network corruption
          Large clock skew
          Extended and asymmetric network partitions
          Bugs in other systems (Chubby !)
          ...
      Delay adding new features until it is clear how the new
      features will be used
      A practical lesson: the importance of proper system-level
      monitoring
      Keep It Simple!
                       Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                  Introduction
                                 Real applications
                        Design
                                 Lessons
               Implementation
                                 End
                       Results
                   Conclusions


END!




 QUESTIONS ?




           Alvanos Michalis      Bigtable: A Distributed Storage System for Structured Data

More Related Content

PDF
Google Bigtable Paper Presentation
PDF
Google BigQuery Best Practices
PDF
Exploring BigData with Google BigQuery
PPTX
Apache Kudu: Technical Deep Dive


PDF
Delta Lake: Optimizing Merge
PDF
The Google Bigtable
PPTX
In-memory Databases
PPTX
A Deep Dive Into Understanding Apache Cassandra
Google Bigtable Paper Presentation
Google BigQuery Best Practices
Exploring BigData with Google BigQuery
Apache Kudu: Technical Deep Dive


Delta Lake: Optimizing Merge
The Google Bigtable
In-memory Databases
A Deep Dive Into Understanding Apache Cassandra

What's hot (20)

PDF
Bigtable and Dynamo
PPT
2 db2 instance creation
ZIP
NoSQL databases
PPTX
Hadoop vs Apache Spark
PPTX
Apache PIG
PPTX
GOOGLE BIGTABLE
ODP
Google's Dremel
PPTX
Relational databases
PPTX
Google Big Table
PDF
Log Structured Merge Tree
PDF
RocksDB Performance and Reliability Practices
PPTX
Apache Spark Architecture
PPT
Dremel: Interactive Analysis of Web-Scale Datasets
PPTX
Appache Cassandra
PDF
MS-SQL SERVER ARCHITECTURE
PPTX
PDF
Building an MLOps Stack for Companies at Reasonable Scale
PDF
The Google File System (GFS)
PPTX
Cassandra - A decentralized storage system
PPTX
Physical architecture of sql server
Bigtable and Dynamo
2 db2 instance creation
NoSQL databases
Hadoop vs Apache Spark
Apache PIG
GOOGLE BIGTABLE
Google's Dremel
Relational databases
Google Big Table
Log Structured Merge Tree
RocksDB Performance and Reliability Practices
Apache Spark Architecture
Dremel: Interactive Analysis of Web-Scale Datasets
Appache Cassandra
MS-SQL SERVER ARCHITECTURE
Building an MLOps Stack for Companies at Reasonable Scale
The Google File System (GFS)
Cassandra - A decentralized storage system
Physical architecture of sql server
Ad

Viewers also liked (20)

PDF
google Bigtable
PPT
Bigtable
PPTX
Summary of "Google's Big Table" at nosql summer reading in Tokyo
PPTX
Dynamo and BigTable in light of the CAP theorem
PDF
Dynamo and BigTable - Review and Comparison
PPT
Google's BigTable
PPTX
Big table
ODP
Big table
PDF
Bigtable
PPT
Google Bigtable paper presentation
PDF
Cloud-native Apps - Architektur, Implementierung, Demo
PDF
ROCKING YOUR SEAT AT THE BIG TABLE - ROB BAILEY
PDF
I've (probably) been using Google App Engine for a week longer than you have
PPT
Big table
PDF
Innovation Case Study BitTorrent
DOCX
Privacy preserving public auditing for regenerating-code-based cloud storage
PPTX
Privacy preserving public auditing for regenerating-code-based cloud storage
PPT
google Bigtable
Bigtable
Summary of "Google's Big Table" at nosql summer reading in Tokyo
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable - Review and Comparison
Google's BigTable
Big table
Big table
Bigtable
Google Bigtable paper presentation
Cloud-native Apps - Architektur, Implementierung, Demo
ROCKING YOUR SEAT AT THE BIG TABLE - ROB BAILEY
I've (probably) been using Google App Engine for a week longer than you have
Big table
Innovation Case Study BitTorrent
Privacy preserving public auditing for regenerating-code-based cloud storage
Privacy preserving public auditing for regenerating-code-based cloud storage
Ad

Similar to Bigtable: A Distributed Storage System for Structured Data (20)

PDF
Scaling Up vs. Scaling-out
PDF
Snowflake Cloning.pdf
PDF
Bigtable
PDF
VectorDB Schema Design 101 - Considerations for Building a Scalable and Perfo...
PPTX
PPTX
Introduction to Big Data
PDF
MySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
PDF
gfs-sosp2003
PDF
gfs-sosp2003
PDF
Bigtable osdi06
PDF
Bigtable osdi06
PDF
Bigtable osdi06
PDF
MySQL 8 Server Optimization Swanseacon 2018
PDF
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
PDF
The google file system
PDF
Gfs论文
PPTX
Data Warehousing Trends, Best Practices, and Future Outlook
PPTX
Cloud Architecture Patterns for Mere Mortals - Bill Wilder - Vermont Code Cam...
PDF
Sybase IQ ve Big Data
PDF
Sybase IQ Big Data
Scaling Up vs. Scaling-out
Snowflake Cloning.pdf
Bigtable
VectorDB Schema Design 101 - Considerations for Building a Scalable and Perfo...
Introduction to Big Data
MySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
gfs-sosp2003
gfs-sosp2003
Bigtable osdi06
Bigtable osdi06
Bigtable osdi06
MySQL 8 Server Optimization Swanseacon 2018
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
The google file system
Gfs论文
Data Warehousing Trends, Best Practices, and Future Outlook
Cloud Architecture Patterns for Mere Mortals - Bill Wilder - Vermont Code Cam...
Sybase IQ ve Big Data
Sybase IQ Big Data

More from elliando dias (20)

PDF
Clojurescript slides
PDF
Why you should be excited about ClojureScript
PDF
Functional Programming with Immutable Data Structures
PPT
Nomenclatura e peças de container
PDF
Geometria Projetiva
PDF
Polyglot and Poly-paradigm Programming for Better Agility
PDF
Javascript Libraries
PDF
How to Make an Eight Bit Computer and Save the World!
PDF
Ragel talk
PDF
A Practical Guide to Connecting Hardware to the Web
PDF
Introdução ao Arduino
PDF
Minicurso arduino
PDF
Incanter Data Sorcery
PDF
PDF
Fab.in.a.box - Fab Academy: Machine Design
PDF
The Digital Revolution: Machines that makes
PDF
Hadoop + Clojure
PDF
Hadoop - Simple. Scalable.
PDF
Hadoop and Hive Development at Facebook
PDF
Multi-core Parallelization in Clojure - a Case Study
Clojurescript slides
Why you should be excited about ClojureScript
Functional Programming with Immutable Data Structures
Nomenclatura e peças de container
Geometria Projetiva
Polyglot and Poly-paradigm Programming for Better Agility
Javascript Libraries
How to Make an Eight Bit Computer and Save the World!
Ragel talk
A Practical Guide to Connecting Hardware to the Web
Introdução ao Arduino
Minicurso arduino
Incanter Data Sorcery
Fab.in.a.box - Fab Academy: Machine Design
The Digital Revolution: Machines that makes
Hadoop + Clojure
Hadoop - Simple. Scalable.
Hadoop and Hive Development at Facebook
Multi-core Parallelization in Clojure - a Case Study

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Spectral efficient network and resource selection model in 5G networks
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Spectroscopy.pptx food analysis technology
PDF
Empathic Computing: Creating Shared Understanding
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Electronic commerce courselecture one. Pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Big Data Technologies - Introduction.pptx
The AUB Centre for AI in Media Proposal.docx
Encapsulation_ Review paper, used for researhc scholars
The Rise and Fall of 3GPP – Time for a Sabbatical?
Spectral efficient network and resource selection model in 5G networks
“AI and Expert System Decision Support & Business Intelligence Systems”
Mobile App Security Testing_ A Comprehensive Guide.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Advanced methodologies resolving dimensionality complications for autism neur...
MYSQL Presentation for SQL database connectivity
Unlocking AI with Model Context Protocol (MCP)
sap open course for s4hana steps from ECC to s4
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Spectroscopy.pptx food analysis technology
Empathic Computing: Creating Shared Understanding
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
NewMind AI Weekly Chronicles - August'25 Week I
Electronic commerce courselecture one. Pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Big Data Technologies - Introduction.pptx

Bigtable: A Distributed Storage System for Structured Data

  • 1. Outline Introduction Design Implementation Results Conclusions Bigtable: A Distributed Storage System for Structured Data Alvanos Michalis April 6, 2009 Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 2. Outline Introduction Design Implementation Results Conclusions 1 Introduction Motivation 2 Design Data model 3 Implementation Building blocks Tablets Compactions Refinements 4 Results Hardware Environment Performance Evaluation 5 Conclusions Real applications Lessons End Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 3. Outline Introduction Design Motivation Implementation Results Conclusions Google! Lots of Different kinds of data! Crawling system URLs, contents, links, anchors, page-rank etc Per-user data: preferences, recent queries/ search history Geographic data, images etc ... Many incoming requests No commercial system is big enough Scale is too large for commercial databases May not run on their commodity hardware No dependence on other vendors Optimizations Better Price/Performance Building internally means the system can be applied across many projects for low incremental cost Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 4. Outline Introduction Design Motivation Implementation Results Conclusions Google goals Fault-tolerant, persistent Scalable 1000s of servers Millions of reads/writes, efficient scans Self-managing Simple! Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 5. Outline Introduction Design Data model Implementation Results Conclusions Bigtable Definition A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes. (row:string, column:string, time:int64) -> string Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 6. Outline Introduction Design Data model Implementation Results Conclusions Rows The row keys in a table are arbitrary strings Every read or write of data under a single row key is atomic maintains data in lexicographic order by row key Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 7. Outline Introduction Design Data model Implementation Results Conclusions Column Families Grouped into sets called column families All data stored in a column family is usually of the same type A column family must be created before data can be stored under any column key in that family A column key is named using the following syntax: family:qualifier Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 8. Outline Introduction Design Data model Implementation Results Conclusions Timestamps Each cell in a Bigtable can contain multiple versions of the same data; these versions are indexed by timestamp (64-bit integers). Applications that need to avoid collisions must generate unique timestamps themselves. To make the management of versioned data less onerous, they support two per-column-family settings that tell Bigtable to garbage-collect cell versions automatically. Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 9. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions Infrastructure Google WorkQueue (scheduler) GFS: large-scale distributed file system Master: responsible for metadata Chunk servers: responsible for r/w large chunks of data Chunks replicated on 3 machines; master responsible Chubby: lock/file/name service Coarse-grained locks; can store small amount of data in a lock 5 replicas; need a majority vote to be active Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 10. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions SSTable Lives in GFS Immutable, sorted file of key-value pairs Chunks of data plus an index Index is of block ranges, not values Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 11. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions Tablet Design Large tables broken into tablets at row boundaries Tablets hold contiguous rows Approx 100 200 MB of data per tablet Approx 100 tablets per machine Fast recovery Load-balancing Built out of multiple SSTables Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 12. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions Tablet Location Like a B+-tree, but fixed at 3 levels How can we avoid creating a bottleneck at the root? Aggressively cache tablet locations Lookup starts from leaf (bet on it being correct); reverse on miss Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 13. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions Tablet Assignment Each tablet is assigned to one tablet server at a time. The master keeps track of the set of live tablet servers, and the current assignment of tablets to tablet servers. Bigtable uses Chubby to keep track of tablet servers. When a tablet server starts, it creates, and acquires an exclusive lock on, a uniquely-named file in a specific Chubby directory. Tablet server stops serving its tablets if loses its exclusive lock The master is responsible for detecting when a tablet server is no longer serving its tablets, and for reassigning those tablets as soon as possible. When a master is started by the cluster management system, it needs to discover the current tablet assignments before it can change them. Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 14. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions Serving a Tablet Updates are logged Each SSTable corresponds to a batch of updates or a snapshot of the tablet taken at some earlier time Memtable (sorted by key) caches recent updates Reads consult both memtable and SSTables Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 15. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions Compactions As write operations execute, the size of the memtable increases. Minor compaction convert the memtable into an SSTable Reduce memory usage Reduce log traffic on restart Merging compaction Periodically executed in the background Reduce number of SSTables Good place to apply policy keep only N versions Major compaction Merging compaction that results in only one SSTable No deletion records, only live data Reclaim resources. Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 16. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions Refinements (1/2) Group column families together into an SSTable. Segregating column families that are not typically accessed together into separate locality groups enables more efficient reads. Can compress locality groups, using Bentley and McIlroy’s scheme and a fast compression algorithm that looks for repetitions. Bloom Filters on locality groups allows to ask whether an SSTable might contain any data for a specified row/column pair. Drastically reduces the number of disk seeks required - for non-existent rows or columns do not need to touch disk. Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 17. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions Refinements (2/2) Caching for read performance ( two levels of caching) Scan Cache: higher-level cache that caches the key-value pairs returned by the SSTable interface to the tablet server code. Block Cache: lower-level cache that caches SSTables blocks that were read from GFS. Commit-log implementation Speeding up tablet recovery (log entries) Exploiting immutability Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 18. Outline Introduction Design Hardware Environment Implementation Performance Evaluation Results Conclusions Hardware Environment Tablet servers were configured to use 1 GB of memory and to write to a GFS cell consisting of 1786 machines with two 400 GB IDE hard drives each. Each machine had two dual-core Opteron 2 GHz chips Enough physical memory to hold the working set of all running processes Single gigabit Ethernet link Two-level tree-shaped switched network with 100-200 Gbps aggregate bandwidth at the root. Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 19. Outline Introduction Design Hardware Environment Implementation Performance Evaluation Results Conclusions Results Per Tablet Server Number of 1000-byte values read/written per second. Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 20. Outline Introduction Design Hardware Environment Implementation Performance Evaluation Results Conclusions Results Aggregate Rate Number of 1000-byte values read/written per second. Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 21. Outline Introduction Design Hardware Environment Implementation Performance Evaluation Results Conclusions Single tablet-server performance The tablet server executes 1200 reads per second ( 75 MB/s), enough to saturate the tablet server CPUs because of overheads in networking stack Random and sequential writes perform better than random reads (commit log and uses group commit) No significant difference between random writes and sequential writes (same commit log) Sequential reads perform better than random reads (block cache) Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 22. Outline Introduction Design Hardware Environment Implementation Performance Evaluation Results Conclusions Scaling Aggregate throughput increases dramatically performance of random reads from memory increases However, performance does not increase linearly Drop in per-server throughput Imbalance in load: Re-balancing is throttled to reduce the number of tablet movement and the load generated by benchmarks shifts around as the benchmark progresses The random read benchmark: transfer one 64KB block over the network for every 1000-byte read and saturates shared 1 Gigabit links Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 23. Outline Introduction Real applications Design Lessons Implementation End Results Conclusions Timestamps Google Analytics Google Earth Personalized Search Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 24. Outline Introduction Real applications Design Lessons Implementation End Results Conclusions Lessons learned Large distributed systems are vulnerable to many types of failures, not just the standard network partitions and fail-stop failures Memory and network corruption Large clock skew Extended and asymmetric network partitions Bugs in other systems (Chubby !) ... Delay adding new features until it is clear how the new features will be used A practical lesson: the importance of proper system-level monitoring Keep It Simple! Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 25. Outline Introduction Real applications Design Lessons Implementation End Results Conclusions END! QUESTIONS ? Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data