SlideShare a Scribd company logo
Columnar Database and hadoop



江志伟( Alex Jiang )
2012-12-1
Agenda   •



1.   Column Advantage
2.   Storage and Process
3.   Hadoop Related
History


    2001 PAX

    Mike Stonebraker, Daniel Abadi, Adam Batkin, Xuedong Chen, Mitch
    Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, …

    C-Store: A Column Oriented DBMS

    D. J. Abadi, etc: Integrating Compression and Execution in Column-O
    riented Database Systems. In SIGMOD, pages 671–682, 2006.

    D. J. Abadi, etc: Materialization Strategies in a Column-Oriented DB
    MS. In ICDE, pages 466–475, 2007.
File Format


PAX
Columnar storage
(Columnar) compression
PPD vs Index or MV
SerDe
PAX




(Picture From oracle blog)
Columnar Store vs Row Store

●   IO-1 (basic column store): Every storage block contain
    s data from only ONE column.
●   IO-2: Aggressive compression.
●   IO-3: No record-ids.
●   CPU-4: A column executor
●   CPU-5: Executor runs on compressed data.
●   CPU-6: Executor can process columns that are key se
    quence or entry sequence.
Columnar Store advantage
●
    Compression
      RLE, Bitmap ..
●
    Ppd
      reduce IO
●
    Late Materialization
      less memeory and CPU overhead
●
    Block Iteration (Vectorization)
      less CPU overhead
●
    Invisible Join
          – block as join key
Compression
●   Run-length Encoding   ●   High Selectivity :
●   ENCODING DELTAVAL            Gender ,age
●   Bit Vector Encoding   ●   Mid Selectivity :
●   BLOCK_DICT                   City , Category
       data skew          ●   Low Selectivity :
       compound                  item_id , user_id
                                 Price,quantity,
                                 comment
Column File Format




(Picture From Vertica Blog)
PPD


Prediction Push Down
    Continuous IO
    Compound Prediction
    Max-Min in each minor Block
PAX has ppd but not efficience
PPD




(Picture from Vertica Blog)
late materialization

Construct Row
Apply Filter + Projection


Projections column only needed(also ppd)
Decoding Column First
Wait util process
Different Compression have difference behavior
Early Materialization




  (Picture from William McKnight)
Late Materialization




 (Picture from William McKnight)
Common Confusion IO

Choose more column ,more close to row store
IO <5%
   record-ID
   Row store free space at block tail
   variable length field
   IO Access Pattern means scalability
   Hardware Trend
   Compression rate
Common Confusion SerDe

Row or PAX SerDe
    cpu cache miss
    no columnar compression
    Block Iteration (construct tuple or row)


Java vs C/C++
   C/c++ direct memory mapping
   Java Fastutil
Index and MV
Reduce IO                 Scalability
Avoid Sort                Storange cost
    Index join            Complex desige
Lookup                    Hard maintain
Pre-computation :         High latency
     Join                 Slow down loading
     Group by             Lost Details
Query Rewrite
Data Modeling

Fat table vs 3NF
Hadoop Related


File Format
  Trenvi vs IBM CIF
  Schema Evolution
  Portable File Format
   Bigger Block Size
    IO Pattern
    SerDe network influence
Hadoop Related

Storage Cost
NameNode
    Less block

   Bigger block size

   Cold data even bigger

   No Intermediate Level

JobTracker
    Each Job have Less Map and reduce number

DataNode
Hadoop Related

Real Data ingestion
   Hbase + Flume
   Balanced Data
   Write avro file format first, then sort merge

SerDe memory reduce
    Tuple Structure not row
Batch Update+Delete+Insert
Hadoop Related

MR Performance Boost
  Block Shuffle (3 times faster)

  Skew data have less overhead

  Less map number and bigger spill

  Reduce side combine

  Light Compression Codec(snappy not LZO)

  Combiner or in-memroy combiner deprecated
Hadoop Related

Easier Performance Tuning
  mapred.min.split.size(deprecated)

  mapred.child.java.opts

  mapred.compress.map.output(deprecated)

  io.sort.mb

  io.sort.spill.percent(deprecated)

  Io.sort.factor

  mapred.reduce.parallel.copies(deprecated)

  Map and reduce number easier estimate

  Reduce algorithm will change
Hadoop Related

Easy Management
   Less Partition or Dynamic Partition

   Integrity constraints and Referential integrity

   Statistic make simple query engine

   Cold Data automatic merge

   Trojan Layout vs Columnar Projections

Less Design complexity
   Map join vs Fat Table

   Group by + Index
Column and hadoop
Reference
●
    http://www.dbms2.com/2011/02/06/columnar-compression-database-storage/

●
    http://cs-www.cs.yale.edu/homes/dna/talks/Column_Store_Tutorial_VLDB09.pdf

●
    http://www.infoq.com/news/2011/09/nosqlnow-columnar-databases/

●
    DREMEL Melnik, Gubarev, Long, Romer, Shivakumar, & Tolton, VLDB 2010

●
    Trenvi http://avro.apache.org/docs/current/trevni/spec.html

●
    http://www.vertica.com/2011/09/01/the-power-of-projections-part-1/
Thank you!
                                 Q&A

Alex Jiang

gemini5201314 at gmail dot com

http://www.gemini5201314.net

More Related Content

PPTX
Bootstrap SaaS startup using Open Source Tools
PPTX
Devops Days, 2019 - Charlotte
PPTX
Keep your Metadata Repository Current with Event-Driven Updates using CDC and...
PDF
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
PDF
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
PDF
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
PDF
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
PDF
How Kafka and MemSQL Became the Dynamic Duo (Sarung Tripathi, MemSQL) Kafka S...
Bootstrap SaaS startup using Open Source Tools
Devops Days, 2019 - Charlotte
Keep your Metadata Repository Current with Event-Driven Updates using CDC and...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
How Kafka and MemSQL Became the Dynamic Duo (Sarung Tripathi, MemSQL) Kafka S...

What's hot (20)

PDF
Data integration with Apache Kafka
PPTX
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
PDF
Apache kafka-a distributed streaming platform
PDF
Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...
PDF
Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...
PDF
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
PDF
Real-time Data Streaming from Oracle to Apache Kafka
PDF
Azure Cosmos DB Kafka Connectors | Abinav Rameesh, Microsoft
PDF
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
PDF
Change data capture with MongoDB and Kafka.
PDF
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
PPTX
Change Data Capture using Kafka
PPTX
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
PDF
Migrating from One Cloud Provider to Another (Without Losing Your Data or You...
PDF
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
PDF
Apache HBase Workshop
PDF
Understanding Kafka Produce and Fetch api calls for high throughtput applicat...
PPTX
PCAP Graphs for Cybersecurity and System Tuning
PDF
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
PDF
Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...
Data integration with Apache Kafka
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
Apache kafka-a distributed streaming platform
Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...
Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Real-time Data Streaming from Oracle to Apache Kafka
Azure Cosmos DB Kafka Connectors | Abinav Rameesh, Microsoft
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Change data capture with MongoDB and Kafka.
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Change Data Capture using Kafka
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Migrating from One Cloud Provider to Another (Without Losing Your Data or You...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
Apache HBase Workshop
Understanding Kafka Produce and Fetch api calls for high throughtput applicat...
PCAP Graphs for Cybersecurity and System Tuning
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...
Ad

Similar to Column and hadoop (20)

PDF
What You Need To Know About The Top Database Trends
PPTX
Column Stores and Google BigQuery
PPT
Column-vs-Row-how-different-are-they.ppt
PDF
Sap technical deep dive in a column oriented in memory database
PDF
Columnar databases on Big data analytics
PPS
Big data hadoop rdbms
PDF
Intro to column stores
PPTX
Big Data 2.0 - Milwaukee Big Data User Group Presentation
PPTX
HBase in Practice
PPTX
HBase in Practice
PDF
Hadoop and Hive Development at Facebook
 
PDF
Hadoop and Hive Development at Facebook
ODP
Nyc summit intro_to_cassandra
PPTX
Introduction to Google BigQuery
PDF
Database system
PDF
Where Does Big Data Meet Big Database - QCon 2012
ODP
Introduciton to Apache Cassandra for Java Developers (JavaOne)
PPTX
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
PDF
Beware of your Hype Value Stores
PDF
GCP Data Engineer cheatsheet
What You Need To Know About The Top Database Trends
Column Stores and Google BigQuery
Column-vs-Row-how-different-are-they.ppt
Sap technical deep dive in a column oriented in memory database
Columnar databases on Big data analytics
Big data hadoop rdbms
Intro to column stores
Big Data 2.0 - Milwaukee Big Data User Group Presentation
HBase in Practice
HBase in Practice
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Nyc summit intro_to_cassandra
Introduction to Google BigQuery
Database system
Where Does Big Data Meet Big Database - QCon 2012
Introduciton to Apache Cassandra for Java Developers (JavaOne)
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Beware of your Hype Value Stores
GCP Data Engineer cheatsheet
Ad

Column and hadoop

  • 1. Columnar Database and hadoop 江志伟( Alex Jiang ) 2012-12-1
  • 2. Agenda • 1. Column Advantage 2. Storage and Process 3. Hadoop Related
  • 3. History  2001 PAX  Mike Stonebraker, Daniel Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, …  C-Store: A Column Oriented DBMS  D. J. Abadi, etc: Integrating Compression and Execution in Column-O riented Database Systems. In SIGMOD, pages 671–682, 2006.  D. J. Abadi, etc: Materialization Strategies in a Column-Oriented DB MS. In ICDE, pages 466–475, 2007.
  • 4. File Format PAX Columnar storage (Columnar) compression PPD vs Index or MV SerDe
  • 6. Columnar Store vs Row Store ● IO-1 (basic column store): Every storage block contain s data from only ONE column. ● IO-2: Aggressive compression. ● IO-3: No record-ids. ● CPU-4: A column executor ● CPU-5: Executor runs on compressed data. ● CPU-6: Executor can process columns that are key se quence or entry sequence.
  • 7. Columnar Store advantage ● Compression RLE, Bitmap .. ● Ppd reduce IO ● Late Materialization less memeory and CPU overhead ● Block Iteration (Vectorization) less CPU overhead ● Invisible Join – block as join key
  • 8. Compression ● Run-length Encoding ● High Selectivity : ● ENCODING DELTAVAL Gender ,age ● Bit Vector Encoding ● Mid Selectivity : ● BLOCK_DICT City , Category data skew ● Low Selectivity : compound item_id , user_id Price,quantity, comment
  • 9. Column File Format (Picture From Vertica Blog)
  • 10. PPD Prediction Push Down Continuous IO Compound Prediction Max-Min in each minor Block PAX has ppd but not efficience
  • 12. late materialization Construct Row Apply Filter + Projection Projections column only needed(also ppd) Decoding Column First Wait util process Different Compression have difference behavior
  • 13. Early Materialization (Picture from William McKnight)
  • 14. Late Materialization (Picture from William McKnight)
  • 15. Common Confusion IO Choose more column ,more close to row store IO <5% record-ID Row store free space at block tail variable length field IO Access Pattern means scalability Hardware Trend Compression rate
  • 16. Common Confusion SerDe Row or PAX SerDe cpu cache miss no columnar compression Block Iteration (construct tuple or row) Java vs C/C++ C/c++ direct memory mapping Java Fastutil
  • 17. Index and MV Reduce IO Scalability Avoid Sort Storange cost Index join Complex desige Lookup Hard maintain Pre-computation : High latency Join Slow down loading Group by Lost Details Query Rewrite
  • 19. Hadoop Related File Format Trenvi vs IBM CIF Schema Evolution Portable File Format Bigger Block Size IO Pattern SerDe network influence
  • 20. Hadoop Related Storage Cost NameNode Less block Bigger block size Cold data even bigger No Intermediate Level JobTracker Each Job have Less Map and reduce number DataNode
  • 21. Hadoop Related Real Data ingestion Hbase + Flume Balanced Data Write avro file format first, then sort merge SerDe memory reduce Tuple Structure not row Batch Update+Delete+Insert
  • 22. Hadoop Related MR Performance Boost Block Shuffle (3 times faster) Skew data have less overhead Less map number and bigger spill Reduce side combine Light Compression Codec(snappy not LZO) Combiner or in-memroy combiner deprecated
  • 23. Hadoop Related Easier Performance Tuning mapred.min.split.size(deprecated) mapred.child.java.opts mapred.compress.map.output(deprecated) io.sort.mb io.sort.spill.percent(deprecated) Io.sort.factor mapred.reduce.parallel.copies(deprecated) Map and reduce number easier estimate Reduce algorithm will change
  • 24. Hadoop Related Easy Management Less Partition or Dynamic Partition Integrity constraints and Referential integrity Statistic make simple query engine Cold Data automatic merge Trojan Layout vs Columnar Projections Less Design complexity Map join vs Fat Table Group by + Index
  • 26. Reference ● http://www.dbms2.com/2011/02/06/columnar-compression-database-storage/ ● http://cs-www.cs.yale.edu/homes/dna/talks/Column_Store_Tutorial_VLDB09.pdf ● http://www.infoq.com/news/2011/09/nosqlnow-columnar-databases/ ● DREMEL Melnik, Gubarev, Long, Romer, Shivakumar, & Tolton, VLDB 2010 ● Trenvi http://avro.apache.org/docs/current/trevni/spec.html ● http://www.vertica.com/2011/09/01/the-power-of-projections-part-1/
  • 27. Thank you! Q&A Alex Jiang gemini5201314 at gmail dot com http://www.gemini5201314.net