Column and hadoop

Columnar Database and hadoop

江志伟（ Alex Jiang ）
2012-12-1

Agenda •

1. Column Advantage
2. Storage and Process
3. Hadoop Related

History


2001 PAX

Mike Stonebraker, Daniel Abadi, Adam Batkin, Xuedong Chen, Mitch
Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, …

C-Store: A Column Oriented DBMS

D. J. Abadi, etc: Integrating Compression and Execution in Column-O
riented Database Systems. In SIGMOD, pages 671–682, 2006.

D. J. Abadi, etc: Materialization Strategies in a Column-Oriented DB
MS. In ICDE, pages 466–475, 2007.

File Format

PAX
Columnar storage
(Columnar) compression
PPD vs Index or MV
SerDe

PAX

(Picture From oracle blog)

Columnar Store vs Row Store

● IO-1 (basic column store): Every storage block contain
s data from only ONE column.
● IO-2: Aggressive compression.
● IO-3: No record-ids.
● CPU-4: A column executor
● CPU-5: Executor runs on compressed data.
● CPU-6: Executor can process columns that are key se
quence or entry sequence.

Columnar Store advantage
●
Compression
RLE, Bitmap ..
●
Ppd
reduce IO
●
Late Materialization
less memeory and CPU overhead
●
Block Iteration (Vectorization)
less CPU overhead
●
Invisible Join
– block as join key

Compression
● Run-length Encoding ● High Selectivity :
● ENCODING DELTAVAL Gender ,age
● Bit Vector Encoding ● Mid Selectivity :
● BLOCK_DICT City , Category
data skew ● Low Selectivity :
compound item_id , user_id
Price,quantity,
comment

Column File Format

(Picture From Vertica Blog)

PPD

Prediction Push Down
Continuous IO
Compound Prediction
Max-Min in each minor Block
PAX has ppd but not efficience

PPD

(Picture from Vertica Blog)

late materialization

Construct Row
Apply Filter + Projection

Projections column only needed(also ppd)
Decoding Column First
Wait util process
Different Compression have difference behavior

Early Materialization

(Picture from William McKnight)

Late Materialization

(Picture from William McKnight)

Common Confusion IO

Choose more column ,more close to row store
IO <5%
record-ID
Row store free space at block tail
variable length field
IO Access Pattern means scalability
Hardware Trend
Compression rate

Common Confusion SerDe

Row or PAX SerDe
cpu cache miss
no columnar compression
Block Iteration (construct tuple or row)

Java vs C/C++
C/c++ direct memory mapping
Java Fastutil

Index and MV
Reduce IO Scalability
Avoid Sort Storange cost
Index join Complex desige
Lookup Hard maintain
Pre-computation : High latency
Join Slow down loading
Group by Lost Details
Query Rewrite

Data Modeling

Fat table vs 3NF

Hadoop Related

File Format
Trenvi vs IBM CIF
Schema Evolution
Portable File Format
Bigger Block Size
IO Pattern
SerDe network influence

Hadoop Related

Storage Cost
NameNode
Less block

Bigger block size

Cold data even bigger

No Intermediate Level

JobTracker
Each Job have Less Map and reduce number

DataNode

Hadoop Related

Real Data ingestion
Hbase + Flume
Balanced Data
Write avro file format first, then sort merge

SerDe memory reduce
Tuple Structure not row
Batch Update+Delete+Insert

Hadoop Related

MR Performance Boost
Block Shuffle (3 times faster)

Skew data have less overhead

Less map number and bigger spill

Reduce side combine

Light Compression Codec(snappy not LZO)

Combiner or in-memroy combiner deprecated

Hadoop Related

Easier Performance Tuning
mapred.min.split.size(deprecated)

mapred.child.java.opts

mapred.compress.map.output(deprecated)

io.sort.mb

io.sort.spill.percent(deprecated)

Io.sort.factor

mapred.reduce.parallel.copies(deprecated)

Map and reduce number easier estimate

Reduce algorithm will change

Hadoop Related

Easy Management
Less Partition or Dynamic Partition

Integrity constraints and Referential integrity

Statistic make simple query engine

Cold Data automatic merge

Trojan Layout vs Columnar Projections

Less Design complexity
Map join vs Fat Table

Group by + Index

Reference
●
http://www.dbms2.com/2011/02/06/columnar-compression-database-storage/

●
http://cs-www.cs.yale.edu/homes/dna/talks/Column_Store_Tutorial_VLDB09.pdf

●
http://www.infoq.com/news/2011/09/nosqlnow-columnar-databases/

●
DREMEL Melnik, Gubarev, Long, Romer, Shivakumar, & Tolton, VLDB 2010

●
Trenvi http://avro.apache.org/docs/current/trevni/spec.html

●
http://www.vertica.com/2011/09/01/the-power-of-projections-part-1/

Thank you!
Q&A

Alex Jiang

gemini5201314 at gmail dot com

http://www.gemini5201314.net

Column and hadoop

More Related Content

What's hot (20)

Similar to Column and hadoop (20)

Column and hadoop