09/27/2012
Column Statistics Project
Shreepadma Venugopalan | Platform Engineering
Outline

    •   Motivation
    •   New Statistics
    •   Computing and Persisting Statistics
    •   Summary
    •   Further Readings




2
                      ©2012 Cloudera, Inc. All Rights Reserved.
Why Column Statistics?

    • RDBMS Query Optimizer
      • Is cost-based
      • Uses the statistical properties of the data to cost
        alternate execution plans
      • Picks executions plans with the lowest cost


    • Hive Query Optimizer
      • Is rule-based
      • Uses rules of thumb to optimize the execution plan
      • Unable to always pick the most efficient execution
        plan


3
                        ©2012 Cloudera, Inc. All Rights Reserved.
Why Column Statistics?
• Statistics in an RDBMS
     • Maintained on per table, per partition, and per column
       basis
     • Used for a wide range of cost based query optimizations

• Statistics in Hive
     • Maintained on per table and per partition level
     • Can be used to perform some cost based optimizations
       such as choosing join method etc.
     • Insufficient for other cost based optimizations such as join
       reordering, two stage aggregation etc.

           Solution: Maintain statistics on columns in Hive



4
                          ©2012 Cloudera, Inc. All Rights Reserved.
What are the New Statistics?

    •   Min Column Value
    •   Max Column Value
    •   Average Length of Column Value
    •   Max Length of Column Value
    •   Number of Distinct Values in a Column
    •   Number of Null Values in a Column
    •   Equi-height Histograms




5
                        ©2012 Cloudera, Inc. All Rights Reserved.
How to Compute Column Statistics?

    • Explicit Computation
      • Triggered through an ANALYZE command
      • Pros: Admin has fine grained control over the stats
        job
      • Cons: Doesn’t piggyback on other operations such as
        scan
    • Implicit Computation
      • Incrementally compute statistics while loading data
      • Pros: Avoid an additional table scan, more efficient
        than explicit computation
      • Cons: Impacts LOAD performance
6
                         ©2012 Cloudera, Inc. All Rights Reserved.
How to Compute Column Statistics?

    • Aggregate function of the column data
      rolled up by table/partition
    • Fits nicely into Hive’s UDAF framework
    • Expect to scan TBs of data at a time
      • Requirement # 1: Memory usage has to scale
        sub-linearly with data size
      • Requirement # 2: Stats task has to complete in a
        reasonable amount of time
      • Given these requirements, some statistics such
        as NDV, histograms are hard to compute!


7
                       ©2012 Cloudera, Inc. All Rights Reserved.
How to Compute NDVs?

    • Naïve approach
      • Maintain a count of distinct values in a column
      • Impractical given memory requirements
    • Flajolet-Martin approach
      • Use probabilistic sketches to estimate NDV
      • Memory required is logarithmic in size of data
      • Estimates are within 10% of the actual value



8
                      ©2011 Cloudera, Inc. All Rights Reserved.
How to Compute Histograms?

    • Computing equi-height histograms is a quantile
      computation/estimation problem
    • Merging the quantiles computed at the mappers is
      non-trivial
    • Deterministic parallel algorithms such as QDigest
      prohibitive in terms of memory required
    • Probabilistic algorithms stream counting algorithms
      such as Count-Min Sketch can be tweaked to
      estimate quantiles
       • Memory required is logarithmic in size of data
       • Computationally expensive!


9
                        ©2012 Cloudera, Inc. All Rights Reserved.
How to Store Column Statistics?

 • Extend metastore schema to store new statistics
 • Extend metastore Thrift API to update, query and
   delete new statistics
 • Size of the column statistics record in metastore is
   independent of table/partition size
 • ~32 bytes/column if histograms are not computed
 • ~320 bytes/column for 20 bin histogram




10
                      ©2012 Cloudera, Inc. All Rights Reserved.
Summary

 • Scalar statistics has been implemented for primitive
   type columns in both tables and partitions

 • Patch is available on JIRA (HIVE-1362)

 • Computing Equi-Height Histograms is a WIP




11
                      ©2012 Cloudera, Inc. All Rights Reserved.
Questions?




12
      ©2012 Cloudera, Inc. All Rights Reserved.
Further Readings

 • Blog
     • http://www.cloudera.com/blog/2012/08/column-statistics-
       in-hive/
 • Academic
     • A. Gruenheid, et. al., Query Optimization using Column
       Statistics in Hive.
     • S. Chaudhuri, An Overview of Query Optimization in
       Relational Systems.
     • P. Flajolet and N.G. Martin, Probabilistic Counting
       Algorithms for Database Applications.
                Contact: shreepadma@cloudera.com



13
                         ©2012 Cloudera, Inc. All Rights Reserved.

More Related Content

PPTX
Platform for Data Scientists
PPTX
PPTX
Seminar on olap online analytical
PDF
DataStax: Steps to successfully implementing NoSQL in the enterprise
PPTX
Big Data and Hadoop Ecosystem
PDF
DataStax: The Whys of NoSQL
PPTX
Machine Learning on Distributed Systems by Josh Poduska
PDF
The Google BigQuery Story: Optimizing 25PB Storage
Platform for Data Scientists
Seminar on olap online analytical
DataStax: Steps to successfully implementing NoSQL in the enterprise
Big Data and Hadoop Ecosystem
DataStax: The Whys of NoSQL
Machine Learning on Distributed Systems by Josh Poduska
The Google BigQuery Story: Optimizing 25PB Storage

What's hot (18)

PPTX
Conformed Dimension and Data Mining
PPTX
Power aware load balancing in cloud
PDF
Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...
PPTX
Cloud Analytics New Version 2017
PDF
Hadoop bangalore-meetup-dec-2011-yoda
ODP
EDW and Hadoop
PDF
3 olap storage
PPTX
A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database
PDF
DataStax: Making a Difference with Smart Analytics
PPTX
Olap operations
PPTX
The Plan Cache Whisperer - Performance Tuning SQL Server
PPTX
Introducing Data Lakes
PPTX
Cruising in data lake from zero to scale
PPTX
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
PDF
The olap tutorial 2012
PPTX
Incorta Data Security
PPTX
SAG_Indexing and Query Optimization
PPTX
OLAP operations
Conformed Dimension and Data Mining
Power aware load balancing in cloud
Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...
Cloud Analytics New Version 2017
Hadoop bangalore-meetup-dec-2011-yoda
EDW and Hadoop
3 olap storage
A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database
DataStax: Making a Difference with Smart Analytics
Olap operations
The Plan Cache Whisperer - Performance Tuning SQL Server
Introducing Data Lakes
Cruising in data lake from zero to scale
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
The olap tutorial 2012
Incorta Data Security
SAG_Indexing and Query Optimization
OLAP operations
Ad

Similar to Column Statistics in Hive (20)

PPTX
The Future of Data Warehousing: ETL Will Never be the Same
PDF
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)
PPTX
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
PDF
IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...
PPTX
MongoDB for Spatio-Behavioral Data Analysis and Visualization
PDF
An overview of modern scalable web development
PDF
Impala use case @ edge
PDF
Which Change Data Capture Strategy is Right for You?
PDF
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
PPTX
AI-Powered Analytics: What It Is and How It’s Powering the Next Generation of...
PDF
Data Warehouse Design Considerations
PPTX
Building a Modern Analytic Database with Cloudera 5.8
PPTX
NOSQL introduction for big data analytics
PPTX
Designing dashboards for performance shridhar wip 040613
PPTX
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
PDF
The Shifting Landscape of Data Integration
PPTX
Data Science Machine Lerning Bigdat.pptx
PDF
Harness the power of Data in a Big Data Lake
PPTX
Jethro data meetup index base sql on hadoop - oct-2014
PPTX
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
The Future of Data Warehousing: ETL Will Never be the Same
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...
MongoDB for Spatio-Behavioral Data Analysis and Visualization
An overview of modern scalable web development
Impala use case @ edge
Which Change Data Capture Strategy is Right for You?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
AI-Powered Analytics: What It Is and How It’s Powering the Next Generation of...
Data Warehouse Design Considerations
Building a Modern Analytic Database with Cloudera 5.8
NOSQL introduction for big data analytics
Designing dashboards for performance shridhar wip 040613
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
The Shifting Landscape of Data Integration
Data Science Machine Lerning Bigdat.pptx
Harness the power of Data in a Big Data Lake
Jethro data meetup index base sql on hadoop - oct-2014
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Ad

Recently uploaded (20)

PDF
Printable Basque Gospel Tract - Last Day.pdf
PDF
Future Relevancy of Black Methodist Consultation (BMC) by Matthews Bantsijang
PDF
Monthly Khazina-e-Ruhaniyaat Aug’2025 (Vol.16, Issue 4)
PDF
UNIT PROGRAM ACTIVITIES.hfhhfhfhfhfhfhfh.pdf
PPTX
THE LIFE & MISSION OF COUPLES FOR CHRIST
PDF
Mangal Dosh Nivaran Pooja – Vedic Remedy for Peace & Prosperity
PPTX
Part 1A Time - Not Linear Its Cyclic Spiral.pptx
PPTX
3. CLP TALK 1. God's Love for Couples for Christ
PDF
Printable Belarusian Gospel Tract - Last Day.pdf
DOCX
Exploring Tirumala Hills: How Travel Packages Save Time & Effort
PDF
Printable Chinese Literary Gospel Tract - Last Day.pdf
PPTX
Salinan Gaza War Israel-Palestine Conflict Slides.pptx
PDF
_OceanofPDF.com_Ayurveda_and_the_mind_-_Dr_David_Frawley.pdf
PDF
NOTICE OF OATH OF COMMITMENT JC-DKR-08192025-01.pdf
PPSX
Perspectives (Kal-el's Shows Slideshows)
PPTX
391 Do good to your servant according to your word LORD 392 Full Redemption
PPSX
Forgiveness (Kal-el's Shows Devotional Slideshow)
PDF
Printable Upper Sorbian Gospel Tract - Be Sure of Heaven.pdf
PDF
Radharamanji -Mandir -in - Vrindavan.pdf
PDF
2024.02.12 - The Word of God on the Feast of the Three Holy Hierarchs - Basil...
Printable Basque Gospel Tract - Last Day.pdf
Future Relevancy of Black Methodist Consultation (BMC) by Matthews Bantsijang
Monthly Khazina-e-Ruhaniyaat Aug’2025 (Vol.16, Issue 4)
UNIT PROGRAM ACTIVITIES.hfhhfhfhfhfhfhfh.pdf
THE LIFE & MISSION OF COUPLES FOR CHRIST
Mangal Dosh Nivaran Pooja – Vedic Remedy for Peace & Prosperity
Part 1A Time - Not Linear Its Cyclic Spiral.pptx
3. CLP TALK 1. God's Love for Couples for Christ
Printable Belarusian Gospel Tract - Last Day.pdf
Exploring Tirumala Hills: How Travel Packages Save Time & Effort
Printable Chinese Literary Gospel Tract - Last Day.pdf
Salinan Gaza War Israel-Palestine Conflict Slides.pptx
_OceanofPDF.com_Ayurveda_and_the_mind_-_Dr_David_Frawley.pdf
NOTICE OF OATH OF COMMITMENT JC-DKR-08192025-01.pdf
Perspectives (Kal-el's Shows Slideshows)
391 Do good to your servant according to your word LORD 392 Full Redemption
Forgiveness (Kal-el's Shows Devotional Slideshow)
Printable Upper Sorbian Gospel Tract - Be Sure of Heaven.pdf
Radharamanji -Mandir -in - Vrindavan.pdf
2024.02.12 - The Word of God on the Feast of the Three Holy Hierarchs - Basil...

Column Statistics in Hive

  • 1. 09/27/2012 Column Statistics Project Shreepadma Venugopalan | Platform Engineering
  • 2. Outline • Motivation • New Statistics • Computing and Persisting Statistics • Summary • Further Readings 2 ©2012 Cloudera, Inc. All Rights Reserved.
  • 3. Why Column Statistics? • RDBMS Query Optimizer • Is cost-based • Uses the statistical properties of the data to cost alternate execution plans • Picks executions plans with the lowest cost • Hive Query Optimizer • Is rule-based • Uses rules of thumb to optimize the execution plan • Unable to always pick the most efficient execution plan 3 ©2012 Cloudera, Inc. All Rights Reserved.
  • 4. Why Column Statistics? • Statistics in an RDBMS • Maintained on per table, per partition, and per column basis • Used for a wide range of cost based query optimizations • Statistics in Hive • Maintained on per table and per partition level • Can be used to perform some cost based optimizations such as choosing join method etc. • Insufficient for other cost based optimizations such as join reordering, two stage aggregation etc. Solution: Maintain statistics on columns in Hive 4 ©2012 Cloudera, Inc. All Rights Reserved.
  • 5. What are the New Statistics? • Min Column Value • Max Column Value • Average Length of Column Value • Max Length of Column Value • Number of Distinct Values in a Column • Number of Null Values in a Column • Equi-height Histograms 5 ©2012 Cloudera, Inc. All Rights Reserved.
  • 6. How to Compute Column Statistics? • Explicit Computation • Triggered through an ANALYZE command • Pros: Admin has fine grained control over the stats job • Cons: Doesn’t piggyback on other operations such as scan • Implicit Computation • Incrementally compute statistics while loading data • Pros: Avoid an additional table scan, more efficient than explicit computation • Cons: Impacts LOAD performance 6 ©2012 Cloudera, Inc. All Rights Reserved.
  • 7. How to Compute Column Statistics? • Aggregate function of the column data rolled up by table/partition • Fits nicely into Hive’s UDAF framework • Expect to scan TBs of data at a time • Requirement # 1: Memory usage has to scale sub-linearly with data size • Requirement # 2: Stats task has to complete in a reasonable amount of time • Given these requirements, some statistics such as NDV, histograms are hard to compute! 7 ©2012 Cloudera, Inc. All Rights Reserved.
  • 8. How to Compute NDVs? • Naïve approach • Maintain a count of distinct values in a column • Impractical given memory requirements • Flajolet-Martin approach • Use probabilistic sketches to estimate NDV • Memory required is logarithmic in size of data • Estimates are within 10% of the actual value 8 ©2011 Cloudera, Inc. All Rights Reserved.
  • 9. How to Compute Histograms? • Computing equi-height histograms is a quantile computation/estimation problem • Merging the quantiles computed at the mappers is non-trivial • Deterministic parallel algorithms such as QDigest prohibitive in terms of memory required • Probabilistic algorithms stream counting algorithms such as Count-Min Sketch can be tweaked to estimate quantiles • Memory required is logarithmic in size of data • Computationally expensive! 9 ©2012 Cloudera, Inc. All Rights Reserved.
  • 10. How to Store Column Statistics? • Extend metastore schema to store new statistics • Extend metastore Thrift API to update, query and delete new statistics • Size of the column statistics record in metastore is independent of table/partition size • ~32 bytes/column if histograms are not computed • ~320 bytes/column for 20 bin histogram 10 ©2012 Cloudera, Inc. All Rights Reserved.
  • 11. Summary • Scalar statistics has been implemented for primitive type columns in both tables and partitions • Patch is available on JIRA (HIVE-1362) • Computing Equi-Height Histograms is a WIP 11 ©2012 Cloudera, Inc. All Rights Reserved.
  • 12. Questions? 12 ©2012 Cloudera, Inc. All Rights Reserved.
  • 13. Further Readings • Blog • http://www.cloudera.com/blog/2012/08/column-statistics- in-hive/ • Academic • A. Gruenheid, et. al., Query Optimization using Column Statistics in Hive. • S. Chaudhuri, An Overview of Query Optimization in Relational Systems. • P. Flajolet and N.G. Martin, Probabilistic Counting Algorithms for Database Applications. Contact: shreepadma@cloudera.com 13 ©2012 Cloudera, Inc. All Rights Reserved.

Editor's Notes

  • #4: Explain what is cost in the context of this discussion – CPU and I/O cost of executing a query plan
  • #6: Talk about how each one of the stats will be useful
  • #11: Talk about algorithms usedFlajolet-Martin, Histogram construction