Column Statistics in Hive

09/27/2012
Column Statistics Project
Shreepadma Venugopalan | Platform Engineering

Outline

• Motivation
• New Statistics
• Computing and Persisting Statistics
• Summary
• Further Readings

2
©2012 Cloudera, Inc. All Rights Reserved.

Why Column Statistics?

• RDBMS Query Optimizer
• Is cost-based
• Uses the statistical properties of the data to cost
alternate execution plans
• Picks executions plans with the lowest cost

• Hive Query Optimizer
• Is rule-based
• Uses rules of thumb to optimize the execution plan
• Unable to always pick the most efficient execution
plan

3

Why Column Statistics?
• Statistics in an RDBMS
• Maintained on per table, per partition, and per column
basis
• Used for a wide range of cost based query optimizations

• Statistics in Hive
• Maintained on per table and per partition level
• Can be used to perform some cost based optimizations
such as choosing join method etc.
• Insufficient for other cost based optimizations such as join
reordering, two stage aggregation etc.

Solution: Maintain statistics on columns in Hive

4

What are the New Statistics?

• Min Column Value
• Max Column Value
• Average Length of Column Value
• Max Length of Column Value
• Number of Distinct Values in a Column
• Number of Null Values in a Column
• Equi-height Histograms

5

How to Compute Column Statistics?

• Explicit Computation
• Triggered through an ANALYZE command
• Pros: Admin has fine grained control over the stats
job
• Cons: Doesn’t piggyback on other operations such as
scan
• Implicit Computation
• Incrementally compute statistics while loading data
• Pros: Avoid an additional table scan, more efficient
than explicit computation
• Cons: Impacts LOAD performance
6

How to Compute Column Statistics?

• Aggregate function of the column data
rolled up by table/partition
• Fits nicely into Hive’s UDAF framework
• Expect to scan TBs of data at a time
• Requirement # 1: Memory usage has to scale
sub-linearly with data size
• Requirement # 2: Stats task has to complete in a
reasonable amount of time
• Given these requirements, some statistics such
as NDV, histograms are hard to compute!

7

How to Compute NDVs?

• Naïve approach
• Maintain a count of distinct values in a column
• Impractical given memory requirements
• Flajolet-Martin approach
• Use probabilistic sketches to estimate NDV
• Memory required is logarithmic in size of data
• Estimates are within 10% of the actual value

8

How to Compute Histograms?

• Computing equi-height histograms is a quantile
computation/estimation problem
• Merging the quantiles computed at the mappers is
non-trivial
• Deterministic parallel algorithms such as QDigest
prohibitive in terms of memory required
• Probabilistic algorithms stream counting algorithms
such as Count-Min Sketch can be tweaked to
estimate quantiles
• Memory required is logarithmic in size of data
• Computationally expensive!

9

How to Store Column Statistics?

• Extend metastore schema to store new statistics
• Extend metastore Thrift API to update, query and
delete new statistics
• Size of the column statistics record in metastore is
independent of table/partition size
• ~32 bytes/column if histograms are not computed
• ~320 bytes/column for 20 bin histogram

10

Summary

• Scalar statistics has been implemented for primitive
type columns in both tables and partitions

• Patch is available on JIRA (HIVE-1362)

• Computing Equi-Height Histograms is a WIP

11

Questions?

12

Further Readings

• Blog
• http://www.cloudera.com/blog/2012/08/column-statistics-
in-hive/
• Academic
• A. Gruenheid, et. al., Query Optimization using Column
Statistics in Hive.
• S. Chaudhuri, An Overview of Query Optimization in
Relational Systems.
• P. Flajolet and N.G. Martin, Probabilistic Counting
Algorithms for Database Applications.
Contact: shreepadma@cloudera.com

13

Column Statistics in Hive

More Related Content

What's hot (18)

Similar to Column Statistics in Hive (20)

Recently uploaded (20)

Column Statistics in Hive

Editor's Notes