Hive at Yahoo: Letters from the trenches

Hive at Yahoo: Letters from the trenches
P R E S E N T E D B Y M i t h u n R a d h a k r i s h n a n , C h r i s D r o m e ⎪ J u n e 1 0 , 2 0 1 5
2 0 1 5 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a

About myself
2 2014 Hadoop Summit, San Jose, California
 Mithun Radhakrishnan
 Hive Engineer at Yahoo!
 Hive Committer and long-time
contributor
› Metastore-scaling
› Integration
› HCatalog
 mithun@apache.org
 @mithunrk

About myself
 Chris Drome
 Hive Engineer at Yahoo!
 Hive contributor
 cdrome@yahoo-inc.com

0
500
1000
1500
2000
2500
q1_pricing_summary_report.hive
q2_minimum_cost_supplier.hiveq3_shipping_priority.hive
q4_order_priority
q5_local_supplier_volume.hive
q6_forecast_revenue_change.hiveq7_volume_shipping.hive
q8_na
onal_market_share.hive
q9_product_type_profit.hiveq10_returned_item.hive
q11_important_stock.hive
q12_shipping.hive
q13_customer_distribu
on.hive
q14_promo
on_effect.hiveq15_top_supplier.hive
q16_parts_supplier_rela
onship.hive
q17_small_quan
ty_order_revenue.hive
q18_large_volume_customer.hive
q19_discounted_revenue.hive
q20_poten
al_part_promo
on.hive
q21_suppliers_who_kept_orders_waing.hive
q22_global_sales_opportunity.hive
Time(inseconds)
TPC-h 1TB
Hive 0.10 RC File
Hive 0.11 ORC
Hive 0.12 ORC
Hive 0.13 ORC MR
Hive 0.13 ORC Tez

1 TB
› 6.2x speedup over Hive 0.10 (RCFile)
• Between 2.5-17x
› Average query time: 172 seconds
• Between 5-947 seconds
• Down from 729 seconds (Hive 0.10 RCFile)
› 61% queries completed in under 2 minutes
› 81% queries completed in under 4 minutes

Explaining the speed-ups
 Hadoop 2.x, et al.
 Apache Tez
› (Arbitrary DAG)-based Execution Engine
› “Playing the gaps” between M&R
• Intermediate data and the HDFS
› Smart scheduling
› Container re-use
› Pipelined job start-up
 Hive
› Statistics
› Vectorized Execution
 ORC
› PPD

Expectations with Hive 0.13 production
 Tez would outperform M/R by miles
 Tez would enable better cluster utilization
› Use less resources
 Tez (and dependencies) would be “production ready”
› GUI for task logs, DAG overviews, swim-lanes
› Speculative execution
 Similarly, ORC and Vectorization
› Support evolving schemas

The Y!Grid
 18 Hadoop Clusters in YGrid
› 41565 Nodes
› Biggest cluster: 5728 Nodes
› 1M jobs a day
 Hadoop 2.6+
 Large Datasets
› Daily, hourly, minute-level frequencies
› Thousands of partitions, 100s of 1000s of files, TBs of data per partition
› 580 PB of data, total
 Pig 0.14 on Tez, Pig 0.11
 Hive 0.13 on Tez
 HCatalog for interoperability
 Oozie for scheduling
 GDM for data-loading
 Spark, HBase, Storm, etc…

Data processing use cases
 Grid usage
› 30+ million jobs per month
› 12+ million Oozie launcher jobs
 Pig usage
› Handles majority of data pipelines/ETL (~43% of jobs)
 Hive usage
› Relatively smaller niche
› 632,000 queries per month (35% Tez)
 HCatalog for Inter-operability
› Metadata storage for all Hadoop data
› Yahoo-scale
› Pig pipelines with Hive analytics

Business Intelligence Tools
 Tableau, MicroStrategy
 Power users
› Tableau Server for scheduled reports
 Challenges:
› Security
• ACLs, Authentication, Encryption over the wire
› Bandwidth
• Transporting results over ODBC
• Limit result-set to 1000s-10000s of rows
• Aggregations
› Query Latency
• Metadata queries
• Partition/Table scans
• Materialized views

 Data producer owns the data
› Unlike traditional DBs
 Multi-paradigm data access/generation
› Pig/Hive/MapReduce using HCatalog
 Highly available metadata service
 UI for tracking/debugging jobs
 Execution engine should ideally support speculative execution
Non-negotiables for Hive upgrade at Yahoo!

Yahoo! Hive-0.13
 Based on Apache Hive-0.13.1
 Internal Yahoo! Patches (admin web-services, data discovery, etc.)
 Community patches to stabilize Apache Hive-0.13.1
› Tez
• HIVE-7544, HIVE-6748, HIVE-7112, …
› Vectorization
• HIVE-8163, HIVE-8092, HIVE-7188, HIVE-7105, HIVE-7514, …
› Failures
• HIVE-7851, HIVE-7459, HIVE-7771, HIVE-7396, …
› Optimizations
• HIVE-7231, HIVE-7219, HIVE-7203, HIVE-7052, …
› Data integrity
• HIVE-7694, HIVE-7494, HIVE-7045, HIVE-7346, HIVE-7232, …
 Phased upgrades
› Phase 1: 285 JIRAs
› Phase 2: 23 JIRAs (HIVE-8781 and related dependencies)
› Phase 3: 46 JIRAs (HIVE-10114 and related dependencies)

 One remote Hive Metastore “instance”
› 4 HCatalog Servers behind a hardware VIP
• L3DSR load balancer
• 96GB-128GB RAM, 16 core boxes
› Backed by Oracle RAC
 About 10 Gateways
› Interactive use of Hive (and Pig, Oozie, M/R)
› hive.metastore.uris -> HCatalog
 About 4 HiveServer2 instances
› Ad Hoc queries, aggregation
Hive deployment (per cluster)

Evolution of grid services at Yahoo!
16 Yahoo Confidential & Proprietary
Gateway Machines
Grid
OracleOracle RAC
Browser
HUE
Hive Server 2
BI Tools
HCatalogHCatalog

 Query performance on very large data sets
› HIVE-8292: Reading … has high overhead in MapOperator.cleanUpInputFileChangedOp
 Split-generation on very large data sets
› Tends to generate more splits (maps tasks) compared to M/R
› Long split generation times
› Hogging the Hadoop queues
• Wave factor vs multi-tenancy requirements
› HIVE-10114: Split strategies for ORC
 Scaling problems with ATS
› More of a problem with Pig workflows
› 10K+ tasks/job are routine
› AM progress reporting, heart-beating, memory usage
› Hadoop 2.6.0.10+
Challenges experienced with Hive on Tez

18 Yahoo Confidential & Proprietary

 At Yahoo! Scale,
› 100s of Databases per cluster
› 100s of Tables per database
› 100s of columns per Table
› 1000s of Partitions per Table
• Larger tables: Thousands of partitions, per hour
• Millions of partitions every few days
• 10s of millions of partitions, over dataset retention period
 Problems:
› Metadata volume
• Database/Table/Partition IO Formats
• Record serialization details
• HDFS paths
• Statistics
– Per partition
– Per column
Fast execution engines aren’t the whole picture

From: Another ETL pipeline.
To: The Yahoo Hive Team
Subject: Slow queries
YHive team,
My query fails with OutOfMemoryError. I tried increasing
container size, but it still fails. Please help!
Here are my settings:
set mapreduce.input.fileinputformat.split.maxsize=16777216;
set mapreduce.map.memory.mb=4096;
set mapreduce.reduce.memory.mb=4096;
set mapred.child.java.opts=“-Xmx1024m”
...
INSERT OVERWRITE TABLE my_table PARTITION( foo, bar, goo )
SELECT * FROM {
...
}
...

From: YET another ETL pipeline.
Subject: Slow UDF performance
YHive team,
Why does using a simple custom UDF cause queries to
time out?
SELECT foo, bar, my_function( goo )
FROM my_large_table
WHERE ...

From: The ETL team
Subject: A small matter of size...
Dear YHive team,
We have partitioned our table using the following
6 partition keys: {hourly-timestamp, name, property,
geo-location, shoe-size, and so on…}.
For a given timestamp, the combined cardinality of the
remaining partition-keys is about 10000/hr.
If queries on partitioned tables are supposed to
be faster, how come queries on our table take forever
just to get off the ground?
Yours gigantically,
Project Grape Ape

Metadata volume and Query Execution time
 Anatomy of a Hive query
1. Compile query to AST
2. Thrift-call to Metastore, for partition list
3. Examine partitions, data-paths, etc. Construct physical query plan.
4. Run optimizers on the plan
5. Execute plan. (M/R, Tez).
 Partition pruner:
› Removes partitions that shouldn’t participate in the query.
› In effect, remove input-directories from the Hadoop job.

The problems of large-scale metadata
 Partition pruner is single-threaded
› Query spans a day
› Query spanning a week? 2 million partitions
 Partition objects are huge:
› HDFS Paths
› IO Formats
› Record Deserializer info
› Data column schema
 Datanucleus:
› 1 Partition: Join 6 Oracle tables in the backend.
 Thrift serialization/deserialization takes minutes.
› *Minutes*.

Immediate workarounds
 “Hive wasn’t originally designed for more than 10000s of partitions,
total…”
 Throw hardware at it
› 4 HCatalog servers behind a hardware VIP
› High-RAM boxes:
• 96GB-128 GB metastore processes
• Tune each to use 100 connections to the Oracle RAC
 Client-side tuning
› Increase hive.metastore.client.socket.timeout
› Increase heap size as needed (container size)
› Multi-threaded fstat operations

Fix the leaky/noisy bits
 Metastore frequently ran out of memory:
› Disable Hadoop FileSystem cache
• HIVE-3098, HDFS-3545
• FileSystem.CACHE used UGI.hashcode()
– Compared Subjects for equality, not equivalence.
› Fixed Thrift 0.9
• TSaslServerTransport had circular references
• JVM couldn’t detect these for cleanup
– WeakReferences are your friend
• Fix incompatibility with L3DSR pings
 Data discovery from Oozie:
› Use JMS notifications, on publication
› Oozie Coordinators wake up on ActiveMQ notification, kick off dependent workflows
› Reduced polling frequency

More fixes
 Metadata-only queries:
› SELECT DISTINCT tstamp FROM my_purple_table ORDER BY tstamp DESC LIMIT
1000;
› Replace HiveMetaStoreClient::getPartitions() with getPartitionNames().
› Local job, versus cluster.
 Optimize the optimizer:
› The first step in some optimizers:
• List<Partition> partitions = hiveMetaStoreClient.getPartitions( db, table,
(short)-1 );
• Pray that the client and/or the metastore don’t run out of memory.
• Take a nap.
› Fixed PartitionPruner, MetadataOnlyOptimizer.

Long-term fixes:
 DirectSQL short-circuits:
› Datanucleus problems at scale
• (Yes, we are aware of the irony that might result from extrapolation.)
› Specific to the backing DB.
 Compaction of Partition info:
› HIVE-7223, HIVE-7576, HIVE-9845, etc.
› Schema evolves infrequently
› Partition-info rarely differs from table-info
– Except HDFS paths (which are super-strings)
› List<Partition> vs Iterator<Partition>
• PartitionSet abstraction
– The delight of Inheritance in Thrift
• Reduced memory foot-prints

“The finest trick of The Devil was to
persuade you that he does not exist.”
-- ???

From: A major reporting team
Subject: Urgent! Customer reports are borking.
Dear YHive team,
When we connect Tableau Server 8.3 to Y!Hive
0.12/0.13, it is unusably slow. Queries take too long
to run, and time out.
We’d prefer not to change our query-code too
much. How soon can Hive accommodate our simple queries?
Yours hysterically,
Project Zodiac

Analysis: The query
 Non-const partition key predicates:
› E.g.
WHERE utc_time <= from_unixtime(unix_timestamp()- 2*24*60*60,
'yyyyMMdd')
AND utc_time >= from_unixtime(unix_timestamp()- 32*24*60*60,
'yyyyMMdd')
› Solution: Use constant expressions where possible.
› Fix: Hive 1.x supports dynamic partition pruning, and constant folding.
 Costly joins with partitioned dimension tables:
› E.g.
› SELECT … FROM fact_table JOIN (SELECT * FROM dimension_table
WHERE dt IN (SELECT MAX(dt) from dimension_table);
› Workaround: External “pointer” tables.
› Fix: Dynamic partition pruning.

Analysis: The data
 Data stored in TEXTFILE
› Solution: Switch to columnar storage
• ORC, dictionary encoding, vectorization, predicate pushdown
 Over-partitioning:
› Too many partition keys
› Diminishing returns with partition pruning
› Solution: Eliminate partition keys, consider sorting
 Small Part files
› Hard-coded nReducers
› E.g.
hive> dfs -count /projects/foo_stats;
9081 682735 1876847648672 /projects/foo.db/foo_stats
› Solution:
• set hive.merge.mapfiles=true;
• set hive.merge.mapredfiles=true;
• set hive.merge.tezfiles=true;

We’re not done yet
39 2015 Hadoop Summit San Jose
 Tez/ATS scaling
 Speed up split calculation
 Auto/Offline compaction
 Abuse detection
 Better handling of schema
evolution
 Skew Joins in Hive
 UDFs with JNI and configuring
LD_LIBRARY_PATH

YHive configuration settings:
set hive.merge.mapfiles=false; -- Except when producing data.
set hive.merge.mapredfiles=false; -- Except when producing data.
set tez.merge.files=false; -- Except when producing data.
-- For ORC files.
-- dfs.blocksize=134217728; -- hdfs-site.xml
set orc.stripe.size=67108864; -- 64MB stripes.
set orc.compress.size=262144; -- 256KB compress buffer.
set orc.compress=ZLIB; -- Override to NONE, per table.
set orc.create.index=true; -- ORC indexes.
set orc.optimize.index.filter=true; -- Predicate pushdown with ORC index
set orc.row.index.stride=10000;

YHive configuration settings: (contd)
-- Delegation Token Store settings:
set hive.cluster.delegation.token.store.class=ZooKeeperTokenStore;
set hive.cluster.delegation.token.renew-interval=172800000;
(Start HCat Server with -Djute.maxbuffer=24MB -> 190K+ tokens.)
-- Data Nucleus settings:
set datanucleus.connectionPoolingType=DBCP; -- !(BoneCP).
set datanucleus.cache.level1.type=none;
set datanucleus.cache.level2.type=none;
set datanucleus.connectionPool.maxWait=200000;
set datanucleus.connectionPool.minIdle=0;
-- Misc.
set hive.metastore.event.listeners=com.yahoo.custom.JMSListener;

Zookeeper Token Storage performance
Jute Buffer Size (in MB) Max delegation token count
4MB 30K
8MB 60K
12MB 90K
16MB 130K
20MB 160K
24MB 190K

Why Hive on Tez?
 Shark, Impala
› Pre-emption for in-memory systems
› Multi-tenant, shared clusters
› Heterogeneous nodes
› Existing ecosystem
› Community-driven development
 Shark
› Good proof of concept, but was not production ready
› Shuffle performance
› Hive on Spark – under active development

Analysis: Tableau/ODBC driver
 Tableau has come a long way, but
› Schema discovery
• SELECT * FROM my_large_table LIMIT 0;
• SELECT DISTINCT part_key FROM my_large_table;
› SQL dialect
• Depends on vendor-specific driver-name
› Schema metadata-scans
• 3 partition listings per query
› Miscellaneous problems:
• “Custom SQL” rewrites
• Trouble with quoting
 tl;dr : Try to transition to Simba’s 2.0.x Drivers with Tableau 8.3.x

Hive at Yahoo: Letters from the trenches

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to Hive at Yahoo: Letters from the trenches (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Hive at Yahoo: Letters from the trenches

Editor's Notes