SlideShare a Scribd company logo
Hive at Yahoo: Letters from the trenches
P R E S E N T E D B Y M i t h u n R a d h a k r i s h n a n , C h r i s D r o m e ⎪ J u n e 1 0 , 2 0 1 5
2 0 1 5 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
About myself
2 2014 Hadoop Summit, San Jose, California
 Mithun Radhakrishnan
 Hive Engineer at Yahoo!
 Hive Committer and long-time
contributor
› Metastore-scaling
› Integration
› HCatalog
 mithun@apache.org
 @mithunrk
About myself
3 2014 Hadoop Summit, San Jose, California
 Chris Drome
 Hive Engineer at Yahoo!
 Hive contributor
 cdrome@yahoo-inc.com
Recap
5 2015 Hadoop Summit, San Jose, California
6 2015 Hadoop Summit, San Jose, California
0
500
1000
1500
2000
2500
q1_pricing_summary_report.hive
q2_minimum_cost_supplier.hiveq3_shipping_priority.hive
q4_order_priority
q5_local_supplier_volume.hive
q6_forecast_revenue_change.hiveq7_volume_shipping.hive
q8_na
onal_market_share.hive
q9_product_type_profit.hiveq10_returned_item.hive
q11_important_stock.hive
q12_shipping.hive
q13_customer_distribu
on.hive
q14_promo
on_effect.hiveq15_top_supplier.hive
q16_parts_supplier_rela
onship.hive
q17_small_quan
ty_order_revenue.hive
q18_large_volume_customer.hive
q19_discounted_revenue.hive
q20_poten
al_part_promo
on.hive
q21_suppliers_who_kept_orders_waing.hive
q22_global_sales_opportunity.hive
Time(inseconds)
TPC-h 1TB
Hive 0.10 RC File
Hive 0.11 ORC
Hive 0.12 ORC
Hive 0.13 ORC MR
Hive 0.13 ORC Tez
1 TB
7 2015 Hadoop Summit, San Jose, California
› 6.2x speedup over Hive 0.10 (RCFile)
• Between 2.5-17x
› Average query time: 172 seconds
• Between 5-947 seconds
• Down from 729 seconds (Hive 0.10 RCFile)
› 61% queries completed in under 2 minutes
› 81% queries completed in under 4 minutes
Explaining the speed-ups
8 2015 Hadoop Summit, San Jose, California
 Hadoop 2.x, et al.
 Apache Tez
› (Arbitrary DAG)-based Execution Engine
› “Playing the gaps” between M&R
• Intermediate data and the HDFS
› Smart scheduling
› Container re-use
› Pipelined job start-up
 Hive
› Statistics
› Vectorized Execution
 ORC
› PPD
Expectations with Hive 0.13 production
9 2014 Hadoop Summit, San Jose, California
 Tez would outperform M/R by miles
 Tez would enable better cluster utilization
› Use less resources
 Tez (and dependencies) would be “production ready”
› GUI for task logs, DAG overviews, swim-lanes
› Speculative execution
 Similarly, ORC and Vectorization
› Support evolving schemas
The Y!Grid
10 2015 Hadoop Summit, San Jose, California
 18 Hadoop Clusters in YGrid
› 41565 Nodes
› Biggest cluster: 5728 Nodes
› 1M jobs a day
 Hadoop 2.6+
 Large Datasets
› Daily, hourly, minute-level frequencies
› Thousands of partitions, 100s of 1000s of files, TBs of data per partition
› 580 PB of data, total
 Pig 0.14 on Tez, Pig 0.11
 Hive 0.13 on Tez
 HCatalog for interoperability
 Oozie for scheduling
 GDM for data-loading
 Spark, HBase, Storm, etc…
Data processing use cases
11 2015 Hadoop Summit, San Jose, California
 Grid usage
› 30+ million jobs per month
› 12+ million Oozie launcher jobs
 Pig usage
› Handles majority of data pipelines/ETL (~43% of jobs)
 Hive usage
› Relatively smaller niche
› 632,000 queries per month (35% Tez)
 HCatalog for Inter-operability
› Metadata storage for all Hadoop data
› Yahoo-scale
› Pig pipelines with Hive analytics
Business Intelligence Tools
12 2015 Hadoop Summit, San Jose, California
 Tableau, MicroStrategy
 Power users
› Tableau Server for scheduled reports
 Challenges:
› Security
• ACLs, Authentication, Encryption over the wire
› Bandwidth
• Transporting results over ODBC
• Limit result-set to 1000s-10000s of rows
• Aggregations
› Query Latency
• Metadata queries
• Partition/Table scans
• Materialized views
 Data producer owns the data
› Unlike traditional DBs
 Multi-paradigm data access/generation
› Pig/Hive/MapReduce using HCatalog
 Highly available metadata service
 UI for tracking/debugging jobs
 Execution engine should ideally support speculative execution
13 2015 Hadoop Summit, San Jose, California
Non-negotiables for Hive upgrade at Yahoo!
Yahoo! Hive-0.13
14 2015 Hadoop Summit, San Jose, California
 Based on Apache Hive-0.13.1
 Internal Yahoo! Patches (admin web-services, data discovery, etc.)
 Community patches to stabilize Apache Hive-0.13.1
› Tez
• HIVE-7544, HIVE-6748, HIVE-7112, …
› Vectorization
• HIVE-8163, HIVE-8092, HIVE-7188, HIVE-7105, HIVE-7514, …
› Failures
• HIVE-7851, HIVE-7459, HIVE-7771, HIVE-7396, …
› Optimizations
• HIVE-7231, HIVE-7219, HIVE-7203, HIVE-7052, …
› Data integrity
• HIVE-7694, HIVE-7494, HIVE-7045, HIVE-7346, HIVE-7232, …
 Phased upgrades
› Phase 1: 285 JIRAs
› Phase 2: 23 JIRAs (HIVE-8781 and related dependencies)
› Phase 3: 46 JIRAs (HIVE-10114 and related dependencies)
 One remote Hive Metastore “instance”
› 4 HCatalog Servers behind a hardware VIP
• L3DSR load balancer
• 96GB-128GB RAM, 16 core boxes
› Backed by Oracle RAC
 About 10 Gateways
› Interactive use of Hive (and Pig, Oozie, M/R)
› hive.metastore.uris -> HCatalog
 About 4 HiveServer2 instances
› Ad Hoc queries, aggregation
15 2015 Hadoop Summit, San Jose, California
Hive deployment (per cluster)
Evolution of grid services at Yahoo!
16 Yahoo Confidential & Proprietary
Gateway Machines
Grid
OracleOracle RAC
Browser
HUE
Hive Server 2
BI Tools
HCatalogHCatalog
 Query performance on very large data sets
› HIVE-8292: Reading … has high overhead in MapOperator.cleanUpInputFileChangedOp
 Split-generation on very large data sets
› Tends to generate more splits (maps tasks) compared to M/R
› Long split generation times
› Hogging the Hadoop queues
• Wave factor vs multi-tenancy requirements
› HIVE-10114: Split strategies for ORC
 Scaling problems with ATS
› More of a problem with Pig workflows
› 10K+ tasks/job are routine
› AM progress reporting, heart-beating, memory usage
› Hadoop 2.6.0.10+
17 2015 Hadoop Summit, San Jose, California
Challenges experienced with Hive on Tez
18 Yahoo Confidential & Proprietary
 At Yahoo! Scale,
› 100s of Databases per cluster
› 100s of Tables per database
› 100s of columns per Table
› 1000s of Partitions per Table
• Larger tables: Thousands of partitions, per hour
• Millions of partitions every few days
• 10s of millions of partitions, over dataset retention period
 Problems:
› Metadata volume
• Database/Table/Partition IO Formats
• Record serialization details
• HDFS paths
• Statistics
– Per partition
– Per column
19 2015 Hadoop Summit, San Jose, California
Fast execution engines aren’t the whole picture
Letters from the trenches
21 2015 Hadoop Summit, San Jose, California
From: Another ETL pipeline.
To: The Yahoo Hive Team
Subject: Slow queries
YHive team,
My query fails with OutOfMemoryError. I tried increasing
container size, but it still fails. Please help!
Here are my settings:
set mapreduce.input.fileinputformat.split.maxsize=16777216;
set mapreduce.map.memory.mb=4096;
set mapreduce.reduce.memory.mb=4096;
set mapred.child.java.opts=“-Xmx1024m”
...
INSERT OVERWRITE TABLE my_table PARTITION( foo, bar, goo )
SELECT * FROM {
...
}
...
22 2015 Hadoop Summit, San Jose, California
From: YET another ETL pipeline.
To: The Yahoo Hive Team
Subject: Slow UDF performance
YHive team,
Why does using a simple custom UDF cause queries to
time out?
SELECT foo, bar, my_function( goo )
FROM my_large_table
WHERE ...
23 2015 Hadoop Summit, San Jose, California
From: The ETL team
To: The Yahoo Hive Team
Subject: A small matter of size...
Dear YHive team,
We have partitioned our table using the following
6 partition keys: {hourly-timestamp, name, property,
geo-location, shoe-size, and so on…}.
For a given timestamp, the combined cardinality of the
remaining partition-keys is about 10000/hr.
If queries on partitioned tables are supposed to
be faster, how come queries on our table take forever
just to get off the ground?
Yours gigantically,
Project Grape Ape
24 2015 Hadoop Summit, San Jose, California
25 2015 Hadoop Summit, San Jose, California
Metadata volume and Query Execution time
26 2015 Hadoop Summit, San Jose, California
 Anatomy of a Hive query
1. Compile query to AST
2. Thrift-call to Metastore, for partition list
3. Examine partitions, data-paths, etc. Construct physical query plan.
4. Run optimizers on the plan
5. Execute plan. (M/R, Tez).
 Partition pruner:
› Removes partitions that shouldn’t participate in the query.
› In effect, remove input-directories from the Hadoop job.
The problems of large-scale metadata
27 2015 Hadoop Summit, San Jose, California
 Partition pruner is single-threaded
› Query spans a day
› Query spanning a week? 2 million partitions
 Partition objects are huge:
› HDFS Paths
› IO Formats
› Record Deserializer info
› Data column schema
 Datanucleus:
› 1 Partition: Join 6 Oracle tables in the backend.
 Thrift serialization/deserialization takes minutes.
› *Minutes*.
Immediate workarounds
28 2015 Hadoop Summit, San Jose, California
 “Hive wasn’t originally designed for more than 10000s of partitions,
total…”
 Throw hardware at it
› 4 HCatalog servers behind a hardware VIP
› High-RAM boxes:
• 96GB-128 GB metastore processes
• Tune each to use 100 connections to the Oracle RAC
 Client-side tuning
› Increase hive.metastore.client.socket.timeout
› Increase heap size as needed (container size)
› Multi-threaded fstat operations
Fix the leaky/noisy bits
29 2015 Hadoop Summit, San Jose, California
 Metastore frequently ran out of memory:
› Disable Hadoop FileSystem cache
• HIVE-3098, HDFS-3545
• FileSystem.CACHE used UGI.hashcode()
– Compared Subjects for equality, not equivalence.
› Fixed Thrift 0.9
• TSaslServerTransport had circular references
• JVM couldn’t detect these for cleanup
– WeakReferences are your friend
• Fix incompatibility with L3DSR pings
 Data discovery from Oozie:
› Use JMS notifications, on publication
› Oozie Coordinators wake up on ActiveMQ notification, kick off dependent workflows
› Reduced polling frequency
More fixes
30 2015 Hadoop Summit, San Jose, California
 Metadata-only queries:
› SELECT DISTINCT tstamp FROM my_purple_table ORDER BY tstamp DESC LIMIT
1000;
› Replace HiveMetaStoreClient::getPartitions() with getPartitionNames().
› Local job, versus cluster.
 Optimize the optimizer:
› The first step in some optimizers:
• List<Partition> partitions = hiveMetaStoreClient.getPartitions( db, table,
(short)-1 );
• Pray that the client and/or the metastore don’t run out of memory.
• Take a nap.
› Fixed PartitionPruner, MetadataOnlyOptimizer.
Long-term fixes:
31 2015 Hadoop Summit, San Jose, California
 DirectSQL short-circuits:
› Datanucleus problems at scale
• (Yes, we are aware of the irony that might result from extrapolation.)
› Specific to the backing DB.
 Compaction of Partition info:
› HIVE-7223, HIVE-7576, HIVE-9845, etc.
› Schema evolves infrequently
› Partition-info rarely differs from table-info
– Except HDFS paths (which are super-strings)
› List<Partition> vs Iterator<Partition>
• PartitionSet abstraction
– The delight of Inheritance in Thrift
• Reduced memory foot-prints
32 2015 Hadoop Summit, San Jose, California
“The finest trick of The Devil was to
persuade you that he does not exist.”
-- ???
33 2015 Hadoop Summit, San Jose, California
34 2015 Hadoop Summit, San Jose, California
35 2015 Hadoop Summit, San Jose, California
From: A major reporting team
To: The Yahoo Hive Team
Subject: Urgent! Customer reports are borking.
Dear YHive team,
When we connect Tableau Server 8.3 to Y!Hive
0.12/0.13, it is unusably slow. Queries take too long
to run, and time out.
We’d prefer not to change our query-code too
much. How soon can Hive accommodate our simple queries?
Yours hysterically,
Project Zodiac
36 2015 Hadoop Summit, San Jose, California
Analysis: The query
37 2015 Hadoop Summit, San Jose, California
 Non-const partition key predicates:
› E.g.
WHERE utc_time <= from_unixtime(unix_timestamp()- 2*24*60*60,
'yyyyMMdd')
AND utc_time >= from_unixtime(unix_timestamp()- 32*24*60*60,
'yyyyMMdd')
› Solution: Use constant expressions where possible.
› Fix: Hive 1.x supports dynamic partition pruning, and constant folding.
 Costly joins with partitioned dimension tables:
› E.g.
› SELECT … FROM fact_table JOIN (SELECT * FROM dimension_table
WHERE dt IN (SELECT MAX(dt) from dimension_table);
› Workaround: External “pointer” tables.
› Fix: Dynamic partition pruning.
Analysis: The data
38 2015 Hadoop Summit, San Jose, California
 Data stored in TEXTFILE
› Solution: Switch to columnar storage
• ORC, dictionary encoding, vectorization, predicate pushdown
 Over-partitioning:
› Too many partition keys
› Diminishing returns with partition pruning
› Solution: Eliminate partition keys, consider sorting
 Small Part files
› Hard-coded nReducers
› E.g.
hive> dfs -count /projects/foo_stats;
9081 682735 1876847648672 /projects/foo.db/foo_stats
› Solution:
• set hive.merge.mapfiles=true;
• set hive.merge.mapredfiles=true;
• set hive.merge.tezfiles=true;
We’re not done yet
39 2015 Hadoop Summit San Jose
 Tez/ATS scaling
 Speed up split calculation
 Auto/Offline compaction
 Abuse detection
 Better handling of schema
evolution
 Skew Joins in Hive
 UDFs with JNI and configuring
LD_LIBRARY_PATH
Questions?
Backup
YHive configuration settings:
42 2014 Hadoop Summit, San Jose, California
set hive.merge.mapfiles=false; -- Except when producing data.
set hive.merge.mapredfiles=false; -- Except when producing data.
set tez.merge.files=false; -- Except when producing data.
-- For ORC files.
-- dfs.blocksize=134217728; -- hdfs-site.xml
set orc.stripe.size=67108864; -- 64MB stripes.
set orc.compress.size=262144; -- 256KB compress buffer.
set orc.compress=ZLIB; -- Override to NONE, per table.
set orc.create.index=true; -- ORC indexes.
set orc.optimize.index.filter=true; -- Predicate pushdown with ORC index
set orc.row.index.stride=10000;
YHive configuration settings: (contd)
43 2014 Hadoop Summit, San Jose, California
-- Delegation Token Store settings:
set hive.cluster.delegation.token.store.class=ZooKeeperTokenStore;
set hive.cluster.delegation.token.renew-interval=172800000;
(Start HCat Server with -Djute.maxbuffer=24MB -> 190K+ tokens.)
-- Data Nucleus settings:
set datanucleus.connectionPoolingType=DBCP; -- !(BoneCP).
set datanucleus.cache.level1.type=none;
set datanucleus.cache.level2.type=none;
set datanucleus.connectionPool.maxWait=200000;
set datanucleus.connectionPool.minIdle=0;
-- Misc.
set hive.metastore.event.listeners=com.yahoo.custom.JMSListener;
Zookeeper Token Storage performance
44 2014 Hadoop Summit, San Jose, California
Jute Buffer Size (in MB) Max delegation token count
4MB 30K
8MB 60K
12MB 90K
16MB 130K
20MB 160K
24MB 190K
45 2015 Hadoop Summit, San Jose, California
Why Hive on Tez?
46 2015 Hadoop Summit, San Jose, California
 Shark, Impala
› Pre-emption for in-memory systems
› Multi-tenant, shared clusters
› Heterogeneous nodes
› Existing ecosystem
› Community-driven development
 Shark
› Good proof of concept, but was not production ready
› Shuffle performance
› Hive on Spark – under active development
Analysis: Tableau/ODBC driver
47 2015 Hadoop Summit, San Jose, California
 Tableau has come a long way, but
› Schema discovery
• SELECT * FROM my_large_table LIMIT 0;
• SELECT DISTINCT part_key FROM my_large_table;
› SQL dialect
• Depends on vendor-specific driver-name
› Schema metadata-scans
• 3 partition listings per query
› Miscellaneous problems:
• “Custom SQL” rewrites
• Trouble with quoting
 tl;dr : Try to transition to Simba’s 2.0.x Drivers with Tableau 8.3.x

More Related Content

PPTX
February 2014 HUG : Tez Details and Insides
PPTX
February 2014 HUG : Pig On Tez
PPTX
Hadoop from Hive with Stinger to Tez
PPTX
Big Data Performance and Capacity Management
PPTX
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
PPTX
Hive+Tez: A performance deep dive
PPTX
Apache Tez: Accelerating Hadoop Query Processing
PPTX
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Pig On Tez
Hadoop from Hive with Stinger to Tez
Big Data Performance and Capacity Management
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hive+Tez: A performance deep dive
Apache Tez: Accelerating Hadoop Query Processing
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success

What's hot (20)

PPTX
February 2014 HUG : Hive On Tez
PDF
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
PPTX
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
PPTX
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
PPTX
Big Data Warehousing: Pig vs. Hive Comparison
PDF
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
PDF
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
PDF
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
PDF
Integration of HIve and HBase
PPTX
Tez Data Processing over Yarn
PPTX
Stinger Initiative - Deep Dive
PDF
Big Data Journey
PDF
Apache Spark & Hadoop
PDF
2013 July 23 Toronto Hadoop User Group Hive Tuning
PPTX
Apache Tez - A New Chapter in Hadoop Data Processing
PPTX
MapR-DB – The First In-Hadoop Document Database
PPTX
Hive and Apache Tez: Benchmarked at Yahoo! Scale
PPTX
Hadoop And Their Ecosystem
PPTX
YARN - Hadoop Next Generation Compute Platform
PDF
Introduction to Hadoop and MapReduce
February 2014 HUG : Hive On Tez
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
Big Data Warehousing: Pig vs. Hive Comparison
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Integration of HIve and HBase
Tez Data Processing over Yarn
Stinger Initiative - Deep Dive
Big Data Journey
Apache Spark & Hadoop
2013 July 23 Toronto Hadoop User Group Hive Tuning
Apache Tez - A New Chapter in Hadoop Data Processing
MapR-DB – The First In-Hadoop Document Database
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hadoop And Their Ecosystem
YARN - Hadoop Next Generation Compute Platform
Introduction to Hadoop and MapReduce
Ad

Viewers also liked (9)

PDF
APAC Big Data Strategy RadhaKrishna Hiremane
PPT
Hw09 Hadoop Development At Facebook Hive And Hdfs
PDF
HBase at Mendeley
PPTX
Apache HBase - Introduction & Use Cases
PPT
Hive Training -- Motivations and Real World Use Cases
PPTX
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
PPTX
Introduction To HBase
PPTX
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
PPTX
Hadoop and HBase @eBay
APAC Big Data Strategy RadhaKrishna Hiremane
Hw09 Hadoop Development At Facebook Hive And Hdfs
HBase at Mendeley
Apache HBase - Introduction & Use Cases
Hive Training -- Motivations and Real World Use Cases
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
Introduction To HBase
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hadoop and HBase @eBay
Ad

Similar to Hive at Yahoo: Letters from the trenches (20)

PPTX
Hadoop Summit 2014 : Benchmarking Apache Hive at Yahoo Scale
PPTX
Faster Faster Faster! Datamarts with Hive at Yahoo
PPTX
Data Discovery on Hadoop - Realizing the Full Potential of your Data
PPTX
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
PDF
Overview of stinger interactive query for hive
PPTX
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
PPTX
Hive big-data meetup
PDF
Scaling 100PB Data Warehouse in Cloud
PPTX
HIVE-NEED, CHARACTERISTICS, OPTIMIZATION
PPTX
hive architecture and hive components in detail
PDF
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
PDF
Data discoveryonhadoop@yahoo! hadoopsummit2014
PPTX
Hive with HDInsight
PPTX
Apache Hive for modern DBAs
PPTX
Big Data & Analytics (CSE6005) L6.pptx
PPT
Hadoop - Apache Hive
PPT
Nextag talk
PPTX
Apache hive
PPT
Hadoop - Introduction to Hadoop
PPTX
Apache Hive
Hadoop Summit 2014 : Benchmarking Apache Hive at Yahoo Scale
Faster Faster Faster! Datamarts with Hive at Yahoo
Data Discovery on Hadoop - Realizing the Full Potential of your Data
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Overview of stinger interactive query for hive
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive big-data meetup
Scaling 100PB Data Warehouse in Cloud
HIVE-NEED, CHARACTERISTICS, OPTIMIZATION
hive architecture and hive components in detail
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Data discoveryonhadoop@yahoo! hadoopsummit2014
Hive with HDInsight
Apache Hive for modern DBAs
Big Data & Analytics (CSE6005) L6.pptx
Hadoop - Apache Hive
Nextag talk
Apache hive
Hadoop - Introduction to Hadoop
Apache Hive

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Encapsulation theory and applications.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Per capita expenditure prediction using model stacking based on satellite ima...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Encapsulation theory and applications.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Digital-Transformation-Roadmap-for-Companies.pptx
Big Data Technologies - Introduction.pptx
Empathic Computing: Creating Shared Understanding
Dropbox Q2 2025 Financial Results & Investor Presentation
Unlocking AI with Model Context Protocol (MCP)
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Electronic commerce courselecture one. Pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
The AUB Centre for AI in Media Proposal.docx
Review of recent advances in non-invasive hemoglobin estimation
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf

Hive at Yahoo: Letters from the trenches

  • 1. Hive at Yahoo: Letters from the trenches P R E S E N T E D B Y M i t h u n R a d h a k r i s h n a n , C h r i s D r o m e ⎪ J u n e 1 0 , 2 0 1 5 2 0 1 5 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
  • 2. About myself 2 2014 Hadoop Summit, San Jose, California  Mithun Radhakrishnan  Hive Engineer at Yahoo!  Hive Committer and long-time contributor › Metastore-scaling › Integration › HCatalog  mithun@apache.org  @mithunrk
  • 3. About myself 3 2014 Hadoop Summit, San Jose, California  Chris Drome  Hive Engineer at Yahoo!  Hive contributor  cdrome@yahoo-inc.com
  • 5. 5 2015 Hadoop Summit, San Jose, California
  • 6. 6 2015 Hadoop Summit, San Jose, California 0 500 1000 1500 2000 2500 q1_pricing_summary_report.hive q2_minimum_cost_supplier.hiveq3_shipping_priority.hive q4_order_priority q5_local_supplier_volume.hive q6_forecast_revenue_change.hiveq7_volume_shipping.hive q8_na onal_market_share.hive q9_product_type_profit.hiveq10_returned_item.hive q11_important_stock.hive q12_shipping.hive q13_customer_distribu on.hive q14_promo on_effect.hiveq15_top_supplier.hive q16_parts_supplier_rela onship.hive q17_small_quan ty_order_revenue.hive q18_large_volume_customer.hive q19_discounted_revenue.hive q20_poten al_part_promo on.hive q21_suppliers_who_kept_orders_waing.hive q22_global_sales_opportunity.hive Time(inseconds) TPC-h 1TB Hive 0.10 RC File Hive 0.11 ORC Hive 0.12 ORC Hive 0.13 ORC MR Hive 0.13 ORC Tez
  • 7. 1 TB 7 2015 Hadoop Summit, San Jose, California › 6.2x speedup over Hive 0.10 (RCFile) • Between 2.5-17x › Average query time: 172 seconds • Between 5-947 seconds • Down from 729 seconds (Hive 0.10 RCFile) › 61% queries completed in under 2 minutes › 81% queries completed in under 4 minutes
  • 8. Explaining the speed-ups 8 2015 Hadoop Summit, San Jose, California  Hadoop 2.x, et al.  Apache Tez › (Arbitrary DAG)-based Execution Engine › “Playing the gaps” between M&R • Intermediate data and the HDFS › Smart scheduling › Container re-use › Pipelined job start-up  Hive › Statistics › Vectorized Execution  ORC › PPD
  • 9. Expectations with Hive 0.13 production 9 2014 Hadoop Summit, San Jose, California  Tez would outperform M/R by miles  Tez would enable better cluster utilization › Use less resources  Tez (and dependencies) would be “production ready” › GUI for task logs, DAG overviews, swim-lanes › Speculative execution  Similarly, ORC and Vectorization › Support evolving schemas
  • 10. The Y!Grid 10 2015 Hadoop Summit, San Jose, California  18 Hadoop Clusters in YGrid › 41565 Nodes › Biggest cluster: 5728 Nodes › 1M jobs a day  Hadoop 2.6+  Large Datasets › Daily, hourly, minute-level frequencies › Thousands of partitions, 100s of 1000s of files, TBs of data per partition › 580 PB of data, total  Pig 0.14 on Tez, Pig 0.11  Hive 0.13 on Tez  HCatalog for interoperability  Oozie for scheduling  GDM for data-loading  Spark, HBase, Storm, etc…
  • 11. Data processing use cases 11 2015 Hadoop Summit, San Jose, California  Grid usage › 30+ million jobs per month › 12+ million Oozie launcher jobs  Pig usage › Handles majority of data pipelines/ETL (~43% of jobs)  Hive usage › Relatively smaller niche › 632,000 queries per month (35% Tez)  HCatalog for Inter-operability › Metadata storage for all Hadoop data › Yahoo-scale › Pig pipelines with Hive analytics
  • 12. Business Intelligence Tools 12 2015 Hadoop Summit, San Jose, California  Tableau, MicroStrategy  Power users › Tableau Server for scheduled reports  Challenges: › Security • ACLs, Authentication, Encryption over the wire › Bandwidth • Transporting results over ODBC • Limit result-set to 1000s-10000s of rows • Aggregations › Query Latency • Metadata queries • Partition/Table scans • Materialized views
  • 13.  Data producer owns the data › Unlike traditional DBs  Multi-paradigm data access/generation › Pig/Hive/MapReduce using HCatalog  Highly available metadata service  UI for tracking/debugging jobs  Execution engine should ideally support speculative execution 13 2015 Hadoop Summit, San Jose, California Non-negotiables for Hive upgrade at Yahoo!
  • 14. Yahoo! Hive-0.13 14 2015 Hadoop Summit, San Jose, California  Based on Apache Hive-0.13.1  Internal Yahoo! Patches (admin web-services, data discovery, etc.)  Community patches to stabilize Apache Hive-0.13.1 › Tez • HIVE-7544, HIVE-6748, HIVE-7112, … › Vectorization • HIVE-8163, HIVE-8092, HIVE-7188, HIVE-7105, HIVE-7514, … › Failures • HIVE-7851, HIVE-7459, HIVE-7771, HIVE-7396, … › Optimizations • HIVE-7231, HIVE-7219, HIVE-7203, HIVE-7052, … › Data integrity • HIVE-7694, HIVE-7494, HIVE-7045, HIVE-7346, HIVE-7232, …  Phased upgrades › Phase 1: 285 JIRAs › Phase 2: 23 JIRAs (HIVE-8781 and related dependencies) › Phase 3: 46 JIRAs (HIVE-10114 and related dependencies)
  • 15.  One remote Hive Metastore “instance” › 4 HCatalog Servers behind a hardware VIP • L3DSR load balancer • 96GB-128GB RAM, 16 core boxes › Backed by Oracle RAC  About 10 Gateways › Interactive use of Hive (and Pig, Oozie, M/R) › hive.metastore.uris -> HCatalog  About 4 HiveServer2 instances › Ad Hoc queries, aggregation 15 2015 Hadoop Summit, San Jose, California Hive deployment (per cluster)
  • 16. Evolution of grid services at Yahoo! 16 Yahoo Confidential & Proprietary Gateway Machines Grid OracleOracle RAC Browser HUE Hive Server 2 BI Tools HCatalogHCatalog
  • 17.  Query performance on very large data sets › HIVE-8292: Reading … has high overhead in MapOperator.cleanUpInputFileChangedOp  Split-generation on very large data sets › Tends to generate more splits (maps tasks) compared to M/R › Long split generation times › Hogging the Hadoop queues • Wave factor vs multi-tenancy requirements › HIVE-10114: Split strategies for ORC  Scaling problems with ATS › More of a problem with Pig workflows › 10K+ tasks/job are routine › AM progress reporting, heart-beating, memory usage › Hadoop 2.6.0.10+ 17 2015 Hadoop Summit, San Jose, California Challenges experienced with Hive on Tez
  • 18. 18 Yahoo Confidential & Proprietary
  • 19.  At Yahoo! Scale, › 100s of Databases per cluster › 100s of Tables per database › 100s of columns per Table › 1000s of Partitions per Table • Larger tables: Thousands of partitions, per hour • Millions of partitions every few days • 10s of millions of partitions, over dataset retention period  Problems: › Metadata volume • Database/Table/Partition IO Formats • Record serialization details • HDFS paths • Statistics – Per partition – Per column 19 2015 Hadoop Summit, San Jose, California Fast execution engines aren’t the whole picture
  • 20. Letters from the trenches
  • 21. 21 2015 Hadoop Summit, San Jose, California From: Another ETL pipeline. To: The Yahoo Hive Team Subject: Slow queries YHive team, My query fails with OutOfMemoryError. I tried increasing container size, but it still fails. Please help! Here are my settings: set mapreduce.input.fileinputformat.split.maxsize=16777216; set mapreduce.map.memory.mb=4096; set mapreduce.reduce.memory.mb=4096; set mapred.child.java.opts=“-Xmx1024m” ... INSERT OVERWRITE TABLE my_table PARTITION( foo, bar, goo ) SELECT * FROM { ... } ...
  • 22. 22 2015 Hadoop Summit, San Jose, California From: YET another ETL pipeline. To: The Yahoo Hive Team Subject: Slow UDF performance YHive team, Why does using a simple custom UDF cause queries to time out? SELECT foo, bar, my_function( goo ) FROM my_large_table WHERE ...
  • 23. 23 2015 Hadoop Summit, San Jose, California
  • 24. From: The ETL team To: The Yahoo Hive Team Subject: A small matter of size... Dear YHive team, We have partitioned our table using the following 6 partition keys: {hourly-timestamp, name, property, geo-location, shoe-size, and so on…}. For a given timestamp, the combined cardinality of the remaining partition-keys is about 10000/hr. If queries on partitioned tables are supposed to be faster, how come queries on our table take forever just to get off the ground? Yours gigantically, Project Grape Ape 24 2015 Hadoop Summit, San Jose, California
  • 25. 25 2015 Hadoop Summit, San Jose, California
  • 26. Metadata volume and Query Execution time 26 2015 Hadoop Summit, San Jose, California  Anatomy of a Hive query 1. Compile query to AST 2. Thrift-call to Metastore, for partition list 3. Examine partitions, data-paths, etc. Construct physical query plan. 4. Run optimizers on the plan 5. Execute plan. (M/R, Tez).  Partition pruner: › Removes partitions that shouldn’t participate in the query. › In effect, remove input-directories from the Hadoop job.
  • 27. The problems of large-scale metadata 27 2015 Hadoop Summit, San Jose, California  Partition pruner is single-threaded › Query spans a day › Query spanning a week? 2 million partitions  Partition objects are huge: › HDFS Paths › IO Formats › Record Deserializer info › Data column schema  Datanucleus: › 1 Partition: Join 6 Oracle tables in the backend.  Thrift serialization/deserialization takes minutes. › *Minutes*.
  • 28. Immediate workarounds 28 2015 Hadoop Summit, San Jose, California  “Hive wasn’t originally designed for more than 10000s of partitions, total…”  Throw hardware at it › 4 HCatalog servers behind a hardware VIP › High-RAM boxes: • 96GB-128 GB metastore processes • Tune each to use 100 connections to the Oracle RAC  Client-side tuning › Increase hive.metastore.client.socket.timeout › Increase heap size as needed (container size) › Multi-threaded fstat operations
  • 29. Fix the leaky/noisy bits 29 2015 Hadoop Summit, San Jose, California  Metastore frequently ran out of memory: › Disable Hadoop FileSystem cache • HIVE-3098, HDFS-3545 • FileSystem.CACHE used UGI.hashcode() – Compared Subjects for equality, not equivalence. › Fixed Thrift 0.9 • TSaslServerTransport had circular references • JVM couldn’t detect these for cleanup – WeakReferences are your friend • Fix incompatibility with L3DSR pings  Data discovery from Oozie: › Use JMS notifications, on publication › Oozie Coordinators wake up on ActiveMQ notification, kick off dependent workflows › Reduced polling frequency
  • 30. More fixes 30 2015 Hadoop Summit, San Jose, California  Metadata-only queries: › SELECT DISTINCT tstamp FROM my_purple_table ORDER BY tstamp DESC LIMIT 1000; › Replace HiveMetaStoreClient::getPartitions() with getPartitionNames(). › Local job, versus cluster.  Optimize the optimizer: › The first step in some optimizers: • List<Partition> partitions = hiveMetaStoreClient.getPartitions( db, table, (short)-1 ); • Pray that the client and/or the metastore don’t run out of memory. • Take a nap. › Fixed PartitionPruner, MetadataOnlyOptimizer.
  • 31. Long-term fixes: 31 2015 Hadoop Summit, San Jose, California  DirectSQL short-circuits: › Datanucleus problems at scale • (Yes, we are aware of the irony that might result from extrapolation.) › Specific to the backing DB.  Compaction of Partition info: › HIVE-7223, HIVE-7576, HIVE-9845, etc. › Schema evolves infrequently › Partition-info rarely differs from table-info – Except HDFS paths (which are super-strings) › List<Partition> vs Iterator<Partition> • PartitionSet abstraction – The delight of Inheritance in Thrift • Reduced memory foot-prints
  • 32. 32 2015 Hadoop Summit, San Jose, California “The finest trick of The Devil was to persuade you that he does not exist.” -- ???
  • 33. 33 2015 Hadoop Summit, San Jose, California
  • 34. 34 2015 Hadoop Summit, San Jose, California
  • 35. 35 2015 Hadoop Summit, San Jose, California
  • 36. From: A major reporting team To: The Yahoo Hive Team Subject: Urgent! Customer reports are borking. Dear YHive team, When we connect Tableau Server 8.3 to Y!Hive 0.12/0.13, it is unusably slow. Queries take too long to run, and time out. We’d prefer not to change our query-code too much. How soon can Hive accommodate our simple queries? Yours hysterically, Project Zodiac 36 2015 Hadoop Summit, San Jose, California
  • 37. Analysis: The query 37 2015 Hadoop Summit, San Jose, California  Non-const partition key predicates: › E.g. WHERE utc_time <= from_unixtime(unix_timestamp()- 2*24*60*60, 'yyyyMMdd') AND utc_time >= from_unixtime(unix_timestamp()- 32*24*60*60, 'yyyyMMdd') › Solution: Use constant expressions where possible. › Fix: Hive 1.x supports dynamic partition pruning, and constant folding.  Costly joins with partitioned dimension tables: › E.g. › SELECT … FROM fact_table JOIN (SELECT * FROM dimension_table WHERE dt IN (SELECT MAX(dt) from dimension_table); › Workaround: External “pointer” tables. › Fix: Dynamic partition pruning.
  • 38. Analysis: The data 38 2015 Hadoop Summit, San Jose, California  Data stored in TEXTFILE › Solution: Switch to columnar storage • ORC, dictionary encoding, vectorization, predicate pushdown  Over-partitioning: › Too many partition keys › Diminishing returns with partition pruning › Solution: Eliminate partition keys, consider sorting  Small Part files › Hard-coded nReducers › E.g. hive> dfs -count /projects/foo_stats; 9081 682735 1876847648672 /projects/foo.db/foo_stats › Solution: • set hive.merge.mapfiles=true; • set hive.merge.mapredfiles=true; • set hive.merge.tezfiles=true;
  • 39. We’re not done yet 39 2015 Hadoop Summit San Jose  Tez/ATS scaling  Speed up split calculation  Auto/Offline compaction  Abuse detection  Better handling of schema evolution  Skew Joins in Hive  UDFs with JNI and configuring LD_LIBRARY_PATH
  • 42. YHive configuration settings: 42 2014 Hadoop Summit, San Jose, California set hive.merge.mapfiles=false; -- Except when producing data. set hive.merge.mapredfiles=false; -- Except when producing data. set tez.merge.files=false; -- Except when producing data. -- For ORC files. -- dfs.blocksize=134217728; -- hdfs-site.xml set orc.stripe.size=67108864; -- 64MB stripes. set orc.compress.size=262144; -- 256KB compress buffer. set orc.compress=ZLIB; -- Override to NONE, per table. set orc.create.index=true; -- ORC indexes. set orc.optimize.index.filter=true; -- Predicate pushdown with ORC index set orc.row.index.stride=10000;
  • 43. YHive configuration settings: (contd) 43 2014 Hadoop Summit, San Jose, California -- Delegation Token Store settings: set hive.cluster.delegation.token.store.class=ZooKeeperTokenStore; set hive.cluster.delegation.token.renew-interval=172800000; (Start HCat Server with -Djute.maxbuffer=24MB -> 190K+ tokens.) -- Data Nucleus settings: set datanucleus.connectionPoolingType=DBCP; -- !(BoneCP). set datanucleus.cache.level1.type=none; set datanucleus.cache.level2.type=none; set datanucleus.connectionPool.maxWait=200000; set datanucleus.connectionPool.minIdle=0; -- Misc. set hive.metastore.event.listeners=com.yahoo.custom.JMSListener;
  • 44. Zookeeper Token Storage performance 44 2014 Hadoop Summit, San Jose, California Jute Buffer Size (in MB) Max delegation token count 4MB 30K 8MB 60K 12MB 90K 16MB 130K 20MB 160K 24MB 190K
  • 45. 45 2015 Hadoop Summit, San Jose, California
  • 46. Why Hive on Tez? 46 2015 Hadoop Summit, San Jose, California  Shark, Impala › Pre-emption for in-memory systems › Multi-tenant, shared clusters › Heterogeneous nodes › Existing ecosystem › Community-driven development  Shark › Good proof of concept, but was not production ready › Shuffle performance › Hive on Spark – under active development
  • 47. Analysis: Tableau/ODBC driver 47 2015 Hadoop Summit, San Jose, California  Tableau has come a long way, but › Schema discovery • SELECT * FROM my_large_table LIMIT 0; • SELECT DISTINCT part_key FROM my_large_table; › SQL dialect • Depends on vendor-specific driver-name › Schema metadata-scans • 3 partition listings per query › Miscellaneous problems: • “Custom SQL” rewrites • Trouble with quoting  tl;dr : Try to transition to Simba’s 2.0.x Drivers with Tableau 8.3.x

Editor's Notes

  • #3: TODO: Update latest profile pic
  • #4: TODO: Update latest profile pic
  • #6: At last year’s talk, which was received so enthusiastically.
  • #9: Tez : Scheduling. Playing the gaps, like Beethoven’s Fifth.
  • #10: Why 13? Why move from 12?
  • #11: 10000s of files? Spark, HBase
  • #13: Talk up the work from Gemini. Power-users of Tableau Server. People with RDBMS expertise think Partitions are analogous to Indexes. The more you have, the faster the query should run.
  • #15: Talk up the work from Gemini. Power-users of Tableau Server. People with RDBMS expertise think Partitions are analogous to Indexes. The more you have, the faster the query should run.
  • #16: Add diagram for deployment of Hive, and its evolution. Describe the problem with
  • #18: Add diagram for deployment of Hive, and its evolution.
  • #19: Last year saw a tonne of benchmarketing. Tez vs Spark (vs Impala). We’ve had several choices of execution engines. But we seem to have forgotten to scale a crucial part of the system. The metastore.
  • #20: Talk about the kinds of metadata: Input/Output formats, per table, per partition. Record format information. SerDe classes. Data paths Table/Partition level statistics: Also mention the Hundreds of columns per table.
  • #22: Small split-size.
  • #23: My_function() is a webservice call. hive.log.incremental.plan.progress.
  • #25: This table is our largest. We use this to test and break our system.
  • #27: Focus on data-paths.
  • #29: Interesting segue: The “short” nPartitions parameter.
  • #30: Interesting segue: The “short” nPartitions parameter.
  • #31: Interesting segue: The “short” nPartitions parameter.
  • #32: Elaborate the problems with datanucleus at scale: Thread safety Memory usage Performance Schema evolution can happen both at a geological pace, as well as a tectonic scale. Inheritance in Thrift is like implementing it in C. Mention that similar changes were made in Pig/HCatalog, for compressing Partition info. 26x storage saving (for split meta-info), + 10x faster for the query to start.
  • #33: The Java anecdote.
  • #34: Verbal Kint.
  • #35: Bonus: The rooftop scene in Sherlock 2.3.
  • #36: Charles Baudelaire. The Java anecdote.
  • #37: Introduce the beast that is Tableau. Flash the “simple” query.
  • #40: Bucky Lasek at the X-Games in 2001. Notice where he’s looking… Not at the camera, but setting up his next trick.
  • #43: Talk about distcp –pgrub, for ORC files.
  • #46: At last year’s talk, which was received so enthusiastically.
  • #47: Shark was a good proof-of-concept, but was not production ready.
  • #48: Praise the work from Simba. Rework slide. Too much info. Just put the TLDR. SQL dialect Depends on vendor-specific driver-name Schema metadata-scans 3 partition listings per query Miscellaneous problems: “Custom SQL” rewrites Trouble with quoting