SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011–2018. All rights reserved.
What is new in Apache Hive?
Ashutosh Chauhan
2 © Hortonworks Inc. 2011–2018. All rights reserved.
Apache Hive – Distant Past – First Five Years
• Initial use case: batch processing
• Circa 2008
• Read-only data
• MapReduce
• HiveQL
3 © Hortonworks Inc. 2011–2018. All rights reserved.
Apache Hive – Past 5 Years
• Effort to take Hive beyond its batch processing roots
• Started in Apache Hive 0.10.0 (January 2013)
• Latest released version: Apache Hive 3.0 (May 2018)
• Extensive renovation along four different axes
• Runtime : Enable sub-second queries - LLAP
• Compiler : Cost Based Optimizer
• SQL support : Improved coverage of SQL syntax
• Transactional Support : ACID
4 © Hortonworks Inc. 2011–2018. All rights reserved.
Hive – Today
• Comprehensive ANSI SQL including all TPC-DS Queries.
• The only Hadoop SQL with ACID MERGE for easy updates.
• In-Memory caching for MPP performance at Hadoop scale.
• Enables Per-User dynamic row and column security.
• Enables Replication and DR for critical workloads.
• Compatible with every major BI Tool.
• Proven at 300+ PB Scale.
5 © Hortonworks Inc. 2011–2018. All rights reserved.
Apache Hive: Fast Facts
Most Queries Per Hour
100,000 Queries Per Hour
Analytics Performance
100 Million rows/s Per Node
Largest Hive Warehouse
300+ PB Raw Storage
Largest Cluster
4,500+ Nodes
6 © Hortonworks Inc. 2011–2018. All rights reserved.
Hive: Serving ETL Workloads to BI Systems
BI
systems
Materialized
view
Improved
Stats
Constraints
Query
Result
Cache
Workload
manage
ment
ACID v2
• Results return
from HDFS/cache
directly
• Reduce load from
repetitive queries
• Allows more
queries to be run
in parallel
• Reduce resource
starvation in large
clusters
• Also:
Active/Passive HA
• More “tools” for
optimizer to use
• More ”tools” for
DBAs to
tune/optimize
• Invisible tuning of
DB from users’
perspective
• ACID v2 is as fast
as regular tables
7 © Hortonworks Inc. 2011–2018. All rights reserved.
• SIGMOD Software Systems Award
• “For developing seminal software systems that served to bring relational-style
declarative programming to the Hadoop ecosystem.”
• Postgres, SQLLite and MonetDB
8 © Hortonworks Inc. 2011–2018. All rights reserved.
Hive – How Did We Get Here?
• LLAP Enhancements
• CBO Enhancements
• ACID Enhancements
9 © Hortonworks Inc. 2011–2018. All rights reserved.
Materialized Views in Hive
10 © Hortonworks Inc. 2011–2018. All rights reserved.
Accelerating Query Processing
• Change data physical properties (distribute, sort)
• Filter rows
• Denormalize
• Preaggregate
Optimization based on access patterns
11 © Hortonworks Inc. 2011–2018. All rights reserved.
Materialized Views to Rescue
 Speed up aggregates and joins via MVs
 View navigation via CBO/Calcite
 Optionally allow rewrites against out-of-date
materializations
12 © Hortonworks Inc. 2011–2018. All rights reserved.
Materialized Views in Hive 3
• Multiple storage options: Hive, Druid
• Multiple options to control materialized views lifecycle
13 © Hortonworks Inc. 2011–2018. All rights reserved.
Materialized View-based Rewriting
• Materialized view definition
CREATE MATERIALIZED VIEW mv AS
SELECT <dims>,
lo_revenue,
lo_extprice * lo_disc AS d_price,
lo_revenue - lo_supplycost,
FROM
customer, dates, lineorder, part, supplier
WHERE
lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and lo_custkey = c_custkey;
• Query
SELECT sum(lo_extendedprice*lo_discount)
FROM
lineorder, dates
WHERE
lo_orderdate = d_datekey
and d_year = 2013
and lo_discount between 1 and 3;
• Materialized view-based rewriting
SELECT SUM(d_price)
FROM mv
WHERE
d_year = 2013
and lo_discount between 1 and 3;
supplier
part
dates
customerlineorder
mv contents
Query results
14 © Hortonworks Inc. 2011–2018. All rights reserved.
Rebuilding Materialized Views
• ALTER MATERIALIZED VIEW [db_name.]materialized_view_name REBUILD;
• Incremental materialized view maintenance
• Only refresh data that has changed in source tables
15 © Hortonworks Inc. 2011–2018. All rights reserved.
Accelerating Query Processing with
Materialized Views in Apache Hive
Jesus Camacho Rodriguez
Tuesday, June 19
2:50 PM - 3:30 PM
Executive Ballroom 210A/E
16 © Hortonworks Inc. 2011–2018. All rights reserved.
Workload Management
17 © Hortonworks Inc. 2011–2018. All rights reserved.
Overview
• Effectively share LLAP cluster resources
• Resource allocation per user policy; separate ETL and BI, etc.
• Resources based guardrails
• Protect against long running queries, high memory usage
• Improved, query-aware scheduling
• Scheduler is aware of query characteristics, types, etc.
• Fragments easy to pre-empt compared to containers
• Queries get guaranteed fractions of the cluster, but
can use empty space
18 © Hortonworks Inc. 2011–2018. All rights reserved.
Resource Plans
• Resource plan is a workload management configuration for a cluster
• Switching is allowed without stopping queries, e.g. based on time of day
• Cluster is divided into query pools (optionally nested)
• Each pool defines query parallelism, cluster resources percentage
• Queries are automatically routed to pools based on user name, app, etc.
• Rules (Triggers) to kill, move, or deprioritized queries based on DFS usage, runtime, etc.
• Example :
CREATE RESOURCE PLAN daytime;
CREATE POOL bi IN daytime (resource_percent=75, concurrent_queries=5);
CREATE POOL etl IN daytime TIME (resource_percent=25, concurrent_queries=10);
CREATE RULE downgrade IN daytime WHEN total_runtime > 120 THEN MOVE etl;
ADD RULE downgrade TO bi IN daytime ;
CREATE MAPPING tableau IN daytime (application='Tableau', pool=bi);
ALTER PLAN daytime SET default_pool='etl';
APPLY PLAN daytime;
19 © Hortonworks Inc. 2011–2018. All rights reserved.
Decentralized Guaranteed Resources
• A guaranteed task for each resource (executor slots)
• HS2 gives N guaranteed tasks to an AM based on configured resource plan
• AMs mark N of its most important tasks as guaranteed at any given time
• Guaranteed tasks pre-empt speculative tasks
20 © Hortonworks Inc. 2011–2018. All rights reserved.
Guaranteed Tasks – BI and ETL Example
BI (80% = 14 guaranteed) ETL (20% = 4 guaranteed)
Query 1 Query 2
LLAP Daemon 1 LLAP Daemon 2 LLAP Daemon 3
Wait Queue
Executors
10 active tasks (running):
10 guaranteed (running)
4 unused for now
19 active tasks (8 running):
4 guaranteed (4 running)
15 speculative (4 running)
HS2
18 executors total
21 © Hortonworks Inc. 2011–2018. All rights reserved.
Caching
22 © Hortonworks Inc. 2011–2018. All rights reserved.
Caching for BI Workloads
• Fine-grained (columnar), compact (dictionary, RLE encoded)
• Important due to projections over many wide EDW tables
• Prioritized – indexes are cached with higher priority
• Important to make use of predicate pushdown
• Off-heap (no extra GC), supports SSD
• LRFU replacement policy avoids the damage from large scans
23 © Hortonworks Inc. 2011–2018. All rights reserved.
Caching for BI Workloads – Formats, Zero-ETL
• ORC, Parquet
• Cached natively
• Zero-ETL analytics on CSV and JSON data with text caching
• Text is efficiently encoded in background; once cached, queries speed up
24 © Hortonworks Inc. 2011–2018. All rights reserved.
In-memory Processing – Native Columnar (ORC)
I/O threads
SSD
cache
Off-heap
cacheCompact encoded data
Distributed FS
Compressed data
Decoder: ORC
col1
col2
Compression
codec
Read planner
Execution thread
Fragment
Hive
operator
Hive
operator
Vectorized
processing
col1 col2
Native data
vectors
Replacement
policy
25 © Hortonworks Inc. 2011–2018. All rights reserved.
Running Hive queries fast in the cloud
Nita Dembla
Wednesday, June 20
4:00 PM - 4:40 PM
Grand Ballroom 220C
26 © Hortonworks Inc. 2011–2018. All rights reserved.
Druid + Apache Hive
Layer Data Access Pattern Features
Hive Layer Large Scale analytics
Joins
Subqueries
Windowing Functions
Transformations
Complex Aggregations
Advanced Sorting
UDFs
Druid Layer
Needles-in-a-haystack queries with
large numbers of dimensions
Dimensional Aggregates
Top N Queries
Min/Max Values
Timeseries Queries
Approximate Distinct Count
Approximate Histograms
27 © Hortonworks Inc. 2011–2018. All rights reserved.
Druid Integration
• Pushdown of aggregate queries
• Pushdown of complex expressions
• Improvements in Druid to support sql standard NULL semantics
• Store MV In Druid
28 © Hortonworks Inc. 2011–2018. All rights reserved.
Hive 3: Real-time Ingestion
Hive
Kafka-Druid-
Hive ingest
Druid
Real-time analytics
• Druid answers in near real-time
29 © Hortonworks Inc. 2011–2018. All rights reserved.
Druid and Hive Together: Interactive
Realtime Analytics at Scale
Nishant Bangarwa
Tuesday, June 19
4:50 PM - 5:30 PM
Grand Ballroom 220B
30 © Hortonworks Inc. 2011–2018. All rights reserved.
Acid V2
• New On disk storage format for Acid tables
• Run major compactions before you upgrade
• Update = Delete + Insert
• Performance at par with non-Acid tables
• Support for load statements
• New Streaming ingestion library
31 © Hortonworks Inc. 2011–2018. All rights reserved.
Insert-only Tables
• Transactional Semantics for non-ORC tables
• For insert into and Insert overwrite
• With near-zero overhead
• No rename() - Cloud friendly
32 © Hortonworks Inc. 2011–2018. All rights reserved.
Transactional Operations in Apache Hive
Eugene Koifman
Wednesday, June 20
11:50 AM - 12:30 PM
Executive Ballroom 210A/E
33 © Hortonworks Inc. 2011–2018. All rights reserved.
Disaster Recovery for Hive Data
A
A B
B
CentralizedSecurityandGovernance
On-Premise
Data Center (a)
On-Premise
Data Center (b)
Scheduled Policy (A)
(2am, 10am, 6pm daily)
Scheduled Policy (B)
(2am daily)
1 Data replication with scheduled policy
2 Disaster takes down Data Center (b)
3 Failover to Data Center (a); data set B made active
4 Active data set B changes to B’ in Data Center (a)
34 © Hortonworks Inc. 2011–2018. All rights reserved.
Hive-based Replication
• Replv2 introduces new REPL commands
• Incremental replication - only copy delta changes
• Point-in time replication.
• Hive maintains the replication state.
• Additional support for other database objects - for ex: functions, constraint etc.
• Reduce number of copies.
35 © Hortonworks Inc. 2011–2018. All rights reserved.
Seamless Replication and Disaster
Recovery for Apache Hive Warehouse
Sankar Hariappan
Thursday, June 21
9:30 AM - 10:10 AM
Meeting Room 211A/B/C/D
36 © Hortonworks Inc. 2011–2018. All rights reserved.
One Metastore to Rule Them All
HDFS/S3 Kafka
Hive
LLAP
Spark
HMS Atlas
RangerSR
Hive
on Tez
37 © Hortonworks Inc. 2011–2018. All rights reserved.
Between Us and the Grand Vision
• Make HMS separable from Hive
• Standalone Metastore
• Unify HMS and Schema Registry so batch and streaming can see each other’s data
• Also reduces the number of metadata systems admins have to install and maintain
38 © Hortonworks Inc. 2011–2018. All rights reserved.
Sharing Metadata Across the Data Lake
and Streams
Alan Gates
Wednesday, June 20
11:50 AM - 12:30 PM
Meeting Room 230A
39 © Hortonworks Inc. 2011–2018. All rights reserved.
External Access –
Spark Llap
40 © Hortonworks Inc. 2011–2018. All rights reserved.
External Access – Relational View for Everyone
• Hive-on-Tez and other DAG executors can use LLAP directly
• LLAP also provides a "relational datanode" view of the data
• Anyone (with access) can push the (approved) code in, from complex query fragments to
simple data reads
• E.g. a Spark DataFrame can be created with LlapInputFormat
• Gives the external services the access to
• Hive data: centralized, secure data access
• Ability to read all Hive table types, like ACID transactional tables
• Hive features: from column-level security, to LLAP columnar cache
41 © Hortonworks Inc. 2011–2018. All rights reserved.
Support Row/Column-level Security in Spark
spark-shell
pyspark
42 © Hortonworks Inc. 2011–2018. All rights reserved.
What Is Required?
• Apache Ranger
• Apache Hive with LLAP
• Spark-LLAP
• A library to integrate above tech with SparkSQL
43 © Hortonworks Inc. 2011–2018. All rights reserved.
HiveServer2 + LLAP + Ranger
YARN Cluster
HiveServer2
Client App
Hive Query
Coordinator
Plan Generation
TableScan: users
Filter: state = ‘CA’
Projection:
mask(name)
SQL Query:
select name from users
1.Client sends query to HiveServer2.
2.Query plan generation by HiveServer2. Ranger
security policies applied. Plan modified based on
dynamic security policies.
3.Query plan sent to query coordinator
4.Query plan sent to LLAP daemons for execution.
Filtering/masking performed.
5.Results consolidated and sent to client
1 Ranger
Dynamic Policies
5 2
3 4
LLAP
LLAP
LLAP Daemons
44 © Hortonworks Inc. 2011–2018. All rights reserved.
LLAP
InputFor
mat
YARN Cluster
HiveServer2
Client App
Hive Query
Coordinator
Plan Generation
TableScan: users
Filter: state = ‘CA’
Projection:
mask(name)
SQL Query:
select name from users
1.Client requests data locations known as “splits”
from HiveServer2.
2.Query plan generation by HiveServer2. Ranger
security policies applied. Plan modified based on
dynamic security policies.
3.Splits returned to client which include signed
query plan.
4.LLAP splits used by client to securely submit
query plan to LLAP. Filtering/masking performed.
Data returned to client.
1 Ranger
Dynamic Policies
3 2
LLAP
LLAP
LLAP Daemons
HiveServer2 + LLAP + Ranger
4
45 © Hortonworks Inc. 2011–2018. All rights reserved.
“Other” Improvements
• Query reoptimization
• Constraints
• Vectorization
• Query Cache
• Active Passive HS2 HA for llap
• HLL BitVectors
• CachedStore
• Numerous enhancements in Spark Integration
46 © Hortonworks Inc. 2011–2018. All rights reserved.
Future
• Standalone Metastore
• Materialized Views – Automatic Recommendations
• Better integration with cloud storage
• HS2 scalability
47 © Hortonworks Inc. 2011–2018. All rights reserved.
Thanks
to Open Source Community
for continued success for last
10 years.
Now,
Onwards to next 10 years

More Related Content

PPTX
Hive 3 - a new horizon
PPTX
Kinesis Firehoseを使ってみた
PDF
Cloud Foundry V2を、もうちょっと深掘りしよう
PDF
Prestoで実現するインタラクティブクエリ - dbtech showcase 2014 Tokyo
PPTX
C#で速度を極めるいろは
PPTX
Glue DataBrewでデータをクリーニング、加工してみよう
PDF
ソーシャルゲームにおけるMongoDB適用事例 - Animal Land
PDF
マルチテナント化で知っておきたいデータベースのこと
Hive 3 - a new horizon
Kinesis Firehoseを使ってみた
Cloud Foundry V2を、もうちょっと深掘りしよう
Prestoで実現するインタラクティブクエリ - dbtech showcase 2014 Tokyo
C#で速度を極めるいろは
Glue DataBrewでデータをクリーニング、加工してみよう
ソーシャルゲームにおけるMongoDB適用事例 - Animal Land
マルチテナント化で知っておきたいデータベースのこと

What's hot (20)

PDF
AWSでDockerを扱うためのベストプラクティス
PDF
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
PDF
Apache Calcite Tutorial - BOSS 21
PDF
MySQL 8.0で憶えておいてほしいこと
PDF
Trino at linkedIn - 2021
PDF
SQLアンチパターン 幻の第26章「とりあえず削除フラグ」
PDF
Apache Arrow - データ処理ツールの次世代プラットフォーム
PDF
코드 리뷰의 또 다른 접근 방법: Pull Requests vs. Stacked Changes
PDF
Spring native について
PDF
MongoDBを用いたソーシャルアプリのログ解析 〜解析基盤構築からフロントUIまで、MongoDBを最大限に活用する〜
PDF
Hardening Kafka Replication
PPT
Hive User Meeting August 2009 Facebook
PDF
WindowsのCP932に苦闘している件
PDF
ソーシャルゲームのためのデータベース設計
PDF
ドキュメンテーションを加速するストレスフリーの作図ツール『blockdiag』 jus2011年6月勉強会
PDF
行ロックと「LOG: process 12345 still waiting for ShareLock on transaction 710 afte...
PDF
Elasticsearchを使うときの注意点 公開用スライド
PDF
[JAWS DAYS 2019] Amazon DocumentDB(with MongoDB Compatibility)入門
PPTX
Hadoop Query Performance Smackdown
PDF
Sqoop on Spark for Data Ingestion
AWSでDockerを扱うためのベストプラクティス
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
Apache Calcite Tutorial - BOSS 21
MySQL 8.0で憶えておいてほしいこと
Trino at linkedIn - 2021
SQLアンチパターン 幻の第26章「とりあえず削除フラグ」
Apache Arrow - データ処理ツールの次世代プラットフォーム
코드 리뷰의 또 다른 접근 방법: Pull Requests vs. Stacked Changes
Spring native について
MongoDBを用いたソーシャルアプリのログ解析 〜解析基盤構築からフロントUIまで、MongoDBを最大限に活用する〜
Hardening Kafka Replication
Hive User Meeting August 2009 Facebook
WindowsのCP932に苦闘している件
ソーシャルゲームのためのデータベース設計
ドキュメンテーションを加速するストレスフリーの作図ツール『blockdiag』 jus2011年6月勉強会
行ロックと「LOG: process 12345 still waiting for ShareLock on transaction 710 afte...
Elasticsearchを使うときの注意点 公開用スライド
[JAWS DAYS 2019] Amazon DocumentDB(with MongoDB Compatibility)入門
Hadoop Query Performance Smackdown
Sqoop on Spark for Data Ingestion
Ad

Similar to What's new in apache hive (20)

PPTX
Apache Hadoop 3 updates with migration story
PDF
What is new in Apache Hive 3.0?
PPTX
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
PPTX
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
PDF
What is New in Apache Hive 3.0?
PPTX
High throughput data replication over RAFT
PPTX
Hive acid and_2.x new_features
PPTX
LLAP: Building Cloud First BI
PDF
Data in the Cloud Crash Course
PPTX
Containers and Big Data
PDF
Apache Hadoop YARN: state of the union - Tokyo
PDF
Apache Hadoop YARN: state of the union
PDF
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
PDF
Curing the Kafka blindness—Streams Messaging Manager
PPTX
Apache Hadoop YARN: state of the union
PDF
What's New in Apache Hive 3.0?
PDF
What's New in Apache Hive 3.0 - Tokyo
PDF
Apache Hadoop YARN: State of the Union
PDF
HDF 3.1 : An Introduction to New Features
PPTX
Hive Performance Dataworks Summit Melbourne February 2019
Apache Hadoop 3 updates with migration story
What is new in Apache Hive 3.0?
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
What is New in Apache Hive 3.0?
High throughput data replication over RAFT
Hive acid and_2.x new_features
LLAP: Building Cloud First BI
Data in the Cloud Crash Course
Containers and Big Data
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Curing the Kafka blindness—Streams Messaging Manager
Apache Hadoop YARN: state of the union
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0 - Tokyo
Apache Hadoop YARN: State of the Union
HDF 3.1 : An Introduction to New Features
Hive Performance Dataworks Summit Melbourne February 2019
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
Machine Learning_overview_presentation.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
A Presentation on Artificial Intelligence
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Tartificialntelligence_presentation.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Network Security Unit 5.pdf for BCA BBA.
Mobile App Security Testing_ A Comprehensive Guide.pdf
Approach and Philosophy of On baking technology
Encapsulation_ Review paper, used for researhc scholars
Dropbox Q2 2025 Financial Results & Investor Presentation
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Machine Learning_overview_presentation.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Assigned Numbers - 2025 - Bluetooth® Document
A Presentation on Artificial Intelligence
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Tartificialntelligence_presentation.pptx
cuic standard and advanced reporting.pdf
Machine learning based COVID-19 study performance prediction
The Rise and Fall of 3GPP – Time for a Sabbatical?
Diabetes mellitus diagnosis method based random forest with bat algorithm
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Programs and apps: productivity, graphics, security and other tools
MIND Revenue Release Quarter 2 2025 Press Release
Network Security Unit 5.pdf for BCA BBA.

What's new in apache hive

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved. What is new in Apache Hive? Ashutosh Chauhan
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved. Apache Hive – Distant Past – First Five Years • Initial use case: batch processing • Circa 2008 • Read-only data • MapReduce • HiveQL
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved. Apache Hive – Past 5 Years • Effort to take Hive beyond its batch processing roots • Started in Apache Hive 0.10.0 (January 2013) • Latest released version: Apache Hive 3.0 (May 2018) • Extensive renovation along four different axes • Runtime : Enable sub-second queries - LLAP • Compiler : Cost Based Optimizer • SQL support : Improved coverage of SQL syntax • Transactional Support : ACID
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved. Hive – Today • Comprehensive ANSI SQL including all TPC-DS Queries. • The only Hadoop SQL with ACID MERGE for easy updates. • In-Memory caching for MPP performance at Hadoop scale. • Enables Per-User dynamic row and column security. • Enables Replication and DR for critical workloads. • Compatible with every major BI Tool. • Proven at 300+ PB Scale.
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved. Apache Hive: Fast Facts Most Queries Per Hour 100,000 Queries Per Hour Analytics Performance 100 Million rows/s Per Node Largest Hive Warehouse 300+ PB Raw Storage Largest Cluster 4,500+ Nodes
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved. Hive: Serving ETL Workloads to BI Systems BI systems Materialized view Improved Stats Constraints Query Result Cache Workload manage ment ACID v2 • Results return from HDFS/cache directly • Reduce load from repetitive queries • Allows more queries to be run in parallel • Reduce resource starvation in large clusters • Also: Active/Passive HA • More “tools” for optimizer to use • More ”tools” for DBAs to tune/optimize • Invisible tuning of DB from users’ perspective • ACID v2 is as fast as regular tables
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved. • SIGMOD Software Systems Award • “For developing seminal software systems that served to bring relational-style declarative programming to the Hadoop ecosystem.” • Postgres, SQLLite and MonetDB
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved. Hive – How Did We Get Here? • LLAP Enhancements • CBO Enhancements • ACID Enhancements
  • 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved. Materialized Views in Hive
  • 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved. Accelerating Query Processing • Change data physical properties (distribute, sort) • Filter rows • Denormalize • Preaggregate Optimization based on access patterns
  • 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved. Materialized Views to Rescue  Speed up aggregates and joins via MVs  View navigation via CBO/Calcite  Optionally allow rewrites against out-of-date materializations
  • 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved. Materialized Views in Hive 3 • Multiple storage options: Hive, Druid • Multiple options to control materialized views lifecycle
  • 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved. Materialized View-based Rewriting • Materialized view definition CREATE MATERIALIZED VIEW mv AS SELECT <dims>, lo_revenue, lo_extprice * lo_disc AS d_price, lo_revenue - lo_supplycost, FROM customer, dates, lineorder, part, supplier WHERE lo_orderdate = d_datekey and lo_partkey = p_partkey and lo_suppkey = s_suppkey and lo_custkey = c_custkey; • Query SELECT sum(lo_extendedprice*lo_discount) FROM lineorder, dates WHERE lo_orderdate = d_datekey and d_year = 2013 and lo_discount between 1 and 3; • Materialized view-based rewriting SELECT SUM(d_price) FROM mv WHERE d_year = 2013 and lo_discount between 1 and 3; supplier part dates customerlineorder mv contents Query results
  • 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved. Rebuilding Materialized Views • ALTER MATERIALIZED VIEW [db_name.]materialized_view_name REBUILD; • Incremental materialized view maintenance • Only refresh data that has changed in source tables
  • 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved. Accelerating Query Processing with Materialized Views in Apache Hive Jesus Camacho Rodriguez Tuesday, June 19 2:50 PM - 3:30 PM Executive Ballroom 210A/E
  • 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved. Workload Management
  • 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved. Overview • Effectively share LLAP cluster resources • Resource allocation per user policy; separate ETL and BI, etc. • Resources based guardrails • Protect against long running queries, high memory usage • Improved, query-aware scheduling • Scheduler is aware of query characteristics, types, etc. • Fragments easy to pre-empt compared to containers • Queries get guaranteed fractions of the cluster, but can use empty space
  • 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved. Resource Plans • Resource plan is a workload management configuration for a cluster • Switching is allowed without stopping queries, e.g. based on time of day • Cluster is divided into query pools (optionally nested) • Each pool defines query parallelism, cluster resources percentage • Queries are automatically routed to pools based on user name, app, etc. • Rules (Triggers) to kill, move, or deprioritized queries based on DFS usage, runtime, etc. • Example : CREATE RESOURCE PLAN daytime; CREATE POOL bi IN daytime (resource_percent=75, concurrent_queries=5); CREATE POOL etl IN daytime TIME (resource_percent=25, concurrent_queries=10); CREATE RULE downgrade IN daytime WHEN total_runtime > 120 THEN MOVE etl; ADD RULE downgrade TO bi IN daytime ; CREATE MAPPING tableau IN daytime (application='Tableau', pool=bi); ALTER PLAN daytime SET default_pool='etl'; APPLY PLAN daytime;
  • 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved. Decentralized Guaranteed Resources • A guaranteed task for each resource (executor slots) • HS2 gives N guaranteed tasks to an AM based on configured resource plan • AMs mark N of its most important tasks as guaranteed at any given time • Guaranteed tasks pre-empt speculative tasks
  • 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved. Guaranteed Tasks – BI and ETL Example BI (80% = 14 guaranteed) ETL (20% = 4 guaranteed) Query 1 Query 2 LLAP Daemon 1 LLAP Daemon 2 LLAP Daemon 3 Wait Queue Executors 10 active tasks (running): 10 guaranteed (running) 4 unused for now 19 active tasks (8 running): 4 guaranteed (4 running) 15 speculative (4 running) HS2 18 executors total
  • 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved. Caching
  • 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved. Caching for BI Workloads • Fine-grained (columnar), compact (dictionary, RLE encoded) • Important due to projections over many wide EDW tables • Prioritized – indexes are cached with higher priority • Important to make use of predicate pushdown • Off-heap (no extra GC), supports SSD • LRFU replacement policy avoids the damage from large scans
  • 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved. Caching for BI Workloads – Formats, Zero-ETL • ORC, Parquet • Cached natively • Zero-ETL analytics on CSV and JSON data with text caching • Text is efficiently encoded in background; once cached, queries speed up
  • 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved. In-memory Processing – Native Columnar (ORC) I/O threads SSD cache Off-heap cacheCompact encoded data Distributed FS Compressed data Decoder: ORC col1 col2 Compression codec Read planner Execution thread Fragment Hive operator Hive operator Vectorized processing col1 col2 Native data vectors Replacement policy
  • 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved. Running Hive queries fast in the cloud Nita Dembla Wednesday, June 20 4:00 PM - 4:40 PM Grand Ballroom 220C
  • 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved. Druid + Apache Hive Layer Data Access Pattern Features Hive Layer Large Scale analytics Joins Subqueries Windowing Functions Transformations Complex Aggregations Advanced Sorting UDFs Druid Layer Needles-in-a-haystack queries with large numbers of dimensions Dimensional Aggregates Top N Queries Min/Max Values Timeseries Queries Approximate Distinct Count Approximate Histograms
  • 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved. Druid Integration • Pushdown of aggregate queries • Pushdown of complex expressions • Improvements in Druid to support sql standard NULL semantics • Store MV In Druid
  • 28. 28 © Hortonworks Inc. 2011–2018. All rights reserved. Hive 3: Real-time Ingestion Hive Kafka-Druid- Hive ingest Druid Real-time analytics • Druid answers in near real-time
  • 29. 29 © Hortonworks Inc. 2011–2018. All rights reserved. Druid and Hive Together: Interactive Realtime Analytics at Scale Nishant Bangarwa Tuesday, June 19 4:50 PM - 5:30 PM Grand Ballroom 220B
  • 30. 30 © Hortonworks Inc. 2011–2018. All rights reserved. Acid V2 • New On disk storage format for Acid tables • Run major compactions before you upgrade • Update = Delete + Insert • Performance at par with non-Acid tables • Support for load statements • New Streaming ingestion library
  • 31. 31 © Hortonworks Inc. 2011–2018. All rights reserved. Insert-only Tables • Transactional Semantics for non-ORC tables • For insert into and Insert overwrite • With near-zero overhead • No rename() - Cloud friendly
  • 32. 32 © Hortonworks Inc. 2011–2018. All rights reserved. Transactional Operations in Apache Hive Eugene Koifman Wednesday, June 20 11:50 AM - 12:30 PM Executive Ballroom 210A/E
  • 33. 33 © Hortonworks Inc. 2011–2018. All rights reserved. Disaster Recovery for Hive Data A A B B CentralizedSecurityandGovernance On-Premise Data Center (a) On-Premise Data Center (b) Scheduled Policy (A) (2am, 10am, 6pm daily) Scheduled Policy (B) (2am daily) 1 Data replication with scheduled policy 2 Disaster takes down Data Center (b) 3 Failover to Data Center (a); data set B made active 4 Active data set B changes to B’ in Data Center (a)
  • 34. 34 © Hortonworks Inc. 2011–2018. All rights reserved. Hive-based Replication • Replv2 introduces new REPL commands • Incremental replication - only copy delta changes • Point-in time replication. • Hive maintains the replication state. • Additional support for other database objects - for ex: functions, constraint etc. • Reduce number of copies.
  • 35. 35 © Hortonworks Inc. 2011–2018. All rights reserved. Seamless Replication and Disaster Recovery for Apache Hive Warehouse Sankar Hariappan Thursday, June 21 9:30 AM - 10:10 AM Meeting Room 211A/B/C/D
  • 36. 36 © Hortonworks Inc. 2011–2018. All rights reserved. One Metastore to Rule Them All HDFS/S3 Kafka Hive LLAP Spark HMS Atlas RangerSR Hive on Tez
  • 37. 37 © Hortonworks Inc. 2011–2018. All rights reserved. Between Us and the Grand Vision • Make HMS separable from Hive • Standalone Metastore • Unify HMS and Schema Registry so batch and streaming can see each other’s data • Also reduces the number of metadata systems admins have to install and maintain
  • 38. 38 © Hortonworks Inc. 2011–2018. All rights reserved. Sharing Metadata Across the Data Lake and Streams Alan Gates Wednesday, June 20 11:50 AM - 12:30 PM Meeting Room 230A
  • 39. 39 © Hortonworks Inc. 2011–2018. All rights reserved. External Access – Spark Llap
  • 40. 40 © Hortonworks Inc. 2011–2018. All rights reserved. External Access – Relational View for Everyone • Hive-on-Tez and other DAG executors can use LLAP directly • LLAP also provides a "relational datanode" view of the data • Anyone (with access) can push the (approved) code in, from complex query fragments to simple data reads • E.g. a Spark DataFrame can be created with LlapInputFormat • Gives the external services the access to • Hive data: centralized, secure data access • Ability to read all Hive table types, like ACID transactional tables • Hive features: from column-level security, to LLAP columnar cache
  • 41. 41 © Hortonworks Inc. 2011–2018. All rights reserved. Support Row/Column-level Security in Spark spark-shell pyspark
  • 42. 42 © Hortonworks Inc. 2011–2018. All rights reserved. What Is Required? • Apache Ranger • Apache Hive with LLAP • Spark-LLAP • A library to integrate above tech with SparkSQL
  • 43. 43 © Hortonworks Inc. 2011–2018. All rights reserved. HiveServer2 + LLAP + Ranger YARN Cluster HiveServer2 Client App Hive Query Coordinator Plan Generation TableScan: users Filter: state = ‘CA’ Projection: mask(name) SQL Query: select name from users 1.Client sends query to HiveServer2. 2.Query plan generation by HiveServer2. Ranger security policies applied. Plan modified based on dynamic security policies. 3.Query plan sent to query coordinator 4.Query plan sent to LLAP daemons for execution. Filtering/masking performed. 5.Results consolidated and sent to client 1 Ranger Dynamic Policies 5 2 3 4 LLAP LLAP LLAP Daemons
  • 44. 44 © Hortonworks Inc. 2011–2018. All rights reserved. LLAP InputFor mat YARN Cluster HiveServer2 Client App Hive Query Coordinator Plan Generation TableScan: users Filter: state = ‘CA’ Projection: mask(name) SQL Query: select name from users 1.Client requests data locations known as “splits” from HiveServer2. 2.Query plan generation by HiveServer2. Ranger security policies applied. Plan modified based on dynamic security policies. 3.Splits returned to client which include signed query plan. 4.LLAP splits used by client to securely submit query plan to LLAP. Filtering/masking performed. Data returned to client. 1 Ranger Dynamic Policies 3 2 LLAP LLAP LLAP Daemons HiveServer2 + LLAP + Ranger 4
  • 45. 45 © Hortonworks Inc. 2011–2018. All rights reserved. “Other” Improvements • Query reoptimization • Constraints • Vectorization • Query Cache • Active Passive HS2 HA for llap • HLL BitVectors • CachedStore • Numerous enhancements in Spark Integration
  • 46. 46 © Hortonworks Inc. 2011–2018. All rights reserved. Future • Standalone Metastore • Materialized Views – Automatic Recommendations • Better integration with cloud storage • HS2 scalability
  • 47. 47 © Hortonworks Inc. 2011–2018. All rights reserved. Thanks to Open Source Community for continued success for last 10 years. Now, Onwards to next 10 years