SlideShare a Scribd company logo
Apache Hive ACID Project
Eugene Koifman
June 2016
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
 Motivations/Goals
 What is included in the project
 End user point of view
 Architecture
 Recent Progress
 Possible future directions
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Motivations/Goals
 Continuously adding new data to Hive in the past
– INSERT INTO Target as SELECT FROM Staging
– ALTER TABLE Target ADD PARTITION (dt=‘2016-06-30’)
• Lots of files – bad for performance
• Fewer files –users wait longer to see latest data
 Modifying existing data
– Analyzing log files – not that important. Sourcing data from an Operational Data Store – may be
really important.
– INSERT OVERWRITE TABLE Target SELECT * FROM Target WHERE …
• Concurrency
– Hope for the best (multiple updates)
– ZooKeeper lock manager S/X locks – restrictive
• Expensive to do repeatedly (write side)
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goals
 Make above use cases easy and efficient
 Key Requirement
– Long running analytics queries should run concurrently with update commands
 NOT OLTP!!!
– Support slowly changing tables
– Not for 100s of concurrent queries trying to update the same partition
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
ACID at High Level
 A new type of table that supports Insert/Update/Delete SQL operations
 Concept of ACID transaction
– Atomic, Consistent, Isolated, Durable
 Streaming Ingest API
– Write a continuous stream of events to Hive in micro batches with transactional semantics
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
ACID at High Level
RDMS
Compute
Nodes
HDFS
Streaming
Client
SQL
Client
Meta
Store
openTxn/commit/abort
Data
txnID
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
User Point of View
 CREATE TABLE T(a int, b int) CLUSTERED BY (b) INTO 8 BUCKETS STORED AS ORC
TBLPROPERTIES ('transactional'='true');
 Not all tables support transactional semantics
 Table must be bucketed – important for query performance
 Table cannot be sorted – ACID implementation requires its own sort order
 Currently requires ORC File but anything implementing format
– AcidInputFormat/AcidOutputFormat
 Snapshot Isolation
– Lock in the state of the DB as of the start of the query for the duration of the query
 autoCommit=true
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Design – Storage Layer
 Storage layer enhanced to support MVCC architecture
– Multiple versions of each row
– Allows concurrent readers/writers
 HDFS – append only file system
– All update operations are written to a delta file first
– Files are combined on read and compaction
 Even if you could update a file in the middle
– The architecture of choice for analytics is columnar storage (ORC File)
– Compresses by column – difficult to update
 Random data access is prohibitively slow
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Storage Layer Example
 CREATE TABLE T(a int, b int) CLUSTERED BY (b) INTO 1 BUCKETS STORED AS ORC
TBLPROPERTIES ('transactional'='true');
 Suppose the table contains (1,2),(3,4)
hive> update T set a = -3 where a = 3;
hive> update T set a = -1 where a = 1;
Now the table has (-1,2),(-3,4)
 hive> dfs -ls -R /user/hive/warehouse/t;
/user/hive/warehouse/t/base_0000022/bucket_00000
/user/hive/warehouse/t/delta_0000023_0000023_0000/bucket_00000
/user/hive/warehouse/t/delta_0000024_0000024_0000/bucket_00000
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example Continued
 bin/hive --orcfiledump -j -d /user/hive/warehouse/t/base_0000022/bucket_00000
{"operation":0,"originalTransaction":22,"bucket":0,"rowId":0,"currentTransaction":22,"row":{"a":3,"b":4}}
{"operation":0,"originalTransaction":22,"bucket":0,"rowId":1,"currentTransaction":22,"row":{"a":1,"b":2}}
 bin/hive --orcfiledump -j -d /…/t/delta_0000023_0000023_0000/bucket_00000
{"operation":1,"originalTransaction":22,"bucket":0,"rowId":0,"currentTransaction":23,"row":{"_col1":-3,"_col2":4}}
 Each file is sorted by PK: originalTransaction,bucket,rowid
 On read base & deltas are stitched together to produce correct version of each row.
 Each read operation “knows” the state of all transactions up to the moment it started
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Producing The Snapshot
base_0000022/bucket_00000
oTxn bucket rowId cTxn a b
22 0 0 22 3 4
22 0 1 22 1 2
select * from T
a b
-3 4
-1 2
delta_0000023_0000023_0000
oTxn bucket rowId cTxn a b
22 0 0 23 -
3
4
delta_0000024_0000024_0000
oTxn bucket rowId cTxn a b
22 0 1 24 -1 2
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Design - Compactor
 More operations = more delta files
 Compactor rewrites the table in the background
– Minor compaction - merges delta files into fewer deltas
– Major compactor merges deltas with base - more expensive
– This amortizes the cost of updates and self tunes the tables
• Makes ORC more efficient - larger stripes, better compression
 Compaction can be triggered automatically or on demand
– There are various configuration options to control when the process kicks in.
– Compaction itself is a Map-Reduce job
 Key design principle is that compactor does not affect readers/writers
 Cleaner process – removes obsolete files
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Design - Concurrency
 Transaction Manager
– manages transaction ID assignment
– keeps track of transaction state: open, committed, aborted
 Lock Manager
– DDL operations acquire eXclusive locks
– Read operations acquire Shared locks.
– Main goal is to prevent someone dropping a table while a query is in progress
 State of both persisted in Hive Metastore
 Write Set tracking to prevent Write-Write conflicts in concurrent transactions
 Note that 2 Inserts are never in conflict since Hive does not enforce unique
constraints.
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
 You are allowed to read acid and non-acid tables in same query.
 You cannot write to acid and non-acid tables at the same time (multi-insert
statement)
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Design - Streaming Ingest
 Allows you to continuously write events to a hive table
– Can commit periodically to make writes durable/visible
– Can also call abort to make writes since last commit/abort invisible.
– Optimized so that it can handle writing micro batches of events - every second.
• Multiple transactions are written to one file
– Only supports adding new data
 Streaming tools like Storm and Flume rely on this API to ingest data into hive
 This API is public so it can be used directly
 Data written via Streaming API has the same transactional semantics as SQL side
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Recent improvements
 PPD
 Schema Evolution
 Split computation ( Tez version 0.7 required)
 Usability
– better lock info
– compaction history
– show locks filtering
 Various safety checks - open txn limit
 Metastore side processes like compaction are no longer singletons
 Metastore scalability
 Bug fixes (Hive, Flume, Storm)
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Future Work (Uncommitted transaction… may be rolled back)
 Automatic/Dynamic bucketing
 Merge statement (SQL Standard 2003) - HIVE-10924
 Performance
– Better Vectorization; some operations over acid tables don’t vectorize at all
– Some do but not as well as they could
 HCatalog integration (at least read side) to read from Pig/MR
 Multi statement transactions, i.e. BEGIN TRANSACTION/COMMIT/ROLLBACK
 Finer grained concurrency management/conflict detection
 Better Monitoring/Alerting
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Etc
 Documentaton
– https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
– https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest
 Follow/Contribute
– https://issues.apache.org/jira/browse/HIVE-
14004?jql=project%20%3D%20HIVE%20AND%20component%20%3D%20Transactions
 user@hive.apache.org
 dev@hive.apache.org
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You

More Related Content

PPTX
Nov. 4, 2011 o reilly webcast-hbase- lars george
PPT
Less18 moving data
PPTX
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
PPTX
Sql Server 2008 New Programmability Features
PDF
What's new in TYPO3 6.2 LTS - #certiFUNcation Alumni Event 05.06.2015
PPTX
ORC improvement in Apache Spark 2.3
PPTX
Oracle ebs db platform migration
PDF
Apache Flume
Nov. 4, 2011 o reilly webcast-hbase- lars george
Less18 moving data
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
Sql Server 2008 New Programmability Features
What's new in TYPO3 6.2 LTS - #certiFUNcation Alumni Event 05.06.2015
ORC improvement in Apache Spark 2.3
Oracle ebs db platform migration
Apache Flume

What's hot (20)

PPTX
ORC: 2015 Faster, Better, Smaller
PDF
Sap basis administrator user guide
PPT
Less08 managing data and concurrency
PDF
Introduction to Hbase
PPT
Less17 flashback tb3
PDF
dNFS for DBA's
PPTX
Centralized logging with Flume
PPTX
Oracle ACFS High Availability NFS Services (HANFS)
PDF
CRX2Oak - all the secrets of repository migration
PPT
Less07 schema
PPT
Flume in 10minutes
PDF
Apache Hbase Architecture
PPTX
R12.2 dba
PPTX
Apache flume - Twitter Streaming
PPTX
Ts windchill data_loading
PPTX
Real-Time Inverted Search NYC ASLUG Oct 2014
PPT
Hechsp 001 Chapter 3
PDF
Configuring and manipulating HDFS files
PDF
Introduction to hadoop ecosystem
PPTX
Apache flume
ORC: 2015 Faster, Better, Smaller
Sap basis administrator user guide
Less08 managing data and concurrency
Introduction to Hbase
Less17 flashback tb3
dNFS for DBA's
Centralized logging with Flume
Oracle ACFS High Availability NFS Services (HANFS)
CRX2Oak - all the secrets of repository migration
Less07 schema
Flume in 10minutes
Apache Hbase Architecture
R12.2 dba
Apache flume - Twitter Streaming
Ts windchill data_loading
Real-Time Inverted Search NYC ASLUG Oct 2014
Hechsp 001 Chapter 3
Configuring and manipulating HDFS files
Introduction to hadoop ecosystem
Apache flume
Ad

Viewers also liked (16)

PPTX
Data Science with Spark & Zeppelin
PPTX
Apache Hive ACID Project
PDF
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
PPTX
Hadoop & devOps : better together
PPTX
How to Use Apache Zeppelin with HWX HDB
PPTX
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
PPTX
Dynamic Column Masking and Row-Level Filtering in HDP
PDF
Integration of Hive and HBase
PPTX
Mutable Data in Hive's Immutable World
PPTX
Edw Optimization Solution
PDF
Architecting a Next Generation Data Platform
PPTX
Transformation Processing Smackdown; Spark vs Hive vs Pig
PPTX
Hive on spark is blazing fast or is it final
PDF
Big Data visualization with Apache Spark and Zeppelin
PDF
How to Make Awesome SlideShares: Tips & Tricks
PDF
Getting Started With SlideShare
Data Science with Spark & Zeppelin
Apache Hive ACID Project
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Hadoop & devOps : better together
How to Use Apache Zeppelin with HWX HDB
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Dynamic Column Masking and Row-Level Filtering in HDP
Integration of Hive and HBase
Mutable Data in Hive's Immutable World
Edw Optimization Solution
Architecting a Next Generation Data Platform
Transformation Processing Smackdown; Spark vs Hive vs Pig
Hive on spark is blazing fast or is it final
Big Data visualization with Apache Spark and Zeppelin
How to Make Awesome SlideShares: Tips & Tricks
Getting Started With SlideShare
Ad

Similar to ACID Transactions in Hive (20)

PPTX
Transactional SQL in Apache Hive
PPTX
Hive acid-updates-strata-sjc-feb-2015
PPTX
Hive ACID Apache BigData 2016
PPTX
Apache Hive on ACID
PPTX
Hive Does ACID
PPTX
Transactional operations in Apache Hive: present and future
PPTX
Apache Hive on ACID
PPTX
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
PPTX
Hive acid-updates-summit-sjc-2014
PPTX
HiveACIDPublic
PPTX
GDPR compliance application architecture and implementation using Hadoop and ...
PPTX
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
PDF
What is new in Apache Hive 3.0?
PDF
What is New in Apache Hive 3.0?
PPTX
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
PPTX
Hive for Analytic Workloads
PPTX
Hive analytic workloads hadoop summit san jose 2014
PPTX
La big datacamp2014_vikram_dixit
PPTX
Hive present-and-feature-shanghai
PPTX
Hive acid and_2.x new_features
Transactional SQL in Apache Hive
Hive acid-updates-strata-sjc-feb-2015
Hive ACID Apache BigData 2016
Apache Hive on ACID
Hive Does ACID
Transactional operations in Apache Hive: present and future
Apache Hive on ACID
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Hive acid-updates-summit-sjc-2014
HiveACIDPublic
GDPR compliance application architecture and implementation using Hadoop and ...
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
What is new in Apache Hive 3.0?
What is New in Apache Hive 3.0?
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive for Analytic Workloads
Hive analytic workloads hadoop summit san jose 2014
La big datacamp2014_vikram_dixit
Hive present-and-feature-shanghai
Hive acid and_2.x new_features

Recently uploaded (20)

PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPT
Predictive modeling basics in data cleaning process
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
IMPACT OF LANDSLIDE.....................
PPTX
New ISO 27001_2022 standard and the changes
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Business Analytics and business intelligence.pdf
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PDF
Global Data and Analytics Market Outlook Report
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Predictive modeling basics in data cleaning process
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
SAP 2 completion done . PRESENTATION.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
Pilar Kemerdekaan dan Identi Bangsa.pptx
annual-report-2024-2025 original latest.
IMPACT OF LANDSLIDE.....................
New ISO 27001_2022 standard and the changes
CYBER SECURITY the Next Warefare Tactics
IBA_Chapter_11_Slides_Final_Accessible.pptx
Business Analytics and business intelligence.pdf
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Global Data and Analytics Market Outlook Report
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx

ACID Transactions in Hive

  • 1. Apache Hive ACID Project Eugene Koifman June 2016
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda  Motivations/Goals  What is included in the project  End user point of view  Architecture  Recent Progress  Possible future directions
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Motivations/Goals  Continuously adding new data to Hive in the past – INSERT INTO Target as SELECT FROM Staging – ALTER TABLE Target ADD PARTITION (dt=‘2016-06-30’) • Lots of files – bad for performance • Fewer files –users wait longer to see latest data  Modifying existing data – Analyzing log files – not that important. Sourcing data from an Operational Data Store – may be really important. – INSERT OVERWRITE TABLE Target SELECT * FROM Target WHERE … • Concurrency – Hope for the best (multiple updates) – ZooKeeper lock manager S/X locks – restrictive • Expensive to do repeatedly (write side)
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Goals  Make above use cases easy and efficient  Key Requirement – Long running analytics queries should run concurrently with update commands  NOT OLTP!!! – Support slowly changing tables – Not for 100s of concurrent queries trying to update the same partition
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ACID at High Level  A new type of table that supports Insert/Update/Delete SQL operations  Concept of ACID transaction – Atomic, Consistent, Isolated, Durable  Streaming Ingest API – Write a continuous stream of events to Hive in micro batches with transactional semantics
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ACID at High Level RDMS Compute Nodes HDFS Streaming Client SQL Client Meta Store openTxn/commit/abort Data txnID
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved User Point of View  CREATE TABLE T(a int, b int) CLUSTERED BY (b) INTO 8 BUCKETS STORED AS ORC TBLPROPERTIES ('transactional'='true');  Not all tables support transactional semantics  Table must be bucketed – important for query performance  Table cannot be sorted – ACID implementation requires its own sort order  Currently requires ORC File but anything implementing format – AcidInputFormat/AcidOutputFormat  Snapshot Isolation – Lock in the state of the DB as of the start of the query for the duration of the query  autoCommit=true
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Design – Storage Layer  Storage layer enhanced to support MVCC architecture – Multiple versions of each row – Allows concurrent readers/writers  HDFS – append only file system – All update operations are written to a delta file first – Files are combined on read and compaction  Even if you could update a file in the middle – The architecture of choice for analytics is columnar storage (ORC File) – Compresses by column – difficult to update  Random data access is prohibitively slow
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Storage Layer Example  CREATE TABLE T(a int, b int) CLUSTERED BY (b) INTO 1 BUCKETS STORED AS ORC TBLPROPERTIES ('transactional'='true');  Suppose the table contains (1,2),(3,4) hive> update T set a = -3 where a = 3; hive> update T set a = -1 where a = 1; Now the table has (-1,2),(-3,4)  hive> dfs -ls -R /user/hive/warehouse/t; /user/hive/warehouse/t/base_0000022/bucket_00000 /user/hive/warehouse/t/delta_0000023_0000023_0000/bucket_00000 /user/hive/warehouse/t/delta_0000024_0000024_0000/bucket_00000
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Example Continued  bin/hive --orcfiledump -j -d /user/hive/warehouse/t/base_0000022/bucket_00000 {"operation":0,"originalTransaction":22,"bucket":0,"rowId":0,"currentTransaction":22,"row":{"a":3,"b":4}} {"operation":0,"originalTransaction":22,"bucket":0,"rowId":1,"currentTransaction":22,"row":{"a":1,"b":2}}  bin/hive --orcfiledump -j -d /…/t/delta_0000023_0000023_0000/bucket_00000 {"operation":1,"originalTransaction":22,"bucket":0,"rowId":0,"currentTransaction":23,"row":{"_col1":-3,"_col2":4}}  Each file is sorted by PK: originalTransaction,bucket,rowid  On read base & deltas are stitched together to produce correct version of each row.  Each read operation “knows” the state of all transactions up to the moment it started
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Producing The Snapshot base_0000022/bucket_00000 oTxn bucket rowId cTxn a b 22 0 0 22 3 4 22 0 1 22 1 2 select * from T a b -3 4 -1 2 delta_0000023_0000023_0000 oTxn bucket rowId cTxn a b 22 0 0 23 - 3 4 delta_0000024_0000024_0000 oTxn bucket rowId cTxn a b 22 0 1 24 -1 2
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Design - Compactor  More operations = more delta files  Compactor rewrites the table in the background – Minor compaction - merges delta files into fewer deltas – Major compactor merges deltas with base - more expensive – This amortizes the cost of updates and self tunes the tables • Makes ORC more efficient - larger stripes, better compression  Compaction can be triggered automatically or on demand – There are various configuration options to control when the process kicks in. – Compaction itself is a Map-Reduce job  Key design principle is that compactor does not affect readers/writers  Cleaner process – removes obsolete files
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Design - Concurrency  Transaction Manager – manages transaction ID assignment – keeps track of transaction state: open, committed, aborted  Lock Manager – DDL operations acquire eXclusive locks – Read operations acquire Shared locks. – Main goal is to prevent someone dropping a table while a query is in progress  State of both persisted in Hive Metastore  Write Set tracking to prevent Write-Write conflicts in concurrent transactions  Note that 2 Inserts are never in conflict since Hive does not enforce unique constraints.
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved  You are allowed to read acid and non-acid tables in same query.  You cannot write to acid and non-acid tables at the same time (multi-insert statement)
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Design - Streaming Ingest  Allows you to continuously write events to a hive table – Can commit periodically to make writes durable/visible – Can also call abort to make writes since last commit/abort invisible. – Optimized so that it can handle writing micro batches of events - every second. • Multiple transactions are written to one file – Only supports adding new data  Streaming tools like Storm and Flume rely on this API to ingest data into hive  This API is public so it can be used directly  Data written via Streaming API has the same transactional semantics as SQL side
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Recent improvements  PPD  Schema Evolution  Split computation ( Tez version 0.7 required)  Usability – better lock info – compaction history – show locks filtering  Various safety checks - open txn limit  Metastore side processes like compaction are no longer singletons  Metastore scalability  Bug fixes (Hive, Flume, Storm)
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Future Work (Uncommitted transaction… may be rolled back)  Automatic/Dynamic bucketing  Merge statement (SQL Standard 2003) - HIVE-10924  Performance – Better Vectorization; some operations over acid tables don’t vectorize at all – Some do but not as well as they could  HCatalog integration (at least read side) to read from Pig/MR  Multi statement transactions, i.e. BEGIN TRANSACTION/COMMIT/ROLLBACK  Finer grained concurrency management/conflict detection  Better Monitoring/Alerting
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Etc  Documentaton – https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions – https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest  Follow/Contribute – https://issues.apache.org/jira/browse/HIVE- 14004?jql=project%20%3D%20HIVE%20AND%20component%20%3D%20Transactions  user@hive.apache.org  dev@hive.apache.org
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank You

Editor's Notes

  • #4: Easiest way to explain this is to talk about how you used to do some things in Hive before Hive ACID project.