SlideShare a Scribd company logo
HOW to use First steps
© 2022 Cloudera, Inc. All rights reserved. 2
Recommended Iceberg Workflow
Create Iceberg
tables
a. Bring your own
datasets by
converting your
Hive external
tables
OR
b. Use the sample
airline datasets
CDW: Hive
CDE: Spark SQL
1
Batch Insert
data
To prepare Time
Travel scenario:
Insert more data into
Iceberg tables with
Hive or Spark
CDE: Spark SQL
2
Create Security
Policy
Create a Ranger
policy to mask a
column for Fine
Grained Access
Control (FGAC)
SDX: Ranger
3
Build BI Query
Create SQL Queries
for standard ops.
reporting
CDW: Impala SQL
4
Build
Visualizations
Create data sets &
Visuals from Query
CDV: Create data set
from query & Build
Visuals
5
Perform Time
Travel
Create Time Travel
Queries and
Execute them to
audit what has
changed
CDW: Hive/Impala SQL
CDE: Spark Scala API
6
Partition
Evolution
Optimize partition
schema to improve
query performance
CDW: Hive/Impala SQL
CDE: Spark SQL
7
Table
Maintenance
Manage / Expire
Snapshots
CDE: Spark SQL
8
CREATE INGEST/ PREP SERVE OPERATION / MAINTENANCE
GOVERN
© 2022 Cloudera, Inc. All rights reserved. 3
SQL Commands ( Hive, Spark, Impala)
© 2022 Cloudera, Inc. All rights reserved. 4
SQL Commands
Iceberg
Tables
T
a
b
l
e
C
o
n
v
e
r
s
i
o
n
Tim
e Travel
DDL
Query
D
M
L
Ease of Use through consistent SQL Syntax across compute engines
Rich set of SQL commands are developed
for Hive, Impala and Spark to
• Create and manipulate database objects
• Run Queries
• Load data into tables
• Modify data in tables
• Perform Time Travel operations
• Convert to Iceberg tables
© 2022 Cloudera, Inc. All rights reserved. 5
Snapshot of Iceberg SQL Commands
Hive Impala Spark
Select ⬤ ⬤ ⬤
DML (INSERT INTO, INSERT OVERWRITE) ⬤ ⬤ ⬤
Create Table ⬤ ⬤ ⬤
Alter Table ⬤ ⬤ ⬤
Drop Table ⬤ ⬤ ⬤
Truncate Table ⬤ ⬤ NA
Create-Table-As-Select ⬤ ⬤ ⬤
Replace-Table-As-Select NA NA ⬤
Partition Evolution ⬤ ⬤ ⬤
Partition Transformation ⬤ ⬤ ⬤
Schema Evolution ⬤ ⬤ ⬤
Table Metadata (DESCRIBE TABLE, SHOW CREATE
TABLE)
⬤ ⬤ ⬤
Time Travel ⬤ ⬤ Scala API now, SQL is planned
Table Migration ⬤ NA ⬤
Table Maintenance NA NA ⬤
⬤ General Availability
⬤ Tech Preview
Compute Engines Interoperability &
Fine Grained Access Control
© 2022 Cloudera, Inc. All rights reserved. 7
Compute Engine Interoperability & FGAC
❏ Consistent Iceberg table access and
processing with SQL using Hive, Spark and
Impala (reads and writes)
❏ No partial reads
❏ No adapters needed
❏ Iceberg FGAC support through Ranger
integration with Hive / Impala
❏ Spark is planned
❏ Compatible with existing workflows
❏ Optimized for performance, cost and
developer efficiency
Iceberg Tables
Apache Impala
Table Conversion SQL commands /
Utility [Tech Preview]
© 2022 Cloudera, Inc. All rights reserved. 9
Table Conversion from Hive External to Iceberg Tables
1. Hive table migration:
ALTER TABLE tbl SET TBLPROPERTIES
(‘storage_handler’=’org.apache.iceberg.mr.hive.HiveIcebergStorageHandler’)
2. Spark 3:
a. Import Hive tables into Iceberg
spark.sql("CALL <catalog>.system.snapshot('<src>', '<dest>')")
b. Migrate Hive tables to Iceberg tables
spark.sql("CALL <catalog>.system.migrate('<src>')")
Time Travel Operations
© 2022 Cloudera, Inc. All rights reserved. 11
Time Travel
t
Time Travel is the ability to make a query reproducible at a given snapshot and/or time
Time Travel operations:
● SELECT … AS OF …
Apache Impala
Snapshot A Snapshot Z
Standard SQL operations:
● Queries
● DDL
● DML
t
|
|
T
0
© 2022 Cloudera, Inc. All rights reserved. 12
Time Travel Operations
Time Travel Ops SQL Examples
Hive / Impala
Query
SELECT * FROM table FOR SYSTEM_TIME AS OF ’2021-08-09 10:35:57’;
SELECT * FROM table FOR SYSTEM_VERSION AS OF 1234567;
Spark Scala API // time travel to snapshot with ID 10963874102873L
spark.read
.option("snapshot-id", 10963874102873L)
.format("iceberg")
.load("path/to/table")
// time travel to October 26, 1986 at 01:21:00
spark.read
.option("as-of-timestamp", "499162860000")
.format("iceberg")
.load("path/to/table")
Partition Evolution
© 2022 Cloudera, Inc. All rights reserved. 14
In-place Partition Evolution
❏ Existing big data solution doesn’t support in-place
partition evolution. Entire table must be completely
rewritten with new partition column
❏ With Iceberg’s hidden partition, a separation between
physical and logical, users are not required to maintain
partition columns.
❏ Iceberg tables can evolve partition schemas over time
as data volume changes.
❏ Benefits:
❏ No costly table rewrites or table migration
❏ No query rewrites
❏ Reduce downtime and improve SLA
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31
2022-01-01 t
Partitions included in query plan
Partitioned by Month(date) Partitioned by Day(date)
2021-10-01 2021-11-01 2021-12-01 2022-01…
SELECT * FROM SALES_ORDER
WHERE
DATE > 2021-11-23 AND
DATA < 2022-01-19
Split plan 1 Split plan 2
© 2022 Cloudera, Inc. All rights reserved. 15
Partition Evolution SQL examples
Engine SQL Examples
Hive / Impala // Partition evolution to hour
ALTER TABLE t SET PARTITION SPEC (hour(ts))
Spark SQL // Partition evolution to hour
ALTER TABLE t ADD PARTITION FIELD (hour(ts))
Table Maintenance [ Tech Preview ]
© 2022 Cloudera, Inc. All rights reserved. 17
Table Maintenance [ Tech Preview ]
Time Travel Ops Examples
Hive / Impala
Query
// Tentative, Proposed Syntax, not in GA
// Expires snapshots that are older than 7 days.
ALTER TABLE test_table EXECUTE expire_snapshots_lt
(now() - interval 7
days);
Spark Scala API // Not in GA
// Expires snapshots that are older than 7 day
Table test_table = …
long tsToExpire = System.currentTimeMillis() - (1000*60*60*24*7);
test_table.expireSnapshots()
.expireOlderThan(tsToExpire)
.commit();
Expiring old snapshots removes them from metadata, so they are no longer available for time travel operations. Data files are
not deleted until they are no longer referenced by a snapshot that may be used for time travel. Regularly expiring snapshots
deletes unused data files.

More Related Content

PDF
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
PDF
Building an open data platform with apache iceberg
PDF
Batch Processing at Scale with Flink & Iceberg
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
PDF
Iceberg + Alluxio for Fast Data Analytics
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Iceberg: A modern table format for big data (Strata NY 2018)
Building an open data platform with apache iceberg
Batch Processing at Scale with Flink & Iceberg
Apache Iceberg - A Table Format for Hige Analytic Datasets
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Iceberg + Alluxio for Fast Data Analytics
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

What's hot (20)

PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PDF
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
PDF
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
PPTX
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
PPTX
Evening out the uneven: dealing with skew in Flink
PDF
How We Optimize Spark SQL Jobs With parallel and sync IO
PDF
Presto Summit 2018 - 09 - Netflix Iceberg
PDF
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
PDF
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Apache airflow
PDF
CDC patterns in Apache Kafka®
PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
Leveraging Neo4j With Apache Spark
PPTX
Performance Optimizations in Apache Impala
PPTX
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
PDF
MyRocks Deep Dive
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Apache Iceberg Presentation for the St. Louis Big Data IDEA
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Evening out the uneven: dealing with skew in Flink
How We Optimize Spark SQL Jobs With parallel and sync IO
Presto Summit 2018 - 09 - Netflix Iceberg
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Apache airflow
CDC patterns in Apache Kafka®
Building a fully managed stream processing platform on Flink at scale for Lin...
Leveraging Neo4j With Apache Spark
Performance Optimizations in Apache Impala
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
MyRocks Deep Dive
Ad

Similar to Some Iceberg Basics for Beginners (CDP).pdf (20)

PDF
Fast federated SQL with Apache Calcite
PPTX
Using SQL Plan Management (SPM) to balance Plan Flexibility and Plan Stability
PDF
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
PDF
Sprint 186
PPTX
Windows on AWS
PDF
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
PPTX
Copy Data Management for the DBA
PPSX
Introducing the eDB360 Tool
PPTX
Denver SQL Saturday The Next Frontier
PDF
Performance Stability, Tips and Tricks and Underscores
PDF
Spark Summit EU talk by Mike Percy
PDF
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
PDF
Replicating in Real-time from MySQL to Amazon Redshift
PDF
Sprint 170
PDF
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
PDF
PgConf US 2015 - ALTER DATABASE ADD more SANITY
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Technical Deck Delta Live Tables.pdf
PDF
Sprint 168
PDF
Sprint 185
Fast federated SQL with Apache Calcite
Using SQL Plan Management (SPM) to balance Plan Flexibility and Plan Stability
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
Sprint 186
Windows on AWS
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Copy Data Management for the DBA
Introducing the eDB360 Tool
Denver SQL Saturday The Next Frontier
Performance Stability, Tips and Tricks and Underscores
Spark Summit EU talk by Mike Percy
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Replicating in Real-time from MySQL to Amazon Redshift
Sprint 170
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
PgConf US 2015 - ALTER DATABASE ADD more SANITY
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Technical Deck Delta Live Tables.pdf
Sprint 168
Sprint 185
Ad

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Machine learning based COVID-19 study performance prediction
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPT
Teaching material agriculture food technology
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Cloud computing and distributed systems.
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Electronic commerce courselecture one. Pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Machine learning based COVID-19 study performance prediction
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Programs and apps: productivity, graphics, security and other tools
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Teaching material agriculture food technology
Spectral efficient network and resource selection model in 5G networks
Reach Out and Touch Someone: Haptics and Empathic Computing
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
MYSQL Presentation for SQL database connectivity
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Cloud computing and distributed systems.
Empathic Computing: Creating Shared Understanding
Digital-Transformation-Roadmap-for-Companies.pptx
Network Security Unit 5.pdf for BCA BBA.
Advanced methodologies resolving dimensionality complications for autism neur...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Electronic commerce courselecture one. Pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf

Some Iceberg Basics for Beginners (CDP).pdf

  • 1. HOW to use First steps
  • 2. © 2022 Cloudera, Inc. All rights reserved. 2 Recommended Iceberg Workflow Create Iceberg tables a. Bring your own datasets by converting your Hive external tables OR b. Use the sample airline datasets CDW: Hive CDE: Spark SQL 1 Batch Insert data To prepare Time Travel scenario: Insert more data into Iceberg tables with Hive or Spark CDE: Spark SQL 2 Create Security Policy Create a Ranger policy to mask a column for Fine Grained Access Control (FGAC) SDX: Ranger 3 Build BI Query Create SQL Queries for standard ops. reporting CDW: Impala SQL 4 Build Visualizations Create data sets & Visuals from Query CDV: Create data set from query & Build Visuals 5 Perform Time Travel Create Time Travel Queries and Execute them to audit what has changed CDW: Hive/Impala SQL CDE: Spark Scala API 6 Partition Evolution Optimize partition schema to improve query performance CDW: Hive/Impala SQL CDE: Spark SQL 7 Table Maintenance Manage / Expire Snapshots CDE: Spark SQL 8 CREATE INGEST/ PREP SERVE OPERATION / MAINTENANCE GOVERN
  • 3. © 2022 Cloudera, Inc. All rights reserved. 3 SQL Commands ( Hive, Spark, Impala)
  • 4. © 2022 Cloudera, Inc. All rights reserved. 4 SQL Commands Iceberg Tables T a b l e C o n v e r s i o n Tim e Travel DDL Query D M L Ease of Use through consistent SQL Syntax across compute engines Rich set of SQL commands are developed for Hive, Impala and Spark to • Create and manipulate database objects • Run Queries • Load data into tables • Modify data in tables • Perform Time Travel operations • Convert to Iceberg tables
  • 5. © 2022 Cloudera, Inc. All rights reserved. 5 Snapshot of Iceberg SQL Commands Hive Impala Spark Select ⬤ ⬤ ⬤ DML (INSERT INTO, INSERT OVERWRITE) ⬤ ⬤ ⬤ Create Table ⬤ ⬤ ⬤ Alter Table ⬤ ⬤ ⬤ Drop Table ⬤ ⬤ ⬤ Truncate Table ⬤ ⬤ NA Create-Table-As-Select ⬤ ⬤ ⬤ Replace-Table-As-Select NA NA ⬤ Partition Evolution ⬤ ⬤ ⬤ Partition Transformation ⬤ ⬤ ⬤ Schema Evolution ⬤ ⬤ ⬤ Table Metadata (DESCRIBE TABLE, SHOW CREATE TABLE) ⬤ ⬤ ⬤ Time Travel ⬤ ⬤ Scala API now, SQL is planned Table Migration ⬤ NA ⬤ Table Maintenance NA NA ⬤ ⬤ General Availability ⬤ Tech Preview
  • 6. Compute Engines Interoperability & Fine Grained Access Control
  • 7. © 2022 Cloudera, Inc. All rights reserved. 7 Compute Engine Interoperability & FGAC ❏ Consistent Iceberg table access and processing with SQL using Hive, Spark and Impala (reads and writes) ❏ No partial reads ❏ No adapters needed ❏ Iceberg FGAC support through Ranger integration with Hive / Impala ❏ Spark is planned ❏ Compatible with existing workflows ❏ Optimized for performance, cost and developer efficiency Iceberg Tables Apache Impala
  • 8. Table Conversion SQL commands / Utility [Tech Preview]
  • 9. © 2022 Cloudera, Inc. All rights reserved. 9 Table Conversion from Hive External to Iceberg Tables 1. Hive table migration: ALTER TABLE tbl SET TBLPROPERTIES (‘storage_handler’=’org.apache.iceberg.mr.hive.HiveIcebergStorageHandler’) 2. Spark 3: a. Import Hive tables into Iceberg spark.sql("CALL <catalog>.system.snapshot('<src>', '<dest>')") b. Migrate Hive tables to Iceberg tables spark.sql("CALL <catalog>.system.migrate('<src>')")
  • 11. © 2022 Cloudera, Inc. All rights reserved. 11 Time Travel t Time Travel is the ability to make a query reproducible at a given snapshot and/or time Time Travel operations: ● SELECT … AS OF … Apache Impala Snapshot A Snapshot Z Standard SQL operations: ● Queries ● DDL ● DML t | | T 0
  • 12. © 2022 Cloudera, Inc. All rights reserved. 12 Time Travel Operations Time Travel Ops SQL Examples Hive / Impala Query SELECT * FROM table FOR SYSTEM_TIME AS OF ’2021-08-09 10:35:57’; SELECT * FROM table FOR SYSTEM_VERSION AS OF 1234567; Spark Scala API // time travel to snapshot with ID 10963874102873L spark.read .option("snapshot-id", 10963874102873L) .format("iceberg") .load("path/to/table") // time travel to October 26, 1986 at 01:21:00 spark.read .option("as-of-timestamp", "499162860000") .format("iceberg") .load("path/to/table")
  • 14. © 2022 Cloudera, Inc. All rights reserved. 14 In-place Partition Evolution ❏ Existing big data solution doesn’t support in-place partition evolution. Entire table must be completely rewritten with new partition column ❏ With Iceberg’s hidden partition, a separation between physical and logical, users are not required to maintain partition columns. ❏ Iceberg tables can evolve partition schemas over time as data volume changes. ❏ Benefits: ❏ No costly table rewrites or table migration ❏ No query rewrites ❏ Reduce downtime and improve SLA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 2022-01-01 t Partitions included in query plan Partitioned by Month(date) Partitioned by Day(date) 2021-10-01 2021-11-01 2021-12-01 2022-01… SELECT * FROM SALES_ORDER WHERE DATE > 2021-11-23 AND DATA < 2022-01-19 Split plan 1 Split plan 2
  • 15. © 2022 Cloudera, Inc. All rights reserved. 15 Partition Evolution SQL examples Engine SQL Examples Hive / Impala // Partition evolution to hour ALTER TABLE t SET PARTITION SPEC (hour(ts)) Spark SQL // Partition evolution to hour ALTER TABLE t ADD PARTITION FIELD (hour(ts))
  • 16. Table Maintenance [ Tech Preview ]
  • 17. © 2022 Cloudera, Inc. All rights reserved. 17 Table Maintenance [ Tech Preview ] Time Travel Ops Examples Hive / Impala Query // Tentative, Proposed Syntax, not in GA // Expires snapshots that are older than 7 days. ALTER TABLE test_table EXECUTE expire_snapshots_lt (now() - interval 7 days); Spark Scala API // Not in GA // Expires snapshots that are older than 7 day Table test_table = … long tsToExpire = System.currentTimeMillis() - (1000*60*60*24*7); test_table.expireSnapshots() .expireOlderThan(tsToExpire) .commit(); Expiring old snapshots removes them from metadata, so they are no longer available for time travel operations. Data files are not deleted until they are no longer referenced by a snapshot that may be used for time travel. Regularly expiring snapshots deletes unused data files.