Some Iceberg Basics for Beginners (CDP).pdf

© 2022 Cloudera, Inc. All rights reserved. 2
Recommended Iceberg Workﬂow
Create Iceberg
tables
a. Bring your own
datasets by
converting your
Hive external
tables
OR
b. Use the sample
airline datasets
CDW: Hive
CDE: Spark SQL
1
Batch Insert
data
To prepare Time
Travel scenario:
Insert more data into
Iceberg tables with
Hive or Spark
CDE: Spark SQL
2
Create Security
Policy
Create a Ranger
policy to mask a
column for Fine
Grained Access
Control (FGAC)
SDX: Ranger
3
Build BI Query
Create SQL Queries
for standard ops.
reporting
CDW: Impala SQL
4
Build
Visualizations
Create data sets &
Visuals from Query
CDV: Create data set
from query & Build
Visuals
5
Perform Time
Travel
Create Time Travel
Queries and
Execute them to
audit what has
changed
CDW: Hive/Impala SQL
CDE: Spark Scala API
6
Partition
Evolution
Optimize partition
schema to improve
query performance
CDW: Hive/Impala SQL
CDE: Spark SQL
7
Table
Maintenance
Manage / Expire
Snapshots
CDE: Spark SQL
8
CREATE INGEST/ PREP SERVE OPERATION / MAINTENANCE
GOVERN

SQL Commands ( Hive, Spark, Impala)

SQL Commands
Iceberg
Tables
T
a
b
l
e
C
o
n
v
e
r
s
i
o
n
Tim
e Travel
DDL
Query
D
M
L
Ease of Use through consistent SQL Syntax across compute engines
Rich set of SQL commands are developed
for Hive, Impala and Spark to
• Create and manipulate database objects
• Run Queries
• Load data into tables
• Modify data in tables
• Perform Time Travel operations
• Convert to Iceberg tables

Snapshot of Iceberg SQL Commands
Hive Impala Spark
Select ⬤ ⬤ ⬤
DML (INSERT INTO, INSERT OVERWRITE) ⬤ ⬤ ⬤
Create Table ⬤ ⬤ ⬤
Alter Table ⬤ ⬤ ⬤
Drop Table ⬤ ⬤ ⬤
Truncate Table ⬤ ⬤ NA
Create-Table-As-Select ⬤ ⬤ ⬤
Replace-Table-As-Select NA NA ⬤
Partition Evolution ⬤ ⬤ ⬤
Partition Transformation ⬤ ⬤ ⬤
Schema Evolution ⬤ ⬤ ⬤
Table Metadata (DESCRIBE TABLE, SHOW CREATE
TABLE)
⬤ ⬤ ⬤
Time Travel ⬤ ⬤ Scala API now, SQL is planned
Table Migration ⬤ NA ⬤
Table Maintenance NA NA ⬤
⬤ General Availability
⬤ Tech Preview

Compute Engines Interoperability &
Fine Grained Access Control

Compute Engine Interoperability & FGAC
❏ Consistent Iceberg table access and
processing with SQL using Hive, Spark and
Impala (reads and writes)
❏ No partial reads
❏ No adapters needed
❏ Iceberg FGAC support through Ranger
integration with Hive / Impala
❏ Spark is planned
❏ Compatible with existing workﬂows
❏ Optimized for performance, cost and
developer eﬃciency
Iceberg Tables
Apache Impala

Table Conversion SQL commands /
Utility [Tech Preview]

Table Conversion from Hive External to Iceberg Tables
1. Hive table migration:
ALTER TABLE tbl SET TBLPROPERTIES
(‘storage_handler’=’org.apache.iceberg.mr.hive.HiveIcebergStorageHandler’)
2. Spark 3:
a. Import Hive tables into Iceberg
spark.sql("CALL <catalog>.system.snapshot('<src>', '<dest>')")
b. Migrate Hive tables to Iceberg tables
spark.sql("CALL <catalog>.system.migrate('<src>')")

Time Travel
t
Time Travel is the ability to make a query reproducible at a given snapshot and/or time
Time Travel operations:
● SELECT … AS OF …
Apache Impala
Snapshot A Snapshot Z
Standard SQL operations:
● Queries
● DDL
● DML
t
|
|
T
0

Time Travel Operations
Time Travel Ops SQL Examples
Hive / Impala
Query
SELECT * FROM table FOR SYSTEM_TIME AS OF ’2021-08-09 10:35:57’;
SELECT * FROM table FOR SYSTEM_VERSION AS OF 1234567;
Spark Scala API // time travel to snapshot with ID 10963874102873L
spark.read
.option("snapshot-id", 10963874102873L)
.format("iceberg")
.load("path/to/table")
// time travel to October 26, 1986 at 01:21:00
spark.read
.option("as-of-timestamp", "499162860000")
.format("iceberg")
.load("path/to/table")

In-place Partition Evolution
❏ Existing big data solution doesn’t support in-place
partition evolution. Entire table must be completely
rewritten with new partition column
❏ With Iceberg’s hidden partition, a separation between
physical and logical, users are not required to maintain
partition columns.
❏ Iceberg tables can evolve partition schemas over time
as data volume changes.
❏ Beneﬁts:
❏ No costly table rewrites or table migration
❏ No query rewrites
❏ Reduce downtime and improve SLA
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31
2022-01-01 t
Partitions included in query plan
Partitioned by Month(date) Partitioned by Day(date)
2021-10-01 2021-11-01 2021-12-01 2022-01…
SELECT * FROM SALES_ORDER
WHERE
DATE > 2021-11-23 AND
DATA < 2022-01-19
Split plan 1 Split plan 2

Partition Evolution SQL examples
Engine SQL Examples
Hive / Impala // Partition evolution to hour
ALTER TABLE t SET PARTITION SPEC (hour(ts))
Spark SQL // Partition evolution to hour
ALTER TABLE t ADD PARTITION FIELD (hour(ts))

Table Maintenance [ Tech Preview ]

Table Maintenance [ Tech Preview ]
Time Travel Ops Examples
Hive / Impala
Query
// Tentative, Proposed Syntax, not in GA
// Expires snapshots that are older than 7 days.
ALTER TABLE test_table EXECUTE expire_snapshots_lt
(now() - interval 7
days);
Spark Scala API // Not in GA
// Expires snapshots that are older than 7 day
Table test_table = …
long tsToExpire = System.currentTimeMillis() - (1000*60*60*24*7);
test_table.expireSnapshots()
.expireOlderThan(tsToExpire)
.commit();
Expiring old snapshots removes them from metadata, so they are no longer available for time travel operations. Data ﬁles are
not deleted until they are no longer referenced by a snapshot that may be used for time travel. Regularly expiring snapshots
deletes unused data ﬁles.

Some Iceberg Basics for Beginners (CDP).pdf

More Related Content

What's hot (20)

Similar to Some Iceberg Basics for Beginners (CDP).pdf (20)

Recently uploaded (20)

Some Iceberg Basics for Beginners (CDP).pdf