SlideShare a Scribd company logo
2
Most read
5
Most read
6
Most read
Apache Iceberg
Scott Shaw
2
© 2021 Cloudera, Inc. All rights reserved.
What is Apache Iceberg?
• Efficient Table Format
– Hidden Partitioning
– Schema Evolution
– Time Travel
• Presto, Hive, Spark
• Created at Netflix (2017).
• Used at Adobe, Apple, LinkedIn,
Experian
3
© 2021 Cloudera, Inc. All rights reserved.
What are the Challenges?
• Data Scalability
• Atomicity
• Performance Degradation
• Complexity
• Object Stores
• Storage and Compute
• File System (Listing)
ARCHITECTURE
5
© 2021 Cloudera, Inc. All rights reserved.
Architecture
Spark Presto
HDFS Object Store
Iceberg
6
© 2021 Cloudera, Inc. All rights reserved.
Architecture
Snapshot (01)
Manifest List
Manifest
Files
Manifest
Manifest List
Snapshot (02)
Files Files
WORKING WITH ICEBERG
8
© 2021 Cloudera, Inc. All rights reserved.
Initial Setup
• Catalogs
– Working with SQL
– System Information
9
© 2021 Cloudera, Inc. All rights reserved.
Spark
spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.0 
--conf
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions 
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog

--conf spark.sql.catalog.spark_catalog.type=hive 
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog 
--conf spark.sql.catalog.local.type=hadoop 
--conf spark.sql.catalog.local.warehouse=$PWD/warehouse
Adding a Catalog
Creating a Table
CREATE TABLE local.db.table (id bigint, data string) USING iceberg
10
© 2021 Cloudera, Inc. All rights reserved.
Hive
add jar /path/to/iceberg-hive-runtime.jar;
Add the jar file
Create an External Table
CREATE EXTERNAL TABLE table_a
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
LOCATION 'hdfs://some_bucket/some_path/table_a';
REFERENCES
12
© 2021 Cloudera, Inc. All rights reserved.
References
Apache Iceberg: https://iceberg.apache.org/
Project Nessie: https://projectnessie.org/
Hive/Iceberg Integration: https://github.com/ExpediaGroup/hiveberg
Partitioning:
https://developer.ibm.com/technologies/artificial-intelligence/articles/the-why-and-how-of-partitioning-in-apache-iceberg/?utm_source=the
newstack&utm_medium=website&utm_campaign=platform
Iceberg Explained: https://thenewstack.io/apache-iceberg-a-different-table-design-for-big-data/
Apache Iceberg Presentation for the St. Louis Big Data IDEA

More Related Content

PDF
Apache Iceberg: An Architectural Look Under the Covers
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PDF
Building an open data platform with apache iceberg
PPTX
iceberg introduction.pptx
PDF
Iceberg: a fast table format for S3
Apache Iceberg: An Architectural Look Under the Covers
A Thorough Comparison of Delta Lake, Iceberg and Hudi
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
Iceberg: A modern table format for big data (Strata NY 2018)
Apache Iceberg - A Table Format for Hige Analytic Datasets
Building an open data platform with apache iceberg
iceberg introduction.pptx
Iceberg: a fast table format for S3

What's hot (20)

PDF
Iceberg + Alluxio for Fast Data Analytics
PDF
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
PDF
Intro to Delta Lake
PDF
Containerized Stream Engine to Build Modern Delta Lake
PDF
Hudi architecture, fundamentals and capabilities
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PDF
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
PPTX
Real-time Analytics with Trino and Apache Pinot
PDF
Batch Processing at Scale with Flink & Iceberg
PDF
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
PDF
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
PDF
Modernizing to a Cloud Data Architecture
PDF
Introduction SQL Analytics on Lakehouse Architecture
PPTX
Delta Lake with Azure Databricks
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PPTX
Evening out the uneven: dealing with skew in Flink
PDF
Making Apache Spark Better with Delta Lake
PPTX
Delta lake and the delta architecture
PDF
From Data Warehouse to Lakehouse
PPTX
Snowflake Architecture.pptx
Iceberg + Alluxio for Fast Data Analytics
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Intro to Delta Lake
Containerized Stream Engine to Build Modern Delta Lake
Hudi architecture, fundamentals and capabilities
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Real-time Analytics with Trino and Apache Pinot
Batch Processing at Scale with Flink & Iceberg
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Modernizing to a Cloud Data Architecture
Introduction SQL Analytics on Lakehouse Architecture
Delta Lake with Azure Databricks
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Evening out the uneven: dealing with skew in Flink
Making Apache Spark Better with Delta Lake
Delta lake and the delta architecture
From Data Warehouse to Lakehouse
Snowflake Architecture.pptx
Ad

Similar to Apache Iceberg Presentation for the St. Louis Big Data IDEA (20)

PPTX
Apache Iceberg Presentation 101:Lakehouse
PDF
Presto Summit 2018 - 09 - Netflix Iceberg
PPTX
Boston Data Engineering: Iceberg Dead Ahead with Starburst
PDF
Some Iceberg Basics for Beginners (CDP).pdf
PDF
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
PPTX
Harry Potter & Apache iceberg format
PDF
Building an Apache Hadoop data application
PDF
Acid ORC, Iceberg and Delta Lake
PDF
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
PPTX
Building data pipelines with kite
PDF
Hive Quick Start Tutorial
PPTX
Data warehousing with Hadoop
PDF
PyData: The Next Generation
PDF
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
PPTX
AWS Lake Formation Deep Dive
PDF
Cloudera Operational DB (Apache HBase & Apache Phoenix)
PDF
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
PPTX
IDERA Slides: Managing Complex Data Environments
PPTX
Twitter with hadoop for oow
PDF
More Than Just The Tip Of The Iceberg.pdf
Apache Iceberg Presentation 101:Lakehouse
Presto Summit 2018 - 09 - Netflix Iceberg
Boston Data Engineering: Iceberg Dead Ahead with Starburst
Some Iceberg Basics for Beginners (CDP).pdf
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
Harry Potter & Apache iceberg format
Building an Apache Hadoop data application
Acid ORC, Iceberg and Delta Lake
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
Building data pipelines with kite
Hive Quick Start Tutorial
Data warehousing with Hadoop
PyData: The Next Generation
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Lake Formation Deep Dive
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
IDERA Slides: Managing Complex Data Environments
Twitter with hadoop for oow
More Than Just The Tip Of The Iceberg.pdf
Ad

More from Adam Doyle (20)

PPTX
ML Ops.pptx
PPTX
Data Engineering Roles
PPTX
Managed Cluster Services
PPTX
Great Expectations Presentation
PDF
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
PDF
Automate your data flows with Apache NIFI
PPTX
Localized Hadoop Development
PDF
The new big data
PDF
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
PDF
Snowflake Data Science and AI/ML at Scale
PPTX
Operationalizing Data Science St. Louis Big Data IDEA
PPTX
Retooling on the Modern Data and Analytics Tech Stack
PDF
Stl meetup cloudera platform - january 2020
PPTX
How stlrda does data
PPTX
Tailoring machine learning practices to support prescriptive analytics
PPTX
Synthesis of analytical methods data driven decision-making
PPTX
Big Data IDEA 101 2019
PPTX
Data Engineering and the Data Science Lifecycle
PDF
Data engineering Stl Big Data IDEA user group
PPTX
Cloudera - Docker on hadoop
ML Ops.pptx
Data Engineering Roles
Managed Cluster Services
Great Expectations Presentation
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
Automate your data flows with Apache NIFI
Localized Hadoop Development
The new big data
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Snowflake Data Science and AI/ML at Scale
Operationalizing Data Science St. Louis Big Data IDEA
Retooling on the Modern Data and Analytics Tech Stack
Stl meetup cloudera platform - january 2020
How stlrda does data
Tailoring machine learning practices to support prescriptive analytics
Synthesis of analytical methods data driven decision-making
Big Data IDEA 101 2019
Data Engineering and the Data Science Lifecycle
Data engineering Stl Big Data IDEA user group
Cloudera - Docker on hadoop

Recently uploaded (20)

PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Fluorescence-microscope_Botany_detailed content
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Business Acumen Training GuidePresentation.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
oil_refinery_comprehensive_20250804084928 (1).pptx
Reliability_Chapter_ presentation 1221.5784
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Fluorescence-microscope_Botany_detailed content
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Business Acumen Training GuidePresentation.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
1_Introduction to advance data techniques.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm

Apache Iceberg Presentation for the St. Louis Big Data IDEA

  • 2. 2 © 2021 Cloudera, Inc. All rights reserved. What is Apache Iceberg? • Efficient Table Format – Hidden Partitioning – Schema Evolution – Time Travel • Presto, Hive, Spark • Created at Netflix (2017). • Used at Adobe, Apple, LinkedIn, Experian
  • 3. 3 © 2021 Cloudera, Inc. All rights reserved. What are the Challenges? • Data Scalability • Atomicity • Performance Degradation • Complexity • Object Stores • Storage and Compute • File System (Listing)
  • 5. 5 © 2021 Cloudera, Inc. All rights reserved. Architecture Spark Presto HDFS Object Store Iceberg
  • 6. 6 © 2021 Cloudera, Inc. All rights reserved. Architecture Snapshot (01) Manifest List Manifest Files Manifest Manifest List Snapshot (02) Files Files
  • 8. 8 © 2021 Cloudera, Inc. All rights reserved. Initial Setup • Catalogs – Working with SQL – System Information
  • 9. 9 © 2021 Cloudera, Inc. All rights reserved. Spark spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.0 --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog --conf spark.sql.catalog.spark_catalog.type=hive --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.local.type=hadoop --conf spark.sql.catalog.local.warehouse=$PWD/warehouse Adding a Catalog Creating a Table CREATE TABLE local.db.table (id bigint, data string) USING iceberg
  • 10. 10 © 2021 Cloudera, Inc. All rights reserved. Hive add jar /path/to/iceberg-hive-runtime.jar; Add the jar file Create an External Table CREATE EXTERNAL TABLE table_a STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' LOCATION 'hdfs://some_bucket/some_path/table_a';
  • 12. 12 © 2021 Cloudera, Inc. All rights reserved. References Apache Iceberg: https://iceberg.apache.org/ Project Nessie: https://projectnessie.org/ Hive/Iceberg Integration: https://github.com/ExpediaGroup/hiveberg Partitioning: https://developer.ibm.com/technologies/artificial-intelligence/articles/the-why-and-how-of-partitioning-in-apache-iceberg/?utm_source=the newstack&utm_medium=website&utm_campaign=platform Iceberg Explained: https://thenewstack.io/apache-iceberg-a-different-table-design-for-big-data/