SlideShare a Scribd company logo
2
Most read
3
Most read
14
Most read
Batch Processing at Scale
with Flink & Iceberg
Andreas Hailu
Vice President, Goldman Sachs
Goldman Sachs Data Lake
● Platform allowing users to
generate batch data pipelines
without writing any code
● Data producers register datasets,
making metadata available
○ Dataset schema, source and access,
batch frequency, etc…
○ Flink batch applications generated
dynamically
● Datasets subscribed for updates
by consumers in warehouses
● Producers and consumers
decoupled
● Scale
○ 162K unique datasets
○ 140K batches/day
○ 4.2MM batches/month
2
ETL
Warehousing
Registry
Service
Producer
Source
Data
HDFS
Redshift
S3
SAP IQ/ASE
Snowflake
Lake
Browseable
Catalog
Batch Data Strategy
● Lake operates using copy-on-write enumerated batches
● Extracted data merged with existing data to create a new batch
● Support both milestoned and append merges
○ Milestoned merge builds out records such that records themselves contain the as-of data
■ No time-travel required
■ Done per key, “linked-list” of time-series records
■ Immutable, retained forever
○ Append merge simply appends incoming data to existing data
● Merged data is stored as Parquet/Avro, snapshots and deltas generated per
batch
○ Data exported to warehouse on batch completion in either snapshot/incremental loads
● Consumers always read data from last completed batch
● Last 3 batches of merged data are retained for recovery purposes
3
Milestoning Example
4
First Name Last Name Profession Date
Art Vandelay Importer May-31-1990
Staging Data
Merged Data
lake_in_id lake_out_id lake_from lake_thru First Name Last Name Profession Date
1 999999999 May-31-1990 11/30/9999 Art Vandelay Importer May-31-1990
Batch 1
Milestoning Example
5
First Name Last Name Profession Date
Art Vandelay Importer-Exporter June-30-1990
Staging Data
Merged Data
lake_in_id lake_out_id lake_from lake_thru First Name Last Name Profession Date
1 1 May-31-1990 11/30/9999 Art Vandelay Importer May-31-1990
2 999999999 May-31-1990 June-30-1990 Art Vandelay Importer May-31-1990
2 999999999 June-30-1990 11/30/9999 Art Vandelay Importer-Exporter June-30-1990
Batch 2
Job Graph - Extract
Extract Source
Data
Transform into Avro
→ Enrichment →
Validate Data Quality
Staging Directory
6
Accumulate Bloom
Filters, Partitions, …
Map, FlatMap DataSink
DataSource
Empty Sink
DataSink
Map, FlatMap
Batch
N
Job Graph - Merge
Read Merged Data
→ Dead Records ||
Records Not in
BloomFilter
7
Read Merged Data
→ Live Records →
Records In
BloomFilter
Read Staging Data
Merge Directory
(snapshot & delta)
Merge Staging
Records with
Merged Records
keyBy()
CoGroup DataSink
DataSource, Filter
DataSource, Filter
DataSource
Batch
N
Batch
N-1
Batch
N-1
Batch
N
Merge Details
● Staging data is merged with existing live records
○ Some niche exemptions for certain use cases
● Updates result in in closure of existing record and insertion of a new record
○ lake_out_id < 999999999 - “dead”
● Live records are typically what consumers query as contain time-series data
○ lake_out_id = 999999999 - “live”
● Over time, serialization of records not sent to CoGroup hinder runtime fitness
○ Dead records & records bloom filtered out but must still be written to new batch merge directory
○ More time spent rewriting records in CoGroup than actually merging
● Dead and live records bucketed by file, live records read and dead files copied
○ Substantial runtime reduction as data volume grows for patterns where ≥ 50% of data composed of
dead records
● Append merges copy data from previous batch
● Both optimizations require periodic compaction to tame overall file count
8
Partitioning
● Can substantially improve batch turnover time
○ Data merged against its own partition, reducing overall volume of data written in batch
● Dataset must have a field that supports partitioning for data requirements
○ Date, timestamp, or integer
● Changes how data is stored
○ Different underlying directory structure, consumers must be aware
○ Registry service stores metadata about latest batch for a partition
● Merge end result can be different
○ Partition fields can’t be changed once set
● Not all datasets have a field to partition on
9
Challenges
● Change set volumes per batch tend to stay consistent over time, but overall data
volume increases
● Data producer & consumer SLAs tend to be static
○ Data must be made available 30 minutes after batch begins
○ Data must be available by 14:30 EST in order to fulfill EOD reporting
● Own the implementation, not the data
○ Same code ran for every dataset
○ No control over fields, types, batch size, partitioning strategy etc…
● Support different use cases
○ Daily batch to 100+ batches/day
○ Milestoned & append batches
○ Snapshot feeds, incremental loads
● Merge optimizations so far only help ingest apps
○ Data consumed in many ways once ingested
○ User Spark code, internal processes exporting snapshot and incremental loads to warehouses
10
Iceberg
● Moving primary storage from HDFS → S3 offered chance for batch
data strategy review
● Iceberg’s metadata layer offers interesting features
○ Manifest files recording statistics
○ Hidden partitioning
■ Reading data looks the same client-side, regardless if/how table partitioned
■ Tracking of partition metadata no longer required
■ FIltering blocks out with Parquet predicates is good, not reading them at all is
better
● Not all datasets use Parquet
■ Consumers benefit in addition to ingest apps
○ V2 table format
■ Performant merge-on-read potential
● Batch retention managed with Snapshots
11
Iceberg - Partitioning
● Tables maintain metadata files that facilitate query planning
● Determines what files are required from query
○ Unnecessary files not read, single lookup rather than multiple IOPs
● Milestoned tables partitioned by record liveness
○ Live records bucketed together, dead records bucketed together
○ “select distinct(Profession) from dataset where lake_out_id =
999999999 and lake_from >= 7/1/1990 and lake_thru < 8/29/1990”
○ Ingest app no longer responsible for implementation
● Can further be partitioned by producer-specified field in schema
● Table implementation can change while consumption patterns
don’t
12
Iceberg - V2 Tables
● V2 tables support a merge-on-read strategy
○ Deltas applied to main table in lieu of rewriting files every batch
● Traditional ingest CoGroup step already marked records for insert,
update, delete, and unchanged
● Read only required records for CoGroup
○ Output becomes a bounded changelog DataStream
○ Unchanged records no longer emitted
● GenericRecord transformed to RowData and given
delta-appropriate RowKind association when written to Iceberg
table
○ RowKind.INSERT for new records
○ RowKind.DELETE + RowKind.INSERT for updates
13
Iceberg - V2 Tables
● Iceberg Flink connector uses Equality deletes
○ Identifies deleted rows by ≥ 1 column values
○ Data row is deleted if values equal to delete columns
○ Doesn’t require knowing where the rows are
○ Deleted when files compacted
○ Positional deletes require knowing where row to delete is required
● Records enriched with internal field with unique identifier for
deletes
○ Random 32-bit alphanumeric ID created during extract phase
○ Consumers only read data with schema in registry
14
Iceberg - V2 Tables Maintenance
● Over time, inserts and deletes can lead to many small data and
delete files
○ Small files problem, and more metadata stored in manifest files
● Periodically compact files during downtime
○ Downtime determined from ingestion schedule metadata in Registry
○ Creates a new snapshot, reads not impacted
○ Deletes applied to data files
15
Iceberg - V2 Tables Performance Testing
● Milestoning
○ Many updates and deletes
○ 10 million records over 8 batches
■ ~1.2GB staging data/batch
○ 10GB Snappy compressed data in total
○ 51% observed reduction in overall runtime over 8 batches when compared to
traditional file-based storage
○ Compaction runtime 51% faster than traditional merge runtime
● Append
○ Data is only appended, no updates/deletes
○ 500K records over 5 batches
○ 1TB Snappy compressed data in total
○ 63% observed reduction in overall runtime over 5 batches
○ Compaction runtime 24% faster than average traditional merge runtime
16
Summary
● Select equality delete fields wisely
○ Using just 1 field minimizes read overhead
● Compaction approach needs to be thought of early
○ Scheduling - built as part of application
● Partition to facilitate query patterns
17
Q&A
Thanks!
Learn more at GS.com/Engineering
The term ‘engineer’ in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law.
These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not
represent that it is accurate, complete, and/or up to date, and be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon
them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition of Goldman Sachs presenting the materials to you, you agree to
treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2022 the Goldman Sachs Group, Inc. All rights
reserved.

More Related Content

PDF
The Parquet Format and Performance Optimization Opportunities
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Apache Flink in the Cloud-Native Era
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PPTX
Real-time Analytics with Trino and Apache Pinot
PPTX
Autoscaling Flink with Reactive Mode
PDF
Apache Flink internals
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
The Parquet Format and Performance Optimization Opportunities
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Apache Flink in the Cloud-Native Era
Apache Iceberg - A Table Format for Hige Analytic Datasets
Real-time Analytics with Trino and Apache Pinot
Autoscaling Flink with Reactive Mode
Apache Flink internals
Tame the small files problem and optimize data layout for streaming ingestion...

What's hot (20)

PDF
Changelog Stream Processing with Apache Flink
PDF
Iceberg: a fast table format for S3
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Evening out the uneven: dealing with skew in Flink
PDF
Parquet performance tuning: the missing guide
PPTX
Practical learnings from running thousands of Flink jobs
PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PDF
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
PDF
Introduction to Apache Flink - Fast and reliable big data processing
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PDF
The Apache Spark File Format Ecosystem
PDF
Spark shuffle introduction
PPTX
RocksDB compaction
PDF
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
PDF
Apache Spark Introduction
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Changelog Stream Processing with Apache Flink
Iceberg: a fast table format for S3
Introducing the Apache Flink Kubernetes Operator
Evening out the uneven: dealing with skew in Flink
Parquet performance tuning: the missing guide
Practical learnings from running thousands of Flink jobs
Building a fully managed stream processing platform on Flink at scale for Lin...
The columnar roadmap: Apache Parquet and Apache Arrow
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Introduction to Apache Flink - Fast and reliable big data processing
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
A Deep Dive into Query Execution Engine of Spark SQL
Flexible and Real-Time Stream Processing with Apache Flink
The Apache Spark File Format Ecosystem
Spark shuffle introduction
RocksDB compaction
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
Apache Spark Introduction
Apache Spark in Depth: Core Concepts, Architecture & Internals
Ad

Similar to Batch Processing at Scale with Flink & Iceberg (20)

PDF
Presto Summit 2018 - 09 - Netflix Iceberg
PDF
Big data should be simple
PDF
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
PDF
Spark Meetup at Uber
PDF
Big data processing systems research
PDF
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
PDF
Enabling Presto Caching at Uber with Alluxio
PDF
Data platform architecture principles - ieee infrastructure 2020
PDF
An Introduction to Apache Cassandra
PDF
Challenges with Gluster and Persistent Memory with Dan Lambright
PPTX
Ledingkart Meetup #2: Scaling Search @Lendingkart
PDF
Object Compaction in Cloud for High Yield
PDF
Understanding Hadoop
PDF
Simplifying Disaster Recovery with Delta Lake
PDF
big_data_topic2_[hadoop]_[thanh_binh_nguyen].TextMark.pdf
PDF
week1slides1704202828322.pdf
PPTX
ApacheCon 2022_ Large scale unification of file format.pptx
PDF
How to Develop and Operate Cloud First Data Platforms
PPTX
Webinar: Building a multi-cloud Kubernetes storage on GitLab
PDF
A Day in the Life of a Druid Implementor and Druid's Roadmap
Presto Summit 2018 - 09 - Netflix Iceberg
Big data should be simple
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
Spark Meetup at Uber
Big data processing systems research
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
Enabling Presto Caching at Uber with Alluxio
Data platform architecture principles - ieee infrastructure 2020
An Introduction to Apache Cassandra
Challenges with Gluster and Persistent Memory with Dan Lambright
Ledingkart Meetup #2: Scaling Search @Lendingkart
Object Compaction in Cloud for High Yield
Understanding Hadoop
Simplifying Disaster Recovery with Delta Lake
big_data_topic2_[hadoop]_[thanh_binh_nguyen].TextMark.pdf
week1slides1704202828322.pdf
ApacheCon 2022_ Large scale unification of file format.pptx
How to Develop and Operate Cloud First Data Platforms
Webinar: Building a multi-cloud Kubernetes storage on GitLab
A Day in the Life of a Druid Implementor and Druid's Roadmap
Ad

More from Flink Forward (20)

PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
One sink to rule them all: Introducing the new Async Sink
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Flink powered stream processing platform at Pinterest
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PPTX
The Current State of Table API in 2022
PDF
Flink SQL on Pulsar made easy
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Processing Semantically-Ordered Streams in Financial Services
PPTX
Welcome to the Flink Community!
PPTX
Extending Flink SQL for stream processing use cases
PPTX
The top 3 challenges running multi-tenant Flink at scale
PPTX
Using Queryable State for Fun and Profit
PPTX
Large Scale Real Time Fraudulent Web Behavior Detection
PPTX
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
PPTX
Building Reliable Lakehouses with Apache Flink and Delta Lake
PPTX
Near real-time statistical modeling and anomaly detection using Flink!
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
One sink to rule them all: Introducing the new Async Sink
Tuning Apache Kafka Connectors for Flink.pptx
Flink powered stream processing platform at Pinterest
Where is my bottleneck? Performance troubleshooting in Flink
Using the New Apache Flink Kubernetes Operator in a Production Deployment
The Current State of Table API in 2022
Flink SQL on Pulsar made easy
Dynamic Rule-based Real-time Market Data Alerts
Processing Semantically-Ordered Streams in Financial Services
Welcome to the Flink Community!
Extending Flink SQL for stream processing use cases
The top 3 challenges running multi-tenant Flink at scale
Using Queryable State for Fun and Profit
Large Scale Real Time Fraudulent Web Behavior Detection
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Building Reliable Lakehouses with Apache Flink and Delta Lake
Near real-time statistical modeling and anomaly detection using Flink!

Recently uploaded (20)

PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Cloud computing and distributed systems.
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
cuic standard and advanced reporting.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
NewMind AI Monthly Chronicles - July 2025
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Reach Out and Touch Someone: Haptics and Empathic Computing
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
“AI and Expert System Decision Support & Business Intelligence Systems”
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
The AUB Centre for AI in Media Proposal.docx
GamePlan Trading System Review: Professional Trader's Honest Take
Understanding_Digital_Forensics_Presentation.pptx
Advanced Soft Computing BINUS July 2025.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Cloud computing and distributed systems.
Unlocking AI with Model Context Protocol (MCP)
cuic standard and advanced reporting.pdf
Machine learning based COVID-19 study performance prediction
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Big Data Technologies - Introduction.pptx
Spectral efficient network and resource selection model in 5G networks
NewMind AI Monthly Chronicles - July 2025

Batch Processing at Scale with Flink & Iceberg

  • 1. Batch Processing at Scale with Flink & Iceberg Andreas Hailu Vice President, Goldman Sachs
  • 2. Goldman Sachs Data Lake ● Platform allowing users to generate batch data pipelines without writing any code ● Data producers register datasets, making metadata available ○ Dataset schema, source and access, batch frequency, etc… ○ Flink batch applications generated dynamically ● Datasets subscribed for updates by consumers in warehouses ● Producers and consumers decoupled ● Scale ○ 162K unique datasets ○ 140K batches/day ○ 4.2MM batches/month 2 ETL Warehousing Registry Service Producer Source Data HDFS Redshift S3 SAP IQ/ASE Snowflake Lake Browseable Catalog
  • 3. Batch Data Strategy ● Lake operates using copy-on-write enumerated batches ● Extracted data merged with existing data to create a new batch ● Support both milestoned and append merges ○ Milestoned merge builds out records such that records themselves contain the as-of data ■ No time-travel required ■ Done per key, “linked-list” of time-series records ■ Immutable, retained forever ○ Append merge simply appends incoming data to existing data ● Merged data is stored as Parquet/Avro, snapshots and deltas generated per batch ○ Data exported to warehouse on batch completion in either snapshot/incremental loads ● Consumers always read data from last completed batch ● Last 3 batches of merged data are retained for recovery purposes 3
  • 4. Milestoning Example 4 First Name Last Name Profession Date Art Vandelay Importer May-31-1990 Staging Data Merged Data lake_in_id lake_out_id lake_from lake_thru First Name Last Name Profession Date 1 999999999 May-31-1990 11/30/9999 Art Vandelay Importer May-31-1990 Batch 1
  • 5. Milestoning Example 5 First Name Last Name Profession Date Art Vandelay Importer-Exporter June-30-1990 Staging Data Merged Data lake_in_id lake_out_id lake_from lake_thru First Name Last Name Profession Date 1 1 May-31-1990 11/30/9999 Art Vandelay Importer May-31-1990 2 999999999 May-31-1990 June-30-1990 Art Vandelay Importer May-31-1990 2 999999999 June-30-1990 11/30/9999 Art Vandelay Importer-Exporter June-30-1990 Batch 2
  • 6. Job Graph - Extract Extract Source Data Transform into Avro → Enrichment → Validate Data Quality Staging Directory 6 Accumulate Bloom Filters, Partitions, … Map, FlatMap DataSink DataSource Empty Sink DataSink Map, FlatMap Batch N
  • 7. Job Graph - Merge Read Merged Data → Dead Records || Records Not in BloomFilter 7 Read Merged Data → Live Records → Records In BloomFilter Read Staging Data Merge Directory (snapshot & delta) Merge Staging Records with Merged Records keyBy() CoGroup DataSink DataSource, Filter DataSource, Filter DataSource Batch N Batch N-1 Batch N-1 Batch N
  • 8. Merge Details ● Staging data is merged with existing live records ○ Some niche exemptions for certain use cases ● Updates result in in closure of existing record and insertion of a new record ○ lake_out_id < 999999999 - “dead” ● Live records are typically what consumers query as contain time-series data ○ lake_out_id = 999999999 - “live” ● Over time, serialization of records not sent to CoGroup hinder runtime fitness ○ Dead records & records bloom filtered out but must still be written to new batch merge directory ○ More time spent rewriting records in CoGroup than actually merging ● Dead and live records bucketed by file, live records read and dead files copied ○ Substantial runtime reduction as data volume grows for patterns where ≥ 50% of data composed of dead records ● Append merges copy data from previous batch ● Both optimizations require periodic compaction to tame overall file count 8
  • 9. Partitioning ● Can substantially improve batch turnover time ○ Data merged against its own partition, reducing overall volume of data written in batch ● Dataset must have a field that supports partitioning for data requirements ○ Date, timestamp, or integer ● Changes how data is stored ○ Different underlying directory structure, consumers must be aware ○ Registry service stores metadata about latest batch for a partition ● Merge end result can be different ○ Partition fields can’t be changed once set ● Not all datasets have a field to partition on 9
  • 10. Challenges ● Change set volumes per batch tend to stay consistent over time, but overall data volume increases ● Data producer & consumer SLAs tend to be static ○ Data must be made available 30 minutes after batch begins ○ Data must be available by 14:30 EST in order to fulfill EOD reporting ● Own the implementation, not the data ○ Same code ran for every dataset ○ No control over fields, types, batch size, partitioning strategy etc… ● Support different use cases ○ Daily batch to 100+ batches/day ○ Milestoned & append batches ○ Snapshot feeds, incremental loads ● Merge optimizations so far only help ingest apps ○ Data consumed in many ways once ingested ○ User Spark code, internal processes exporting snapshot and incremental loads to warehouses 10
  • 11. Iceberg ● Moving primary storage from HDFS → S3 offered chance for batch data strategy review ● Iceberg’s metadata layer offers interesting features ○ Manifest files recording statistics ○ Hidden partitioning ■ Reading data looks the same client-side, regardless if/how table partitioned ■ Tracking of partition metadata no longer required ■ FIltering blocks out with Parquet predicates is good, not reading them at all is better ● Not all datasets use Parquet ■ Consumers benefit in addition to ingest apps ○ V2 table format ■ Performant merge-on-read potential ● Batch retention managed with Snapshots 11
  • 12. Iceberg - Partitioning ● Tables maintain metadata files that facilitate query planning ● Determines what files are required from query ○ Unnecessary files not read, single lookup rather than multiple IOPs ● Milestoned tables partitioned by record liveness ○ Live records bucketed together, dead records bucketed together ○ “select distinct(Profession) from dataset where lake_out_id = 999999999 and lake_from >= 7/1/1990 and lake_thru < 8/29/1990” ○ Ingest app no longer responsible for implementation ● Can further be partitioned by producer-specified field in schema ● Table implementation can change while consumption patterns don’t 12
  • 13. Iceberg - V2 Tables ● V2 tables support a merge-on-read strategy ○ Deltas applied to main table in lieu of rewriting files every batch ● Traditional ingest CoGroup step already marked records for insert, update, delete, and unchanged ● Read only required records for CoGroup ○ Output becomes a bounded changelog DataStream ○ Unchanged records no longer emitted ● GenericRecord transformed to RowData and given delta-appropriate RowKind association when written to Iceberg table ○ RowKind.INSERT for new records ○ RowKind.DELETE + RowKind.INSERT for updates 13
  • 14. Iceberg - V2 Tables ● Iceberg Flink connector uses Equality deletes ○ Identifies deleted rows by ≥ 1 column values ○ Data row is deleted if values equal to delete columns ○ Doesn’t require knowing where the rows are ○ Deleted when files compacted ○ Positional deletes require knowing where row to delete is required ● Records enriched with internal field with unique identifier for deletes ○ Random 32-bit alphanumeric ID created during extract phase ○ Consumers only read data with schema in registry 14
  • 15. Iceberg - V2 Tables Maintenance ● Over time, inserts and deletes can lead to many small data and delete files ○ Small files problem, and more metadata stored in manifest files ● Periodically compact files during downtime ○ Downtime determined from ingestion schedule metadata in Registry ○ Creates a new snapshot, reads not impacted ○ Deletes applied to data files 15
  • 16. Iceberg - V2 Tables Performance Testing ● Milestoning ○ Many updates and deletes ○ 10 million records over 8 batches ■ ~1.2GB staging data/batch ○ 10GB Snappy compressed data in total ○ 51% observed reduction in overall runtime over 8 batches when compared to traditional file-based storage ○ Compaction runtime 51% faster than traditional merge runtime ● Append ○ Data is only appended, no updates/deletes ○ 500K records over 5 batches ○ 1TB Snappy compressed data in total ○ 63% observed reduction in overall runtime over 5 batches ○ Compaction runtime 24% faster than average traditional merge runtime 16
  • 17. Summary ● Select equality delete fields wisely ○ Using just 1 field minimizes read overhead ● Compaction approach needs to be thought of early ○ Scheduling - built as part of application ● Partition to facilitate query patterns 17
  • 18. Q&A Thanks! Learn more at GS.com/Engineering The term ‘engineer’ in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law. These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete, and/or up to date, and be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition of Goldman Sachs presenting the materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2022 the Goldman Sachs Group, Inc. All rights reserved.