SlideShare a Scribd company logo
ADVANCED DATABASE SYSTEMS
Andy Pavlo // 15-721 // Spring 2023
Modern OLAP
Databases
Lecture
#02
15-721 (Spring 2023)
COURSE OUTLINE
Storage
→ Columnar Storage
→ Compression
→ Indexes
Query Execution:
→ Processing Models
→ Scheduling
→ Vectorization
→ Compilation
→ Joins
→ Materialized Views
Query Optimization
Network Interfaces
2
Client Interface
Optimization
Query Execution
Storage
15-721 (Spring 2023)
TODAY’S AGENDA
Query Execution
Distributed System Architectures
OLAP Commoditization
3
15-721 (Spring 2023)
DISTRIBUTED QUERY EXECUTION
Executing an OLAP query in a distributed DBMS is
roughly the same as on a single-node DBMS.
→ Query plan is a DAG of physical operators.
For each operator, the DBMS considers where
input is coming from and where to send output.
→ Table Scans
→ Joins
→ Aggregations
→ Sorting
4
15-721 (Spring 2023)
DISTRIBUTED QUERY EXECUTION
5
Intermediate
Data
Intermediate
Data
⋮
Worker Nodes
Persistent Data
Persistent Data
15-721 (Spring 2023)
DISTRIBUTED QUERY EXECUTION
5
⋮
Shuffle Nodes
(Optional)
Intermediate
Data
Intermediate
Data
⋮
Worker Nodes
Persistent Data
Persistent Data
15-721 (Spring 2023)
DISTRIBUTED QUERY EXECUTION
5
⋮
Shuffle Nodes
(Optional)
Intermediate
Data
Intermediate
Data
⋮
Worker Nodes
⋮
Worker Nodes
Persistent Data
Persistent Data
15-721 (Spring 2023)
DISTRIBUTED QUERY EXECUTION
5
⋮
Shuffle Nodes
(Optional)
Intermediate
Data
Intermediate
Data
⋮
Worker Nodes
⋮
Worker Nodes
Final
Result
Persistent Data
Persistent Data
15-721 (Spring 2023)
DATA CATEGORIES
Persistent Data:
→ The "source of record" for the database (e.g., tables).
→ Modern systems assume that these data files are immutable
but can support updates by rewriting them.
Intermediate Data:
→ Short-lived artifacts produced by query operators during
execution and then consumed by other operators.
→ The amount of intermediate data that a query generates
has little to no correlation to amount of persistent data that
it reads or the execution time.
6
BUILDING AN ELASTIC QUERY ENGINE ON
DISAGGREGATED STORAGE
NSDI 2022
15-721 (Spring 2023)
DISTRIBUTED SYSTEM ARCHITECTURE
A distributed DBMS's system architecture specifies
the location of the database's persistent data files.
This affects how nodes coordinate with each other
and where they retrieve/store objects in the
database.
Two approaches (not mutually exclusive):
→ Push Query to Data
→ Pull Data to Query
7
THE CASE FOR SHARED NOTHING
HPTS 1985
15-721 (Spring 2023)
PUSH VS. PULL
Approach #1: Push Query to Data
→ Send the query (or a portion of it) to the node that
contains the data.
→ Perform as much filtering and processing as possible where
data resides before transmitting over network.
Approach #2: Pull Data to Query
→ Bring the data to the node that is executing a query that
needs it for processing.
→ This is necessary when there is no compute resources
available where persistent data files are located.
8
15-721 (Spring 2023)
PUSH VS. PULL
Approach #1: Push Query to Data
→ Send the query (or a portion of it) to the node that
contains the data.
→ Perform as much filtering and processing as possible where
data resides before transmitting over network.
Approach #2: Pull Data to Query
→ Bring the data to the node that is executing a query that
needs it for processing.
→ This is necessary when there is no compute resources
available where persistent data files are located.
8
15-721 (Spring 2023)
SHARED NOTHING
Each DBMS instance has its own
CPU, memory, locally-attached disk.
→ Nodes only communicate with each other
via network.
Database is partitioned into disjoint
subsets across nodes.
→ Adding a new node requires physically
moving data between nodes.
Since data is local, the DBMS can
access it via POSIX API.
9
Network
DBMS
Node
15-721 (Spring 2023)
SHARED DISK
Each node accesses a single logical
disk via an interconnect, but also have
their own private memory and
ephemeral storage.
→ Must send messages between nodes to
learn about their current state.
Instead of a POSIX API, the DBMS
accesses disk using a userspace API.
10
Network
Network
Compute
Layer
Storage
Layer
15-721 (Spring 2023)
SYSTEM ARCHITECTURE
Choice #1: Shared Nothing:
→ Harder to scale capacity (data movement).
→ Potentially better performance & efficiency.
→ Apply filters where the data resides before transferring.
Choice #2: Shared Disk:
→ Scale compute layer independently from the storage layer.
→ Easy to shutdown idle compute layer resources.
→ May need to pull uncached persistent data from storage
layer to compute layer before applying filters.
11
15-721 (Spring 2023)
SHARED DISK
Traditionally the storage layer in shared-disk
DBMSs were dedicated on-prem NAS.
→ Example: Oracle Exadata
Cloud object stores are now the prevailing storage
target for modern OLAP DBMSs because they are
"infinitely" scalable.
→ Examples: Amazon S3, Azure Blob, Google Cloud Storage
12
15-721 (Spring 2023)
OBJECT STORES
Partition the database's tables (persistent data) into
large, immutable files stored in an object store.
→ All attributes for a tuple are stored in the same file in a
columnar layout (PAX).
→ Header (or footer) contains meta-data about columnar
offsets, compression schemes, indexes, and zone maps.
The DBMS retrieves a block's header to determine
what byte ranges it needs to retrieve (if any).
Each cloud vendor provides their own proprietary
API to access data (PUT, GET, DELETE).
→ Some vendors support predicate pushdown (S3).
13
15-721 (Spring 2023)
OBJECT STORES
Partition the database's tables (persistent data) into
large, immutable files stored in an object store.
→ All attributes for a tuple are stored in the same file in a
columnar layout (PAX).
→ Header (or footer) contains meta-data about columnar
offsets, compression schemes, indexes, and zone maps.
The DBMS retrieves a block's header to determine
what byte ranges it needs to retrieve (if any).
Each cloud vendor provides their own proprietary
API to access data (PUT, GET, DELETE).
→ Some vendors support predicate pushdown (S3).
13
15-721 (Spring 2023)
ADDITIONAL TOPICS
File Formats
Table Partitioning
Data Ingestion / Updates / Discovery
Scheduling / Adaptivity
14
15-721 (Spring 2023)
OBSERVATION
Snowflake is a monolithic system comprised of
components built entirely in-house.
Most of the non-academic DBMSs we will cover
this semester will have a similar overall architecture.
But this means that multiple organizations are
writing the same DBMS software…
15
15-721 (Spring 2023)
OLAP COMMODITIZATION
One recent trend of the last decade is the breakout
OLAP engine sub-systems into standalone open-
source components.
→ This is typically done by organizations not in the business
of selling DBMS software.
Examples:
→ System Catalogs
→ Query Optimizers
→ File Format / Access Libraries
→ Execution Engines
16
15-721 (Spring 2023)
OLAP COMMODITIZATION
One recent trend of the last decade is the breakout
OLAP engine sub-systems into standalone open-
source components.
→ This is typically done by organizations not in the business
of selling DBMS software.
Examples:
→ System Catalogs
→ Query Optimizers
→ File Format / Access Libraries
→ Execution Engines
16
15-721 (Spring 2023)
SYSTEM CATALOGS
A DBMS tracks a database's schema (table, columns)
and data files in its catalog.
→ If the DBMS is on the data ingestion path, then it can
maintain the catalog incrementally.
→ If an external process adds data files, then it also needs to
update the catalog so that the DBMS is aware of them.
Notable implementations:
→ HCatalog
→ Google Data Catalog
→ Amazon Glue Data Catalog
17
15-721 (Spring 2023)
QUERY OPTIMIZERS
Extendible search engine framework for heuristic-
and cost-based query optimization.
→ DBMS provides transformation rules and cost estimates.
→ Framework returns either a logical or physical query plan.
This is the hardest part to build in any DBMS.
Notable implementations:
→ Greenplum Orca
→ Apache Calcite
18
ORCA: A MODULAR QUERY OPTIMIZER
ARCHITECTURE FOR BIG DATA
SIGMOD 2014
APACHE CALCITE: A FOUNDATIONAL FRAMEWORK FOR OPTIMIZED
QUERY PROCESSING OVER HETEROGENEOUS DATA SOURCES
SIGMOD 2018
15-721 (Spring 2023)
FILE FORMATS
Most DBMSs use a proprietary on-disk binary file
format for their databases.The only way to share
data between systems is to convert data into a
common text-based format
→ Examples: CSV, JSON, XML
There are open-source binary file formats that make
it easier to access data across systems and libraries
for extracting data from files.
→ Libraries provide an iterator interface to retrieve (batched)
columns from files.
19
15-721 (Spring 2023)
UNIVERSAL FORMATS
Apache Parquet (2013)
→ Compressed columnar storage from
Cloudera/Twitter
Apache ORC (2013)
→ Compressed columnar storage from
Apache Hive.
Apache CarbonData (2013)
→ Compressed columnar storage with
indexes from Huawei.
20
Apache Iceberg (2017)
→ Flexible data format that supports
schema evolution from Netflix.
HDF5 (1998)
→ Multi-dimensional arrays for
scientific workloads.
Apache Arrow (2016)
→ In-memory compressed columnar
storage from Pandas/Dremio.
15-721 (Spring 2023)
EXECUTION ENGINES
Standalone libraries for executing vectorized query
operators on columnar data.
→ Input is a DAG of physical operators.
→ Require external scheduling and orchestration.
Notable implementations:
→ Velox
→ DataFusion
→ Intel OAP
21
VLDB 2022
15-721 (Spring 2023)
CONCLUSION
Today was about understanding the high-level
context of what modern OLAP DBMSs look like.
→ Fundamentally these new DBMSs are not different than
previous distributed/parallel DBMSs except for the
prevalence of a cloud-based object store for shared disk.
Our focus for the rest of the semester will be about
state-of-the-art implementations of these systems'
components.
22
15-721 (Spring 2023)
NEXT CLASS
Storage Models
Data Representation
Partitioning
Catalogs
23

More Related Content

PPTX
Column Stores and Google BigQuery
PDF
BCS403_dyeuhfgidgujoiduhyopyirhyiuertfuiPPT.pdf
PDF
From flat files to deconstructed database
PPTX
Strata NY 2018: The deconstructed database
PDF
History of Databases CMU Advanced Databases
PDF
Wolfgang Lehner Technische Universitat Dresden
PDF
Database Systems - Introduction to Database Systems
PPT
Database Management System Processing.ppt
Column Stores and Google BigQuery
BCS403_dyeuhfgidgujoiduhyopyirhyiuertfuiPPT.pdf
From flat files to deconstructed database
Strata NY 2018: The deconstructed database
History of Databases CMU Advanced Databases
Wolfgang Lehner Technische Universitat Dresden
Database Systems - Introduction to Database Systems
Database Management System Processing.ppt

Similar to Modern OLAP Databases CMU Advanced Databases (20)

PDF
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
PPTX
Big Data Infrastructure and Hadoop components.pptx
PPTX
Basic SQL for Bcom Business Analytics.pptx
PDF
Operational-Analytics
PPTX
Fundamentals of data base management in science and technology
PPTX
Big Data Strategy for the Relational World
PDF
The Future of Fast Databases: Lessons from a Decade of QuestDB
PPTX
Big and Fast Data - Building Infinitely Scalable Systems
PPT
Database Concepts.ppt
PPT
Unit-1-Introduction.ppt for the gigachad
PDF
Midterm revision - without answer edit.pdf
PDF
Where Does Big Data Meet Big Database - QCon 2012
PDF
Big Data Fundamentals
PPTX
Database workshop - Encode | Bhuvan Gandhi | Vishwas Ganatra
PPTX
PPTX
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
PPTX
Data(base) taxonomy
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
PDF
Database Systems - A Historical Perspective
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Big Data Infrastructure and Hadoop components.pptx
Basic SQL for Bcom Business Analytics.pptx
Operational-Analytics
Fundamentals of data base management in science and technology
Big Data Strategy for the Relational World
The Future of Fast Databases: Lessons from a Decade of QuestDB
Big and Fast Data - Building Infinitely Scalable Systems
Database Concepts.ppt
Unit-1-Introduction.ppt for the gigachad
Midterm revision - without answer edit.pdf
Where Does Big Data Meet Big Database - QCon 2012
Big Data Fundamentals
Database workshop - Encode | Bhuvan Gandhi | Vishwas Ganatra
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Data(base) taxonomy
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
Database Systems - A Historical Perspective
Ad

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Machine learning based COVID-19 study performance prediction
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Electronic commerce courselecture one. Pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Understanding_Digital_Forensics_Presentation.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Spectroscopy.pptx food analysis technology
Per capita expenditure prediction using model stacking based on satellite ima...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Encapsulation theory and applications.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Digital-Transformation-Roadmap-for-Companies.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Unlocking AI with Model Context Protocol (MCP)
Machine learning based COVID-19 study performance prediction
20250228 LYD VKU AI Blended-Learning.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
The Rise and Fall of 3GPP – Time for a Sabbatical?
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Electronic commerce courselecture one. Pdf
Ad

Modern OLAP Databases CMU Advanced Databases

  • 1. ADVANCED DATABASE SYSTEMS Andy Pavlo // 15-721 // Spring 2023 Modern OLAP Databases Lecture #02
  • 2. 15-721 (Spring 2023) COURSE OUTLINE Storage → Columnar Storage → Compression → Indexes Query Execution: → Processing Models → Scheduling → Vectorization → Compilation → Joins → Materialized Views Query Optimization Network Interfaces 2 Client Interface Optimization Query Execution Storage
  • 3. 15-721 (Spring 2023) TODAY’S AGENDA Query Execution Distributed System Architectures OLAP Commoditization 3
  • 4. 15-721 (Spring 2023) DISTRIBUTED QUERY EXECUTION Executing an OLAP query in a distributed DBMS is roughly the same as on a single-node DBMS. → Query plan is a DAG of physical operators. For each operator, the DBMS considers where input is coming from and where to send output. → Table Scans → Joins → Aggregations → Sorting 4
  • 5. 15-721 (Spring 2023) DISTRIBUTED QUERY EXECUTION 5 Intermediate Data Intermediate Data ⋮ Worker Nodes Persistent Data Persistent Data
  • 6. 15-721 (Spring 2023) DISTRIBUTED QUERY EXECUTION 5 ⋮ Shuffle Nodes (Optional) Intermediate Data Intermediate Data ⋮ Worker Nodes Persistent Data Persistent Data
  • 7. 15-721 (Spring 2023) DISTRIBUTED QUERY EXECUTION 5 ⋮ Shuffle Nodes (Optional) Intermediate Data Intermediate Data ⋮ Worker Nodes ⋮ Worker Nodes Persistent Data Persistent Data
  • 8. 15-721 (Spring 2023) DISTRIBUTED QUERY EXECUTION 5 ⋮ Shuffle Nodes (Optional) Intermediate Data Intermediate Data ⋮ Worker Nodes ⋮ Worker Nodes Final Result Persistent Data Persistent Data
  • 9. 15-721 (Spring 2023) DATA CATEGORIES Persistent Data: → The "source of record" for the database (e.g., tables). → Modern systems assume that these data files are immutable but can support updates by rewriting them. Intermediate Data: → Short-lived artifacts produced by query operators during execution and then consumed by other operators. → The amount of intermediate data that a query generates has little to no correlation to amount of persistent data that it reads or the execution time. 6 BUILDING AN ELASTIC QUERY ENGINE ON DISAGGREGATED STORAGE NSDI 2022
  • 10. 15-721 (Spring 2023) DISTRIBUTED SYSTEM ARCHITECTURE A distributed DBMS's system architecture specifies the location of the database's persistent data files. This affects how nodes coordinate with each other and where they retrieve/store objects in the database. Two approaches (not mutually exclusive): → Push Query to Data → Pull Data to Query 7 THE CASE FOR SHARED NOTHING HPTS 1985
  • 11. 15-721 (Spring 2023) PUSH VS. PULL Approach #1: Push Query to Data → Send the query (or a portion of it) to the node that contains the data. → Perform as much filtering and processing as possible where data resides before transmitting over network. Approach #2: Pull Data to Query → Bring the data to the node that is executing a query that needs it for processing. → This is necessary when there is no compute resources available where persistent data files are located. 8
  • 12. 15-721 (Spring 2023) PUSH VS. PULL Approach #1: Push Query to Data → Send the query (or a portion of it) to the node that contains the data. → Perform as much filtering and processing as possible where data resides before transmitting over network. Approach #2: Pull Data to Query → Bring the data to the node that is executing a query that needs it for processing. → This is necessary when there is no compute resources available where persistent data files are located. 8
  • 13. 15-721 (Spring 2023) SHARED NOTHING Each DBMS instance has its own CPU, memory, locally-attached disk. → Nodes only communicate with each other via network. Database is partitioned into disjoint subsets across nodes. → Adding a new node requires physically moving data between nodes. Since data is local, the DBMS can access it via POSIX API. 9 Network DBMS Node
  • 14. 15-721 (Spring 2023) SHARED DISK Each node accesses a single logical disk via an interconnect, but also have their own private memory and ephemeral storage. → Must send messages between nodes to learn about their current state. Instead of a POSIX API, the DBMS accesses disk using a userspace API. 10 Network Network Compute Layer Storage Layer
  • 15. 15-721 (Spring 2023) SYSTEM ARCHITECTURE Choice #1: Shared Nothing: → Harder to scale capacity (data movement). → Potentially better performance & efficiency. → Apply filters where the data resides before transferring. Choice #2: Shared Disk: → Scale compute layer independently from the storage layer. → Easy to shutdown idle compute layer resources. → May need to pull uncached persistent data from storage layer to compute layer before applying filters. 11
  • 16. 15-721 (Spring 2023) SHARED DISK Traditionally the storage layer in shared-disk DBMSs were dedicated on-prem NAS. → Example: Oracle Exadata Cloud object stores are now the prevailing storage target for modern OLAP DBMSs because they are "infinitely" scalable. → Examples: Amazon S3, Azure Blob, Google Cloud Storage 12
  • 17. 15-721 (Spring 2023) OBJECT STORES Partition the database's tables (persistent data) into large, immutable files stored in an object store. → All attributes for a tuple are stored in the same file in a columnar layout (PAX). → Header (or footer) contains meta-data about columnar offsets, compression schemes, indexes, and zone maps. The DBMS retrieves a block's header to determine what byte ranges it needs to retrieve (if any). Each cloud vendor provides their own proprietary API to access data (PUT, GET, DELETE). → Some vendors support predicate pushdown (S3). 13
  • 18. 15-721 (Spring 2023) OBJECT STORES Partition the database's tables (persistent data) into large, immutable files stored in an object store. → All attributes for a tuple are stored in the same file in a columnar layout (PAX). → Header (or footer) contains meta-data about columnar offsets, compression schemes, indexes, and zone maps. The DBMS retrieves a block's header to determine what byte ranges it needs to retrieve (if any). Each cloud vendor provides their own proprietary API to access data (PUT, GET, DELETE). → Some vendors support predicate pushdown (S3). 13
  • 19. 15-721 (Spring 2023) ADDITIONAL TOPICS File Formats Table Partitioning Data Ingestion / Updates / Discovery Scheduling / Adaptivity 14
  • 20. 15-721 (Spring 2023) OBSERVATION Snowflake is a monolithic system comprised of components built entirely in-house. Most of the non-academic DBMSs we will cover this semester will have a similar overall architecture. But this means that multiple organizations are writing the same DBMS software… 15
  • 21. 15-721 (Spring 2023) OLAP COMMODITIZATION One recent trend of the last decade is the breakout OLAP engine sub-systems into standalone open- source components. → This is typically done by organizations not in the business of selling DBMS software. Examples: → System Catalogs → Query Optimizers → File Format / Access Libraries → Execution Engines 16
  • 22. 15-721 (Spring 2023) OLAP COMMODITIZATION One recent trend of the last decade is the breakout OLAP engine sub-systems into standalone open- source components. → This is typically done by organizations not in the business of selling DBMS software. Examples: → System Catalogs → Query Optimizers → File Format / Access Libraries → Execution Engines 16
  • 23. 15-721 (Spring 2023) SYSTEM CATALOGS A DBMS tracks a database's schema (table, columns) and data files in its catalog. → If the DBMS is on the data ingestion path, then it can maintain the catalog incrementally. → If an external process adds data files, then it also needs to update the catalog so that the DBMS is aware of them. Notable implementations: → HCatalog → Google Data Catalog → Amazon Glue Data Catalog 17
  • 24. 15-721 (Spring 2023) QUERY OPTIMIZERS Extendible search engine framework for heuristic- and cost-based query optimization. → DBMS provides transformation rules and cost estimates. → Framework returns either a logical or physical query plan. This is the hardest part to build in any DBMS. Notable implementations: → Greenplum Orca → Apache Calcite 18 ORCA: A MODULAR QUERY OPTIMIZER ARCHITECTURE FOR BIG DATA SIGMOD 2014 APACHE CALCITE: A FOUNDATIONAL FRAMEWORK FOR OPTIMIZED QUERY PROCESSING OVER HETEROGENEOUS DATA SOURCES SIGMOD 2018
  • 25. 15-721 (Spring 2023) FILE FORMATS Most DBMSs use a proprietary on-disk binary file format for their databases.The only way to share data between systems is to convert data into a common text-based format → Examples: CSV, JSON, XML There are open-source binary file formats that make it easier to access data across systems and libraries for extracting data from files. → Libraries provide an iterator interface to retrieve (batched) columns from files. 19
  • 26. 15-721 (Spring 2023) UNIVERSAL FORMATS Apache Parquet (2013) → Compressed columnar storage from Cloudera/Twitter Apache ORC (2013) → Compressed columnar storage from Apache Hive. Apache CarbonData (2013) → Compressed columnar storage with indexes from Huawei. 20 Apache Iceberg (2017) → Flexible data format that supports schema evolution from Netflix. HDF5 (1998) → Multi-dimensional arrays for scientific workloads. Apache Arrow (2016) → In-memory compressed columnar storage from Pandas/Dremio.
  • 27. 15-721 (Spring 2023) EXECUTION ENGINES Standalone libraries for executing vectorized query operators on columnar data. → Input is a DAG of physical operators. → Require external scheduling and orchestration. Notable implementations: → Velox → DataFusion → Intel OAP 21 VLDB 2022
  • 28. 15-721 (Spring 2023) CONCLUSION Today was about understanding the high-level context of what modern OLAP DBMSs look like. → Fundamentally these new DBMSs are not different than previous distributed/parallel DBMSs except for the prevalence of a cloud-based object store for shared disk. Our focus for the rest of the semester will be about state-of-the-art implementations of these systems' components. 22
  • 29. 15-721 (Spring 2023) NEXT CLASS Storage Models Data Representation Partitioning Catalogs 23