SlideShare a Scribd company logo
ADVANCED DATABASE SYSTEMS
Andy Pavlo // 15-721 // Spring 2023
Storage Models
& Data Layout
Lecture
#03
15-721 (Spring 2023)
OBSERVATION
Today's lecture is about the lowest physical
representation of data in a database.
What data "looks" like determines almost a DBMS's
entire system architecture.
→ Processing Model
→ Tuple Materialization Strategy
→ Operator Algorithms
→ Data Ingestion / Updates
→ Concurrency Control (we will ignore this)
→ Query Optimization
2
15-721 (Spring 2023)
TODAY’S AGENDA
Storage Models
Type Representation
Partitioning
3
15-721 (Spring 2023)
STORAGE MODELS
A DBMS's storage model specifies how it
physically organizes tuples on disk and in memory.
Choice #1: N-ary Storage Model (NSM)
Choice #2: Decomposition Storage Model (DSM)
Choice #3: Hybrid Storage Model (PAX)
4
COLUMN-STORES VS. ROW-STORES: HOW
DIFFERENT ARE THEY REALLY?
SIGMOD 2008
15-721 (Spring 2023)
N-ARY STORAGE MODEL (NSM)
The DBMS stores (almost) all the attributes for a
single tuple contiguously in a single page.
Ideal for OLTP workloads where txns tend to access
individual entities and insert-heavy workloads.
→ Use the tuple-at-a-time iterator processing model.
NSM database page sizes are typically some constant
multiple of 4 KB hardware pages.
→ Example: Oracle (4 KB), Postgres (8 KB), MySQL (16 KB)
5
15-721 (Spring 2023)
Database
Page
NSM: PHYSICAL ORGANIZATION
A disk-oriented NSM system stores a
tuple's fixed-length and variable-
length attributes contiguously in a
single slotted page.
The tuple's record id (page#, slot#) is
how the DBMS uniquely identifies a
physical tuple.
6
b0
b1
b2
b3
b4
b5
a0
a1
a2
a3
a4
a5
c0
c1
c2
c3
c4
c5
Row #0
Row #1
Row #2
Row #3
Row #4
Row #5
Col A Col B Col C
header
b0
a0 c0
header
Slot Array
15-721 (Spring 2023)
Database
Page
NSM: PHYSICAL ORGANIZATION
A disk-oriented NSM system stores a
tuple's fixed-length and variable-
length attributes contiguously in a
single slotted page.
The tuple's record id (page#, slot#) is
how the DBMS uniquely identifies a
physical tuple.
6
b0
b1
b2
b3
b4
b5
a0
a1
a2
a3
a4
a5
c0
c1
c2
c3
c4
c5
Row #0
Row #1
Row #2
Row #3
Row #4
Row #5
Col A Col B Col C
header
b0
a0 c0
header
b1 c1
a1
header
Slot Array
15-721 (Spring 2023)
Database
Page
NSM: PHYSICAL ORGANIZATION
A disk-oriented NSM system stores a
tuple's fixed-length and variable-
length attributes contiguously in a
single slotted page.
The tuple's record id (page#, slot#) is
how the DBMS uniquely identifies a
physical tuple.
6
b0
b1
b2
b3
b4
b5
a0
a1
a2
a3
a4
a5
c0
c1
c2
c3
c4
c5
Row #0
Row #1
Row #2
Row #3
Row #4
Row #5
Col A Col B Col C
header
b0
a0 c0
header
b1 c1
a1
header
Slot Array
b2
a2 c2
header
b3
a3 c3
header
b4 c4
a4
header
b5
a5 c5
header
15-721 (Spring 2023)
Database
Page
NSM: PHYSICAL ORGANIZATION
6
b0
b1
b2
b3
b4
b5
a0
a1
a2
a3
a4
a5
c0
c1
c2
c3
c4
c5
Row #0
Row #1
Row #2
Row #3
Row #4
Row #5
Col A Col B Col C
header
b0
a0 c0
header
b1 c1
a1
header
Slot Array
b2
a2 c2
header
b3
a3 c3
header
b4 c4
a4
header
b5
a5 c5
header
SELECT SUM(colA), AVG(colC)
FROM xxx
WHERE colA > 1000
15-721 (Spring 2023)
Database
Page
NSM: PHYSICAL ORGANIZATION
6
b0
b1
b2
b3
b4
b5
a0
a1
a2
a3
a4
a5
c0
c1
c2
c3
c4
c5
Row #0
Row #1
Row #2
Row #3
Row #4
Row #5
Col A Col B Col C
header
b0
a0 c0
header
b1 c1
a1
header
Slot Array
b2
a2 c2
header
b3
a3 c3
header
b4 c4
a4
header
b5
a5 c5
header
SELECT SUM(colA), AVG(colC)
FROM xxx
WHERE colA > 1000
15-721 (Spring 2023)
Database
Page
NSM: PHYSICAL ORGANIZATION
6
b0
b1
b2
b3
b4
b5
a0
a1
a2
a3
a4
a5
c0
c1
c2
c3
c4
c5
Row #0
Row #1
Row #2
Row #3
Row #4
Row #5
Col A Col B Col C
header
b0
a0 c0
header
b1 c1
a1
header
Slot Array
b2
a2 c2
header
b3
a3 c3
header
b4 c4
a4
header
b5
a5 c5
header
SELECT SUM(colA), AVG(colC)
FROM xxx
WHERE colA > 1000
15-721 (Spring 2023)
N-ARY STORAGE MODEL (NSM)
Advantages
→ Fast inserts, updates, and deletes.
→ Good for queries that need the entire tuple (OLTP).
→ Can use index-oriented physical storage for clustering.
Disadvantages
→ Not good for scanning large portions of the table and/or a
subset of the attributes.
→ Terrible memory locality in access patterns.
→ Not ideal for compression because of multiple value
domains within a single page.
7
15-721 (Spring 2023)
DECOMPOSITION STORAGE MODEL (DSM)
The DBMS stores a single attribute for all tuples
contiguously in a block of data.
Ideal for OLAP workloads where read-only queries
perform large scans over a subset of the table’s
attributes.
→ Use a batched vectorized processing model.
File sizes are larger (100s of MBs), but it may still
organize tuples within the file into smaller groups.
8
15-721 (Spring 2023)
DSM: PHYSICAL ORGANIZATION
Store attributes and meta-data (e.g.,
nulls) in separate arrays of fixed-
length values.
→ Most systems identify unique physical
tuples using offsets into these arrays.
→ Need to handle variable-length values…
Maintain a separate file per attribute
with a dedicated header area for meta-
data about entire column.
9
b0
b1
b2
b3
b4
b5
a0
a1
a2
a3
a4
a5
c0
c1
c2
c3
c4
c5
Row #0
Row #1
Row #2
Row #3
Row #4
Row #5
Col A Col B Col C
header null bitmap
a0 a1 a2 a3 a4 a5
File
#1
15-721 (Spring 2023)
DSM: PHYSICAL ORGANIZATION
Store attributes and meta-data (e.g.,
nulls) in separate arrays of fixed-
length values.
→ Most systems identify unique physical
tuples using offsets into these arrays.
→ Need to handle variable-length values…
Maintain a separate file per attribute
with a dedicated header area for meta-
data about entire column.
9
b0
b1
b2
b3
b4
b5
a0
a1
a2
a3
a4
a5
c0
c1
c2
c3
c4
c5
Row #0
Row #1
Row #2
Row #3
Row #4
Row #5
Col A Col B Col C
header null bitmap
a0 a1 a2 a3 a4 a5
File
#1
header null bitmap
b0 b1 b2 b3 b4 b5
File
#2
15-721 (Spring 2023)
DSM: PHYSICAL ORGANIZATION
Store attributes and meta-data (e.g.,
nulls) in separate arrays of fixed-
length values.
→ Most systems identify unique physical
tuples using offsets into these arrays.
→ Need to handle variable-length values…
Maintain a separate file per attribute
with a dedicated header area for meta-
data about entire column.
9
b0
b1
b2
b3
b4
b5
a0
a1
a2
a3
a4
a5
c0
c1
c2
c3
c4
c5
Row #0
Row #1
Row #2
Row #3
Row #4
Row #5
Col A Col B Col C
header null bitmap
a0 a1 a2 a3 a4 a5
File
#1
header null bitmap
b0 b1 b2 b3 b4 b5
File
#2
header null bitmap
c5
c0 c1 c2 c3 c4
File
#3
15-721 (Spring 2023)
DSM: TUPLE IDENTIFICATION
Choice #1: Fixed-length Offsets
→ Each value is the same length for an attribute.
Choice #2: Embedded Tuple Ids
→ Each value is stored with its tuple id in a column.
10
Offsets
0
1
2
3
A B C D
Embedded Ids
A
0
1
2
3
B
0
1
2
3
C
0
1
2
3
D
0
1
2
3
15-721 (Spring 2023)
DSM: VARIABLE-LENGTH DATA
Padding variable-length fields to ensure they are
fixed-length is wasteful, especially for large
attributes.
A better approach is to use dictionary compression to
convert repetitive variable-length data into fixed-
length values (typically 32-bit integers).
→ More on this next week.
11
15-721 (Spring 2023)
DSM: SYSTEM HISTORY
1970s: Cantor DBMS
1980s: DSM Proposal
1990s: SybaseIQ (in-memory only)
2000s: Vertica, Vectorwise, MonetDB
2010s: Everyone
12
15-721 (Spring 2023)
DECOMPOSITION STORAGE MODEL (DSM)
Advantages
→ Reduces the amount wasted I/O per query because the
DBMS only reads the data that it needs.
→ Faster query processing because of increased locality and
cached data reuse.
→ Better data compression (more on this later)
Disadvantages
→ Slow for point queries, inserts, updates, and deletes
because of tuple splitting/stitching/reorganization.
13
15-721 (Spring 2023)
OBSERVATION
OLAP queries almost never access a single column
in a table by itself.
→ At some point during query execution, the DBMS must get
other columns and stitch the original tuple back together.
But we still need to store data in a columnar format
to get the storage + execution benefits.
We need columnar scheme that still stores
attributes separately but keeps the data for each
tuple physically close to each other…
14
15-721 (Spring 2023)
PAX STORAGE MODEL
Partition Attributes Across (PAX) is a hybrid
storage model that vertically partitions attributes
within a database page.
→ This is what Paraquet and Orc use.
The goal is to get the benefit of faster processing on
columnar storage while retaining the spatial locality
benefits of row storage.
15
DATA PAGE LAYOUTS FOR RELATIONAL DATABASES
ON DEEP MEMORY HIERARCHIES
VLDB JOURNAL 2002
15-721 (Spring 2023)
PAX: PHYSICAL ORGANIZATION
Horizontally partition rows into
groups. Then vertically partition their
attributes into columns.
Global header contains directory with
the offsets to the file's row groups.
→ This is stored in the footer if the file is
immutable (Parquet, Orc).
Each row group contains its own
meta-data header about its contents.
16
b0
b1
b2
b3
b4
b5
a0
a1
a2
a3
a4
a5
c0
c1
c2
c3
c4
c5
Row #0
Row #1
Row #2
Row #3
Row #4
Row #5
Col A Col B Col C
header
PAX
File
a0 a1 a2 b0 b1 b2
c0 c1 c2
header
Row
Group
15-721 (Spring 2023)
PAX: PHYSICAL ORGANIZATION
Horizontally partition rows into
groups. Then vertically partition their
attributes into columns.
Global header contains directory with
the offsets to the file's row groups.
→ This is stored in the footer if the file is
immutable (Parquet, Orc).
Each row group contains its own
meta-data header about its contents.
16
b0
b1
b2
b3
b4
b5
a0
a1
a2
a3
a4
a5
c0
c1
c2
c3
c4
c5
Row #0
Row #1
Row #2
Row #3
Row #4
Row #5
Col A Col B Col C
header
PAX
File
a0 a1 a2 b0 b1 b2
c0 c1 c2
header
Row
Group
a3 a4 a5 b3 b4 b5
c3 c4 c5
header
Row
Group
15-721 (Spring 2023)
PAX: PHYSICAL ORGANIZATION
Horizontally partition rows into
groups. Then vertically partition their
attributes into columns.
Global header contains directory with
the offsets to the file's row groups.
→ This is stored in the footer if the file is
immutable (Parquet, Orc).
Each row group contains its own
meta-data header about its contents.
16
b0
b1
b2
b3
b4
b5
a0
a1
a2
a3
a4
a5
c0
c1
c2
c3
c4
c5
Row #0
Row #1
Row #2
Row #3
Row #4
Row #5
Col A Col B Col C
header
PAX
File
a0 a1 a2 b0 b1 b2
c0 c1 c2
header
Row
Group
a3 a4 a5 b3 b4 b5
c3 c4 c5
header
Row
Group
15-721 (Spring 2023)
MEMORY PAGES
An OLAP DBMS uses the buffer pool manager
methods that we discussed in the intro course.
OS maps physical pages to virtual memory pages.
The CPU's MMU maintains a TLB that contains the
physical address of a virtual memory page.
→ The TLB resides in the CPU caches.
→ It cannot obviously store every possible entry for a large
memory machine.
When you allocate a block of memory, the allocator
keeps that it aligned to page boundaries.
18
15-721 (Spring 2023)
TRANSPARENT HUGE PAGES (THP)
Instead of always allocating memory in 4 KB pages,
Linux supports creating larger pages (2MB to 1GB)
→ Each page must be a contiguous blocks of memory.
→ Greatly reduces the # of TLB entries
With THP, the OS reorganizes pages in the
background to keep things compact.
→ Split larger pages into smaller pages.
→ Combine smaller pages into larger pages.
→ Can cause the DBMS process to stall on memory access.
19
Source: Alexandr Nikitin
15-721 (Spring 2023)
TRANSPARENT HUGE PAGES (THP)
Historically, every DBMS advises you to disable this
THP on Linux:
→ Oracle, SingleStore, NuoDB, MongoDB, Sybase, TiDB.
→ Vertica says to enable THP only for newer Linux distros.
Recent research from Google suggests that huge
pages improved their data center workload by 7%.
→ 6.5% improvement in Spanner's throughput
20
Source: Evan Jones
15-721 (Spring 2023)
TRANSPARENT HUGE PAGES (THP)
Historically, every DBMS advises you to disable this
THP on Linux:
→ Oracle, SingleStore, NuoDB, MongoDB, Sybase, TiDB.
→ Vertica says to enable THP only for newer Linux distros.
Recent research from Google suggests that huge
pages improved their data center workload by 7%.
→ 6.5% improvement in Spanner's throughput
20
Source: Evan Jones
15-721 (Spring 2023)
DATA REPRESENTATION
INTEGER/BIGINT/SMALLINT/TINYINT
→ C/C++ Representation
FLOAT/REAL vs. NUMERIC/DECIMAL
→ IEEE-754 Standard / Fixed-point Decimals
TIME/DATE/TIMESTAMP
→ 32/64-bit int of (micro/milli)seconds since Unix epoch
VARCHAR/VARBINARY/TEXT/BLOB
→ Pointer to other location if type is ≥64-bits
→ Header with length and address to next location (if
segmented), followed by data bytes.
→ Most DBMSs use dictionary compression for these.
21
15-721 (Spring 2023)
VARIABLE PRECISION NUMBERS
Inexact, variable-precision numeric type that uses
the "native" C/C++ types.
Store directly as specified by IEEE-754.
→ Example: FLOAT, REAL/DOUBLE
These types are typically faster than fixed precision
numbers because CPU ISA's (Xeon, Arm) have
instructions / registers to support them.
But they do not guarantee exact values…
22
15-721 (Spring 2023)
VARIABLE PRECISION NUMBERS
23
#include <stdio.h>
int main(int argc, char* argv[]) {
float x = 0.1;
float y = 0.2;
printf("x+y = %fn", x+y);
printf("0.3 = %fn", 0.3);
}
Rounding Example
x+y = 0.300000
0.3 = 0.300000
Output
15-721 (Spring 2023)
VARIABLE PRECISION NUMBERS
23
#include <stdio.h>
int main(int argc, char* argv[]) {
float x = 0.1;
float y = 0.2;
printf("x+y = %fn", x+y);
printf("0.3 = %fn", 0.3);
}
Rounding Example
x+y = 0.300000
0.3 = 0.300000
Output
#include <stdio.h>
int main(int argc, char* argv[]) {
float x = 0.1;
float y = 0.2;
printf("x+y = %.20fn", x+y);
printf("0.3 = %.20fn", 0.3);
}
x+y = 0.30000001192092895508
0.3 = 0.29999999999999998890
15-721 (Spring 2023)
FIXED PRECISION NUMBERS
Numeric data types with (potentially) arbitrary
precision and scale. Used when rounding errors are
unacceptable.
→ Example: NUMERIC, DECIMAL
Many different implementations.
→ Example: Store in an exact, variable-length binary
representation with additional meta-data.
→ Can be less expensive if the DBMS does not provide
arbitrary precision (e.g., decimal point can be in a different
position per value).
24
15-721 (Spring 2023)
FIXED PRECISION NUMBERS
Numeric data types with (potentially) arbitrary
precision and scale. Used when rounding errors are
unacceptable.
→ Example: NUMERIC, DECIMAL
Many different implementations.
→ Example: Store in an exact, variable-length binary
representation with additional meta-data.
→ Can be less expensive if the DBMS does not provide
arbitrary precision (e.g., decimal point can be in a different
position per value).
24
15-721 (Spring 2023)
POSTGRES: NUMERIC
25
typedef unsigned char NumericDigit;
typedef struct {
int ndigits;
int weight;
int scale;
int sign;
NumericDigit *digits;
} numeric;
# of Digits
Weight of 1st Digit
Scale Factor
Positive/Negative/NaN
Digit Storage
15-721 (Spring 2023)
POSTGRES: NUMERIC
25
typedef unsigned char NumericDigit;
typedef struct {
int ndigits;
int weight;
int scale;
int sign;
NumericDigit *digits;
} numeric;
# of Digits
Weight of 1st Digit
Scale Factor
Positive/Negative/NaN
Digit Storage
15-721 (Spring 2023)
MYSQL: NUMERIC
26
typedef int32 decimal_digit_t;
struct decimal_t {
int intg, frac, len;
bool sign;
decimal_digit_t *buf;
};
# of Digits Before Point
# of Digits After Point
Length (Bytes)
Positive/Negative
Digit Storage
15-721 (Spring 2023)
MYSQL: NUMERIC
26
typedef int32 decimal_digit_t;
struct decimal_t {
int intg, frac, len;
bool sign;
decimal_digit_t *buf;
};
# of Digits Before Point
# of Digits After Point
Length (Bytes)
Positive/Negative
Digit Storage
15-721 (Spring 2023)
NULL DATA TYPES
Choice #1: Special Values
→ Designate a value to represent NULL for a data type (e.g.,
INT32_MIN).
Choice #2: Null Column Bitmap Header
→ Store a bitmap in a centralized header that specifies what
attributes are null.
Choice #3: Per Attribute Null Flag
→ Store a flag that marks that a value is null.
→ Must use more space than just a single bit because this
messes up with word alignment.
27
15-721 (Spring 2023)
NULL DATA TYPES
Choice #1: Special Values
→ Designate a value to represent NULL for a data type (e.g.,
INT32_MIN).
Choice #2: Null Column Bitmap Header
→ Store a bitmap in a centralized header that specifies what
attributes are null.
Choice #3: Per Attribute Null Flag
→ Store a flag that marks that a value is null.
→ Must use more space than just a single bit because this
messes up with word alignment.
27
15-721 (Spring 2023)
OBSERVATION
Data is "hot" when it enters the database
→ A newly inserted tuple is more likely to be updated again
the near future.
As a tuple ages, it is updated less frequently.
→ At some point, a tuple is only accessed in read-only queries
along with other tuples.
28
15-721 (Spring 2023)
HYBRID STORAGE MODEL
Use separate execution engines that are optimized
for either NSM or DSM databases.
→ Store new data in NSM for fast OLTP.
→ Migrate data to DSM for more efficient OLAP.
→ Combine query results from both engines to appear as a
single logical database to the application.
Choice #1: Fractured Mirrors
→ Examples: Oracle, IBM DB2 Blu, Microsoft SQL Server
Choice #2: Delta Store
→ Examples: SAP HANA, Vertica, SingleStore, Databricks,
Google Napa
29
15-721 (Spring 2023)
FRACTURED MIRRORS
Store a second copy of the database in a DSM layout
that is automatically updated.
→ All updates are first entered in NSM then eventually copied
into DSM mirror.
→ If the DBMS supports updates, it must invalidate tuples in
the DSM mirror.
30
A CASE FOR FRACTURED MIRRORS
VLDB 2002
NSM
(Primary)
DSM
(Mirror)
Transactions
Analytical
Queries
15-721 (Spring 2023)
DELTA STORE
Stage updates to the database in an NSM table.
A background thread migrates updates from delta
store and applies them to DSM data.
→ Batch large chunks and then write them out as a PAX file.
31
NSM
Delta Store
DSM
Historical Data
Transactions
15-721 (Spring 2023)
DATABASE PARTITIONING
Split database across multiple resources:
→ Disks, nodes, processors.
→ Often called "sharding" in NoSQL systems.
The DBMS executes query fragments on each
partition and then combines the results to produce a
single answer.
The DBMS can partition a database physically
(shared nothing) or logically (shared disk).
32
15-721 (Spring 2023)
HORIZONTAL PARTITIONING
Split a table's tuples into disjoint subsets based on
some partitioning key and scheme.
→ Choose column(s) that divides the database equally in
terms of size, load, or usage.
Partitioning Schemes:
→ Hashing
→ Ranges
→ Predicates
33
15-721 (Spring 2023)
HORIZONTAL PARTITIONING
34
SELECT * FROM table
WHERE partitionKey = ?
Ideal Query:
Partitions
Table1
101 a XXX 2022-11-29
102 b XXY 2022-11-28
103 c XYZ 2022-11-29
104 d XYX 2022-11-27
105 e XYY 2022-11-29
hash(a)%4 = P2
hash(b)%4 = P4
hash(c)%4 = P3
hash(d)%4 = P2
hash(e)%4 = P1
Partitioning Key
15-721 (Spring 2023)
HORIZONTAL PARTITIONING
34
SELECT * FROM table
WHERE partitionKey = ?
Ideal Query:
Partitions
Table1
101 a XXX 2022-11-29
102 b XXY 2022-11-28
103 c XYZ 2022-11-29
104 d XYX 2022-11-27
105 e XYY 2022-11-29
hash(a)%4 = P2
hash(b)%4 = P4
hash(c)%4 = P3
hash(d)%4 = P2
hash(e)%4 = P1
Partitioning Key
15-721 (Spring 2023)
HORIZONTAL PARTITIONING
34
SELECT * FROM table
WHERE partitionKey = ?
Ideal Query:
Partitions
Table1
101 a XXX 2022-11-29
102 b XXY 2022-11-28
103 c XYZ 2022-11-29
104 d XYX 2022-11-27
105 e XYY 2022-11-29
P3 P4
P1 P2
hash(a)%4 = P2
hash(b)%4 = P4
hash(c)%4 = P3
hash(d)%4 = P2
hash(e)%4 = P1
Partitioning Key
15-721 (Spring 2023)
HORIZONTAL PARTITIONING
34
SELECT * FROM table
WHERE partitionKey = ?
Ideal Query:
Partitions
Table1
101 a XXX 2022-11-29
102 b XXY 2022-11-28
103 c XYZ 2022-11-29
104 d XYX 2022-11-27
105 e XYY 2022-11-29
P3 P4
P1 P2
hash(a)%4 = P2
hash(b)%4 = P4
hash(c)%4 = P3
hash(d)%4 = P2
hash(e)%4 = P1
Partitioning Key
15-721 (Spring 2023)
Storage
LOGICAL PARTITIONING
Node
Application
Server Node
Get Id=1
Id=1
Id=2
Id=3
Id=4
Id=1
Id=2
Id=3
Id=4
35
15-721 (Spring 2023)
Storage
LOGICAL PARTITIONING
Node
Application
Server Node
Get Id=3
Id=1
Id=2
Id=3
Id=4
Id=1
Id=2
Id=3
Id=4
35
15-721 (Spring 2023)
Storage
LOGICAL PARTITIONING
Node
Application
Server Node
Id=1
Id=2
Id=3
Id=4
Id=1
Id=2
Id=3
Id=4
Get Id=3
Get Id=2
35
15-721 (Spring 2023)
Node
Node
PHYSICAL PARTITIONING
Application
Server
Get Id=1
Id=1
Id=2
Id=3
Id=4
36
15-721 (Spring 2023)
Node
Node
PHYSICAL PARTITIONING
Application
Server
Get Id=3
Id=1
Id=2
Id=3
Id=4
36
15-721 (Spring 2023)
PARTING THOUGHTS
Every modern OLAP system is using some variant
of PAX storage. The key idea is that all data must be
fixed-length.
Real-world tables contain mostly numeric attributes
(int/float), but their occupied storage is mostly
comprised of string data.
Modern columnar systems are so fast that most
people do not denormalize data warehouse schemas.
37
15-721 (Spring 2023)
NEXT CLASS
How to accelerate OLAP queries on columnar data
with auxiliary data structures.
→ Zone Maps
→ Bitmap Indexes
→ Sketches
We will also discuss Project #1.
38

More Related Content

PDF
PPTX
Bdam presentation on parquet
PDF
OLAP Indexes and Algorithms CMU Advanced Databases
PPT
fdocuments.in_aerospike-key-value-data-access.ppt
PPT
Aerospike: Key Value Data Access
PPTX
Hpverticacertificationguide 150322232921-conversion-gate01
PPTX
Hp vertica certification guide
PDF
Open Source Datawarehouse
Bdam presentation on parquet
OLAP Indexes and Algorithms CMU Advanced Databases
fdocuments.in_aerospike-key-value-data-access.ppt
Aerospike: Key Value Data Access
Hpverticacertificationguide 150322232921-conversion-gate01
Hp vertica certification guide
Open Source Datawarehouse

Similar to Storage Systems and Data Layout CMU Advanced Databases (20)

PPT
Storage cassandra
PPTX
«Дизайн продвинутых нереляционных схем для Big Data»
PDF
Bigtable osdi06
PDF
PDF
Modern OLAP Databases CMU Advanced Databases
PPT
Intro to IDMS
PPT
Intro To IDMS
PDF
AWS SSA Webinar 20 - Getting Started with Data Warehouses on AWS
PPTX
NoSQL - A Closer Look to Couchbase
PDF
NewSQL Database Overview
PPT
04 Cache Memory
PDF
Row or Columnar Database
PDF
The TileDB Embedded Storage Engine
PPT
7 data management design
PDF
Data base managment system 1254658543135
PDF
Time series database by Harshil Ambagade
PDF
Scaling ArangoDB on Mesosphere DCOS
PDF
ICDE2015 Research 3: Distributed Storage and Processing
PDF
Oracle 10g Database Administrator Implementation and Administration 2nd Editi...
Storage cassandra
«Дизайн продвинутых нереляционных схем для Big Data»
Bigtable osdi06
Modern OLAP Databases CMU Advanced Databases
Intro to IDMS
Intro To IDMS
AWS SSA Webinar 20 - Getting Started with Data Warehouses on AWS
NoSQL - A Closer Look to Couchbase
NewSQL Database Overview
04 Cache Memory
Row or Columnar Database
The TileDB Embedded Storage Engine
7 data management design
Data base managment system 1254658543135
Time series database by Harshil Ambagade
Scaling ArangoDB on Mesosphere DCOS
ICDE2015 Research 3: Distributed Storage and Processing
Oracle 10g Database Administrator Implementation and Administration 2nd Editi...
Ad

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Machine Learning_overview_presentation.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Electronic commerce courselecture one. Pdf
PPT
Teaching material agriculture food technology
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
A Presentation on Artificial Intelligence
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
Review of recent advances in non-invasive hemoglobin estimation
Spectral efficient network and resource selection model in 5G networks
Advanced methodologies resolving dimensionality complications for autism neur...
Machine Learning_overview_presentation.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Electronic commerce courselecture one. Pdf
Teaching material agriculture food technology
Dropbox Q2 2025 Financial Results & Investor Presentation
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Big Data Technologies - Introduction.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Encapsulation theory and applications.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Unlocking AI with Model Context Protocol (MCP)
20250228 LYD VKU AI Blended-Learning.pptx
MYSQL Presentation for SQL database connectivity
A Presentation on Artificial Intelligence
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Ad

Storage Systems and Data Layout CMU Advanced Databases

  • 1. ADVANCED DATABASE SYSTEMS Andy Pavlo // 15-721 // Spring 2023 Storage Models & Data Layout Lecture #03
  • 2. 15-721 (Spring 2023) OBSERVATION Today's lecture is about the lowest physical representation of data in a database. What data "looks" like determines almost a DBMS's entire system architecture. → Processing Model → Tuple Materialization Strategy → Operator Algorithms → Data Ingestion / Updates → Concurrency Control (we will ignore this) → Query Optimization 2
  • 3. 15-721 (Spring 2023) TODAY’S AGENDA Storage Models Type Representation Partitioning 3
  • 4. 15-721 (Spring 2023) STORAGE MODELS A DBMS's storage model specifies how it physically organizes tuples on disk and in memory. Choice #1: N-ary Storage Model (NSM) Choice #2: Decomposition Storage Model (DSM) Choice #3: Hybrid Storage Model (PAX) 4 COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? SIGMOD 2008
  • 5. 15-721 (Spring 2023) N-ARY STORAGE MODEL (NSM) The DBMS stores (almost) all the attributes for a single tuple contiguously in a single page. Ideal for OLTP workloads where txns tend to access individual entities and insert-heavy workloads. → Use the tuple-at-a-time iterator processing model. NSM database page sizes are typically some constant multiple of 4 KB hardware pages. → Example: Oracle (4 KB), Postgres (8 KB), MySQL (16 KB) 5
  • 6. 15-721 (Spring 2023) Database Page NSM: PHYSICAL ORGANIZATION A disk-oriented NSM system stores a tuple's fixed-length and variable- length attributes contiguously in a single slotted page. The tuple's record id (page#, slot#) is how the DBMS uniquely identifies a physical tuple. 6 b0 b1 b2 b3 b4 b5 a0 a1 a2 a3 a4 a5 c0 c1 c2 c3 c4 c5 Row #0 Row #1 Row #2 Row #3 Row #4 Row #5 Col A Col B Col C header b0 a0 c0 header Slot Array
  • 7. 15-721 (Spring 2023) Database Page NSM: PHYSICAL ORGANIZATION A disk-oriented NSM system stores a tuple's fixed-length and variable- length attributes contiguously in a single slotted page. The tuple's record id (page#, slot#) is how the DBMS uniquely identifies a physical tuple. 6 b0 b1 b2 b3 b4 b5 a0 a1 a2 a3 a4 a5 c0 c1 c2 c3 c4 c5 Row #0 Row #1 Row #2 Row #3 Row #4 Row #5 Col A Col B Col C header b0 a0 c0 header b1 c1 a1 header Slot Array
  • 8. 15-721 (Spring 2023) Database Page NSM: PHYSICAL ORGANIZATION A disk-oriented NSM system stores a tuple's fixed-length and variable- length attributes contiguously in a single slotted page. The tuple's record id (page#, slot#) is how the DBMS uniquely identifies a physical tuple. 6 b0 b1 b2 b3 b4 b5 a0 a1 a2 a3 a4 a5 c0 c1 c2 c3 c4 c5 Row #0 Row #1 Row #2 Row #3 Row #4 Row #5 Col A Col B Col C header b0 a0 c0 header b1 c1 a1 header Slot Array b2 a2 c2 header b3 a3 c3 header b4 c4 a4 header b5 a5 c5 header
  • 9. 15-721 (Spring 2023) Database Page NSM: PHYSICAL ORGANIZATION 6 b0 b1 b2 b3 b4 b5 a0 a1 a2 a3 a4 a5 c0 c1 c2 c3 c4 c5 Row #0 Row #1 Row #2 Row #3 Row #4 Row #5 Col A Col B Col C header b0 a0 c0 header b1 c1 a1 header Slot Array b2 a2 c2 header b3 a3 c3 header b4 c4 a4 header b5 a5 c5 header SELECT SUM(colA), AVG(colC) FROM xxx WHERE colA > 1000
  • 10. 15-721 (Spring 2023) Database Page NSM: PHYSICAL ORGANIZATION 6 b0 b1 b2 b3 b4 b5 a0 a1 a2 a3 a4 a5 c0 c1 c2 c3 c4 c5 Row #0 Row #1 Row #2 Row #3 Row #4 Row #5 Col A Col B Col C header b0 a0 c0 header b1 c1 a1 header Slot Array b2 a2 c2 header b3 a3 c3 header b4 c4 a4 header b5 a5 c5 header SELECT SUM(colA), AVG(colC) FROM xxx WHERE colA > 1000
  • 11. 15-721 (Spring 2023) Database Page NSM: PHYSICAL ORGANIZATION 6 b0 b1 b2 b3 b4 b5 a0 a1 a2 a3 a4 a5 c0 c1 c2 c3 c4 c5 Row #0 Row #1 Row #2 Row #3 Row #4 Row #5 Col A Col B Col C header b0 a0 c0 header b1 c1 a1 header Slot Array b2 a2 c2 header b3 a3 c3 header b4 c4 a4 header b5 a5 c5 header SELECT SUM(colA), AVG(colC) FROM xxx WHERE colA > 1000
  • 12. 15-721 (Spring 2023) N-ARY STORAGE MODEL (NSM) Advantages → Fast inserts, updates, and deletes. → Good for queries that need the entire tuple (OLTP). → Can use index-oriented physical storage for clustering. Disadvantages → Not good for scanning large portions of the table and/or a subset of the attributes. → Terrible memory locality in access patterns. → Not ideal for compression because of multiple value domains within a single page. 7
  • 13. 15-721 (Spring 2023) DECOMPOSITION STORAGE MODEL (DSM) The DBMS stores a single attribute for all tuples contiguously in a block of data. Ideal for OLAP workloads where read-only queries perform large scans over a subset of the table’s attributes. → Use a batched vectorized processing model. File sizes are larger (100s of MBs), but it may still organize tuples within the file into smaller groups. 8
  • 14. 15-721 (Spring 2023) DSM: PHYSICAL ORGANIZATION Store attributes and meta-data (e.g., nulls) in separate arrays of fixed- length values. → Most systems identify unique physical tuples using offsets into these arrays. → Need to handle variable-length values… Maintain a separate file per attribute with a dedicated header area for meta- data about entire column. 9 b0 b1 b2 b3 b4 b5 a0 a1 a2 a3 a4 a5 c0 c1 c2 c3 c4 c5 Row #0 Row #1 Row #2 Row #3 Row #4 Row #5 Col A Col B Col C header null bitmap a0 a1 a2 a3 a4 a5 File #1
  • 15. 15-721 (Spring 2023) DSM: PHYSICAL ORGANIZATION Store attributes and meta-data (e.g., nulls) in separate arrays of fixed- length values. → Most systems identify unique physical tuples using offsets into these arrays. → Need to handle variable-length values… Maintain a separate file per attribute with a dedicated header area for meta- data about entire column. 9 b0 b1 b2 b3 b4 b5 a0 a1 a2 a3 a4 a5 c0 c1 c2 c3 c4 c5 Row #0 Row #1 Row #2 Row #3 Row #4 Row #5 Col A Col B Col C header null bitmap a0 a1 a2 a3 a4 a5 File #1 header null bitmap b0 b1 b2 b3 b4 b5 File #2
  • 16. 15-721 (Spring 2023) DSM: PHYSICAL ORGANIZATION Store attributes and meta-data (e.g., nulls) in separate arrays of fixed- length values. → Most systems identify unique physical tuples using offsets into these arrays. → Need to handle variable-length values… Maintain a separate file per attribute with a dedicated header area for meta- data about entire column. 9 b0 b1 b2 b3 b4 b5 a0 a1 a2 a3 a4 a5 c0 c1 c2 c3 c4 c5 Row #0 Row #1 Row #2 Row #3 Row #4 Row #5 Col A Col B Col C header null bitmap a0 a1 a2 a3 a4 a5 File #1 header null bitmap b0 b1 b2 b3 b4 b5 File #2 header null bitmap c5 c0 c1 c2 c3 c4 File #3
  • 17. 15-721 (Spring 2023) DSM: TUPLE IDENTIFICATION Choice #1: Fixed-length Offsets → Each value is the same length for an attribute. Choice #2: Embedded Tuple Ids → Each value is stored with its tuple id in a column. 10 Offsets 0 1 2 3 A B C D Embedded Ids A 0 1 2 3 B 0 1 2 3 C 0 1 2 3 D 0 1 2 3
  • 18. 15-721 (Spring 2023) DSM: VARIABLE-LENGTH DATA Padding variable-length fields to ensure they are fixed-length is wasteful, especially for large attributes. A better approach is to use dictionary compression to convert repetitive variable-length data into fixed- length values (typically 32-bit integers). → More on this next week. 11
  • 19. 15-721 (Spring 2023) DSM: SYSTEM HISTORY 1970s: Cantor DBMS 1980s: DSM Proposal 1990s: SybaseIQ (in-memory only) 2000s: Vertica, Vectorwise, MonetDB 2010s: Everyone 12
  • 20. 15-721 (Spring 2023) DECOMPOSITION STORAGE MODEL (DSM) Advantages → Reduces the amount wasted I/O per query because the DBMS only reads the data that it needs. → Faster query processing because of increased locality and cached data reuse. → Better data compression (more on this later) Disadvantages → Slow for point queries, inserts, updates, and deletes because of tuple splitting/stitching/reorganization. 13
  • 21. 15-721 (Spring 2023) OBSERVATION OLAP queries almost never access a single column in a table by itself. → At some point during query execution, the DBMS must get other columns and stitch the original tuple back together. But we still need to store data in a columnar format to get the storage + execution benefits. We need columnar scheme that still stores attributes separately but keeps the data for each tuple physically close to each other… 14
  • 22. 15-721 (Spring 2023) PAX STORAGE MODEL Partition Attributes Across (PAX) is a hybrid storage model that vertically partitions attributes within a database page. → This is what Paraquet and Orc use. The goal is to get the benefit of faster processing on columnar storage while retaining the spatial locality benefits of row storage. 15 DATA PAGE LAYOUTS FOR RELATIONAL DATABASES ON DEEP MEMORY HIERARCHIES VLDB JOURNAL 2002
  • 23. 15-721 (Spring 2023) PAX: PHYSICAL ORGANIZATION Horizontally partition rows into groups. Then vertically partition their attributes into columns. Global header contains directory with the offsets to the file's row groups. → This is stored in the footer if the file is immutable (Parquet, Orc). Each row group contains its own meta-data header about its contents. 16 b0 b1 b2 b3 b4 b5 a0 a1 a2 a3 a4 a5 c0 c1 c2 c3 c4 c5 Row #0 Row #1 Row #2 Row #3 Row #4 Row #5 Col A Col B Col C header PAX File a0 a1 a2 b0 b1 b2 c0 c1 c2 header Row Group
  • 24. 15-721 (Spring 2023) PAX: PHYSICAL ORGANIZATION Horizontally partition rows into groups. Then vertically partition their attributes into columns. Global header contains directory with the offsets to the file's row groups. → This is stored in the footer if the file is immutable (Parquet, Orc). Each row group contains its own meta-data header about its contents. 16 b0 b1 b2 b3 b4 b5 a0 a1 a2 a3 a4 a5 c0 c1 c2 c3 c4 c5 Row #0 Row #1 Row #2 Row #3 Row #4 Row #5 Col A Col B Col C header PAX File a0 a1 a2 b0 b1 b2 c0 c1 c2 header Row Group a3 a4 a5 b3 b4 b5 c3 c4 c5 header Row Group
  • 25. 15-721 (Spring 2023) PAX: PHYSICAL ORGANIZATION Horizontally partition rows into groups. Then vertically partition their attributes into columns. Global header contains directory with the offsets to the file's row groups. → This is stored in the footer if the file is immutable (Parquet, Orc). Each row group contains its own meta-data header about its contents. 16 b0 b1 b2 b3 b4 b5 a0 a1 a2 a3 a4 a5 c0 c1 c2 c3 c4 c5 Row #0 Row #1 Row #2 Row #3 Row #4 Row #5 Col A Col B Col C header PAX File a0 a1 a2 b0 b1 b2 c0 c1 c2 header Row Group a3 a4 a5 b3 b4 b5 c3 c4 c5 header Row Group
  • 26. 15-721 (Spring 2023) MEMORY PAGES An OLAP DBMS uses the buffer pool manager methods that we discussed in the intro course. OS maps physical pages to virtual memory pages. The CPU's MMU maintains a TLB that contains the physical address of a virtual memory page. → The TLB resides in the CPU caches. → It cannot obviously store every possible entry for a large memory machine. When you allocate a block of memory, the allocator keeps that it aligned to page boundaries. 18
  • 27. 15-721 (Spring 2023) TRANSPARENT HUGE PAGES (THP) Instead of always allocating memory in 4 KB pages, Linux supports creating larger pages (2MB to 1GB) → Each page must be a contiguous blocks of memory. → Greatly reduces the # of TLB entries With THP, the OS reorganizes pages in the background to keep things compact. → Split larger pages into smaller pages. → Combine smaller pages into larger pages. → Can cause the DBMS process to stall on memory access. 19 Source: Alexandr Nikitin
  • 28. 15-721 (Spring 2023) TRANSPARENT HUGE PAGES (THP) Historically, every DBMS advises you to disable this THP on Linux: → Oracle, SingleStore, NuoDB, MongoDB, Sybase, TiDB. → Vertica says to enable THP only for newer Linux distros. Recent research from Google suggests that huge pages improved their data center workload by 7%. → 6.5% improvement in Spanner's throughput 20 Source: Evan Jones
  • 29. 15-721 (Spring 2023) TRANSPARENT HUGE PAGES (THP) Historically, every DBMS advises you to disable this THP on Linux: → Oracle, SingleStore, NuoDB, MongoDB, Sybase, TiDB. → Vertica says to enable THP only for newer Linux distros. Recent research from Google suggests that huge pages improved their data center workload by 7%. → 6.5% improvement in Spanner's throughput 20 Source: Evan Jones
  • 30. 15-721 (Spring 2023) DATA REPRESENTATION INTEGER/BIGINT/SMALLINT/TINYINT → C/C++ Representation FLOAT/REAL vs. NUMERIC/DECIMAL → IEEE-754 Standard / Fixed-point Decimals TIME/DATE/TIMESTAMP → 32/64-bit int of (micro/milli)seconds since Unix epoch VARCHAR/VARBINARY/TEXT/BLOB → Pointer to other location if type is ≥64-bits → Header with length and address to next location (if segmented), followed by data bytes. → Most DBMSs use dictionary compression for these. 21
  • 31. 15-721 (Spring 2023) VARIABLE PRECISION NUMBERS Inexact, variable-precision numeric type that uses the "native" C/C++ types. Store directly as specified by IEEE-754. → Example: FLOAT, REAL/DOUBLE These types are typically faster than fixed precision numbers because CPU ISA's (Xeon, Arm) have instructions / registers to support them. But they do not guarantee exact values… 22
  • 32. 15-721 (Spring 2023) VARIABLE PRECISION NUMBERS 23 #include <stdio.h> int main(int argc, char* argv[]) { float x = 0.1; float y = 0.2; printf("x+y = %fn", x+y); printf("0.3 = %fn", 0.3); } Rounding Example x+y = 0.300000 0.3 = 0.300000 Output
  • 33. 15-721 (Spring 2023) VARIABLE PRECISION NUMBERS 23 #include <stdio.h> int main(int argc, char* argv[]) { float x = 0.1; float y = 0.2; printf("x+y = %fn", x+y); printf("0.3 = %fn", 0.3); } Rounding Example x+y = 0.300000 0.3 = 0.300000 Output #include <stdio.h> int main(int argc, char* argv[]) { float x = 0.1; float y = 0.2; printf("x+y = %.20fn", x+y); printf("0.3 = %.20fn", 0.3); } x+y = 0.30000001192092895508 0.3 = 0.29999999999999998890
  • 34. 15-721 (Spring 2023) FIXED PRECISION NUMBERS Numeric data types with (potentially) arbitrary precision and scale. Used when rounding errors are unacceptable. → Example: NUMERIC, DECIMAL Many different implementations. → Example: Store in an exact, variable-length binary representation with additional meta-data. → Can be less expensive if the DBMS does not provide arbitrary precision (e.g., decimal point can be in a different position per value). 24
  • 35. 15-721 (Spring 2023) FIXED PRECISION NUMBERS Numeric data types with (potentially) arbitrary precision and scale. Used when rounding errors are unacceptable. → Example: NUMERIC, DECIMAL Many different implementations. → Example: Store in an exact, variable-length binary representation with additional meta-data. → Can be less expensive if the DBMS does not provide arbitrary precision (e.g., decimal point can be in a different position per value). 24
  • 36. 15-721 (Spring 2023) POSTGRES: NUMERIC 25 typedef unsigned char NumericDigit; typedef struct { int ndigits; int weight; int scale; int sign; NumericDigit *digits; } numeric; # of Digits Weight of 1st Digit Scale Factor Positive/Negative/NaN Digit Storage
  • 37. 15-721 (Spring 2023) POSTGRES: NUMERIC 25 typedef unsigned char NumericDigit; typedef struct { int ndigits; int weight; int scale; int sign; NumericDigit *digits; } numeric; # of Digits Weight of 1st Digit Scale Factor Positive/Negative/NaN Digit Storage
  • 38. 15-721 (Spring 2023) MYSQL: NUMERIC 26 typedef int32 decimal_digit_t; struct decimal_t { int intg, frac, len; bool sign; decimal_digit_t *buf; }; # of Digits Before Point # of Digits After Point Length (Bytes) Positive/Negative Digit Storage
  • 39. 15-721 (Spring 2023) MYSQL: NUMERIC 26 typedef int32 decimal_digit_t; struct decimal_t { int intg, frac, len; bool sign; decimal_digit_t *buf; }; # of Digits Before Point # of Digits After Point Length (Bytes) Positive/Negative Digit Storage
  • 40. 15-721 (Spring 2023) NULL DATA TYPES Choice #1: Special Values → Designate a value to represent NULL for a data type (e.g., INT32_MIN). Choice #2: Null Column Bitmap Header → Store a bitmap in a centralized header that specifies what attributes are null. Choice #3: Per Attribute Null Flag → Store a flag that marks that a value is null. → Must use more space than just a single bit because this messes up with word alignment. 27
  • 41. 15-721 (Spring 2023) NULL DATA TYPES Choice #1: Special Values → Designate a value to represent NULL for a data type (e.g., INT32_MIN). Choice #2: Null Column Bitmap Header → Store a bitmap in a centralized header that specifies what attributes are null. Choice #3: Per Attribute Null Flag → Store a flag that marks that a value is null. → Must use more space than just a single bit because this messes up with word alignment. 27
  • 42. 15-721 (Spring 2023) OBSERVATION Data is "hot" when it enters the database → A newly inserted tuple is more likely to be updated again the near future. As a tuple ages, it is updated less frequently. → At some point, a tuple is only accessed in read-only queries along with other tuples. 28
  • 43. 15-721 (Spring 2023) HYBRID STORAGE MODEL Use separate execution engines that are optimized for either NSM or DSM databases. → Store new data in NSM for fast OLTP. → Migrate data to DSM for more efficient OLAP. → Combine query results from both engines to appear as a single logical database to the application. Choice #1: Fractured Mirrors → Examples: Oracle, IBM DB2 Blu, Microsoft SQL Server Choice #2: Delta Store → Examples: SAP HANA, Vertica, SingleStore, Databricks, Google Napa 29
  • 44. 15-721 (Spring 2023) FRACTURED MIRRORS Store a second copy of the database in a DSM layout that is automatically updated. → All updates are first entered in NSM then eventually copied into DSM mirror. → If the DBMS supports updates, it must invalidate tuples in the DSM mirror. 30 A CASE FOR FRACTURED MIRRORS VLDB 2002 NSM (Primary) DSM (Mirror) Transactions Analytical Queries
  • 45. 15-721 (Spring 2023) DELTA STORE Stage updates to the database in an NSM table. A background thread migrates updates from delta store and applies them to DSM data. → Batch large chunks and then write them out as a PAX file. 31 NSM Delta Store DSM Historical Data Transactions
  • 46. 15-721 (Spring 2023) DATABASE PARTITIONING Split database across multiple resources: → Disks, nodes, processors. → Often called "sharding" in NoSQL systems. The DBMS executes query fragments on each partition and then combines the results to produce a single answer. The DBMS can partition a database physically (shared nothing) or logically (shared disk). 32
  • 47. 15-721 (Spring 2023) HORIZONTAL PARTITIONING Split a table's tuples into disjoint subsets based on some partitioning key and scheme. → Choose column(s) that divides the database equally in terms of size, load, or usage. Partitioning Schemes: → Hashing → Ranges → Predicates 33
  • 48. 15-721 (Spring 2023) HORIZONTAL PARTITIONING 34 SELECT * FROM table WHERE partitionKey = ? Ideal Query: Partitions Table1 101 a XXX 2022-11-29 102 b XXY 2022-11-28 103 c XYZ 2022-11-29 104 d XYX 2022-11-27 105 e XYY 2022-11-29 hash(a)%4 = P2 hash(b)%4 = P4 hash(c)%4 = P3 hash(d)%4 = P2 hash(e)%4 = P1 Partitioning Key
  • 49. 15-721 (Spring 2023) HORIZONTAL PARTITIONING 34 SELECT * FROM table WHERE partitionKey = ? Ideal Query: Partitions Table1 101 a XXX 2022-11-29 102 b XXY 2022-11-28 103 c XYZ 2022-11-29 104 d XYX 2022-11-27 105 e XYY 2022-11-29 hash(a)%4 = P2 hash(b)%4 = P4 hash(c)%4 = P3 hash(d)%4 = P2 hash(e)%4 = P1 Partitioning Key
  • 50. 15-721 (Spring 2023) HORIZONTAL PARTITIONING 34 SELECT * FROM table WHERE partitionKey = ? Ideal Query: Partitions Table1 101 a XXX 2022-11-29 102 b XXY 2022-11-28 103 c XYZ 2022-11-29 104 d XYX 2022-11-27 105 e XYY 2022-11-29 P3 P4 P1 P2 hash(a)%4 = P2 hash(b)%4 = P4 hash(c)%4 = P3 hash(d)%4 = P2 hash(e)%4 = P1 Partitioning Key
  • 51. 15-721 (Spring 2023) HORIZONTAL PARTITIONING 34 SELECT * FROM table WHERE partitionKey = ? Ideal Query: Partitions Table1 101 a XXX 2022-11-29 102 b XXY 2022-11-28 103 c XYZ 2022-11-29 104 d XYX 2022-11-27 105 e XYY 2022-11-29 P3 P4 P1 P2 hash(a)%4 = P2 hash(b)%4 = P4 hash(c)%4 = P3 hash(d)%4 = P2 hash(e)%4 = P1 Partitioning Key
  • 52. 15-721 (Spring 2023) Storage LOGICAL PARTITIONING Node Application Server Node Get Id=1 Id=1 Id=2 Id=3 Id=4 Id=1 Id=2 Id=3 Id=4 35
  • 53. 15-721 (Spring 2023) Storage LOGICAL PARTITIONING Node Application Server Node Get Id=3 Id=1 Id=2 Id=3 Id=4 Id=1 Id=2 Id=3 Id=4 35
  • 54. 15-721 (Spring 2023) Storage LOGICAL PARTITIONING Node Application Server Node Id=1 Id=2 Id=3 Id=4 Id=1 Id=2 Id=3 Id=4 Get Id=3 Get Id=2 35
  • 55. 15-721 (Spring 2023) Node Node PHYSICAL PARTITIONING Application Server Get Id=1 Id=1 Id=2 Id=3 Id=4 36
  • 56. 15-721 (Spring 2023) Node Node PHYSICAL PARTITIONING Application Server Get Id=3 Id=1 Id=2 Id=3 Id=4 36
  • 57. 15-721 (Spring 2023) PARTING THOUGHTS Every modern OLAP system is using some variant of PAX storage. The key idea is that all data must be fixed-length. Real-world tables contain mostly numeric attributes (int/float), but their occupied storage is mostly comprised of string data. Modern columnar systems are so fast that most people do not denormalize data warehouse schemas. 37
  • 58. 15-721 (Spring 2023) NEXT CLASS How to accelerate OLAP queries on columnar data with auxiliary data structures. → Zone Maps → Bitmap Indexes → Sketches We will also discuss Project #1. 38