2. 15-721 (Spring 2023)
OBSERVATION
Today's lecture is about the lowest physical
representation of data in a database.
What data "looks" like determines almost a DBMS's
entire system architecture.
→ Processing Model
→ Tuple Materialization Strategy
→ Operator Algorithms
→ Data Ingestion / Updates
→ Concurrency Control (we will ignore this)
→ Query Optimization
2
4. 15-721 (Spring 2023)
STORAGE MODELS
A DBMS's storage model specifies how it
physically organizes tuples on disk and in memory.
Choice #1: N-ary Storage Model (NSM)
Choice #2: Decomposition Storage Model (DSM)
Choice #3: Hybrid Storage Model (PAX)
4
COLUMN-STORES VS. ROW-STORES: HOW
DIFFERENT ARE THEY REALLY?
SIGMOD 2008
5. 15-721 (Spring 2023)
N-ARY STORAGE MODEL (NSM)
The DBMS stores (almost) all the attributes for a
single tuple contiguously in a single page.
Ideal for OLTP workloads where txns tend to access
individual entities and insert-heavy workloads.
→ Use the tuple-at-a-time iterator processing model.
NSM database page sizes are typically some constant
multiple of 4 KB hardware pages.
→ Example: Oracle (4 KB), Postgres (8 KB), MySQL (16 KB)
5
6. 15-721 (Spring 2023)
Database
Page
NSM: PHYSICAL ORGANIZATION
A disk-oriented NSM system stores a
tuple's fixed-length and variable-
length attributes contiguously in a
single slotted page.
The tuple's record id (page#, slot#) is
how the DBMS uniquely identifies a
physical tuple.
6
b0
b1
b2
b3
b4
b5
a0
a1
a2
a3
a4
a5
c0
c1
c2
c3
c4
c5
Row #0
Row #1
Row #2
Row #3
Row #4
Row #5
Col A Col B Col C
header
b0
a0 c0
header
Slot Array
7. 15-721 (Spring 2023)
Database
Page
NSM: PHYSICAL ORGANIZATION
A disk-oriented NSM system stores a
tuple's fixed-length and variable-
length attributes contiguously in a
single slotted page.
The tuple's record id (page#, slot#) is
how the DBMS uniquely identifies a
physical tuple.
6
b0
b1
b2
b3
b4
b5
a0
a1
a2
a3
a4
a5
c0
c1
c2
c3
c4
c5
Row #0
Row #1
Row #2
Row #3
Row #4
Row #5
Col A Col B Col C
header
b0
a0 c0
header
b1 c1
a1
header
Slot Array
8. 15-721 (Spring 2023)
Database
Page
NSM: PHYSICAL ORGANIZATION
A disk-oriented NSM system stores a
tuple's fixed-length and variable-
length attributes contiguously in a
single slotted page.
The tuple's record id (page#, slot#) is
how the DBMS uniquely identifies a
physical tuple.
6
b0
b1
b2
b3
b4
b5
a0
a1
a2
a3
a4
a5
c0
c1
c2
c3
c4
c5
Row #0
Row #1
Row #2
Row #3
Row #4
Row #5
Col A Col B Col C
header
b0
a0 c0
header
b1 c1
a1
header
Slot Array
b2
a2 c2
header
b3
a3 c3
header
b4 c4
a4
header
b5
a5 c5
header
9. 15-721 (Spring 2023)
Database
Page
NSM: PHYSICAL ORGANIZATION
6
b0
b1
b2
b3
b4
b5
a0
a1
a2
a3
a4
a5
c0
c1
c2
c3
c4
c5
Row #0
Row #1
Row #2
Row #3
Row #4
Row #5
Col A Col B Col C
header
b0
a0 c0
header
b1 c1
a1
header
Slot Array
b2
a2 c2
header
b3
a3 c3
header
b4 c4
a4
header
b5
a5 c5
header
SELECT SUM(colA), AVG(colC)
FROM xxx
WHERE colA > 1000
10. 15-721 (Spring 2023)
Database
Page
NSM: PHYSICAL ORGANIZATION
6
b0
b1
b2
b3
b4
b5
a0
a1
a2
a3
a4
a5
c0
c1
c2
c3
c4
c5
Row #0
Row #1
Row #2
Row #3
Row #4
Row #5
Col A Col B Col C
header
b0
a0 c0
header
b1 c1
a1
header
Slot Array
b2
a2 c2
header
b3
a3 c3
header
b4 c4
a4
header
b5
a5 c5
header
SELECT SUM(colA), AVG(colC)
FROM xxx
WHERE colA > 1000
11. 15-721 (Spring 2023)
Database
Page
NSM: PHYSICAL ORGANIZATION
6
b0
b1
b2
b3
b4
b5
a0
a1
a2
a3
a4
a5
c0
c1
c2
c3
c4
c5
Row #0
Row #1
Row #2
Row #3
Row #4
Row #5
Col A Col B Col C
header
b0
a0 c0
header
b1 c1
a1
header
Slot Array
b2
a2 c2
header
b3
a3 c3
header
b4 c4
a4
header
b5
a5 c5
header
SELECT SUM(colA), AVG(colC)
FROM xxx
WHERE colA > 1000
12. 15-721 (Spring 2023)
N-ARY STORAGE MODEL (NSM)
Advantages
→ Fast inserts, updates, and deletes.
→ Good for queries that need the entire tuple (OLTP).
→ Can use index-oriented physical storage for clustering.
Disadvantages
→ Not good for scanning large portions of the table and/or a
subset of the attributes.
→ Terrible memory locality in access patterns.
→ Not ideal for compression because of multiple value
domains within a single page.
7
13. 15-721 (Spring 2023)
DECOMPOSITION STORAGE MODEL (DSM)
The DBMS stores a single attribute for all tuples
contiguously in a block of data.
Ideal for OLAP workloads where read-only queries
perform large scans over a subset of the table’s
attributes.
→ Use a batched vectorized processing model.
File sizes are larger (100s of MBs), but it may still
organize tuples within the file into smaller groups.
8
14. 15-721 (Spring 2023)
DSM: PHYSICAL ORGANIZATION
Store attributes and meta-data (e.g.,
nulls) in separate arrays of fixed-
length values.
→ Most systems identify unique physical
tuples using offsets into these arrays.
→ Need to handle variable-length values…
Maintain a separate file per attribute
with a dedicated header area for meta-
data about entire column.
9
b0
b1
b2
b3
b4
b5
a0
a1
a2
a3
a4
a5
c0
c1
c2
c3
c4
c5
Row #0
Row #1
Row #2
Row #3
Row #4
Row #5
Col A Col B Col C
header null bitmap
a0 a1 a2 a3 a4 a5
File
#1
15. 15-721 (Spring 2023)
DSM: PHYSICAL ORGANIZATION
Store attributes and meta-data (e.g.,
nulls) in separate arrays of fixed-
length values.
→ Most systems identify unique physical
tuples using offsets into these arrays.
→ Need to handle variable-length values…
Maintain a separate file per attribute
with a dedicated header area for meta-
data about entire column.
9
b0
b1
b2
b3
b4
b5
a0
a1
a2
a3
a4
a5
c0
c1
c2
c3
c4
c5
Row #0
Row #1
Row #2
Row #3
Row #4
Row #5
Col A Col B Col C
header null bitmap
a0 a1 a2 a3 a4 a5
File
#1
header null bitmap
b0 b1 b2 b3 b4 b5
File
#2
16. 15-721 (Spring 2023)
DSM: PHYSICAL ORGANIZATION
Store attributes and meta-data (e.g.,
nulls) in separate arrays of fixed-
length values.
→ Most systems identify unique physical
tuples using offsets into these arrays.
→ Need to handle variable-length values…
Maintain a separate file per attribute
with a dedicated header area for meta-
data about entire column.
9
b0
b1
b2
b3
b4
b5
a0
a1
a2
a3
a4
a5
c0
c1
c2
c3
c4
c5
Row #0
Row #1
Row #2
Row #3
Row #4
Row #5
Col A Col B Col C
header null bitmap
a0 a1 a2 a3 a4 a5
File
#1
header null bitmap
b0 b1 b2 b3 b4 b5
File
#2
header null bitmap
c5
c0 c1 c2 c3 c4
File
#3
17. 15-721 (Spring 2023)
DSM: TUPLE IDENTIFICATION
Choice #1: Fixed-length Offsets
→ Each value is the same length for an attribute.
Choice #2: Embedded Tuple Ids
→ Each value is stored with its tuple id in a column.
10
Offsets
0
1
2
3
A B C D
Embedded Ids
A
0
1
2
3
B
0
1
2
3
C
0
1
2
3
D
0
1
2
3
18. 15-721 (Spring 2023)
DSM: VARIABLE-LENGTH DATA
Padding variable-length fields to ensure they are
fixed-length is wasteful, especially for large
attributes.
A better approach is to use dictionary compression to
convert repetitive variable-length data into fixed-
length values (typically 32-bit integers).
→ More on this next week.
11
19. 15-721 (Spring 2023)
DSM: SYSTEM HISTORY
1970s: Cantor DBMS
1980s: DSM Proposal
1990s: SybaseIQ (in-memory only)
2000s: Vertica, Vectorwise, MonetDB
2010s: Everyone
12
20. 15-721 (Spring 2023)
DECOMPOSITION STORAGE MODEL (DSM)
Advantages
→ Reduces the amount wasted I/O per query because the
DBMS only reads the data that it needs.
→ Faster query processing because of increased locality and
cached data reuse.
→ Better data compression (more on this later)
Disadvantages
→ Slow for point queries, inserts, updates, and deletes
because of tuple splitting/stitching/reorganization.
13
21. 15-721 (Spring 2023)
OBSERVATION
OLAP queries almost never access a single column
in a table by itself.
→ At some point during query execution, the DBMS must get
other columns and stitch the original tuple back together.
But we still need to store data in a columnar format
to get the storage + execution benefits.
We need columnar scheme that still stores
attributes separately but keeps the data for each
tuple physically close to each other…
14
22. 15-721 (Spring 2023)
PAX STORAGE MODEL
Partition Attributes Across (PAX) is a hybrid
storage model that vertically partitions attributes
within a database page.
→ This is what Paraquet and Orc use.
The goal is to get the benefit of faster processing on
columnar storage while retaining the spatial locality
benefits of row storage.
15
DATA PAGE LAYOUTS FOR RELATIONAL DATABASES
ON DEEP MEMORY HIERARCHIES
VLDB JOURNAL 2002
23. 15-721 (Spring 2023)
PAX: PHYSICAL ORGANIZATION
Horizontally partition rows into
groups. Then vertically partition their
attributes into columns.
Global header contains directory with
the offsets to the file's row groups.
→ This is stored in the footer if the file is
immutable (Parquet, Orc).
Each row group contains its own
meta-data header about its contents.
16
b0
b1
b2
b3
b4
b5
a0
a1
a2
a3
a4
a5
c0
c1
c2
c3
c4
c5
Row #0
Row #1
Row #2
Row #3
Row #4
Row #5
Col A Col B Col C
header
PAX
File
a0 a1 a2 b0 b1 b2
c0 c1 c2
header
Row
Group
24. 15-721 (Spring 2023)
PAX: PHYSICAL ORGANIZATION
Horizontally partition rows into
groups. Then vertically partition their
attributes into columns.
Global header contains directory with
the offsets to the file's row groups.
→ This is stored in the footer if the file is
immutable (Parquet, Orc).
Each row group contains its own
meta-data header about its contents.
16
b0
b1
b2
b3
b4
b5
a0
a1
a2
a3
a4
a5
c0
c1
c2
c3
c4
c5
Row #0
Row #1
Row #2
Row #3
Row #4
Row #5
Col A Col B Col C
header
PAX
File
a0 a1 a2 b0 b1 b2
c0 c1 c2
header
Row
Group
a3 a4 a5 b3 b4 b5
c3 c4 c5
header
Row
Group
25. 15-721 (Spring 2023)
PAX: PHYSICAL ORGANIZATION
Horizontally partition rows into
groups. Then vertically partition their
attributes into columns.
Global header contains directory with
the offsets to the file's row groups.
→ This is stored in the footer if the file is
immutable (Parquet, Orc).
Each row group contains its own
meta-data header about its contents.
16
b0
b1
b2
b3
b4
b5
a0
a1
a2
a3
a4
a5
c0
c1
c2
c3
c4
c5
Row #0
Row #1
Row #2
Row #3
Row #4
Row #5
Col A Col B Col C
header
PAX
File
a0 a1 a2 b0 b1 b2
c0 c1 c2
header
Row
Group
a3 a4 a5 b3 b4 b5
c3 c4 c5
header
Row
Group
26. 15-721 (Spring 2023)
MEMORY PAGES
An OLAP DBMS uses the buffer pool manager
methods that we discussed in the intro course.
OS maps physical pages to virtual memory pages.
The CPU's MMU maintains a TLB that contains the
physical address of a virtual memory page.
→ The TLB resides in the CPU caches.
→ It cannot obviously store every possible entry for a large
memory machine.
When you allocate a block of memory, the allocator
keeps that it aligned to page boundaries.
18
27. 15-721 (Spring 2023)
TRANSPARENT HUGE PAGES (THP)
Instead of always allocating memory in 4 KB pages,
Linux supports creating larger pages (2MB to 1GB)
→ Each page must be a contiguous blocks of memory.
→ Greatly reduces the # of TLB entries
With THP, the OS reorganizes pages in the
background to keep things compact.
→ Split larger pages into smaller pages.
→ Combine smaller pages into larger pages.
→ Can cause the DBMS process to stall on memory access.
19
Source: Alexandr Nikitin
28. 15-721 (Spring 2023)
TRANSPARENT HUGE PAGES (THP)
Historically, every DBMS advises you to disable this
THP on Linux:
→ Oracle, SingleStore, NuoDB, MongoDB, Sybase, TiDB.
→ Vertica says to enable THP only for newer Linux distros.
Recent research from Google suggests that huge
pages improved their data center workload by 7%.
→ 6.5% improvement in Spanner's throughput
20
Source: Evan Jones
29. 15-721 (Spring 2023)
TRANSPARENT HUGE PAGES (THP)
Historically, every DBMS advises you to disable this
THP on Linux:
→ Oracle, SingleStore, NuoDB, MongoDB, Sybase, TiDB.
→ Vertica says to enable THP only for newer Linux distros.
Recent research from Google suggests that huge
pages improved their data center workload by 7%.
→ 6.5% improvement in Spanner's throughput
20
Source: Evan Jones
30. 15-721 (Spring 2023)
DATA REPRESENTATION
INTEGER/BIGINT/SMALLINT/TINYINT
→ C/C++ Representation
FLOAT/REAL vs. NUMERIC/DECIMAL
→ IEEE-754 Standard / Fixed-point Decimals
TIME/DATE/TIMESTAMP
→ 32/64-bit int of (micro/milli)seconds since Unix epoch
VARCHAR/VARBINARY/TEXT/BLOB
→ Pointer to other location if type is ≥64-bits
→ Header with length and address to next location (if
segmented), followed by data bytes.
→ Most DBMSs use dictionary compression for these.
21
31. 15-721 (Spring 2023)
VARIABLE PRECISION NUMBERS
Inexact, variable-precision numeric type that uses
the "native" C/C++ types.
Store directly as specified by IEEE-754.
→ Example: FLOAT, REAL/DOUBLE
These types are typically faster than fixed precision
numbers because CPU ISA's (Xeon, Arm) have
instructions / registers to support them.
But they do not guarantee exact values…
22
34. 15-721 (Spring 2023)
FIXED PRECISION NUMBERS
Numeric data types with (potentially) arbitrary
precision and scale. Used when rounding errors are
unacceptable.
→ Example: NUMERIC, DECIMAL
Many different implementations.
→ Example: Store in an exact, variable-length binary
representation with additional meta-data.
→ Can be less expensive if the DBMS does not provide
arbitrary precision (e.g., decimal point can be in a different
position per value).
24
35. 15-721 (Spring 2023)
FIXED PRECISION NUMBERS
Numeric data types with (potentially) arbitrary
precision and scale. Used when rounding errors are
unacceptable.
→ Example: NUMERIC, DECIMAL
Many different implementations.
→ Example: Store in an exact, variable-length binary
representation with additional meta-data.
→ Can be less expensive if the DBMS does not provide
arbitrary precision (e.g., decimal point can be in a different
position per value).
24
36. 15-721 (Spring 2023)
POSTGRES: NUMERIC
25
typedef unsigned char NumericDigit;
typedef struct {
int ndigits;
int weight;
int scale;
int sign;
NumericDigit *digits;
} numeric;
# of Digits
Weight of 1st Digit
Scale Factor
Positive/Negative/NaN
Digit Storage
37. 15-721 (Spring 2023)
POSTGRES: NUMERIC
25
typedef unsigned char NumericDigit;
typedef struct {
int ndigits;
int weight;
int scale;
int sign;
NumericDigit *digits;
} numeric;
# of Digits
Weight of 1st Digit
Scale Factor
Positive/Negative/NaN
Digit Storage
38. 15-721 (Spring 2023)
MYSQL: NUMERIC
26
typedef int32 decimal_digit_t;
struct decimal_t {
int intg, frac, len;
bool sign;
decimal_digit_t *buf;
};
# of Digits Before Point
# of Digits After Point
Length (Bytes)
Positive/Negative
Digit Storage
39. 15-721 (Spring 2023)
MYSQL: NUMERIC
26
typedef int32 decimal_digit_t;
struct decimal_t {
int intg, frac, len;
bool sign;
decimal_digit_t *buf;
};
# of Digits Before Point
# of Digits After Point
Length (Bytes)
Positive/Negative
Digit Storage
40. 15-721 (Spring 2023)
NULL DATA TYPES
Choice #1: Special Values
→ Designate a value to represent NULL for a data type (e.g.,
INT32_MIN).
Choice #2: Null Column Bitmap Header
→ Store a bitmap in a centralized header that specifies what
attributes are null.
Choice #3: Per Attribute Null Flag
→ Store a flag that marks that a value is null.
→ Must use more space than just a single bit because this
messes up with word alignment.
27
41. 15-721 (Spring 2023)
NULL DATA TYPES
Choice #1: Special Values
→ Designate a value to represent NULL for a data type (e.g.,
INT32_MIN).
Choice #2: Null Column Bitmap Header
→ Store a bitmap in a centralized header that specifies what
attributes are null.
Choice #3: Per Attribute Null Flag
→ Store a flag that marks that a value is null.
→ Must use more space than just a single bit because this
messes up with word alignment.
27
42. 15-721 (Spring 2023)
OBSERVATION
Data is "hot" when it enters the database
→ A newly inserted tuple is more likely to be updated again
the near future.
As a tuple ages, it is updated less frequently.
→ At some point, a tuple is only accessed in read-only queries
along with other tuples.
28
43. 15-721 (Spring 2023)
HYBRID STORAGE MODEL
Use separate execution engines that are optimized
for either NSM or DSM databases.
→ Store new data in NSM for fast OLTP.
→ Migrate data to DSM for more efficient OLAP.
→ Combine query results from both engines to appear as a
single logical database to the application.
Choice #1: Fractured Mirrors
→ Examples: Oracle, IBM DB2 Blu, Microsoft SQL Server
Choice #2: Delta Store
→ Examples: SAP HANA, Vertica, SingleStore, Databricks,
Google Napa
29
44. 15-721 (Spring 2023)
FRACTURED MIRRORS
Store a second copy of the database in a DSM layout
that is automatically updated.
→ All updates are first entered in NSM then eventually copied
into DSM mirror.
→ If the DBMS supports updates, it must invalidate tuples in
the DSM mirror.
30
A CASE FOR FRACTURED MIRRORS
VLDB 2002
NSM
(Primary)
DSM
(Mirror)
Transactions
Analytical
Queries
45. 15-721 (Spring 2023)
DELTA STORE
Stage updates to the database in an NSM table.
A background thread migrates updates from delta
store and applies them to DSM data.
→ Batch large chunks and then write them out as a PAX file.
31
NSM
Delta Store
DSM
Historical Data
Transactions
46. 15-721 (Spring 2023)
DATABASE PARTITIONING
Split database across multiple resources:
→ Disks, nodes, processors.
→ Often called "sharding" in NoSQL systems.
The DBMS executes query fragments on each
partition and then combines the results to produce a
single answer.
The DBMS can partition a database physically
(shared nothing) or logically (shared disk).
32
47. 15-721 (Spring 2023)
HORIZONTAL PARTITIONING
Split a table's tuples into disjoint subsets based on
some partitioning key and scheme.
→ Choose column(s) that divides the database equally in
terms of size, load, or usage.
Partitioning Schemes:
→ Hashing
→ Ranges
→ Predicates
33
48. 15-721 (Spring 2023)
HORIZONTAL PARTITIONING
34
SELECT * FROM table
WHERE partitionKey = ?
Ideal Query:
Partitions
Table1
101 a XXX 2022-11-29
102 b XXY 2022-11-28
103 c XYZ 2022-11-29
104 d XYX 2022-11-27
105 e XYY 2022-11-29
hash(a)%4 = P2
hash(b)%4 = P4
hash(c)%4 = P3
hash(d)%4 = P2
hash(e)%4 = P1
Partitioning Key
49. 15-721 (Spring 2023)
HORIZONTAL PARTITIONING
34
SELECT * FROM table
WHERE partitionKey = ?
Ideal Query:
Partitions
Table1
101 a XXX 2022-11-29
102 b XXY 2022-11-28
103 c XYZ 2022-11-29
104 d XYX 2022-11-27
105 e XYY 2022-11-29
hash(a)%4 = P2
hash(b)%4 = P4
hash(c)%4 = P3
hash(d)%4 = P2
hash(e)%4 = P1
Partitioning Key
50. 15-721 (Spring 2023)
HORIZONTAL PARTITIONING
34
SELECT * FROM table
WHERE partitionKey = ?
Ideal Query:
Partitions
Table1
101 a XXX 2022-11-29
102 b XXY 2022-11-28
103 c XYZ 2022-11-29
104 d XYX 2022-11-27
105 e XYY 2022-11-29
P3 P4
P1 P2
hash(a)%4 = P2
hash(b)%4 = P4
hash(c)%4 = P3
hash(d)%4 = P2
hash(e)%4 = P1
Partitioning Key
51. 15-721 (Spring 2023)
HORIZONTAL PARTITIONING
34
SELECT * FROM table
WHERE partitionKey = ?
Ideal Query:
Partitions
Table1
101 a XXX 2022-11-29
102 b XXY 2022-11-28
103 c XYZ 2022-11-29
104 d XYX 2022-11-27
105 e XYY 2022-11-29
P3 P4
P1 P2
hash(a)%4 = P2
hash(b)%4 = P4
hash(c)%4 = P3
hash(d)%4 = P2
hash(e)%4 = P1
Partitioning Key
57. 15-721 (Spring 2023)
PARTING THOUGHTS
Every modern OLAP system is using some variant
of PAX storage. The key idea is that all data must be
fixed-length.
Real-world tables contain mostly numeric attributes
(int/float), but their occupied storage is mostly
comprised of string data.
Modern columnar systems are so fast that most
people do not denormalize data warehouse schemas.
37
58. 15-721 (Spring 2023)
NEXT CLASS
How to accelerate OLAP queries on columnar data
with auxiliary data structures.
→ Zone Maps
→ Bitmap Indexes
→ Sketches
We will also discuss Project #1.
38