SlideShare a Scribd company logo
What a Modern Database
Enables
Srini Srinivasan
CTO and Founder
Aerospike
All rights reserved. © 2023 Aerospike, Inc.
Our Driving Design Centers
2
Optimizations for Modern System Architectures
• CPU and NUMA pinning
• Storage tiers (DRAM, NVMe)
• Hybrid Memory Architecture
• Network to application alignment
Massive Parallelism with Indexing
• Multi-threaded NUMA architecture
• Data distribution across disks, nodes
• Client accesses server in single hop
• AI/ML Processing
Strongly Consistent Transactions
• Zero Data Loss
• Linearizable reads (tunable)
• Read one, write all scheme
• Roster concept maximizes availability
Geo-Distributed Active-Active System
• Uniform data partitioning
• Mixed workload handling
• Self-managed rack aware clusters
• Synch and Asynch Replication
All rights reserved. © 2023 Aerospike, Inc.
Strongly Consistent Transactions
3
All rights reserved. © 2023 Aerospike, Inc.
Aerospike Strong Consistency @ 33% less H/W
4
Failure Support – Big Hardware Savings
• 1 failure => 2 copies
• 2 failures => 3 copies
When is data consistent?
• Once all nodes respond
Aerospike consensus is
non-quorum, roster-based
How is consistency maintained?
• With a roster
• Determines cluster health
Heartbeats
• Exchanged by nodes
• CPU unaffected as data/node increases
A B
Application
(Leader) (Follower)
Aerospike passes Jepsen tests: https://jepsen.io/analyses/aerospike-3-99-0-3
All rights reserved. © 2023 Aerospike, Inc.
Strong Consistency (SC) – Write Logic
5
Write to all replicas before return to client w/commits with minimum friction
1. Request 6. Success
2. Write Local
3. Replicate
4. Response
5. Advise Replicated*
3. Replicate
4. Response
5. Advise Replicated*
*Advise Replicated is one way
and only when more than 1 copy
Master
Client
Replica 2
Replica 1
All rights reserved. © 2023 Aerospike, Inc.
SC – Linearizable Read Logic
6
Master or replica read alone is
sufficient for Sequential Consistency
Master
Client
1. Request 5. Response
2. Read Local
Replica 2
Replica 1
3. Status Request
4. Status Response
3. Status Request
4. Status Response
No stale reads possible
Extra network round-trip
All rights reserved. © 2023 Aerospike, Inc.
High Availability in an RF2 Strong Consistency System
7
Synchronous
1M
Rack 1
Zone 1
A
B
C
Z
1R
2M
2R
3M
3R
C
B
Read (3R)
Write (3M)
Write (3M)
Read (3M)
RF2:
Complete
copy of
data
Writes < 10ms
Reads < 1ms
Automatic Sync
RF – Replication Factor
Rack 2
Zone 2
Rack Awareness pegs data copies to racks distributed across zones or datacenters within a cluster
All rights reserved. © 2023 Aerospike, Inc.
Data Availability During Split Brain Events
A B C D E
A B C D E
A B C D E
RM
R
RR
M
RM
R
RR
M
RM RR
M R
In a healthy cluster the Roster Master is the same as the Master, and
the Roster Replica is the same as the replica.
Rule 1: A sub-cluster is Active if it has the Roster Master and all
Roster Replicas and at least 1 is full.
Rule 2: A sub-cluster is Active if it has the majority of nodes and at
least one full Roster Master or Roster Replica OR exactly ½ the
roster nodes and the Roster Master and the partition is full.
A B C D E
RM RR
M R
Rule 3: A sub-cluster is Active if it is a Super Majority Cluster and the
partitions are full or subsets
All rights reserved. © 2023 Aerospike, Inc.
Geo-Distributed Active-Active System
9
All rights reserved. © 2023 Aerospike, Inc.
Node Add/Remove/Update without Disruption
10
Self-healing, auto-sharding, algorithmic cluster management
A B C Z
25%
CLUSTER DATA
High uptime
“Shared Nothing” architecture
No single points of failure
Self-healing capability
Auto rebalance upon node add/remove
Data migrates automatically, evenly
Set-and-forget DevOps
Automatic sharding of data
No re-tuning of cluster for use-case
changes
25%
CLUSTER DATA
25%
CLUSTER DATA
25%
CLUSTER DATA
A B C Z
All rights reserved. © 2023 Aerospike, Inc.
Global Transactions – Sync Active-Active
11
USA West
Rack 1
Node 1 R1
Node 2
Node 3
Geographically distributed strongly
consistent transactions at scale
Node 7
Node 8
Node 9 M
United Kingdom
Rack 3
Node 4 R1
Node 5
Node 6
USA East
Rack 2
Local apps
Roster Membership Based
Local apps
Local apps
Synchronous active-active replication
Strong Consistency (linearizable)
No data loss
Conflict avoidance
Auto recovery on single site failure
Low latency reads from local rack
Single cluster with
Racks 1, 2, 3
Automatic Sync
Writes ~ 200 ms
Reads < 1ms
All rights reserved. © 2023 Aerospike, Inc.
Distributed Data Hub – Async Active-Active
12
Multiple clusters
connected via XDR
› Asynchronous active-active replication
› Dynamic fine-grained data routing
› Relaxed consistency (lag ~ 100ms)
› Asynchronous active-active replication
› Dynamic fine-grained data routing
› Relaxed consistency (lag ~ 100ms)
Predictive
Analytics
Single Source of Truth
Legacy
Data
Store
TB’s 100’s PB’s
PB’s
Edge (ms) Core (ms) Warehouse (sec-to-mins)
Location A
(SOE)
Location B
(SOE)
Location C
(SOE)
XDR
XDR
Real Time
System of Record
Streaming
AI/ML
Engines
XDR
Query & Reporting
Store
Web
Social
Data Sources
Streaming Video
Gaming
Enterprise
Applications
IoT
3rd Party
Mobile
Features
All rights reserved. © 2023 Aerospike, Inc.
Optimizations for Modern System
Architectures
13
All rights reserved. © 2023 Aerospike, Inc.
Real-time Read Access to Data in SSD
14
Patented Hybrid Memory ArchitectureTM (HMA) places data on SSD and indexes-only in DRAM
Software written in C to natively talk to hardware, not an API layer
BLOCK INTERFACE
SSD SSD
NVME
SSD
HYBRID-MEMORY ARCHITECTURE™
Direct SSD device access
Highly Parallelized
Large Block Writes to SSD
SSD vendor-optimized
Continuous, non-disruptive defrag
OS FILE SYSTEM
PAGE CACHE
BLOCK INTERFACE
SSD SSD
OTHER DATABASE
All rights reserved. © 2023 Aerospike, Inc.
Storage Tier Configurations
15
All DRAM All Flash
› Index and Data in Flash
› Sub 5-millisecond reads & writes
› Lower DRAM usage than HMA
› Suitable for lots of small objects
› Server footprint reduction similar to HMA
OPERATIONS
EXPIRY
DIGEST & TREE INFO
RECORD METADATA
STORAGE POINTER
WRITE QUEUE
BIN
1
BIN
2
BIN
3
STORAGE
FLASH INDEX
OPERATIONS
EXPIRY
DIGEST & TREE INFO
RECORD METADATA
STORAGE POINTER
WRITE QUEUE
DEFRAG
DATA IN
FLASH
READS
STORAGE
Hybrid DRAM/Flash
› Index in DRAM, Data in Flash
› Sub millisecond reads & writes
› 5-10X lower server footprint
DRAM INDEX
OPERATIONS
EXPIRY
DIGEST & TREE INFO
RECORD METADATA
STORAGE POINTER
WRITE QUEUE
DEFRAG
DATA IN
FLASH
READS
STORAGE
BIN
1
BIN
2
BIN
3
BIN
1
BIN
2
BIN
3
› Index and Data DRAM
› Sub millisecond reads & writes
All rights reserved. © 2023 Aerospike, Inc.
SLAs versus Scale on Storage Tiers
16
Memory Optimized
512 GiB memory
2 x 1900 GB SSD
r6in.16xlarge
Storage Optimized
128 GiB memory
2 x 7500 GB SSD
im4gn.8xlarge
20 TB Data
37 nodes
20 TB Data
Addressable
memory space:
512 GiB/node
Addressable
memory space:
15 TB/node
In-Memory
All-Flash
Hybrid Memory
Performance + Cost Affordable Scale
99% < 1ms
99% < 1ms
99% < 10ms
Terabytes
Petabytes
6 nodes
In-Memory
All-Flash
Hybrid Memory
Petabytes
All rights reserved. © 2023 Aerospike, Inc.
C based DB kernel
Optimizations for CPU, Memory, Network
17
➤ Multi-threaded data structures (NUMA pinned)
➤ Nested locking model for synchronization
➤ Lockless data structures
➤ Partitioned single threaded data structures
➤ Index entries are aligned to cache line (64 bytes)
➤ Custom memory management (arenas)
Memory Arena Assignment
Multi-core Architecture
NIC
Queue
NIC
Queue
NIC
Queue
NIC
Queue
NIC
NIC IRQ
Binding
Core Core Core Core
CPU Socket
NIC IRQ
Binding
NIC IRQ
Binding
Core Core Core Core
CPU Socket
NIC IRQ
Binding
All rights reserved. © 2023 Aerospike, Inc.
Massive Parallelism with Indexing
18
All rights reserved. © 2023 Aerospike, Inc.
Data distribution
Intelligent Data Partitioning Eliminates Hotspots
19
Data distribution is deterministic, uniform and algorithmic
Even amount data on every node
and on every flash device
Load balanced continually and
automatically on all servers, even
while scaling up/down or with
cluster reconfigurations
No retuning for new use cases
(same scheme/algos)
Partition Id Leader Replica 1 Replica 2 Replica 3 Replica 4
0 B D E A C
1 E C A D B
2 C B E A D
… … … … … …
4095 A E B D C
A B C D E
All rights reserved. © 2023 Aerospike, Inc.
Remove bottlenecks: Same low latency from 1st GB to the 1st PB…
Smart Client TM
Direct Path to Data (single-hop)
20
Each nodes knows where all data resides via Smart ClientTM
Client is 1st
-class participant in architecture
and data fabric
Continuously updates
Calculates Partition ID to determine
Node ID
Cluster-spanning operations
(scan, query, batch) sent to all processing
nodes for parallel processing
Executes operations APIs (e.g. CRUD+)
All rights reserved. © 2023 Aerospike, Inc.
Secondary Indexes – Parallel Query Execution
b1:r1 b2:r1 … b1:r2 b2:r4 … b5:r3 b2:r9 …
. . .
P1 P2 Px
SECONDARY INDEX
PRIMARY INDEX
RECORD RECORD
RECORD RECORD
SSD
SSD
DRAM
…
Query
• Value-based lookup
• Via secondary index
• Similar to SQL “select”
Parallel execution
• Per partition
• Scatter-gather scheme
• Multiple threads across nodes
Parallel access efficient for “low
selectivity indices
Support equality matches, range
queries: Integer, double, string, blob
All rights reserved. © 2023 Aerospike, Inc.
A B C
CLIENT
22
% OF CLUSTER DATA
11%
SSD 1
11%
SSD 2
11%
SSD 3
Massively Parallel Architecture
Data distribution is deterministic, uniform and algorithmic
Data distribution
Even amount data on every node and on
every flash device
Load balanced continually and
automatically on all servers, even while
scaling up/down or with cluster
reconfigurations
No retuning for new use cases (same
scheme/algos)
No hot spots with intelligent auto-sharding
33%
CLUSTER DATA
0 33%
CLUSTER DATA
33%
CLUSTER DATA
A B C
All rights reserved. © 2023 Aerospike, Inc.
Summary
23
Optimizations for Modern System Architectures
• CPU and NUMA pinning
• Storage tiers (DRAM, NVMe)
• Hybrid Memory Architecture
• Network to application alignment
Massive Parallelism with Indexing
• Multi-threaded NUMA architecture
• Data distribution across disks, nodes
• Client accesses server in single hop
• AI/ML Processing
Strongly Consistent Transactions
• Zero Data Loss
• Linearizable reads (tunable)
• Read one, write all scheme
• Roster concept maximizes availability
Geo-Distributed Active-Active System
• Uniform data partitioning
• Mixed workload handling
• Self-managed rack aware clusters
• Synch and Asynch Replication
All rights reserved. © 2023 Aerospike, Inc.
Thank You
24

More Related Content

PPT
Predictable Big Data Performance in Real-time
PDF
Art of the Possible_Tim Faulkes.pdf
PPT
Big Data Learnings from a Vendor's Perspective
PPTX
Amazon Aurora TechConnect
PPT
MYSQL
PPT
3PAR and VMWare
PDF
Aerospike Hybrid Memory Architecture
PDF
Aerospike AdTech Gets Hacked in Lower Manhattan
Predictable Big Data Performance in Real-time
Art of the Possible_Tim Faulkes.pdf
Big Data Learnings from a Vendor's Perspective
Amazon Aurora TechConnect
MYSQL
3PAR and VMWare
Aerospike Hybrid Memory Architecture
Aerospike AdTech Gets Hacked in Lower Manhattan

Similar to What a Modern Database Enables_Srini Srinivasan.pdf (11)

PDF
You Snooze You Lose or How to Win in Ad Tech?
PDF
Azure Cosmos DB - Technical Deep Dive
PPTX
Rapid Application Design in Financial Services
PDF
Data Grids with Oracle Coherence
PPTX
Aerospike Architecture
PPTX
Scale and Throughput @ Clicktale with Akka
PDF
DRBD Deep Dive - Philipp Reisner - LINBIT
PPSX
RAC - The Savior of DBA
PPTX
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
PDF
Hadoop 3.0 - Revolution or evolution?
PDF
@IBM Power roadmap 8
You Snooze You Lose or How to Win in Ad Tech?
Azure Cosmos DB - Technical Deep Dive
Rapid Application Design in Financial Services
Data Grids with Oracle Coherence
Aerospike Architecture
Scale and Throughput @ Clicktale with Akka
DRBD Deep Dive - Philipp Reisner - LINBIT
RAC - The Savior of DBA
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Hadoop 3.0 - Revolution or evolution?
@IBM Power roadmap 8
Ad

More from Aerospike, Inc. (8)

PDF
Building for a Real-time Data Future_Subbu Iyer.pdf
PDF
Aerospike & Unity_Connecting Developer Communities_Matt Dondelinger and Stace...
PDF
App Modernization with Aerospike as an Intraday System of Record_Venkat Thamm...
PDF
Aerospike & AWS Working backward from the customer.pdf
PDF
Update on Aerospike Database, Clients and Frameworks_Ronen Botzer.pdf
PDF
Developing for Real-time_Art Anderson.pdf
PDF
Aerospike Today and Tomorrow Product Roadmap 2023_Lenley Hensarling.pdf
PDF
Now in AI- How we got here_Ashwin Rao.pdf
Building for a Real-time Data Future_Subbu Iyer.pdf
Aerospike & Unity_Connecting Developer Communities_Matt Dondelinger and Stace...
App Modernization with Aerospike as an Intraday System of Record_Venkat Thamm...
Aerospike & AWS Working backward from the customer.pdf
Update on Aerospike Database, Clients and Frameworks_Ronen Botzer.pdf
Developing for Real-time_Art Anderson.pdf
Aerospike Today and Tomorrow Product Roadmap 2023_Lenley Hensarling.pdf
Now in AI- How we got here_Ashwin Rao.pdf
Ad

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
KodekX | Application Modernization Development
PDF
Network Security Unit 5.pdf for BCA BBA.
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Electronic commerce courselecture one. Pdf
PPTX
Cloud computing and distributed systems.
PDF
Approach and Philosophy of On baking technology
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
Empathic Computing: Creating Shared Understanding
The AUB Centre for AI in Media Proposal.docx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
cuic standard and advanced reporting.pdf
Machine learning based COVID-19 study performance prediction
MYSQL Presentation for SQL database connectivity
20250228 LYD VKU AI Blended-Learning.pptx
KodekX | Application Modernization Development
Network Security Unit 5.pdf for BCA BBA.
“AI and Expert System Decision Support & Business Intelligence Systems”
Electronic commerce courselecture one. Pdf
Cloud computing and distributed systems.
Approach and Philosophy of On baking technology
Spectral efficient network and resource selection model in 5G networks
MIND Revenue Release Quarter 2 2025 Press Release
Reach Out and Touch Someone: Haptics and Empathic Computing
Advanced methodologies resolving dimensionality complications for autism neur...
Understanding_Digital_Forensics_Presentation.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Chapter 3 Spatial Domain Image Processing.pdf

What a Modern Database Enables_Srini Srinivasan.pdf

  • 1. What a Modern Database Enables Srini Srinivasan CTO and Founder Aerospike
  • 2. All rights reserved. © 2023 Aerospike, Inc. Our Driving Design Centers 2 Optimizations for Modern System Architectures • CPU and NUMA pinning • Storage tiers (DRAM, NVMe) • Hybrid Memory Architecture • Network to application alignment Massive Parallelism with Indexing • Multi-threaded NUMA architecture • Data distribution across disks, nodes • Client accesses server in single hop • AI/ML Processing Strongly Consistent Transactions • Zero Data Loss • Linearizable reads (tunable) • Read one, write all scheme • Roster concept maximizes availability Geo-Distributed Active-Active System • Uniform data partitioning • Mixed workload handling • Self-managed rack aware clusters • Synch and Asynch Replication
  • 3. All rights reserved. © 2023 Aerospike, Inc. Strongly Consistent Transactions 3
  • 4. All rights reserved. © 2023 Aerospike, Inc. Aerospike Strong Consistency @ 33% less H/W 4 Failure Support – Big Hardware Savings • 1 failure => 2 copies • 2 failures => 3 copies When is data consistent? • Once all nodes respond Aerospike consensus is non-quorum, roster-based How is consistency maintained? • With a roster • Determines cluster health Heartbeats • Exchanged by nodes • CPU unaffected as data/node increases A B Application (Leader) (Follower) Aerospike passes Jepsen tests: https://jepsen.io/analyses/aerospike-3-99-0-3
  • 5. All rights reserved. © 2023 Aerospike, Inc. Strong Consistency (SC) – Write Logic 5 Write to all replicas before return to client w/commits with minimum friction 1. Request 6. Success 2. Write Local 3. Replicate 4. Response 5. Advise Replicated* 3. Replicate 4. Response 5. Advise Replicated* *Advise Replicated is one way and only when more than 1 copy Master Client Replica 2 Replica 1
  • 6. All rights reserved. © 2023 Aerospike, Inc. SC – Linearizable Read Logic 6 Master or replica read alone is sufficient for Sequential Consistency Master Client 1. Request 5. Response 2. Read Local Replica 2 Replica 1 3. Status Request 4. Status Response 3. Status Request 4. Status Response No stale reads possible Extra network round-trip
  • 7. All rights reserved. © 2023 Aerospike, Inc. High Availability in an RF2 Strong Consistency System 7 Synchronous 1M Rack 1 Zone 1 A B C Z 1R 2M 2R 3M 3R C B Read (3R) Write (3M) Write (3M) Read (3M) RF2: Complete copy of data Writes < 10ms Reads < 1ms Automatic Sync RF – Replication Factor Rack 2 Zone 2 Rack Awareness pegs data copies to racks distributed across zones or datacenters within a cluster
  • 8. All rights reserved. © 2023 Aerospike, Inc. Data Availability During Split Brain Events A B C D E A B C D E A B C D E RM R RR M RM R RR M RM RR M R In a healthy cluster the Roster Master is the same as the Master, and the Roster Replica is the same as the replica. Rule 1: A sub-cluster is Active if it has the Roster Master and all Roster Replicas and at least 1 is full. Rule 2: A sub-cluster is Active if it has the majority of nodes and at least one full Roster Master or Roster Replica OR exactly ½ the roster nodes and the Roster Master and the partition is full. A B C D E RM RR M R Rule 3: A sub-cluster is Active if it is a Super Majority Cluster and the partitions are full or subsets
  • 9. All rights reserved. © 2023 Aerospike, Inc. Geo-Distributed Active-Active System 9
  • 10. All rights reserved. © 2023 Aerospike, Inc. Node Add/Remove/Update without Disruption 10 Self-healing, auto-sharding, algorithmic cluster management A B C Z 25% CLUSTER DATA High uptime “Shared Nothing” architecture No single points of failure Self-healing capability Auto rebalance upon node add/remove Data migrates automatically, evenly Set-and-forget DevOps Automatic sharding of data No re-tuning of cluster for use-case changes 25% CLUSTER DATA 25% CLUSTER DATA 25% CLUSTER DATA A B C Z
  • 11. All rights reserved. © 2023 Aerospike, Inc. Global Transactions – Sync Active-Active 11 USA West Rack 1 Node 1 R1 Node 2 Node 3 Geographically distributed strongly consistent transactions at scale Node 7 Node 8 Node 9 M United Kingdom Rack 3 Node 4 R1 Node 5 Node 6 USA East Rack 2 Local apps Roster Membership Based Local apps Local apps Synchronous active-active replication Strong Consistency (linearizable) No data loss Conflict avoidance Auto recovery on single site failure Low latency reads from local rack Single cluster with Racks 1, 2, 3 Automatic Sync Writes ~ 200 ms Reads < 1ms
  • 12. All rights reserved. © 2023 Aerospike, Inc. Distributed Data Hub – Async Active-Active 12 Multiple clusters connected via XDR › Asynchronous active-active replication › Dynamic fine-grained data routing › Relaxed consistency (lag ~ 100ms) › Asynchronous active-active replication › Dynamic fine-grained data routing › Relaxed consistency (lag ~ 100ms) Predictive Analytics Single Source of Truth Legacy Data Store TB’s 100’s PB’s PB’s Edge (ms) Core (ms) Warehouse (sec-to-mins) Location A (SOE) Location B (SOE) Location C (SOE) XDR XDR Real Time System of Record Streaming AI/ML Engines XDR Query & Reporting Store Web Social Data Sources Streaming Video Gaming Enterprise Applications IoT 3rd Party Mobile Features
  • 13. All rights reserved. © 2023 Aerospike, Inc. Optimizations for Modern System Architectures 13
  • 14. All rights reserved. © 2023 Aerospike, Inc. Real-time Read Access to Data in SSD 14 Patented Hybrid Memory ArchitectureTM (HMA) places data on SSD and indexes-only in DRAM Software written in C to natively talk to hardware, not an API layer BLOCK INTERFACE SSD SSD NVME SSD HYBRID-MEMORY ARCHITECTURE™ Direct SSD device access Highly Parallelized Large Block Writes to SSD SSD vendor-optimized Continuous, non-disruptive defrag OS FILE SYSTEM PAGE CACHE BLOCK INTERFACE SSD SSD OTHER DATABASE
  • 15. All rights reserved. © 2023 Aerospike, Inc. Storage Tier Configurations 15 All DRAM All Flash › Index and Data in Flash › Sub 5-millisecond reads & writes › Lower DRAM usage than HMA › Suitable for lots of small objects › Server footprint reduction similar to HMA OPERATIONS EXPIRY DIGEST & TREE INFO RECORD METADATA STORAGE POINTER WRITE QUEUE BIN 1 BIN 2 BIN 3 STORAGE FLASH INDEX OPERATIONS EXPIRY DIGEST & TREE INFO RECORD METADATA STORAGE POINTER WRITE QUEUE DEFRAG DATA IN FLASH READS STORAGE Hybrid DRAM/Flash › Index in DRAM, Data in Flash › Sub millisecond reads & writes › 5-10X lower server footprint DRAM INDEX OPERATIONS EXPIRY DIGEST & TREE INFO RECORD METADATA STORAGE POINTER WRITE QUEUE DEFRAG DATA IN FLASH READS STORAGE BIN 1 BIN 2 BIN 3 BIN 1 BIN 2 BIN 3 › Index and Data DRAM › Sub millisecond reads & writes
  • 16. All rights reserved. © 2023 Aerospike, Inc. SLAs versus Scale on Storage Tiers 16 Memory Optimized 512 GiB memory 2 x 1900 GB SSD r6in.16xlarge Storage Optimized 128 GiB memory 2 x 7500 GB SSD im4gn.8xlarge 20 TB Data 37 nodes 20 TB Data Addressable memory space: 512 GiB/node Addressable memory space: 15 TB/node In-Memory All-Flash Hybrid Memory Performance + Cost Affordable Scale 99% < 1ms 99% < 1ms 99% < 10ms Terabytes Petabytes 6 nodes In-Memory All-Flash Hybrid Memory Petabytes
  • 17. All rights reserved. © 2023 Aerospike, Inc. C based DB kernel Optimizations for CPU, Memory, Network 17 ➤ Multi-threaded data structures (NUMA pinned) ➤ Nested locking model for synchronization ➤ Lockless data structures ➤ Partitioned single threaded data structures ➤ Index entries are aligned to cache line (64 bytes) ➤ Custom memory management (arenas) Memory Arena Assignment Multi-core Architecture NIC Queue NIC Queue NIC Queue NIC Queue NIC NIC IRQ Binding Core Core Core Core CPU Socket NIC IRQ Binding NIC IRQ Binding Core Core Core Core CPU Socket NIC IRQ Binding
  • 18. All rights reserved. © 2023 Aerospike, Inc. Massive Parallelism with Indexing 18
  • 19. All rights reserved. © 2023 Aerospike, Inc. Data distribution Intelligent Data Partitioning Eliminates Hotspots 19 Data distribution is deterministic, uniform and algorithmic Even amount data on every node and on every flash device Load balanced continually and automatically on all servers, even while scaling up/down or with cluster reconfigurations No retuning for new use cases (same scheme/algos) Partition Id Leader Replica 1 Replica 2 Replica 3 Replica 4 0 B D E A C 1 E C A D B 2 C B E A D … … … … … … 4095 A E B D C A B C D E
  • 20. All rights reserved. © 2023 Aerospike, Inc. Remove bottlenecks: Same low latency from 1st GB to the 1st PB… Smart Client TM Direct Path to Data (single-hop) 20 Each nodes knows where all data resides via Smart ClientTM Client is 1st -class participant in architecture and data fabric Continuously updates Calculates Partition ID to determine Node ID Cluster-spanning operations (scan, query, batch) sent to all processing nodes for parallel processing Executes operations APIs (e.g. CRUD+)
  • 21. All rights reserved. © 2023 Aerospike, Inc. Secondary Indexes – Parallel Query Execution b1:r1 b2:r1 … b1:r2 b2:r4 … b5:r3 b2:r9 … . . . P1 P2 Px SECONDARY INDEX PRIMARY INDEX RECORD RECORD RECORD RECORD SSD SSD DRAM … Query • Value-based lookup • Via secondary index • Similar to SQL “select” Parallel execution • Per partition • Scatter-gather scheme • Multiple threads across nodes Parallel access efficient for “low selectivity indices Support equality matches, range queries: Integer, double, string, blob
  • 22. All rights reserved. © 2023 Aerospike, Inc. A B C CLIENT 22 % OF CLUSTER DATA 11% SSD 1 11% SSD 2 11% SSD 3 Massively Parallel Architecture Data distribution is deterministic, uniform and algorithmic Data distribution Even amount data on every node and on every flash device Load balanced continually and automatically on all servers, even while scaling up/down or with cluster reconfigurations No retuning for new use cases (same scheme/algos) No hot spots with intelligent auto-sharding 33% CLUSTER DATA 0 33% CLUSTER DATA 33% CLUSTER DATA A B C
  • 23. All rights reserved. © 2023 Aerospike, Inc. Summary 23 Optimizations for Modern System Architectures • CPU and NUMA pinning • Storage tiers (DRAM, NVMe) • Hybrid Memory Architecture • Network to application alignment Massive Parallelism with Indexing • Multi-threaded NUMA architecture • Data distribution across disks, nodes • Client accesses server in single hop • AI/ML Processing Strongly Consistent Transactions • Zero Data Loss • Linearizable reads (tunable) • Read one, write all scheme • Roster concept maximizes availability Geo-Distributed Active-Active System • Uniform data partitioning • Mixed workload handling • Self-managed rack aware clusters • Synch and Asynch Replication
  • 24. All rights reserved. © 2023 Aerospike, Inc. Thank You 24