SlideShare a Scribd company logo
Schema Design
(and its performance implications)
Jay Runkel
Principal Solutions Architect
jay.runkel@mongodb.com
@jayrunkel
2
Agenda
1. Today’s Example
2. MongoDB Schema Design vs. Relational
3. Modeling Relationships
4. Schema Design and Performance
Today’s Example
4
Medical Records
• Collects all patient information in a central repository
• Provide central point of access for
– Patients
– Care providers: physicians, nurses, etc.
– Billing
– Insurance reconciliation
• Hospitals, physicians, patients, procedures, records
Patient
Records
Medications
Lab Results
Procedures
Hospital
Records
Physicians
Patients
Nurses
Billing
5
Medical Record Data
• Hospitals
– have physicians
• Physicians
– Have patients
– Perform procedures
– Belong to hospitals
• Patients
– Have physicians
– Are the subject of procedures
• Procedures
– Associated with a patient
– Associated with a physician
– Have a record
– Variable meta data
• Records
– Associated with a procedure
– Binary data
– Variable fields
6
Lot of Variability
Relational View
Schema Design:
MongoDB vs. Relational
MongoDB Relational
Collections Tables
Documents Rows
Data Use Data Storage
What questions do I have? What answers do I have?
MongoDB versus Relational
Attribute MongoDB Relational
Storage N-dimensional Two-dimensional
Field Values 0, 1, many, or embed Single value
Query Any field or level Any field
Schema Flexible Very structured
Complex Normalized Schemas
Complex Normalized Schemas
13
Documents are Rich Data Structures
{
first_name: ‘Paul’,
surname: ‘Miller’,
cell: ‘+447557505611’
city: ‘London’,
location: [45.123,47.232],
Profession: [banking, finance, trader],
cars: [
{ model: ‘Bentley’,
year: 1973,
value: 100000, … },
{ model: ‘Rolls Royce’,
year: 1965,
value: 330000, … }
]
}
Fields can contain an array of sub-documents
Fields
Typed field values
Fields can contain arrays
Relationships
Modeling One-to-One Relationships
16
Referencing
Procedure
• patient
• date
• type
• physician
• type
Results
• dataType
• size
• content: {…}
Use two collections with a reference
Similar to relational
17
Procedure
• patient
• date
• type
• results
• equipmentId
• data1
• data2
• physician
• Results
• type
• size
• content: {…}
Embedding
Document Schema
18
Referencing
Procedure
{
"_id" : 333,
"date" : "2003-02-09T05:00:00"),
"hospital" : “County Hills”,
"patient" : “John Doe”,
"physician" : “Stephen Smith”,
"type" : ”Chest X-ray",
”result" : 134
}
Results
{
“_id” : 134
"type" : "txt",
"size" : NumberInt(12),
"content" : {
value1: 343,
value2: “abc”,
…
}
}
19
Embedding
Procedure
{
"_id" : 333,
"date" : "2003-02-09T05:00:00"),
"hospital" : “County Hills”,
"patient" : “John Doe”,
"physician" : “Stephen Smith”,
"type" : ”Chest X-ray",
”result" : {
"type" : "txt",
"size" : NumberInt(12),
"content" : {
value1: 343,
value2: “abc”,
…
}
}
}
20
Embedding
• Advantages
– Retrieve all relevant information in a single query/document
– Avoid implementing joins in application code
– Update related information as a single atomic operation
• MongoDB doesn’t offer multi-document transactions
• Limitations
– Large documents mean more overhead if most fields are not relevant
– 16 MB document size limit
21
Atomicity
• Document operations are atomic
db.patients.update({_id: 12345},
{$inc : {numProcedures : 1},
$push : {procedures : “proc123”},
$set : {addr.state : “TX”}})
• No multi-document transactions
db.beginTransaction();
db.patients.update({_id: 12345}, …);
db.procedure.insert({_id: “proc123”, …});
db.records.insert({_id: “rec123”, …});
db.endTransaction();
22
Embedding
• Advantages
– Retrieve all relevant information in a single query/document
– Avoid implementing joins in application code
– Update related information as a single atomic operation
• MongoDB doesn’t offer multi-document transactions
• Limitations
– Large documents mean more overhead if most fields are not relevant
– 16 MB document size limit
23
Referencing
• Advantages
– Smaller documents
– Less likely to reach 16 MB document limit
– Infrequently accessed information not accessed on every query
– No duplication of data
• Limitations
– Two queries required to retrieve information
– Cannot update related information atomically
24
One to One: General Recommendations
• Embed
– No additional data duplication
– Can query or index on
embedded field
• e.g., “result.type”
• Exceptional cases…
• Embedding results in large
documents
• Set of infrequently access
fields
{
"_id" : 333,
"date" : "2003-02-09T05:00:00"),
"hospital" : “County Hills”,
"patient" : “John Doe”,
"physician" : “Stephen Smith”,
"type" : ”Chest X-ray",
”result" : {
"type" : "txt",
"size" : NumberInt(12),
"content" : {
value1: 343,
value2: “abc”,
…
}
}
}
Modeling One-to-Many Relationships
26
{
_id: 2,
first: “Joe”,
last: “Patient”,
addr: { …},
procedures: [
{
id: 12345,
date: 2015-02-15,
type: “Cat scan”,
…},
{
id: 12346,
date: 2015-02-15,
type: “blood test”,
…}]
}
Patients
Embed
One-to-Many Relationships
Modeled in 2 possible ways
{
_id: 2,
first: “Joe”,
last: “Patient”,
addr: { …},
procedures: [12345, 12346]}
{
_id: 12345,
date: 2015-02-15,
type: “Cat scan”,
…}
{
_id: 12346,
date: 2015-02-15,
type: “blood test”,
…}
Patients
Reference
Procedures
27
One to Many: General Recommendations
• Embed, when possible
– Access all information in a single query
– Take advantage of update atomicity
– No additional data duplication
– Can query or index on any field
• e.g., { “phones.type”: “mobile” }
• Exceptional cases:
– 16 MB document size
– Large number of infrequently accessed fields
{
_id: 2,
first: “Joe”,
last: “Patient”,
addr: { …},
procedures: [
{
id: 12345,
date: 2015-02-15,
type: “Cat scan”,
…},
{
id: 12346,
date: 2015-02-15,
type: “blood test”,
…}]
}
Modeling Many-to-Many Relationships
29
Many to Many
Traditional Relational Association
Join table
Physicians
name
specialty
phone
Hospitals
name
HosPhysicanRel
hospitalId
physicianId
X
Use arrays instead
30
{
_id: 1,
name: “Oak Valley Hospital”,
city: “New York”,
beds: 131,
physicians: [
{
id: 12345,
name: “Joe Doctor”,
address: {…},
…},
{
id: 12346,
name: “Mary Well”,
address: {…},
…}]
}
Many-to-Many Relationships
Embedding physicians in hospitals collection
{
_id: 2,
name: “Plainmont Hospital”,
city: “Omaha”,
beds: 85,
physicians: [
{
id: 63633,
name: “Harold Green”,
address: {…},
…},
{
id: 12345,
name: “Joe Doctor”,
address: {…},
…}]
}
Data Duplication
31
{
_id: 1,
name: “Oak Valley Hospital”,
city: “New York”,
beds: 131,
physicians: [12345, 12346]
}
Many-to-Many Relationships
Referencing
{
id: 63633,
name: “Harold Green”,
address: {…},
…}
Hospitals
{
_id: 2,
name: “Plainmont Hospital”,
city: “Omaha”,
beds: 85,
physicians: [63633, 12345]
}
Physicians
{
id: 12345,
name: “Joe Doctor”,
address: {…},
…}
{
id: 12346,
name: “Mary Well”,
address: {…},
…}
32
Many to Many
General Recommendation
• Use case determines whether to reference or
embed:
1. Data Duplication
• Embedding may result in data duplication
• Duplication may be okay if reads
dominate updates
2. Referencing may be required if many
related items
3. Hybrid approach
• Potentially do both
{
_id: 2,
name: “Oak Valley Hospital”,
city: “New York”,
beds: 131,
physicians: [12345, 12346]}
{
_id: 12345,
name: “Joe Doctor”,
address: {…},
…}
{
_id: 12346,
name: “Mary Well”,
address: {…},
…}
Hospitals
Reference
Physicians
What If I Want to Store Large Files in MongoDB?
34
GridFS
Driver
GridFS API
doc.jpg
(meta
data)
doc.jpg
(1)doc.jpg
(1)doc.jpg
(1)
fs.files fs.chunks
doc.jpg
mongofiles utility provides command line GridFS interface
Schema Design and Performance
Two Examples
Example 1: Hybrid Approach
Embed and Reference
37
Healthcare Example
patients
procedures
Tailor Schema to Queries (cont.)
{
"_id" : 593340651,
"first" : "Gregorio",
"last" : "Lang",
"addr" : {
"street" : "623 Flowers Rd",
"city" : "Groton",
"state" : "NH",
"zip" : 3266
},
"physicians" : [10387 33456],
"procedures” : ["551ac”, “343fs”]
}
{
"_id" : "551ac”,
"date" :"2000-04-26”,
"hospital" : 161,
"patient" : 593340651,
"physician" : 10387,
"type" : "Chest X-ray",
"records" : [ “67bc6”]
}
Patient Procedure
Find all patients from NH that
have had chest x-rays
Tailor Schema to Queries (cont.)
{
"_id" : 593340651,
"first" : "Gregorio",
"last" : "Lang",
"addr" : {
"street" : "623 Flowers Rd",
"city" : "Groton",
"state" : "NH",
"zip" : 3266
},
"physicians" : [10387 33456],
"procedures” : [
{id : "551ac”,
type : “Chest X-ray”},
{id : “343fs”,
type : “Blood Test”}]
}
{
"_id" : "551ac”,
"date" :"2000-04-26”,
"hospital" : 161,
"patient" : 593340651,
"physician" : 10387,
"type" : "Chest X-ray",
"records" : [ “67bc6”]
}
Patient Procedure
Find all patients from NH that
have had chest x-rays
Example 2: Time Series Data
Medical Devices
41
Vital Sign Monitoring Device
Vital Signs Measured:
• Blood Pressure
• Pulse
• Blood Oxygen Levels
Produces data at regular intervals
• Once per minute
42
We have a hospital(s) of devices
43
Data From Vital Signs Monitoring Device
{
deviceId: 123456,
spO2: 88,
pulse: 74,
bp: [128, 80],
ts: ISODate("2013-10-16T22:07:00.000-0500")
}
• One document per minute per device
• Relational approach
44
Document Per Hour (By minute)
{
deviceId: 123456,
spO2: { 0: 88, 1: 90, …, 59: 92},
pulse: { 0: 74, 1: 76, …, 59: 72},
bp: { 0: [122, 80], 1: [126, 84], …, 59: [124, 78]},
ts: ISODate("2013-10-16T22:00:00.000-0500")
}
• Store per-minute data at the hourly level
• Update-driven workload
• 1 document per device per hour
45
Characterizing Write Differences
• Example: data generated every minute
• Recording the data for 1 patient for 1 hour:
Document Per Event
60 inserts
Document Per Hour
1 insert, 59 updates
46
Characterizing Read Differences
• Want to graph 24 hour of vital signs for a patient:
• Read performance is greatly improved
Document Per Event
1440 reads
Document Per Hour
24 reads
47
Characterizing Memory and Storage Differences
Document Per Minute Document Per Hour
Number Documents 52.6 B 876 M
Total Index Size 6364 GB 106 GB
_id index 1468 GB 24.5 GB
{ts: 1, deviceId: 1} 4895 GB 81.6 GB
Document Size 92 Bytes 758 Bytes
Database Size 4503 GB 618 GB
• 100K Devices
• 1 years worth of data
100000 * 365 *
24 * 60
100000 * 365 *
24
100000 * 365 *
24 * 60 * 130
100000 * 365 *
24 * 130
100000 * 365 *
24 * 60 * 92
100000 * 365 *
24 * 758
48
Summary
• Relationships can be modeled by embedding or references
• Decision should be made in context of application data and query workload
– Tailor schema to application workload
• It is okay recommended to violate RDBMS schema design principles
– No duplication of data
– Normalization
• Different schemas may result in dramatically different
– Query performance
– Hardware requirements
Questions?
jay.runkel@mongodb.com
@jayrunkel

More Related Content

PDF
Apache Druid 101
PDF
VQ-VAE
PPTX
RedisConf17- Using Redis at scale @ Twitter
PPTX
Notes on attention mechanism
PDF
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...
PDF
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
PDF
Working with JSON Data in PostgreSQL vs. MongoDB
PPTX
Transfer learning-presentation
Apache Druid 101
VQ-VAE
RedisConf17- Using Redis at scale @ Twitter
Notes on attention mechanism
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Working with JSON Data in PostgreSQL vs. MongoDB
Transfer learning-presentation

What's hot (20)

PDF
PDF
Introduction to LLMs
PPTX
Building a Virtual Data Lake with Apache Arrow
PDF
Vector database
PDF
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
PDF
Intro To MongoDB
PDF
ClickHouse Deep Dive, by Aleksei Milovidov
PDF
Big data real time architectures
PPTX
SQL vs MongoDB
PDF
Migrating from RDBMS to MongoDB
PDF
Your first ClickHouse data warehouse
POTX
LDA Beginner's Tutorial
PDF
XStream: stream processing platform at facebook
PDF
An introduction to MongoDB
PDF
[논문리뷰] 딥러닝 적용한 EEG 연구 소개
PPTX
How Insurance Companies Use MongoDB
PPTX
An Enterprise Architect's View of MongoDB
PPTX
Report: "MolGAN: An implicit generative model for small molecular graphs"
PPTX
Druid deep dive
PPTX
A Beginner's Guide to Building Data Pipelines with Luigi
Introduction to LLMs
Building a Virtual Data Lake with Apache Arrow
Vector database
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Intro To MongoDB
ClickHouse Deep Dive, by Aleksei Milovidov
Big data real time architectures
SQL vs MongoDB
Migrating from RDBMS to MongoDB
Your first ClickHouse data warehouse
LDA Beginner's Tutorial
XStream: stream processing platform at facebook
An introduction to MongoDB
[논문리뷰] 딥러닝 적용한 EEG 연구 소개
How Insurance Companies Use MongoDB
An Enterprise Architect's View of MongoDB
Report: "MolGAN: An implicit generative model for small molecular graphs"
Druid deep dive
A Beginner's Guide to Building Data Pipelines with Luigi
Ad

Viewers also liked (20)

PDF
PM interview questions book pdf
PDF
Statistics to Know for Case Interview
PDF
Google Product Manager Interview Cheat Sheet
PDF
1 Page Salary Negotiation Cheat Sheet
PDF
Creating social features at BranchOut using MongoDB
PDF
Amazon Product Manager Interview Cheat Sheet
PDF
Building a Social Network with MongoDB
PDF
Salary Negotiation Cheat Sheet
PDF
A gentle introduction to algorithm complexity analysis
PDF
Book Summary: Secrets of the Product Manager Interview
PDF
What Game Developers Look for in a New Graduate: Interviews and Surveys at On...
PPTX
MongoDB Best Practices
PDF
Facebook Rotational Product Manager Interview: Jewel Lim's Tips on Getting an...
PDF
Book Summary: Decode and Conquer by Lewis C. Lin
PPTX
How to Ace the Product Management Interview, Product Camp Seattle Oct 2013
PDF
Google product manager interview questions answers
PDF
100 product management interview questions and answers pdf
PDF
Facebook Product Manager Interview Cheat Sheet
PPTX
Cracking the Product Manager Interview
PPTX
MongoDB Schema Design: Four Real-World Examples
PM interview questions book pdf
Statistics to Know for Case Interview
Google Product Manager Interview Cheat Sheet
1 Page Salary Negotiation Cheat Sheet
Creating social features at BranchOut using MongoDB
Amazon Product Manager Interview Cheat Sheet
Building a Social Network with MongoDB
Salary Negotiation Cheat Sheet
A gentle introduction to algorithm complexity analysis
Book Summary: Secrets of the Product Manager Interview
What Game Developers Look for in a New Graduate: Interviews and Surveys at On...
MongoDB Best Practices
Facebook Rotational Product Manager Interview: Jewel Lim's Tips on Getting an...
Book Summary: Decode and Conquer by Lewis C. Lin
How to Ace the Product Management Interview, Product Camp Seattle Oct 2013
Google product manager interview questions answers
100 product management interview questions and answers pdf
Facebook Product Manager Interview Cheat Sheet
Cracking the Product Manager Interview
MongoDB Schema Design: Four Real-World Examples
Ad

Similar to MongoDB Schema Design and its Performance Implications (20)

PPTX
Webinar: MongoDB Schema Design and Performance Implications
PPTX
MongoDB Days UK: Jumpstart: Schema Design
PPTX
Working With Large-Scale Clinical Datasets
PDF
Introduction to High-performance In-memory Genome Project at HPI
PDF
Next generation electronic medical records and search a test implementation i...
PDF
Health Sciences Research Informatics, Powered by Globus
PPTX
Accelerate Pharmaceutical R&D with Big Data and MongoDB
PPTX
Accelerate pharmaceutical r&d with mongo db
PPTX
Clinical Data Models - The Hyve - Bio IT World April 2019
PPTX
Big Data in Clinical Research
PDF
Big Medical Data – Challenge or Potential?
PDF
Processing of Big Medical Data in Personalized Medicine: Challenge or Potential
PPT
Throw the Semantic Web at Today's Health-care
PPTX
Data Management 2: Conquering Data Proliferation
PPTX
Painting the Future of Big Data with Apache Spark and MongoDB
PPTX
Fostering Serendipity through Big Linked Data
PDF
Steffen Frederiksen: DATA, DITA, DOCX
PPT
Quantitative Medicine Feb 2009
PPT
CDISC Presentation
PDF
The Logical Model Designer - Binding Information Models to Terminology
Webinar: MongoDB Schema Design and Performance Implications
MongoDB Days UK: Jumpstart: Schema Design
Working With Large-Scale Clinical Datasets
Introduction to High-performance In-memory Genome Project at HPI
Next generation electronic medical records and search a test implementation i...
Health Sciences Research Informatics, Powered by Globus
Accelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate pharmaceutical r&d with mongo db
Clinical Data Models - The Hyve - Bio IT World April 2019
Big Data in Clinical Research
Big Medical Data – Challenge or Potential?
Processing of Big Medical Data in Personalized Medicine: Challenge or Potential
Throw the Semantic Web at Today's Health-care
Data Management 2: Conquering Data Proliferation
Painting the Future of Big Data with Apache Spark and MongoDB
Fostering Serendipity through Big Linked Data
Steffen Frederiksen: DATA, DITA, DOCX
Quantitative Medicine Feb 2009
CDISC Presentation
The Logical Model Designer - Binding Information Models to Terminology

More from Lewis Lin 🦊 (20)

PDF
Gaskins' memo pitching PowerPoint
PDF
P&G Memo: Creating Modern Day Brand Management
PDF
Jeffrey Katzenberg on Disney Studios
PDF
Carnegie Mellon MS PM Internships 2020
PDF
Gallup's Notes on Reinventing Performance Management
PDF
Twitter Job Opportunities for Students
PDF
Facebook's Official Guide to Technical Program Management Candidates
PDF
Performance Management at Google
PDF
Google Interview Prep Guide Software Engineer
PDF
Google Interview Prep Guide Product Manager
PDF
Skills Assessment Offering by Lewis C. Lin
PDF
How Men and Women Differ Across Leadership Traits
PDF
Product Manager Skills Survey
PDF
Uxpin Why Build a Design System
PDF
Sourcing on GitHub
PDF
30-Day Google PM Interview Study Guide
PDF
30-Day Facebook PM Interview Study Guide
PDF
36-Day Amazon PM Interview Study Guide
PDF
McKinsey's Assessment on PM Careers
PDF
Five Traits of Great Product Managers
Gaskins' memo pitching PowerPoint
P&G Memo: Creating Modern Day Brand Management
Jeffrey Katzenberg on Disney Studios
Carnegie Mellon MS PM Internships 2020
Gallup's Notes on Reinventing Performance Management
Twitter Job Opportunities for Students
Facebook's Official Guide to Technical Program Management Candidates
Performance Management at Google
Google Interview Prep Guide Software Engineer
Google Interview Prep Guide Product Manager
Skills Assessment Offering by Lewis C. Lin
How Men and Women Differ Across Leadership Traits
Product Manager Skills Survey
Uxpin Why Build a Design System
Sourcing on GitHub
30-Day Google PM Interview Study Guide
30-Day Facebook PM Interview Study Guide
36-Day Amazon PM Interview Study Guide
McKinsey's Assessment on PM Careers
Five Traits of Great Product Managers

Recently uploaded (20)

PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Nekopoi APK 2025 free lastest update
PPTX
assetexplorer- product-overview - presentation
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
Transform Your Business with a Software ERP System
PDF
System and Network Administraation Chapter 3
PPTX
L1 - Introduction to python Backend.pptx
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
top salesforce developer skills in 2025.pdf
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PPTX
history of c programming in notes for students .pptx
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
medical staffing services at VALiNTRY
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
Introduction to Artificial Intelligence
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Nekopoi APK 2025 free lastest update
assetexplorer- product-overview - presentation
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Odoo Companies in India – Driving Business Transformation.pdf
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Transform Your Business with a Software ERP System
System and Network Administraation Chapter 3
L1 - Introduction to python Backend.pptx
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Upgrade and Innovation Strategies for SAP ERP Customers
top salesforce developer skills in 2025.pdf
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
history of c programming in notes for students .pptx
Reimagine Home Health with the Power of Agentic AI​
Wondershare Filmora 15 Crack With Activation Key [2025
medical staffing services at VALiNTRY
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Introduction to Artificial Intelligence

MongoDB Schema Design and its Performance Implications

  • 1. Schema Design (and its performance implications) Jay Runkel Principal Solutions Architect jay.runkel@mongodb.com @jayrunkel
  • 2. 2 Agenda 1. Today’s Example 2. MongoDB Schema Design vs. Relational 3. Modeling Relationships 4. Schema Design and Performance
  • 4. 4 Medical Records • Collects all patient information in a central repository • Provide central point of access for – Patients – Care providers: physicians, nurses, etc. – Billing – Insurance reconciliation • Hospitals, physicians, patients, procedures, records Patient Records Medications Lab Results Procedures Hospital Records Physicians Patients Nurses Billing
  • 5. 5 Medical Record Data • Hospitals – have physicians • Physicians – Have patients – Perform procedures – Belong to hospitals • Patients – Have physicians – Are the subject of procedures • Procedures – Associated with a patient – Associated with a physician – Have a record – Variable meta data • Records – Associated with a procedure – Binary data – Variable fields
  • 9. MongoDB Relational Collections Tables Documents Rows Data Use Data Storage What questions do I have? What answers do I have? MongoDB versus Relational
  • 10. Attribute MongoDB Relational Storage N-dimensional Two-dimensional Field Values 0, 1, many, or embed Single value Query Any field or level Any field Schema Flexible Very structured
  • 13. 13 Documents are Rich Data Structures { first_name: ‘Paul’, surname: ‘Miller’, cell: ‘+447557505611’ city: ‘London’, location: [45.123,47.232], Profession: [banking, finance, trader], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } ] } Fields can contain an array of sub-documents Fields Typed field values Fields can contain arrays
  • 16. 16 Referencing Procedure • patient • date • type • physician • type Results • dataType • size • content: {…} Use two collections with a reference Similar to relational
  • 17. 17 Procedure • patient • date • type • results • equipmentId • data1 • data2 • physician • Results • type • size • content: {…} Embedding Document Schema
  • 18. 18 Referencing Procedure { "_id" : 333, "date" : "2003-02-09T05:00:00"), "hospital" : “County Hills”, "patient" : “John Doe”, "physician" : “Stephen Smith”, "type" : ”Chest X-ray", ”result" : 134 } Results { “_id” : 134 "type" : "txt", "size" : NumberInt(12), "content" : { value1: 343, value2: “abc”, … } }
  • 19. 19 Embedding Procedure { "_id" : 333, "date" : "2003-02-09T05:00:00"), "hospital" : “County Hills”, "patient" : “John Doe”, "physician" : “Stephen Smith”, "type" : ”Chest X-ray", ”result" : { "type" : "txt", "size" : NumberInt(12), "content" : { value1: 343, value2: “abc”, … } } }
  • 20. 20 Embedding • Advantages – Retrieve all relevant information in a single query/document – Avoid implementing joins in application code – Update related information as a single atomic operation • MongoDB doesn’t offer multi-document transactions • Limitations – Large documents mean more overhead if most fields are not relevant – 16 MB document size limit
  • 21. 21 Atomicity • Document operations are atomic db.patients.update({_id: 12345}, {$inc : {numProcedures : 1}, $push : {procedures : “proc123”}, $set : {addr.state : “TX”}}) • No multi-document transactions db.beginTransaction(); db.patients.update({_id: 12345}, …); db.procedure.insert({_id: “proc123”, …}); db.records.insert({_id: “rec123”, …}); db.endTransaction();
  • 22. 22 Embedding • Advantages – Retrieve all relevant information in a single query/document – Avoid implementing joins in application code – Update related information as a single atomic operation • MongoDB doesn’t offer multi-document transactions • Limitations – Large documents mean more overhead if most fields are not relevant – 16 MB document size limit
  • 23. 23 Referencing • Advantages – Smaller documents – Less likely to reach 16 MB document limit – Infrequently accessed information not accessed on every query – No duplication of data • Limitations – Two queries required to retrieve information – Cannot update related information atomically
  • 24. 24 One to One: General Recommendations • Embed – No additional data duplication – Can query or index on embedded field • e.g., “result.type” • Exceptional cases… • Embedding results in large documents • Set of infrequently access fields { "_id" : 333, "date" : "2003-02-09T05:00:00"), "hospital" : “County Hills”, "patient" : “John Doe”, "physician" : “Stephen Smith”, "type" : ”Chest X-ray", ”result" : { "type" : "txt", "size" : NumberInt(12), "content" : { value1: 343, value2: “abc”, … } } }
  • 26. 26 { _id: 2, first: “Joe”, last: “Patient”, addr: { …}, procedures: [ { id: 12345, date: 2015-02-15, type: “Cat scan”, …}, { id: 12346, date: 2015-02-15, type: “blood test”, …}] } Patients Embed One-to-Many Relationships Modeled in 2 possible ways { _id: 2, first: “Joe”, last: “Patient”, addr: { …}, procedures: [12345, 12346]} { _id: 12345, date: 2015-02-15, type: “Cat scan”, …} { _id: 12346, date: 2015-02-15, type: “blood test”, …} Patients Reference Procedures
  • 27. 27 One to Many: General Recommendations • Embed, when possible – Access all information in a single query – Take advantage of update atomicity – No additional data duplication – Can query or index on any field • e.g., { “phones.type”: “mobile” } • Exceptional cases: – 16 MB document size – Large number of infrequently accessed fields { _id: 2, first: “Joe”, last: “Patient”, addr: { …}, procedures: [ { id: 12345, date: 2015-02-15, type: “Cat scan”, …}, { id: 12346, date: 2015-02-15, type: “blood test”, …}] }
  • 29. 29 Many to Many Traditional Relational Association Join table Physicians name specialty phone Hospitals name HosPhysicanRel hospitalId physicianId X Use arrays instead
  • 30. 30 { _id: 1, name: “Oak Valley Hospital”, city: “New York”, beds: 131, physicians: [ { id: 12345, name: “Joe Doctor”, address: {…}, …}, { id: 12346, name: “Mary Well”, address: {…}, …}] } Many-to-Many Relationships Embedding physicians in hospitals collection { _id: 2, name: “Plainmont Hospital”, city: “Omaha”, beds: 85, physicians: [ { id: 63633, name: “Harold Green”, address: {…}, …}, { id: 12345, name: “Joe Doctor”, address: {…}, …}] } Data Duplication
  • 31. 31 { _id: 1, name: “Oak Valley Hospital”, city: “New York”, beds: 131, physicians: [12345, 12346] } Many-to-Many Relationships Referencing { id: 63633, name: “Harold Green”, address: {…}, …} Hospitals { _id: 2, name: “Plainmont Hospital”, city: “Omaha”, beds: 85, physicians: [63633, 12345] } Physicians { id: 12345, name: “Joe Doctor”, address: {…}, …} { id: 12346, name: “Mary Well”, address: {…}, …}
  • 32. 32 Many to Many General Recommendation • Use case determines whether to reference or embed: 1. Data Duplication • Embedding may result in data duplication • Duplication may be okay if reads dominate updates 2. Referencing may be required if many related items 3. Hybrid approach • Potentially do both { _id: 2, name: “Oak Valley Hospital”, city: “New York”, beds: 131, physicians: [12345, 12346]} { _id: 12345, name: “Joe Doctor”, address: {…}, …} { _id: 12346, name: “Mary Well”, address: {…}, …} Hospitals Reference Physicians
  • 33. What If I Want to Store Large Files in MongoDB?
  • 35. Schema Design and Performance Two Examples
  • 36. Example 1: Hybrid Approach Embed and Reference
  • 38. Tailor Schema to Queries (cont.) { "_id" : 593340651, "first" : "Gregorio", "last" : "Lang", "addr" : { "street" : "623 Flowers Rd", "city" : "Groton", "state" : "NH", "zip" : 3266 }, "physicians" : [10387 33456], "procedures” : ["551ac”, “343fs”] } { "_id" : "551ac”, "date" :"2000-04-26”, "hospital" : 161, "patient" : 593340651, "physician" : 10387, "type" : "Chest X-ray", "records" : [ “67bc6”] } Patient Procedure Find all patients from NH that have had chest x-rays
  • 39. Tailor Schema to Queries (cont.) { "_id" : 593340651, "first" : "Gregorio", "last" : "Lang", "addr" : { "street" : "623 Flowers Rd", "city" : "Groton", "state" : "NH", "zip" : 3266 }, "physicians" : [10387 33456], "procedures” : [ {id : "551ac”, type : “Chest X-ray”}, {id : “343fs”, type : “Blood Test”}] } { "_id" : "551ac”, "date" :"2000-04-26”, "hospital" : 161, "patient" : 593340651, "physician" : 10387, "type" : "Chest X-ray", "records" : [ “67bc6”] } Patient Procedure Find all patients from NH that have had chest x-rays
  • 40. Example 2: Time Series Data Medical Devices
  • 41. 41 Vital Sign Monitoring Device Vital Signs Measured: • Blood Pressure • Pulse • Blood Oxygen Levels Produces data at regular intervals • Once per minute
  • 42. 42 We have a hospital(s) of devices
  • 43. 43 Data From Vital Signs Monitoring Device { deviceId: 123456, spO2: 88, pulse: 74, bp: [128, 80], ts: ISODate("2013-10-16T22:07:00.000-0500") } • One document per minute per device • Relational approach
  • 44. 44 Document Per Hour (By minute) { deviceId: 123456, spO2: { 0: 88, 1: 90, …, 59: 92}, pulse: { 0: 74, 1: 76, …, 59: 72}, bp: { 0: [122, 80], 1: [126, 84], …, 59: [124, 78]}, ts: ISODate("2013-10-16T22:00:00.000-0500") } • Store per-minute data at the hourly level • Update-driven workload • 1 document per device per hour
  • 45. 45 Characterizing Write Differences • Example: data generated every minute • Recording the data for 1 patient for 1 hour: Document Per Event 60 inserts Document Per Hour 1 insert, 59 updates
  • 46. 46 Characterizing Read Differences • Want to graph 24 hour of vital signs for a patient: • Read performance is greatly improved Document Per Event 1440 reads Document Per Hour 24 reads
  • 47. 47 Characterizing Memory and Storage Differences Document Per Minute Document Per Hour Number Documents 52.6 B 876 M Total Index Size 6364 GB 106 GB _id index 1468 GB 24.5 GB {ts: 1, deviceId: 1} 4895 GB 81.6 GB Document Size 92 Bytes 758 Bytes Database Size 4503 GB 618 GB • 100K Devices • 1 years worth of data 100000 * 365 * 24 * 60 100000 * 365 * 24 100000 * 365 * 24 * 60 * 130 100000 * 365 * 24 * 130 100000 * 365 * 24 * 60 * 92 100000 * 365 * 24 * 758
  • 48. 48 Summary • Relationships can be modeled by embedding or references • Decision should be made in context of application data and query workload – Tailor schema to application workload • It is okay recommended to violate RDBMS schema design principles – No duplication of data – Normalization • Different schemas may result in dramatically different – Query performance – Hardware requirements