SlideShare a Scribd company logo
MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep Dive
Atlas Data Lake Technical Deep-Dive
Himanshumali, Solutions Architect-APAC, MongoDB
Himanshumali
State of Affairs
Why are we building this?
• Businesses have a humongous amount of data
o IDC predicts that by 2025 global data will reach 175 Zettabytes
o 49% of it will reside in the public cloud.
• Cloud storage is cost-effective
• Stored data is hard to operationalize
A New Service Offered by MongoDB Atlas
Atlas Data Lake allows you to...
• Access long-term data
• Query long-term data
• Analyze long-term data
Product Guidelines
Every product has requirements!
• Look and act like MongoDB
• Access customer’s data securely
• Handle queries over vast amounts of data
• Handle long-running queries
• Efficient use of resources
Emulating MongoDB
Behaviour
• Communicate with our existing drivers. Written in Go.
• Implemented a TCP server, using mongo-go-driver’s wire protocol package
• Implemented commands for a read-only server
• Used the server’s aggregation engine
Security
Must have the same security as MongoDB.
• Users configured in Atlas
• Implemented MongoDB’s security model
• Authentication
• Authorization
• Require the use of TLS + SNI
• SNI = Server Name Indicator
Control and Configuration
Controlled by Customer
Customers have complete control
• Provide us with an IAM Role
• Configure your buckets
• Configure your users in Atlas
Configuration
Customers control their data layout.
• Stores
• Databases, Collections
• DataSources
{
“stores” : [{
s3 : {
name: <string>,
bucket: <string>,
region: <string>,
prefix: <string>
}}
],
“databases” : {
“<database_name>” : {
“<collection_name>” : [
{
“store” : “<string>”
“definition” : “<string>”
}]
}}
}
Configuration: Store
Configuration (S3 Bucket): ent-archive
/archive/customers
- a-m.json
- n-z.json
/archive/invoices
- 2019
- 1.parquet
- 2.parquet
- 2018
- 1.parquet
- 2017.json.gz
- 2016.json.gz
s3 : {
name: "ent-archive",
bucket: "ent-archive",
region: "us-east-1",
prefix: "/archive/"
}
Configuration: Store
databases : {
history: {
customers: [{
store: "ent-archive",
definition: "/customers/*"
}],
invoices: [{
store: “ent-archive",
definition: "/invoices/{year int}/*"
}, {
store: "ent-archive",
definition: "/invoices/{year int}.json.gz"
}]
}
}
Configuration: Data
history: {
invoices: [{
store: "ent-archive",
definition: "/invoices/{year int}/*"
}, {
store: "ent-archive",
definition : "/invoices/{year int}.json.gz
}, {
store: "atlas",
cluster: "my-cluster",
db: "customers",
collection: "invoices"
}]
}
Configuration: Data (Future)
Configuration: File Formats
Each file has a format
• BSON (gzipped)
• JSON (gzipped)
• Avro (gzipped)
• CSV/TSV (gzipped)
• Parquet
Processing
MQL : Distributed MQL
• Parse
• Distribute & Parallelize
Architecture
Architecture
{ $match: { year: { $gt: 2000 } } }
{ $limit: 10 }
Map:
{ $match: { year: { $gt: 2000 } } }
{ $limit: 10 }
Reduce:
{ $limit: 10 }
Query Example: $limit
{ $group: { _id: "$year", totalAvg: { $avg: "amount" } } }
Map:
{ $group: { _id: "$year",
totalAvg_sum: { $sum: "amount" },
totalAvg_count: { $sum: 1 }} }
Reduce:
{ $group: { _id: "$_id",
totalAvg_sum: { $sum: "$totalAvg_sum" },
totalAvg_count: { $sum: "$totalAvg_count" }} }
Finalize:
{ $project: { _id: "$_id", totalAvg: { $divide: ["$totalAvg_sum",
"$totalAvg_count"] } } }
Query Example: $group
Use Cases
Scenario
• Customer storing devices data in Atlas for last 2 years
• Actively requires on last 2 years of data
• Might also need to query older data in case of customer request
Future
Future
More supported MongoDB operators.
• $out
• $merge
• $graphLookup
• Geo operators
• Full Text Search
Future
Optimizations!
• Indexes
• Statistics
Future
More File Formats!
• ORC
• Excel
• PDF
Future
Integrations!
• Microsoft Azure
• Google Cloud
Thank You!
MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep Dive

More Related Content

PDF
MongoDB .local Bengaluru 2019: The Journey of Migration from Oracle to MongoD...
PDF
MongoDB .local Bengaluru 2019: New Encryption Capabilities in MongoDB 4.2: A ...
PDF
MongoDB .local Bengaluru 2019: Realm: The Secret Sauce for Better Mobile Apps
PDF
MongoDB .local Bengaluru 2019: Using MongoDB Services in Kubernetes: Any Plat...
PDF
MongoDB .local Bengaluru 2019: Lift & Shift MongoDB to Atlas
PDF
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
PDF
MongoDB on Azure
MongoDB .local Bengaluru 2019: The Journey of Migration from Oracle to MongoD...
MongoDB .local Bengaluru 2019: New Encryption Capabilities in MongoDB 4.2: A ...
MongoDB .local Bengaluru 2019: Realm: The Secret Sauce for Better Mobile Apps
MongoDB .local Bengaluru 2019: Using MongoDB Services in Kubernetes: Any Plat...
MongoDB .local Bengaluru 2019: Lift & Shift MongoDB to Atlas
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB on Azure

What's hot (20)

PDF
MongoDB .local Munich 2019: Mastering MongoDB on Kubernetes – MongoDB Enterpr...
PPTX
What's new in MongoDB 2.6
PPTX
How Thermo Fisher is Reducing Data Analysis Times from Days to Minutes with M...
PDF
MongoDB .local Bengaluru 2019: Becoming an Ops Manager Backup Superhero!
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
PPTX
MMS - Monitoring, backup and management at a single click
PDF
Mongo db eveningschemadesign
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
PDF
Introducing MongoDB Stitch, Backend-as-a-Service from MongoDB
PDF
tdtechtalk20160330johan
PPTX
An Introduction to MongoDB Compass
PDF
Cignex mongodb-sharding-mongodbdays
PPTX
Stream processing at Hotstar
PPTX
Concurrency Control in MongoDB 3.0
PDF
MongoDB Introduction talk at Dr Dobbs Conference, MongoDB Evenings at Bangalo...
PPTX
MongoDB and the Internet of Things
PPTX
An afternoon with mongo db new delhi
PDF
NoSQL benchmarking
PPTX
AWS Lambda, Step Functions & MongoDB Atlas Tutorial
MongoDB .local Munich 2019: Mastering MongoDB on Kubernetes – MongoDB Enterpr...
What's new in MongoDB 2.6
How Thermo Fisher is Reducing Data Analysis Times from Days to Minutes with M...
MongoDB .local Bengaluru 2019: Becoming an Ops Manager Backup Superhero!
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MMS - Monitoring, backup and management at a single click
Mongo db eveningschemadesign
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
Introducing MongoDB Stitch, Backend-as-a-Service from MongoDB
tdtechtalk20160330johan
An Introduction to MongoDB Compass
Cignex mongodb-sharding-mongodbdays
Stream processing at Hotstar
Concurrency Control in MongoDB 3.0
MongoDB Introduction talk at Dr Dobbs Conference, MongoDB Evenings at Bangalo...
MongoDB and the Internet of Things
An afternoon with mongo db new delhi
NoSQL benchmarking
AWS Lambda, Step Functions & MongoDB Atlas Tutorial
Ad

Similar to MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep Dive (20)

PDF
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
PDF
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive
PDF
MongoDB World 2019: MongoDB Atlas Data Lake Technical Deep Dive
PDF
MongoDB .local Munich 2019: MongoDB Atlas Data Lake Technical Deep Dive
PDF
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
PDF
Overview of data analytics service: Treasure Data Service
PDF
2017 AWS DB Day | Amazon Athena 서비스 최신 기능 소개
PPTX
Analyzing Real-World Data with Apache Drill
PPTX
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
PDF
Cloud arch patterns
PDF
AWS Big Data Landscape
PPTX
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
PPTX
Amazon Web Services OverView
PPTX
AWS as platform for scalable applications
PPTX
Introduction to Polyglot Persistence
PDF
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
PPSX
Geek Sync | Data in the Cloud: Understanding Amazon Database Services with Vi...
PDF
Big data on_aws in korea by abhishek sinha (lunch and learn)
PDF
Big data and Analytics on AWS
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB World 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Munich 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
Overview of data analytics service: Treasure Data Service
2017 AWS DB Day | Amazon Athena 서비스 최신 기능 소개
Analyzing Real-World Data with Apache Drill
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Cloud arch patterns
AWS Big Data Landscape
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Amazon Web Services OverView
AWS as platform for scalable applications
Introduction to Polyglot Persistence
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Geek Sync | Data in the Cloud: Understanding Amazon Database Services with Vi...
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data and Analytics on AWS
Ad

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
PDF
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDB
PDF
MongoDB .local Paris 2020: Tout savoir sur le moteur de recherche Full Text S...
PDF
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
PDF
MongoDB .local Paris 2020: Les bonnes pratiques pour travailler avec les donn...
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDB
MongoDB .local Paris 2020: Tout savoir sur le moteur de recherche Full Text S...
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
MongoDB .local Paris 2020: Les bonnes pratiques pour travailler avec les donn...

Recently uploaded (20)

PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Spectroscopy.pptx food analysis technology
PDF
Machine learning based COVID-19 study performance prediction
PDF
Empathic Computing: Creating Shared Understanding
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPT
Teaching material agriculture food technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
MIND Revenue Release Quarter 2 2025 Press Release
Reach Out and Touch Someone: Haptics and Empathic Computing
Per capita expenditure prediction using model stacking based on satellite ima...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Approach and Philosophy of On baking technology
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Spectroscopy.pptx food analysis technology
Machine learning based COVID-19 study performance prediction
Empathic Computing: Creating Shared Understanding
MYSQL Presentation for SQL database connectivity
Chapter 3 Spatial Domain Image Processing.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
sap open course for s4hana steps from ECC to s4
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Teaching material agriculture food technology
Review of recent advances in non-invasive hemoglobin estimation
The Rise and Fall of 3GPP – Time for a Sabbatical?

MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep Dive

  • 2. Atlas Data Lake Technical Deep-Dive Himanshumali, Solutions Architect-APAC, MongoDB
  • 4. State of Affairs Why are we building this? • Businesses have a humongous amount of data o IDC predicts that by 2025 global data will reach 175 Zettabytes o 49% of it will reside in the public cloud. • Cloud storage is cost-effective • Stored data is hard to operationalize
  • 5. A New Service Offered by MongoDB Atlas Atlas Data Lake allows you to... • Access long-term data • Query long-term data • Analyze long-term data
  • 6. Product Guidelines Every product has requirements! • Look and act like MongoDB • Access customer’s data securely • Handle queries over vast amounts of data • Handle long-running queries • Efficient use of resources
  • 8. Behaviour • Communicate with our existing drivers. Written in Go. • Implemented a TCP server, using mongo-go-driver’s wire protocol package • Implemented commands for a read-only server • Used the server’s aggregation engine
  • 9. Security Must have the same security as MongoDB. • Users configured in Atlas • Implemented MongoDB’s security model • Authentication • Authorization • Require the use of TLS + SNI • SNI = Server Name Indicator
  • 11. Controlled by Customer Customers have complete control • Provide us with an IAM Role • Configure your buckets • Configure your users in Atlas
  • 12. Configuration Customers control their data layout. • Stores • Databases, Collections • DataSources
  • 13. { “stores” : [{ s3 : { name: <string>, bucket: <string>, region: <string>, prefix: <string> }} ], “databases” : { “<database_name>” : { “<collection_name>” : [ { “store” : “<string>” “definition” : “<string>” }] }} } Configuration: Store
  • 14. Configuration (S3 Bucket): ent-archive /archive/customers - a-m.json - n-z.json /archive/invoices - 2019 - 1.parquet - 2.parquet - 2018 - 1.parquet - 2017.json.gz - 2016.json.gz
  • 15. s3 : { name: "ent-archive", bucket: "ent-archive", region: "us-east-1", prefix: "/archive/" } Configuration: Store
  • 16. databases : { history: { customers: [{ store: "ent-archive", definition: "/customers/*" }], invoices: [{ store: “ent-archive", definition: "/invoices/{year int}/*" }, { store: "ent-archive", definition: "/invoices/{year int}.json.gz" }] } } Configuration: Data
  • 17. history: { invoices: [{ store: "ent-archive", definition: "/invoices/{year int}/*" }, { store: "ent-archive", definition : "/invoices/{year int}.json.gz }, { store: "atlas", cluster: "my-cluster", db: "customers", collection: "invoices" }] } Configuration: Data (Future)
  • 18. Configuration: File Formats Each file has a format • BSON (gzipped) • JSON (gzipped) • Avro (gzipped) • CSV/TSV (gzipped) • Parquet
  • 19. Processing MQL : Distributed MQL • Parse • Distribute & Parallelize
  • 22. { $match: { year: { $gt: 2000 } } } { $limit: 10 } Map: { $match: { year: { $gt: 2000 } } } { $limit: 10 } Reduce: { $limit: 10 } Query Example: $limit
  • 23. { $group: { _id: "$year", totalAvg: { $avg: "amount" } } } Map: { $group: { _id: "$year", totalAvg_sum: { $sum: "amount" }, totalAvg_count: { $sum: 1 }} } Reduce: { $group: { _id: "$_id", totalAvg_sum: { $sum: "$totalAvg_sum" }, totalAvg_count: { $sum: "$totalAvg_count" }} } Finalize: { $project: { _id: "$_id", totalAvg: { $divide: ["$totalAvg_sum", "$totalAvg_count"] } } } Query Example: $group
  • 24. Use Cases Scenario • Customer storing devices data in Atlas for last 2 years • Actively requires on last 2 years of data • Might also need to query older data in case of customer request
  • 26. Future More supported MongoDB operators. • $out • $merge • $graphLookup • Geo operators • Full Text Search
  • 28. Future More File Formats! • ORC • Excel • PDF