SlideShare a Scribd company logo
Ken W. Alger – Developer Advocate, MongoDB
Exploring your MongoDB Data with Pirates (R) and
Snakes (Python)
@kenwalger
Ken W. Alger
Developer Advocate
Overview
§ The Document Model
§ Data Frames
§ R vs. Python
§ MongoDB to Data Frames
§ Array Consumption
§ The Power of MongoDB
The Document Model
Document Model Features
Naturally maps objects to
code using JSON
Represent data of any
structure. Our data model is
very flexible.
Strongly typed for ease of
processing. We support
over twenty binary
encoded JSON data types.
Document Model for Analytics
Flexibility helps with
feature engineering by
allowing for
experimentation and the
picking of features
iteratively.
For Deep Learning the
flexibility allows for faster
iteration.
Pre-filtering of data with
aggregation framework.
Flexibility is Great if You're a Python or a Pirate…
Pirate Ships
Island Packet 31
Götheborg & Batavia
Sail Data
{
"Name": "Götheborg",
"Year Completed": 1738,
"Sail Area": {
"Lateen mizzen": 160,
"Mizzen topsail": 2500,
…
},
…
}
{
"Name": "Batavia",
"Year Completed": 1628,
"Sail Area": {
"Lateen mizzen": 250,
”Main topsail": 3500,
…
},
…
}
…but some tasks require data that's rigidly structured.
Data Frames
Data Frame
A data frame is a list of vectors, factors,
and/or matrices all having the same length
(number of rows in the case of matrices).
Used for storing data tables.
Data Frame Data
Length (ft) Year Completed Displacement Sail Area (sq. ft)
Batavia 186 1628 1200 33000
Cutty Sark 280 1869 2100 32000
Götheborg 190 1738 NaN 21140
HMS Endeavor 97.75 1764 NaN 29889
Kruzenshtern 375 1926 3064 NaN
HMS Victory 227.5 1765 3500 58556
Distributed Data Frames
Big data distributed across clusters.
Use tools like:
R vs. Python
R Data Frame
R dataframe is more or less built into the
language.
More functional than Python.
More statistical support in general.
Python Data Frame
More object-oriented.
Relies on packages (pandas, numpy, scikit-
learn)
As a language it’s great for additional tasks
along side of analytics.
Language Usage
MongoDB to Data Frames
Sail Data
{
"Lateen mizzen": 120,
"Mizzen topsail": 400,
"Mainsail": 2500,
…
}
{
"Lateen mizzen": 120,
"Main topgallant": 600,
"Mainsail": 3500,
…
}
library(mongolite)
connection <-
mongo(collection = "sails",
db="ships",
url="mongodb://localhost”)
sails <- data.frame(
connection$find()
)
MongoDB Data to Data Frames
import pandas as pd
from pymongo import
MongoClient
client =
MongoClient('localhost',
27017)
db = client.ships
df = pd.DataFrame(list
(db.sails.find())
)
Data Frame
Lateen mizzen Main topgallant Mainsail Mizzen topsail _id
0 120.0 NaN 2500 400.0 5cf53129984ae2b600701611
1 120.0 600.0 3500 NaN 5cf53141984ae2b600701612
Lateen mizzen Mizzen topsail Mainsail Main topgallant
1 120 400 2500 NA
2 120 NA 3500 600
Array Consumption
Array Data
Pattern 1 – Arrays of Arrays
sail_area = [
[160, 2500, 1600, 450],
[250, 3500, 2800, 575],
[120, 3500, 295]
]
0 1 2 3
0 160 2500 1600 450
1 250 3500 2800 575
2 120 3500 295 NaN
Resulting Data Frame
pd.DataFrame(sail_area)
-or-
data.frame(sail_area)
pd.DataFrame(list(db.sail_area.find()))
-or-
data.frame(connection$find())
Array Data
Pattern 2
[
{"lateen mizzen": 160, "mail topsail": 2500, "mainsail": 1600, "main topgallant": 450},
{"lateen mizzen": 250, "mail topsail": 3500, "mainsail": 2800, "main topgallant": 575},
{"lateen mizzen": 120, "mailsail": 3500, "jib": 450},
]
_id jib lateen mizzen mail topsail mainsail main topgallant
0 5cf5607b984ae2b600701613 NaN 160.0 2500.0 1600.0 450.0
1 5cf5609f984ae2b600701614 NaN 250.0 3500.0 2800.0 575.0
2 5cf560d4984ae2b600701615 450.0 120.0 NaN 3500.0 NaN
Resulting Data Frame
Array Data
Pattern 3
[
{"area": [160, 2500, 1600, 450]},
{"area": [250, 3500, 2800, 575]},
{"area": [120, 3500, 295]}
]
_id area
0 5cf57574984ae2b600701623 [160.0, 2500.0, 1600.0, 450.0]
1 5cf57584984ae2b600701624 [250.0, 3500.0, 2800.0, 575.0]
2 5cf5758f984ae2b600701625 [120.0, 3500.0, 295.0]
Resulting Data Frame
Are our hopes lost?
Moving Data from MongoDB Arrays
Array Data
[
{"name": "Batavia", "area": [160, 2500, 1600, 450]},
{"name": "Götheberg", "area": [250, 3500, 2800, 575]},
{"name": "HMS Endeavor", "area": [120, 3500, 295]}
]
library(mongolite)
connection <- mongo(collection =
"sails", db="ships",
url="mongodb://localhost”)
sails <-
data.frame(connection$find())
Working with MongoDB Arrays
import pandas as pd
from pymongo import MongoClient
client = MongoClient('localhost',
27017)
db = client.ships
values = []
for ship in db.sailareas.find():
values.append(ship['area'])
print(pd.DataFrame(values))
Array Data
[
{"name": "Batavia", "area": [160, 2500, 1600, 450]},
{"name": "Götheberg", "area": [250, 3500, 2800, 575]},
{"name": "HMS Endeavor", "area": [120, 3500, 295]}
]
Resulting Data Frame
0 1 2 3
0 160.0 2500.0 1600.0 450.0
1 250.0 3500.0 2800.0 575.0
2 120.0 3500.0 95.0 NaN
library(mongolite)
connection <- mongo(collection =
"sails", db="ships",
url="mongodb://localhost”)
sails <-
data.frame(connection$find())
Working with MongoDB Arrays
import pandas as pd
from pymongo import MongoClient
client = MongoClient('localhost',
27017)
db = client.ships
values = []
seriesLabels = []
for ship in db.sailareas.find():
values.append(ship['area'])
seriesLabels.append(ship['name'])
print(pd.DataFrame(values,
index=seriesLabels))
Array Data
[
{"name": "Batavia", "area": [160, 2500, 1600, 450]},
{"name": "Götheberg", "area": [250, 3500, 2800, 575]},
{"name": "HMS Endeavor", "area": [120, 3500, 295]}
]
Resulting Data Frame
0 1 2 3
Batavia 160.0 2500.0 1600.0 450.0
Götheborg 250.0 3500.0 2800.0 575.0
HMS Endeavor 120.0 3500.0 95.0 NaN
The Power of
MongoDB
Aggregation Framework
Aggregation Framework
• Pre-filter and/or pre-aggregate data on the server before
moving it across the network.
• Reduces the amount of data in the data frame.
• Improves performance.
Sample Data
Country
Year
Completed
Displacement
Individual Sail Areas
(sq. ft)
Batavia NLD 1628 1200 [292, 2012, 990, 550, 403, 642, 1056, ...]
Cutty Sark GBR 1869 2100 [2408, 866, 155, 2041, 518, 1675, …]
Götheborg SWE 1738 NaN [315, 614, 314, 2451, 2096, 2477, …]
HMS
Endeavor
GBR 1764 NaN [1060, 2089, 1101, 420, 2320, 2245]
Kruzenshtern DEU 1926 3064 [1476, 1352, 2383, 1100, 1807, 448, 2415]
HMS Victory GBR 1765 3500 [1310, 2445, 1327, 1668, 2098, 2179, …]
from datetime import datetime, timezone
values = []
seriesLabels = []
for ship in db.ships.aggregate [
{
'$match': {
'year_completed': {
'$gte': datetime(1571, 1, 1, 0, 0, 0, tzinfo=timezone.utc)
}
}
},
Aggregation Pipeline 1
{
'$match': {
'year_completed': {
'$lt': datetime(1862, 1, 1, 0, 0, 0, tzinfo=timezone.utc)
}
}
}, {
'$match': {
'country_of_origin': {
'$ne': 'USA'
}
}
},
Aggregation Pipeline 2
{
'$project': {
'name': 1,
'country_of_origin': 1,
'sail_areas': 1,
'total_sails': {
'$cond': {
'if': {
'$isArray': '$sail_areas'
},
'then': {
'$size': '$sail_areas'
},
'else': 'NA'
}
}}},
Aggregation Pipeline 3
{
'$project': {
'name': 1,
'country_of_origin': 1,
'total_sails': 1,
'total_area': {
'$sum': '$sail_area'
}
}
}
]):
values.append(ship['sail_area'])
seriesLabels.append(ship['name'])
dataframe = pd.DataFrame(values, index=seriesLabels)
Aggregation Pipeline 4
Results
0
La Amistad 19335
Batavia 31246
Götheborg 18464
HMS Endeavour 9235
Golden Hind 11749
Grand Turk 8405
Kalmar Nyckel 3710
Lady Nelson 24938
Pallada 20785
Shtandart 9363
HMS Sultana 35061
HMS Surprise 26272
HMS Trincomalee 21744
HMS Victory 14740
Other Sessions
Today
1:00pm Real-time Clinical Decision Support System – Prem Timisina & Arash Kia
2:00 Analytics with MongoDB – Stuart Shiell & Mark Clancy
2:00 A Complete Methodology to Data Modeling for MongoDB –Daniel Coupal
3:15 Unleash the Power of the MongoDB Aggregation Framework – Abhishek Bagga
Tomorrow
9:00am Best Practices for Working with IoT and Time-series Data – Robert Walters
3:00pm MongoDB in Data Science – Vigen Sahakyan
Takeaways
MongoDB's flexible data model is very powerful for data analytics.
Some analytic tools require a more structured approach.
When forming your data the schema design used can make a huge
impact on analytics.
Use MongoDB's Aggregation Framework to improve performance.
Thank You!
Ken W. Alger - @kenwalger
MongoDB World 2019: Exploring your MongoDB Data with Pirates (R) and Snakes (Python)

More Related Content

PPTX
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
PPTX
MongoDB - Aggregation Pipeline
PDF
Analytics with MongoDB Aggregation Framework and Hadoop Connector
KEY
Python Development (MongoSF)
PPTX
Weather of the Century: Design and Performance
PDF
Using MongoDB and Python
PDF
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
PPTX
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB - Aggregation Pipeline
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Python Development (MongoSF)
Weather of the Century: Design and Performance
Using MongoDB and Python
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...

What's hot (20)

PDF
Data Processing and Aggregation with MongoDB
PPTX
Operational Intelligence with MongoDB Webinar
PDF
Hadoop - MongoDB Webinar June 2014
PPTX
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
PDF
Tiered storage intro. By Robert Hodges, Altinity CEO
PDF
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
PPTX
Aggregation Framework
PPTX
The Aggregation Framework
PDF
MongoDB Aggregation Framework
PDF
ClickHouse Features for Advanced Users, by Aleksei Milovidov
PDF
Aggregation Framework MongoDB Days Munich
PPTX
MongoDB Aggregation
PDF
Presto in Treasure Data
PDF
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
PPTX
Peggy elasticsearch應用
KEY
MongoDB Aggregation Framework
PDF
Unified Data Platform, by Pauline Yeung of Cisco Systems
PDF
MariaDB and Clickhouse Percona Live 2019 talk
PPTX
2014 bigdatacamp asya_kamsky
PDF
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
Data Processing and Aggregation with MongoDB
Operational Intelligence with MongoDB Webinar
Hadoop - MongoDB Webinar June 2014
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
Tiered storage intro. By Robert Hodges, Altinity CEO
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
Aggregation Framework
The Aggregation Framework
MongoDB Aggregation Framework
ClickHouse Features for Advanced Users, by Aleksei Milovidov
Aggregation Framework MongoDB Days Munich
MongoDB Aggregation
Presto in Treasure Data
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
Peggy elasticsearch應用
MongoDB Aggregation Framework
Unified Data Platform, by Pauline Yeung of Cisco Systems
MariaDB and Clickhouse Percona Live 2019 talk
2014 bigdatacamp asya_kamsky
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
Ad

Similar to MongoDB World 2019: Exploring your MongoDB Data with Pirates (R) and Snakes (Python) (20)

PPTX
Conceptos básicos. Seminario web 5: Introducción a Aggregation Framework
PPTX
Beyond the Basics 2: Aggregation Framework
PDF
Webinar: Data Processing and Aggregation Options
PDF
Database madness with_mongoengine_and_sql_alchemy
PDF
2016 feb-23 pyugre-py_mongo
PDF
MongoDB World 2016: From the Polls to the Trolls: Seeing What the World Think...
PDF
MongoDB FabLab León
PDF
Python and MongoDB
PDF
MongoDB Basics
PDF
Data analysis and visualization with mongo db [mongodb world 2016]
PPTX
Rapid and Scalable Development with MongoDB, PyMongo, and Ming
PPTX
Webinar: Getting Started with MongoDB - Back to Basics
PPTX
Joins and Other Aggregation Enhancements Coming in MongoDB 3.2
PPT
Rapid and Scalable Development with MongoDB, PyMongo, and Ming
PDF
MongoDB and Python
PDF
MongoDB Project: Relational databases to Document-Oriented databases
PDF
The solution for technical test claire kim
ODP
MongoDB Distilled
PDF
mongoDB Project: Relational databases & Document-Oriented databases
PPTX
Past, Present and Future of Data Processing in Apache Hadoop
Conceptos básicos. Seminario web 5: Introducción a Aggregation Framework
Beyond the Basics 2: Aggregation Framework
Webinar: Data Processing and Aggregation Options
Database madness with_mongoengine_and_sql_alchemy
2016 feb-23 pyugre-py_mongo
MongoDB World 2016: From the Polls to the Trolls: Seeing What the World Think...
MongoDB FabLab León
Python and MongoDB
MongoDB Basics
Data analysis and visualization with mongo db [mongodb world 2016]
Rapid and Scalable Development with MongoDB, PyMongo, and Ming
Webinar: Getting Started with MongoDB - Back to Basics
Joins and Other Aggregation Enhancements Coming in MongoDB 3.2
Rapid and Scalable Development with MongoDB, PyMongo, and Ming
MongoDB and Python
MongoDB Project: Relational databases to Document-Oriented databases
The solution for technical test claire kim
MongoDB Distilled
mongoDB Project: Relational databases & Document-Oriented databases
Past, Present and Future of Data Processing in Apache Hadoop
Ad

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
cuic standard and advanced reporting.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Approach and Philosophy of On baking technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Big Data Technologies - Introduction.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Spectroscopy.pptx food analysis technology
PDF
Network Security Unit 5.pdf for BCA BBA.
KodekX | Application Modernization Development
Chapter 3 Spatial Domain Image Processing.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Spectral efficient network and resource selection model in 5G networks
cuic standard and advanced reporting.pdf
20250228 LYD VKU AI Blended-Learning.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Understanding_Digital_Forensics_Presentation.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
sap open course for s4hana steps from ECC to s4
Approach and Philosophy of On baking technology
Encapsulation_ Review paper, used for researhc scholars
Mobile App Security Testing_ A Comprehensive Guide.pdf
Programs and apps: productivity, graphics, security and other tools
Big Data Technologies - Introduction.pptx
Empathic Computing: Creating Shared Understanding
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Spectroscopy.pptx food analysis technology
Network Security Unit 5.pdf for BCA BBA.

MongoDB World 2019: Exploring your MongoDB Data with Pirates (R) and Snakes (Python)

  • 1. Ken W. Alger – Developer Advocate, MongoDB Exploring your MongoDB Data with Pirates (R) and Snakes (Python) @kenwalger
  • 3. Overview § The Document Model § Data Frames § R vs. Python § MongoDB to Data Frames § Array Consumption § The Power of MongoDB
  • 5. Document Model Features Naturally maps objects to code using JSON Represent data of any structure. Our data model is very flexible. Strongly typed for ease of processing. We support over twenty binary encoded JSON data types.
  • 6. Document Model for Analytics Flexibility helps with feature engineering by allowing for experimentation and the picking of features iteratively. For Deep Learning the flexibility allows for faster iteration. Pre-filtering of data with aggregation framework.
  • 7. Flexibility is Great if You're a Python or a Pirate…
  • 11. Sail Data { "Name": "Götheborg", "Year Completed": 1738, "Sail Area": { "Lateen mizzen": 160, "Mizzen topsail": 2500, … }, … } { "Name": "Batavia", "Year Completed": 1628, "Sail Area": { "Lateen mizzen": 250, ”Main topsail": 3500, … }, … }
  • 12. …but some tasks require data that's rigidly structured.
  • 14. Data Frame A data frame is a list of vectors, factors, and/or matrices all having the same length (number of rows in the case of matrices). Used for storing data tables.
  • 15. Data Frame Data Length (ft) Year Completed Displacement Sail Area (sq. ft) Batavia 186 1628 1200 33000 Cutty Sark 280 1869 2100 32000 Götheborg 190 1738 NaN 21140 HMS Endeavor 97.75 1764 NaN 29889 Kruzenshtern 375 1926 3064 NaN HMS Victory 227.5 1765 3500 58556
  • 16. Distributed Data Frames Big data distributed across clusters. Use tools like:
  • 18. R Data Frame R dataframe is more or less built into the language. More functional than Python. More statistical support in general.
  • 19. Python Data Frame More object-oriented. Relies on packages (pandas, numpy, scikit- learn) As a language it’s great for additional tasks along side of analytics.
  • 21. MongoDB to Data Frames
  • 22. Sail Data { "Lateen mizzen": 120, "Mizzen topsail": 400, "Mainsail": 2500, … } { "Lateen mizzen": 120, "Main topgallant": 600, "Mainsail": 3500, … }
  • 23. library(mongolite) connection <- mongo(collection = "sails", db="ships", url="mongodb://localhost”) sails <- data.frame( connection$find() ) MongoDB Data to Data Frames import pandas as pd from pymongo import MongoClient client = MongoClient('localhost', 27017) db = client.ships df = pd.DataFrame(list (db.sails.find()) )
  • 24. Data Frame Lateen mizzen Main topgallant Mainsail Mizzen topsail _id 0 120.0 NaN 2500 400.0 5cf53129984ae2b600701611 1 120.0 600.0 3500 NaN 5cf53141984ae2b600701612 Lateen mizzen Mizzen topsail Mainsail Main topgallant 1 120 400 2500 NA 2 120 NA 3500 600
  • 26. Array Data Pattern 1 – Arrays of Arrays sail_area = [ [160, 2500, 1600, 450], [250, 3500, 2800, 575], [120, 3500, 295] ] 0 1 2 3 0 160 2500 1600 450 1 250 3500 2800 575 2 120 3500 295 NaN Resulting Data Frame pd.DataFrame(sail_area) -or- data.frame(sail_area) pd.DataFrame(list(db.sail_area.find())) -or- data.frame(connection$find())
  • 27. Array Data Pattern 2 [ {"lateen mizzen": 160, "mail topsail": 2500, "mainsail": 1600, "main topgallant": 450}, {"lateen mizzen": 250, "mail topsail": 3500, "mainsail": 2800, "main topgallant": 575}, {"lateen mizzen": 120, "mailsail": 3500, "jib": 450}, ] _id jib lateen mizzen mail topsail mainsail main topgallant 0 5cf5607b984ae2b600701613 NaN 160.0 2500.0 1600.0 450.0 1 5cf5609f984ae2b600701614 NaN 250.0 3500.0 2800.0 575.0 2 5cf560d4984ae2b600701615 450.0 120.0 NaN 3500.0 NaN Resulting Data Frame
  • 28. Array Data Pattern 3 [ {"area": [160, 2500, 1600, 450]}, {"area": [250, 3500, 2800, 575]}, {"area": [120, 3500, 295]} ] _id area 0 5cf57574984ae2b600701623 [160.0, 2500.0, 1600.0, 450.0] 1 5cf57584984ae2b600701624 [250.0, 3500.0, 2800.0, 575.0] 2 5cf5758f984ae2b600701625 [120.0, 3500.0, 295.0] Resulting Data Frame
  • 29. Are our hopes lost?
  • 30. Moving Data from MongoDB Arrays
  • 31. Array Data [ {"name": "Batavia", "area": [160, 2500, 1600, 450]}, {"name": "Götheberg", "area": [250, 3500, 2800, 575]}, {"name": "HMS Endeavor", "area": [120, 3500, 295]} ]
  • 32. library(mongolite) connection <- mongo(collection = "sails", db="ships", url="mongodb://localhost”) sails <- data.frame(connection$find()) Working with MongoDB Arrays import pandas as pd from pymongo import MongoClient client = MongoClient('localhost', 27017) db = client.ships values = [] for ship in db.sailareas.find(): values.append(ship['area']) print(pd.DataFrame(values))
  • 33. Array Data [ {"name": "Batavia", "area": [160, 2500, 1600, 450]}, {"name": "Götheberg", "area": [250, 3500, 2800, 575]}, {"name": "HMS Endeavor", "area": [120, 3500, 295]} ] Resulting Data Frame 0 1 2 3 0 160.0 2500.0 1600.0 450.0 1 250.0 3500.0 2800.0 575.0 2 120.0 3500.0 95.0 NaN
  • 34. library(mongolite) connection <- mongo(collection = "sails", db="ships", url="mongodb://localhost”) sails <- data.frame(connection$find()) Working with MongoDB Arrays import pandas as pd from pymongo import MongoClient client = MongoClient('localhost', 27017) db = client.ships values = [] seriesLabels = [] for ship in db.sailareas.find(): values.append(ship['area']) seriesLabels.append(ship['name']) print(pd.DataFrame(values, index=seriesLabels))
  • 35. Array Data [ {"name": "Batavia", "area": [160, 2500, 1600, 450]}, {"name": "Götheberg", "area": [250, 3500, 2800, 575]}, {"name": "HMS Endeavor", "area": [120, 3500, 295]} ] Resulting Data Frame 0 1 2 3 Batavia 160.0 2500.0 1600.0 450.0 Götheborg 250.0 3500.0 2800.0 575.0 HMS Endeavor 120.0 3500.0 95.0 NaN
  • 37. Aggregation Framework • Pre-filter and/or pre-aggregate data on the server before moving it across the network. • Reduces the amount of data in the data frame. • Improves performance.
  • 38. Sample Data Country Year Completed Displacement Individual Sail Areas (sq. ft) Batavia NLD 1628 1200 [292, 2012, 990, 550, 403, 642, 1056, ...] Cutty Sark GBR 1869 2100 [2408, 866, 155, 2041, 518, 1675, …] Götheborg SWE 1738 NaN [315, 614, 314, 2451, 2096, 2477, …] HMS Endeavor GBR 1764 NaN [1060, 2089, 1101, 420, 2320, 2245] Kruzenshtern DEU 1926 3064 [1476, 1352, 2383, 1100, 1807, 448, 2415] HMS Victory GBR 1765 3500 [1310, 2445, 1327, 1668, 2098, 2179, …]
  • 39. from datetime import datetime, timezone values = [] seriesLabels = [] for ship in db.ships.aggregate [ { '$match': { 'year_completed': { '$gte': datetime(1571, 1, 1, 0, 0, 0, tzinfo=timezone.utc) } } }, Aggregation Pipeline 1
  • 40. { '$match': { 'year_completed': { '$lt': datetime(1862, 1, 1, 0, 0, 0, tzinfo=timezone.utc) } } }, { '$match': { 'country_of_origin': { '$ne': 'USA' } } }, Aggregation Pipeline 2
  • 41. { '$project': { 'name': 1, 'country_of_origin': 1, 'sail_areas': 1, 'total_sails': { '$cond': { 'if': { '$isArray': '$sail_areas' }, 'then': { '$size': '$sail_areas' }, 'else': 'NA' } }}}, Aggregation Pipeline 3
  • 42. { '$project': { 'name': 1, 'country_of_origin': 1, 'total_sails': 1, 'total_area': { '$sum': '$sail_area' } } } ]): values.append(ship['sail_area']) seriesLabels.append(ship['name']) dataframe = pd.DataFrame(values, index=seriesLabels) Aggregation Pipeline 4
  • 43. Results 0 La Amistad 19335 Batavia 31246 Götheborg 18464 HMS Endeavour 9235 Golden Hind 11749 Grand Turk 8405 Kalmar Nyckel 3710 Lady Nelson 24938 Pallada 20785 Shtandart 9363 HMS Sultana 35061 HMS Surprise 26272 HMS Trincomalee 21744 HMS Victory 14740
  • 44. Other Sessions Today 1:00pm Real-time Clinical Decision Support System – Prem Timisina & Arash Kia 2:00 Analytics with MongoDB – Stuart Shiell & Mark Clancy 2:00 A Complete Methodology to Data Modeling for MongoDB –Daniel Coupal 3:15 Unleash the Power of the MongoDB Aggregation Framework – Abhishek Bagga Tomorrow 9:00am Best Practices for Working with IoT and Time-series Data – Robert Walters 3:00pm MongoDB in Data Science – Vigen Sahakyan
  • 45. Takeaways MongoDB's flexible data model is very powerful for data analytics. Some analytic tools require a more structured approach. When forming your data the schema design used can make a huge impact on analytics. Use MongoDB's Aggregation Framework to improve performance.
  • 46. Thank You! Ken W. Alger - @kenwalger