SlideShare a Scribd company logo
Overnight to 60 Seconds
An IOT ETL Performance Case Study
Preventing Insanity
An IOT ETL Performance Case Study
Kevin Arhelger
Senior Technical Services Engineer
MongoDB
@kevarh
About Me
• At MongoDB since January 2016
• Senior Technical Services Engineer - I answer your support questions.
• Performance Driven - Software performance and benchmarking for
the last decade
• New to MongoDB, but not performance
• Loves data
• Programming Polyglot
Disclaimer
• This is my personal journey
• I made lots of mistakes
• You are probably smarter than me• (I’m hopefully smarter than I was two years ago)
My Project
• I’ve been collecting Water/Electric meter data since February 2015.
• Now that I work at a database company, maybe I should put this in a
database?
• See what I can learn about my consumption.
• Get access to my meter data on the internet.
IOT
• Internet of things
• I want my things (meters) to be connected to the Internet
• This would let me remotely monitor my utilization
Utility Meter
• 900 MHz Radio
• Broadcasts consumption every few
minutes
Radio
● Software Defined Radio
● Open source project rtlamr written in GO by
Douglas Hall
● Reads meters data and exports JSON
Single Board
Computer
Odroid C2 - Ubuntu 16.04
Quad Core ARM at 1.5 GHZ
More than enough horsepower
Complete Setup
ETL
• Extract, Transform, Load
• Not in the traditional sense (not already in another database)
• Many of the same characteristics
• Convert between formats
• Reading all the data quickly
• Inserting into another database
Tabular Schema
Time ID Type Tamper Consumption CRC
2017-06-14T... 20289211 3 00:00 5357 0xA409
2017-06-14T... 20289211 3 00:00 5358 0x777B
2017-06-14T... 20289211 3 00:00 5359 0x4132
2017-06-14T... 20289211 3 00:00 5360 0x8707
2017-06-14T... 20289211 3 00:00 5361 0x59FA
2017-06-14T... 20289211 3 00:00 5362 0x559E
2017-06-14T... 20289211 3 00:00 5363 0x8B63
The Plan: Simple Tools
The Plan: Simple Tools
The Plan: Simple Tools
The Plan: Simple Tools
mongoimport
Looks Like JSON but isn’t
{Time:2017-06-14T10:06:47.225
SCM:{ID:20289211 Type: 3
Tamper:{Phy:00 Enc:00}
Consumption: 53557
CRC:0xA409}}
Data Cleaning
#!/bin/bash
cat - | grep -E '^{' | 
sed -e 's/Time.*Time/{Time/g' | 
sed -e 's/:00,/:@/g' | 
gsed -e 's/s+/ /g' | 
sed -e 's/[{}]//g' | 
sed -e 's/SCM://g' | 
sed -e 's/Tamper://g' | 
sed -e 's/^/{/g' | 
sed -e 's/$/}/g' | 
gsed -e 's/: +/:/g' | 
sed -e 's/ /, /g' | 
sed -e 's/, }/}/g' | 
sed -e 's/Time:([^,]*),/Time:{"$date":"1Z"},/g' | 
gsed -e 's/:0+([1-9][0-9]*,)/:1/g' | 
sed -e 's/([^0-9]):0([^x])/1:2/g' | 
sed -e 's/Time/time/g' | 
sed -e 's/ID/id/g' | 
sed -e 's/Consumption/consumption/g' | 
sed -e 's/:@,/:0,/g' | 
sed -e 's/Type:,/Type:0,/g' | 
grep -v 'consumption:,'
Post Cleaning
{
time: {"$date": "2017-06-14T10:06:47.225"},
id: 20289211,
Type: 3,
Phy: 0,
Enc: 0,
consumption: 53557,
CRC: 0xA409
}
The Plan: Use Simple Tools
mongoimport
Redundant Data!
• The meters send readings every few minutes.
• The reading does not have up-to-date information.
• We only care about the first change.
2015-02-13T18:01:09.079 Consumption: 5048615
2015-02-13T18:02:11.272 Consumption: 5048621
2015-02-13T18:03:14.093 Consumption: 5048621
2015-02-13T18:04:13.155 Consumption: 5048621
2015-02-13T18:05:10.849 Consumption: 5048621
2015-02-13T18:06:11.668 Consumption: 5048623
The Plan: Use Simple Tools
mongoimport
It Works!
It Works! (Sort of)
• Entire import process takes overnight (around four hours)
• Read 10.6GB
• Inserts 90,840,510 documents
Problem: Queries
• Query for monthly, daily, day of
week are similar.
• Generate ranges, grab a pair of
readings, calculate the
difference.
• Aggregation isn’t a great match.
before =
db.getSiblingDB("meters").mine.find({scm.id
: myid, time: {'$lte':
begin}}).sort({time:-1}).limit(1).toArray()
[0];
after =
db.getSiblingDB("meters").mine.find({scm.id
: myid, time: {'$gte':
end}}).sort({time:-1}).limit(1).toArray()[0
];
consumption = after.consumption -
before.consumption
Problem:
Missing Data
Missed Readings?
Power Outages?
Results could be far removed from actual usage.
Problem: Displaying Data
Last 24 hours
Problem: Displaying Data
• Requires multiple calls to the
database
• Could be off by depending on
when we see readings
before =
db.getSiblingDB("meters").mine.find({scm.id:
myid, time:{'$lte':
begin}}).sort({time:-1}).limit(1).toArray()[0];
readings = db.getSiblingDB("meters").mine.find({
scm.id: myid,
time: {$gte: before.time})
.sort({time:1})
var previous = readings.shift();
var count = 0;
var hourly = [];
readings.forEach(reading => {
if(hourly.length > 24) return;
if(reading.time.getHours() != previous.getHours()){
hourly.push(reading.consumption -
previous.consumption); previous = reading;
}
});
Problems
Requirements
Cleaning Data is Easy
No Duplicates
Daily Consumption
Weekly Consumption
Compare Days
Calculate Utility Bill
Fast
Performance: Rewrite in Go
• More control over cleaning our data
• Driver allows easy batch insertion
• Split into multiple workers (goroutines) to distribute insertion load
• Take advantage of all our cores
Read File
Lines to
Documents
Batch Insertion Routine•••Batch Insertion Routine Batch Insertion Routine•••
Clean Lines
Taking a Step Back
Tabular Data
Time ID Type Tamper Consu... CRC
2017-06... 20289211 3 00:00 5357 0xA409
2017-06... 20289211 3 00:00 5358 0x777B
2017-06... 20289211 3 00:00 5359 0x4132
2017-06... 20289211 3 00:00 5360 0x8707
2017-06... 20289211 3 00:00 5361 0x59FA
2017-06... 20289211 3 00:00 5362 0x559E
2017-06... 20289211 3 00:00 5363 0x8B63
Change The Schema
CHANGE THE SCHEMA!
• The schema I started with didn’t meet my requirements.
• Resisted this change as it required additional application work (writing
my own ETL tool).
• Think about how you will use your data!
New Schema
{
"_id" : ObjectId("54de8229791e4b133c000052"),
"meter" : 29026302,
"date" : new Date("2015-02-13T17:00:57"),
"hour" : 17,
"weekday" : 5,
"day" : 13,
"month" : 2,
"year" : 2015,
"consumption" : 5.526729939432698,
"begin" : 50480.575901639124,
"end" : 50486.10263157856,
"before" : {...},
"after" : {...},
"readings" : [ ... ]
}
One document per hour
• This makes hourly, daily, and
weekly calculations easier to
calculate.
• Easy cutoff for insertion, wait
until an hour passes to insert our
documents.
{
"_id" :
ObjectId("54de8229791e4b133c000052"),
"meter" : 29026302,
"date" : ISODate("2015-02-13T23:00:57Z"),
"consumption" : 5.526729939432698,
"begin" : 50480.575901639124,
"end" : 50486.10263157856,
...
}
Store a before and after reading
• Used in our ETL tool
• Linear interpolation from these
values to project what the start
and end reading would have
been.
• Included for completeness, but
otherwise unnecessary. These
fields are never queried and
could be omitted.
"before" : {
"date" :
ISODate("2015-02-13T22:59:56Z"),
"consumption" : 50480.57,
"delta" : 5.785714347892832
},
"after" : {
"date" :
ISODate("2015-02-14T00:00:11Z"),
"consumption" : 50486.12,
"delta" : 5.68421056066001
}
…
Embed readings
• We may want to graph usage
within the hour, so store the raw
values.
• Store deltas to make our life
easier later.
"readings" : [
{
"date" :
ISODate("2015-02-13T23:00:57Z"),
"consumption" : 50480.66,
"delta" : 5.311475388465158
},
{
"date" :
ISODate("2015-02-13T23:02:00Z"),
"consumption" : 50480.75,
"delta" : 5.142857168616757
} ...
Split out Time
• Splitting out the hour, day,
month, year, day of week makes
for easy queries.
• Aggregation is easy and fast as a
$dayOfMonth projection isn’t
required.
• We can now use a simple
aggregation to explore by year,
month, week, day and hour.
{
...
"date" : new Date("2015-02-13T17:00:57"),
"hour" : 17,
"weekday" : 5,
"day" : 13,
"month" : 2,
"year" : 2015,
…
}
Split out Time: Benefits
Queries: Daily Consumption
• Grab the convenient fields
• Sum the consumption
daily =
db.getSiblingDB("meters").mine.aggregate([{
$match: {
meter: myid
year: 2018
month: 8
day: 26
}},
{$group: {“_id”: 1, total: {“$sum” :
consumption}}])[0].consumption
Queries: 24 Hour Graph
• Filter by the meter’s id
• Sort based on date
• 24 documents returned for
graphing
• Already binned on hour
boundaries
last24 = db.getSiblingDB("meters").mine.({
meter:29026302},
{consumption:1,
date:1})
.sort({date:-1})
.limit(24)
.toArray()
Problems Revisited
Requirements
Cleaning Data is Easy
No Duplicates
Daily Consumption
Weekly Consumption
Compare Days
Calculate Utility Bill
Fast
PERFORMANCE!
Changing schema was
the single biggest
performance win
Performance by the numbers
• 4 hours to 3 minutes
• Deduplication process eliminates 202 minutes
• Data cleaning process eliminates 24 minutes
• Parallel insertion eliminates 11 minutes
• 90,840,510 Readings to 436,477
• 90,840,510 Docs to 31,396
• 10.6 GB File to 13MB compressed WiredTiger data (31MB
uncompressed)
Getting from 180 to 60 seconds
• Buffer input heavily, we should never be waiting on IO
• Perform simple checks to avoid stripping whitespace
• Using fixed string parsing vs. regex
• Tune batch sizes and workers to keep the system busy
• Optimistically encode documents to reduce encoding overhead
• Batch golang channel sending to reduce overhead
Complexity Vs. Performance
Flame Graph: Before
Flame Graph: Before
Flame Graph: Before
Flame Graph: Before
Flame Graph: After
Flame Graph: After
Key Takeaways
• Follow best practices
• Batch writes improve throughput by reducing roundtrips
• Multiple insertion workers remove roundtrip bottleneck
• Design you schema so you can easily access your data
• Understand the big picture
• You can treat database performance just like any software issue.
• Tabular data isn’t a great way to represent many problems.
What have I learned?
• My household consumes a lot of water
• Changed shower heads (30% savings)
• Changed water heater ($50 a month savings)
• When certain people are home, energy consumption rises
• Replaced light bulbs (few $ a month)
The Document Model
Unleashes Data
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study

More Related Content

PDF
MongoDB World 2018: Enterprise Security in the Cloud
PDF
MongoDB World 2018: Using Change Streams to Keep Up with Your Data
PPTX
High Performance Applications with MongoDB
PDF
Time series databases
PDF
MongoDB Performance Tuning
PDF
Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...
PPTX
Webinar: Architecting Secure and Compliant Applications with MongoDB
PPTX
Breaking the Oracle Tie; High Performance OLTP and Analytics Using MongoDB
MongoDB World 2018: Enterprise Security in the Cloud
MongoDB World 2018: Using Change Streams to Keep Up with Your Data
High Performance Applications with MongoDB
Time series databases
MongoDB Performance Tuning
Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...
Webinar: Architecting Secure and Compliant Applications with MongoDB
Breaking the Oracle Tie; High Performance OLTP and Analytics Using MongoDB

What's hot (20)

PPTX
MongoDB for Time Series Data Part 3: Sharding
PPTX
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
PPTX
Webinar: Compliance and Data Protection in the Big Data Age: MongoDB Security...
PPTX
ReadConcern and WriteConcern
PPTX
How Thermo Fisher is Reducing Data Analysis Times from Days to Minutes with M...
PDF
201809 DB tech showcase
PDF
Webinar: Developing with the modern App Stack: MEAN and MERN (with Angular2 a...
PPTX
Choosing a Shard key
PPTX
MongoDB World 2018: What's Next? The Path to Sharded Transactions
PPTX
Database Trends for Modern Applications: Why the Database You Choose Matters
PPTX
Enabling Search in your Cassandra Application with DataStax Enterprise
PDF
NoSQL Infrastructure
PDF
MongoDB for Analytics
PPTX
Performance Tuning and Optimization
PPTX
Advanced Sharding Features in MongoDB 2.4
PDF
Helsinki Cassandra Meetup #2: From Postgres to Cassandra
PPTX
Back to Basics 2017: Introduction to Sharding
PPTX
How to Achieve Scale with MongoDB
PPTX
MongoDB for Time Series Data: Sharding
PPTX
Mongo db pefrormance optimization strategies
MongoDB for Time Series Data Part 3: Sharding
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
Webinar: Compliance and Data Protection in the Big Data Age: MongoDB Security...
ReadConcern and WriteConcern
How Thermo Fisher is Reducing Data Analysis Times from Days to Minutes with M...
201809 DB tech showcase
Webinar: Developing with the modern App Stack: MEAN and MERN (with Angular2 a...
Choosing a Shard key
MongoDB World 2018: What's Next? The Path to Sharded Transactions
Database Trends for Modern Applications: Why the Database You Choose Matters
Enabling Search in your Cassandra Application with DataStax Enterprise
NoSQL Infrastructure
MongoDB for Analytics
Performance Tuning and Optimization
Advanced Sharding Features in MongoDB 2.4
Helsinki Cassandra Meetup #2: From Postgres to Cassandra
Back to Basics 2017: Introduction to Sharding
How to Achieve Scale with MongoDB
MongoDB for Time Series Data: Sharding
Mongo db pefrormance optimization strategies
Ad

Similar to MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study (20)

PPTX
MongoDB for Time Series Data: Setting the Stage for Sensor Management
PPTX
Sizing MongoDB Clusters
PPTX
StasD & Graphite - Measure anything, Measure Everything
PPTX
1404 app dev series - session 8 - monitoring & performance tuning
PDF
app/server monitoring
PPTX
MongoDB for Time Series Data
PPTX
OPTIMIZING THE TICK STACK
PDF
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
PPTX
MongoDB and the Internet of Things
PPTX
Cloud Security Monitoring and Spark Analytics
PDF
Map Reducec and Spark big data visualization and analytics
PPTX
Codemotion Milano 2014 - MongoDB and the Internet of Things
PPTX
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
PDF
MongoDB at Baidu
PPTX
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
PPTX
Azure stream analytics by Nico Jacobs
PPTX
Sizing MongoDB Clusters
PDF
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
PDF
Using Riak for Events storage and analysis at Booking.com
PPTX
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
MongoDB for Time Series Data: Setting the Stage for Sensor Management
Sizing MongoDB Clusters
StasD & Graphite - Measure anything, Measure Everything
1404 app dev series - session 8 - monitoring & performance tuning
app/server monitoring
MongoDB for Time Series Data
OPTIMIZING THE TICK STACK
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
MongoDB and the Internet of Things
Cloud Security Monitoring and Spark Analytics
Map Reducec and Spark big data visualization and analytics
Codemotion Milano 2014 - MongoDB and the Internet of Things
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
MongoDB at Baidu
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
Azure stream analytics by Nico Jacobs
Sizing MongoDB Clusters
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Using Riak for Events storage and analysis at Booking.com
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Ad

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Encapsulation theory and applications.pdf
PPT
Teaching material agriculture food technology
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Spectroscopy.pptx food analysis technology
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
Understanding_Digital_Forensics_Presentation.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Encapsulation theory and applications.pdf
Teaching material agriculture food technology
MYSQL Presentation for SQL database connectivity
Encapsulation_ Review paper, used for researhc scholars
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
The AUB Centre for AI in Media Proposal.docx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Per capita expenditure prediction using model stacking based on satellite ima...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
sap open course for s4hana steps from ECC to s4
Dropbox Q2 2025 Financial Results & Investor Presentation
Spectroscopy.pptx food analysis technology
Reach Out and Touch Someone: Haptics and Empathic Computing
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Building Integrated photovoltaic BIPV_UPV.pdf

MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study

  • 1. Overnight to 60 Seconds An IOT ETL Performance Case Study
  • 2. Preventing Insanity An IOT ETL Performance Case Study
  • 3. Kevin Arhelger Senior Technical Services Engineer MongoDB @kevarh
  • 4. About Me • At MongoDB since January 2016 • Senior Technical Services Engineer - I answer your support questions. • Performance Driven - Software performance and benchmarking for the last decade • New to MongoDB, but not performance • Loves data • Programming Polyglot
  • 5. Disclaimer • This is my personal journey • I made lots of mistakes • You are probably smarter than me• (I’m hopefully smarter than I was two years ago)
  • 6. My Project • I’ve been collecting Water/Electric meter data since February 2015. • Now that I work at a database company, maybe I should put this in a database? • See what I can learn about my consumption. • Get access to my meter data on the internet.
  • 7. IOT • Internet of things • I want my things (meters) to be connected to the Internet • This would let me remotely monitor my utilization
  • 8. Utility Meter • 900 MHz Radio • Broadcasts consumption every few minutes
  • 9. Radio ● Software Defined Radio ● Open source project rtlamr written in GO by Douglas Hall ● Reads meters data and exports JSON
  • 10. Single Board Computer Odroid C2 - Ubuntu 16.04 Quad Core ARM at 1.5 GHZ More than enough horsepower
  • 12. ETL • Extract, Transform, Load • Not in the traditional sense (not already in another database) • Many of the same characteristics • Convert between formats • Reading all the data quickly • Inserting into another database
  • 13. Tabular Schema Time ID Type Tamper Consumption CRC 2017-06-14T... 20289211 3 00:00 5357 0xA409 2017-06-14T... 20289211 3 00:00 5358 0x777B 2017-06-14T... 20289211 3 00:00 5359 0x4132 2017-06-14T... 20289211 3 00:00 5360 0x8707 2017-06-14T... 20289211 3 00:00 5361 0x59FA 2017-06-14T... 20289211 3 00:00 5362 0x559E 2017-06-14T... 20289211 3 00:00 5363 0x8B63
  • 17. The Plan: Simple Tools mongoimport
  • 18. Looks Like JSON but isn’t {Time:2017-06-14T10:06:47.225 SCM:{ID:20289211 Type: 3 Tamper:{Phy:00 Enc:00} Consumption: 53557 CRC:0xA409}}
  • 19. Data Cleaning #!/bin/bash cat - | grep -E '^{' | sed -e 's/Time.*Time/{Time/g' | sed -e 's/:00,/:@/g' | gsed -e 's/s+/ /g' | sed -e 's/[{}]//g' | sed -e 's/SCM://g' | sed -e 's/Tamper://g' | sed -e 's/^/{/g' | sed -e 's/$/}/g' | gsed -e 's/: +/:/g' | sed -e 's/ /, /g' | sed -e 's/, }/}/g' | sed -e 's/Time:([^,]*),/Time:{"$date":"1Z"},/g' | gsed -e 's/:0+([1-9][0-9]*,)/:1/g' | sed -e 's/([^0-9]):0([^x])/1:2/g' | sed -e 's/Time/time/g' | sed -e 's/ID/id/g' | sed -e 's/Consumption/consumption/g' | sed -e 's/:@,/:0,/g' | sed -e 's/Type:,/Type:0,/g' | grep -v 'consumption:,'
  • 20. Post Cleaning { time: {"$date": "2017-06-14T10:06:47.225"}, id: 20289211, Type: 3, Phy: 0, Enc: 0, consumption: 53557, CRC: 0xA409 }
  • 21. The Plan: Use Simple Tools mongoimport
  • 22. Redundant Data! • The meters send readings every few minutes. • The reading does not have up-to-date information. • We only care about the first change. 2015-02-13T18:01:09.079 Consumption: 5048615 2015-02-13T18:02:11.272 Consumption: 5048621 2015-02-13T18:03:14.093 Consumption: 5048621 2015-02-13T18:04:13.155 Consumption: 5048621 2015-02-13T18:05:10.849 Consumption: 5048621 2015-02-13T18:06:11.668 Consumption: 5048623
  • 23. The Plan: Use Simple Tools mongoimport
  • 25. It Works! (Sort of) • Entire import process takes overnight (around four hours) • Read 10.6GB • Inserts 90,840,510 documents
  • 26. Problem: Queries • Query for monthly, daily, day of week are similar. • Generate ranges, grab a pair of readings, calculate the difference. • Aggregation isn’t a great match. before = db.getSiblingDB("meters").mine.find({scm.id : myid, time: {'$lte': begin}}).sort({time:-1}).limit(1).toArray() [0]; after = db.getSiblingDB("meters").mine.find({scm.id : myid, time: {'$gte': end}}).sort({time:-1}).limit(1).toArray()[0 ]; consumption = after.consumption - before.consumption
  • 27. Problem: Missing Data Missed Readings? Power Outages? Results could be far removed from actual usage.
  • 29. Problem: Displaying Data • Requires multiple calls to the database • Could be off by depending on when we see readings before = db.getSiblingDB("meters").mine.find({scm.id: myid, time:{'$lte': begin}}).sort({time:-1}).limit(1).toArray()[0]; readings = db.getSiblingDB("meters").mine.find({ scm.id: myid, time: {$gte: before.time}) .sort({time:1}) var previous = readings.shift(); var count = 0; var hourly = []; readings.forEach(reading => { if(hourly.length > 24) return; if(reading.time.getHours() != previous.getHours()){ hourly.push(reading.consumption - previous.consumption); previous = reading; } });
  • 30. Problems Requirements Cleaning Data is Easy No Duplicates Daily Consumption Weekly Consumption Compare Days Calculate Utility Bill Fast
  • 31. Performance: Rewrite in Go • More control over cleaning our data • Driver allows easy batch insertion • Split into multiple workers (goroutines) to distribute insertion load • Take advantage of all our cores
  • 32. Read File Lines to Documents Batch Insertion Routine•••Batch Insertion Routine Batch Insertion Routine••• Clean Lines
  • 34. Tabular Data Time ID Type Tamper Consu... CRC 2017-06... 20289211 3 00:00 5357 0xA409 2017-06... 20289211 3 00:00 5358 0x777B 2017-06... 20289211 3 00:00 5359 0x4132 2017-06... 20289211 3 00:00 5360 0x8707 2017-06... 20289211 3 00:00 5361 0x59FA 2017-06... 20289211 3 00:00 5362 0x559E 2017-06... 20289211 3 00:00 5363 0x8B63
  • 36. CHANGE THE SCHEMA! • The schema I started with didn’t meet my requirements. • Resisted this change as it required additional application work (writing my own ETL tool). • Think about how you will use your data!
  • 37. New Schema { "_id" : ObjectId("54de8229791e4b133c000052"), "meter" : 29026302, "date" : new Date("2015-02-13T17:00:57"), "hour" : 17, "weekday" : 5, "day" : 13, "month" : 2, "year" : 2015, "consumption" : 5.526729939432698, "begin" : 50480.575901639124, "end" : 50486.10263157856, "before" : {...}, "after" : {...}, "readings" : [ ... ] }
  • 38. One document per hour • This makes hourly, daily, and weekly calculations easier to calculate. • Easy cutoff for insertion, wait until an hour passes to insert our documents. { "_id" : ObjectId("54de8229791e4b133c000052"), "meter" : 29026302, "date" : ISODate("2015-02-13T23:00:57Z"), "consumption" : 5.526729939432698, "begin" : 50480.575901639124, "end" : 50486.10263157856, ... }
  • 39. Store a before and after reading • Used in our ETL tool • Linear interpolation from these values to project what the start and end reading would have been. • Included for completeness, but otherwise unnecessary. These fields are never queried and could be omitted. "before" : { "date" : ISODate("2015-02-13T22:59:56Z"), "consumption" : 50480.57, "delta" : 5.785714347892832 }, "after" : { "date" : ISODate("2015-02-14T00:00:11Z"), "consumption" : 50486.12, "delta" : 5.68421056066001 } …
  • 40. Embed readings • We may want to graph usage within the hour, so store the raw values. • Store deltas to make our life easier later. "readings" : [ { "date" : ISODate("2015-02-13T23:00:57Z"), "consumption" : 50480.66, "delta" : 5.311475388465158 }, { "date" : ISODate("2015-02-13T23:02:00Z"), "consumption" : 50480.75, "delta" : 5.142857168616757 } ...
  • 41. Split out Time • Splitting out the hour, day, month, year, day of week makes for easy queries. • Aggregation is easy and fast as a $dayOfMonth projection isn’t required. • We can now use a simple aggregation to explore by year, month, week, day and hour. { ... "date" : new Date("2015-02-13T17:00:57"), "hour" : 17, "weekday" : 5, "day" : 13, "month" : 2, "year" : 2015, … }
  • 42. Split out Time: Benefits
  • 43. Queries: Daily Consumption • Grab the convenient fields • Sum the consumption daily = db.getSiblingDB("meters").mine.aggregate([{ $match: { meter: myid year: 2018 month: 8 day: 26 }}, {$group: {“_id”: 1, total: {“$sum” : consumption}}])[0].consumption
  • 44. Queries: 24 Hour Graph • Filter by the meter’s id • Sort based on date • 24 documents returned for graphing • Already binned on hour boundaries last24 = db.getSiblingDB("meters").mine.({ meter:29026302}, {consumption:1, date:1}) .sort({date:-1}) .limit(24) .toArray()
  • 45. Problems Revisited Requirements Cleaning Data is Easy No Duplicates Daily Consumption Weekly Consumption Compare Days Calculate Utility Bill Fast
  • 47. Changing schema was the single biggest performance win
  • 48. Performance by the numbers • 4 hours to 3 minutes • Deduplication process eliminates 202 minutes • Data cleaning process eliminates 24 minutes • Parallel insertion eliminates 11 minutes • 90,840,510 Readings to 436,477 • 90,840,510 Docs to 31,396 • 10.6 GB File to 13MB compressed WiredTiger data (31MB uncompressed)
  • 49. Getting from 180 to 60 seconds • Buffer input heavily, we should never be waiting on IO • Perform simple checks to avoid stripping whitespace • Using fixed string parsing vs. regex • Tune batch sizes and workers to keep the system busy • Optimistically encode documents to reduce encoding overhead • Batch golang channel sending to reduce overhead
  • 57. Key Takeaways • Follow best practices • Batch writes improve throughput by reducing roundtrips • Multiple insertion workers remove roundtrip bottleneck • Design you schema so you can easily access your data • Understand the big picture • You can treat database performance just like any software issue. • Tabular data isn’t a great way to represent many problems.
  • 58. What have I learned? • My household consumes a lot of water • Changed shower heads (30% savings) • Changed water heater ($50 a month savings) • When certain people are home, energy consumption rises • Replaced light bulbs (few $ a month)