SlideShare a Scribd company logo
Building a Monitoring
Framework Using DTrace
and MongoDB
Dan Kimmel
Software Engineer, Delphix
dan.kimmel@delphix.com
Background
● Building a performance monitoring
framework on illumos using DTrace
● It's monitoring our data virtualization engine
○ That means "database storage virtualization and
rigorous administration automation" for those who
didn't have time to study up on our marketing lingo
● Our users are mostly DBAs
● The monitoring framework itself is not
released yet
● DBAs have one performance metric they
care about for their database storage
○ I/O latency, because it translates to database I/O
latency, which translates to end-user happiness
● But to make the performance data
actionable, they usually need more than that
single measurement
○ Luckily, DTrace always has more data
What to collect?
Virtualized Database Storage*
Database Process
(Oracle, SQLServer, others on the way)
Storage Appliance
(the Delphix Engine)
* as most people imagine it
Database
I/O path
Network
Hypervisor*
Delphix OS
Database Host OS
(Windows, Linux, Solaris, *BSD, HP-UX, AIX)
Virtualized Database Storage
Database Process
(Oracle, SQLServer, others on the way)
Network-Mounted Storage Layer (NFS/iSCSI)
Network
Delphix FS
Storage
Database
I/O path
* Sometimes the DB host is running on a hypervisor too, or even on the same hypervisor
Hypervisor
Delphix OS
Database Host OS
(Windows, Linux, Solaris, *BSD, HP-UX, AIX)
Latency can come from anywhere
Database Process
(Oracle, SQLServer, others on the way)
Network-Mounted Storage Layer (NFS/iSCSI)
Network
Delphix FS
Storage
Out of memory? Out of CPU?
Out of bandwidth?
Out of memory? Out of CPU?
Out of memory? Out of CPU?
Out of IOPS? Out of bandwidth?
NFS client latency
Network latency
Queuing latency
FS latency
Device latency
Database
I/O path
Bottlenecks on the left Sources of latency on the right
Investigation Requirements
Want users to be able to dig deeper during a
performance investigation.
● Show many different sources of latency and
show many possible bottlenecks
○ i.e. collect data from all levels of the I/O stack
○ This is something that we're still working on, and
sadly, not all levels of the stack have DTrace
● Allow users to narrow down the cause within
one layer
○ Concepts were inspired by other DTrace-based
analytics tools from Sun and Joyent
Narrowing down the cause
After looking at a high level view of the layers, a
user sees NFS server latency has some slow
outliers.
1. NFS latency by client IP address
○ The client at 187.124.26.12 looks slowest
2. NFS latency for 187... by operation
○ Writes look like the slow operation
3. NFS write latency for 187... by synchronous
○ Synchronous writes are slower than normal
How that exercise helped
● The user just learned a lot about the problem
○ The user might be able to solve it themselves by (for
instance) upgrading or expanding the storage we sit
on top of to handle synchronous writes better
○ They can also submit a much more useful bug report
or speak effectively to our support staff
● Saves them time, saves us time!
DTrace is the perfect tool
● To split results on a variable, collect the
variable and use it as an additional key in
your aggregations.
● To narrow down a variable, add a condition.
// Pseudocode alert!
0. probe {@latency = quantize(start - timestamp)}
1. probe {@latency[ip] = quantize(start - timestamp)}
2. probe /ip == "187..."/ {
@latency[operation] = quantize(start - timestamp);
}
3. probe /ip == "187..." && operation == "write"/ {
@latency[synchronous] = quantize(start - timestamp);
}
How we built "narrowing down"
● Templated D scripts for collecting data
internal to Delphix OS
● Allow the user to specify constraints on
variables in each template
○ Translate these into DTrace conditions
● Allow the user to specify which variables
they want to display
● Fill out a template and run the resulting
script
Enhancing Supportability
Our support staff hears this question frequently:
We got reports of slow DB accesses last
Friday, but now everything is back to normal.
Can you help us debug what went wrong?
Historical data is important too
● We always read a few system-wide statistics
● We store all readings into MongoDB
○ We're not really concerned about ACID guarantees
○ We don't know exactly what variables we will be
collecting for each collector ahead of time
○ MongoDB has a couple of features that are
specifically made for logging that we use
○ It was easy to configure and use
Storing (lots of) historical data
The collected data piles up quickly!
● Don't collect data too frequently
● Compress readings into larger and larger
time intervals as the readings age
○ We implemented this in the caller, but could have
used MongoDB's MapReduce as well
● Eventually, delete them (after ~2 weeks)
○ We used MongoDB's "time-to-live indexes" to handle
this automatically; they work nicely
Dealing with the Edge Cases
● If an investigation is ongoing, performance
data could be compressed or deleted if the
investigating takes too long
● Users can prevent data from being
compressed or deleted by explicitly saving it
Summary
● We used DTrace to allow customers to dig
deeper on performance issues
○ Customers will love it*
○ Our support staff will love it*
* at least, that's the hope!
Thanks!

More Related Content

PPTX
NoSQL Evolution
PDF
Hitachi datasheet-universal-replicator
PPT
Nick Bond - Zeus - Load Balancing in the Cloud - CloudCamp Berlin 30.04.2009
ODP
Distributed systems and consistency
PPTX
Webinar: Eliminate Backups and Simplify DR with Hybrid Cloud Storage
PDF
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
PDF
Data Management in Cloud Platforms
PDF
Scalr: Setting Up Automated Scaling
NoSQL Evolution
Hitachi datasheet-universal-replicator
Nick Bond - Zeus - Load Balancing in the Cloud - CloudCamp Berlin 30.04.2009
Distributed systems and consistency
Webinar: Eliminate Backups and Simplify DR with Hybrid Cloud Storage
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
Data Management in Cloud Platforms
Scalr: Setting Up Automated Scaling

What's hot (20)

PPTX
In-Memory Computing: How, Why? and common Patterns
PPTX
ClustrixDB: how distributed databases scale out
PDF
MySQL 高可用性
PDF
Heka - Rob Miller
PPTX
ClustrixDB at Samsung Cloud
PDF
Sql server tips from the field
PDF
How to build an event driven architecture with kafka and kafka connect
PDF
Try Cloud Spanner
PPT
Real time database
PDF
Auto Europe's ongoing journey with MariaDB and open source
PPTX
Database , 12 Reliability
PDF
Munich 2016 - Z011597 Martin Packer - How To Be A Better Performance Specialist
PDF
Sync IT Presentation 3.16
PDF
Redis as a Main Database, Scaling and HA
PDF
SNIA SDC 2016 final
PDF
Learnings from the Field. Lessons from Working with Dozens of Small & Large D...
PDF
E commerce data migration in moving systems across data centres
PPTX
IMC Summit 2016 Breakout - Girish Mutreja - Extreme Transaction Processing in...
PPTX
in-memory database system and low latency
PDF
Dynomite - PerconaLive 2017
In-Memory Computing: How, Why? and common Patterns
ClustrixDB: how distributed databases scale out
MySQL 高可用性
Heka - Rob Miller
ClustrixDB at Samsung Cloud
Sql server tips from the field
How to build an event driven architecture with kafka and kafka connect
Try Cloud Spanner
Real time database
Auto Europe's ongoing journey with MariaDB and open source
Database , 12 Reliability
Munich 2016 - Z011597 Martin Packer - How To Be A Better Performance Specialist
Sync IT Presentation 3.16
Redis as a Main Database, Scaling and HA
SNIA SDC 2016 final
Learnings from the Field. Lessons from Working with Dozens of Small & Large D...
E commerce data migration in moving systems across data centres
IMC Summit 2016 Breakout - Girish Mutreja - Extreme Transaction Processing in...
in-memory database system and low latency
Dynomite - PerconaLive 2017
Ad

Viewers also liked (19)

PPTX
A brief history of DTrace
PDF
SSD based storage tuning for databases
PDF
Solaris Kernel Debugging V1.0
PDF
Site Operation Manual for a Typical Air Monitoring Site
PPTX
River monitoring site 7
PPT
Khulisa Management Services- ECD Site Monitoring Instrument
PDF
Presentation Mrs.Smolka Ursula, Ramboll: costs and benefits when monitoring s...
PPTX
Nagios Conference 2013 - Thomas Dunbar - Building Technology for Storage Syst...
PPTX
The Benefits of Having Nerds On Site Monitoring Your Technology
PDF
LabVIEW Based Monitoring the Building in wireless communication
PPTX
Building and Monitoring Services at Lithium
PDF
The Drupal Ecosystem for Drupal Services
PDF
How to build a budget transparency site: 5 easy steps
PDF
Big Data and Social Monitoring: Building Meaningful Relationships
DOCX
Low power wireless sensor network for building monitoring
PDF
How to Efficiently and Effectively Balance Central Monitoring with On-Site Mo...
PDF
Experience from Phase 3 Study Using Risk- Based Monitoring and eSource Method...
PDF
ECD monitoring instrument
PDF
Notes to support the presentation 'Introduction to the Visual Infusion Phlebi...
A brief history of DTrace
SSD based storage tuning for databases
Solaris Kernel Debugging V1.0
Site Operation Manual for a Typical Air Monitoring Site
River monitoring site 7
Khulisa Management Services- ECD Site Monitoring Instrument
Presentation Mrs.Smolka Ursula, Ramboll: costs and benefits when monitoring s...
Nagios Conference 2013 - Thomas Dunbar - Building Technology for Storage Syst...
The Benefits of Having Nerds On Site Monitoring Your Technology
LabVIEW Based Monitoring the Building in wireless communication
Building and Monitoring Services at Lithium
The Drupal Ecosystem for Drupal Services
How to build a budget transparency site: 5 easy steps
Big Data and Social Monitoring: Building Meaningful Relationships
Low power wireless sensor network for building monitoring
How to Efficiently and Effectively Balance Central Monitoring with On-Site Mo...
Experience from Phase 3 Study Using Risk- Based Monitoring and eSource Method...
ECD monitoring instrument
Notes to support the presentation 'Introduction to the Visual Infusion Phlebi...
Ad

Similar to #lspe Building a Monitoring Framework using DTrace and MongoDB (20)

PPTX
The power of linux advanced tracer [POUG18]
PDF
LISA2010 visualizations
PPTX
Performance Forensics - Understanding Application Performance
PDF
dtrace_topics_intro.pdf
PDF
Linux Perf Tools
PDF
Profiling PHP - AmsterdamPHP Meetup - 2014-11-20
PPTX
fpga2014-wjun.pptx
PDF
MongoDB Operational Best Practices (mongosf2012)
PPTX
Performance analysis and troubleshooting using DTrace
PDF
Become a Performance Diagnostics Hero
PDF
High performance json- postgre sql vs. mongodb
PPTX
Apache phoenix
PDF
Linux Performance Analysis: New Tools and Old Secrets
PDF
Performance Profiling Tools & Tricks
PDF
Performance Profiling Tools and Tricks
PDF
Next-Gen Business Transaction Configuration, Instrumentation, and Java Perfor...
PPTX
Jumpstart: Introduction to MongoDB
PPTX
Delphix Platform Overview
PDF
Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...
PPTX
Dot Net Application Monitoring
The power of linux advanced tracer [POUG18]
LISA2010 visualizations
Performance Forensics - Understanding Application Performance
dtrace_topics_intro.pdf
Linux Perf Tools
Profiling PHP - AmsterdamPHP Meetup - 2014-11-20
fpga2014-wjun.pptx
MongoDB Operational Best Practices (mongosf2012)
Performance analysis and troubleshooting using DTrace
Become a Performance Diagnostics Hero
High performance json- postgre sql vs. mongodb
Apache phoenix
Linux Performance Analysis: New Tools and Old Secrets
Performance Profiling Tools & Tricks
Performance Profiling Tools and Tricks
Next-Gen Business Transaction Configuration, Instrumentation, and Java Perfor...
Jumpstart: Introduction to MongoDB
Delphix Platform Overview
Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...
Dot Net Application Monitoring

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
Teaching material agriculture food technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Electronic commerce courselecture one. Pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
cuic standard and advanced reporting.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
Big Data Technologies - Introduction.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
20250228 LYD VKU AI Blended-Learning.pptx
Teaching material agriculture food technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
A comparative analysis of optical character recognition models for extracting...
Electronic commerce courselecture one. Pdf
MYSQL Presentation for SQL database connectivity
Network Security Unit 5.pdf for BCA BBA.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
The AUB Centre for AI in Media Proposal.docx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
cuic standard and advanced reporting.pdf
sap open course for s4hana steps from ECC to s4
Mobile App Security Testing_ A Comprehensive Guide.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Chapter 3 Spatial Domain Image Processing.pdf
Spectral efficient network and resource selection model in 5G networks

#lspe Building a Monitoring Framework using DTrace and MongoDB

  • 1. Building a Monitoring Framework Using DTrace and MongoDB Dan Kimmel Software Engineer, Delphix dan.kimmel@delphix.com
  • 2. Background ● Building a performance monitoring framework on illumos using DTrace ● It's monitoring our data virtualization engine ○ That means "database storage virtualization and rigorous administration automation" for those who didn't have time to study up on our marketing lingo ● Our users are mostly DBAs ● The monitoring framework itself is not released yet
  • 3. ● DBAs have one performance metric they care about for their database storage ○ I/O latency, because it translates to database I/O latency, which translates to end-user happiness ● But to make the performance data actionable, they usually need more than that single measurement ○ Luckily, DTrace always has more data What to collect?
  • 4. Virtualized Database Storage* Database Process (Oracle, SQLServer, others on the way) Storage Appliance (the Delphix Engine) * as most people imagine it Database I/O path Network
  • 5. Hypervisor* Delphix OS Database Host OS (Windows, Linux, Solaris, *BSD, HP-UX, AIX) Virtualized Database Storage Database Process (Oracle, SQLServer, others on the way) Network-Mounted Storage Layer (NFS/iSCSI) Network Delphix FS Storage Database I/O path * Sometimes the DB host is running on a hypervisor too, or even on the same hypervisor
  • 6. Hypervisor Delphix OS Database Host OS (Windows, Linux, Solaris, *BSD, HP-UX, AIX) Latency can come from anywhere Database Process (Oracle, SQLServer, others on the way) Network-Mounted Storage Layer (NFS/iSCSI) Network Delphix FS Storage Out of memory? Out of CPU? Out of bandwidth? Out of memory? Out of CPU? Out of memory? Out of CPU? Out of IOPS? Out of bandwidth? NFS client latency Network latency Queuing latency FS latency Device latency Database I/O path Bottlenecks on the left Sources of latency on the right
  • 7. Investigation Requirements Want users to be able to dig deeper during a performance investigation. ● Show many different sources of latency and show many possible bottlenecks ○ i.e. collect data from all levels of the I/O stack ○ This is something that we're still working on, and sadly, not all levels of the stack have DTrace ● Allow users to narrow down the cause within one layer ○ Concepts were inspired by other DTrace-based analytics tools from Sun and Joyent
  • 8. Narrowing down the cause After looking at a high level view of the layers, a user sees NFS server latency has some slow outliers. 1. NFS latency by client IP address ○ The client at 187.124.26.12 looks slowest 2. NFS latency for 187... by operation ○ Writes look like the slow operation 3. NFS write latency for 187... by synchronous ○ Synchronous writes are slower than normal
  • 9. How that exercise helped ● The user just learned a lot about the problem ○ The user might be able to solve it themselves by (for instance) upgrading or expanding the storage we sit on top of to handle synchronous writes better ○ They can also submit a much more useful bug report or speak effectively to our support staff ● Saves them time, saves us time!
  • 10. DTrace is the perfect tool ● To split results on a variable, collect the variable and use it as an additional key in your aggregations. ● To narrow down a variable, add a condition. // Pseudocode alert! 0. probe {@latency = quantize(start - timestamp)} 1. probe {@latency[ip] = quantize(start - timestamp)} 2. probe /ip == "187..."/ { @latency[operation] = quantize(start - timestamp); } 3. probe /ip == "187..." && operation == "write"/ { @latency[synchronous] = quantize(start - timestamp); }
  • 11. How we built "narrowing down" ● Templated D scripts for collecting data internal to Delphix OS ● Allow the user to specify constraints on variables in each template ○ Translate these into DTrace conditions ● Allow the user to specify which variables they want to display ● Fill out a template and run the resulting script
  • 12. Enhancing Supportability Our support staff hears this question frequently: We got reports of slow DB accesses last Friday, but now everything is back to normal. Can you help us debug what went wrong?
  • 13. Historical data is important too ● We always read a few system-wide statistics ● We store all readings into MongoDB ○ We're not really concerned about ACID guarantees ○ We don't know exactly what variables we will be collecting for each collector ahead of time ○ MongoDB has a couple of features that are specifically made for logging that we use ○ It was easy to configure and use
  • 14. Storing (lots of) historical data The collected data piles up quickly! ● Don't collect data too frequently ● Compress readings into larger and larger time intervals as the readings age ○ We implemented this in the caller, but could have used MongoDB's MapReduce as well ● Eventually, delete them (after ~2 weeks) ○ We used MongoDB's "time-to-live indexes" to handle this automatically; they work nicely
  • 15. Dealing with the Edge Cases ● If an investigation is ongoing, performance data could be compressed or deleted if the investigating takes too long ● Users can prevent data from being compressed or deleted by explicitly saving it
  • 16. Summary ● We used DTrace to allow customers to dig deeper on performance issues ○ Customers will love it* ○ Our support staff will love it* * at least, that's the hope!