SlideShare a Scribd company logo
Don't Break the Camel's Back
MongoDB World, June, 2019
Presenting Today
Jon Hyman

CTO and Co-Founder, Braze
@jon_hyman
Time is money.
The value of data to your business starts
deteriorating as soon as it's generated.
The new digital economy is on-demand and 

the connected consumer is always-on.
1990 2000 2020 2030
Corporate 

Computing
Personal 

Computing
Ambient 

Computing
SOCIETALIMPACT
SOURCE: JAVASCRIPT: THE MACHINE LANGUAGE OF THE AMBIENT COMPUTING ERA, ALLEN WIRFS-BROCK
Computers empower
and enhance
enterprise tasks
Computers empower
and enhance
individuals’ tasks
GOOGLE 

ANALYTICS
SILVERPOP
ELOQUA
EXACT TARGET
NEOLANE
RESPONSYS
OMNITURE
iOS
APP STORE
CHROMECAST
ALEXA
TENSORFLOW
tvOS
FB MESSENGER
WALT MOSSBERG, AMERICAN JOURNALIST & FORMER RECODE EDITOR AT LARGE
I expect that one end result of all this work will be that the technology,
the computer inside all these things, will fade into the background. In
some cases, it may entirely disappear... This is ambient computing, the
transformation of the environment all around us with intelligence and
capabilities that don’t seem to be there at all.”
“
1980 2010
Computers empower
and enhance our
environment
Braze empowers you to humanize your
brand-customer relationships at scale.
Tens of 

Billions of
Messages 

Sent 

Monthly
Global 

Customer 

Presence
More than 

1 Billion 

MAU
ON SIX CONTINENTS
TOC
MongoDB at Braze
Message Sending Pipeline
Monitoring Application Impact


Summary and Q&A
Today
MongoDB
at Braze
MongoDB at Braze
•Main database at Braze, used for most
application models
•Most documents at Braze are user profiles
• End users of mobile apps, websites, or mailing lists
• Nearly 11 billion user profiles
•Over 1,200 shards across more than 65
different clusters
• Scaling is entirely for read and write throughput,
storage size is only tens of terabytes
•Across clusters, performing over 350,000
MongoDB operations per second
Challenges at Braze
Braze
APIs
Messaging
Sending
{
_id: 123,
first_name: "Jon",
email: "jon@braze.com",
custom: {
favorite_color: "blue",
shoe_size: 11,
linked_credit_card: true
},
campaign_summaries: {
CampaignA: {
last_received: Date('2019-06-01T12:00:03Z'),
last_opened_email: Date('2019-06-01T12:03:19Z')
}
}
}
Users Collection Example
Demographic data
Custom data
Aggregated 

interaction data
UserCampaignInteractionData Collection Example
{
_id: 123,
emails_received: [
{
date: Date(‘2019-06-01T12:00:03Z’),
campaign: “CampaignA”,
dispatch_id: “identifier-for-send”
},
…
],
android_push_received: [ … ],
emails_opened: [
{
date: Date(‘2019-06-01T12:03:19Z’),
campaign: “CampaignA”,
dispatch_id: “identifier-for-send”
},
…
]
}
Longer history
of interaction data
Message
receipt data
Event processing and message sending
Analytics
•Braze provides real-time analytics on
interactions from campaigns
•Conversion processing to attribute
revenue and actions to campaign receipt
•Determining influenced interaction rates
Event processing and message sending
Analytics
•Braze provides real-time analytics on
interactions from campaigns
•Conversion processing to attribute
revenue and actions to campaign receipt
•Determining influenced interaction rates
Message Sending
•Frequency capping: Limit customers to
only receive certain types of campaigns
a fixed number of times
•Intelligent send time optimizations: use
interaction data to feed into model for
best time of day to send to someone
This is possible with accurate summarized event documents per user.
We summarize event data in other collections as well
{
_id: 123,
sessions: [
Date(‘2019-06-01T11:53:03Z’),
Date(‘2019-05-31T08:14:44Z’),
Date(‘2019-05-31T08:02:11Z’),
],
custom_events: {
watched_video: [
Date(‘2019-06-01T11:53:03Z’),
Date(‘2019-05-31T08:14:44Z’),
Date(‘2019-05-31T08:02:11Z’),
]
},
purchases: [
…
}
}
Recent history
of behavioral data
Recent history
of usage data
Message Sending
Pipeline
Message Sending Pipeline at Braze
Audience
Segmentation
Business Logic
Application
Integrity Check
& Frequency Cap
Render Payloads MongoDB Write Deliver Messages
How can Braze send messages as fast
as possible without breaking the bank?
Money is money.
MongoDB Deployment Model
Three tiers of databases:
Small Clients
Shared Databases
Medium Clients
Dedicated Databases,
Shared Cluster
Large Clients
Dedicated Databases,
Dedicated Cluster
A
B
C
A B A
MongoDB Deployment Model
Benefits:
• Isolation for security and compliance
• Scalability, read and write throughput
• Maintenance improvements: issues or
maintenances affecting one database
may not affect other customers
A
B
C
A B A
Worker servers are associated with a given database(s) and allow us to take down individual databases and pause processing.
Improving Read Throughput
•Limit concurrency of unindexed queries with large
expected totalDocsExamined
•Take advantage of multi-cluster deployment model
•Statistical analysis on customer campaigns to create
new indexes with partialFilterExpression
•Find slow queries in system.profile focused around
campaign sending
•Do frequency analysis of those queries
•Use aggregations to determine selectivity of the fields
Improving Write Throughput
•Add a lot of small shards to increase write scopes
• Many instances are 30-50 shards with small
amounts of disk
•Spread writes out to multiple collections with
summary documents, prune old data from existing
documents
•Use the cluster distribution model, if a customer’s
write throughput is affecting other clients, move
them to a new cluster
•Limit concurrency as necessary
Monitoring
Application Impact
Monitoring Application Throughput
•We care about application impact of load to
MongoDB: some queueing may be okay, it
depends on if the application is affected
•We want to prevent application from being
blocked indefinitely by queued operations
•We want to understand if certain MongoDB
clusters or databases are causing issues
•We patched the MongoDB driver to do both
these things
Driver Patches
Useful for returning
control to the application
Patch Mongo::Socket#read_from_socket to allow custom, per operation timeouts
Unedited Ruby Driver Code
deadline = (Time.now + timeout) if timeout
begin
while retrieved < length
retrieve = length - retrieved
if retrieve > buf_size
retrieve = buf_size
end
chunk = @socket.read_nonblock(retrieve, buf)
# If we read the entire wanted length in one operation,
# return the data as is which saves one memory allocation and
# one copy per read
if retrieved == 0 && chunk.length == length
return chunk
end
# If we are here, we are reading the wanted length in
# multiple operations. Allocate the total buffer here rather
# than up front so that the special case above won't be
# allocating twice
if data.nil?
data = allocate_string(length)
end
# ... and we need to copy the chunks at this point
data[retrieved, chunk.length] = chunk
retrieved += chunk.length
end
rescue IO::WaitReadable
select_timeout = (deadline - Time.now) if deadline
if (select_timeout && select_timeout <= 0) || !Kernel::select([@socket], nil, [@socket], select_timeout)
raise Timeout::Error.new("Took more than #{timeout} seconds to receive data.")
end
retry
end
Per-server deadline
Selects on this deadline
First, support a custom timeout
deadline = (Time.now + timeout) if timeout
begin
while retrieved < length
retrieve = length - retrieved
if retrieve > buf_size
retrieve = buf_size
end
chunk = @socket.read_nonblock(retrieve, buf)
# If we read the entire wanted length in one operation,
# return the data as is which saves one memory allocation and
# one copy per read
if retrieved == 0 && chunk.length == length
return chunk
end
# If we are here, we are reading the wanted length in
# multiple operations. Allocate the total buffer here rather
# than up front so that the special case above won't be
# allocating twice
if data.nil?
data = allocate_string(length)
end
# ... and we need to copy the chunks at this point
data[retrieved, chunk.length] = chunk
retrieved += chunk.length
end
rescue IO::WaitReadable
select_timeout = (deadline - Time.now) if deadline
if (select_timeout && select_timeout <= 0) || !Kernel::select([@socket], nil, [@socket], select_timeout)
raise Timeout::Error.new("Took more than #{timeout} seconds to receive data.")
end
retry
end
Set deadline per
operation
First, support a custom timeout
Mongo::Socket.with_read_timeout(5) do
# any read or write in here can only take 5 seconds
end
class ReadMaxTimeoutError < ::Timeout::Error
end
# Allows you to create a block in which you can set a maximum time for a
query to wait for data on the socket
# @param timeout Integer timeout for the socket read
# @param exception_class_to_raise The exception class that will be raised if the
timeout is hit, the default will be Mongo::Socket::ReadMaxTimeoutError
def self.with_read_timeout(timeout, exception_class_to_raise = nil)
existing_value = Thread.current[:mongo_query_read_timeout]
existing_exception_value = Thread.current[:mongo_query_read_timeout_exception]
begin
Thread.current[:mongo_query_read_timeout] = timeout
Thread.current[:mongo_query_read_timeout_exception] = exception_class_to_raise
yield
ensure
Thread.current[:mongo_query_read_timeout] = existing_value
Thread.current[:mongo_query_read_timeout_exception] = existing_exception_value
end
end
modified_timeout = Thread.current[:mongo_query_read_timeout] || timeout
deadline = (Time.now + modified_timeout) if modified_timeout
begin
while retrieved < length
retrieve = length - retrieved
if retrieve > buf_size
retrieve = buf_size
end
chunk = @socket.read_nonblock(retrieve, buf)
# If we read the entire wanted length in one operation,
# return the data as is which saves one memory allocation and
# one copy per read
if retrieved == 0 && chunk.length == length
return chunk
end
...
Driver Patches
Get metrics on application issues
Patch Mongo::Socket#read_from_socket to log to StatsD as it is waiting
Modify IO::WaitReadable
LOG_TO_STATS_D_AFTER_EACH_OF_THESE_SECONDS = [
2, 5, 10, 30, 60, 120, 180, 240, 300, 360, 420, 480
].freeze
rescue IO::WaitReadable
...
# Log duration to StatsD here
seconds_since_start_time = Time.now - start_time
LOG_TO_STATS_D_AFTER_EACH_OF_THESE_SECONDS.each do |seconds|
begin
# We want to log once after 5 seconds, then after 10 seconds, then after 30 seconds, etc.
# so keep track of if we have already logged for a given value
if !already_logged_to_stats_d_at_seconds[seconds] && seconds_since_start_time > seconds
already_logged_to_stats_d_at_seconds[seconds] = true
mongo_port = Thread.current[:current_mongo_port]
company_id = Thread.current[:current_company_id]
StatsDAdapter.increment(
"platform.all.mongo_slow_query".freeze,
1,
{:port => mongo_port, :company_id => company_id, :seconds => seconds}
)
end
rescue => e
Mongo::Logger.logger.info {"Caught error logging to StatsD: #{e.inspect()}"}
end
end
...
end
Long running queries by MongoDB cluster & company
During a hardware incident that affected a handful of MongoDB clusters
Regular day-to-day example
How does Braze use
this information?
Braze response to long running operations
•Graphs can inform whether or not to add more
shards to a cluster or move a customer off
•Graphs can point toward certain campaigns to
evaluate at times of day for new
partialFilterExpression indexes
•Used as input to alerts and incident handling tools
+
Braze incident handling tools
• Built throttling system for campaign sending (i.e., database writes)
• Can control rate at which messages are picked up to be sent
• Alerts set up for high load on MongoDB, slow application
queries
• Ideal long term solution is to build a Governor
• Use machine learning to throttle database writes based on
application response time
Summary
• MongoDB’s flexible schemas with non-scalar types enables
Braze to quickly fetch information about end users
• Braze’s MongoDB deployment model allows Braze to match
customer resources with the appropriate amount of hardware
and shards
• Metrics from the driver for how long reads and writes are taking
gives Braze visibility into how its application is performing
Q&A
Thank you! We are hiring!
braze.com/careers
MongoDB World 2019: Don't Break the Camel's Back: Running MongoDB as Hard as Possible

More Related Content

PDF
Software Puzzle: A Countermeasure to Resource-Inflated Denial- of-Service Att...
PDF
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
PDF
Wide area frequency easurement system iitb
PPTX
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
PDF
Elastic Relevance Presentation feb4 2020
PDF
Mongodb in-anger-boston-rb-2011
PDF
02_Chapter_WorkLoads_DataModeling_Mongodb.pdf
Software Puzzle: A Countermeasure to Resource-Inflated Denial- of-Service Att...
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
Wide area frequency easurement system iitb
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
Elastic Relevance Presentation feb4 2020
Mongodb in-anger-boston-rb-2011
02_Chapter_WorkLoads_DataModeling_Mongodb.pdf

Similar to MongoDB World 2019: Don't Break the Camel's Back: Running MongoDB as Hard as Possible (20)

PDF
02_Chapter_WorkLoads_DataModeling_Mongodb.pdf
PDF
MongoDB World 2018: Data Models for Storing Sophisticated Customer Journeys i...
PPTX
Android Trainning Session 2
PDF
Architetture Serverless con SQL Server e Azure Functions
PPTX
Become a GC Hero
PPTX
What's new in MongoDB 3.6?
PDF
AWS Lambda Deep Dive
PPTX
MongoDB 101 & Beyond: Get Started in MongoDB 3.0, Preview 3.2 & Demo of Ops M...
PPTX
Being HAPI! Reverse Proxying on Purpose
PPTX
Cloud computing security from single to multiple
PDF
Systems Bioinformatics Workshop Keynote
PDF
Reliably Measuring Responsiveness
PPTX
ML on Big Data: Real-Time Analysis on Time Series
PPTX
Microsoft Dryad
PDF
Monitoring Error Logs at Databricks
PPTX
A miało być tak... bez wycieków
PPTX
Spring batch
PDF
MongoDB World 2019: How Braze uses the MongoDB Aggregation Pipeline for Lean,...
PPTX
Splunk Conf 2014 - Getting the message
PDF
Fronteers 20131205 the realtime web
02_Chapter_WorkLoads_DataModeling_Mongodb.pdf
MongoDB World 2018: Data Models for Storing Sophisticated Customer Journeys i...
Android Trainning Session 2
Architetture Serverless con SQL Server e Azure Functions
Become a GC Hero
What's new in MongoDB 3.6?
AWS Lambda Deep Dive
MongoDB 101 & Beyond: Get Started in MongoDB 3.0, Preview 3.2 & Demo of Ops M...
Being HAPI! Reverse Proxying on Purpose
Cloud computing security from single to multiple
Systems Bioinformatics Workshop Keynote
Reliably Measuring Responsiveness
ML on Big Data: Real-Time Analysis on Time Series
Microsoft Dryad
Monitoring Error Logs at Databricks
A miało być tak... bez wycieków
Spring batch
MongoDB World 2019: How Braze uses the MongoDB Aggregation Pipeline for Lean,...
Splunk Conf 2014 - Getting the message
Fronteers 20131205 the realtime web
Ad

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
Ad

Recently uploaded (20)

PDF
Encapsulation theory and applications.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPT
Teaching material agriculture food technology
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Understanding_Digital_Forensics_Presentation.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Encapsulation theory and applications.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Teaching material agriculture food technology
MYSQL Presentation for SQL database connectivity
Understanding_Digital_Forensics_Presentation.pptx
The AUB Centre for AI in Media Proposal.docx
Machine learning based COVID-19 study performance prediction
Per capita expenditure prediction using model stacking based on satellite ima...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Approach and Philosophy of On baking technology
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Unlocking AI with Model Context Protocol (MCP)
Network Security Unit 5.pdf for BCA BBA.
Advanced methodologies resolving dimensionality complications for autism neur...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
The Rise and Fall of 3GPP – Time for a Sabbatical?
Dropbox Q2 2025 Financial Results & Investor Presentation
How UI/UX Design Impacts User Retention in Mobile Apps.pdf

MongoDB World 2019: Don't Break the Camel's Back: Running MongoDB as Hard as Possible

  • 1. Don't Break the Camel's Back MongoDB World, June, 2019
  • 2. Presenting Today Jon Hyman
 CTO and Co-Founder, Braze @jon_hyman
  • 3. Time is money. The value of data to your business starts deteriorating as soon as it's generated.
  • 4. The new digital economy is on-demand and 
 the connected consumer is always-on. 1990 2000 2020 2030 Corporate 
 Computing Personal 
 Computing Ambient 
 Computing SOCIETALIMPACT SOURCE: JAVASCRIPT: THE MACHINE LANGUAGE OF THE AMBIENT COMPUTING ERA, ALLEN WIRFS-BROCK Computers empower and enhance enterprise tasks Computers empower and enhance individuals’ tasks GOOGLE 
 ANALYTICS SILVERPOP ELOQUA EXACT TARGET NEOLANE RESPONSYS OMNITURE iOS APP STORE CHROMECAST ALEXA TENSORFLOW tvOS FB MESSENGER WALT MOSSBERG, AMERICAN JOURNALIST & FORMER RECODE EDITOR AT LARGE I expect that one end result of all this work will be that the technology, the computer inside all these things, will fade into the background. In some cases, it may entirely disappear... This is ambient computing, the transformation of the environment all around us with intelligence and capabilities that don’t seem to be there at all.” “ 1980 2010 Computers empower and enhance our environment
  • 5. Braze empowers you to humanize your brand-customer relationships at scale. Tens of 
 Billions of Messages 
 Sent 
 Monthly Global 
 Customer 
 Presence More than 
 1 Billion 
 MAU ON SIX CONTINENTS
  • 6. TOC MongoDB at Braze Message Sending Pipeline Monitoring Application Impact 
 Summary and Q&A Today
  • 8. MongoDB at Braze •Main database at Braze, used for most application models •Most documents at Braze are user profiles • End users of mobile apps, websites, or mailing lists • Nearly 11 billion user profiles •Over 1,200 shards across more than 65 different clusters • Scaling is entirely for read and write throughput, storage size is only tens of terabytes •Across clusters, performing over 350,000 MongoDB operations per second
  • 10. { _id: 123, first_name: "Jon", email: "jon@braze.com", custom: { favorite_color: "blue", shoe_size: 11, linked_credit_card: true }, campaign_summaries: { CampaignA: { last_received: Date('2019-06-01T12:00:03Z'), last_opened_email: Date('2019-06-01T12:03:19Z') } } } Users Collection Example Demographic data Custom data Aggregated 
 interaction data
  • 11. UserCampaignInteractionData Collection Example { _id: 123, emails_received: [ { date: Date(‘2019-06-01T12:00:03Z’), campaign: “CampaignA”, dispatch_id: “identifier-for-send” }, … ], android_push_received: [ … ], emails_opened: [ { date: Date(‘2019-06-01T12:03:19Z’), campaign: “CampaignA”, dispatch_id: “identifier-for-send” }, … ] } Longer history of interaction data Message receipt data
  • 12. Event processing and message sending Analytics •Braze provides real-time analytics on interactions from campaigns •Conversion processing to attribute revenue and actions to campaign receipt •Determining influenced interaction rates
  • 13. Event processing and message sending Analytics •Braze provides real-time analytics on interactions from campaigns •Conversion processing to attribute revenue and actions to campaign receipt •Determining influenced interaction rates Message Sending •Frequency capping: Limit customers to only receive certain types of campaigns a fixed number of times •Intelligent send time optimizations: use interaction data to feed into model for best time of day to send to someone This is possible with accurate summarized event documents per user.
  • 14. We summarize event data in other collections as well { _id: 123, sessions: [ Date(‘2019-06-01T11:53:03Z’), Date(‘2019-05-31T08:14:44Z’), Date(‘2019-05-31T08:02:11Z’), ], custom_events: { watched_video: [ Date(‘2019-06-01T11:53:03Z’), Date(‘2019-05-31T08:14:44Z’), Date(‘2019-05-31T08:02:11Z’), ] }, purchases: [ … } } Recent history of behavioral data Recent history of usage data
  • 16. Message Sending Pipeline at Braze Audience Segmentation Business Logic Application Integrity Check & Frequency Cap Render Payloads MongoDB Write Deliver Messages
  • 17. How can Braze send messages as fast as possible without breaking the bank? Money is money.
  • 18. MongoDB Deployment Model Three tiers of databases: Small Clients Shared Databases Medium Clients Dedicated Databases, Shared Cluster Large Clients Dedicated Databases, Dedicated Cluster A B C A B A
  • 19. MongoDB Deployment Model Benefits: • Isolation for security and compliance • Scalability, read and write throughput • Maintenance improvements: issues or maintenances affecting one database may not affect other customers A B C A B A Worker servers are associated with a given database(s) and allow us to take down individual databases and pause processing.
  • 20. Improving Read Throughput •Limit concurrency of unindexed queries with large expected totalDocsExamined •Take advantage of multi-cluster deployment model •Statistical analysis on customer campaigns to create new indexes with partialFilterExpression •Find slow queries in system.profile focused around campaign sending •Do frequency analysis of those queries •Use aggregations to determine selectivity of the fields
  • 21. Improving Write Throughput •Add a lot of small shards to increase write scopes • Many instances are 30-50 shards with small amounts of disk •Spread writes out to multiple collections with summary documents, prune old data from existing documents •Use the cluster distribution model, if a customer’s write throughput is affecting other clients, move them to a new cluster •Limit concurrency as necessary
  • 23. Monitoring Application Throughput •We care about application impact of load to MongoDB: some queueing may be okay, it depends on if the application is affected •We want to prevent application from being blocked indefinitely by queued operations •We want to understand if certain MongoDB clusters or databases are causing issues •We patched the MongoDB driver to do both these things
  • 24. Driver Patches Useful for returning control to the application Patch Mongo::Socket#read_from_socket to allow custom, per operation timeouts
  • 26. deadline = (Time.now + timeout) if timeout begin while retrieved < length retrieve = length - retrieved if retrieve > buf_size retrieve = buf_size end chunk = @socket.read_nonblock(retrieve, buf) # If we read the entire wanted length in one operation, # return the data as is which saves one memory allocation and # one copy per read if retrieved == 0 && chunk.length == length return chunk end # If we are here, we are reading the wanted length in # multiple operations. Allocate the total buffer here rather # than up front so that the special case above won't be # allocating twice if data.nil? data = allocate_string(length) end # ... and we need to copy the chunks at this point data[retrieved, chunk.length] = chunk retrieved += chunk.length end rescue IO::WaitReadable select_timeout = (deadline - Time.now) if deadline if (select_timeout && select_timeout <= 0) || !Kernel::select([@socket], nil, [@socket], select_timeout) raise Timeout::Error.new("Took more than #{timeout} seconds to receive data.") end retry end Per-server deadline Selects on this deadline
  • 27. First, support a custom timeout
  • 28. deadline = (Time.now + timeout) if timeout begin while retrieved < length retrieve = length - retrieved if retrieve > buf_size retrieve = buf_size end chunk = @socket.read_nonblock(retrieve, buf) # If we read the entire wanted length in one operation, # return the data as is which saves one memory allocation and # one copy per read if retrieved == 0 && chunk.length == length return chunk end # If we are here, we are reading the wanted length in # multiple operations. Allocate the total buffer here rather # than up front so that the special case above won't be # allocating twice if data.nil? data = allocate_string(length) end # ... and we need to copy the chunks at this point data[retrieved, chunk.length] = chunk retrieved += chunk.length end rescue IO::WaitReadable select_timeout = (deadline - Time.now) if deadline if (select_timeout && select_timeout <= 0) || !Kernel::select([@socket], nil, [@socket], select_timeout) raise Timeout::Error.new("Took more than #{timeout} seconds to receive data.") end retry end Set deadline per operation
  • 29. First, support a custom timeout Mongo::Socket.with_read_timeout(5) do # any read or write in here can only take 5 seconds end
  • 30. class ReadMaxTimeoutError < ::Timeout::Error end # Allows you to create a block in which you can set a maximum time for a query to wait for data on the socket # @param timeout Integer timeout for the socket read # @param exception_class_to_raise The exception class that will be raised if the timeout is hit, the default will be Mongo::Socket::ReadMaxTimeoutError def self.with_read_timeout(timeout, exception_class_to_raise = nil) existing_value = Thread.current[:mongo_query_read_timeout] existing_exception_value = Thread.current[:mongo_query_read_timeout_exception] begin Thread.current[:mongo_query_read_timeout] = timeout Thread.current[:mongo_query_read_timeout_exception] = exception_class_to_raise yield ensure Thread.current[:mongo_query_read_timeout] = existing_value Thread.current[:mongo_query_read_timeout_exception] = existing_exception_value end end
  • 31. modified_timeout = Thread.current[:mongo_query_read_timeout] || timeout deadline = (Time.now + modified_timeout) if modified_timeout begin while retrieved < length retrieve = length - retrieved if retrieve > buf_size retrieve = buf_size end chunk = @socket.read_nonblock(retrieve, buf) # If we read the entire wanted length in one operation, # return the data as is which saves one memory allocation and # one copy per read if retrieved == 0 && chunk.length == length return chunk end ...
  • 32. Driver Patches Get metrics on application issues Patch Mongo::Socket#read_from_socket to log to StatsD as it is waiting
  • 34. LOG_TO_STATS_D_AFTER_EACH_OF_THESE_SECONDS = [ 2, 5, 10, 30, 60, 120, 180, 240, 300, 360, 420, 480 ].freeze rescue IO::WaitReadable ... # Log duration to StatsD here seconds_since_start_time = Time.now - start_time LOG_TO_STATS_D_AFTER_EACH_OF_THESE_SECONDS.each do |seconds| begin # We want to log once after 5 seconds, then after 10 seconds, then after 30 seconds, etc. # so keep track of if we have already logged for a given value if !already_logged_to_stats_d_at_seconds[seconds] && seconds_since_start_time > seconds already_logged_to_stats_d_at_seconds[seconds] = true mongo_port = Thread.current[:current_mongo_port] company_id = Thread.current[:current_company_id] StatsDAdapter.increment( "platform.all.mongo_slow_query".freeze, 1, {:port => mongo_port, :company_id => company_id, :seconds => seconds} ) end rescue => e Mongo::Logger.logger.info {"Caught error logging to StatsD: #{e.inspect()}"} end end ... end
  • 35. Long running queries by MongoDB cluster & company During a hardware incident that affected a handful of MongoDB clusters
  • 37. How does Braze use this information?
  • 38. Braze response to long running operations •Graphs can inform whether or not to add more shards to a cluster or move a customer off •Graphs can point toward certain campaigns to evaluate at times of day for new partialFilterExpression indexes •Used as input to alerts and incident handling tools +
  • 39. Braze incident handling tools • Built throttling system for campaign sending (i.e., database writes) • Can control rate at which messages are picked up to be sent • Alerts set up for high load on MongoDB, slow application queries • Ideal long term solution is to build a Governor • Use machine learning to throttle database writes based on application response time
  • 40. Summary • MongoDB’s flexible schemas with non-scalar types enables Braze to quickly fetch information about end users • Braze’s MongoDB deployment model allows Braze to match customer resources with the appropriate amount of hardware and shards • Metrics from the driver for how long reads and writes are taking gives Braze visibility into how its application is performing
  • 41. Q&A
  • 42. Thank you! We are hiring! braze.com/careers