SlideShare a Scribd company logo
Neil Dahlke, Engineer
2016 November 4
Real-Time Analytics with
MemSQL and Spark
About Me: Neil Dahlke
 Engineer
 MemSQL
• real-time database for transactions / analytics
 Formerly Globus
• high performance data transfer for research scientists
 Past talks
• Real-time, Geospatial, Maps
 Slides: http://www.slideshare.net/MemSQL/realtime-geospatial-maps-
by-neil-dahlke
WHAT WE
ARE SEEING
A WORLD OF CONNECTED
MACHINES AND PEOPLE
WHAT WE ARE SEEING:
Sensors. Applications. Machines. And us.
Generating more data every single day.
By 2020, over 20 billion connected things will
be in use across a range of industries.
REAL-TIME
INPUTS
LIVE
OUTPUTS
Sensors
Logs
Events
Streaming
Inserts
Upserts
Queries
Dashboards
Business
Intelligence
Applications
Predict Analytics
WHAT DO REAL TIME BUSINESSES NEED?
FAST DATA
INGEST
The volume of data
that can be ingested
into the database
WHAT DO REAL TIME BUSINESSES NEED?
LOW LATENCY
QUERIES
The time it takes to
execute queries and
receive results
WHAT DO REAL TIME BUSINESSES NEED?
HIGH
CONCURRENCY
The ability to scale
simultaneous operations
WHAT DO REAL TIME BUSINESSES NEED?
FAST DATA
INGEST
The volume of data
that can be ingested
into the database
LOW LATENCY
QUERIES
The time it takes to
execute queries and
receive results
HIGH
CONCURRENCY
The ability to scale
simultaneous operations
REAL-TIME
INPUTS
LIVE
OUTPUTS
Sensors
Logs
Events
Streaming
Inserts
Upserts
Queries
Dashboards
Business
Intelligence
Applications
Predict Analytics
A massively scalable database and ingest solution allowed for
massive growth, real-time analytic applications and faster, targeted.
+
 Kafka
• Component we kept
 S3
• Persisted all logs to cold storage for eventual analysis
 Hadoop
• Nighly map-reduce jobs
 Redshift
• Took a full day to load data from previous day
• Reaching overlap of times caused data crisis
Before
 No real time access to analytics
 No SQL interface for analysts and data scientists
 Massive nightly Hadoop batch jobs (late data)
 Unfiltered and incomplete data (silos)
 Expensive
Why was this bad for their business operations?
Why was this bad for their data operations?
 Too slow
 Not scalable
 No deduplication
• aka not exactly-once
 Low concurrency
FAST DATA
INGEST
LOW
LATENCY
QUERIES
HIGH
CONCURRENCY
How It Works Now
After
Real-Time Analytics with MemSQL and Spark
Real-Time Analytics with MemSQL and Spark
TECHNICAL BENEFITS
 Instant accuracy to the latest re-pin
 1 GB/sec totaling 72 TB/day
THE PINTEREST REAL-TIME ARCHITECTURE
REAL-TIME
ANALYTICS
Accelerated ingest
time by 200,000x
1 GB/sec totaling
72 TB/day
RESULTS
Real-Time Analytics with MemSQL and Spark
Visualizing The Data
23
Visualizing the Data
 Demo built using
• Mapbox
• Websockets
• Tornado web server
 When an image is re pinned, the circles on the globe
expand, showing higher volume areas
 Reads data from MemSQL directly
24
DEMO
25
Questions?
More Info
 http://www.odbms.org/blog/2015/04/powering-big-data-at-
pinterest-interview-with-krishna-gade/
 https://gigaom.com/2015/02/18/pinterest-is-
experimenting-with-memsql-for-real-time-data-analytics/
 https://www.infoq.com/news/2015/03/pinterest-memsql-
spark-streaming
 http://blog.memsql.com/pinterest-apache-spark-use-case/
 https://engineering.pinterest.com/blog/real-time-analytics-
pinterest
Resources
 https://github.com/memsql/memsql-spark-connector
 http://docs.memsql.com/docs/streamliner-administration
 http://docs.memsql.com/docs/pipelines-overview
 https://github.com/memsql/memsql-docker-quickstart
Thank You

More Related Content

PPTX
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
PDF
Building an IoT Kafka Pipeline in Under 5 Minutes
PPTX
Driving the On-Demand Economy with Predictive Analytics
PDF
Enabling Real-Time Analytics for IoT
PDF
The Fast Path to Building Operational Applications with Spark
PDF
Driving the On-Demand Economy with Predictive Analytics
PPTX
Real-Time Analytics with Spark and MemSQL
PPTX
CTO View: Driving the On-Demand Economy with Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
Building an IoT Kafka Pipeline in Under 5 Minutes
Driving the On-Demand Economy with Predictive Analytics
Enabling Real-Time Analytics for IoT
The Fast Path to Building Operational Applications with Spark
Driving the On-Demand Economy with Predictive Analytics
Real-Time Analytics with Spark and MemSQL
CTO View: Driving the On-Demand Economy with Predictive Analytics

What's hot (20)

PDF
Enabling Real-Time Analytics for IoT
PPTX
Driving the On-Demand Economy with Spark and Predictive Analytics
PDF
Winning the On-Demand Economy with Spark and Predictive Analytics
PPTX
Real-Time Geospatial Intelligence at Scale
PDF
Building the Ideal Stack for Real-Time Analytics
PPTX
In-Memory Computing Webcast. Market Predictions 2017
PPTX
Getting It Right Exactly Once: Principles for Streaming Architectures
PDF
Machines and the Magic of Fast Learning
PPTX
Tapjoy: Building a Real-Time Data Science Service for Mobile Advertising
PPTX
Real-Time, Geospatial, Maps by Neil Dahlke
PDF
Real-Time Analytics with Confluent and MemSQL
PDF
Building the Ideal Stack for Machine Learning
PPTX
Modeling the Smart and Connected City of the Future with Kafka and Spark
PPTX
Spark Summit West 2017: Real-Time Image Recognition with MemSQL and Spark
PPTX
See who is using MemSQL
PDF
Building Software to Scale
PPTX
O'Reilly Media Webcast: Building Real-Time Data Pipelines
PPTX
Internet of Things and Multi-model Data Infrastructure
PDF
Building the Next-gen Digital Meter Platform for Fluvius
PDF
Journey to the Real-Time Analytics in Extreme Growth
Enabling Real-Time Analytics for IoT
Driving the On-Demand Economy with Spark and Predictive Analytics
Winning the On-Demand Economy with Spark and Predictive Analytics
Real-Time Geospatial Intelligence at Scale
Building the Ideal Stack for Real-Time Analytics
In-Memory Computing Webcast. Market Predictions 2017
Getting It Right Exactly Once: Principles for Streaming Architectures
Machines and the Magic of Fast Learning
Tapjoy: Building a Real-Time Data Science Service for Mobile Advertising
Real-Time, Geospatial, Maps by Neil Dahlke
Real-Time Analytics with Confluent and MemSQL
Building the Ideal Stack for Machine Learning
Modeling the Smart and Connected City of the Future with Kafka and Spark
Spark Summit West 2017: Real-Time Image Recognition with MemSQL and Spark
See who is using MemSQL
Building Software to Scale
O'Reilly Media Webcast: Building Real-Time Data Pipelines
Internet of Things and Multi-model Data Infrastructure
Building the Next-gen Digital Meter Platform for Fluvius
Journey to the Real-Time Analytics in Extreme Growth
Ad

Viewers also liked (14)

PDF
Building Real-Time Data Pipelines with Kafka, Spark, and MemSQL
PDF
Real-Time Supply Chain Analytics with Machine Learning, Kafka, and Spark
PPTX
MemSQL - The Real-time Analytics Platform
PPTX
Streaming Analytics and Cognitive Computing - Changing the Game
PPTX
Bringing olap fully online analyze changing datasets in mem sql and spark wi...
PPTX
Introducing MemSQL 4
PDF
MemSQL DB Class, Ankur Goyal
PDF
In-Memory Database System Built for Speed and Scale
PDF
Building a Real-Time Data Pipeline with Spark, Kafka, and Python
PPTX
The Road To RAM - Carlos Bueno, MemSQL
PPTX
INTRODUCING: CREATE PIPELINE
PDF
PDF
Boosting Machine Learning with Redis Modules and Spark
PPTX
In-Memory Database Performance on AWS M4 Instances
Building Real-Time Data Pipelines with Kafka, Spark, and MemSQL
Real-Time Supply Chain Analytics with Machine Learning, Kafka, and Spark
MemSQL - The Real-time Analytics Platform
Streaming Analytics and Cognitive Computing - Changing the Game
Bringing olap fully online analyze changing datasets in mem sql and spark wi...
Introducing MemSQL 4
MemSQL DB Class, Ankur Goyal
In-Memory Database System Built for Speed and Scale
Building a Real-Time Data Pipeline with Spark, Kafka, and Python
The Road To RAM - Carlos Bueno, MemSQL
INTRODUCING: CREATE PIPELINE
Boosting Machine Learning with Redis Modules and Spark
In-Memory Database Performance on AWS M4 Instances
Ad

Similar to Real-Time Analytics with MemSQL and Spark (20)

PPTX
From Spark to Ignition: Fueling Your Business on Real-Time Analytics
PPTX
From Spark to Ignition: Fueling Your Business on Real-Time Analytics
PDF
IMCSummit 2015 - Day 1 IT Business Track - From Spark to Ignition
PPTX
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
PPTX
Data & Analytics Forum: Moving Telcos to Real Time
PDF
Architecting Data in the AWS Ecosystem
PDF
Database Survival Guide: Exploratory Webcast
PPTX
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
PDF
Introduction to MemSQL
PPTX
Netflix viewing data architecture evolution - QCon 2014
PPTX
Complex Analytics with NoSQL Data Store in Real Time
PDF
OSDC 2018 | The operational brain: how new Paradigms like Machine Learning ar...
PPTX
Next generation databases july2010
PPT
Big Data Real Time Analytics - A Facebook Case Study
PDF
High-performance database technology for rock-solid IoT solutions
PDF
Memsql product overview_2013
PPTX
How to choose the right Database technology for your business?
PDF
Why would I store my data in more than one database?
PPTX
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
PDF
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
From Spark to Ignition: Fueling Your Business on Real-Time Analytics
From Spark to Ignition: Fueling Your Business on Real-Time Analytics
IMCSummit 2015 - Day 1 IT Business Track - From Spark to Ignition
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Data & Analytics Forum: Moving Telcos to Real Time
Architecting Data in the AWS Ecosystem
Database Survival Guide: Exploratory Webcast
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
Introduction to MemSQL
Netflix viewing data architecture evolution - QCon 2014
Complex Analytics with NoSQL Data Store in Real Time
OSDC 2018 | The operational brain: how new Paradigms like Machine Learning ar...
Next generation databases july2010
Big Data Real Time Analytics - A Facebook Case Study
High-performance database technology for rock-solid IoT solutions
Memsql product overview_2013
How to choose the right Database technology for your business?
Why would I store my data in more than one database?
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...

More from SingleStore (19)

PPTX
Five ways database modernization simplifies your data life
PPTX
How Kafka and Modern Databases Benefit Apps and Analytics
PPTX
Building the Foundation for a Latency-Free Life
PDF
Converging Database Transactions and Analytics
PDF
Building a Machine Learning Recommendation Engine in SQL
PPTX
MemSQL 201: Advanced Tips and Tricks Webcast
PDF
An Engineering Approach to Database Evaluations
PPTX
Building a Fault Tolerant Distributed Architecture
PDF
Stream Processing with Pipelines and Stored Procedures
PPTX
Curriculum Associates Strata NYC 2017
PPTX
Image Recognition on Streaming Data
PPTX
Spark Summit Dublin 2017 - MemSQL - Real-Time Image Recognition
PDF
The State of the Data Warehouse in 2017 and Beyond
PDF
How Database Convergence Impacts the Coming Decades of Data Management
PPTX
Teaching Databases to Learn in the World of AI
PDF
Gartner Catalyst 2017: The Data Warehouse Blueprint for ML, AI, and Hybrid Cloud
PPTX
Gartner Catalyst 2017: Image Recognition on Streaming Data
PDF
Real-Time Analytics at Uber Scale
PPTX
Machines and the Magic of Fast Learning - Strata Keynote
Five ways database modernization simplifies your data life
How Kafka and Modern Databases Benefit Apps and Analytics
Building the Foundation for a Latency-Free Life
Converging Database Transactions and Analytics
Building a Machine Learning Recommendation Engine in SQL
MemSQL 201: Advanced Tips and Tricks Webcast
An Engineering Approach to Database Evaluations
Building a Fault Tolerant Distributed Architecture
Stream Processing with Pipelines and Stored Procedures
Curriculum Associates Strata NYC 2017
Image Recognition on Streaming Data
Spark Summit Dublin 2017 - MemSQL - Real-Time Image Recognition
The State of the Data Warehouse in 2017 and Beyond
How Database Convergence Impacts the Coming Decades of Data Management
Teaching Databases to Learn in the World of AI
Gartner Catalyst 2017: The Data Warehouse Blueprint for ML, AI, and Hybrid Cloud
Gartner Catalyst 2017: Image Recognition on Streaming Data
Real-Time Analytics at Uber Scale
Machines and the Magic of Fast Learning - Strata Keynote

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
cuic standard and advanced reporting.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Advanced IT Governance
PDF
Review of recent advances in non-invasive hemoglobin estimation
Approach and Philosophy of On baking technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Per capita expenditure prediction using model stacking based on satellite ima...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
“AI and Expert System Decision Support & Business Intelligence Systems”
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
cuic standard and advanced reporting.pdf
Network Security Unit 5.pdf for BCA BBA.
The Rise and Fall of 3GPP – Time for a Sabbatical?
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Advanced Soft Computing BINUS July 2025.pdf
Empathic Computing: Creating Shared Understanding
Reach Out and Touch Someone: Haptics and Empathic Computing
Chapter 3 Spatial Domain Image Processing.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Advanced IT Governance
Review of recent advances in non-invasive hemoglobin estimation

Real-Time Analytics with MemSQL and Spark

Editor's Notes

  • #12: - Distributed In-Memory Database - Built for real-time analytics and transactions Familiar SQL Interface Spark integration out-of-the-box - Native Kafka Ingestion What did they want to do? - highly scalable infrastructure that collects, stores and processes user engagement data in real-time higher performance event logging Reliable log transport and storage ability to query real-time data
  • #13: user clicks Pin or repin event is pushed to Apache Kafka Storm, Spark and other custom built log readers process these events in real-time log persistence service called Secor that reliably writes these events to Amazon S3 (zero data loss, overcoming its weak eventual consistency model).  self-serve big data platform loads the data from S3 into many different Hadoop clusters for batch processing In house tools Singer (logger) & Secor (replicator) asynchronously replicating local logs from app servers to centralized S3 location using Kafka for transport Kafka was great for throughput, but needed a way to derive value, e.g. run SQL against these datasets in real time A few days later this data would hit Redshift and be queryable
  • #14: - took several days to access analytics and make available to data science team (too late, A/B testing, advertising) - no SQL Interface - 5.5 M rows / second for one topic, 1.7 M rows / second for another, with the lowest throughput being 132k rows / second - data needs to be filtered as well as enriched - At LEAST once semantics
  • #17: user clicks Pin or repin event is pushed to Apache Kafka Storm, Spark and other custom built log readers process these events in real-time log persistence service called Secor that reliably writes these events to Amazon S3 (zero data loss, overcoming its weak eventual consistency model).  self-serve big data platform loads the data from S3 into many different Hadoop clusters for batch processing In house tools Singer (logger) & Secor (replicator) asynchronously replicating local logs from app servers to centralized S3 location using Kafka for transport Kafka was great for throughput, but needed a way to derive value, e.g. run SQL against these datasets in real time A few days later this data would hit Redshift and be queryable
  • #19: Goes both ways
  • #20: easily repeatable success days to seconds now has a source of record for sharing relevant user engagement data and metrics their data analyst and with key brands Pinterest and their partners can get a better understanding of user behavior and provide more value to the Pinner community Cheaper the ability to identify (and react to) developing trends as they happen provides insight into how users are engaging with Pins across the globe in real-time helps Pinterest become a better recommendation engine- SQL interface for engineering and data science teams fast ad-hoc query execution on real-time data to allow the execution of SQL queries on the real-time events as they arrive
  • #21: easily repeatable success days to seconds now has a source of record for sharing relevant user engagement data and metrics their data analyst and with key brands Pinterest and their partners can get a better understanding of user behavior and provide more value to the Pinner community Cheaper the ability to identify (and react to) developing trends as they happen provides insight into how users are engaging with Pins across the globe in real-time helps Pinterest become a better recommendation engine- SQL interface for engineering and data science teams fast ad-hoc query execution on real-time data to allow the execution of SQL queries on the real-time events as they arrive
  • #26: Pull up Ops Pull up a terminal and create the database Deploy Spark Create a Streamliner pipeline Create a Pipeline pipeline Expose the UI Ad-Hoc queries, Tableau, and custom reporting