SlideShare a Scribd company logo
Storm @ Visual Revenue   (an Outbrain
Company)




Alex Poon
VP of Engineering
Who are we?
What we do?
                                          Customer
Traffic

•  14B page views per month

•  At peak, 8000-10000 per sec              Web
Servers


•  Deployed Storm to production ~ 1
                                                                  Ka=a

month ago                                 Data
Transform/
                                           Aggrega8on

•  Storm cluster of ~50 instances on                              Storm

AWS
                                             Databases




                                       Dashboard
         Algo




                                            Automa8on

Before Storm
•  Built our own distributed data processing
    •  ZMQ
    •  Batch based process
    •  Hashing processing by customers
•  Advantages
    •  Simple in-house system built from very basic components
    •  Well understood
•  Disadvantages
    •  Hard to scale, constant battle for keeping up with pings
    •  Machine management was clumsy
    •  Uneven distribution of traffic
    •  Multiple processes doing similar work, wasting resources
Why Kafka/Storm?
•  Kafka
    •  open-sourced, distributed publish-subscribe messaging system
•  Storm
    •  open-sourced, real-time computation system for continuous
    computation
•  They are awesome
    •  Distributed, highly scalable, and fault tolerance
    •  High throughput
    •  Reliable
    •  Real-time
    •  Great at in-memory analytics, and real-time decision support
DataAggregation
                      Customer

                          15s





                       Position
   Front Page

                          15s
         15s

URL
     Aggregate

             15s

                      Aggregate
   Arrangement

                          5m
           5m





Spout
                 Tweet
       @Handle

Bolt
                     15s
         15s

Learning / Ideas
1. Kafka + zookeeper is extremely scalable and easy to setup.
Check out the Brod library if you are doing Python

2. Use the Storm UI (Ganglia based) to monitor your cluster

3. Shell Bolts were inefficient and hard to debug (at least for us)

4. Upgrade to at least Storm version 0.8.2 which gives you capacity
metrics on top of other goodies

5. Storm’s anchoring/replay capability is awesome but comes with a
visible overhead

6. Use a good framework to manage your cluster, we use Salt Stack

7. Our unit tests are built in Junit. Most built in unit tests for Storm
are only available in Clojure for now
Thank You

 Alex Poon
 @alexpoon06
 @Outbrain

  Yes, it is true. We are
  Hiring!! 

     www.visualrevenue.com/jobs


More Related Content

PPTX
Immutable infrastructure isn’t the answer
PPT
Performance stack
PDF
Hbasecon2013 Wrap Up
PPTX
Real time dashboards with Kafka and Druid
PPTX
Monitoring the unknown, 1000*100 series a day - Big Data Vilnius 2017
PPTX
Building a derived data store using Kafka
PPTX
How to Build High Performance : WordPress
PPTX
Build a reverse proxy for modern immutable infrastructure - Sozu - Devops D D...
Immutable infrastructure isn’t the answer
Performance stack
Hbasecon2013 Wrap Up
Real time dashboards with Kafka and Druid
Monitoring the unknown, 1000*100 series a day - Big Data Vilnius 2017
Building a derived data store using Kafka
How to Build High Performance : WordPress
Build a reverse proxy for modern immutable infrastructure - Sozu - Devops D D...

What's hot (19)

PDF
Multi-master, multi-region MySQL deployment in Amazon AWS
PPTX
Aws 12 Month Free Tier for Web Designers and Developers
PPTX
MONITORING THE UNKNOWN, 1000*100 SERIES A DAY - DEVOXX MOROCCO 2017
PPTX
Problems you’ll face in the Microservices World: Configuration, Authenticatio...
PPTX
Azure Site Recovery Loves Business Continuity
PDF
#lspe Q1 2013 dynamically scaling netflix in the cloud
PPTX
Building big data pipelines with Kafka and Kubernetes
PPTX
Meetup #3: Migrate a fast scale system to AWS
PPTX
Taming the cost of your first cloud - CCCEU 2014
PPTX
SCCM ConfigMgr Intune Architecture Decision Maker
PDF
Azure Nights August2017
PDF
Green / Blue Deployment with Immutable Servers
PDF
Sina App Engine - a distributed web solution on cloud
PPTX
Faas With Kata Container
PDF
Terraform
PPTX
Reliable, Scalable Kubernetes on AWS
PPTX
Blue green deployment
PPTX
Cloud - High Availability @ Low Cost - Workshop - Gurpreet ahuja
PPTX
Sas 2015 event_driven
Multi-master, multi-region MySQL deployment in Amazon AWS
Aws 12 Month Free Tier for Web Designers and Developers
MONITORING THE UNKNOWN, 1000*100 SERIES A DAY - DEVOXX MOROCCO 2017
Problems you’ll face in the Microservices World: Configuration, Authenticatio...
Azure Site Recovery Loves Business Continuity
#lspe Q1 2013 dynamically scaling netflix in the cloud
Building big data pipelines with Kafka and Kubernetes
Meetup #3: Migrate a fast scale system to AWS
Taming the cost of your first cloud - CCCEU 2014
SCCM ConfigMgr Intune Architecture Decision Maker
Azure Nights August2017
Green / Blue Deployment with Immutable Servers
Sina App Engine - a distributed web solution on cloud
Faas With Kata Container
Terraform
Reliable, Scalable Kubernetes on AWS
Blue green deployment
Cloud - High Availability @ Low Cost - Workshop - Gurpreet ahuja
Sas 2015 event_driven
Ad

Similar to Open analytics meetup alex poon (1) (20)

PDF
A scalable server environment for your applications
PPTX
Stream Computing (The Engineer's Perspective)
PDF
Palringo : a startup's journey from a data center to the cloud
PPT
Cloud Computing with .Net
PDF
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
PDF
Five Years of EC2 Distilled
PDF
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
PPT
Architecture Best Practices on Windows Azure
PDF
Apache storm vs. Spark Streaming
PDF
NICTA, Disaster Recovery Using OpenStack
PDF
Leaving the Ivory Tower: Research in the Real World
PDF
John adams talk cloudy
PPTX
Your Guide to Streaming - The Engineer's Perspective
PPTX
Azug - successfully breeding rabits
PPTX
IEEE Cloud 2012: Clouds Hands-On Tutorial
PDF
Quilt - Distributed Load Simulation from AWS
PDF
Oracle in the Cloud
PDF
A real-life account of moving 100% to a public cloud
PDF
Data Warehousing Infrastructure on Cloud
A scalable server environment for your applications
Stream Computing (The Engineer's Perspective)
Palringo : a startup's journey from a data center to the cloud
Cloud Computing with .Net
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Five Years of EC2 Distilled
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Architecture Best Practices on Windows Azure
Apache storm vs. Spark Streaming
NICTA, Disaster Recovery Using OpenStack
Leaving the Ivory Tower: Research in the Real World
John adams talk cloudy
Your Guide to Streaming - The Engineer's Perspective
Azug - successfully breeding rabits
IEEE Cloud 2012: Clouds Hands-On Tutorial
Quilt - Distributed Load Simulation from AWS
Oracle in the Cloud
A real-life account of moving 100% to a public cloud
Data Warehousing Infrastructure on Cloud
Ad

More from Open Analytics (20)

PDF
Cyber after Snowden (OA Cyber Summit)
PPTX
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
PPT
CDM….Where do you start? (OA Cyber Summit)
PPTX
An Immigrant’s view of Cyberspace (OA Cyber Summit)
PPTX
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
PPTX
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
PPTX
Using Real-Time Data to Drive Optimization & Personalization
PPTX
M&A Trends in Telco Analytics
PPTX
Competing in the Digital Economy
PPTX
Piwik: An Analytics Alternative (Chicago Summit)
PDF
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
PDF
Crossing the Chasm (Ikanow - Chicago Summit)
PPTX
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
PDF
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
PDF
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
PDF
From Insight to Impact (Chicago Summit - Keynote)
PPT
Easybib Open Analytics NYC
PPTX
MarkLogic - Open Analytics Meetup
PPTX
The caprate presentation_july2013_open analytics dc meetup
PPTX
Verifeed open analytics_3min deck_071713_final
Cyber after Snowden (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Using Real-Time Data to Drive Optimization & Personalization
M&A Trends in Telco Analytics
Competing in the Digital Economy
Piwik: An Analytics Alternative (Chicago Summit)
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Crossing the Chasm (Ikanow - Chicago Summit)
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
From Insight to Impact (Chicago Summit - Keynote)
Easybib Open Analytics NYC
MarkLogic - Open Analytics Meetup
The caprate presentation_july2013_open analytics dc meetup
Verifeed open analytics_3min deck_071713_final

Open analytics meetup alex poon (1)

  • 1. Storm @ Visual Revenue (an Outbrain Company) Alex Poon VP of Engineering
  • 3. What we do? Customer
Traffic
 •  14B page views per month •  At peak, 8000-10000 per sec Web
Servers
 •  Deployed Storm to production ~ 1 Ka=a
 month ago Data
Transform/ Aggrega8on
 •  Storm cluster of ~50 instances on Storm
 AWS Databases
 Dashboard
 Algo
 Automa8on

  • 4. Before Storm •  Built our own distributed data processing •  ZMQ •  Batch based process •  Hashing processing by customers •  Advantages •  Simple in-house system built from very basic components •  Well understood •  Disadvantages •  Hard to scale, constant battle for keeping up with pings •  Machine management was clumsy •  Uneven distribution of traffic •  Multiple processes doing similar work, wasting resources
  • 5. Why Kafka/Storm? •  Kafka •  open-sourced, distributed publish-subscribe messaging system •  Storm •  open-sourced, real-time computation system for continuous computation •  They are awesome •  Distributed, highly scalable, and fault tolerance •  High throughput •  Reliable •  Real-time •  Great at in-memory analytics, and real-time decision support
  • 6. DataAggregation Customer
 15s
 Position
 Front Page
 15s
 15s
 URL
 Aggregate
 15s
 Aggregate
 Arrangement
 5m
 5m
 Spout
 Tweet
 @Handle
 Bolt
 15s
 15s

  • 7. Learning / Ideas 1. Kafka + zookeeper is extremely scalable and easy to setup. Check out the Brod library if you are doing Python 2. Use the Storm UI (Ganglia based) to monitor your cluster 3. Shell Bolts were inefficient and hard to debug (at least for us) 4. Upgrade to at least Storm version 0.8.2 which gives you capacity metrics on top of other goodies 5. Storm’s anchoring/replay capability is awesome but comes with a visible overhead 6. Use a good framework to manage your cluster, we use Salt Stack 7. Our unit tests are built in Junit. Most built in unit tests for Storm are only available in Clojure for now
  • 8. Thank You Alex Poon @alexpoon06 @Outbrain Yes, it is true. We are Hiring!! 
 www.visualrevenue.com/jobs