SlideShare a Scribd company logo
STORM as an ETL Engine to
HADOOP
Apr 15, 2015
Yash Ranadive
Lookout Mobile Security
@yashranadive
etl.svbtle.com
Friday, April 24, 15
ABOUT
• Data Engineer at Lookout, San Francisco
• Work on
• Analytics Infrastructure (Internal)
• Data Ingestion in Hadoop
• Blog all things ETL
• etl.svbtle.com
Friday, April 24, 15
AGENDA
• When to use Storm?
• Architecture Alternatives
• Monitoring
• Questions
Friday, April 24, 15
Over 60 million registered users
Friday, April 24, 15
DEFINITION OF ETL
Moving and Transforming data so it can
be stored and analyzed
Friday, April 24, 15
DEFINITION OF ETL
Moving and Transforming data so it can
be stored and analyzed
Friday, April 24, 15
DEFINITION OF ETL
Moving and Transforming data so it can
be stored and analyzed
Friday, April 24, 15
General Framework for Event Processing
Pipelines
Need
Friday, April 24, 15
General Framework for Event Processing
Pipelines
Need
that can make processed data available for
analytics as fast as possible
Friday, April 24, 15
THE PROBLEM
Message Bus
ETL Logic
HADOOPWhat to use?
Friday, April 24, 15
Depends on Latency Requirements
What to use?
Friday, April 24, 15
ANALYTICS LATENCY
Latency
Batch Hourly/Daily
Frequent Batch 10-15 mins
Near Real-time <1 minute
Sub Second < 1s
Flow interrupt
Friday, April 24, 15
Depends on Complexity of Reports
What to use?
Friday, April 24, 15
THE PROBLEM
Kafka
ETL Logic
HADOOPWhat to use?
Friday, April 24, 15
OFFLOADING AND PROCESSING
Batch
Offload
Batch
Process
Real-time
Offload
Real-time
Process
Camus X
Storm X X
Scalding X
Spark X X X X
...
Friday, April 24, 15
For Real-time Analytics
We use Storm
Friday, April 24, 15
HOW WE SOLVED 2 PROBLEMS
1. User Gratifications
2. Device Connections
Friday, April 24, 15
an event that adds value to the user
“Gratification” is
USER GRATIFICATIONS
Friday, April 24, 15
1. USER GRATIFICATIONS
• Need Analytics on performance of “Scream”, “Lock”,
“Locate”
• Events in Protobuf format
Kafka
“Scream”, “Lock”, “Locate”
protobuf events
Monitor
Throughput
Join
Cohorts Table
Complex Reports
Friday, April 24, 15
PIPELINE - LANDING DATA DIRECTLY
Kafka Storm HDFS
Kafka
Spout
Deserializ
e Protobuf
storm-
hdfs bolt
Landing
Directory
Hive
Directory
Bolt deserializes protobuf to a TSV Data lands on hdfs
Files rotated to
HIVE external
table folder
Friday, April 24, 15
TUNING OPTIONS
• Change storm-hdfs hsync count based policy
• Change Parallelism of storm-hdfs bolt
• Possibly Change storm-hdfs hsync time based
policy
Friday, April 24, 15
THE GOOD
• Plain Text of Protobufs by tailing landing file
• Real-time view of throughput via. StatsD
• Data available in HIVE for downstream
analysis
#####Insert diag here
Friday, April 24, 15
CHALLENGES
• Possible duplicates if not “exactly-once”
• storm-hdfs bolt has limitations
• can’t rotate when topology
shutdown
• parameter tweaking depending
throughput
Friday, April 24, 15
BURSTY TRAFFIC
dd
Bursty Traffic can cause frequent hsync (hadoop file system sync)
and slow down throughput
Friday, April 24, 15
DEVICE CONNECTIONS
Friday, April 24, 15
2. DEVICE CONNECTIONS
• Report on counts of devices connecting
• JSON format
• Analyze all connecting devices to backend
servers to measure engagement after new
product feature rollouts
Device Connection JSON
events
Join
Cohorts Table
Complex Reports
Friday, April 24, 15
LANDING DATA ON HBASE
Storm HBase
HIVE
Bolt writes to HBase
Daily job copies data
from
HBase to Hive table
Hive table backed by HBase
TTL => 3 days
Hive table backed by HBase - last 3
days of data
Friday, April 24, 15
THE GOOD
• Can query in real-time HBase or Hive
• Better Stability than writing directly to HDFS
Friday, April 24, 15
ANALYTICS
Kafka Storm
StatsD
HIVE Tableau
AdHoc
Friday, April 24, 15
OPERATIONAL
STUFF
Friday, April 24, 15
TOPOLOGY DEPLOYMENT
• Manually push Storm Jars
• After Code Review
• JAR uploaded to Artifactory w/ version
• JAR deployed to Storm Box
• To start topology
• Kill previous
• Start new
Friday, April 24, 15
CONFIGURATION MANAGEMENT
$> cat run_topology.sh
storm jar data-storm-0.0.6.jar com.lookout.data.topology.MyTopoClass 
-topologymaxtaskparallelism 8 
-D hdfs.sync.tuple.count=3000 
...
-D statsd.host=statsd.flexd-sf0.local
• Simple
• Config parameters in shell scripts
Friday, April 24, 15
TRACKING METRICS
• Use StatsD and Graphite
• Storm Consumer Offsets in DataDog
Friday, April 24, 15
OPERATIONAL MONITORING &
ALERTING
• Ruby script hits Storm’s thrift API
• Alert if topology is inactive
• No monitoring on bolt-level failures
• Alert on high-level metrics to
prevent alert fatigue
Friday, April 24, 15
ENVIRONMENT
• Independent Storm Cluster for Data Warehouse
Tasks
• 2 worker nodes
• 24 Cores
• 48GB Memory
Friday, April 24, 15
LESSONS LEARNED
• Use Storm only for real-time metrics
• Streaming data directly to HDFS has its challenges
• Better stability with ingesting first in HBase
Friday, April 24, 15
Questions
Friday, April 24, 15
OFFICIAL DESCRIPTION
Lookout`s data team ingests several terabytes of data from various sources every day
using many techniques such as binlog parsing, ruby daemons, and storm topologies.
With an increasing use of distributed messaging like Kafka from upstream services,
ingestion needs to happen on a distributed ETL infrastructure that can horizontally
scale with the data.
This talk will be on storm topology pipelines for data ingestion, transformation,
processing and ultimately consumption for interactive queries.
In addition, the talk will focus on
1. storm topology deployment,
2. configuration management,
3. metric monitoring,
4. and finally storage on Hadoop.
Keep in Mind: 1. Planning – Structure your presentation, define what the most
important messages are and clearly make your point 2. Plan on approximately 30
minutes of presentation and 10 minutes of Q&A 3. Use standard fonts no smaller
than 24 pts.
Friday, April 24, 15

More Related Content

PPTX
Resource Aware Scheduling in Apache Storm
PDF
Storm and Cassandra
PPTX
Experience with Kafka & Storm
PDF
Realtime Analytics with Storm and Hadoop
PDF
Scaling Apache Storm - Strata + Hadoop World 2014
PPTX
Storm-on-YARN: Convergence of Low-Latency and Big-Data
PDF
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
PPTX
Design Patterns For Real Time Streaming Data Analytics
Resource Aware Scheduling in Apache Storm
Storm and Cassandra
Experience with Kafka & Storm
Realtime Analytics with Storm and Hadoop
Scaling Apache Storm - Strata + Hadoop World 2014
Storm-on-YARN: Convergence of Low-Latency and Big-Data
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Design Patterns For Real Time Streaming Data Analytics

What's hot (20)

PPTX
Real-Time Big Data at In-Memory Speed, Using Storm
PDF
Hadoop at Lookout
PDF
Storm: distributed and fault-tolerant realtime computation
PPTX
Yahoo compares Storm and Spark
PDF
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
PDF
Developing Java Streaming Applications with Apache Storm
PPTX
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
PDF
Apache Storm Concepts
PPTX
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
PPTX
Apache Storm 0.9 basic training - Verisign
PPTX
Multi-Tenant Storm Service on Hadoop Grid
PPTX
Suneel Marthi - Deep Learning with Apache Flink and DL4J
PDF
Learning Stream Processing with Apache Storm
PPTX
Data Stream Algorithms in Storm and R
PPS
Storm presentation
PDF
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
PPTX
Introduction to Storm
PDF
Apache storm vs. Spark Streaming
PDF
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
PDF
Real-Time Analytics with Kafka, Cassandra and Storm
Real-Time Big Data at In-Memory Speed, Using Storm
Hadoop at Lookout
Storm: distributed and fault-tolerant realtime computation
Yahoo compares Storm and Spark
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Developing Java Streaming Applications with Apache Storm
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Apache Storm Concepts
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Apache Storm 0.9 basic training - Verisign
Multi-Tenant Storm Service on Hadoop Grid
Suneel Marthi - Deep Learning with Apache Flink and DL4J
Learning Stream Processing with Apache Storm
Data Stream Algorithms in Storm and R
Storm presentation
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Introduction to Storm
Apache storm vs. Spark Streaming
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Real-Time Analytics with Kafka, Cassandra and Storm
Ad

Viewers also liked (16)

PDF
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
PDF
Hadoop Summit Europe 2014: Apache Storm Architecture
PDF
美团技术沙龙03 - 实时数据仓库解决方案
PDF
美团点评技术沙龙09 - 一个用户行为分析产品的设计与实现
PDF
中等创业公司后端技术选型
PDF
Big Data & the importance of Data Science
ODP
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
PDF
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
PDF
Kettle: Pentaho Data Integration tool
PDF
Streaming SQL
PDF
Data integration with Apache Kafka
PDF
Apache Storm vs. Spark Streaming - two stream processing platforms compared
PPTX
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
PDF
Real Time Data Streaming using Kafka & Storm
PPTX
Resource Aware Scheduling in Apache Storm
PDF
Kafka and Storm - event processing in realtime
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Hadoop Summit Europe 2014: Apache Storm Architecture
美团技术沙龙03 - 实时数据仓库解决方案
美团点评技术沙龙09 - 一个用户行为分析产品的设计与实现
中等创业公司后端技术选型
Big Data & the importance of Data Science
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Kettle: Pentaho Data Integration tool
Streaming SQL
Data integration with Apache Kafka
Apache Storm vs. Spark Streaming - two stream processing platforms compared
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Real Time Data Streaming using Kafka & Storm
Resource Aware Scheduling in Apache Storm
Kafka and Storm - event processing in realtime
Ad

Similar to STORM as an ETL Engine to HADOOP (20)

PDF
SF Hadoop Users Group August 2014 Meetup Slides
PDF
2014 sept 26_thug_lambda_part1
PDF
Stream processing using Apache Storm - Big Data Meetup Athens 2016
PDF
4th Athens Big Data Meetup - 1st Talk - Big Data Streaming Processing Using A...
PDF
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
PDF
Real time data processing frameworks
PPTX
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
PDF
Storm at spider.io - London Storm Meetup 2013-06-18
PPTX
The Future of Apache Storm
PDF
The Future of Apache Storm
PPTX
Apache Storm - Real Time Analytics
PDF
Hadoop_RealTime_Processing_eVenkat
PPTX
Hadoop Platform at Yahoo
PDF
Building a Sustainable Data Platform on AWS
PDF
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
PDF
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
PDF
A real time architecture using Hadoop and Storm @ FOSDEM 2013
PPTX
Scaling Apache Storm (Hadoop Summit 2015)
PDF
The Future of Apache Storm
PDF
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
SF Hadoop Users Group August 2014 Meetup Slides
2014 sept 26_thug_lambda_part1
Stream processing using Apache Storm - Big Data Meetup Athens 2016
4th Athens Big Data Meetup - 1st Talk - Big Data Streaming Processing Using A...
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Real time data processing frameworks
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Storm at spider.io - London Storm Meetup 2013-06-18
The Future of Apache Storm
The Future of Apache Storm
Apache Storm - Real Time Analytics
Hadoop_RealTime_Processing_eVenkat
Hadoop Platform at Yahoo
Building a Sustainable Data Platform on AWS
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
A real time architecture using Hadoop and Storm @ FOSDEM 2013
Scaling Apache Storm (Hadoop Summit 2015)
The Future of Apache Storm
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
DOCX
The AUB Centre for AI in Media Proposal.docx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Advanced methodologies resolving dimensionality complications for autism neur...
Chapter 3 Spatial Domain Image Processing.pdf
Electronic commerce courselecture one. Pdf
Spectroscopy.pptx food analysis technology
MIND Revenue Release Quarter 2 2025 Press Release
sap open course for s4hana steps from ECC to s4
Encapsulation_ Review paper, used for researhc scholars
Dropbox Q2 2025 Financial Results & Investor Presentation
Digital-Transformation-Roadmap-for-Companies.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
MYSQL Presentation for SQL database connectivity
Per capita expenditure prediction using model stacking based on satellite ima...
Unlocking AI with Model Context Protocol (MCP)
Understanding_Digital_Forensics_Presentation.pptx
Encapsulation theory and applications.pdf
Spectral efficient network and resource selection model in 5G networks
The AUB Centre for AI in Media Proposal.docx

STORM as an ETL Engine to HADOOP

  • 1. STORM as an ETL Engine to HADOOP Apr 15, 2015 Yash Ranadive Lookout Mobile Security @yashranadive etl.svbtle.com Friday, April 24, 15
  • 2. ABOUT • Data Engineer at Lookout, San Francisco • Work on • Analytics Infrastructure (Internal) • Data Ingestion in Hadoop • Blog all things ETL • etl.svbtle.com Friday, April 24, 15
  • 3. AGENDA • When to use Storm? • Architecture Alternatives • Monitoring • Questions Friday, April 24, 15
  • 4. Over 60 million registered users Friday, April 24, 15
  • 5. DEFINITION OF ETL Moving and Transforming data so it can be stored and analyzed Friday, April 24, 15
  • 6. DEFINITION OF ETL Moving and Transforming data so it can be stored and analyzed Friday, April 24, 15
  • 7. DEFINITION OF ETL Moving and Transforming data so it can be stored and analyzed Friday, April 24, 15
  • 8. General Framework for Event Processing Pipelines Need Friday, April 24, 15
  • 9. General Framework for Event Processing Pipelines Need that can make processed data available for analytics as fast as possible Friday, April 24, 15
  • 10. THE PROBLEM Message Bus ETL Logic HADOOPWhat to use? Friday, April 24, 15
  • 11. Depends on Latency Requirements What to use? Friday, April 24, 15
  • 12. ANALYTICS LATENCY Latency Batch Hourly/Daily Frequent Batch 10-15 mins Near Real-time <1 minute Sub Second < 1s Flow interrupt Friday, April 24, 15
  • 13. Depends on Complexity of Reports What to use? Friday, April 24, 15
  • 14. THE PROBLEM Kafka ETL Logic HADOOPWhat to use? Friday, April 24, 15
  • 15. OFFLOADING AND PROCESSING Batch Offload Batch Process Real-time Offload Real-time Process Camus X Storm X X Scalding X Spark X X X X ... Friday, April 24, 15
  • 16. For Real-time Analytics We use Storm Friday, April 24, 15
  • 17. HOW WE SOLVED 2 PROBLEMS 1. User Gratifications 2. Device Connections Friday, April 24, 15
  • 18. an event that adds value to the user “Gratification” is USER GRATIFICATIONS Friday, April 24, 15
  • 19. 1. USER GRATIFICATIONS • Need Analytics on performance of “Scream”, “Lock”, “Locate” • Events in Protobuf format Kafka “Scream”, “Lock”, “Locate” protobuf events Monitor Throughput Join Cohorts Table Complex Reports Friday, April 24, 15
  • 20. PIPELINE - LANDING DATA DIRECTLY Kafka Storm HDFS Kafka Spout Deserializ e Protobuf storm- hdfs bolt Landing Directory Hive Directory Bolt deserializes protobuf to a TSV Data lands on hdfs Files rotated to HIVE external table folder Friday, April 24, 15
  • 21. TUNING OPTIONS • Change storm-hdfs hsync count based policy • Change Parallelism of storm-hdfs bolt • Possibly Change storm-hdfs hsync time based policy Friday, April 24, 15
  • 22. THE GOOD • Plain Text of Protobufs by tailing landing file • Real-time view of throughput via. StatsD • Data available in HIVE for downstream analysis #####Insert diag here Friday, April 24, 15
  • 23. CHALLENGES • Possible duplicates if not “exactly-once” • storm-hdfs bolt has limitations • can’t rotate when topology shutdown • parameter tweaking depending throughput Friday, April 24, 15
  • 24. BURSTY TRAFFIC dd Bursty Traffic can cause frequent hsync (hadoop file system sync) and slow down throughput Friday, April 24, 15
  • 26. 2. DEVICE CONNECTIONS • Report on counts of devices connecting • JSON format • Analyze all connecting devices to backend servers to measure engagement after new product feature rollouts Device Connection JSON events Join Cohorts Table Complex Reports Friday, April 24, 15
  • 27. LANDING DATA ON HBASE Storm HBase HIVE Bolt writes to HBase Daily job copies data from HBase to Hive table Hive table backed by HBase TTL => 3 days Hive table backed by HBase - last 3 days of data Friday, April 24, 15
  • 28. THE GOOD • Can query in real-time HBase or Hive • Better Stability than writing directly to HDFS Friday, April 24, 15
  • 31. TOPOLOGY DEPLOYMENT • Manually push Storm Jars • After Code Review • JAR uploaded to Artifactory w/ version • JAR deployed to Storm Box • To start topology • Kill previous • Start new Friday, April 24, 15
  • 32. CONFIGURATION MANAGEMENT $> cat run_topology.sh storm jar data-storm-0.0.6.jar com.lookout.data.topology.MyTopoClass -topologymaxtaskparallelism 8 -D hdfs.sync.tuple.count=3000 ... -D statsd.host=statsd.flexd-sf0.local • Simple • Config parameters in shell scripts Friday, April 24, 15
  • 33. TRACKING METRICS • Use StatsD and Graphite • Storm Consumer Offsets in DataDog Friday, April 24, 15
  • 34. OPERATIONAL MONITORING & ALERTING • Ruby script hits Storm’s thrift API • Alert if topology is inactive • No monitoring on bolt-level failures • Alert on high-level metrics to prevent alert fatigue Friday, April 24, 15
  • 35. ENVIRONMENT • Independent Storm Cluster for Data Warehouse Tasks • 2 worker nodes • 24 Cores • 48GB Memory Friday, April 24, 15
  • 36. LESSONS LEARNED • Use Storm only for real-time metrics • Streaming data directly to HDFS has its challenges • Better stability with ingesting first in HBase Friday, April 24, 15
  • 38. OFFICIAL DESCRIPTION Lookout`s data team ingests several terabytes of data from various sources every day using many techniques such as binlog parsing, ruby daemons, and storm topologies. With an increasing use of distributed messaging like Kafka from upstream services, ingestion needs to happen on a distributed ETL infrastructure that can horizontally scale with the data. This talk will be on storm topology pipelines for data ingestion, transformation, processing and ultimately consumption for interactive queries. In addition, the talk will focus on 1. storm topology deployment, 2. configuration management, 3. metric monitoring, 4. and finally storage on Hadoop. Keep in Mind: 1. Planning – Structure your presentation, define what the most important messages are and clearly make your point 2. Plan on approximately 30 minutes of presentation and 10 minutes of Q&A 3. Use standard fonts no smaller than 24 pts. Friday, April 24, 15