STORM as an ETL Engine to HADOOP

STORM as an ETL Engine to
HADOOP
Apr 15, 2015
Yash Ranadive
Lookout Mobile Security
@yashranadive
etl.svbtle.com
Friday, April 24, 15

ABOUT
• Data Engineer at Lookout, San Francisco
• Work on
• Analytics Infrastructure (Internal)
• Data Ingestion in Hadoop
• Blog all things ETL
• etl.svbtle.com

AGENDA
• When to use Storm?
• Architecture Alternatives
• Monitoring
• Questions

Over 60 million registered users

DEFINITION OF ETL
Moving and Transforming data so it can
be stored and analyzed

General Framework for Event Processing
Pipelines
Need

General Framework for Event Processing
Pipelines
Need
that can make processed data available for
analytics as fast as possible

THE PROBLEM
Message Bus
ETL Logic
HADOOPWhat to use?

Depends on Latency Requirements
What to use?

ANALYTICS LATENCY
Latency
Batch Hourly/Daily
Frequent Batch 10-15 mins
Near Real-time <1 minute
Sub Second < 1s
Flow interrupt

Depends on Complexity of Reports
What to use?

THE PROBLEM
Kafka
ETL Logic
HADOOPWhat to use?

OFFLOADING AND PROCESSING
Batch
Offload
Batch
Process
Real-time
Offload
Real-time
Process
Camus X
Storm X X
Scalding X
Spark X X X X
...

For Real-time Analytics
We use Storm

HOW WE SOLVED 2 PROBLEMS
1. User Gratiﬁcations
2. Device Connections

an event that adds value to the user
“Gratiﬁcation” is
USER GRATIFICATIONS

1. USER GRATIFICATIONS
• Need Analytics on performance of “Scream”, “Lock”,
“Locate”
• Events in Protobuf format
Kafka
“Scream”, “Lock”, “Locate”
protobuf events
Monitor
Throughput
Join
Cohorts Table
Complex Reports

PIPELINE - LANDING DATA DIRECTLY
Kafka Storm HDFS
Kafka
Spout
Deserializ
e Protobuf
storm-
hdfs bolt
Landing
Directory
Hive
Directory
Bolt deserializes protobuf to a TSV Data lands on hdfs
Files rotated to
HIVE external
table folder

TUNING OPTIONS
• Change storm-hdfs hsync count based policy
• Change Parallelism of storm-hdfs bolt
• Possibly Change storm-hdfs hsync time based
policy

THE GOOD
• Plain Text of Protobufs by tailing landing ﬁle
• Real-time view of throughput via. StatsD
• Data available in HIVE for downstream
analysis
#####Insert diag here

CHALLENGES
• Possible duplicates if not “exactly-once”
• storm-hdfs bolt has limitations
• can’t rotate when topology
shutdown
• parameter tweaking depending
throughput

BURSTY TRAFFIC
dd
Bursty Traffic can cause frequent hsync (hadoop ﬁle system sync)
and slow down throughput

DEVICE CONNECTIONS

2. DEVICE CONNECTIONS
• Report on counts of devices connecting
• JSON format
• Analyze all connecting devices to backend
servers to measure engagement after new
product feature rollouts
Device Connection JSON
events
Join
Cohorts Table
Complex Reports

LANDING DATA ON HBASE
Storm HBase
HIVE
Bolt writes to HBase
Daily job copies data
from
HBase to Hive table
Hive table backed by HBase
TTL => 3 days
Hive table backed by HBase - last 3
days of data

THE GOOD
• Can query in real-time HBase or Hive
• Better Stability than writing directly to HDFS

ANALYTICS
Kafka Storm
StatsD
HIVE Tableau
AdHoc

OPERATIONAL
STUFF

TOPOLOGY DEPLOYMENT
• Manually push Storm Jars
• After Code Review
• JAR uploaded to Artifactory w/ version
• JAR deployed to Storm Box
• To start topology
• Kill previous
• Start new

CONFIGURATION MANAGEMENT
$> cat run_topology.sh
storm jar data-storm-0.0.6.jar com.lookout.data.topology.MyTopoClass
-topologymaxtaskparallelism 8
-D hdfs.sync.tuple.count=3000
...
-D statsd.host=statsd.flexd-sf0.local
• Simple
• Conﬁg parameters in shell scripts

TRACKING METRICS
• Use StatsD and Graphite
• Storm Consumer Offsets in DataDog

OPERATIONAL MONITORING &
ALERTING
• Ruby script hits Storm’s thrift API
• Alert if topology is inactive
• No monitoring on bolt-level failures
• Alert on high-level metrics to
prevent alert fatigue

ENVIRONMENT
• Independent Storm Cluster for Data Warehouse
Tasks
• 2 worker nodes
• 24 Cores
• 48GB Memory

LESSONS LEARNED
• Use Storm only for real-time metrics
• Streaming data directly to HDFS has its challenges
• Better stability with ingesting ﬁrst in HBase

Questions

OFFICIAL DESCRIPTION
Lookout`s data team ingests several terabytes of data from various sources every day
using many techniques such as binlog parsing, ruby daemons, and storm topologies.
With an increasing use of distributed messaging like Kafka from upstream services,
ingestion needs to happen on a distributed ETL infrastructure that can horizontally
scale with the data.
This talk will be on storm topology pipelines for data ingestion, transformation,
processing and ultimately consumption for interactive queries.
In addition, the talk will focus on
1. storm topology deployment,
2. configuration management,
3. metric monitoring,
4. and finally storage on Hadoop.
Keep in Mind: 1. Planning – Structure your presentation, define what the most
important messages are and clearly make your point 2. Plan on approximately 30
minutes of presentation and 10 minutes of Q&A 3. Use standard fonts no smaller
than 24 pts.

STORM as an ETL Engine to HADOOP

More Related Content

What's hot (20)

Viewers also liked (16)

Similar to STORM as an ETL Engine to HADOOP (20)

More from DataWorks Summit (20)

Recently uploaded (20)

STORM as an ETL Engine to HADOOP