Real-time Streams & Logs with Storm and Kafka by Andrew Montalenti and Keith Bourgoin PyData SV 2014

Real-time Streams & Logs
Andrew Montalenti, CTO
Keith Bourgoin, Backend Lead
1 of 47

Agenda
Parse.ly problem space
Aggregating the stream (Storm)
Organizing around logs (Kafka)
2 of 47

Admin
Our presentations and code:
http://parse.ly/code
This presentation's slides:
http://parse.ly/slides/logs
This presentation's notes:
http://parse.ly/slides/logs/notes
3 of 47

What is Parse.ly?
Web content analytics for digital storytellers.
5 of 47

Velocity
Average post has <48-hour shelf life.
6 of 47

Volume
Top publishers write 1000's of posts per day.
7 of 47

Information radiators
12 of 47

Architecture evolution
13 of 47

Queues and workers
Queues: RabbitMQ => Redis => ZeroMQ
Workers: Cron Jobs => Celery
14 of 47

Workers and databases
15 of 47

In short: it started to get messy
17 of 47

Introducing Storm
Storm is a distributed real-time computation system.
Hadoop provides a set of general primitives for doing batch
processing.
Storm provides a set of general primitives for doing
real-time computation.
Perfect as a replacement for ad-hoc workers-and-queues
systems.
18 of 47

Storm features
Speed
Fault tolerance
Parallelism
Guaranteed Messages
Easy Code Management
Local Dev
19 of 47

Storm primitives
Streaming Data Set, typically from Kafka.
ZeroMQ used for inter-process communication.
Bolts & Spouts; Storm's Topology is a DAG.
Nimbus & Workers manage execution.
Tuneable parallelism + built-in fault tolerance.
20 of 47

Tuple Tree
Tuple tree, anchoring, and retries.
22 of 47

Word Stream Spout (Storm)
;; spout configuration
{"word-spout" (shell-spout-spec
;; Python Spout implementation:
;; - fetches words (e.g. from Kafka)
["python" "words.py"]
;; - emits (word,) tuples
["word"]
)
}
23 of 47

Word Stream Spout in Python
import itertools
from streamparse import storm
class WordSpout(storm.Spout):
def initialize(self, conf, ctx):
self.words = itertools.cycle(['dog', 'cat',
'zebra', 'elephant'])
def next_tuple(self):
word = next(self.words)
storm.emit([word])
WordSpout().run()
24 of 47

Word Count Bolt (Storm)
;; bolt configuration
{"count-bolt" (shell-bolt-spec
;; Bolt input: Spout and field grouping on word
{"word-spout" ["word"]}
;; Python Bolt implementation:
;; - maintains a Counter of word
;; - increments as new words arrive
["python" "wordcount.py"]
;; Emits latest word count for most recent word
["word" "count"]
;; parallelism = 2
:p 2
)
}
25 of 47

Word Count Bolt in Python
from collections import Counter
from streamparse import storm
class WordCounter(storm.Bolt):
def initialize(self, conf, ctx):
self.counts = Counter()
def process(self, tup):
word = tup.values[0]
self.counts[word] += 1
storm.emit([word, self.counts[word]])
storm.log('%s: %d' % (word, self.counts[word]))
WordCounter().run()
26 of 47

streamparse
sparse provides a CLI front-end to streamparse, a
framework for creating Python projects for running,
debugging, and submitting Storm topologies for data
processing. (still in development)
After installing the lein (only dependency), you can run:
pip install streamparse
This will offer a command-line tool, sparse. Use:
sparse quickstart
27 of 47

Running and debugging
You can then run the local Storm topology using:
$ sparse run
Running wordcount topology...
Options: {:spec "topologies/wordcount.clj", ...}
#<StormTopology StormTopology(spouts:{word-spout=...
storm.daemon.nimbus - Starting Nimbus with conf {...
storm.daemon.supervisor - Starting supervisor with id 4960ac74...
storm.daemon.nimbus - Received topology submission with conf {...
... lots of output as topology runs...
Interested? Lightning talk!
28 of 47

Organizing around logs
29 of 47

Not all logs are application logs
A "log" could be any stream of structured data:
Web logs
Raw data waiting to be processed
Partially processed data
Database operations (e.g. mongo's oplog)
A series of timestamped facts about a given system.
30 of 47

LinkedIn's lattice problem
31 of 47

Enter the unified log
32 of 47

Log-centric is simpler
33 of 47

Parse.ly is log-centric, too
34 of 47

Introducing Apache Kafka
Log-centric messaging system developed at LinkedIn.
Designed for throughput; efficient resource use.
Persists to disk; in-memory for recent data
Little to no overhead for new consumers
Scalable to 10,000's of messages per second
As of 0.8, full replication of topic data.
35 of 47

Kafka concepts
Concept Description
Cluster An arrangement of Brokers & Zookeeper
nodes
Broker An individual node in the Cluster
Topic A group of related messages (a stream)
Partition Part of a topic, used for replication
Producer Publishes messages to stream
Consumer
Group
Group of related processes reading a topic
Offset Point in a topic that the consumer has read to
36 of 47

What's the catch?
Replication isn't perfect. Network partitions can cause
problems.
No out-of-order acknowledgement:
"Offset" is a marker of where consumer is in log;
nothing more.
On a restart, you know where to start reading, but
not if individual messages before the stored offset
was fully processed.
In practice, not as much of a problem as it sounds.
37 of 47

Kafka is a "distributed log"
Topics are logs, not queues.
Consumers read into offsets of the log.
Logs are maintained for a configurable period of time.
Messages can be "replayed".
Consumers can share identical logs easily.
38 of 47

Multi-consumer
Even if Kafka's availability and scalability story isn't
interesting to you, the multi-consumer story should be.
39 of 47

Queue problems, revisited
Traditional queues (e.g. RabbitMQ / Redis):
not distributed / highly available at core
not persistent ("overflows" easily)
more consumers mean more queue server load
Kafka solves all of these problems.
40 of 47

Kafka + Storm
Good fit for at-least-once processing.
No need for out-of-order acks.
Community work is ongoing for at-most-once processing.
Able to keep up with Storm's high-throughput processing.
Great for handling backpressure during traffic spikes.
41 of 47

Kafka in Python (1)
python-kafka (0.8+)
https://github.com/mumrah/kafka-python
from kafka.client import KafkaClient
from kafka.consumer import SimpleConsumer
kafka = KafkaClient('localhost:9092')
consumer = SimpleConsumer(kafka, 'test_consumer', 'raw_data')
start = time.time()
for msg in consumer:
count += 1
if count % 1000 == 0:
dur = time.time() - start
print 'Reading at {:.2f} messages/sec'.format(dur/1000)
start = time.time()
42 of 47

Kafka in Python (2)
samsa (0.7x)
https://github.com/getsamsa/samsa
import time
from kazoo.client import KazooClient
from samsa.cluster import Cluster
zk = KazooClient()
zk.start()
cluster = Cluster(zk)
queue = cluster.topics['raw_data'].subscribe('test_consumer')
start = time.time()
for msg in queue:
count += 1
if count % 1000 == 0:
dur = time.time() - start
print 'Reading at {:.2f} messages/sec'.format(dur/1000)
queue.commit_offsets() # commit to zk every 1k msgs
43 of 47

Other Log-Centric Companies
Company Logs Workers
LinkedIn Kafka* Samza
Twitter Kafka Storm*
Pinterest Kafka Storm
Spotify Kafka Storm
Wikipedia Kafka Storm
Outbrain Kafka Storm
LivePerson Kafka Storm
Netflix Kafka ???
44 of 47

What we've learned
There is no silver bullet data processing technology.
Log storage is very cheap, and getting cheaper.
"Timestamped facts" is rawest form of data available.
Storm and Kafka allow you to develop atop those facts.
Organizing around real-time logs is a wise decision.
46 of 47

Questions?
Go forth and stream!
Parse.ly:
http://parse.ly/code
http://twitter.com/parsely
Andrew & Keith:
http://twitter.com/amontalenti
http://twitter.com/kbourgoin
47 of 47

Real-time Streams & Logs with Storm and Kafka by Andrew Montalenti and Keith Bourgoin PyData SV 2014

More Related Content

What's hot (20)

Similar to Real-time Streams & Logs with Storm and Kafka by Andrew Montalenti and Keith Bourgoin PyData SV 2014 (20)

More from PyData (20)

Recently uploaded (20)

Real-time Streams & Logs with Storm and Kafka by Andrew Montalenti and Keith Bourgoin PyData SV 2014