SlideShare a Scribd company logo
Real-time Streams & Logs
Andrew Montalenti, CTO
Keith Bourgoin, Backend Lead
1 of 47
Agenda
Parse.ly problem space
Aggregating the stream (Storm)
Organizing around logs (Kafka)
2 of 47
Admin
Our presentations and code:
http://parse.ly/code
This presentation's slides:
http://parse.ly/slides/logs
This presentation's notes:
http://parse.ly/slides/logs/notes
3 of 47
What is Parse.ly?
4 of 47
What is Parse.ly?
Web content analytics for digital storytellers.
5 of 47
Velocity
Average post has <48-hour shelf life.
6 of 47
Volume
Top publishers write 1000's of posts per day.
7 of 47
Time series data
8 of 47
Summary data
9 of 47
Ranked data
10 of 47
Benchmark data
11 of 47
Information radiators
12 of 47
Architecture evolution
13 of 47
Queues and workers
Queues: RabbitMQ => Redis => ZeroMQ
Workers: Cron Jobs => Celery
14 of 47
Workers and databases
15 of 47
Lots of moving parts
16 of 47
In short: it started to get messy
17 of 47
Introducing Storm
Storm is a distributed real-time computation system.
Hadoop provides a set of general primitives for doing batch
processing.
Storm provides a set of general primitives for doing
real-time computation.
Perfect as a replacement for ad-hoc workers-and-queues
systems.
18 of 47
Storm features
Speed
Fault tolerance
Parallelism
Guaranteed Messages
Easy Code Management
Local Dev
19 of 47
Storm primitives
Streaming Data Set, typically from Kafka.
ZeroMQ used for inter-process communication.
Bolts & Spouts; Storm's Topology is a DAG.
Nimbus & Workers manage execution.
Tuneable parallelism + built-in fault tolerance.
20 of 47
Wired Topology
21 of 47
Tuple Tree
Tuple tree, anchoring, and retries.
22 of 47
Word Stream Spout (Storm)
;; spout configuration
{"word-spout" (shell-spout-spec
;; Python Spout implementation:
;; - fetches words (e.g. from Kafka)
["python" "words.py"]
;; - emits (word,) tuples
["word"]
)
}
23 of 47
Word Stream Spout in Python
import itertools
from streamparse import storm
class WordSpout(storm.Spout):
def initialize(self, conf, ctx):
self.words = itertools.cycle(['dog', 'cat',
'zebra', 'elephant'])
def next_tuple(self):
word = next(self.words)
storm.emit([word])
WordSpout().run()
24 of 47
Word Count Bolt (Storm)
;; bolt configuration
{"count-bolt" (shell-bolt-spec
;; Bolt input: Spout and field grouping on word
{"word-spout" ["word"]}
;; Python Bolt implementation:
;; - maintains a Counter of word
;; - increments as new words arrive
["python" "wordcount.py"]
;; Emits latest word count for most recent word
["word" "count"]
;; parallelism = 2
:p 2
)
}
25 of 47
Word Count Bolt in Python
from collections import Counter
from streamparse import storm
class WordCounter(storm.Bolt):
def initialize(self, conf, ctx):
self.counts = Counter()
def process(self, tup):
word = tup.values[0]
self.counts[word] += 1
storm.emit([word, self.counts[word]])
storm.log('%s: %d' % (word, self.counts[word]))
WordCounter().run()
26 of 47
streamparse
sparse provides a CLI front-end to streamparse, a
framework for creating Python projects for running,
debugging, and submitting Storm topologies for data
processing. (still in development)
After installing the lein (only dependency), you can run:
pip install streamparse
This will offer a command-line tool, sparse. Use:
sparse quickstart
27 of 47
Running and debugging
You can then run the local Storm topology using:
$ sparse run
Running wordcount topology...
Options: {:spec "topologies/wordcount.clj", ...}
#<StormTopology StormTopology(spouts:{word-spout=...
storm.daemon.nimbus - Starting Nimbus with conf {...
storm.daemon.supervisor - Starting supervisor with id 4960ac74...
storm.daemon.nimbus - Received topology submission with conf {...
... lots of output as topology runs...
Interested? Lightning talk!
28 of 47
Organizing around logs
29 of 47
Not all logs are application logs
A "log" could be any stream of structured data:
Web logs
Raw data waiting to be processed
Partially processed data
Database operations (e.g. mongo's oplog)
A series of timestamped facts about a given system.
30 of 47
LinkedIn's lattice problem
31 of 47
Enter the unified log
32 of 47
Log-centric is simpler
33 of 47
Parse.ly is log-centric, too
34 of 47
Introducing Apache Kafka
Log-centric messaging system developed at LinkedIn.
Designed for throughput; efficient resource use.
Persists to disk; in-memory for recent data
Little to no overhead for new consumers
Scalable to 10,000's of messages per second
As of 0.8, full replication of topic data.
35 of 47
Kafka concepts
Concept Description
Cluster An arrangement of Brokers & Zookeeper
nodes
Broker An individual node in the Cluster
Topic A group of related messages (a stream)
Partition Part of a topic, used for replication
Producer Publishes messages to stream
Consumer
Group
Group of related processes reading a topic
Offset Point in a topic that the consumer has read to
36 of 47
What's the catch?
Replication isn't perfect. Network partitions can cause
problems.
No out-of-order acknowledgement:
"Offset" is a marker of where consumer is in log;
nothing more.
On a restart, you know where to start reading, but
not if individual messages before the stored offset
was fully processed.
In practice, not as much of a problem as it sounds.
37 of 47
Kafka is a "distributed log"
Topics are logs, not queues.
Consumers read into offsets of the log.
Logs are maintained for a configurable period of time.
Messages can be "replayed".
Consumers can share identical logs easily.
38 of 47
Multi-consumer
Even if Kafka's availability and scalability story isn't
interesting to you, the multi-consumer story should be.
39 of 47
Queue problems, revisited
Traditional queues (e.g. RabbitMQ / Redis):
not distributed / highly available at core
not persistent ("overflows" easily)
more consumers mean more queue server load
Kafka solves all of these problems.
40 of 47
Kafka + Storm
Good fit for at-least-once processing.
No need for out-of-order acks.
Community work is ongoing for at-most-once processing.
Able to keep up with Storm's high-throughput processing.
Great for handling backpressure during traffic spikes.
41 of 47
Kafka in Python (1)
python-kafka (0.8+)
https://github.com/mumrah/kafka-python
from kafka.client import KafkaClient
from kafka.consumer import SimpleConsumer
kafka = KafkaClient('localhost:9092')
consumer = SimpleConsumer(kafka, 'test_consumer', 'raw_data')
start = time.time()
for msg in consumer:
count += 1
if count % 1000 == 0:
dur = time.time() - start
print 'Reading at {:.2f} messages/sec'.format(dur/1000)
start = time.time()
42 of 47
Kafka in Python (2)
samsa (0.7x)
https://github.com/getsamsa/samsa
import time
from kazoo.client import KazooClient
from samsa.cluster import Cluster
zk = KazooClient()
zk.start()
cluster = Cluster(zk)
queue = cluster.topics['raw_data'].subscribe('test_consumer')
start = time.time()
for msg in queue:
count += 1
if count % 1000 == 0:
dur = time.time() - start
print 'Reading at {:.2f} messages/sec'.format(dur/1000)
queue.commit_offsets() # commit to zk every 1k msgs
43 of 47
Other Log-Centric Companies
Company Logs Workers
LinkedIn Kafka* Samza
Twitter Kafka Storm*
Pinterest Kafka Storm
Spotify Kafka Storm
Wikipedia Kafka Storm
Outbrain Kafka Storm
LivePerson Kafka Storm
Netflix Kafka ???
44 of 47
Conclusion
45 of 47
What we've learned
There is no silver bullet data processing technology.
Log storage is very cheap, and getting cheaper.
"Timestamped facts" is rawest form of data available.
Storm and Kafka allow you to develop atop those facts.
Organizing around real-time logs is a wise decision.
46 of 47
Questions?
Go forth and stream!
Parse.ly:
http://parse.ly/code
http://twitter.com/parsely
Andrew & Keith:
http://twitter.com/amontalenti
http://twitter.com/kbourgoin
47 of 47

More Related Content

PDF
streamparse and pystorm: simple reliable parallel processing with storm
PDF
Storm Anatomy
PDF
Storm - As deep into real-time data processing as you can get in 30 minutes.
PDF
Storm
KEY
Do more than one thing at the same time, the Python way
PPTX
Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017)
PDF
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
PPTX
Scheduling in Linux and Web Servers
streamparse and pystorm: simple reliable parallel processing with storm
Storm Anatomy
Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm
Do more than one thing at the same time, the Python way
Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017)
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
Scheduling in Linux and Web Servers

What's hot (20)

PPT
Venkat ns2
PPTX
Network simulator 2
PDF
PDF
Profiling your Applications using the Linux Perf Tools
PPTX
속도체크
PDF
NS-2 Tutorial
PPTX
Am I reading GC logs Correctly?
PDF
Is your profiler speaking the same language as you? -- Docklands JUG
PPTX
Network simulator 2
PDF
Chainer v4 and v5
PDF
DevoxxPL: JRebel Under The Covers
PPT
Tut hemant ns2
PPT
Session 1 introduction to ns2
PPT
Introduction to NS2 - Cont..
PPTX
Multicore programmingandtpl
ODP
LPW 2007 - Perl Plumbing
PDF
Debugging Complex Systems - Erlang Factory SF 2015
PDF
Ceph Day Shanghai - Ceph Performance Tools
PDF
Do snow.rwn
Venkat ns2
Network simulator 2
Profiling your Applications using the Linux Perf Tools
속도체크
NS-2 Tutorial
Am I reading GC logs Correctly?
Is your profiler speaking the same language as you? -- Docklands JUG
Network simulator 2
Chainer v4 and v5
DevoxxPL: JRebel Under The Covers
Tut hemant ns2
Session 1 introduction to ns2
Introduction to NS2 - Cont..
Multicore programmingandtpl
LPW 2007 - Perl Plumbing
Debugging Complex Systems - Erlang Factory SF 2015
Ceph Day Shanghai - Ceph Performance Tools
Do snow.rwn
Ad

Similar to Real-time Streams & Logs with Storm and Kafka by Andrew Montalenti and Keith Bourgoin PyData SV 2014 (20)

PDF
Porting a Streaming Pipeline from Scala to Rust
PDF
Introduction to Kafka Streams
PDF
Apache Kafka, and the Rise of Stream Processing
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PDF
Data Con LA 2018 - A Serverless Approach to Data Processing using Apache Puls...
PDF
Project Reactor Now and Tomorrow
PPTX
Cisco OpenSOC
PDF
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
PPTX
Splunk Conf 2014 - Getting the message
PDF
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
PPTX
SC'18 BoF Presentation
PPT
Sedna XML Database: Executor Internals
PDF
LibOS as a regression test framework for Linux networking #netdev1.1
PDF
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
PDF
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
PDF
Building Modern Data Streaming Apps with Python
PPTX
Typesafe spark- Zalando meetup
PDF
Serverless Event Streaming Applications as Functionson K8
PDF
Developing Realtime Data Pipelines With Apache Kafka
PDF
I can't believe it's not a queue: Kafka and Spring
Porting a Streaming Pipeline from Scala to Rust
Introduction to Kafka Streams
Apache Kafka, and the Rise of Stream Processing
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Data Con LA 2018 - A Serverless Approach to Data Processing using Apache Puls...
Project Reactor Now and Tomorrow
Cisco OpenSOC
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
Splunk Conf 2014 - Getting the message
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
SC'18 BoF Presentation
Sedna XML Database: Executor Internals
LibOS as a regression test framework for Linux networking #netdev1.1
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Building Modern Data Streaming Apps with Python
Typesafe spark- Zalando meetup
Serverless Event Streaming Applications as Functionson K8
Developing Realtime Data Pipelines With Apache Kafka
I can't believe it's not a queue: Kafka and Spring
Ad

More from PyData (20)

PDF
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
PDF
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
PDF
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
PDF
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
PDF
Deploying Data Science for Distribution of The New York Times - Anne Bauer
PPTX
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
PPTX
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
PDF
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PDF
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
PDF
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
PDF
Words in Space - Rebecca Bilbro
PDF
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
PPTX
Pydata beautiful soup - Monica Puerto
PDF
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
PPTX
Extending Pandas with Custom Types - Will Ayd
PDF
Measuring Model Fairness - Stephen Hoover
PDF
What's the Science in Data Science? - Skipper Seabold
PDF
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
PDF
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
PDF
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Words in Space - Rebecca Bilbro
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
Pydata beautiful soup - Monica Puerto
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
Extending Pandas with Custom Types - Will Ayd
Measuring Model Fairness - Stephen Hoover
What's the Science in Data Science? - Skipper Seabold
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...

Recently uploaded (20)

PDF
Getting Started with Data Integration: FME Form 101
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
project resource management chapter-09.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Web App vs Mobile App What Should You Build First.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
Encapsulation theory and applications.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
August Patch Tuesday
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPTX
Programs and apps: productivity, graphics, security and other tools
Getting Started with Data Integration: FME Form 101
A comparative analysis of optical character recognition models for extracting...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Group 1 Presentation -Planning and Decision Making .pptx
project resource management chapter-09.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Web App vs Mobile App What Should You Build First.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
WOOl fibre morphology and structure.pdf for textiles
DP Operators-handbook-extract for the Mautical Institute
Encapsulation theory and applications.pdf
A novel scalable deep ensemble learning framework for big data classification...
OMC Textile Division Presentation 2021.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
August Patch Tuesday
Heart disease approach using modified random forest and particle swarm optimi...
Assigned Numbers - 2025 - Bluetooth® Document
Hindi spoken digit analysis for native and non-native speakers
Enhancing emotion recognition model for a student engagement use case through...
Programs and apps: productivity, graphics, security and other tools

Real-time Streams & Logs with Storm and Kafka by Andrew Montalenti and Keith Bourgoin PyData SV 2014

  • 1. Real-time Streams & Logs Andrew Montalenti, CTO Keith Bourgoin, Backend Lead 1 of 47
  • 2. Agenda Parse.ly problem space Aggregating the stream (Storm) Organizing around logs (Kafka) 2 of 47
  • 3. Admin Our presentations and code: http://parse.ly/code This presentation's slides: http://parse.ly/slides/logs This presentation's notes: http://parse.ly/slides/logs/notes 3 of 47
  • 5. What is Parse.ly? Web content analytics for digital storytellers. 5 of 47
  • 6. Velocity Average post has <48-hour shelf life. 6 of 47
  • 7. Volume Top publishers write 1000's of posts per day. 7 of 47
  • 14. Queues and workers Queues: RabbitMQ => Redis => ZeroMQ Workers: Cron Jobs => Celery 14 of 47
  • 16. Lots of moving parts 16 of 47
  • 17. In short: it started to get messy 17 of 47
  • 18. Introducing Storm Storm is a distributed real-time computation system. Hadoop provides a set of general primitives for doing batch processing. Storm provides a set of general primitives for doing real-time computation. Perfect as a replacement for ad-hoc workers-and-queues systems. 18 of 47
  • 19. Storm features Speed Fault tolerance Parallelism Guaranteed Messages Easy Code Management Local Dev 19 of 47
  • 20. Storm primitives Streaming Data Set, typically from Kafka. ZeroMQ used for inter-process communication. Bolts & Spouts; Storm's Topology is a DAG. Nimbus & Workers manage execution. Tuneable parallelism + built-in fault tolerance. 20 of 47
  • 22. Tuple Tree Tuple tree, anchoring, and retries. 22 of 47
  • 23. Word Stream Spout (Storm) ;; spout configuration {"word-spout" (shell-spout-spec ;; Python Spout implementation: ;; - fetches words (e.g. from Kafka) ["python" "words.py"] ;; - emits (word,) tuples ["word"] ) } 23 of 47
  • 24. Word Stream Spout in Python import itertools from streamparse import storm class WordSpout(storm.Spout): def initialize(self, conf, ctx): self.words = itertools.cycle(['dog', 'cat', 'zebra', 'elephant']) def next_tuple(self): word = next(self.words) storm.emit([word]) WordSpout().run() 24 of 47
  • 25. Word Count Bolt (Storm) ;; bolt configuration {"count-bolt" (shell-bolt-spec ;; Bolt input: Spout and field grouping on word {"word-spout" ["word"]} ;; Python Bolt implementation: ;; - maintains a Counter of word ;; - increments as new words arrive ["python" "wordcount.py"] ;; Emits latest word count for most recent word ["word" "count"] ;; parallelism = 2 :p 2 ) } 25 of 47
  • 26. Word Count Bolt in Python from collections import Counter from streamparse import storm class WordCounter(storm.Bolt): def initialize(self, conf, ctx): self.counts = Counter() def process(self, tup): word = tup.values[0] self.counts[word] += 1 storm.emit([word, self.counts[word]]) storm.log('%s: %d' % (word, self.counts[word])) WordCounter().run() 26 of 47
  • 27. streamparse sparse provides a CLI front-end to streamparse, a framework for creating Python projects for running, debugging, and submitting Storm topologies for data processing. (still in development) After installing the lein (only dependency), you can run: pip install streamparse This will offer a command-line tool, sparse. Use: sparse quickstart 27 of 47
  • 28. Running and debugging You can then run the local Storm topology using: $ sparse run Running wordcount topology... Options: {:spec "topologies/wordcount.clj", ...} #<StormTopology StormTopology(spouts:{word-spout=... storm.daemon.nimbus - Starting Nimbus with conf {... storm.daemon.supervisor - Starting supervisor with id 4960ac74... storm.daemon.nimbus - Received topology submission with conf {... ... lots of output as topology runs... Interested? Lightning talk! 28 of 47
  • 30. Not all logs are application logs A "log" could be any stream of structured data: Web logs Raw data waiting to be processed Partially processed data Database operations (e.g. mongo's oplog) A series of timestamped facts about a given system. 30 of 47
  • 32. Enter the unified log 32 of 47
  • 34. Parse.ly is log-centric, too 34 of 47
  • 35. Introducing Apache Kafka Log-centric messaging system developed at LinkedIn. Designed for throughput; efficient resource use. Persists to disk; in-memory for recent data Little to no overhead for new consumers Scalable to 10,000's of messages per second As of 0.8, full replication of topic data. 35 of 47
  • 36. Kafka concepts Concept Description Cluster An arrangement of Brokers & Zookeeper nodes Broker An individual node in the Cluster Topic A group of related messages (a stream) Partition Part of a topic, used for replication Producer Publishes messages to stream Consumer Group Group of related processes reading a topic Offset Point in a topic that the consumer has read to 36 of 47
  • 37. What's the catch? Replication isn't perfect. Network partitions can cause problems. No out-of-order acknowledgement: "Offset" is a marker of where consumer is in log; nothing more. On a restart, you know where to start reading, but not if individual messages before the stored offset was fully processed. In practice, not as much of a problem as it sounds. 37 of 47
  • 38. Kafka is a "distributed log" Topics are logs, not queues. Consumers read into offsets of the log. Logs are maintained for a configurable period of time. Messages can be "replayed". Consumers can share identical logs easily. 38 of 47
  • 39. Multi-consumer Even if Kafka's availability and scalability story isn't interesting to you, the multi-consumer story should be. 39 of 47
  • 40. Queue problems, revisited Traditional queues (e.g. RabbitMQ / Redis): not distributed / highly available at core not persistent ("overflows" easily) more consumers mean more queue server load Kafka solves all of these problems. 40 of 47
  • 41. Kafka + Storm Good fit for at-least-once processing. No need for out-of-order acks. Community work is ongoing for at-most-once processing. Able to keep up with Storm's high-throughput processing. Great for handling backpressure during traffic spikes. 41 of 47
  • 42. Kafka in Python (1) python-kafka (0.8+) https://github.com/mumrah/kafka-python from kafka.client import KafkaClient from kafka.consumer import SimpleConsumer kafka = KafkaClient('localhost:9092') consumer = SimpleConsumer(kafka, 'test_consumer', 'raw_data') start = time.time() for msg in consumer: count += 1 if count % 1000 == 0: dur = time.time() - start print 'Reading at {:.2f} messages/sec'.format(dur/1000) start = time.time() 42 of 47
  • 43. Kafka in Python (2) samsa (0.7x) https://github.com/getsamsa/samsa import time from kazoo.client import KazooClient from samsa.cluster import Cluster zk = KazooClient() zk.start() cluster = Cluster(zk) queue = cluster.topics['raw_data'].subscribe('test_consumer') start = time.time() for msg in queue: count += 1 if count % 1000 == 0: dur = time.time() - start print 'Reading at {:.2f} messages/sec'.format(dur/1000) queue.commit_offsets() # commit to zk every 1k msgs 43 of 47
  • 44. Other Log-Centric Companies Company Logs Workers LinkedIn Kafka* Samza Twitter Kafka Storm* Pinterest Kafka Storm Spotify Kafka Storm Wikipedia Kafka Storm Outbrain Kafka Storm LivePerson Kafka Storm Netflix Kafka ??? 44 of 47
  • 46. What we've learned There is no silver bullet data processing technology. Log storage is very cheap, and getting cheaper. "Timestamped facts" is rawest form of data available. Storm and Kafka allow you to develop atop those facts. Organizing around real-time logs is a wise decision. 46 of 47
  • 47. Questions? Go forth and stream! Parse.ly: http://parse.ly/code http://twitter.com/parsely Andrew & Keith: http://twitter.com/amontalenti http://twitter.com/kbourgoin 47 of 47