SlideShare a Scribd company logo
Data Processing with Python /
Celery and RabbitMQ
for the
New England Regional Developers (NERD) Summit
Jeff Peck
9/11/2015
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Introduction
Jeff Peck
Senior Software Engineer
Code Ninja
jpeck@esperdyne.com
www.esperdyne.com
Esperdyne Technologies, LLC
245 Russell Street, Suite 23
Hadley, MA 01035-9558
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
The Goal of this Presentation
● Understand the challenges of
real-life data processing
scenarios
● Consider the possible solutions
● Describe an approach using
Python / Celery and RabbitMQ
● Discover how you can process
data with Celery, from scratch,
by walking through a real
example
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Agenda
● Background
● The Challenge
● Approaches Considered
● About Celery / Task Queues
● Practical Example: Processing Emails
● Questions
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Background
● We process data for ~5 million industrial parts
each week
● Data comes from different sources
● Some structured / some unstructured
● Multiple deploy targets: MySQL / FAST ESP
● Database deploy non-item-specific data (i.e.
catalog data or taxonomy data, etc)
● Metadata processing
● Various dependencies before processing and
pushing to production
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Background
Structured
Catalog Data
Unstructured
PDF Data
Metadata
Database
Search Index
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
The Challenge
● Efficiently process data from multiple sources
● Consider all dependencies
● Deploy to multiple targets in parallel
● Capture the success/failure of each item to be
able to generate a report
● Build a process that can be easily triggered to
handle all aspects of data processing on a
weekly basis
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Approaches
● Process everything in separate batches
– Fine for small amount of data
– Lots of manual steps
– Almost no parallel processing
– Would take approximately one week to process all data
● Pypes
– Flow-based programming paradigm
– “Components” and “Packets”
– Lacked flexibility to spawn multiple jobs from a single
component
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
“This Calls for Some Celery!”
● Celery: Distributed Task Queue
● Written in Python
● Integrates with RabbitMQ and Redis
● Supports task chaining
● Extremely Flexible
● Distributed
– Can manage multiple queues
● Very active community
– (over 10k downloads per day)
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Celery
● “Celery is an asynchronous task queue/job
queue based on distributed message passing. It
is focused on real-time operation, but supports
scheduling as well.”
● http://www.celeryproject.org/
● pip install -U Celery
● Supports callbacks or task chaining
● Ideal for processing data from different sources,
and deploying to multiple targets, while
collecting status of individual items
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
What is a Distributed Task Queue?
● A message queue passes, holds, and delivers
messages across a system or application
● A task queue is a type of message queue that
deals with tasks, such as processing some data
● A distributed task queue combines multiple
task queues across systems
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Workers, Brokers, and Backends
● In Celery, a worker executes tasks that are
passed to it from the message broker
● The message broker is the service that sends
are receives the messages (i.e. the message
queue). Celery is compatible with many
different brokers such as Redis, Mongo DB,
Iron MQ, etc. We use RabbitMQ.
● A backend is necessary if you want to store the
results of tasks or send the states somewhere
(i.e. when executing a “group” of tasks)
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Practical Example: Processing
Emails
● 500k emails recovered from Enron
● Goal is to parse each email and load them into
ElasticSearch and MySQL
● We could do this manually in stages, but we want to take
full advantage of our resources and minimize our
interaction with the process
● We will use Celery, RabbitMQ, and Redis
● All of the source code for this example is available here:
https://github.com/esperdyne
●
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing
Parse
Elastic Search
MySQL
Emails
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: Setup
● Install:
– RabbitMQ
– Redis
– Celery
– Fabric
– MySQL
– ElasticSearch
Install RabbitMQ:
$ sudo apt-get install rabbitmq-server
Install Redis:
$ sudo apt-get install redis-server
$ sudo pip install redis
Install Celery:
$ sudo pip install celery
Install Fabric:
$ sudo pip install fabric
Install ElasticSearch:
$ sudo apt-get install openjdk-7-jre
$ wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch |
sudo apt-key add -
$ echo "deb http://packages.elastic.co/elasticsearch/1.7/debian
stable main" | sudo tee -a /etc/apt/sources.list.d/elasticsearch-
1.7.list
$ sudo apt-get update && sudo apt-get install elasticsearch
$ sudo update-rc.d elasticsearch defaults 95 10
$ sudo pip install elasticsearch
$ sudo service elasticsearch start
Install MySQL:
$ sudo apt-get install mysql-server
$ sudo apt-get build-dep python-mysqldb
$ sudo pip install MySQL_python
$ sudo pip install sqlalchemy
Make “messages” database:
$ mysql -u root -e "CREATE DATABASE messages"
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: Setup
● Create a new directory for the project
● Create the proj directory and put an empty
__init__.py file in it.
● Download the raw Enron emails
$ mkdir celery-message-processing
$ cd celery-message-processing
$ mkdir proj
$ touch proj/__init__.py
$ wget http://www.cs.cmu.edu/~enron/enron_mail_20150507.tgz
$ tar -xvf enron_mail_20150507.tgz
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: The Celery file
● Inside the proj dir, create a file called
celery.py and open it with your favorite text
editor (i.e. emacs proj/celery.py )
from __future__ import absolute_import
from celery import Celery
app = Celery('proj',
broker='amqp://',
backend='redis://localhost',
include=['proj.tasks'])
# Optional configuration, see the application user guide.
app.conf.update(
CELERY_TASK_RESULT_EXPIRES=3600,
)
if __name__ == '__main__':
app.start()
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: The Tasks File
● Now, create another file inside the proj
directory called tasks.py and open it for
editing.
● Write the following imports:
from __future__ import absolute_import
import email
from sqlalchemy import *
from elasticsearch import Elasticsearch
from celery import Task
from proj.celery import app
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Emails Processing: Tasks File (cont)
class MessagesTask(Task):
"""This is a celery abstract base class that contains all of the logic for
    parsing and deploying content."""
abstract = True
_messages_table = None
_elasticsearch = None
def _init_database(self):
"""Set up the MySQL database"""
db = create_engine('mysql://root@localhost/messages')
metadata = MetaData(db)
messages_table = Table('messages', metadata,
Column('message_id', String(255), primary_key = True),
Column('subject', String(255)),
Column('to', String(255)),
Column('x_to', String(255)),
Column('from', String(255)),
Column('x_from', String(255)),
Column('cc', String(255)),
Column('x_cc', String(255)),
Column('bcc', String(255)),
Column('x_bcc', String(255)),
Column('payload', Text()))
messages_table.create(checkfirst=True)
self._messages_table = messages_table
def _init_elasticsearch(self):
"""Set up the ElasticSearch instance"""
self._elasticsearch = Elasticsearch()
...
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Emails Processing: Tasks File (cont)
...
def parse_message_file(self, filename):
"""Parse an email file. Return as dictionary"""
with open(filename) as f:
message = email.message_from_file(f)
return {'subject': message.get("Subject"),
'to': message.get("To"),
'x_to': message.get("X-To"),
'from': message.get("From"),
'x_from': message.get("X-From"),
'cc': message.get("Cc"),
'x_cc': message.get("X-cc"),
'bcc': message.get("Bcc"),
'x_bcc': message.get("X-bcc"),
'message_id': message.get("Message-ID"),
'payload': message.get_payload()}
def database_insert(self, message_dict):
"""Insert a message into the MySQL database"""
if self._messages_table is None:
self._init_database()
ins = self._messages_table.insert(values=message_dict)
ins.execute()
def elasticsearch_index(self, id, message_dict):
"""Insert a message into the ElasticSearch index"""
if self._elasticsearch is None:
self._init_elasticsearch()
self._elasticsearch.index(index="messages", doc_type="message", id=id, body=message_dict)
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: Tasks File (cont)
@app.task(base=MessagesTask, queue="parse")
def parse(filename):
"""Parse an email file. Return as dictionary"""
# Call the method in the base task and return the result
return parse.parse_message_file(filename)
@app.task(base=MessagesTask, queue="db_deploy", ignore_result=True)
def deploy_db(message_dict):
"""Deploys the message dictionary to the MySQL database table"""
# Call the method in the base task
deploy_db.database_insert(message_dict)
@app.task(base=MessagesTask, queue="es_deploy", ignore_result=True)
def deploy_es(message_dict):
"""Deploys the message dictionary to the Elastic Search instance"""
# Call the method in the base task
deploy_es.elasticsearch_index(message_dict['message_id'], message_dict)
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: Fabric Script
● I use fabric to start/stop the Celery workers and
to pass the raw emails to be processed
● Make a fabfile.py in the base directory and
open it for editing
import os
from fabric.api import local
from celery import chain, group
from celery.task.control import inspect
from proj.tasks import parse, deploy_db, deploy_es
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: Fabric (cont)
def workers(action):
"""Issue command to start, restart, or stop celery workers"""
# Prepare the directories for pids and logs
local("mkdir -p celery-pids celery-logs")
# Launch 4 celery workers for 4 queues (parse, db_deploy, es_deploy, and default)
# Each has a concurrency of 2 except the default which has a concurrency of 1
# More info on the format of this command can be found here:
# http://docs.celeryproject.org/en/latest/reference/celery.bin.multi.html
local("celery multi {} parse db_deploy es_deploy celery "
"-Q:parse parse -Q:db_deploy db_deploy -Q:es_deploy es_deploy -Q:celery celery "
"-c 2 -c:celery 1 "
"-l info -A proj "
"--pidfile=celery-pids/%n.pid --logfile=celery-logs/%n.log".format(action))
● Start/stop the workers with fabric
Usage example:
$ fab workers:start
$ fab workers:stop
$ fab workers:restart
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: Fabric (cont)
● Task Chaining
def process_one(filename=None):
"""Enqueues a mail file for processing"""
res = chain(parse.s(filename), group(deploy_db.s(), deploy_es.s()))()
print "Enqueued mail file for processing: {} ({})".format(filename, res)
def process(path=None):
"""Enqueues a mail file for processing. Optionally, submitting a
    directory will enqueue all files in that directory"""
if os.path.isfile(path):
process_one(path)
elif os.path.isdir(path):
for subpath, subdirs, files in os.walk(path):
for name in files:
process_one(os.path.join(subpath, name))
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: Usage
● To start a build cycle, this is all that you need to
do:
$ fab workers:start
$ fab process:maildir
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: What next?
● Implement a “chord”:
– Trigger a task to update an email's status after
successfully being processed and deployed to MySQL
and ElasticSearch
● Handle errors:
– Write to a special log file every time an error occurs with
a custom error handler
● Reporting:
– Detect the completion of processing with a scheduled
task that confirms that all tasks are complete, and email
a report automatically with the number of successful /
failed messages
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: Try it yourself
● All of the source code and instructions for this demo are
available here:
https://github.com/esperdyne/celery-message-processing
● Can be used as a boilerplate for an unrelated celery
project
● Fork, experiment, ask questions, etc.
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
One More Thing: Celery Flower
● There is a tool that provides real-time monitoring for your
Celery instance, called “Flower”:
https://github.com/mher/flower
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Any Questions?
(Can you spare a guess as to why that question mark isn't made out of celery?)

More Related Content

PDF
Celery - A Distributed Task Queue
PDF
Practical Celery
ODP
Introduction to Python Celery
PDF
Celery with python
PDF
Celery: The Distributed Task Queue
PDF
Von A bis Z-itrix: Installieren Sie den stabilsten und schnellsten HCL Notes-...
PDF
Lessons learned from writing over 300,000 lines of infrastructure code
PDF
Introduction to Celery
Celery - A Distributed Task Queue
Practical Celery
Introduction to Python Celery
Celery with python
Celery: The Distributed Task Queue
Von A bis Z-itrix: Installieren Sie den stabilsten und schnellsten HCL Notes-...
Lessons learned from writing over 300,000 lines of infrastructure code
Introduction to Celery

What's hot (20)

PDF
Git and git flow
PPTX
Git and GitFlow branching model
PPTX
HCL Domino V12 Key Security Features Overview
PPTX
Git basics to advance with diagrams
PDF
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
PPTX
Terraform modules restructured
PPTX
Apache Camel K - Copenhagen
PDF
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
PDF
Git Tutorial I
PDF
[1A7]Ansible의이해와활용
PPT
Git l'essentiel
PDF
Introduction to Docker storage, volume and image
PDF
Apache Kafka Architecture & Fundamentals Explained
PPTX
Introduction to ansible
PPTX
Best practices for ansible
PDF
Introduction to GitHub Actions
PPTX
Kafka 101
PPTX
Helm.pptx
PPTX
Terraform on Azure
PDF
Git and git flow
Git and GitFlow branching model
HCL Domino V12 Key Security Features Overview
Git basics to advance with diagrams
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Terraform modules restructured
Apache Camel K - Copenhagen
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
Git Tutorial I
[1A7]Ansible의이해와활용
Git l'essentiel
Introduction to Docker storage, volume and image
Apache Kafka Architecture & Fundamentals Explained
Introduction to ansible
Best practices for ansible
Introduction to GitHub Actions
Kafka 101
Helm.pptx
Terraform on Azure
Ad

Viewers also liked (13)

PDF
Javan Owino Diploma certificate
PPTX
Internal training - Eda
PPT
Diplomas
PDF
Erlang for data ops
PPT
Dilplomas Certificaciones
PDF
Attachment report IAT
PDF
Attachment report Victor
PDF
INTERNSHIP REPORT
DOCX
Dynamo db pros and cons
DOCX
Attachment report
PDF
Field attachment report (alie chibwe)
DOCX
Industrial Training Report-1
PDF
Summer internship project report
Javan Owino Diploma certificate
Internal training - Eda
Diplomas
Erlang for data ops
Dilplomas Certificaciones
Attachment report IAT
Attachment report Victor
INTERNSHIP REPORT
Dynamo db pros and cons
Attachment report
Field attachment report (alie chibwe)
Industrial Training Report-1
Summer internship project report
Ad

Similar to Data processing with celery and rabbit mq (20)

PDF
PyCon India 2012: Celery Talk
PDF
PDF
An Introduction to Celery
PDF
Celery
KEY
Django Celery
PDF
Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2
PPT
Introduction to Django-Celery and Supervisor
PPTX
Massaging the Pony: Message Queues and You
PDF
Celery by dummy
KEY
Celery
PDF
Why Task Queues - ComoRichWeb
PDF
Queue Everything and Please Everyone
PDF
Developer-friendly taskqueues: What you should ask yourself before choosing one
PDF
Developer-friendly task queues: what we learned building MRQ, Sylvain Zimmer
PDF
Queick: A Simple Job Queue System for Python
PPTX
Data integration with rabbit mq and celery
PDF
Celery introduction
PDF
Advanced task management with Celery
PDF
Celery for internal API in SOA infrastructure
PDF
Scaling up task processing with Celery
PyCon India 2012: Celery Talk
An Introduction to Celery
Celery
Django Celery
Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2
Introduction to Django-Celery and Supervisor
Massaging the Pony: Message Queues and You
Celery by dummy
Celery
Why Task Queues - ComoRichWeb
Queue Everything and Please Everyone
Developer-friendly taskqueues: What you should ask yourself before choosing one
Developer-friendly task queues: what we learned building MRQ, Sylvain Zimmer
Queick: A Simple Job Queue System for Python
Data integration with rabbit mq and celery
Celery introduction
Advanced task management with Celery
Celery for internal API in SOA infrastructure
Scaling up task processing with Celery

Recently uploaded (20)

PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Database Infoormation System (DBIS).pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Introduction to Knowledge Engineering Part 1
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Business Analytics and business intelligence.pdf
PDF
annual-report-2024-2025 original latest.
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Lecture1 pattern recognition............
Fluorescence-microscope_Botany_detailed content
Database Infoormation System (DBIS).pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction to Knowledge Engineering Part 1
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
ISS -ESG Data flows What is ESG and HowHow
Business Acumen Training GuidePresentation.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Supervised vs unsupervised machine learning algorithms
oil_refinery_comprehensive_20250804084928 (1).pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Miokarditis (Inflamasi pada Otot Jantung)
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Business Analytics and business intelligence.pdf
annual-report-2024-2025 original latest.
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Lecture1 pattern recognition............

Data processing with celery and rabbit mq

  • 1. Data Processing with Python / Celery and RabbitMQ for the New England Regional Developers (NERD) Summit Jeff Peck 9/11/2015
  • 2. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Introduction Jeff Peck Senior Software Engineer Code Ninja jpeck@esperdyne.com www.esperdyne.com Esperdyne Technologies, LLC 245 Russell Street, Suite 23 Hadley, MA 01035-9558
  • 3. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ The Goal of this Presentation ● Understand the challenges of real-life data processing scenarios ● Consider the possible solutions ● Describe an approach using Python / Celery and RabbitMQ ● Discover how you can process data with Celery, from scratch, by walking through a real example
  • 4. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Agenda ● Background ● The Challenge ● Approaches Considered ● About Celery / Task Queues ● Practical Example: Processing Emails ● Questions
  • 5. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Background ● We process data for ~5 million industrial parts each week ● Data comes from different sources ● Some structured / some unstructured ● Multiple deploy targets: MySQL / FAST ESP ● Database deploy non-item-specific data (i.e. catalog data or taxonomy data, etc) ● Metadata processing ● Various dependencies before processing and pushing to production
  • 6. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Background Structured Catalog Data Unstructured PDF Data Metadata Database Search Index
  • 7. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ The Challenge ● Efficiently process data from multiple sources ● Consider all dependencies ● Deploy to multiple targets in parallel ● Capture the success/failure of each item to be able to generate a report ● Build a process that can be easily triggered to handle all aspects of data processing on a weekly basis
  • 8. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Approaches ● Process everything in separate batches – Fine for small amount of data – Lots of manual steps – Almost no parallel processing – Would take approximately one week to process all data ● Pypes – Flow-based programming paradigm – “Components” and “Packets” – Lacked flexibility to spawn multiple jobs from a single component
  • 9. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ “This Calls for Some Celery!” ● Celery: Distributed Task Queue ● Written in Python ● Integrates with RabbitMQ and Redis ● Supports task chaining ● Extremely Flexible ● Distributed – Can manage multiple queues ● Very active community – (over 10k downloads per day)
  • 10. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Celery ● “Celery is an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well.” ● http://www.celeryproject.org/ ● pip install -U Celery ● Supports callbacks or task chaining ● Ideal for processing data from different sources, and deploying to multiple targets, while collecting status of individual items
  • 11. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ What is a Distributed Task Queue? ● A message queue passes, holds, and delivers messages across a system or application ● A task queue is a type of message queue that deals with tasks, such as processing some data ● A distributed task queue combines multiple task queues across systems
  • 12. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Workers, Brokers, and Backends ● In Celery, a worker executes tasks that are passed to it from the message broker ● The message broker is the service that sends are receives the messages (i.e. the message queue). Celery is compatible with many different brokers such as Redis, Mongo DB, Iron MQ, etc. We use RabbitMQ. ● A backend is necessary if you want to store the results of tasks or send the states somewhere (i.e. when executing a “group” of tasks)
  • 13. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Practical Example: Processing Emails ● 500k emails recovered from Enron ● Goal is to parse each email and load them into ElasticSearch and MySQL ● We could do this manually in stages, but we want to take full advantage of our resources and minimize our interaction with the process ● We will use Celery, RabbitMQ, and Redis ● All of the source code for this example is available here: https://github.com/esperdyne ●
  • 14. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Email Processing Parse Elastic Search MySQL Emails
  • 15. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Email Processing: Setup ● Install: – RabbitMQ – Redis – Celery – Fabric – MySQL – ElasticSearch Install RabbitMQ: $ sudo apt-get install rabbitmq-server Install Redis: $ sudo apt-get install redis-server $ sudo pip install redis Install Celery: $ sudo pip install celery Install Fabric: $ sudo pip install fabric Install ElasticSearch: $ sudo apt-get install openjdk-7-jre $ wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add - $ echo "deb http://packages.elastic.co/elasticsearch/1.7/debian stable main" | sudo tee -a /etc/apt/sources.list.d/elasticsearch- 1.7.list $ sudo apt-get update && sudo apt-get install elasticsearch $ sudo update-rc.d elasticsearch defaults 95 10 $ sudo pip install elasticsearch $ sudo service elasticsearch start Install MySQL: $ sudo apt-get install mysql-server $ sudo apt-get build-dep python-mysqldb $ sudo pip install MySQL_python $ sudo pip install sqlalchemy Make “messages” database: $ mysql -u root -e "CREATE DATABASE messages"
  • 16. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Email Processing: Setup ● Create a new directory for the project ● Create the proj directory and put an empty __init__.py file in it. ● Download the raw Enron emails $ mkdir celery-message-processing $ cd celery-message-processing $ mkdir proj $ touch proj/__init__.py $ wget http://www.cs.cmu.edu/~enron/enron_mail_20150507.tgz $ tar -xvf enron_mail_20150507.tgz
  • 17. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Email Processing: The Celery file ● Inside the proj dir, create a file called celery.py and open it with your favorite text editor (i.e. emacs proj/celery.py ) from __future__ import absolute_import from celery import Celery app = Celery('proj', broker='amqp://', backend='redis://localhost', include=['proj.tasks']) # Optional configuration, see the application user guide. app.conf.update( CELERY_TASK_RESULT_EXPIRES=3600, ) if __name__ == '__main__': app.start()
  • 18. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Email Processing: The Tasks File ● Now, create another file inside the proj directory called tasks.py and open it for editing. ● Write the following imports: from __future__ import absolute_import import email from sqlalchemy import * from elasticsearch import Elasticsearch from celery import Task from proj.celery import app
  • 19. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Emails Processing: Tasks File (cont) class MessagesTask(Task): """This is a celery abstract base class that contains all of the logic for     parsing and deploying content.""" abstract = True _messages_table = None _elasticsearch = None def _init_database(self): """Set up the MySQL database""" db = create_engine('mysql://root@localhost/messages') metadata = MetaData(db) messages_table = Table('messages', metadata, Column('message_id', String(255), primary_key = True), Column('subject', String(255)), Column('to', String(255)), Column('x_to', String(255)), Column('from', String(255)), Column('x_from', String(255)), Column('cc', String(255)), Column('x_cc', String(255)), Column('bcc', String(255)), Column('x_bcc', String(255)), Column('payload', Text())) messages_table.create(checkfirst=True) self._messages_table = messages_table def _init_elasticsearch(self): """Set up the ElasticSearch instance""" self._elasticsearch = Elasticsearch() ...
  • 20. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Emails Processing: Tasks File (cont) ... def parse_message_file(self, filename): """Parse an email file. Return as dictionary""" with open(filename) as f: message = email.message_from_file(f) return {'subject': message.get("Subject"), 'to': message.get("To"), 'x_to': message.get("X-To"), 'from': message.get("From"), 'x_from': message.get("X-From"), 'cc': message.get("Cc"), 'x_cc': message.get("X-cc"), 'bcc': message.get("Bcc"), 'x_bcc': message.get("X-bcc"), 'message_id': message.get("Message-ID"), 'payload': message.get_payload()} def database_insert(self, message_dict): """Insert a message into the MySQL database""" if self._messages_table is None: self._init_database() ins = self._messages_table.insert(values=message_dict) ins.execute() def elasticsearch_index(self, id, message_dict): """Insert a message into the ElasticSearch index""" if self._elasticsearch is None: self._init_elasticsearch() self._elasticsearch.index(index="messages", doc_type="message", id=id, body=message_dict)
  • 21. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Email Processing: Tasks File (cont) @app.task(base=MessagesTask, queue="parse") def parse(filename): """Parse an email file. Return as dictionary""" # Call the method in the base task and return the result return parse.parse_message_file(filename) @app.task(base=MessagesTask, queue="db_deploy", ignore_result=True) def deploy_db(message_dict): """Deploys the message dictionary to the MySQL database table""" # Call the method in the base task deploy_db.database_insert(message_dict) @app.task(base=MessagesTask, queue="es_deploy", ignore_result=True) def deploy_es(message_dict): """Deploys the message dictionary to the Elastic Search instance""" # Call the method in the base task deploy_es.elasticsearch_index(message_dict['message_id'], message_dict)
  • 22. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Email Processing: Fabric Script ● I use fabric to start/stop the Celery workers and to pass the raw emails to be processed ● Make a fabfile.py in the base directory and open it for editing import os from fabric.api import local from celery import chain, group from celery.task.control import inspect from proj.tasks import parse, deploy_db, deploy_es
  • 23. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Email Processing: Fabric (cont) def workers(action): """Issue command to start, restart, or stop celery workers""" # Prepare the directories for pids and logs local("mkdir -p celery-pids celery-logs") # Launch 4 celery workers for 4 queues (parse, db_deploy, es_deploy, and default) # Each has a concurrency of 2 except the default which has a concurrency of 1 # More info on the format of this command can be found here: # http://docs.celeryproject.org/en/latest/reference/celery.bin.multi.html local("celery multi {} parse db_deploy es_deploy celery " "-Q:parse parse -Q:db_deploy db_deploy -Q:es_deploy es_deploy -Q:celery celery " "-c 2 -c:celery 1 " "-l info -A proj " "--pidfile=celery-pids/%n.pid --logfile=celery-logs/%n.log".format(action)) ● Start/stop the workers with fabric Usage example: $ fab workers:start $ fab workers:stop $ fab workers:restart
  • 24. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Email Processing: Fabric (cont) ● Task Chaining def process_one(filename=None): """Enqueues a mail file for processing""" res = chain(parse.s(filename), group(deploy_db.s(), deploy_es.s()))() print "Enqueued mail file for processing: {} ({})".format(filename, res) def process(path=None): """Enqueues a mail file for processing. Optionally, submitting a     directory will enqueue all files in that directory""" if os.path.isfile(path): process_one(path) elif os.path.isdir(path): for subpath, subdirs, files in os.walk(path): for name in files: process_one(os.path.join(subpath, name))
  • 25. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Email Processing: Usage ● To start a build cycle, this is all that you need to do: $ fab workers:start $ fab process:maildir
  • 26. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Email Processing: What next? ● Implement a “chord”: – Trigger a task to update an email's status after successfully being processed and deployed to MySQL and ElasticSearch ● Handle errors: – Write to a special log file every time an error occurs with a custom error handler ● Reporting: – Detect the completion of processing with a scheduled task that confirms that all tasks are complete, and email a report automatically with the number of successful / failed messages
  • 27. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Email Processing: Try it yourself ● All of the source code and instructions for this demo are available here: https://github.com/esperdyne/celery-message-processing ● Can be used as a boilerplate for an unrelated celery project ● Fork, experiment, ask questions, etc.
  • 28. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ One More Thing: Celery Flower ● There is a tool that provides real-time monitoring for your Celery instance, called “Flower”: https://github.com/mher/flower
  • 29. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Any Questions? (Can you spare a guess as to why that question mark isn't made out of celery?)