SlideShare a Scribd company logo
Resilient Data Pipelines with
Docker and Docker-
Compose
Heeren Sharma
Software Engineer @Cliqz
heeren@cliqz.com
@heerensharma
80+ - Team size!
!
500,000 - DAU!
!
3 Million+ - Downloads (Germany only)!
!
1 billion+ - Indexed pages (We do not believe
in indexing the web.)!
!
5 TB - In-Memory indexed (Based on open
source and in-house build NoSQL stores.)
Storyfile
FROM Introduction
RUN Data-pipelines
RUN whats-docker
ADD use-case ./wrap-up
CMD ./demo
“Data and data everywhere”
“Design and development of Data Pipelines”
“Cloud deployment has its own needs”
One’s processed Data is another
system’s input data
What’s that whale named
Docker ?
"Docker allows you to package an application with all of its dependencies into a
standardised unit for software development."
A little bit more Docker
• Dockerfile
!
FROM ubuntu:14.04!
RUN pip install requests!
ADD . /awesome-code!
WORKDIR /awesome-code!
CMD python revolutionary_app.py
$ docker build -t=“your-awesome-image:v1” .
$ docker run -d -p 80:5000 —name container-name your-awesome-image:v1
$ docker push your-awesome-image:v1
$ docker pull myregistry.com:8080/your-awesome-image:v1
$ docker ps!
!
$ docker inspect <container-name/ID>!
!
$ docker log <container-name/ID>!
!
$ docker stop <container-name/ID>!
!
$ docker exec -it <container-name/ID> your_command!
Simplified Use Case
• Streaming data from different sources - Twitter,
FB, feeds and customised scraping engine.
• Different processing engines
• Fast iterations over new requirements
• Resilient system with focus over easy
deployment
Use case - News Articles
• Trending news articles from
different domains
• News Categorisation
• Relevant news over search
query
• Traffic of news content
fluctuates - pretty dynamic.
• There is no universal right
answer
System Design (1)
Data
Stream
Processing
Engine
Processed
Data
(Redis)
$ docker run -d —-name processed_data redismaster:v1 redis-server
$ docker build -t=“redismaster:v1” .
$ docker build -t=“datastream:twitter” .
$ docker run -d —-name twitter_queue datastream:twitter python /code/format_data.py
$ docker build -t —-name processing_engine .!
$ docker run -d —-name magic_powerhouse —-link processed_data:db processing_engine python /code/
magic_script.py
New State of design
System Design (2)
Data
Stream
Processing
Engine
Processed
Data
(Redis)
News
Fetcher
Requests
System Design (3)
Data
Stream
Processing
Engine
Processed
Data
(Redis)
FB
Engine
News
Fetcher
Requests
System Design (4)
Data
Stream
Processing
Engine
Processed
Data
(Redis)
FB
Engine
News
Fetcher
Requests
NewsLetters
Engine
3rd
Party
docker-compose to rescue
queue:!
build: .!
command: python /news-swimlane/twitter-queue/read_queue.py!
volumes:!
- /ebs/data:/data!
environment:!
DATA_DIR: /data!
PYTHONPATH: /news-swimlane!
restart: always!
update:!
build: .!
command: python /news-swimlane/server-redis/build_redis_exc.py!
volumes:!
- /ebs/data:/data!
links:!
- redismaster!
- fetcher!
environment:!
REDIS_HOST: redismaster_1!
DATA_DIR: /data!
PYTHONPATH: /news-swimlane!
restart: always!
redismaster:!
build: redis3-container!
ports:!
- "6379:6379"!
volumes:!
- /ebs/backup:/data!
restart: always!
command: /usr/local/bin/redis-server /master.conf!
fetcher:!
build: .!
command: python /news-swimlane/fetcher/server_fetcher.py!
ports:!
- "80:5000"!
Key points: Design &
Development
• Micro - Services oriented design and henceforth
development
• Old/new components can be realised in form of
docker containers
• Different containers can readily interact among
each other
• Ease to test (Local environment) and no worries if
it bursts out in production.
Deployment
• Just install docker on remote (to be done with
care)
• Docker images can be pushed to remote
repository (better say registry).
• Make your instances autoscale in any cloud IaaS.
• If instance go down, then new instance just pull
docker images, and start docker containers.
Resources
• Docker - https://www.docker.com/whatisdocker
• Boot2docker - https://docs.docker.com/installation/mac/
• Docker-compose - https://docs.docker.com/compose/
• Very short video about Docker (American Style) - https://
www.youtube.com/watch?v=aLipr7tTuA4 - American style
• https://www.youtube.com/watch?v=FdkNAjjO5yQ - Good
resource to have a little deep insight about Containers
http://www.cliqz.com/en
THANKQZ

More Related Content

PDF
M3D - Metadata Driven Development
PPTX
Distcp gobblin
PDF
RedisConf17 - Redis Graph
PPTX
RedisConf17 - Rax, Listpack and Safe Contexts
PPTX
RethinkDB - the open-source database for the realtime web
PDF
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
PPTX
January 2011 HUG: Pig Presentation
PDF
Netflix running Presto in the AWS Cloud
M3D - Metadata Driven Development
Distcp gobblin
RedisConf17 - Redis Graph
RedisConf17 - Rax, Listpack and Safe Contexts
RethinkDB - the open-source database for the realtime web
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
January 2011 HUG: Pig Presentation
Netflix running Presto in the AWS Cloud

What's hot (20)

PPTX
Code4 lib 20141129 python
PPTX
Hadoop 2 cluster architecture
PDF
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
PDF
Hadoop-BigData
PPTX
Mapreduce Tutorial
PDF
Clickhouse at Cloudflare. By Marek Vavrusa
PPTX
January 2011 HUG: Howl Presentation
DOCX
My First Hadoop Program !!!
PDF
Pachyderm: Building a Big Data Beast On Kubernetes
PPT
Nov 2010 HUG: Fuzzy Table - B.A.H
PDF
introduction to data processing using Hadoop and Pig
PPTX
Hadoop 1 vs hadoop2
DOCX
Apache hive
PPTX
HUG Nov 2010: HDFS Raid - Facebook
PDF
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
KEY
Hive vs Pig for HadoopSourceCodeReading
PPT
Hw09 Hadoop Development At Facebook Hive And Hdfs
PPTX
Parallel Computing with HDF Server
PPT
Hive integration: HBase and Rcfile__HadoopSummit2010
Code4 lib 20141129 python
Hadoop 2 cluster architecture
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
Hadoop-BigData
Mapreduce Tutorial
Clickhouse at Cloudflare. By Marek Vavrusa
January 2011 HUG: Howl Presentation
My First Hadoop Program !!!
Pachyderm: Building a Big Data Beast On Kubernetes
Nov 2010 HUG: Fuzzy Table - B.A.H
introduction to data processing using Hadoop and Pig
Hadoop 1 vs hadoop2
Apache hive
HUG Nov 2010: HDFS Raid - Facebook
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Hive vs Pig for HadoopSourceCodeReading
Hw09 Hadoop Development At Facebook Hive And Hdfs
Parallel Computing with HDF Server
Hive integration: HBase and Rcfile__HadoopSummit2010
Ad

Viewers also liked (10)

PPT
Heldenplatz 1938 Vortrag gehalten an der Universität Oldenburg 2009
PDF
Praesentation TU Darmstadt
PDF
01 pm vorbemerkungen_ws1011
PPTX
The Changing Character of Customization: Content Personalisation in the News
PDF
Medienbildung in einer zukunftsorientierten Lehrerbildung
PDF
Vortrag Graphendatenbanken Uni Stuttgart
ZIP
Iitm10.Key
PDF
Praesentation TU Darmstadt English
PDF
A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)
Heldenplatz 1938 Vortrag gehalten an der Universität Oldenburg 2009
Praesentation TU Darmstadt
01 pm vorbemerkungen_ws1011
The Changing Character of Customization: Content Personalisation in the News
Medienbildung in einer zukunftsorientierten Lehrerbildung
Vortrag Graphendatenbanken Uni Stuttgart
Iitm10.Key
Praesentation TU Darmstadt English
A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)
Ad

Similar to PyconUK-2015 (20)

PDF
ansible_rhel_90.pdf
PDF
Enterprise Data Science
PPTX
TIAD 2016 : Application delivery in a container world
PPTX
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
PPTX
Amazon Web Services and Docker: from developing to production
PDF
Docker včera, dnes a zítra
PDF
Docker and the Container Revolution
PPT
Engineering Presentation for Careers@Directi
PPTX
Docker Container As A Service - JAX 2016
PDF
TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, S...
PDF
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
PDF
Redispresentation apac2012
PDF
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
PPTX
Openshift Presentation ppt compare with VM
PPTX
ASP.NET Core and Docker
PPTX
Docker Platform and Ecosystem
PDF
Docker Multi-arch All The Things
PDF
Rakuten Ichiba development Automation show case - Bamboo, Docker -
PPTX
Dayta AI Seminar - Kubernetes, Docker and AI on Cloud
ansible_rhel_90.pdf
Enterprise Data Science
TIAD 2016 : Application delivery in a container world
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Amazon Web Services and Docker: from developing to production
Docker včera, dnes a zítra
Docker and the Container Revolution
Engineering Presentation for Careers@Directi
Docker Container As A Service - JAX 2016
TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, S...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Redispresentation apac2012
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
Openshift Presentation ppt compare with VM
ASP.NET Core and Docker
Docker Platform and Ecosystem
Docker Multi-arch All The Things
Rakuten Ichiba development Automation show case - Bamboo, Docker -
Dayta AI Seminar - Kubernetes, Docker and AI on Cloud

PyconUK-2015

  • 1. Resilient Data Pipelines with Docker and Docker- Compose Heeren Sharma Software Engineer @Cliqz heeren@cliqz.com @heerensharma
  • 2. 80+ - Team size! ! 500,000 - DAU! ! 3 Million+ - Downloads (Germany only)! ! 1 billion+ - Indexed pages (We do not believe in indexing the web.)! ! 5 TB - In-Memory indexed (Based on open source and in-house build NoSQL stores.)
  • 3. Storyfile FROM Introduction RUN Data-pipelines RUN whats-docker ADD use-case ./wrap-up CMD ./demo
  • 4. “Data and data everywhere” “Design and development of Data Pipelines” “Cloud deployment has its own needs”
  • 5. One’s processed Data is another system’s input data
  • 6. What’s that whale named Docker ? "Docker allows you to package an application with all of its dependencies into a standardised unit for software development."
  • 7. A little bit more Docker • Dockerfile ! FROM ubuntu:14.04! RUN pip install requests! ADD . /awesome-code! WORKDIR /awesome-code! CMD python revolutionary_app.py $ docker build -t=“your-awesome-image:v1” . $ docker run -d -p 80:5000 —name container-name your-awesome-image:v1 $ docker push your-awesome-image:v1 $ docker pull myregistry.com:8080/your-awesome-image:v1 $ docker ps! ! $ docker inspect <container-name/ID>! ! $ docker log <container-name/ID>! ! $ docker stop <container-name/ID>! ! $ docker exec -it <container-name/ID> your_command!
  • 8. Simplified Use Case • Streaming data from different sources - Twitter, FB, feeds and customised scraping engine. • Different processing engines • Fast iterations over new requirements • Resilient system with focus over easy deployment
  • 9. Use case - News Articles • Trending news articles from different domains • News Categorisation • Relevant news over search query • Traffic of news content fluctuates - pretty dynamic. • There is no universal right answer
  • 10. System Design (1) Data Stream Processing Engine Processed Data (Redis) $ docker run -d —-name processed_data redismaster:v1 redis-server $ docker build -t=“redismaster:v1” . $ docker build -t=“datastream:twitter” . $ docker run -d —-name twitter_queue datastream:twitter python /code/format_data.py $ docker build -t —-name processing_engine .! $ docker run -d —-name magic_powerhouse —-link processed_data:db processing_engine python /code/ magic_script.py
  • 11. New State of design
  • 15. docker-compose to rescue queue:! build: .! command: python /news-swimlane/twitter-queue/read_queue.py! volumes:! - /ebs/data:/data! environment:! DATA_DIR: /data! PYTHONPATH: /news-swimlane! restart: always! update:! build: .! command: python /news-swimlane/server-redis/build_redis_exc.py! volumes:! - /ebs/data:/data! links:! - redismaster! - fetcher! environment:! REDIS_HOST: redismaster_1! DATA_DIR: /data! PYTHONPATH: /news-swimlane! restart: always! redismaster:! build: redis3-container! ports:! - "6379:6379"! volumes:! - /ebs/backup:/data! restart: always! command: /usr/local/bin/redis-server /master.conf! fetcher:! build: .! command: python /news-swimlane/fetcher/server_fetcher.py! ports:! - "80:5000"!
  • 16. Key points: Design & Development • Micro - Services oriented design and henceforth development • Old/new components can be realised in form of docker containers • Different containers can readily interact among each other • Ease to test (Local environment) and no worries if it bursts out in production.
  • 17. Deployment • Just install docker on remote (to be done with care) • Docker images can be pushed to remote repository (better say registry). • Make your instances autoscale in any cloud IaaS. • If instance go down, then new instance just pull docker images, and start docker containers.
  • 18. Resources • Docker - https://www.docker.com/whatisdocker • Boot2docker - https://docs.docker.com/installation/mac/ • Docker-compose - https://docs.docker.com/compose/ • Very short video about Docker (American Style) - https:// www.youtube.com/watch?v=aLipr7tTuA4 - American style • https://www.youtube.com/watch?v=FdkNAjjO5yQ - Good resource to have a little deep insight about Containers