SlideShare a Scribd company logo
MEET
OUR
TEAM
WRITE HERE SOMETHING
#1 Data Science Club
sli.do/exponea
17/18 Summer
MEET
OUR
TEAM
WRITE HERE SOMETHING
Batch (Spark) and Streaming (Kafka)
Data Preprocessing
Matus Cimerman
What do we do
FULL-STACK MARKETING CLOUD
SOME NUMBERS
● 150+ employees from 12 countries
● 29 average age
● 8 offices in 5 countries on 4 continents
● 1000%+ growth over the last 2 years
AI TEAM
● Recommendations,
● Propensity to buy during the session,
● Optimal time to send an email,
● Tech stack: Python, Gensim, Kubernetes, Spark, Go, TF, ...
How do we collect data
We don't scrape any websites
Arbitrary schema-less JSON objects
Storage: IMF, MongoDB, HDFS
Data Sources
Events
Columnar storage (CSV, SQL DB, …)
Storage: Elastic, HDFS
Products
Static attributes (age, location, …)
Dynamic attributes (# page visits, …)
Storage: MongoDB, IMF,
HDFS (parquet)
Customers
Long-term storage (archive)
● Events: Newline delimited JSON
● Customers: Parquet file
● Products: CSV, soon BigTable
Why even this talk at all?
Data preparation & preprocessing is at least
80% of Data Scientist job
ML Algorithms don't eat raw data
We expect to growth 10-20x in the upcoming
1-2 years, so scale-out is critical
Batch (Spark)
Batch processing is done mostly by Apache
Spark jobs, sadly written in Python
EASY, 12 LINES OF CODE
+ 500 LOC of boilerplate
Batch processing and Spark
things to consider first
1. Try Kappa first.
2. If not possible, try to implement Lambda arch.
3. Combine them all!
Streaming (Kafka)
Streaming ETL
Streaming ETL
Streaming ETL
EASY, 8 LINES OF CODE
+ 500 LOC of boilerplate
Infrastructure
Legacy bare-metal
Google Cloud Platform for the Win
Google Cloud Platform for the win
Google Cloud Platform
more moving parts
● Inputs for ML algorithms from Storage.
● Training WORM1 to Storage.
● Kubernetes for orchestration.
● Large scale models deployment, K8s.
● Easy monitoring thanks to Stackdriver.
1Write once read many
Lessons learned
Unified
PROS
Scale-out
No vendor
lock-in
Python
CONS
Distributed
debugging
Cluster size
guessing game
Python
Future
● Spark Streaming + for all transformations.
● Batch in Spark for daily data consolidation.
● Kafka Streams for real-time use-cases.
That's all folks
Matus Cimerman
matus.cimerman@exponea.com
www.exponea.com/internship
PALO ALTO, CA
456 University Ave
Palo Alto, CA 94301
+1 (650) 440-7297
PRAGUE, CZ
Rohanské nábřeží 687/29,
186 00 Prague, Czechia
+420 601 372 909
BRATISLAVA, SK
Twin City B, Mlynské Nivy 12,
821 09, Bratislava, Slovakia
+421 948 127 332
WARSAW, PL
Postępu 14, 02-676
Warsaw, PL
+48 603 663 766
MOSCOW, RU
10c1 Kozhevnicheskaya Street
115114, Moscow, RU
+7 (495) 120 26 53
LONDON, UK
41 Corsham Street
London N1 6DR, UK
+44 (0) 203 086 8894
MANCHESTER, UK
1 Spinningfields, Quay Street
Manchester M3 3JE
+44 (0) 203 086 8894
EDINBURGH, UK
20/6 Fountainhall Road
Edinburgh EH9 2NN
+44 (0) 203 086 8894
www.exponea.com

More Related Content

PDF
TiDB DevCon 2020 Opening Keynote
PDF
TiDB at PayPay
PDF
Analyzing and processing FInancial Market Data on AWS with Kinesis - AWS Pop ...
PDF
Infrastructure as Code with Terraform: Koombea TechTalks
PDF
Austin bdug 2011_01_27_small_and_big_data
PDF
SOLR Power FTW: short version
PDF
Presto @ Uber Hadoop summit2017
PPTX
2013 DATA @ NFLX (Tableau User Group)
TiDB DevCon 2020 Opening Keynote
TiDB at PayPay
Analyzing and processing FInancial Market Data on AWS with Kinesis - AWS Pop ...
Infrastructure as Code with Terraform: Koombea TechTalks
Austin bdug 2011_01_27_small_and_big_data
SOLR Power FTW: short version
Presto @ Uber Hadoop summit2017
2013 DATA @ NFLX (Tableau User Group)

What's hot (20)

PDF
TiDB Introduction
PDF
The Dark Side Of Go -- Go runtime related problems in TiDB in production
PDF
Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...
PDF
Presto Apache BigData 2017
PDF
Presto@Uber
PDF
Going Elastic - Philipp Krenn - Codemotion Amsterdam 2016
PDF
Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]
PDF
Predictive Models at Scale
PDF
TiDB Introduction - San Francisco MySQL Meetup
PDF
Graph Processing with Titan and Scylla
PDF
FleetDB
PDF
Presto GeoSpatial @ Strata New York 2017
PDF
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
PDF
Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]
PDF
Traveloka's data journey — Traveloka data meetup #2
PDF
Stream processing with Apache Flink @ OfferUp
PDF
Introducing TiDB @ SF DevOps Meetup
PDF
Graph Processing with Apache TinkerPop
PDF
TiDB as an HTAP Database
PDF
Introduction to Data Engineer and Data Pipeline at Credit OK
TiDB Introduction
The Dark Side Of Go -- Go runtime related problems in TiDB in production
Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...
Presto Apache BigData 2017
Presto@Uber
Going Elastic - Philipp Krenn - Codemotion Amsterdam 2016
Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]
Predictive Models at Scale
TiDB Introduction - San Francisco MySQL Meetup
Graph Processing with Titan and Scylla
FleetDB
Presto GeoSpatial @ Strata New York 2017
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]
Traveloka's data journey — Traveloka data meetup #2
Stream processing with Apache Flink @ OfferUp
Introducing TiDB @ SF DevOps Meetup
Graph Processing with Apache TinkerPop
TiDB as an HTAP Database
Introduction to Data Engineer and Data Pipeline at Credit OK
Ad

Similar to Batch (Spark) and Streaming (Kafka) Data-Preprocessing (20)

PDF
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
PPTX
Data pipelines from zero
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
PDF
Python and H2O with Cliff Click at PyData Dallas 2015
PPTX
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
PDF
Elastic Data Analytics Platform @Datadog
PDF
Testing data streaming applications
PPTX
Big Data in 200 km/h | AWS Big Data Demystified #1.3
PDF
ITCamp 2018 - Laurent Bugnion - Azure, Windows and Xamarin: Using the cloud t...
PPTX
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
PDF
SnappyData Toronto Meetup Nov 2017
PDF
Hail hydrate! from stream to lake using open source
PPTX
Liveperson DLD 2015
PDF
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
PDF
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
PDF
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
PDF
Introduction To Spark - Durham LUG 20150916
PDF
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
PDF
Scio - Moving to Google Cloud, A Spotify Story
PPTX
Intro to Apache Spark
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
Data pipelines from zero
AWS Big Data Demystified #1: Big data architecture lessons learned
Python and H2O with Cliff Click at PyData Dallas 2015
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Elastic Data Analytics Platform @Datadog
Testing data streaming applications
Big Data in 200 km/h | AWS Big Data Demystified #1.3
ITCamp 2018 - Laurent Bugnion - Azure, Windows and Xamarin: Using the cloud t...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
SnappyData Toronto Meetup Nov 2017
Hail hydrate! from stream to lake using open source
Liveperson DLD 2015
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
Introduction To Spark - Durham LUG 20150916
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
Scio - Moving to Google Cloud, A Spotify Story
Intro to Apache Spark
Ad

More from Data Science Club (6)

PPTX
How to present campaign results to your boss
PPTX
Principles of Big Data Analytics Visualization
PPTX
A Big (Query) Frog in a Small Pond, Jakub Motyl, BuffPanel
PDF
Why Successful Games Need Analytics
PPTX
Introduction to data science club
PDF
Live predictions with schemaless data at scale. MLMU Kosice, Exponea
How to present campaign results to your boss
Principles of Big Data Analytics Visualization
A Big (Query) Frog in a Small Pond, Jakub Motyl, BuffPanel
Why Successful Games Need Analytics
Introduction to data science club
Live predictions with schemaless data at scale. MLMU Kosice, Exponea

Recently uploaded (20)

PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Artificial Intelligence
PPT
Project quality management in manufacturing
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PDF
Well-logging-methods_new................
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
DOCX
573137875-Attendance-Management-System-original
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
additive manufacturing of ss316l using mig welding
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
Geodesy 1.pptx...............................................
Embodied AI: Ushering in the Next Era of Intelligent Systems
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Artificial Intelligence
Project quality management in manufacturing
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
Well-logging-methods_new................
R24 SURVEYING LAB MANUAL for civil enggi
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
573137875-Attendance-Management-System-original
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
bas. eng. economics group 4 presentation 1.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
additive manufacturing of ss316l using mig welding
UNIT 4 Total Quality Management .pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Geodesy 1.pptx...............................................

Batch (Spark) and Streaming (Kafka) Data-Preprocessing