SlideShare a Scribd company logo
Get rid of traditional ETL,
Move to Spark!
Bas Geerdink
ING
Bas Geerdink
• Chapter Lead in Analytics area at ING
• Master degree in Artificial Intelligence and
Informatics
• Spark Certified Developer
• @bgeerdink
• https://www.linkedin.com/in/geerdink
Who am I?
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas Geerdink
Definition of ETL
“A repeatable programmed data movement”
Extract: get data from source systems
Transform: filter/map/enrich/combine/validate/sort/…
Load: store data in a data warehouse or data mart
Use cases:
– Data loading
– Data migration
– Data ingestion
– …
ETL Tools
• IBM InfoSphere DataStage
• Oracle Warehouse Builder
• Pervasive Data Integrator
• PowerCenter Informatica
• SAS Data Management
• Talend Open Studio
• SAP Data Services
• Microsoft SSIS
• Syncsort DMX
• CloverETL
• Jaspersoft
• Pentaho
• Nifi
What has changed?
Business Intelligence  Big Data
Data Warehouse  Data Lake
Applications  Microservices
ETL  …
The Future of ETL Tools
• Only develop connectors for integration?
• Rebuild entire back-end to Hadoop/Spark/Flink?
• Provide a GUI with code generation?
A Quiz!
What is the most difficult part for developers?
What is the most resource intensive part?
E / T / L
ETL Hell
• Data getting out of sync
• Performance issues
• Waste of server resources (peak performance)
• Plain-text code in hidden stages
• Click, click, click, click, click (RSI danger!)
• CSV files are not type-safe
• All-or-nothing approach in batch jobs
• Legacy code
• …
Is NO-ETL The Future?
• Why move data around?
• Alternative: keep data at the source, make it available in API’s
(microservices architecture)
• ETL is an intermediary step, and at each ETL step you can
introduce errors and risk:
– ETL can lose data
– ETL can duplicate data after failover
– ETL tools can cost millions of dollars
– ETL decreases throughput
– ETL increases the complexity of the pipeline
(source: noetl.org)
Intermediate: use Spark for ETL
• Parallel processing is built-in
• Runs on top of Hadoop, which is probably your data
source anyway
• It’s just Scala code (or Python, or Java)
• Machine learning can be thrown in to do more interesting
things
• Good support for security, unit testing, performance
measurement, exception handling, monitoring, etc.
Code example #1: EXTRACT
Get data from HDFS
Code example #2: TRANSFORM
Filter, Map, Join
Code example #3: LOAD
Store transformed data in Cassandra
Code example #4: Continuous ETL
Stream from file or message bus
What to choose?
• Technology is just… technology
• Choose a mindset / culture / way of working
• Do you really need a full Hadoop/Spark cluster for your
average ETL?
• Do you really need an expensive vendor enterprise tool
for your ETL?
Considerations…
• Testing (unit, functional, performance)
• Need for visualization and explanaition
• Flexibility: Continous Delivery, Automation, Reusability
• Simplification leads to less errors
• Tool vs framework
• A hybrid solution? E.g. code generation tools
Key takeaways
1. Pick one: ETL, ELT, ELTL, …
2. Treat all data equal: batch and stream
3. Continuous ETL: don’t wait for a phase to complete
4. Don’t just transform; enrich, alert, predict
5. Build for scale: distribute data and logic
6. Automate everything
7. Think about NoETL: each copy is a risk!
THANK YOU!
ING is hiring 
#SparkSummit

More Related Content

PPTX
Introduction to DOM
ODP
DOM HTML Javascript
PPTX
C functions by ranjan call by value and reference.pptx
PDF
CSS Dasar #7 : Selector
PDF
Data Analyst Job Description | Edureka
PDF
CSS Dasar #10 : Specificity
PDF
SEO Prompt Engineering - A Duda Webinar
PDF
Hotwire: How To Build Reactive Rails Applications Without Javascript
Introduction to DOM
DOM HTML Javascript
C functions by ranjan call by value and reference.pptx
CSS Dasar #7 : Selector
Data Analyst Job Description | Edureka
CSS Dasar #10 : Specificity
SEO Prompt Engineering - A Duda Webinar
Hotwire: How To Build Reactive Rails Applications Without Javascript

What's hot (20)

PPTX
CSS3 2D/3D transform
PPT
Document Object Model
PPTX
Tech Talk - Overview of Dash framework for building dashboards
PDF
Paige Hobart - What is "Content"? |SEO Meetup London - 8th September.pdf
PDF
CSS Dasar #8 : Pseudo-class
PPTX
XML Schemas
PPTX
PPTX
How to get your SEO work prioritised in house - Maddie McCartney.pptx
PDF
ORM in Django
PDF
Beginners python cheat sheet - Basic knowledge
 
PDF
How to take care of yourself when researching/writing about tough subjects
PDF
Become a Data Analyst
PPT
CSS for Beginners
PDF
CSS Layouting #4 : Float
PDF
Values story-to-brexit-split-part-1
PDF
Crafting Expertise, Authority and Trust with Entity-Based Content Strategy - ...
PDF
Kleecks - AI-Martech as a game changer-DEF.pdf
PDF
CSS Layouting #5 : Position
PPTX
Document object model
PPTX
超初心者向けハンズオン講座 「ゼロから始めるQGIS」
CSS3 2D/3D transform
Document Object Model
Tech Talk - Overview of Dash framework for building dashboards
Paige Hobart - What is "Content"? |SEO Meetup London - 8th September.pdf
CSS Dasar #8 : Pseudo-class
XML Schemas
How to get your SEO work prioritised in house - Maddie McCartney.pptx
ORM in Django
Beginners python cheat sheet - Basic knowledge
 
How to take care of yourself when researching/writing about tough subjects
Become a Data Analyst
CSS for Beginners
CSS Layouting #4 : Float
Values story-to-brexit-split-part-1
Crafting Expertise, Authority and Trust with Entity-Based Content Strategy - ...
Kleecks - AI-Martech as a game changer-DEF.pdf
CSS Layouting #5 : Position
Document object model
超初心者向けハンズオン講座 「ゼロから始めるQGIS」
Ad

Similar to Spark Summit EU talk by Bas Geerdink (20)

PDF
Using Apache Spark as ETL engine. Pros and Cons
PDF
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
PDF
Why shift from ETL to ELT?
PPTX
What is ETL?
PDF
Exceptions are the Norm: Dealing with Bad Actors in ETL
PPTX
Lecture13- Extract Transform Load presentation.pptx
ODP
Introduction to ETL
PDF
Building an Advanced ETL Pipeline: Beyond the Basics
PDF
ETL Tools Ankita Dubey
PPTX
ETL Technologies.pptx
PPTX
ELT vs. ETL - How they’re different and why it matters
PPTX
Building Modern Data Platform with AWS
PPT
ETL (1).ppt
DOC
To Study E T L ( Extract, Transform, Load) Tools Specially S Q L Server I...
DOCX
What are the key points to focus on before starting to learn ETL Development....
PPTX
Designing modern dw and data lake
PDF
What is ETL? Difference between ETL and ELT?.pdf
PPTX
Extract Transformation Load (3) (1).pptx
PPTX
Extract Transformation Loading1 (3).pptx
PDF
ETL and Event Sourcing
Using Apache Spark as ETL engine. Pros and Cons
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Why shift from ETL to ELT?
What is ETL?
Exceptions are the Norm: Dealing with Bad Actors in ETL
Lecture13- Extract Transform Load presentation.pptx
Introduction to ETL
Building an Advanced ETL Pipeline: Beyond the Basics
ETL Tools Ankita Dubey
ETL Technologies.pptx
ELT vs. ETL - How they’re different and why it matters
Building Modern Data Platform with AWS
ETL (1).ppt
To Study E T L ( Extract, Transform, Load) Tools Specially S Q L Server I...
What are the key points to focus on before starting to learn ETL Development....
Designing modern dw and data lake
What is ETL? Difference between ETL and ELT?.pdf
Extract Transformation Load (3) (1).pptx
Extract Transformation Loading1 (3).pptx
ETL and Event Sourcing
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Global journeys: estimating international migration
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
1_Introduction to advance data techniques.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPT
Quality review (1)_presentation of this 21
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Introduction to Business Data Analytics.
PPT
Miokarditis (Inflamasi pada Otot Jantung)
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Global journeys: estimating international migration
Galatica Smart Energy Infrastructure Startup Pitch Deck
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
1_Introduction to advance data techniques.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Quality review (1)_presentation of this 21
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Supervised vs unsupervised machine learning algorithms
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to Business Data Analytics.
Miokarditis (Inflamasi pada Otot Jantung)

Spark Summit EU talk by Bas Geerdink

  • 1. Get rid of traditional ETL, Move to Spark! Bas Geerdink ING
  • 2. Bas Geerdink • Chapter Lead in Analytics area at ING • Master degree in Artificial Intelligence and Informatics • Spark Certified Developer • @bgeerdink • https://www.linkedin.com/in/geerdink Who am I?
  • 5. Definition of ETL “A repeatable programmed data movement” Extract: get data from source systems Transform: filter/map/enrich/combine/validate/sort/… Load: store data in a data warehouse or data mart Use cases: – Data loading – Data migration – Data ingestion – …
  • 6. ETL Tools • IBM InfoSphere DataStage • Oracle Warehouse Builder • Pervasive Data Integrator • PowerCenter Informatica • SAS Data Management • Talend Open Studio • SAP Data Services • Microsoft SSIS • Syncsort DMX • CloverETL • Jaspersoft • Pentaho • Nifi
  • 7. What has changed? Business Intelligence  Big Data Data Warehouse  Data Lake Applications  Microservices ETL  …
  • 8. The Future of ETL Tools • Only develop connectors for integration? • Rebuild entire back-end to Hadoop/Spark/Flink? • Provide a GUI with code generation?
  • 9. A Quiz! What is the most difficult part for developers? What is the most resource intensive part? E / T / L
  • 10. ETL Hell • Data getting out of sync • Performance issues • Waste of server resources (peak performance) • Plain-text code in hidden stages • Click, click, click, click, click (RSI danger!) • CSV files are not type-safe • All-or-nothing approach in batch jobs • Legacy code • …
  • 11. Is NO-ETL The Future? • Why move data around? • Alternative: keep data at the source, make it available in API’s (microservices architecture) • ETL is an intermediary step, and at each ETL step you can introduce errors and risk: – ETL can lose data – ETL can duplicate data after failover – ETL tools can cost millions of dollars – ETL decreases throughput – ETL increases the complexity of the pipeline (source: noetl.org)
  • 12. Intermediate: use Spark for ETL • Parallel processing is built-in • Runs on top of Hadoop, which is probably your data source anyway • It’s just Scala code (or Python, or Java) • Machine learning can be thrown in to do more interesting things • Good support for security, unit testing, performance measurement, exception handling, monitoring, etc.
  • 13. Code example #1: EXTRACT Get data from HDFS
  • 14. Code example #2: TRANSFORM Filter, Map, Join
  • 15. Code example #3: LOAD Store transformed data in Cassandra
  • 16. Code example #4: Continuous ETL Stream from file or message bus
  • 17. What to choose? • Technology is just… technology • Choose a mindset / culture / way of working • Do you really need a full Hadoop/Spark cluster for your average ETL? • Do you really need an expensive vendor enterprise tool for your ETL?
  • 18. Considerations… • Testing (unit, functional, performance) • Need for visualization and explanaition • Flexibility: Continous Delivery, Automation, Reusability • Simplification leads to less errors • Tool vs framework • A hybrid solution? E.g. code generation tools
  • 19. Key takeaways 1. Pick one: ETL, ELT, ELTL, … 2. Treat all data equal: batch and stream 3. Continuous ETL: don’t wait for a phase to complete 4. Don’t just transform; enrich, alert, predict 5. Build for scale: distribute data and logic 6. Automate everything 7. Think about NoETL: each copy is a risk!
  • 20. THANK YOU! ING is hiring  #SparkSummit