Spark Summit EU talk by Bas Geerdink

Get rid of traditional ETL,
Move to Spark!
Bas Geerdink
ING

Bas Geerdink
• Chapter Lead in Analytics area at ING
• Master degree in Artificial Intelligence and
Informatics
• Spark Certified Developer
• @bgeerdink
• https://www.linkedin.com/in/geerdink
Who am I?

Definition of ETL
“A repeatable programmed data movement”
Extract: get data from source systems
Transform: filter/map/enrich/combine/validate/sort/…
Load: store data in a data warehouse or data mart
Use cases:
– Data loading
– Data migration
– Data ingestion
– …

ETL Tools
• IBM InfoSphere DataStage
• Oracle Warehouse Builder
• Pervasive Data Integrator
• PowerCenter Informatica
• SAS Data Management
• Talend Open Studio
• SAP Data Services
• Microsoft SSIS
• Syncsort DMX
• CloverETL
• Jaspersoft
• Pentaho
• Nifi

What has changed?
Business Intelligence  Big Data
Data Warehouse  Data Lake
Applications  Microservices
ETL  …

The Future of ETL Tools
• Only develop connectors for integration?
• Rebuild entire back-end to Hadoop/Spark/Flink?
• Provide a GUI with code generation?

A Quiz!
What is the most difficult part for developers?
What is the most resource intensive part?
E / T / L

ETL Hell
• Data getting out of sync
• Performance issues
• Waste of server resources (peak performance)
• Plain-text code in hidden stages
• Click, click, click, click, click (RSI danger!)
• CSV files are not type-safe
• All-or-nothing approach in batch jobs
• Legacy code
• …

Is NO-ETL The Future?
• Why move data around?
• Alternative: keep data at the source, make it available in API’s
(microservices architecture)
• ETL is an intermediary step, and at each ETL step you can
introduce errors and risk:
– ETL can lose data
– ETL can duplicate data after failover
– ETL tools can cost millions of dollars
– ETL decreases throughput
– ETL increases the complexity of the pipeline
(source: noetl.org)

Intermediate: use Spark for ETL
• Parallel processing is built-in
• Runs on top of Hadoop, which is probably your data
source anyway
• It’s just Scala code (or Python, or Java)
• Machine learning can be thrown in to do more interesting
things
• Good support for security, unit testing, performance
measurement, exception handling, monitoring, etc.

Code example #1: EXTRACT
Get data from HDFS

Code example #2: TRANSFORM
Filter, Map, Join

Code example #3: LOAD
Store transformed data in Cassandra

Code example #4: Continuous ETL
Stream from file or message bus

What to choose?
• Technology is just… technology
• Choose a mindset / culture / way of working
• Do you really need a full Hadoop/Spark cluster for your
average ETL?
• Do you really need an expensive vendor enterprise tool
for your ETL?

Considerations…
• Testing (unit, functional, performance)
• Need for visualization and explanaition
• Flexibility: Continous Delivery, Automation, Reusability
• Simplification leads to less errors
• Tool vs framework
• A hybrid solution? E.g. code generation tools

Key takeaways
1. Pick one: ETL, ELT, ELTL, …
2. Treat all data equal: batch and stream
3. Continuous ETL: don’t wait for a phase to complete
4. Don’t just transform; enrich, alert, predict
5. Build for scale: distribute data and logic
6. Automate everything
7. Think about NoETL: each copy is a risk!

THANK YOU!
ING is hiring 
#SparkSummit

Spark Summit EU talk by Bas Geerdink

More Related Content

What's hot (20)

Similar to Spark Summit EU talk by Bas Geerdink (20)

More from Spark Summit (20)

Recently uploaded (20)

Spark Summit EU talk by Bas Geerdink