SlideShare a Scribd company logo
Pentaho Data Integration/Kettle
● Dan Moore
● 8z Real Estate
● Kettle user for two years
Questions
● Who has used a relational database?
● Who has written scripts or java code to munge
data from one source and load it to another?
– What did you use?
– Scripts
– Custom java code
– ETL tool
What is Kettle?
● Batch data integration and processing tool
written in Java
● Exists to retrieve, process and load data
● ETL
– Extract, transform and load
● PDI synonomous
What is Kettle good for
● Mirroring data from master to slave
● Syncing two data sources
● Processing data retrieved from multiple sources
and pushed to multiple destinations
● Loading data to RDBMS
● Datamart/data warehouse
– Dimension lookup/update step
● Graphical manipulation of data
Alternatives
● Code
– Custom java
– Spring batch
● Scripts
– perl, python, shell, etc
– Possibly + db loader tool and cron
● Commercial ETL tools
– Oracle Warehouse Builder
– Datastage
– Informatica
– SQL Server Integration services
● Open source ETL tools:
– Talend
– KETL
– Clover.ETL
● Special case tools
– SymmetricDS
– Db replication
Why Kettle is better
● Higher level than code
– Graphical interface
– No connection pooling to worry about
– No DDL to write
– Validation/business rules
● Well tested full suite of components
● Data analysis tools
– Preview
– Data profiling with data cleaner (add on)
● Free (as in beer and speech)
– Two editions
– GPLv2
● Performant?
– Developer vs computer performant
– Depends, right?
– Sitemailsame job 10k rows/second for 125M rows
● Leverage java
– jvm tuning skills
– java libraries and logic (in jars)
Data sources
●
Files
●
Databases
●
No SQL
●
REST
● XML
● Hadoop/HBase
● JSON
● Excel
● EDI
● RSS
● Google Analytics
Kettle concepts
● Repository
● Rows/Stream
● Steps
● Job
● Transformation
Demo 1: one way sync
● Sync tables
Demo 2: processing
● Process data from one table and replace some
values, filter some values
● Lookup table
Demo 3: log file processing
● Load apache logs for analysis
What it is not good for
● User interfaces/user interaction
● Small data sets
– 500 (from experience)
● Web applications
● One off processes?
– One off becomes regular
Who uses
● Survey results
– ~20 people
● Number of downloads: 110K downloads of Kettle 4.4
– Since Nov 2012
● Our specific use
– MLS data
● Different data source formats and types (jdbc, local csv, ftp)
– Public records data
● Fixed width files
Larger picture
● Kettle 10 years old
– joined Pentaho about 7 years ago
● Open source, at version 4.4
– GPLv2 license
– EE edition available
● BI suite
– Reporting
– Analytics
– Dashboards
– Machine Learning (weka)
Kettle tools
● Spoon
● Kitchen
● Pan
● Carte
– Clustering tool
Advanced topics
● Existing java logic
– Embedded
– Polygon example
– Demo 4
● Deployment
– Variables Config files are your friend
● Mapping/Parameterization
– Subroutines of logic
Advanced Topics Continued
● Testing
– Who tests
● Version control
– Who uses version control
● Error handling
– Email
– Log files
Getting started
●
Download
– sourceforge
● Includes over 150 example transformations
– Mysql 3.14 jdbc driver
● Helpful sites
– Forums: http://forums.pentaho.com/forumdisplay.php?135-Pentaho-Data-Integration-
Kettle
– Wiki: http://wiki.pentaho.com/display/EAI/Pentaho+Data+Integration+Steps
– Testing: http://www.mooreds.com/wordpress/pentaho-kettle-testing
● Helpful books
– Pentaho Kettle Solutions: Casters, Bouman, van Dongen
●
Barely scratched surface
● Don't like tools that turn me into a mechanic

More Related Content

PDF
Introduction to dataset
PDF
Scaling ELK Stack - DevOpsDays Singapore
PDF
Building end to end streaming application on Spark
PPTX
Linked Data from a Digital Object Management System
PDF
Introduction to Structured Data Processing with Spark SQL
PDF
Observability for Data Pipelines With OpenLineage
PPTX
NATE-Central-Log
PDF
Open core summit: Observability for data pipelines with OpenLineage
Introduction to dataset
Scaling ELK Stack - DevOpsDays Singapore
Building end to end streaming application on Spark
Linked Data from a Digital Object Management System
Introduction to Structured Data Processing with Spark SQL
Observability for Data Pipelines With OpenLineage
NATE-Central-Log
Open core summit: Observability for data pipelines with OpenLineage

What's hot (20)

PDF
Superset druid realtime
PDF
Graph basedrdf storeforapachecassandra
PDF
Data pipelines observability: OpenLineage & Marquez
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
PDF
Presto Strata Hadoop SJ 2016 short talk
PDF
Introduction to basic data analytics tools
PDF
Sydney Spark Meetup - September 2015
PDF
Anatomy of Data Source API : A deep dive into Spark Data source API
PDF
Spark SQL Beyond Official Documentation
PDF
Streamsets and spark at SF Hadoop User Group
PDF
Understanding transactional writes in datasource v2
PDF
Sydney Apache Spark Meetup - Spark Natural Language Processing
PDF
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
PDF
Data lineage and observability with Marquez - subsurface 2020
PPTX
GOKb and Refine (Kuali Days 2013)
PDF
Managing ADLS gen2 using Apache Spark
PDF
Introduction to Spark 2.0 Dataset API
PDF
Data platform architecture principles - ieee infrastructure 2020
PDF
Ciel, mes données ne sont plus relationnelles
PDF
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Superset druid realtime
Graph basedrdf storeforapachecassandra
Data pipelines observability: OpenLineage & Marquez
Iceberg: A modern table format for big data (Strata NY 2018)
Presto Strata Hadoop SJ 2016 short talk
Introduction to basic data analytics tools
Sydney Spark Meetup - September 2015
Anatomy of Data Source API : A deep dive into Spark Data source API
Spark SQL Beyond Official Documentation
Streamsets and spark at SF Hadoop User Group
Understanding transactional writes in datasource v2
Sydney Apache Spark Meetup - Spark Natural Language Processing
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Data lineage and observability with Marquez - subsurface 2020
GOKb and Refine (Kuali Days 2013)
Managing ADLS gen2 using Apache Spark
Introduction to Spark 2.0 Dataset API
Data platform architecture principles - ieee infrastructure 2020
Ciel, mes données ne sont plus relationnelles
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Ad

Similar to An Introduction to Pentaho Kettle (20)

PDF
Introduction To Pentaho Kettle
PDF
Kettle: Pentaho Data Integration tool
PPTX
Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando m...
PPT
Pentaho etl-tool
ODP
Pentaho Data Integration Introduction
ODP
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
PPTX
Pentaho ppt up
PPT
Kettle – Etl Tool
PPTX
GraphDay Paris - Intégrer des flux de données dans Neo4j avec l'ETL Open Sour...
PDF
Neo4J meetup, Brussels, 2018-06-12
PDF
Neo4j Data Loading with Kettle
PPT
Kettleetltool 090522005630-phpapp01
PPTX
Slides pentaho-hadoop-weka
PDF
Pentaho data integration 4.0 and my sql
ODP
Choosing the right steps in pentaho kettle
PPSX
Business Intelligence and Big Data Analytics with Pentaho
PDF
ETL All The Things with Ruby
PDF
Data transformations. Using kettle transformations - Andriy Kyrylenko,
PDF
Pentaho 32 Data Integration Beginners Guide Mara Carina Roldan
PDF
Using Pentaho with MariaDB ColumnStore
Introduction To Pentaho Kettle
Kettle: Pentaho Data Integration tool
Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando m...
Pentaho etl-tool
Pentaho Data Integration Introduction
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
Pentaho ppt up
Kettle – Etl Tool
GraphDay Paris - Intégrer des flux de données dans Neo4j avec l'ETL Open Sour...
Neo4J meetup, Brussels, 2018-06-12
Neo4j Data Loading with Kettle
Kettleetltool 090522005630-phpapp01
Slides pentaho-hadoop-weka
Pentaho data integration 4.0 and my sql
Choosing the right steps in pentaho kettle
Business Intelligence and Big Data Analytics with Pentaho
ETL All The Things with Ruby
Data transformations. Using kettle transformations - Andriy Kyrylenko,
Pentaho 32 Data Integration Beginners Guide Mara Carina Roldan
Using Pentaho with MariaDB ColumnStore
Ad

More from Dan Moore (9)

PDF
JWTs - What PHP developers need to know - LonghornPHP.pdf
PDF
Protecting Your APIs.pdf
PDF
Building a slackbot
PPTX
AWS 101
PPTX
Three things that surprised me when I was a new developer
PPTX
New developer talk culture foundry - 0519
PDF
Super Simple Supervised Learning - AML - Dan Moore
PPTX
Life at an early stage SaaS company as a technical co-founder
PDF
Stripe - Do You Like Money
JWTs - What PHP developers need to know - LonghornPHP.pdf
Protecting Your APIs.pdf
Building a slackbot
AWS 101
Three things that surprised me when I was a new developer
New developer talk culture foundry - 0519
Super Simple Supervised Learning - AML - Dan Moore
Life at an early stage SaaS company as a technical co-founder
Stripe - Do You Like Money

Recently uploaded (20)

PDF
Designing Intelligence for the Shop Floor.pdf
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Digital Systems & Binary Numbers (comprehensive )
PPTX
Monitoring Stack: Grafana, Loki & Promtail
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Download FL Studio Crack Latest version 2025 ?
DOCX
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
PDF
iTop VPN Crack Latest Version Full Key 2025
PPTX
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
17 Powerful Integrations Your Next-Gen MLM Software Needs
PDF
Complete Guide to Website Development in Malaysia for SMEs
PPTX
assetexplorer- product-overview - presentation
PDF
Nekopoi APK 2025 free lastest update
PDF
Cost to Outsource Software Development in 2025
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
Designing Intelligence for the Shop Floor.pdf
Odoo Companies in India – Driving Business Transformation.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Digital Systems & Binary Numbers (comprehensive )
Monitoring Stack: Grafana, Loki & Promtail
Why Generative AI is the Future of Content, Code & Creativity?
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Download FL Studio Crack Latest version 2025 ?
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
iTop VPN Crack Latest Version Full Key 2025
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Reimagine Home Health with the Power of Agentic AI​
17 Powerful Integrations Your Next-Gen MLM Software Needs
Complete Guide to Website Development in Malaysia for SMEs
assetexplorer- product-overview - presentation
Nekopoi APK 2025 free lastest update
Cost to Outsource Software Development in 2025
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf

An Introduction to Pentaho Kettle

  • 1. Pentaho Data Integration/Kettle ● Dan Moore ● 8z Real Estate ● Kettle user for two years
  • 2. Questions ● Who has used a relational database? ● Who has written scripts or java code to munge data from one source and load it to another? – What did you use? – Scripts – Custom java code – ETL tool
  • 3. What is Kettle? ● Batch data integration and processing tool written in Java ● Exists to retrieve, process and load data ● ETL – Extract, transform and load ● PDI synonomous
  • 4. What is Kettle good for ● Mirroring data from master to slave ● Syncing two data sources ● Processing data retrieved from multiple sources and pushed to multiple destinations ● Loading data to RDBMS ● Datamart/data warehouse – Dimension lookup/update step ● Graphical manipulation of data
  • 5. Alternatives ● Code – Custom java – Spring batch ● Scripts – perl, python, shell, etc – Possibly + db loader tool and cron ● Commercial ETL tools – Oracle Warehouse Builder – Datastage – Informatica – SQL Server Integration services ● Open source ETL tools: – Talend – KETL – Clover.ETL ● Special case tools – SymmetricDS – Db replication
  • 6. Why Kettle is better ● Higher level than code – Graphical interface – No connection pooling to worry about – No DDL to write – Validation/business rules ● Well tested full suite of components ● Data analysis tools – Preview – Data profiling with data cleaner (add on) ● Free (as in beer and speech) – Two editions – GPLv2 ● Performant? – Developer vs computer performant – Depends, right? – Sitemailsame job 10k rows/second for 125M rows ● Leverage java – jvm tuning skills – java libraries and logic (in jars)
  • 7. Data sources ● Files ● Databases ● No SQL ● REST ● XML ● Hadoop/HBase ● JSON ● Excel ● EDI ● RSS ● Google Analytics
  • 8. Kettle concepts ● Repository ● Rows/Stream ● Steps ● Job ● Transformation
  • 9. Demo 1: one way sync ● Sync tables
  • 10. Demo 2: processing ● Process data from one table and replace some values, filter some values ● Lookup table
  • 11. Demo 3: log file processing ● Load apache logs for analysis
  • 12. What it is not good for ● User interfaces/user interaction ● Small data sets – 500 (from experience) ● Web applications ● One off processes? – One off becomes regular
  • 13. Who uses ● Survey results – ~20 people ● Number of downloads: 110K downloads of Kettle 4.4 – Since Nov 2012 ● Our specific use – MLS data ● Different data source formats and types (jdbc, local csv, ftp) – Public records data ● Fixed width files
  • 14. Larger picture ● Kettle 10 years old – joined Pentaho about 7 years ago ● Open source, at version 4.4 – GPLv2 license – EE edition available ● BI suite – Reporting – Analytics – Dashboards – Machine Learning (weka)
  • 15. Kettle tools ● Spoon ● Kitchen ● Pan ● Carte – Clustering tool
  • 16. Advanced topics ● Existing java logic – Embedded – Polygon example – Demo 4 ● Deployment – Variables Config files are your friend ● Mapping/Parameterization – Subroutines of logic
  • 17. Advanced Topics Continued ● Testing – Who tests ● Version control – Who uses version control ● Error handling – Email – Log files
  • 18. Getting started ● Download – sourceforge ● Includes over 150 example transformations – Mysql 3.14 jdbc driver ● Helpful sites – Forums: http://forums.pentaho.com/forumdisplay.php?135-Pentaho-Data-Integration- Kettle – Wiki: http://wiki.pentaho.com/display/EAI/Pentaho+Data+Integration+Steps – Testing: http://www.mooreds.com/wordpress/pentaho-kettle-testing ● Helpful books – Pentaho Kettle Solutions: Casters, Bouman, van Dongen ● Barely scratched surface ● Don't like tools that turn me into a mechanic