SlideShare a Scribd company logo
Lighthouse
an open-source toolkit to build data lakes
Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris Peeters
Wow, you nerds are
really on to something.
Lighthouse - an open-source library to build data lakes - Kris Peeters
How did you clean the input
data? Can I see the logs?
Can you run that model
every night?
Can I explore that data
real quick?
Can you run the model live on
our website?
Can we also start loading
clickstream and IoT data?
Can you add this data source
from marketing?
Can I reuse that data for
another crazy idea I’m having?
What’s the uptime SLA on
that thing?
Lighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris Peeters
Batch processing
Real-time processing
DATA
BROKER
INSIGHT
INSIGHT
Serving layer
DATABASE
API
ALERTS
Data exploration & data science
DATA LAKE
A big data system needs the following capabilities
Data exploration & data science
Batch processing
Real-time processing
Serving layer
Putting some technologies on each capability
S3
Athena Python
RDS
Lambda
API gateway
Batch processing
Real-time processing
DATA
BROKER
INSIGHT
INSIGHT
Serving layer
DATABASE
API
ALERTS
Data exploration & data science
DATA LAKE
A big data system needs the following capabilities
Store first
Ask questions later
Reuse data
across use cases
Do data analytics
Lighthouse is an open-source toolkit
to build data lakes
Lighthouse - an open-source library to build data lakes - Kris Peeters
DAS WAR EIN BEFEHL!
Your code is
really poor
quality sir
Your test
coverage is
low sir
Configuration
in json? What
is this? 1923?
You didn’t even
define clear
interfaces sir
Can I at least help
build the new
version?
No
No No
Nope nope nope
nope nope nope
nope nope nope
Lighthouse is an open-source toolkit
to build data lakes
Define
Data lake
Construct
Data pipelines
Batteries
Included*
*pull requests welcome
90% of big data projects
Welcome to my swamp
/
/clean /master /metrics /models
Clickstream
CRM
Customer Service
ERP
Invoicing
Social media
Customer360
Asset utilisation
Factory efficiency
Product revenue
Web sessions
Customer Loyalty
Stock levels
Churn prediction
Next product
to buy
Asset
optimisation
preventative
maintenance
I have a bunch of raw data
sources, in this case CSV
I will clean them and store
them in the default format,
ORC
Finally, I will build a single
view of airplanes, and I will
make it available in Hive
Uniquely identifying and reusing data sources
(from HDFS, Hive, JDBC, …)
Structured storage formats such as ORC
Consistent reads and writes to the data lake
Lighthouse - an open-source library to build data lakes - Kris Peeters
Clean each data source
Combine the two weather
data sources
Build a single view of
airplanes, including
weather data
Structured way of building data pipelines
Each individual transform is easy to test
Separate how to get the data
from what you do with it
Parameter store
- Store credentials
and config parameters
- Use AWS Parameter Store
or Filesystem Vault
Testing functions
- Compare datasets
- Pretty Print data frames
- Compare columns
- spark unit tests
Demo project
- Simple setup
- Data lake & data pipelines
- Copy&paste for your needs
Batteries included
Lighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris Peeters
- If you have more than one project in Spark
that you need to implement AND has to run
in production
- If you have business analysts and data
scientists who want to explore and play with
all the data that is available in the
organisation
- If you don’t have a lot of experience yet
- If you only have to run one very particular use
case. Lighthouse will generate more overhead
- If you don’t have a large volume or variety of
data. Python + Postgres will do just fine
- If you are in prototyping / experimentation
mode, and there is no need yet to put stuff in
production
Lighthouse
an open-source toolkit to build data lakes
Kris Peeters
https://github.com/datamindedbe/lighthouse

More Related Content

PDF
Developing high frequency indicators using real time tick data on apache supe...
PDF
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
PDF
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
PDF
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
PDF
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
PDF
Visualizing big data in the browser using spark
PPTX
presto-at-netflix-hadoop-summit-15
PPTX
Open Source Big Data Ingestion - Without the Heartburn!
Developing high frequency indicators using real time tick data on apache supe...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Visualizing big data in the browser using spark
presto-at-netflix-hadoop-summit-15
Open Source Big Data Ingestion - Without the Heartburn!

What's hot (20)

PDF
Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...
PDF
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
KEY
Cascalog
KEY
Cascalog at May Bay Area Hadoop User Group
PDF
Lighthouse
PPTX
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
PPTX
Presto: Distributed sql query engine
PDF
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
PDF
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
PPTX
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
PDF
Building Open Data Lakes on AWS with Debezium and Apache Hudi
PDF
How to teach your data scientist to leverage an analytics cluster with Presto...
PDF
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
PDF
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
PPTX
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
PDF
The Future of Real-Time in Spark
PDF
Spark Summit EU talk by Tug Grall
PDF
Clinical Suspecting at Scale Using PySpark
PPTX
Data ingestion
PPTX
Presto@Netflix Presto Meetup 03-19-15
Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Cascalog
Cascalog at May Bay Area Hadoop User Group
Lighthouse
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Presto: Distributed sql query engine
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Building Open Data Lakes on AWS with Debezium and Apache Hudi
How to teach your data scientist to leverage an analytics cluster with Presto...
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
The Future of Real-Time in Spark
Spark Summit EU talk by Tug Grall
Clinical Suspecting at Scale Using PySpark
Data ingestion
Presto@Netflix Presto Meetup 03-19-15
Ad

Similar to Lighthouse - an open-source library to build data lakes - Kris Peeters (20)

PPTX
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 
PDF
PDF
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
PDF
Owning Your Own (Data) Lake House
PPTX
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
PDF
Big Data, Ingeniería de datos, y Data Lakes en AWS
PPTX
Big data clustering
PPTX
So your boss says you need to learn data science
PDF
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
PPTX
Azure Data Lake Intro (SQLBits 2016)
PPTX
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
PPT
Big Data - JAX2011 (Pavlo Baron)
PPTX
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
PPTX
Essential Data Engineering for Data Scientist
PPTX
Azure Data.pptx
PDF
Scaling up with Cisco Big Data: Data + Science = Data Science
PPTX
Azure Databricks - An Introduction 2019 Roadshow.pptx
PPTX
big data eco system fundamentals of data science
PDF
Building a Data Lake on AWS
PPTX
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Owning Your Own (Data) Lake House
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big data clustering
So your boss says you need to learn data science
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Azure Data Lake Intro (SQLBits 2016)
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Big Data - JAX2011 (Pavlo Baron)
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Essential Data Engineering for Data Scientist
Azure Data.pptx
Scaling up with Cisco Big Data: Data + Science = Data Science
Azure Databricks - An Introduction 2019 Roadshow.pptx
big data eco system fundamentals of data science
Building a Data Lake on AWS
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Ad

More from Data Science Leuven (20)

PDF
Distributed Deep Learning Using Java on the Client and in the Cloud
PDF
Statbel and big data
PDF
Learning from positive and unlabeled data
PPTX
Recommender systems for job search - Michael Reusens
PPTX
VITO WatchItGrow - Jeroen Dries
PDF
How to build a search engine in 2 days
PDF
Uplift models
PDF
Value from health data
PPTX
Computing power and algorithms? In people we trust
PDF
Trumania , a realistic scenario-based data-generator
PDF
Recommender systems, optimizing least squares or user experience
PPTX
Replicability and questionable research practices
PDF
Predicting Eurosong with Google Predicting Eurosong with Google and data visu...
PPTX
Storytelling for impactful predictive models - Gert De Geyter
PDF
Lessons from driving analytics projects
PPTX
Geospatial visual analytics
PDF
Open-Source Data Science Crossing The Chasm
PDF
Probabilistic machine learning for optimization and solving complex
Distributed Deep Learning Using Java on the Client and in the Cloud
Statbel and big data
Learning from positive and unlabeled data
Recommender systems for job search - Michael Reusens
VITO WatchItGrow - Jeroen Dries
How to build a search engine in 2 days
Uplift models
Value from health data
Computing power and algorithms? In people we trust
Trumania , a realistic scenario-based data-generator
Recommender systems, optimizing least squares or user experience
Replicability and questionable research practices
Predicting Eurosong with Google Predicting Eurosong with Google and data visu...
Storytelling for impactful predictive models - Gert De Geyter
Lessons from driving analytics projects
Geospatial visual analytics
Open-Source Data Science Crossing The Chasm
Probabilistic machine learning for optimization and solving complex

Recently uploaded (20)

PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Business Analytics and business intelligence.pdf
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Computer network topology notes for revision
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Database Infoormation System (DBIS).pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Mega Projects Data Mega Projects Data
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Business Analytics and business intelligence.pdf
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
IB Computer Science - Internal Assessment.pptx
Computer network topology notes for revision
Fluorescence-microscope_Botany_detailed content
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Qualitative Qantitative and Mixed Methods.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Database Infoormation System (DBIS).pptx
ISS -ESG Data flows What is ESG and HowHow
Mega Projects Data Mega Projects Data
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Galatica Smart Energy Infrastructure Startup Pitch Deck
Clinical guidelines as a resource for EBP(1).pdf
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
STUDY DESIGN details- Lt Col Maksud (21).pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx

Lighthouse - an open-source library to build data lakes - Kris Peeters

  • 1. Lighthouse an open-source toolkit to build data lakes Kris Peeters
  • 3. Wow, you nerds are really on to something.
  • 5. How did you clean the input data? Can I see the logs? Can you run that model every night? Can I explore that data real quick? Can you run the model live on our website? Can we also start loading clickstream and IoT data? Can you add this data source from marketing? Can I reuse that data for another crazy idea I’m having? What’s the uptime SLA on that thing?
  • 8. Batch processing Real-time processing DATA BROKER INSIGHT INSIGHT Serving layer DATABASE API ALERTS Data exploration & data science DATA LAKE A big data system needs the following capabilities
  • 9. Data exploration & data science Batch processing Real-time processing Serving layer Putting some technologies on each capability S3 Athena Python RDS Lambda API gateway
  • 10. Batch processing Real-time processing DATA BROKER INSIGHT INSIGHT Serving layer DATABASE API ALERTS Data exploration & data science DATA LAKE A big data system needs the following capabilities
  • 14. Lighthouse is an open-source toolkit to build data lakes
  • 16. DAS WAR EIN BEFEHL!
  • 17. Your code is really poor quality sir Your test coverage is low sir Configuration in json? What is this? 1923? You didn’t even define clear interfaces sir
  • 18. Can I at least help build the new version?
  • 19. No No No Nope nope nope nope nope nope nope nope nope
  • 20. Lighthouse is an open-source toolkit to build data lakes
  • 22. 90% of big data projects Welcome to my swamp
  • 23. / /clean /master /metrics /models Clickstream CRM Customer Service ERP Invoicing Social media Customer360 Asset utilisation Factory efficiency Product revenue Web sessions Customer Loyalty Stock levels Churn prediction Next product to buy Asset optimisation preventative maintenance
  • 24. I have a bunch of raw data sources, in this case CSV I will clean them and store them in the default format, ORC Finally, I will build a single view of airplanes, and I will make it available in Hive
  • 25. Uniquely identifying and reusing data sources (from HDFS, Hive, JDBC, …) Structured storage formats such as ORC Consistent reads and writes to the data lake
  • 27. Clean each data source Combine the two weather data sources Build a single view of airplanes, including weather data
  • 28. Structured way of building data pipelines Each individual transform is easy to test Separate how to get the data from what you do with it
  • 29. Parameter store - Store credentials and config parameters - Use AWS Parameter Store or Filesystem Vault Testing functions - Compare datasets - Pretty Print data frames - Compare columns - spark unit tests Demo project - Simple setup - Data lake & data pipelines - Copy&paste for your needs Batteries included
  • 35. - If you have more than one project in Spark that you need to implement AND has to run in production - If you have business analysts and data scientists who want to explore and play with all the data that is available in the organisation - If you don’t have a lot of experience yet - If you only have to run one very particular use case. Lighthouse will generate more overhead - If you don’t have a large volume or variety of data. Python + Postgres will do just fine - If you are in prototyping / experimentation mode, and there is no need yet to put stuff in production
  • 36. Lighthouse an open-source toolkit to build data lakes Kris Peeters https://github.com/datamindedbe/lighthouse