SlideShare a Scribd company logo
U n i v a l e n c e
DATALAB 101
Jonathan WINANDY
About me
Jonathan WINANDY
Data Engineer / Entrepreneur
@AHOY_JON
U n i v a l e n c e
Présentation
What are Datalabs ?
Projects to transform an organisation
based on its existing data.
Présentation
Why ?
Data is a leverage for economic growth.
Présentation
But ?
Data has no value by itself.
Is data the new oil ?
Présentation
How do we start ?
By building a Data platform ?
Présentation
Data Platform
Awesome pipelines
+
BIG Data technologies
Rex > Data Platform
U n i v a l e n c e
Rex > Data Platform > Schéma cible
Staging DWH Business
Views
sql3
sql2
sql1
Logs
Events
other
cube
sql
Serving
Metadata
Rex > Data Platform
U n i v a l e n c e
Rex > Data Platform > Schéma cible
Staging DWH Business
Views
sql3
sql2
sql1
Logs
Events
other
cube
sql
Serving
Metadata
Staging :
Storage space used to
decouple from upstream
sources.
Rex > Data Platform
HADOOP
ETL workflow :
Rex > Data Platform > Data Warehouse > ETL
API 1
(file)
API 2
(file)
Ref
(file)DB
API
adapter
result
DB
adapter
DB
adapter
serving
DBFilesFilesFiles
processprocessprocess
U n i v a l e n c e
Rex > Data Platform
U n i v a l e n c e
Rex > Data Platform > Business views & Reporting
● Création des axes métiers
● Visualisation des données
DWH
BV
BV
BV
DB
SQL
Self service Data
visualisation
Rex > Data Platform
Objectives :
Storage / Warehousing.
Reduce access time.
Elasticity.
Collaboration.
Reuse.
U n i v a l e n c e
Présentation
But ?
Building a data platform
is a BIG project with
no clear return on investment.
Présentation
“The Datalab as an
infrastructure.”
Présentation
How to grow a Datalab ?
Start small with an
end to end business case.
Rex > Datalab
U n i v a l e n c e
U n i v a l e n c e
CoGroup Map
Rex > Datalab > Recipe
1. Stage the data
2. Source mapping
3. CoGroup
4. Enrich
5. Make it accessible
Sprint
A. Cardinality Study
B. Technical mapping
C. Business-oriented
model
Marathon
Rex > Datalab > CoGroup
{
"group":123,
"V":[{"c2":true,
"c1":123}],
"R":[{"c3":"DIRECT",
"c2":"boeuf bourguignon",
"c1":123},
{“c3":"DIRECT",
"c2":"nouilles de riz",
“c1":123},
{“c3":"INDIRECT",
"c2":"soupe au melon d’hiver",
"c1":123},
{"c3":"INDIRECT",
"c2":"nouilles de riz",
“c1":123}]}
}
group int
v array<struct<c1:int,
c2:boolean>>
r array<struct<c1:int,
c2:string,
c3:string>>
Rex > Datalab > Ex
f: G => Visiteur
Rex > Datalab > Query
select count(*)
from visitor,
visitor.session session,
session.page page
where visitor.is_robot = false
and page.type = product
U n i v a l e n c e
Query for nested Data (Impala) :
Rex > Datalab > Sum UP
CoGroup all your inputs with PIG.
Map the data with Spark.
Store in ElasticSearch.
Présentation
Conclusion
Présentation
Questions ?

More Related Content

PDF
Hugfr SPARK & RIAK -20160114_hug_france
PDF
Introduction to TitanDB
PDF
Dynamic Partition Pruning in Apache Spark
PDF
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
PPTX
Unlocking Your Hadoop Data with Apache Spark and CDH5
PDF
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
PDF
NoSQL no more: SQL on Druid with Apache Calcite
PPTX
Intro to Apache Spark
Hugfr SPARK & RIAK -20160114_hug_france
Introduction to TitanDB
Dynamic Partition Pruning in Apache Spark
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Unlocking Your Hadoop Data with Apache Spark and CDH5
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
NoSQL no more: SQL on Druid with Apache Calcite
Intro to Apache Spark

What's hot (20)

PDF
Data Infrastructure for a World of Music
PPTX
Functional architectural patterns
PPTX
Quark Virtualization Engine for Analytics
PDF
Adding Complex Data to Spark Stack by Tug Grall
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PDF
Realtime Reporting using Spark Streaming
PDF
Proud to be Polyglot - Riviera Dev 2015
PDF
Hyperspace for Delta Lake
PDF
Building Data Intensive Analytic Application on Top of Delta Lakes
PPTX
Implementing the Lambda Architecture efficiently with Apache Spark
PDF
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
PDF
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
PDF
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
PDF
Change Data Feed in Delta
PDF
Doug Cutting on the State of the Hadoop Ecosystem
PDF
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
PDF
Spark Summit EU talk by Miha Pelko and Til Piffl
PDF
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
PPTX
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
PDF
Spark Summit EU 2015: Reynold Xin Keynote
Data Infrastructure for a World of Music
Functional architectural patterns
Quark Virtualization Engine for Analytics
Adding Complex Data to Spark Stack by Tug Grall
Presto: Optimizing Performance of SQL-on-Anything Engine
Realtime Reporting using Spark Streaming
Proud to be Polyglot - Riviera Dev 2015
Hyperspace for Delta Lake
Building Data Intensive Analytic Application on Top of Delta Lakes
Implementing the Lambda Architecture efficiently with Apache Spark
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Change Data Feed in Delta
Doug Cutting on the State of the Hadoop Ecosystem
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Spark Summit EU talk by Miha Pelko and Til Piffl
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Spark Summit EU 2015: Reynold Xin Keynote
Ad

Similar to Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark Meetup (20)

PPTX
Data Science presentation for explanation of numpy and pandas
PDF
Data Science Introduction and Process in Data Science
PDF
Data Warehouse approaches with Dynamics AX
PPTX
Chapter -2- Data science Emerging Tech.pptx
PPTX
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
PPTX
Introducition to Data scinece compiled by hu
PDF
Towards a rebirth of data science (by Data Fellas)
PDF
Data science training in hyderabad
PPTX
Evolution of the DBA to Data Platform Administrator/Specialist
PPTX
Building the enterprise data architecture
PPTX
Data warehouse testing
PDF
Case study uwv using eCF and edison
PPTX
Data warehouse physical design
PDF
Prague data management meetup 2017-01-23
PPTX
basic of data science and big data......
PPS
Bi Dw Presentation
PPTX
2016 Chapter 2 - Intro. to Data Sciences.pptx
PPTX
Data wharehousing and OLAP
PDF
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
PPS
Introduction to Data Warehousing
Data Science presentation for explanation of numpy and pandas
Data Science Introduction and Process in Data Science
Data Warehouse approaches with Dynamics AX
Chapter -2- Data science Emerging Tech.pptx
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
Introducition to Data scinece compiled by hu
Towards a rebirth of data science (by Data Fellas)
Data science training in hyderabad
Evolution of the DBA to Data Platform Administrator/Specialist
Building the enterprise data architecture
Data warehouse testing
Case study uwv using eCF and edison
Data warehouse physical design
Prague data management meetup 2017-01-23
basic of data science and big data......
Bi Dw Presentation
2016 Chapter 2 - Intro. to Data Sciences.pptx
Data wharehousing and OLAP
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
Introduction to Data Warehousing
Ad

More from Modern Data Stack France (20)

PDF
Stash - Data FinOPS
PDF
Vue d'ensemble Dremio
PDF
From Data Warehouse to Lakehouse
PDF
Talend spark meetup 03042017 - Paris Spark Meetup
PDF
Paris Spark Meetup - Trifacta - 03_04_2017
PDF
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
PDF
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
PDF
Hadoop France meetup Feb2016 : recommendations with spark
PPTX
Hug janvier 2016 -EDF
PPTX
HUG France - 20160114 industrialisation_process_big_data CanalPlus
PDF
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
PDF
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
PDF
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
PDF
Spark dataframe
PDF
June Spark meetup : search as recommandation
PDF
Spark ML par Xebia (Spark Meetup du 11/06/2015)
PPTX
Spark meetup at viadeo
PPTX
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
PPTX
Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX
PDF
The Cascading (big) data application framework
Stash - Data FinOPS
Vue d'ensemble Dremio
From Data Warehouse to Lakehouse
Talend spark meetup 03042017 - Paris Spark Meetup
Paris Spark Meetup - Trifacta - 03_04_2017
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
Hadoop France meetup Feb2016 : recommendations with spark
Hug janvier 2016 -EDF
HUG France - 20160114 industrialisation_process_big_data CanalPlus
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Spark dataframe
June Spark meetup : search as recommandation
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark meetup at viadeo
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX
The Cascading (big) data application framework

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Cloud computing and distributed systems.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Big Data Technologies - Introduction.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
cuic standard and advanced reporting.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Machine learning based COVID-19 study performance prediction
MIND Revenue Release Quarter 2 2025 Press Release
Cloud computing and distributed systems.
The Rise and Fall of 3GPP – Time for a Sabbatical?
Big Data Technologies - Introduction.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Per capita expenditure prediction using model stacking based on satellite ima...
MYSQL Presentation for SQL database connectivity
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Unlocking AI with Model Context Protocol (MCP)
Network Security Unit 5.pdf for BCA BBA.
Digital-Transformation-Roadmap-for-Companies.pptx
cuic standard and advanced reporting.pdf
Spectroscopy.pptx food analysis technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark Meetup

  • 1. U n i v a l e n c e DATALAB 101 Jonathan WINANDY
  • 2. About me Jonathan WINANDY Data Engineer / Entrepreneur @AHOY_JON U n i v a l e n c e
  • 3. Présentation What are Datalabs ? Projects to transform an organisation based on its existing data.
  • 4. Présentation Why ? Data is a leverage for economic growth.
  • 5. Présentation But ? Data has no value by itself.
  • 6. Is data the new oil ?
  • 7. Présentation How do we start ? By building a Data platform ?
  • 9. Rex > Data Platform U n i v a l e n c e Rex > Data Platform > Schéma cible Staging DWH Business Views sql3 sql2 sql1 Logs Events other cube sql Serving Metadata
  • 10. Rex > Data Platform U n i v a l e n c e Rex > Data Platform > Schéma cible Staging DWH Business Views sql3 sql2 sql1 Logs Events other cube sql Serving Metadata Staging : Storage space used to decouple from upstream sources.
  • 11. Rex > Data Platform HADOOP ETL workflow : Rex > Data Platform > Data Warehouse > ETL API 1 (file) API 2 (file) Ref (file)DB API adapter result DB adapter DB adapter serving DBFilesFilesFiles processprocessprocess U n i v a l e n c e
  • 12. Rex > Data Platform U n i v a l e n c e Rex > Data Platform > Business views & Reporting ● Création des axes métiers ● Visualisation des données DWH BV BV BV DB SQL Self service Data visualisation
  • 13. Rex > Data Platform Objectives : Storage / Warehousing. Reduce access time. Elasticity. Collaboration. Reuse. U n i v a l e n c e
  • 14. Présentation But ? Building a data platform is a BIG project with no clear return on investment.
  • 15. Présentation “The Datalab as an infrastructure.”
  • 16. Présentation How to grow a Datalab ? Start small with an end to end business case.
  • 17. Rex > Datalab U n i v a l e n c e
  • 18. U n i v a l e n c e CoGroup Map
  • 19. Rex > Datalab > Recipe 1. Stage the data 2. Source mapping 3. CoGroup 4. Enrich 5. Make it accessible Sprint A. Cardinality Study B. Technical mapping C. Business-oriented model Marathon
  • 20. Rex > Datalab > CoGroup { "group":123, "V":[{"c2":true, "c1":123}], "R":[{"c3":"DIRECT", "c2":"boeuf bourguignon", "c1":123}, {“c3":"DIRECT", "c2":"nouilles de riz", “c1":123}, {“c3":"INDIRECT", "c2":"soupe au melon d’hiver", "c1":123}, {"c3":"INDIRECT", "c2":"nouilles de riz", “c1":123}]} } group int v array<struct<c1:int, c2:boolean>> r array<struct<c1:int, c2:string, c3:string>>
  • 21. Rex > Datalab > Ex f: G => Visiteur
  • 22. Rex > Datalab > Query select count(*) from visitor, visitor.session session, session.page page where visitor.is_robot = false and page.type = product U n i v a l e n c e Query for nested Data (Impala) :
  • 23. Rex > Datalab > Sum UP CoGroup all your inputs with PIG. Map the data with Spark. Store in ElasticSearch.