SlideShare a Scribd company logo
ETL Practices
for Better or Worse
opportunities and challenges
along with the rise of machines
Agenda
1. Bring the data together and then transform
2. Data modeling for pragmatism
3. Let the datasets talk to each other
4. Beef up sandbox and backfill
5. Performance tuning
6. Examples, questions and comments
ELT and RIT
● ELT =
○ Extract from Source
○ Load to HDFS/MPP
○ Transform and Integrate inside Hadoop/MPP
● RIT =
○ Replicate message/log to Queue (JMS/Kafka/CDC)
○ Stream from Queue and Ingest to HDFS/MPP
○ Transform and Integrate inside Hadoop/MPP
Why ELT and RIT (instead of ETL)
● Store related raw data together for better
leverage
● Big queryable staging area is quite useful
● Reduce workload impact against source
systems
● Write data cleansing and business logic in
similar languages/scripting used by BI
Is Data Modeling still Important
● Do we still need to model the data in the era
of NoSQL and Big Data?
● Shall we de-normalize/pre-join everything?
● Shall we use hierarchical JSON/XML and/or
Key-Value pair for everything?
● Balance and trade-off: analytics, reusability,
metadata-driven, size vs. easy-to-query
Is Data Modeling still Important
● Cluster all attributes and children objects to
the tree structure (thinking in NoSQL way)
● Can we live without JOIN operator?
● Is mutable/updatable dataset still useful?
● Why not snapshot everything? Why SCD2?
● Is relational-model outdated?
● Model it in source or fix it in report?
● All or none: Index, Hash, and Full Scan?
Integration Brings the True Value
● Like the idea of SOA, be careful with DQ
● Data producers are loosely-coupled for the
sake of scalability
● Integration and cross-reference is deferred
to DW/BI layer, yet someone has to do it
● How can unique identifier help here?
● Replicate dim/ref/lkp and Federate tx/event
Profiling Prototype Deploy Backfill
● Profiling data to understand data
● Prototype with real data
● Enrich and harden the derived data in
sandbox before deploying to production
● Be ready to backfill the data because it will
happen (easier to produce or to consume?)
Self-service is Cool, but
● Strong automation tools must be built first
● Software can monitor and throttle
● If a user’s job gets killed, there needs to be
enough info/clue to explain why and how to
(try to) fix it
● Education and knowledge sharing is
essential, is wiki page/runbook good enough
Performance Matters
● Good instrumentation/logging will pay off big
time in performance tuning
● Can we run complex OLAP reports on top of
operational metadata?
Where needs the tuning the most?
● Swapping (spill to disk) can be a big issue
● Detect and kill the bad jobs early
Examples
1. data exploration in MPP/Hadoop instead of
in source systems (don’t let brain wait)
2. web click stream backend transaction
3. replicate/synchronize rollup hierarchy
(mapping lookup) to multiple data systems;
then produce near-real-time aggregation in
each system; finally federate the aggregates

More Related Content

PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PPTX
Reshape Data Lake (as of 2020.07)
PPTX
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
PPTX
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
PDF
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
PPTX
Advanced Analytics using Apache Hive
PPTX
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
PPTX
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Reshape Data Lake (as of 2020.07)
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Advanced Analytics using Apache Hive
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...

What's hot (20)

PDF
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
PPTX
In Search of Database Nirvana: Challenges of Delivering HTAP
PPTX
Splice Machine Overview
PPTX
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
PPTX
Hadoop data ingestion
PDF
Powering Interactive BI Analytics with Presto and Delta Lake
PDF
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
PPTX
Real-time Analytics with Trino and Apache Pinot
PDF
Hoodie: How (And Why) We built an analytical datastore on Spark
PDF
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
PDF
How Adobe Does 2 Million Records Per Second Using Apache Spark!
PPTX
Presto: SQL-on-anything
PPTX
Time-oriented event search. A new level of scale
PDF
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
KEY
Large scale ETL with Hadoop
PDF
Operationalizing Big Data Pipelines At Scale
PPT
Architecting Big Data Ingest & Manipulation
PDF
The hidden engineering behind machine learning products at Helixa
PDF
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
In Search of Database Nirvana: Challenges of Delivering HTAP
Splice Machine Overview
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
Hadoop data ingestion
Powering Interactive BI Analytics with Presto and Delta Lake
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Real-time Analytics with Trino and Apache Pinot
Hoodie: How (And Why) We built an analytical datastore on Spark
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Presto: SQL-on-anything
Time-oriented event search. A new level of scale
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Large scale ETL with Hadoop
Operationalizing Big Data Pipelines At Scale
Architecting Big Data Ingest & Manipulation
The hidden engineering behind machine learning products at Helixa
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Ad

Viewers also liked (9)

PPT
How To Buy Data Warehouse
PDF
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
PPTX
Airflow at WePay
PDF
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
PDF
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
PDF
Insights Without Tradeoffs: Using Structured Streaming
PPTX
File Format Benchmark - Avro, JSON, ORC & Parquet
PPTX
Hbase hive pig
PDF
What to Expect for Big Data and Apache Spark in 2017
How To Buy Data Warehouse
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Airflow at WePay
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Insights Without Tradeoffs: Using Structured Streaming
File Format Benchmark - Avro, JSON, ORC & Parquet
Hbase hive pig
What to Expect for Big Data and Apache Spark in 2017
Ad

Similar to ETL Practices for Better or Worse (20)

PDF
The New Database Frontier: Harnessing the Cloud
PPTX
SoftServe BI/BigData Workshop in Utah
PDF
Big Data at a Gaming Company: Spil Games
PPTX
Big Data Expo 2015 - Pentaho The Future of Analytics
PDF
CWIN17 India / Bigdata architecture yashowardhan sowale
PDF
Extending BI with Big Data Analytics
PDF
Complement Your Existing Data Warehouse with Big Data & Hadoop
PDF
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
PPTX
Bdf16 big-data-warehouse-case-study-data kitchen
PPTX
Creating an Enterprise AI Strategy
PPTX
Bi Architecture And Conceptual Framework
PPTX
From Business Intelligence to Big Data - hack/reduce Dec 2014
PDF
Building the Artificially Intelligent Enterprise
PPTX
StreamCentral Technical Overview
DOCX
Business Intelligence, Analytics, and Data Science A Managerial
PDF
Design, Implementation, and Assessment of Innovative Data Warehousing; Extrac...
PPTX
Big data architectures and the data lake
PPTX
Эволюция Big Data и Information Management. Reference Architecture.
PDF
Pitfalls of Data Warehousing_2019-04-24
PDF
Intersection of Business Intelligence and CRM vsr12
The New Database Frontier: Harnessing the Cloud
SoftServe BI/BigData Workshop in Utah
Big Data at a Gaming Company: Spil Games
Big Data Expo 2015 - Pentaho The Future of Analytics
CWIN17 India / Bigdata architecture yashowardhan sowale
Extending BI with Big Data Analytics
Complement Your Existing Data Warehouse with Big Data & Hadoop
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Bdf16 big-data-warehouse-case-study-data kitchen
Creating an Enterprise AI Strategy
Bi Architecture And Conceptual Framework
From Business Intelligence to Big Data - hack/reduce Dec 2014
Building the Artificially Intelligent Enterprise
StreamCentral Technical Overview
Business Intelligence, Analytics, and Data Science A Managerial
Design, Implementation, and Assessment of Innovative Data Warehousing; Extrac...
Big data architectures and the data lake
Эволюция Big Data и Information Management. Reference Architecture.
Pitfalls of Data Warehousing_2019-04-24
Intersection of Business Intelligence and CRM vsr12

Recently uploaded (20)

PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Foundation of Data Science unit number two notes
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Introduction to Knowledge Engineering Part 1
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Database Infoormation System (DBIS).pptx
Supervised vs unsupervised machine learning algorithms
Fluorescence-microscope_Botany_detailed content
Foundation of Data Science unit number two notes
Moving the Public Sector (Government) to a Digital Adoption
oil_refinery_comprehensive_20250804084928 (1).pptx
Data_Analytics_and_PowerBI_Presentation.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Major-Components-ofNKJNNKNKNKNKronment.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...

ETL Practices for Better or Worse

  • 1. ETL Practices for Better or Worse opportunities and challenges along with the rise of machines
  • 2. Agenda 1. Bring the data together and then transform 2. Data modeling for pragmatism 3. Let the datasets talk to each other 4. Beef up sandbox and backfill 5. Performance tuning 6. Examples, questions and comments
  • 3. ELT and RIT ● ELT = ○ Extract from Source ○ Load to HDFS/MPP ○ Transform and Integrate inside Hadoop/MPP ● RIT = ○ Replicate message/log to Queue (JMS/Kafka/CDC) ○ Stream from Queue and Ingest to HDFS/MPP ○ Transform and Integrate inside Hadoop/MPP
  • 4. Why ELT and RIT (instead of ETL) ● Store related raw data together for better leverage ● Big queryable staging area is quite useful ● Reduce workload impact against source systems ● Write data cleansing and business logic in similar languages/scripting used by BI
  • 5. Is Data Modeling still Important ● Do we still need to model the data in the era of NoSQL and Big Data? ● Shall we de-normalize/pre-join everything? ● Shall we use hierarchical JSON/XML and/or Key-Value pair for everything? ● Balance and trade-off: analytics, reusability, metadata-driven, size vs. easy-to-query
  • 6. Is Data Modeling still Important ● Cluster all attributes and children objects to the tree structure (thinking in NoSQL way) ● Can we live without JOIN operator? ● Is mutable/updatable dataset still useful? ● Why not snapshot everything? Why SCD2? ● Is relational-model outdated? ● Model it in source or fix it in report? ● All or none: Index, Hash, and Full Scan?
  • 7. Integration Brings the True Value ● Like the idea of SOA, be careful with DQ ● Data producers are loosely-coupled for the sake of scalability ● Integration and cross-reference is deferred to DW/BI layer, yet someone has to do it ● How can unique identifier help here? ● Replicate dim/ref/lkp and Federate tx/event
  • 8. Profiling Prototype Deploy Backfill ● Profiling data to understand data ● Prototype with real data ● Enrich and harden the derived data in sandbox before deploying to production ● Be ready to backfill the data because it will happen (easier to produce or to consume?)
  • 9. Self-service is Cool, but ● Strong automation tools must be built first ● Software can monitor and throttle ● If a user’s job gets killed, there needs to be enough info/clue to explain why and how to (try to) fix it ● Education and knowledge sharing is essential, is wiki page/runbook good enough
  • 10. Performance Matters ● Good instrumentation/logging will pay off big time in performance tuning ● Can we run complex OLAP reports on top of operational metadata? Where needs the tuning the most? ● Swapping (spill to disk) can be a big issue ● Detect and kill the bad jobs early
  • 11. Examples 1. data exploration in MPP/Hadoop instead of in source systems (don’t let brain wait) 2. web click stream backend transaction 3. replicate/synchronize rollup hierarchy (mapping lookup) to multiple data systems; then produce near-real-time aggregation in each system; finally federate the aggregates