SlideShare a Scribd company logo
Crossing Analytics Systems: Case for
Integrated Provenance in Data Lakes
Isuru Suriarachchi and Beth Plale
School of Informatics and Computing
Indiana University
IEEE E-science 2016 : Hot Topics
The Data Lake has arisen within
last couple of years as
conceptualization of data
management framework with
flexibility to support multiple
data processing tools needed for
truly Big Data analytics.
Data Warehouse
• Supports multidimensional analytical processing
– Online Analytical Processing (OLAP) or
Multidimensional OLAP
• Numeric facts (measures) categorized by
dimensions creating vector space (OLAP cube).
• Interface is matrix interface like Pivot tables
• Schema is star schema, snowflake schema
• Storage is largely relational database
Credit: https://www.linkedin.com/topic/data-warehouse-architecture
Data Warehouse Architecture
• ETL: Extraction, Transformation, Load
Challenging the Warehouse: Big Data
• From numerous sources
– social media, sensor data, IoT devices, server logs,
clickstream etc.
• Not all numeric (quantitative) thus differently
structured
– Structured, semi-structured, unstructured
• Continuously generated or archived
Suitability of Data Warehouse for
Today’s Big Data
• ETL imposes burden
– Schema on write
– Inflexibility/inefficiency at ingest time
– Information loss upon schema translation
• Weak fit for popular Big Data analytical tools
(e.g., Spark, Hadoop) and data serving
platforms (e.g., HDFS, S3)
Data Lake
• A scalable storage infrastructure with no schema
enforcement at ingest
• Data ingested in raw form: no loss
• Schema-on-read
• Integrated Transformations
– With e.g., Hadoop, Spark
IngestAPI
Data
Data Lake
Clickstream
Sensor data
IoT Devices
Social Media
Could Platforms
Server Logs Metadata Lineage
Transform Transform Transform
Data Data Data
Analysis
Big Data Processing Frameworks
Ex: Hadoop, Spark, Storm
Data Lake Challenges
• Increased flexibility leads to harder
manageability
– Differently typed data can be easily dumped into
the Data Lake
– Data products can be in different stages of their
lifecycle: raw, half processed, processed etc.
– Can easily turn into “data swamps”
• Requires traceability!!..
– Provenance can help
Data Provenance
• Information about activities, entities and people
who involved in producing a data product
• Standards
– OPM
– PROV
• If a Data Lake ensures that every data product’s
provenance is in place starting from data
product’s origin, critical traceability can be had
What provenance perspective could
bring to a Data Lake?
• Track origins of data, chained transformations
• Contribute to reuse determinations of trust
and quality
• React!! Minimally constrain what enters a
Lake?
Challenges in Provenance Capturing
• Chains of Transformations
– Different analytics systems: Hadoop, Spark etc.
• Need is end to end integrated provenance across
transformations
• System specific provenance
collection methods are less
useful
– Integration/stitching
problems
– E.g.: RAMP, HadoopProv
for Hadoop
Solution to minimal lake governance
• All components in lake stream provenance to
central provenance subsystem
– Stores provenance for long term queries
– Monitors provenance stream in real time
• Event in stream represented by edge in
provenance graph
• Global lake wide policy: Uniform Persistent ID
(PID) (Handle, UUIDs, DOIs) attached to all
data objects in Data Lake
– required to guarantee integrated provenance
Model
• PID assigned to all data objects
– granularity
• Transformations T1, T2, and T3
– Distributed
– May use different frameworks
d1 T1
d2
d3
d4
d5
d6
d7
d8T2 T3
d1d3
d4
d6
d7
d8
Chain of
transformations
sharing Ids
Backward
provenance
from central
provenance store
Provenance traces integrate across systems of
Data Lake
Reference Architecture
IngestAPI
Batch Processing
Ex: Hadoop, Spark
Lineage
Raw Data from
various sources
Transformations
Workflow Engines
Ex: Kepler
Legacy Scripts
Stream Processing
Ex: Storm, Spark
Monitoring
Debugging Reproducing Data Quality
QueriesVisualization
Data Data Data
Data
Import
Lineage
Data
Export
Data Lake
Messaging System
Ingest API
Query API
ProvenanceSubsystem
Prov Stream
Processing
Prov
Storage
Prov
Stream
• Real-time provenance stream processing
• Stored provenance for long term usage
Prototype Use Case
• Different frameworks used
– Flume: Captures tweets and write into HDFS
– Hadoop Job: Computes hashtag counts
– Spark Job: Computes category counts
Central provenance store
• Uses Komadu
– A distributed
provenance
collection tool
– Visualization,
Custom Queries
I. Suriarachchi, Q. Zhou and B. Plale (2015). Komadu: A Capture and Visualization System for Scientific Data
Provenance. Journal of Open Research Software 3(1):e4
Client Library
• Log4j like API for provenance capture
• Dedicated thread pool in provenance layer
• Batching to minimize network overhead
Application Layer API
Komadu Client Layer
RabbitMQ Client Layer
client.addGeneration(A, E)
batching
prov thread
pool
RabbitMQ
Server
Komadu
Client Library
Use case evaluation
• Flume, Hadoop and Spark jobs instrumented
using Komadu client libraries
• Jobs stream provenance events into central
provenance store (Komadu)
• Persistent IDs (UUID) assigned for each data
object at entry to data lake; PID persists
thereafter with data object
Use case evaluation: experimental
environment
• 5 small VM instances, 2 2.5GhZ cores, 4 GB
RAM, 50 GB local storage
• 4 VM instances used for HDFS cluster
• 3.23 GB Twitter data collected over 5 days
running Flume on master node
• Hadoop and Spark set up on top of HDFS
cluster
• Separate instance for RabbitMQ and Komadu
Use case evaluation: Metrics
• Batch size:
– impact of batch size on provenance capture efficiency.
Measured by total execution time for Hadoop using
provenance event batching mechanism in Komadu
library
• Overhead of provenance capture:
– Measured against total tool-specific execution time
– measure overhead of customized value field (in key
value pair)
– Measure overhead of provenance capture for Hadoop
and Spark
Batch Size Test
• Hadoop job execution times with varying
batch sizes
• Optimal batch size: ~5000 KB
Overhead: Hadoop
• custom val: emits PID with key value pair
as (#nba, <2, id>) instead of (#nba, 2)
• data prov HDFS: writes provenance into HDFS,
used by HadoopProv and RAMP
Overhead: Spark
• Higher provenance capture overhead
compared to Hadoop
Future Work
• Performance overhead is prohibitively high
– decouple PID assignment from execution?
Examine granularity
• Live provenance stream processing for real
time monitoring/reaction
• Explore minimal provenance at on-line rates
and more comprehensive provenance at off-
line rates
Work funded in part by National Science
Foundation OCI-0940824
IEEE E-science 2016 : Hot Topics

More Related Content

PPTX
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
PPTX
Real time fraud detection at 1+M scale on hadoop stack
PDF
What's new in SQL on Hadoop and Beyond
PPTX
Debunking Common Myths in Stream Processing
PPTX
Lego-like building blocks of Storm and Spark Streaming Pipelines
PPTX
Active Learning for Fraud Prevention
PPTX
Combining Machine Learning frameworks with Apache Spark
PDF
Reliable and Scalable Data Ingestion at Airbnb
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
Real time fraud detection at 1+M scale on hadoop stack
What's new in SQL on Hadoop and Beyond
Debunking Common Myths in Stream Processing
Lego-like building blocks of Storm and Spark Streaming Pipelines
Active Learning for Fraud Prevention
Combining Machine Learning frameworks with Apache Spark
Reliable and Scalable Data Ingestion at Airbnb

What's hot (20)

PDF
NoSQL – Data Center Centric Application Enablement
PDF
The Future of Apache Storm
PPTX
Accelerating Data Warehouse Modernization
PPTX
Embeddable data transformation for real time streams
PPTX
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
PPTX
Integrating Apache Phoenix with Distributed Query Engines
PDF
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
PPTX
Stream processing on mobile networks
PPTX
Streaming in the Wild with Apache Flink
PPTX
Data streaming fundamentals
PPTX
Preventative Maintenance of Robots in Automotive Industry
PDF
Data platform evolution
PPTX
Lambda-less Stream Processing @Scale in LinkedIn
PPTX
Building Continuously Curated Ingestion Pipelines
PPTX
Assaf Araki – Real Time Analytics at Scale
PDF
Data Streaming For Big Data
PPTX
Building Data Pipelines with Spark and StreamSets
PPTX
Lightning Fast Analytics with Hive LLAP and Druid
PDF
Mission-Critical, Real-Time Fault-Detection for NASA's Deep Space Network usi...
PPTX
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
NoSQL – Data Center Centric Application Enablement
The Future of Apache Storm
Accelerating Data Warehouse Modernization
Embeddable data transformation for real time streams
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Integrating Apache Phoenix with Distributed Query Engines
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Stream processing on mobile networks
Streaming in the Wild with Apache Flink
Data streaming fundamentals
Preventative Maintenance of Robots in Automotive Industry
Data platform evolution
Lambda-less Stream Processing @Scale in LinkedIn
Building Continuously Curated Ingestion Pipelines
Assaf Araki – Real Time Analytics at Scale
Data Streaming For Big Data
Building Data Pipelines with Spark and StreamSets
Lightning Fast Analytics with Hive LLAP and Druid
Mission-Critical, Real-Time Fault-Detection for NASA's Deep Space Network usi...
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Ad

Viewers also liked (20)

PDF
TEXTO FINAL DE VOTACIÓN DEL PROYECTO DE LEY ORGÁNICA DE ECONOMÍA POPULAR Y SO...
PPTX
報告
PDF
HPF_III_Evaluation_Report_Final_2016
PPTX
#EatWaterford - The Importance of Social Media
PPT
Antibiotic Resistance presentation
PPTX
Findaflorist
PPTX
Aprendizaje autónomo
PDF
Frugal Diner Final
PPTX
importancia de implantes
PPTX
Computacionnnnnn
PDF
Infantry S3 Article SEP-OCT12
PDF
Graphic Design Portfolio
PPTX
Derechos humanos de los delincuentes
PDF
PDF
Simulation theory 510 13 ne`matov
PPT
ENERGY SAVING BY AIR WASHER PUMPS #5 ASH PLANT AT GTPS
PPTX
aspida Presentation booklet GA1 update 3-1
PDF
Prokochuk_Irina_architectural magazine Sporuda
PDF
BUILDING HOMES FOR HEROES-Charity-Info-v1d
DOCX
Artist resume 2015
TEXTO FINAL DE VOTACIÓN DEL PROYECTO DE LEY ORGÁNICA DE ECONOMÍA POPULAR Y SO...
報告
HPF_III_Evaluation_Report_Final_2016
#EatWaterford - The Importance of Social Media
Antibiotic Resistance presentation
Findaflorist
Aprendizaje autónomo
Frugal Diner Final
importancia de implantes
Computacionnnnnn
Infantry S3 Article SEP-OCT12
Graphic Design Portfolio
Derechos humanos de los delincuentes
Simulation theory 510 13 ne`matov
ENERGY SAVING BY AIR WASHER PUMPS #5 ASH PLANT AT GTPS
aspida Presentation booklet GA1 update 3-1
Prokochuk_Irina_architectural magazine Sporuda
BUILDING HOMES FOR HEROES-Charity-Info-v1d
Artist resume 2015
Ad

Similar to Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes (20)

PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r2)
PPTX
Chap3-Data Warehousing and OLAP operations..pptx
PPTX
Data Lakehouse Symposium | Day 4
PPTX
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
PPT
Provinance in scientific workflows in e science
PPTX
Unlock Data-driven Insights in Databricks Using Location Intelligence
PDF
Building a Distributed Collaborative Data Pipeline with Apache Spark
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
PDF
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
PPTX
Streaming Hypothesis Reasoning - William Smith, Jan 2016
PPTX
vJUG - Introduction to data streaming
PPTX
JUG SF - Introduction to data streaming
PPTX
SCALE - Stream processing and Open Data, a match made in Heaven
PPTX
JUG Tirana - Introduction to data streaming
PPTX
Data lake-itweekend-sharif university-vahid amiry
PPTX
WaJUG - Introduction to data streaming
PPTX
BruJUG - Introduction to data streaming
KEY
Panda Provenance
PDF
Capturing Interactive Data Transformation Operations using Provenance Workflows
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Chap3-Data Warehousing and OLAP operations..pptx
Data Lakehouse Symposium | Day 4
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Provinance in scientific workflows in e science
Unlock Data-driven Insights in Databricks Using Location Intelligence
Building a Distributed Collaborative Data Pipeline with Apache Spark
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
Streaming Hypothesis Reasoning - William Smith, Jan 2016
vJUG - Introduction to data streaming
JUG SF - Introduction to data streaming
SCALE - Stream processing and Open Data, a match made in Heaven
JUG Tirana - Introduction to data streaming
Data lake-itweekend-sharif university-vahid amiry
WaJUG - Introduction to data streaming
BruJUG - Introduction to data streaming
Panda Provenance
Capturing Interactive Data Transformation Operations using Provenance Workflows

Recently uploaded (20)

PDF
August Patch Tuesday
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPTX
Modernising the Digital Integration Hub
PPTX
1. Introduction to Computer Programming.pptx
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
WOOl fibre morphology and structure.pdf for textiles
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
TLE Review Electricity (Electricity).pptx
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
Getting started with AI Agents and Multi-Agent Systems
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
August Patch Tuesday
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Chapter 5: Probability Theory and Statistics
Assigned Numbers - 2025 - Bluetooth® Document
A comparative study of natural language inference in Swahili using monolingua...
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Group 1 Presentation -Planning and Decision Making .pptx
Modernising the Digital Integration Hub
1. Introduction to Computer Programming.pptx
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
WOOl fibre morphology and structure.pdf for textiles
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
TLE Review Electricity (Electricity).pptx
Final SEM Unit 1 for mit wpu at pune .pptx
Getting started with AI Agents and Multi-Agent Systems
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
O2C Customer Invoices to Receipt V15A.pptx
A contest of sentiment analysis: k-nearest neighbor versus neural network

Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes

  • 1. Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes Isuru Suriarachchi and Beth Plale School of Informatics and Computing Indiana University IEEE E-science 2016 : Hot Topics
  • 2. The Data Lake has arisen within last couple of years as conceptualization of data management framework with flexibility to support multiple data processing tools needed for truly Big Data analytics.
  • 3. Data Warehouse • Supports multidimensional analytical processing – Online Analytical Processing (OLAP) or Multidimensional OLAP • Numeric facts (measures) categorized by dimensions creating vector space (OLAP cube). • Interface is matrix interface like Pivot tables • Schema is star schema, snowflake schema • Storage is largely relational database
  • 5. Challenging the Warehouse: Big Data • From numerous sources – social media, sensor data, IoT devices, server logs, clickstream etc. • Not all numeric (quantitative) thus differently structured – Structured, semi-structured, unstructured • Continuously generated or archived
  • 6. Suitability of Data Warehouse for Today’s Big Data • ETL imposes burden – Schema on write – Inflexibility/inefficiency at ingest time – Information loss upon schema translation • Weak fit for popular Big Data analytical tools (e.g., Spark, Hadoop) and data serving platforms (e.g., HDFS, S3)
  • 7. Data Lake • A scalable storage infrastructure with no schema enforcement at ingest • Data ingested in raw form: no loss • Schema-on-read • Integrated Transformations – With e.g., Hadoop, Spark IngestAPI Data Data Lake Clickstream Sensor data IoT Devices Social Media Could Platforms Server Logs Metadata Lineage Transform Transform Transform Data Data Data Analysis Big Data Processing Frameworks Ex: Hadoop, Spark, Storm
  • 8. Data Lake Challenges • Increased flexibility leads to harder manageability – Differently typed data can be easily dumped into the Data Lake – Data products can be in different stages of their lifecycle: raw, half processed, processed etc. – Can easily turn into “data swamps” • Requires traceability!!.. – Provenance can help
  • 9. Data Provenance • Information about activities, entities and people who involved in producing a data product • Standards – OPM – PROV • If a Data Lake ensures that every data product’s provenance is in place starting from data product’s origin, critical traceability can be had
  • 10. What provenance perspective could bring to a Data Lake? • Track origins of data, chained transformations • Contribute to reuse determinations of trust and quality • React!! Minimally constrain what enters a Lake?
  • 11. Challenges in Provenance Capturing • Chains of Transformations – Different analytics systems: Hadoop, Spark etc. • Need is end to end integrated provenance across transformations • System specific provenance collection methods are less useful – Integration/stitching problems – E.g.: RAMP, HadoopProv for Hadoop
  • 12. Solution to minimal lake governance • All components in lake stream provenance to central provenance subsystem – Stores provenance for long term queries – Monitors provenance stream in real time • Event in stream represented by edge in provenance graph • Global lake wide policy: Uniform Persistent ID (PID) (Handle, UUIDs, DOIs) attached to all data objects in Data Lake – required to guarantee integrated provenance
  • 13. Model • PID assigned to all data objects – granularity • Transformations T1, T2, and T3 – Distributed – May use different frameworks d1 T1 d2 d3 d4 d5 d6 d7 d8T2 T3 d1d3 d4 d6 d7 d8 Chain of transformations sharing Ids Backward provenance from central provenance store
  • 14. Provenance traces integrate across systems of Data Lake
  • 15. Reference Architecture IngestAPI Batch Processing Ex: Hadoop, Spark Lineage Raw Data from various sources Transformations Workflow Engines Ex: Kepler Legacy Scripts Stream Processing Ex: Storm, Spark Monitoring Debugging Reproducing Data Quality QueriesVisualization Data Data Data Data Import Lineage Data Export Data Lake Messaging System Ingest API Query API ProvenanceSubsystem Prov Stream Processing Prov Storage Prov Stream • Real-time provenance stream processing • Stored provenance for long term usage
  • 16. Prototype Use Case • Different frameworks used – Flume: Captures tweets and write into HDFS – Hadoop Job: Computes hashtag counts – Spark Job: Computes category counts
  • 17. Central provenance store • Uses Komadu – A distributed provenance collection tool – Visualization, Custom Queries I. Suriarachchi, Q. Zhou and B. Plale (2015). Komadu: A Capture and Visualization System for Scientific Data Provenance. Journal of Open Research Software 3(1):e4
  • 18. Client Library • Log4j like API for provenance capture • Dedicated thread pool in provenance layer • Batching to minimize network overhead Application Layer API Komadu Client Layer RabbitMQ Client Layer client.addGeneration(A, E) batching prov thread pool RabbitMQ Server Komadu Client Library
  • 19. Use case evaluation • Flume, Hadoop and Spark jobs instrumented using Komadu client libraries • Jobs stream provenance events into central provenance store (Komadu) • Persistent IDs (UUID) assigned for each data object at entry to data lake; PID persists thereafter with data object
  • 20. Use case evaluation: experimental environment • 5 small VM instances, 2 2.5GhZ cores, 4 GB RAM, 50 GB local storage • 4 VM instances used for HDFS cluster • 3.23 GB Twitter data collected over 5 days running Flume on master node • Hadoop and Spark set up on top of HDFS cluster • Separate instance for RabbitMQ and Komadu
  • 21. Use case evaluation: Metrics • Batch size: – impact of batch size on provenance capture efficiency. Measured by total execution time for Hadoop using provenance event batching mechanism in Komadu library • Overhead of provenance capture: – Measured against total tool-specific execution time – measure overhead of customized value field (in key value pair) – Measure overhead of provenance capture for Hadoop and Spark
  • 22. Batch Size Test • Hadoop job execution times with varying batch sizes • Optimal batch size: ~5000 KB
  • 23. Overhead: Hadoop • custom val: emits PID with key value pair as (#nba, <2, id>) instead of (#nba, 2) • data prov HDFS: writes provenance into HDFS, used by HadoopProv and RAMP
  • 24. Overhead: Spark • Higher provenance capture overhead compared to Hadoop
  • 25. Future Work • Performance overhead is prohibitively high – decouple PID assignment from execution? Examine granularity • Live provenance stream processing for real time monitoring/reaction • Explore minimal provenance at on-line rates and more comprehensive provenance at off- line rates
  • 26. Work funded in part by National Science Foundation OCI-0940824 IEEE E-science 2016 : Hot Topics