SlideShare a Scribd company logo
Ensuring quality in a data
lake environment with
lakeFS
August 18, 2021
By Paul Singman
o Fast Development & Effective Collaboration
o Intelligent Data Deployment
o Straightforward Error Resolution
Important Factors for Quality
The Data Lake Advantage
1. Scalable and cost effective
The Data Lake Advantage
1. Scalable and cost effective
2. Highly accessible
The Data Lake Advantage
1. Scalable and cost effective
2. Highly accessible
3. High throughput
The Data Lake Advantage
1. Scalable and cost effective
2. Highly accessible
3. High throughput
4. Rich application ecosystem
The Data Lake Advantage
Data Lake Architecture Reference
Events Data
Operations Data
OBJECT
STORAGE
Data Processors
Analytics Engines
Data Viz
Data Exploration
o Inability to experiment, compare and reproduce
Example: test version or algorithm upgrade
o Difficult to enforce best practices
Example: schema, format enforcement
o Hard to recover from production errors
Example: delete files in the object store
The Data Lake Challenge
Atomic versioned
data lake on top of
object storage
In a perfect world
How it works
s3://data-bucket/collections/foo
s3://data-bucket/main/collections/foo
Important Factors for Quality
o Fast Development & Effective Collaboration
o Intelligent Data Deployment
o Straightforward Error Correction
o Fast Development & Effective Collaboration
o Intelligent Data Deployment
o Straightforward Error Correction
Important Factors for Quality
from lakefs_client.client import LakeFSClient
client = LakeFSClient(config)
client.branches.create_branch(repository=‘my-repo’, branch_creation=
models.BranchCreation(name=‘experiment2’, source=‘experiment1’)
)
df1 = sc.read.parquet("s3a://my-repo/experiment-1/events/by-date")
df2 = sc.read.parquet("s3a://my-repo/experiment-2/events/by-date")
df1.group_by("...").count()
df2.group_by("...").count() // now we can compare the properties of the data itself
o Fast Development & Effective Collaboration
o Intelligent Data Deployment
o Straightforward Error Correction
Important Factors for Quality
from lakefs_client.client import LakeFSClient
client = LakeFSClient(config)
client.refs.merge_into_branch(
repository=‘my-repo’,
source_ref=‘experiment1’,
destination_branch=‘main’
)
merge
changeset:
✓ 001.parquet
✓ 002.parquet
random.csv
new-
data-1
main
o Fast Development & Effective Collaboration
o Intelligent Data Deployment
o Straightforward Error Correction
Important Factors for Quality
lakectl branch commit lakefs://example-repo@stream-1-branch
lakectl branch revert lakefs://example-repo@stream-1-branch --commit dd8a60d5ef70809
Data Consumption
Data Sources
!"#$%&'(&)*$
Manageability & Resilience Layer
Integrates with your existing tools
Object Store
Streaming Data
Batch Jobs
Data Visualization
MLOps
Query Engines Data Quality
lakeFS Architecture
Architecture
lakeFS Metadata
Manager
lakeFS
S3 Gateway
Spark/Presto/EMR, ...
lakeFS Hadoop Client
Other Data
Consumers/Producers
Object store (S3)
rocksDB metadata
Deduplicated Data
Objects
Data Format
Commit
Meta range ID: ...
Master
Commit ID: ...
8-12MB .sst files,
containing continuous
sorted keys
.sst file,
pointing to
range files
Further Reading
Additional Resources
Getting started
Check out the docs
Join the lakeFS Slack Channel
Contribute and star the repo
Thanks!
Join the community einat.orr@treeverse.io
GitHub

More Related Content

PDF
Apache Iceberg: An Architectural Look Under the Covers
PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
PDF
Deploying Flink on Kubernetes - David Anderson
PDF
Introduction to Apache Flink - Fast and reliable big data processing
PPTX
Building Reliable Lakehouses with Apache Flink and Delta Lake
PDF
Building large scale transactional data lake using apache hudi
PPTX
Free Training: How to Build a Lakehouse
PPTX
Best practices and lessons learnt from Running Apache NiFi at Renault
Apache Iceberg: An Architectural Look Under the Covers
Building Lakehouses on Delta Lake with SQL Analytics Primer
Deploying Flink on Kubernetes - David Anderson
Introduction to Apache Flink - Fast and reliable big data processing
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building large scale transactional data lake using apache hudi
Free Training: How to Build a Lakehouse
Best practices and lessons learnt from Running Apache NiFi at Renault

What's hot (20)

PDF
Databricks Delta Lake and Its Benefits
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
PostgreSQL HA
PDF
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
PDF
Log Structured Merge Tree
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
PPTX
Introduction to Apache ZooKeeper
PPTX
Apache flink
PDF
PostgreSQL and RAM usage
PDF
Databricks: A Tool That Empowers You To Do More With Data
PDF
Parquet performance tuning: the missing guide
PDF
Introduction SQL Analytics on Lakehouse Architecture
PDF
Apache Flink internals
PPTX
Apache Tez - A New Chapter in Hadoop Data Processing
PPTX
File Format Benchmark - Avro, JSON, ORC & Parquet
PDF
CDC patterns in Apache Kafka®
PPTX
RocksDB detail
PDF
Dataiku Data Science Studio (datasheet)
Databricks Delta Lake and Its Benefits
Efficient Data Storage for Analytics with Apache Parquet 2.0
Building robust CDC pipeline with Apache Hudi and Debezium
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PostgreSQL HA
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Log Structured Merge Tree
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Introduction to Apache ZooKeeper
Apache flink
PostgreSQL and RAM usage
Databricks: A Tool That Empowers You To Do More With Data
Parquet performance tuning: the missing guide
Introduction SQL Analytics on Lakehouse Architecture
Apache Flink internals
Apache Tez - A New Chapter in Hadoop Data Processing
File Format Benchmark - Avro, JSON, ORC & Parquet
CDC patterns in Apache Kafka®
RocksDB detail
Dataiku Data Science Studio (datasheet)
Ad

Similar to Ensuring data quality with lakeFS (20)

PDF
CI/CD with an Idempotent Kafka Producer & Consumer | Kafka Summit London 2022
PPTX
RDF Validation in a Linked Data World - A vision beyond structural and value ...
PPTX
BGOUG "Agile Data: revolutionizing database cloning'
PDF
13-Essential-Data-Validation-Checks.pdf
DOCX
RamyaChavakula_DBA_2.8yrs exp
PPTX
Kscope 14 Presentation : Virtual Data Platform
PPTX
DevOps, Databases and The Phoenix Project UGF4042 from OOW14
PPTX
Denver devops : enabling DevOps with data virtualization
DOCX
Satheesh Oracle DBA Resume
PPTX
Cloud Serving Engine
PPTX
Walking Through Cloud Serving at Yahoo!
PDF
UEMB270: Software Distribution Under The Hood
PDF
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
PDF
Making Apache Spark Better with Delta Lake
PPTX
SQL Saturday San Diego
PDF
Patterns & Practices of Microservices
PDF
gDBClone - Database Clone “onecommand Automation Tool”
PDF
C19013010 the tutorial to build shared ai services session 2
PDF
Testing Delphix: easy data virtualization
PDF
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
CI/CD with an Idempotent Kafka Producer & Consumer | Kafka Summit London 2022
RDF Validation in a Linked Data World - A vision beyond structural and value ...
BGOUG "Agile Data: revolutionizing database cloning'
13-Essential-Data-Validation-Checks.pdf
RamyaChavakula_DBA_2.8yrs exp
Kscope 14 Presentation : Virtual Data Platform
DevOps, Databases and The Phoenix Project UGF4042 from OOW14
Denver devops : enabling DevOps with data virtualization
Satheesh Oracle DBA Resume
Cloud Serving Engine
Walking Through Cloud Serving at Yahoo!
UEMB270: Software Distribution Under The Hood
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Making Apache Spark Better with Delta Lake
SQL Saturday San Diego
Patterns & Practices of Microservices
gDBClone - Database Clone “onecommand Automation Tool”
C19013010 the tutorial to build shared ai services session 2
Testing Delphix: easy data virtualization
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Ad

More from Paul Singman (6)

PDF
psingmani
PDF
PDF
paulsingmaninsight
PDF
Psingman-insight
PDF
psingman insight.pptx
PDF
Paul singman insight
psingmani
paulsingmaninsight
Psingman-insight
psingman insight.pptx
Paul singman insight

Recently uploaded (20)

PDF
composite construction of structures.pdf
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
CH1 Production IntroductoryConcepts.pptx
PPT
Project quality management in manufacturing
PPTX
Welding lecture in detail for understanding
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
Foundation to blockchain - A guide to Blockchain Tech
DOCX
573137875-Attendance-Management-System-original
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Sustainable Sites - Green Building Construction
composite construction of structures.pdf
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
CH1 Production IntroductoryConcepts.pptx
Project quality management in manufacturing
Welding lecture in detail for understanding
Model Code of Practice - Construction Work - 21102022 .pdf
Foundation to blockchain - A guide to Blockchain Tech
573137875-Attendance-Management-System-original
bas. eng. economics group 4 presentation 1.pptx
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Internet of Things (IOT) - A guide to understanding
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Operating System & Kernel Study Guide-1 - converted.pdf
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
additive manufacturing of ss316l using mig welding
Sustainable Sites - Green Building Construction

Ensuring data quality with lakeFS

  • 1. Ensuring quality in a data lake environment with lakeFS August 18, 2021 By Paul Singman
  • 2. o Fast Development & Effective Collaboration o Intelligent Data Deployment o Straightforward Error Resolution Important Factors for Quality
  • 3. The Data Lake Advantage
  • 4. 1. Scalable and cost effective The Data Lake Advantage
  • 5. 1. Scalable and cost effective 2. Highly accessible The Data Lake Advantage
  • 6. 1. Scalable and cost effective 2. Highly accessible 3. High throughput The Data Lake Advantage
  • 7. 1. Scalable and cost effective 2. Highly accessible 3. High throughput 4. Rich application ecosystem The Data Lake Advantage
  • 8. Data Lake Architecture Reference Events Data Operations Data OBJECT STORAGE Data Processors Analytics Engines Data Viz Data Exploration
  • 9. o Inability to experiment, compare and reproduce Example: test version or algorithm upgrade o Difficult to enforce best practices Example: schema, format enforcement o Hard to recover from production errors Example: delete files in the object store The Data Lake Challenge
  • 10. Atomic versioned data lake on top of object storage In a perfect world
  • 12. Important Factors for Quality o Fast Development & Effective Collaboration o Intelligent Data Deployment o Straightforward Error Correction
  • 13. o Fast Development & Effective Collaboration o Intelligent Data Deployment o Straightforward Error Correction Important Factors for Quality from lakefs_client.client import LakeFSClient client = LakeFSClient(config) client.branches.create_branch(repository=‘my-repo’, branch_creation= models.BranchCreation(name=‘experiment2’, source=‘experiment1’) ) df1 = sc.read.parquet("s3a://my-repo/experiment-1/events/by-date") df2 = sc.read.parquet("s3a://my-repo/experiment-2/events/by-date") df1.group_by("...").count() df2.group_by("...").count() // now we can compare the properties of the data itself
  • 14. o Fast Development & Effective Collaboration o Intelligent Data Deployment o Straightforward Error Correction Important Factors for Quality from lakefs_client.client import LakeFSClient client = LakeFSClient(config) client.refs.merge_into_branch( repository=‘my-repo’, source_ref=‘experiment1’, destination_branch=‘main’ ) merge changeset: ✓ 001.parquet ✓ 002.parquet random.csv new- data-1 main
  • 15. o Fast Development & Effective Collaboration o Intelligent Data Deployment o Straightforward Error Correction Important Factors for Quality lakectl branch commit lakefs://example-repo@stream-1-branch lakectl branch revert lakefs://example-repo@stream-1-branch --commit dd8a60d5ef70809
  • 17. Integrates with your existing tools Object Store Streaming Data Batch Jobs Data Visualization MLOps Query Engines Data Quality
  • 19. Architecture lakeFS Metadata Manager lakeFS S3 Gateway Spark/Presto/EMR, ... lakeFS Hadoop Client Other Data Consumers/Producers Object store (S3) rocksDB metadata Deduplicated Data Objects
  • 20. Data Format Commit Meta range ID: ... Master Commit ID: ... 8-12MB .sst files, containing continuous sorted keys .sst file, pointing to range files Further Reading
  • 21. Additional Resources Getting started Check out the docs Join the lakeFS Slack Channel Contribute and star the repo
  • 22. Thanks! Join the community einat.orr@treeverse.io GitHub