Ensuring data quality with lakeFS

Ensuring quality in a data
lake environment with
lakeFS
August 18, 2021
By Paul Singman

o Fast Development & Effective Collaboration
o Intelligent Data Deployment
o Straightforward Error Resolution
Important Factors for Quality

1. Scalable and cost effective
The Data Lake Advantage

2. Highly accessible

3. High throughput

3. High throughput
4. Rich application ecosystem

Data Lake Architecture Reference
Events Data
Operations Data
OBJECT
STORAGE
Data Processors
Analytics Engines
Data Viz
Data Exploration

o Inability to experiment, compare and reproduce
Example: test version or algorithm upgrade
o Difficult to enforce best practices
Example: schema, format enforcement
o Hard to recover from production errors
Example: delete files in the object store
The Data Lake Challenge

Atomic versioned
data lake on top of
object storage
In a perfect world

How it works
s3://data-bucket/collections/foo
s3://data-bucket/main/collections/foo

o Straightforward Error Correction

from lakefs_client.client import LakeFSClient
client = LakeFSClient(config)
client.branches.create_branch(repository=‘my-repo’, branch_creation=
models.BranchCreation(name=‘experiment2’, source=‘experiment1’)
)
df1 = sc.read.parquet("s3a://my-repo/experiment-1/events/by-date")
df2 = sc.read.parquet("s3a://my-repo/experiment-2/events/by-date")
df1.group_by("...").count()
df2.group_by("...").count() // now we can compare the properties of the data itself

from lakefs_client.client import LakeFSClient
client = LakeFSClient(config)
client.refs.merge_into_branch(
repository=‘my-repo’,
source_ref=‘experiment1’,
destination_branch=‘main’
)
merge
changeset:
✓ 001.parquet
✓ 002.parquet
random.csv
new-
data-1
main

lakectl branch commit lakefs://example-repo@stream-1-branch
lakectl branch revert lakefs://example-repo@stream-1-branch --commit dd8a60d5ef70809

Data Consumption
Data Sources
!"#$%&'(&)*$
Manageability & Resilience Layer

Integrates with your existing tools
Object Store
Streaming Data
Batch Jobs
Data Visualization
MLOps
Query Engines Data Quality

Architecture
lakeFS Metadata
Manager
lakeFS
S3 Gateway
Spark/Presto/EMR, ...
lakeFS Hadoop Client
Other Data
Consumers/Producers
Object store (S3)
rocksDB metadata
Deduplicated Data
Objects

Data Format
Commit
Meta range ID: ...
Master
Commit ID: ...
8-12MB .sst files,
containing continuous
sorted keys
.sst file,
pointing to
range files
Further Reading

Additional Resources
Getting started
Check out the docs
Join the lakeFS Slack Channel
Contribute and star the repo

Thanks!
Join the community einat.orr@treeverse.io
GitHub

Ensuring data quality with lakeFS

More Related Content

What's hot (20)

Similar to Ensuring data quality with lakeFS (20)

More from Paul Singman (6)

Recently uploaded (20)

Ensuring data quality with lakeFS