SlideShare a Scribd company logo
Behind the Scenes | 23 May 2019
Same same but different
Data
Engineering
Data Systems
Our mission
To deliver a Data Platform
that empowers both Data Creators and Data Consumers,
maximising the capability of Coolblue
to keep our Customers smiling
Data Engineering | 23 May 2019
The Jenga champion
Soumya Patra
● Data Engineer @Coolblue since October 2017
● Masters from TU Eindhoven
● Interests: AWS, Data Engineering
First things first
Agenda
● ETL Challenges in a fast growing company
● Self service platform
● Key takeaways
Restaurant kitchen
Data warehouse
Data warehouse
Behind the scenes   data engineering
Growth
Sales Customers Stores
Growth
Dev teams
Analysts &
Data
scientists
Reporting
Challenges
Scalability
Scalability
Scalability
Scalability
Support & maintenance
Agility & innovation
Agility & innovation
Agility & innovation
Team 1 Team 2 Team 3
Data Engineering
Agility & innovation
Team 1 Team 2 Team 3
Centralized domain knowledge
Fully shipped
vs
Partially shipped
Promoter
vs
Neutral
Data variety
Recap
Scalability
Support
Data
Variety
Agility &
Innovation
Centralised
Knowledge
Overcoming challenges
Google BigQuery
Scalability Support
Self servicing teams
Team 1 Team 2 Team 3
Agility & Innovation
Domain experts in each team
Team 1 Team 2 Team 3
Centralized domain
knowledge
Need for a strong platform
Team 1 Team 2 Team 3
Data variety
Self service platform
Expectations from platform
Expectations from platform
Expectations from platform
Simplicity
Expectations from platform
Simplicity Observability
Expectations from platform
Simplicity Observability Security
Expectations from platform
Simplicity Observability Security Scalability
Expectations from platform
Simplicity Observability Security ResiliencyScalability
Hello Airflow
Coffee time
Select option
Pour coffee Pour milk
Add sugar
Your drink is ready!
8 AM Everyday
Operators
Operators
8 AM Everyday
Data processing Templates
Data processing Templates
Observability
Observability
Role based access control
AWS
Airflow
Simplicity Observability Security ResiliencyScalability
Impact
Domain Teams Business Analysts
Key takeaways
Behind the scenes   data engineering
Focus on big data problems
Data Engineering | 23 May 2019
Cookie master chef
Cindy Cressot
● From France
● Have a little Daughter
● PhD applied sciences
● Interests: Big Data, Spark
● Data Engineer
● Coolblue since April 2018
First things first
Agenda
● Context
● Challenges
● Method (PySpark)
● A step further with Big Data
● Key takeaways
Why deduplicating customers?
Customer data-driven
Customer data-driven
Customer data-driven
Customer journey
Prospective customer ID 1
Email john.doe@coolblue.nl
Prospect
Customer journey
Customer
Prospective customer ID 1
Email john.doe@coolblue.nl
Customer ID 1
Order ID 1
Prospect
Customer journey
Customer
Prospective customer ID 1
Email john.doe@yahoo.com
Customer ID 1
Order ID 1
Prospect Updates
Customer journey
Customer
Prospective customer ID 1
Email john.doe@coolblue.nl
Customer ID 2
Order ID 2
Prospect Updates
Customer journey
Customer
Prospective customer ID 1
Email john.doe@coolblue.nl
Customer ID 2 1
Order ID 2
Prospect Updates Deduplication
Customer journey
Customer
Prospective customer ID 1
Email john.doe@coolblue.nl
Customer ID 2 1
Order ID 2
Prospect Updates Deduplication
A complex problem
Why?
● Each customer has 1 coolblue-account
● 1 account is linked to 1 email address
● 1 email address represents 1 person
○ However, customers can update their email address
○ And customers can use same account (households)
● All the different scenarios with different attributes (email, address, phone,
etc.) makes it harder to recognize the customer.
Let's start with customers and emails
customer_id email
1 a@a
2 a@a
3 b@b
4 c@c
Step 1
● PARTITION customers BY email
● ORDER BY registration_date
● promote the FIRST customer as MASTER
Group customer by email
customer_id email master_customer_id
1 a@a
1
2 a@a
3 b@b 3
4 c@c 4
master
Dealing with updates
What if customers 1 and 3 update their email addresses ?
● We would need to keep track of all updates in their email addresses.
E.g: a@a changed email to: b@b
Historical Updates (new rows)
customer_id email master_customer_id
1 a@a 1
2 a@a 1
3 b@b 3
4 c@c 4
Historical Updates (new rows)
customer_id email master_customer_id
1 a@a 1
1 b@b
2 a@a 1
3 b@b 3
3 c@c
4 c@c 4
current
current
Step 1 (again)
● PARTITION customers BY email
● ORDER BY registration_date
● promote the FIRST customer as MASTER
Group customer by email
customer_id email master_customer_id
2 a@a 1
1 a@a 1
1 b@b
3 b@b 3
3 c@c
4 c@c 4
Group customer by email
customer_id email master_customer_id
2 a@a
1
1 a@a
1 b@b
1
3 b@b
3 c@c
3
4 c@c
earliest customer_id
Group customer by email
customer_id email master_customer_id
2 a@a
1
1 a@a
1 b@b
1
3 b@b
3 c@c
3
4 c@c
What
about
this ?
Step 2 (a bit different)
● PARTITION master_customer_ids BY customer_id
● ORDER BY master_customer_id
● promote the FIRST master_customer_id as deduplicated_customer_id
Group Master by Customer
customer_id email master_customer_id deduplicated_customer_i
d
2 a@a
1
1
1 a@a
1 b@b
1
3 b@b
3 c@c
3
4 c@c 3
Group Master by Customer
customer_id email master_customer_id deduplicated_customer_i
d
2 a@a
1
1
1 a@a
1
1 b@b
1
3 b@b
3 c@c
3
4 c@c 3
Group Master by Customer
customer_id email master_customer_id deduplicated_customer_i
d
2 a@a
1
1
1 a@a
1
1 b@b
1
3 b@b
1
3 c@c
3
4 c@c 3
Can we dedup more?
Yes! - Group Dedup by Email
customer_i
d
email master_customer_id
(per email)
deduplicated_customer_id
(iteration 1)
master_customer_i
d(per email)
2 a@a
1
1
1
1 a@a
1
1 b@b
1
3 b@b
1
3 c@c
3
4 c@c 3
Yes! - Group Dedup by Email
customer_i
d
email master_customer_id
(per email)
deduplicated_customer_id
(iteration 1)
master_customer_i
d(per email)
2 a@a
1
1
1
1 a@a
1
1 b@b
1 1
3 b@b
1
3 c@c
3
4 c@c 3
Yes! - Group Dedup by Email
customer_i
d
email master_customer_id
(per email)
deduplicated_customer_id
(iteration 1)
master_customer_i
d(per email)
2 a@a
1
1
1
1 a@a
1
1 b@b
1 1
3 b@b
1
3 c@c
3 1
4 c@c 3
Are we there yet?
Can we dedup more? - No
customer_id email master_customer_i
d
(per email)
deduplicated_
customer_id
(iteration 1)
master_
customer_id
deduplicated_
customer_id
2 a@a
1
1
1
1
1 a@a
1 1
1 b@b
1 1
3 b@b
1 1
3 c@c
3 1
4 c@c 3 1
Deduplicated!
customer_id email master_customer_i
d
(per email)
deduplicated_
customer_id
(iteration 1)
master_
customer_id
deduplicated_
customer_id
2 a@a
1
1
1
1
1 a@a
1 1
1 b@b
1 1
3 b@b
1 1
3 c@c
3 1
4 c@c 3 1
Deduplicated!
customer_id email master_customer_i
d
(per email)
deduplicated_
customer_id
(iteration 1)
master_
customer_id
deduplicated_
customer_id
2 a@a
1
1
1
1
1 a@a
1 1
1 b@b
1 1
3 b@b
1 1
3 c@c
3 1
4 c@c 3 1
Deduplicated!
customer_id email master_customer_i
d
(per email)
deduplicated_
customer_id
(iteration 1)
master_
customer_id
deduplicated_
customer_id
2 a@a
1
1
1
1
1 a@a
1 1
1 b@b
1 1
3 b@b
1 1
3 c@c
3 1
4 c@c 3 1
● We only saw 2 Iterations
○ But we could have multiple depths of relationships.
○ We could have more attributes like address or phone number.
● We had to deal with multiple columns
○ customer_id, master_id, dedup_id
● Are you confused?
○ It was already hard to explain and to understand.
● Is there a different way to see it?
Conclusion
Network
b@b
a@a
c@c
1
3 4
2
Connected components
Email address
Connected components
Email address
Invoice address
Full name
Full name
+
Invoice address
Connected components
Email address
Website traffic
Deduplicating visitors
Deduplicating visitors
Deduplicating visitors
Cross device tracking
Deduplicating visitors
Cross device tracking
Deduplicating visitors
Complete customer journey
Cross device tracking
Key takeaways
Recognize your customer
Keep it simple - use graphs!
Focus on customer journey
Behind the scenes   data engineering
Break: time for some wakey juice
Data Systems | Migrating to the Cloud: Our Journey | 23-05-2019
Flipping tables
Data Systems Team
● We work with data, a lot of data!
● Keep data clean and centralized in our Data
Warehouse
● Create data pipelines in Airflow
● Support semantic layer using OLAP Cubes
The Python Jedi
Gwildor Sok
● Data Engineer at Coolblue since July 2017
● Started with Python in 2012
● Game and full stack Web development
before moving to Data Engineering
They call me the cube guy
● Business Intelligence Engineer at Coolblue since April 2018
● Experienced in Microsoft BI Stack
● Start learning how to love open source
● OLAP! OLAP everywhere!
André Santos
Behind the scenes   data engineering
Data Center Architecture
Data Source 1
External Systems
(...)
Data Source 2
Data Source 3
Data Center Architecture
Data Source 1
External Systems
(...)
Staging Data
Warehouse
Data Mart 1
Data Source 2
Data Source 3
Data Mart 2
Data Center Architecture
Data Source 1
External Systems
(...)
Staging Data
Warehouse
Data Mart 1 OLAP 1
Data Source 2
Data Source 3
Data Mart 2 OLAP 2
Data Center Architecture
Data Source 1
External Systems
(...)
Azkaban
Staging Data
Warehouse
Data Mart 1 OLAP 1
Data Source 2
Data Source 3
Data Mart 2 OLAP 2
Data Center Architecture
Data Source 1
External Systems
(...)
Staging Data
Warehouse
Data Mart 1 OLAP 1
Data Source 2
Data Source 3
Data Mart 2 OLAP 2
Azkaban
Data Center Architecture
Data Source 1
External Systems
(...)
Staging Data
Warehouse
Data Mart 1 OLAP 1
Data Source 2
Data Source 3
Data Mart 2 OLAP 2
Azkaban
Azkaban
Data Center Architecture
Data Source 1
External Systems
(...)
Staging Data
Warehouse
Data Mart 1 OLAP 1
Data Source 2
Data Source 3
Data Mart 2 OLAP 2
Migration Steps
Move OLAP Server
from Data Center to
Cloud
Move SQL Server
from Data Center to
Cloud
Moving away from
Azkaban and adopt
Airflow
Create data validation mechanisms
Behind the scenes   data engineering
● Templated arguments, like SQL
queries
● A lot of building blocks designed for
data engineering work
● Interface is not great to
browse historical runs
● Hard to rerun individual tasks
Azkaban vs Airflow
Azkaban
● Task configuration separate
from code
● Dates as first class citizen; easily
trigger historical runs
Airflow
What do we need to migrate?
Big daily process
Identifying and splitting up
Introducing checkpoints
Organizing it in Airflow
● Multiple pipelines instead of one
● Interdependent communication
● Multiple “checkpoints”
Our new daily process(es!)
Time
Load to staging area
325 tasks
Heavy calculations
50 tasks
Loading data warehouse tables
140 tasks
Process
semantic layer
20 tasks
● Easier to read and reason
● Maintainable
● Separate logical units
● Less dependency management
● Interdependency checks can fail,
blocking the next step
● More code because of the
interdependency checks
Checkpoints approach
Disadvantages
● Easier to test● Generally slower
Advantages
Ignoring the black box
Observing behaviour
?Source Staging
Old code
Most code looked like this
def main():
result = run_query('some_query.sql')
filename = create_csv(result)
upload_to_gcs(filename, GCS_FILENAME)
load_to_mssql(GCS_FILENAME, MSSQL_TABLE)
New code
Now we have this
OracleToGCSOperator(
sql='some_query.sql',
gcs_location=GCS_FILENAME)
GCSToMSSQLOperator(
gcs_location=GCS_FILENAME,
mssql_table=MSSQL_TABLE)
Advantages
Now we have this
● Configuration as code
○ Easier to read
○ Very easy to test
● Less code to maintain
○ Written and maintained by Airflow contributors
○ Custom code is rare instead of the default
● Quicker to create new pipelines
Summary
Airflow
● All configuration now in code
● Building blocks for faster pipeline development
● Lot less code
● Manageable daily process
Behind the scenes   data engineering
Migration Steps
Move OLAP Server
from Data Center to
Cloud
Move SQL Server
from Data Center to
Cloud
Moving away from
Azkaban and adopt
Airflow
Create data validation mechanisms
SQL Server(less)
SQL Server from Data Center to Cloud
Data Center
Physical Server
Cloud
SQL Server(less)
Amazon Relational Database Service (RDS)
● Simple to setup and configure
● Supports multiple databases providers
● Patching the database software, backing up databases and
some other DBA tasks are managed by AWS itself
SQL Server(less)
Step by Step - SQL Server Migration to Cloud
New SQL Server Instance on RDS
Deploy DW onto new Instance
Populate historical tables
Configure daily ETL in Airflow
Data Validation tools
SQL Server(less)
Step by Step - SQL Server Migration to Cloud
MyDB:
Properties:
AllocatedStorage: "100"
DBInstanceClass: db.m1.small
Engine: sqlserver-se
EngineVersion: "14.00.3015.40.v1"
Type: "AWS::RDS::DBInstance"
SQL Server(less)
New SQL Server Instance on RDS
Deploy DW onto new Instance
Populate historical tables
Configure daily ETL in Airflow
Data Validation tools
Step by Step - SQL Server Migration to Cloud
Team City
Deployment
SQL Server(less)
New SQL Server Instance on RDS
Deploy DW onto new Instance
Populate historical tables
Configure daily ETL in Airflow
Data Validation tools
Step by Step - SQL Server Migration to Cloud
SQL Server(less)
New SQL Server Instance on RDS
Deploy DW onto new Instance
Populate historical tables
Configure daily ETL in Airflow
Data Validation tools
Step by Step - SQL Server Migration to Cloud
Data Source 1
Data Source 2
Data Source 3 ETL
SQL Server(less)
New SQL Server Instance on RDS
Deploy DW onto new Instance
Populate historical tables
Configure daily ETL in Airflow
Data Validation tools
Step by Step - SQL Server Migration to Cloud
SQL Server(less)
New SQL Server Instance on RDS
Deploy DW onto new Instance
Populate historical tables
Configure daily ETL in Airflow
Data Validation tools
Apache
Beam
Step by Step - SQL Server Migration to Cloud
NBi
Data Validation
Summary
SQL Server on RDS
● We can easily scale our instance
● No server maintenance
● All configurations in code (Cloudformation) facilitates maintenance
● Backup mechanism offered by AWS has some limitations
+
Behind the scenes   data engineering
Migration Steps
Move OLAP Server
from Data Center to
Cloud
Move SQL Server
from Data Center to
Cloud
Moving away from
Azkaban and adopt
Airflow
Create data validation mechanisms
OLAP Server
What is an OLAP database?
● OLAP stands for OnLine Analytical Processing
● An OLAP database is a multi-dimensional array of data,
commonly referred as “cube”
● This technology used to facilitate query processing on
data warehouse.
OLAP Server
OLAP on top of Data warehouse
Data warehouse
Report 1
Report 2
Report N
OLAP Server
(SSAS)
OLAP Server
How to migrate our OLAP Server?
?
OLAP Server
Main Challenges
OLAP Server
Main Challenges
No support for our
OLAP technology
OLAP Server
Main Challenges
No support for our
OLAP technology
● Owning and support our VM
(EC2)
● Configure VM using “code” (no
UI on Windows Server Core)
OLAP Server
Main Challenges
Weekly Recycling (wipe)
OLAP Server
Main Challenges
Weekly Recycling (wipe)
● Keep same machine
configurations after recycling
● Keep data in OLAP Server after
recycling
OLAP Server
1st step - AMI (basebox)
431 2
OLAP Server
2nd step - Cloudformation (AWS Architecture)
21 43
OLAP Server
3rd step - Configurations and Backups
21 2 3 4
OLAP Server
4th step - Integration with our ETL pipeline
1 322 4
OLAP Server
Integrate OLAP Server with Airflow
Partition 2019W04
Partition 2019W03
Partition 2019W02
...
OLAP Server
Integrate OLAP Server with Airflow
Process Partition
Create Partition Partition 2019W04
Partition 2019W03
Partition 2019W02
...
OLAP Server
Integrate OLAP Server with Airflow
Process Partition
Create Partition Partition 2019W04
Partition 2019W03
Partition 2019W02
...
OLAP Server
Integrate OLAP Server with Airflow and... USERS
Partition 2019W04
Partition 2019W03
Partition 2019W02
...
Summary
OLAP Server on EC2
● We can easily scale our instance
● Infrastructure as Code facilitates maintenance
● Easy to rebuild machine if gets corrupted
● A lot of overhead costs on training upfront (really)
+
Behind the scenes   data engineering
Migration Steps
Move OLAP Server
from Data Center to
Cloud
Move SQL Server
from Data Center to
Cloud
Moving away from
Azkaban and adopt
Airflow
Create data validation mechanisms
Automated validation
Automated validation
Same result set is important
PK value
1 A
2 B
3 C
PK value
4 D
3 C
2 B
SELECT
TOP 3 *
FROM Foo
ORDER BY PK
Automated validation
Getting the hashes
PK hash
1 c4ca4238
2 c81e728d
3 eccbc87e
PK hash
1 c4ca4238
2 c81e728d
3 a87ff679
Automated validation
Comparing hashes
Source
hashes
Target
hashes Apache Beam
Not in
source
Not in
target
Different
Automated validation
Grouping the output
Table Type Count
A not_in_target 0
not_in_source 5
different 1000
B not_in_target 20
not_in_source 0
different 500
Automated validation
Daily report
Table Difference Difference yesterday
A 5000 0
B 300 300
C 20 10,000
Automated validation
What’s different?
Table Primary Key Type
A 1 not_in_target
A 2 not_in_source
A 3 different
Automated validation
Automated validation steps
1. Get result set from source and target
2. Calculate hashes
3. Compare hashes, track differences
4. Store counts of differences in tracking tables
5. Talk through differences every day
Custom validation
Custom validation
Custom validation
NBi
● Unit testing for Business Intelligence, based on NUnit
● For tables where the logic changed, so needs custom
validation
● For validating the OLAP Server output
Summary
Validation
● Automated validation for most of our data
● Custom validation for tables that changed
● Custom validation for important parts of the
OLAP Server
Apache Beam NBi
Behind the scenes   data engineering
What we gained
What we learned
Lessons learned from this migration (1 / 2)
● Not everything you have on data center will be supported by AWS as it is
● Less monitoring capabilities in comparison to data center. No superpowers on
RDS
● Doing two migrations in parallel (Azkaban → Airflow, data center → AWS) might
not be such a smart idea
What we learned
Lessons learned from this migration (2 / 2)
● You should get extra training on AWS/DevOps upfront
● Think about infrastructure as code, both for Airflow pipelines as well as weekly
OLAP recycling: all is in code now, less in documentation or manual changes
● AWS flexibility allows you to scale your infrastructure with ease
Data Systems | Migrating to the Cloud: Our Journey | 23-05-2019
Data Systems | Migrating to the Cloud: Our Journey | 23-05-2019
Tour Time
Beer Time!!!
careers@coolblue.nl
careers@coolblue.nl

More Related Content

PPTX
Apache Atlas: Governance for your Data
PDF
Feast Feature Store - An In-depth Overview Experimentation and Application in...
PDF
Data Quality With or Without Apache Spark and Its Ecosystem
PDF
FLiP Into Trino
PPTX
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
PDF
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
PDF
Dagster - DataOps and MLOps for Machine Learning Engineers.pdf
PDF
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Apache Atlas: Governance for your Data
Feast Feature Store - An In-depth Overview Experimentation and Application in...
Data Quality With or Without Apache Spark and Its Ecosystem
FLiP Into Trino
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Dagster - DataOps and MLOps for Machine Learning Engineers.pdf
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)

What's hot (20)

PDF
Build enterprise-grade AI agents with Azure AI Agent Service
PPTX
GPT, LLM, RAG, and RAG in Action: Understanding the Future of AI-Powered Info...
PPTX
Power BI for Big Data and the New Look of Big Data Solutions
PPTX
Shared channels in Microsoft Teams, an overview - JcGonzalez.pptx
PDF
Making Apache Spark Better with Delta Lake
PPTX
Building, Evaluating, and Optimizing your RAG App for Production
PDF
Simplify and Scale Data Engineering Pipelines with Delta Lake
PPTX
Ijcai 2020
PDF
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
PPTX
Apache Atlas: Tracking dataset lineage across Hadoop components
PPTX
Domain Driven Design(DDD) Presentation
PDF
LanGCHAIN Framework
PDF
Caching Data in OutSystems: A Tale of Gains Without Pain
PPT
Domain Driven Design (DDD)
PPTX
Boost Customer Experience with UiPath and AWS Contact Center automation
PDF
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
PDF
Data Warehouse Design and Best Practices
PDF
Data Mesh @ Yelp - 2019
PPTX
Microsoft AI Platform Overview
PPTX
Personalization Everywhere! Create a Personalization Strategy
Build enterprise-grade AI agents with Azure AI Agent Service
GPT, LLM, RAG, and RAG in Action: Understanding the Future of AI-Powered Info...
Power BI for Big Data and the New Look of Big Data Solutions
Shared channels in Microsoft Teams, an overview - JcGonzalez.pptx
Making Apache Spark Better with Delta Lake
Building, Evaluating, and Optimizing your RAG App for Production
Simplify and Scale Data Engineering Pipelines with Delta Lake
Ijcai 2020
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
Apache Atlas: Tracking dataset lineage across Hadoop components
Domain Driven Design(DDD) Presentation
LanGCHAIN Framework
Caching Data in OutSystems: A Tale of Gains Without Pain
Domain Driven Design (DDD)
Boost Customer Experience with UiPath and AWS Contact Center automation
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
Data Warehouse Design and Best Practices
Data Mesh @ Yelp - 2019
Microsoft AI Platform Overview
Personalization Everywhere! Create a Personalization Strategy
Ad

Similar to Behind the scenes data engineering (20)

PDF
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
PPTX
Relational Database to Apache Spark (and sometimes back again)
PDF
Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?
PDF
Mastering Your Customer Data on Apache Spark by Elliott Cordo
PPTX
Presentation_BigData_NenaMarin
PDF
Mastering Customer Data on Apache Spark
PPTX
Svccg nosql 2011_v4
DOCX
Running Head PROJECT DELIVERABLE 31PROJECT DELIVERABLE 310.docx
PDF
Data infrastructure for the other 90% of companies
PDF
Azure data stack_2019_08
PPT
Big Data
PPTX
Data modeling trends for Analytics
PPTX
Netflix's Transition to High-Availability Storage (QCon SF 2010)
PPTX
Data modeling trends for analytics
PDF
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
PPT
Big data – can it deliver speed and accuracy v1
PDF
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
PPTX
Big Data Analytics in the Cloud with Microsoft Azure
PPTX
The design and implementation of modern column oriented databases
PPTX
Redshift Chartio Event Presentation
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
Relational Database to Apache Spark (and sometimes back again)
Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Presentation_BigData_NenaMarin
Mastering Customer Data on Apache Spark
Svccg nosql 2011_v4
Running Head PROJECT DELIVERABLE 31PROJECT DELIVERABLE 310.docx
Data infrastructure for the other 90% of companies
Azure data stack_2019_08
Big Data
Data modeling trends for Analytics
Netflix's Transition to High-Availability Storage (QCon SF 2010)
Data modeling trends for analytics
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Big data – can it deliver speed and accuracy v1
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
Big Data Analytics in the Cloud with Microsoft Azure
The design and implementation of modern column oriented databases
Redshift Chartio Event Presentation
Ad

Recently uploaded (20)

PPTX
Managing Community Partner Relationships
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
annual-report-2024-2025 original latest.
PPTX
Introduction to Knowledge Engineering Part 1
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Mega Projects Data Mega Projects Data
Managing Community Partner Relationships
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
.pdf is not working space design for the following data for the following dat...
Supervised vs unsupervised machine learning algorithms
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
SAP 2 completion done . PRESENTATION.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
annual-report-2024-2025 original latest.
Introduction to Knowledge Engineering Part 1
Miokarditis (Inflamasi pada Otot Jantung)
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
[EN] Industrial Machine Downtime Prediction
climate analysis of Dhaka ,Banglades.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Database Infoormation System (DBIS).pptx
Introduction-to-Cloud-ComputingFinal.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Mega Projects Data Mega Projects Data

Behind the scenes data engineering