Behind the scenes data engineering

Behind the Scenes | 23 May 2019

Same same but different
Data
Engineering
Data Systems

Our mission
To deliver a Data Platform
that empowers both Data Creators and Data Consumers,
maximising the capability of Coolblue
to keep our Customers smiling

Data Engineering | 23 May 2019

The Jenga champion
Soumya Patra
● Data Engineer @Coolblue since October 2017
● Masters from TU Eindhoven
● Interests: AWS, Data Engineering

First things first
Agenda
● ETL Challenges in a fast growing company
● Self service platform
● Key takeaways

Behind the scenes data engineering

Growth
Dev teams
Analysts &
Data
scientists
Reporting

Agility & innovation
Team 1 Team 2 Team 3
Data Engineering

Agility & innovation

Centralized domain knowledge
Fully shipped
vs
Partially shipped
Promoter
vs
Neutral

Recap
Scalability
Support
Data
Variety
Agility &
Innovation
Centralised
Knowledge

Google BigQuery
Scalability Support

Self servicing teams
Agility & Innovation

Domain experts in each team
Centralized domain
knowledge

Need for a strong platform
Data variety
Self service platform

Expectations from platform
Simplicity

Simplicity Observability

Simplicity Observability Security

Simplicity Observability Security Scalability

Simplicity Observability Security ResiliencyScalability

Coffee time
Select option
Pour coffee Pour milk
Add sugar
Your drink is ready!
8 AM Everyday

Airflow
Simplicity Observability Security ResiliencyScalability

Impact
Domain Teams Business Analysts

Cookie master chef
Cindy Cressot
● From France
● Have a little Daughter
● PhD applied sciences
● Interests: Big Data, Spark
● Data Engineer
● Coolblue since April 2018

First things first
Agenda
● Context
● Challenges
● Method (PySpark)
● A step further with Big Data
● Key takeaways

Customer journey
Prospective customer ID 1
Email john.doe@coolblue.nl
Prospect

Customer journey
Customer
Customer ID 1
Order ID 1
Prospect

Customer journey
Customer
Email john.doe@yahoo.com
Customer ID 1
Order ID 1
Prospect Updates

Customer journey
Customer
Customer ID 2
Order ID 2
Prospect Updates

Customer journey
Customer
Customer ID 2 1
Order ID 2
Prospect Updates Deduplication

Why?
● Each customer has 1 coolblue-account
● 1 account is linked to 1 email address
● 1 email address represents 1 person
○ However, customers can update their email address
○ And customers can use same account (households)
● All the different scenarios with different attributes (email, address, phone,
etc.) makes it harder to recognize the customer.

Let's start with customers and emails
customer_id email
1 a@a
2 a@a
3 b@b
4 c@c

Step 1
● PARTITION customers BY email
● ORDER BY registration_date
● promote the FIRST customer as MASTER

Group customer by email
customer_id email master_customer_id
1 a@a
1
2 a@a
3 b@b 3
4 c@c 4
master

Dealing with updates
What if customers 1 and 3 update their email addresses ?
● We would need to keep track of all updates in their email addresses.
E.g: a@a changed email to: b@b

Historical Updates (new rows)
1 a@a 1
2 a@a 1
3 b@b 3
4 c@c 4

Historical Updates (new rows)
1 a@a 1
1 b@b
2 a@a 1
3 b@b 3
3 c@c
4 c@c 4
current
current

Step 1 (again)
● PARTITION customers BY email
● ORDER BY registration_date
● promote the FIRST customer as MASTER

2 a@a 1
1 a@a 1
1 b@b
3 b@b 3
3 c@c
4 c@c 4

2 a@a
1
1 a@a
1 b@b
1
3 b@b
3 c@c
3
4 c@c
earliest customer_id

2 a@a
1
1 a@a
1 b@b
1
3 b@b
3 c@c
3
4 c@c
What
about
this ?

Step 2 (a bit different)
● PARTITION master_customer_ids BY customer_id
● ORDER BY master_customer_id
● promote the FIRST master_customer_id as deduplicated_customer_id

Group Master by Customer
customer_id email master_customer_id deduplicated_customer_i
d
2 a@a
1
1
1 a@a
1 b@b
1
3 b@b
3 c@c
3
4 c@c 3

d
2 a@a
1
1
1 a@a
1
1 b@b
1
3 b@b
3 c@c
3
4 c@c 3

d
2 a@a
1
1
1 a@a
1
1 b@b
1
3 b@b
1
3 c@c
3
4 c@c 3

Yes! - Group Dedup by Email
customer_i
d
email master_customer_id
(per email)
deduplicated_customer_id
(iteration 1)
master_customer_i
d(per email)
2 a@a
1
1
1
1 a@a
1
1 b@b
1
3 b@b
1
3 c@c
3
4 c@c 3

customer_i
d
(per email)
(iteration 1)
master_customer_i
d(per email)
2 a@a
1
1
1
1 a@a
1
1 b@b
1 1
3 b@b
1
3 c@c
3
4 c@c 3

customer_i
d
(per email)
(iteration 1)
master_customer_i
d(per email)
2 a@a
1
1
1
1 a@a
1
1 b@b
1 1
3 b@b
1
3 c@c
3 1
4 c@c 3

Can we dedup more? - No
customer_id email master_customer_i
d
(per email)
deduplicated_
customer_id
(iteration 1)
master_
customer_id
deduplicated_
customer_id
2 a@a
1
1
1
1
1 a@a
1 1
1 b@b
1 1
3 b@b
1 1
3 c@c
3 1
4 c@c 3 1

Deduplicated!
customer_id email master_customer_i
d
(per email)
deduplicated_
customer_id
(iteration 1)
master_
customer_id
deduplicated_
customer_id
2 a@a
1
1
1
1
1 a@a
1 1
1 b@b
1 1
3 b@b
1 1
3 c@c
3 1
4 c@c 3 1

● We only saw 2 Iterations
○ But we could have multiple depths of relationships.
○ We could have more attributes like address or phone number.
● We had to deal with multiple columns
○ customer_id, master_id, dedup_id
● Are you confused?
○ It was already hard to explain and to understand.
● Is there a different way to see it?
Conclusion

Connected components
Email address

Email address
Invoice address
Full name

Full name
+
Invoice address
Email address

Deduplicating visitors
Cross device tracking

Deduplicating visitors
Complete customer journey
Cross device tracking

Break: time for some wakey juice

Data Systems | Migrating to the Cloud: Our Journey | 23-05-2019

Flipping tables
Data Systems Team
● We work with data, a lot of data!
● Keep data clean and centralized in our Data
Warehouse
● Create data pipelines in Airflow
● Support semantic layer using OLAP Cubes

The Python Jedi
Gwildor Sok
● Data Engineer at Coolblue since July 2017
● Started with Python in 2012
● Game and full stack Web development
before moving to Data Engineering

They call me the cube guy
● Business Intelligence Engineer at Coolblue since April 2018
● Experienced in Microsoft BI Stack
● Start learning how to love open source
● OLAP! OLAP everywhere!
André Santos

Data Center Architecture
Data Source 1
External Systems
(...)
Data Source 2
Data Source 3

Data Source 1
External Systems
(...)
Staging Data
Warehouse
Data Mart 1
Data Source 2
Data Source 3
Data Mart 2

Data Source 1
External Systems
(...)
Staging Data
Warehouse
Data Mart 1 OLAP 1
Data Source 2
Data Source 3
Data Mart 2 OLAP 2

Data Source 1
External Systems
(...)
Azkaban
Staging Data
Warehouse
Data Mart 1 OLAP 1
Data Source 2
Data Source 3
Data Mart 2 OLAP 2

Data Source 1
External Systems
(...)
Staging Data
Warehouse
Data Mart 1 OLAP 1
Data Source 2
Data Source 3
Data Mart 2 OLAP 2
Azkaban

Azkaban
Data Source 1
External Systems
(...)
Staging Data
Warehouse
Data Mart 1 OLAP 1
Data Source 2
Data Source 3
Data Mart 2 OLAP 2

Migration Steps
Move OLAP Server
from Data Center to
Cloud
Move SQL Server
from Data Center to
Cloud
Moving away from
Azkaban and adopt
Airflow
Create data validation mechanisms

● Templated arguments, like SQL
queries
● A lot of building blocks designed for
data engineering work
● Interface is not great to
browse historical runs
● Hard to rerun individual tasks
Azkaban vs Airflow
Azkaban
● Task configuration separate
from code
● Dates as first class citizen; easily
trigger historical runs
Airflow

Introducing checkpoints
Organizing it in Airflow
● Multiple pipelines instead of one
● Interdependent communication
● Multiple “checkpoints”

Our new daily process(es!)
Time
Load to staging area
325 tasks
Heavy calculations
50 tasks
Loading data warehouse tables
140 tasks
Process
semantic layer
20 tasks

● Easier to read and reason
● Maintainable
● Separate logical units
● Less dependency management
● Interdependency checks can fail,
blocking the next step
● More code because of the
interdependency checks
Checkpoints approach
Disadvantages
● Easier to test● Generally slower
Advantages

Ignoring the black box
Observing behaviour
?Source Staging

Old code
Most code looked like this
def main():
result = run_query('some_query.sql')
filename = create_csv(result)
upload_to_gcs(filename, GCS_FILENAME)
load_to_mssql(GCS_FILENAME, MSSQL_TABLE)

New code
Now we have this
OracleToGCSOperator(
sql='some_query.sql',
gcs_location=GCS_FILENAME)
GCSToMSSQLOperator(
gcs_location=GCS_FILENAME,
mssql_table=MSSQL_TABLE)

Advantages
Now we have this
● Configuration as code
○ Easier to read
○ Very easy to test
● Less code to maintain
○ Written and maintained by Airflow contributors
○ Custom code is rare instead of the default
● Quicker to create new pipelines

Summary
Airflow
● All configuration now in code
● Building blocks for faster pipeline development
● Lot less code
● Manageable daily process

SQL Server(less)
SQL Server from Data Center to Cloud
Data Center
Physical Server
Cloud

SQL Server(less)
Amazon Relational Database Service (RDS)
● Simple to setup and configure
● Supports multiple databases providers
● Patching the database software, backing up databases and
some other DBA tasks are managed by AWS itself

SQL Server(less)
Step by Step - SQL Server Migration to Cloud
New SQL Server Instance on RDS
Deploy DW onto new Instance
Populate historical tables
Configure daily ETL in Airflow
Data Validation tools

SQL Server(less)
MyDB:
Properties:
AllocatedStorage: "100"
DBInstanceClass: db.m1.small
Engine: sqlserver-se
EngineVersion: "14.00.3015.40.v1"
Type: "AWS::RDS::DBInstance"

SQL Server(less)
Team City
Deployment

SQL Server(less)

SQL Server(less)
Data Source 1
Data Source 2
Data Source 3 ETL

SQL Server(less)
Apache
Beam
NBi
Data Validation

Summary
SQL Server on RDS
● We can easily scale our instance
● No server maintenance
● All configurations in code (Cloudformation) facilitates maintenance
● Backup mechanism offered by AWS has some limitations
+

OLAP Server
What is an OLAP database?
● OLAP stands for OnLine Analytical Processing
● An OLAP database is a multi-dimensional array of data,
commonly referred as “cube”
● This technology used to facilitate query processing on
data warehouse.

OLAP Server
OLAP on top of Data warehouse
Data warehouse
Report 1
Report 2
Report N

OLAP Server
How to migrate our OLAP Server?
?

OLAP Server
Main Challenges
No support for our
OLAP technology

OLAP Server
Main Challenges
No support for our
OLAP technology
● Owning and support our VM
(EC2)
● Configure VM using “code” (no
UI on Windows Server Core)

OLAP Server
Main Challenges
Weekly Recycling (wipe)

OLAP Server
Main Challenges
Weekly Recycling (wipe)
● Keep same machine
configurations after recycling
● Keep data in OLAP Server after
recycling

OLAP Server
1st step - AMI (basebox)
431 2

OLAP Server
2nd step - Cloudformation (AWS Architecture)
21 43

OLAP Server
3rd step - Configurations and Backups
21 2 3 4

OLAP Server
4th step - Integration with our ETL pipeline
1 322 4

OLAP Server
Integrate OLAP Server with Airflow
Partition 2019W04
Partition 2019W03
Partition 2019W02
...

OLAP Server
Integrate OLAP Server with Airflow
Process Partition
Create Partition Partition 2019W04
Partition 2019W03
Partition 2019W02
...

OLAP Server
Integrate OLAP Server with Airflow and... USERS
Partition 2019W04
Partition 2019W03
Partition 2019W02
...

Summary
OLAP Server on EC2
● We can easily scale our instance
● Infrastructure as Code facilitates maintenance
● Easy to rebuild machine if gets corrupted
● A lot of overhead costs on training upfront (really)
+

Automated validation
Same result set is important
PK value
1 A
2 B
3 C
PK value
4 D
3 C
2 B
SELECT
TOP 3 *
FROM Foo
ORDER BY PK

Getting the hashes
PK hash
1 c4ca4238
2 c81e728d
3 eccbc87e
PK hash
1 c4ca4238
2 c81e728d
3 a87ff679

Comparing hashes
Source
hashes
Target
hashes Apache Beam
Not in
source
Not in
target
Different

Grouping the output
Table Type Count
A not_in_target 0
not_in_source 5
different 1000
B not_in_target 20
not_in_source 0
different 500

Daily report
Table Difference Difference yesterday
A 5000 0
B 300 300
C 20 10,000

What’s different?
Table Primary Key Type
A 1 not_in_target
A 2 not_in_source
A 3 different

Automated validation steps
1. Get result set from source and target
2. Calculate hashes
3. Compare hashes, track differences
4. Store counts of differences in tracking tables
5. Talk through differences every day

Custom validation
NBi
● Unit testing for Business Intelligence, based on NUnit
● For tables where the logic changed, so needs custom
validation
● For validating the OLAP Server output

Summary
Validation
● Automated validation for most of our data
● Custom validation for tables that changed
● Custom validation for important parts of the
OLAP Server
Apache Beam NBi

What we learned
Lessons learned from this migration (1 / 2)
● Not everything you have on data center will be supported by AWS as it is
● Less monitoring capabilities in comparison to data center. No superpowers on
RDS
● Doing two migrations in parallel (Azkaban → Airflow, data center → AWS) might
not be such a smart idea

What we learned
Lessons learned from this migration (2 / 2)
● You should get extra training on AWS/DevOps upfront
● Think about infrastructure as code, both for Airflow pipelines as well as weekly
OLAP recycling: all is in code now, less in documentation or manual changes
● AWS flexibility allows you to scale your infrastructure with ease

Behind the scenes data engineering

More Related Content

What's hot (20)

Similar to Behind the scenes data engineering (20)

Recently uploaded (20)

Behind the scenes data engineering