Дмитрий Попович "How to build a data warehouse?"

How to build
a data warehouse?
Dmytro Popovych, SE @ Tubular

Theory vs practice
Цитата #441422
Пока инженеры в белых халатах прикручивают красивый двигатель к
идеальному крылу, бригада взлохмаченных придурков во главе с
безумным авантюристом пролетает над ними на конструкции из
микроавтобуса, забора и двух промышленных фенов, навстречу
второму туру инвестиций.
Красивые проекты не взлетают, потому что они не успевают взлететь.

Agenda
• Problem statement:
• Data Ingestion
• Data Normalisation
• Data Access
• Our way to solve the problem :)

About us
• Video intelligence for the cross-platform world
• 30+ video platforms including YouTube, Facebook, Instagram
• 7M creators
• 3B videos
• 2Tb of newly ingested data a day
• 150Tb of data in the warehouse

What is a data warehouse?
A central repository of data collected from disparate sources.
ANALYST
ENGINEER
SERVICE
DATA
WAREHOUSE

Key features
Ingestion
Store raw data extracted from disparate data sources
Normalisation
Cleanup / combine raw data
Access
Help user to retrieve data

What problems does it solve in Tubular?
• For engineers / analysts:
• data discovery
• prototyping / analyse
• For services:
• data exchange

Data Ingestion Problems
• Real time data:
• tweets, comments, shares, views
• Periodical snapshots:
• dump of real time data
• results of the data analysis
• databases from internal services (in some cases)

Real time data
DATABUS / event log / message queue
Powered by KAFKA
Data serialised with AVRO
Keeps all events for the last N days
SERVICE #1 SERVICE #2
PERMANENT
STORAGE
...

Why did we choose Kafka?
• Stores streams of records in a fault-tolerant way
• Designed to serve multiple consumers per topic
• Allows to keep the last N days of records
• Tested in very big companies Linkedin, Twitter, Uber, Airbnb...

• Strict schema definition
• Safe schema evolution
• Compact (binary serialisation format)
• Cross-technology format (Java, Python, …)
• Has some ecosystem around (Schema Registry, CLI consumers, …)
• Hadoop-friendly
Why did we choose Avro?

Periodical Snapshots
DATABUS
SERVICE #1
Powered by ELASTIC
PERMANENT STORAGE
SERVICE #2
Powered by CASSANDRA
SERVICE #3
Powered by MYSQL
...
DATA IMPORT TOOL
Powered by S3
Data serialised with PARQUET
Powered by SPARK

Why did we choose S3?
• There is no need to support it
• Compatible with Hadoop ecosystem
• Relatively stable & cheap

Why did we choose Parquet?
• Column-oriented format (perfect for analytics and partial reads)
• Supports complex data structures
• Compatible with Hadoop ecosystem

Why did we choose Spark?
• Scalable data processing engine
• Faster than Hadoop
• Has connectors to all popular storages: JDBC, Elastic, Cassandra, Kafka
• Has Python bindings
• Built-in support of Parquet

Data Normalisation Problems
• Cleanup duplicates
• Partition by year / month / date / hour
• Join various data sources

Normalisation of real time data (example)
SERVICE #1
Powered by ELASTIC
DATABUS
UI
PERMANENT STORAGE
The service joins multiple data streams by sending
partial updates to Elastic.
Note! It isn’t the only way to implement a real time
join, more generic solution could be implemented
with Apache Samza.

Why did we choose Elastic?
• Provides real time search and analytics
• Has relatively cheap partial updates
• Easy to scale

Normalisation of previously imported data
DATA NORMALISATION TOOL
PERMANENT STORAGE
Powered by Spark
Joins various datasets
Removes duplicates
Creates partitions by time range buckets

Why did we choose Spark?
• Scalable data processing engine
• Has built-in SQL api to transform data (perfect for joins and deduplication)

Data Access Problems
• Datasets discovery
• Unified data access interface

Metadata Storage
PERMANENT STORAGE
Parquet
# 1
Avro
#1
CSV
#1
Parquet
# 2
...
METADATA STORAGE
Table # 1
Table # 2
Table # 3
...
Powered by Hive Metastore

Why did we choose Hive Metastore?
• Supported by Hadoop ecosystem
• Simple (Thrift api on top of MySQL table)
• Supported by Hue (UI to access metadata)

System Overview
ANALYST,
ENGINEER
PERMANENT STORAGE
DATABUS
METADATA
STORAGE
IMPORT
TOOL
NORMALISATION
TOOL
WAREHOUSE
SERVICES
* Data flows for Metadata Storage are explained verbally, too many arrows...

Thanks! Questions?
Check this out: https://github.com/Tubular/sparkly

Дмитрий Попович "How to build a data warehouse?"

More Related Content

What's hot (20)

Similar to Дмитрий Попович "How to build a data warehouse?" (20)

More from Fwdays (20)

Recently uploaded (20)

Дмитрий Попович "How to build a data warehouse?"