LinkedIn2

Scaling out to 10 Clusters, 1000 Users,
and 10,000 Flows:
The Dali Experience at LinkedIn
Carl Steinbach
Senior Staff Software Engineer
Data Analytics Infrastructure Group
LinkedIn

Hadoop @ LinkedIn: Circa 2008
1 cluster
20 nodes
10 users
10 production workflows
MapReduce, Pig

Hadoop @ LinkedIn: NOW
> 10 clusters
> 10,000 nodes
> 1,000 users
Hundreds of production workflows, thousands of
development flows and ad-hoc Qs
MapReduce, Pig, Hive, Gobblin, Cubert, Scalding, Spark,
Presto, …

What did we learn along the way?
Scaling Hardware Infrastructure is Hard

What did we learn along the way?
Scaling Human Infrastructure is Harder

Dali Motivations: Data Consumers
Data consumers have to manage too many details
What data is available, and who produces it?
Where is the data located (cluster, path)
How is the data partitioned (logical  physical)
How do I read the data (format)?

Dali Motivations: Data Producers
Managing contracts with consumers is hard
Change Management
Who depends on this dataset?
Public versus Private APIs
Can’t unpublish new fields
Schemas are two weak
Physical types are nice, but we want semantic types

Dali Motivations: Infra Providers
This mess makes things really hard for infrastructure providers!
Lots of optimizations are impossible because producers/consumer logic locks us
into what should be backend decisions
Storage format (Avro)
Physical partitioning scheme (Date)
Data location (Specific directory, cluster, FS)
Lots of redundant code paths to support

10
Hidden,
constantly
changing
dependencies
linking producers
and consumers

Dali Vision and Mission
Vision: Make analytics infrastructure invisible
Mission: Make data on HDFS easier to manage
Filesystem: multi-version, multi-cluster
Datasets: tables not files
Views: contract management for producers and consumers
Lineage and Discovery: map datasets to producers, consumers, and track
changes over time

Dali Central Dogma
Separate logical and physical concerns!
GUIDs for logical entities are good!
Versions are good!

13
“All problems in computer
science can be solved by another
level of indirection”
David Wheeler (?)

Dali Dataset API: Catalog Service

Is a Dataset API Enough?
Some use cases at LinkedIn
Structural transformations (flattening and nesting)
Muxing and de-muxing data (unions)
Patching bad data
Backward incompatible changes
Code reuse
What we need
Ability to decouple the API from the dataset
Producer control over public and private APIs
Tooling and process to support safe evolution of these APIs

A sample view
CREATE VIEW profile_flattened
TBLPROPERTIES(
'functions' =
'get_profile_section:isb.GetProfileSections',
'dependencies' =
'com.linkedin.dali-udfs:get-profile-sections:0.0.5')
AS SELECT
get_profile_section(...)
FROM
prod_identity.profile;

Reading a Dali View from Pig
register ivy://com.linkedin.dali:dali-all:2.3.52;
a = load “dali:///prod_identity/profile_flattened”
using dali.data.pig.DaliStorage();

Versioning for Dali Views
All views and UDFs must be versioned
Declare and version view dependencies
Deploy multiple versions of the same view
Incremental pull upgrades replace monolithic push upgrades

Managing views as Gradle artifacts
Git is the source of truth for view definitions
1:1 mapping between views and published artifacts
Mapping view artifacts to view names
db.view 
${groupId}.${artifactId}_x_y_z
${groupId}.${artifactId} is an alias for the most recent view definition

Leveraging existing LI tools INFRA
Query view/UDF version dependency graph
who-depends-on-me?
Deprecate, EOL, and purge a specific view/UDF version
Plug into existing global namespace management provided by LI developer tools
Enforce referential integrity for views at deployment time

Contract Law for Datasets
Vague, poorly defined contracts bind data producers to consumers
Physical types don’t tell us much
STRING or URI?
STRING or ENUM?
Semantic types help, but what about other types of relationships?
X IS NOT NULL
A_time is in seconds, b_time is in millis
Attributes of a good contract
Easy to find
Easy to understand
Easy to change

Hijacking an existing process
Express contracts as logical constraints against the fields of a view
Make the contract easy to find by storing it in the view’s Git repo
Contract negotiation follows an existing process
Data producer (view owner) controls the ACL on the view repo
Data consumer requests a contract change via ReviewBoard request
View owner either accepts or rejects the pull request
If accepted, view version is bumped to notify downstream consumers
If rejected, consumer still has the option of committing the constraint to their own repo
Contract  Constraint based testing for views
Contract  Data Quality tests

Why Dali?
Consumers
Make data stable, predictable, discoverable
Producers
Explicit, manageable contracts with consumers
Frictionless, familiar process for modifying existing contracts
Infra Providers
Freedom to optimize
Flow portability  DR, multi-DC scheduling

csteinbach@linkedin.com
linkedin.com/in/carlsteinbach
@cwsteinbach

LinkedIn2

More Related Content

What's hot (20)

Similar to LinkedIn2 (20)

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded (20)

LinkedIn2

Editor's Notes