SlideShare a Scribd company logo
LinkedIn2
Scaling out to 10 Clusters, 1000 Users,
and 10,000 Flows:
The Dali Experience at LinkedIn
Carl Steinbach
Senior Staff Software Engineer
Data Analytics Infrastructure Group
LinkedIn
Hadoop @ LinkedIn: Circa 2008
1 cluster
20 nodes
10 users
10 production workflows
MapReduce, Pig
Hadoop @ LinkedIn: NOW
> 10 clusters
> 10,000 nodes
> 1,000 users
Hundreds of production workflows, thousands of
development flows and ad-hoc Qs
MapReduce, Pig, Hive, Gobblin, Cubert, Scalding, Spark,
Presto, …
What did we learn along the way?
Scaling Hardware Infrastructure is Hard
What did we learn along the way?
Scaling Human Infrastructure is Harder
Dali Motivations: Data Consumers
Data consumers have to manage too many details
What data is available, and who produces it?
Where is the data located (cluster, path)
How is the data partitioned (logical  physical)
How do I read the data (format)?
Dali Motivations: Data Producers
Managing contracts with consumers is hard
Change Management
Who depends on this dataset?
Public versus Private APIs
Can’t unpublish new fields
Schemas are two weak
Physical types are nice, but we want semantic types
Dali Motivations: Infra Providers
This mess makes things really hard for infrastructure providers!
Lots of optimizations are impossible because producers/consumer logic locks us
into what should be backend decisions
Storage format (Avro)
Physical partitioning scheme (Date)
Data location (Specific directory, cluster, FS)
Lots of redundant code paths to support
10
Hidden,
constantly
changing
dependencies
linking producers
and consumers
Dali Vision and Mission
Vision: Make analytics infrastructure invisible
Mission: Make data on HDFS easier to manage
Filesystem: multi-version, multi-cluster
Datasets: tables not files
Views: contract management for producers and consumers
Lineage and Discovery: map datasets to producers, consumers, and track
changes over time
Dali Central Dogma
Separate logical and physical concerns!
GUIDs for logical entities are good!
Versions are good!
13
“All problems in computer
science can be solved by another
level of indirection”
David Wheeler (?)
Dali Dataset API: Catalog Service
Is a Dataset API Enough?
Some use cases at LinkedIn
Structural transformations (flattening and nesting)
Muxing and de-muxing data (unions)
Patching bad data
Backward incompatible changes
Code reuse
What we need
Ability to decouple the API from the dataset
Producer control over public and private APIs
Tooling and process to support safe evolution of these APIs
Dali View
A sample view
CREATE VIEW profile_flattened
TBLPROPERTIES(
'functions' =
'get_profile_section:isb.GetProfileSections',
'dependencies' =
'com.linkedin.dali-udfs:get-profile-sections:0.0.5')
AS SELECT
get_profile_section(...)
FROM
prod_identity.profile;
Reading a Dali View from Pig
register ivy://com.linkedin.dali:dali-all:2.3.52;
a = load “dali:///prod_identity/profile_flattened”
using dali.data.pig.DaliStorage();
Versioning for Dali Views
All views and UDFs must be versioned
Declare and version view dependencies
Deploy multiple versions of the same view
Incremental pull upgrades replace monolithic push upgrades
Managing views as Gradle artifacts
Git is the source of truth for view definitions
1:1 mapping between views and published artifacts
Mapping view artifacts to view names
db.view 
${groupId}.${artifactId}_x_y_z
${groupId}.${artifactId} is an alias for the most recent view definition
Leveraging existing LI tools INFRA
Query view/UDF version dependency graph
who-depends-on-me?
Deprecate, EOL, and purge a specific view/UDF version
Plug into existing global namespace management provided by LI developer tools
Enforce referential integrity for views at deployment time
Contract Law for Datasets
Vague, poorly defined contracts bind data producers to consumers
Physical types don’t tell us much
STRING or URI?
STRING or ENUM?
Semantic types help, but what about other types of relationships?
X IS NOT NULL
A_time is in seconds, b_time is in millis
Attributes of a good contract
Easy to find
Easy to understand
Easy to change
Hijacking an existing process
Express contracts as logical constraints against the fields of a view
Make the contract easy to find by storing it in the view’s Git repo
Contract negotiation follows an existing process
Data producer (view owner) controls the ACL on the view repo
Data consumer requests a contract change via ReviewBoard request
View owner either accepts or rejects the pull request
If accepted, view version is bumped to notify downstream consumers
If rejected, consumer still has the option of committing the constraint to their own repo
Contract  Constraint based testing for views
Contract  Data Quality tests
Why Dali?
Consumers
Make data stable, predictable, discoverable
Producers
Explicit, manageable contracts with consumers
Frictionless, familiar process for modifying existing contracts
Infra Providers
Freedom to optimize
Flow portability  DR, multi-DC scheduling
Simplifying with Views
csteinbach@linkedin.com
linkedin.com/in/carlsteinbach
@cwsteinbach
©2014 LinkedIn Corporation. All Rights Reserved.©2014 LinkedIn Corporation. All Rights Reserved.

More Related Content

PDF
Benefits of Hadoop as Platform as a Service
PDF
On Demand HDP Clusters using Cloudbreak and Ambari
PPTX
Log I am your father
PDF
Microservices Patterns with GoldenGate
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PPTX
Hadoop and Friends as Key Enabler of the IoE - Continental's Dynamic eHorizon
PPTX
Practical advice to build a data driven company
PPTX
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Benefits of Hadoop as Platform as a Service
On Demand HDP Clusters using Cloudbreak and Ambari
Log I am your father
Microservices Patterns with GoldenGate
Architect’s Open-Source Guide for a Data Mesh Architecture
Hadoop and Friends as Key Enabler of the IoE - Continental's Dynamic eHorizon
Practical advice to build a data driven company
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...

What's hot (20)

PPTX
The convergence of reporting and interactive BI on Hadoop
PPTX
Containers and Big Data
PDF
High Performance Spatial-Temporal Trajectory Analysis with Spark
PDF
Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...
PPTX
Shaping a Digital Vision
PPTX
Saving the elephant—now, not later
PPT
Migrating legacy ERP data into Hadoop
PDF
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
PDF
Democratizing Data Science on Kubernetes
PPTX
Adding structure to your streaming pipelines: moving from Spark streaming to ...
PDF
Delta Lake: Open Source Reliability w/ Apache Spark
PPTX
Hadoop for the Masses
PDF
Migrate and Modernize Hadoop-Based Security Policies for Databricks
PDF
Intelligent Integration OOW2017 - Jeff Pollock
PPTX
Druid Overview by Rachel Pedreschi
PPTX
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
PDF
Big SQL: Powerful SQL Optimization - Re-Imagined for open source
PPTX
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
PDF
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
PDF
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
The convergence of reporting and interactive BI on Hadoop
Containers and Big Data
High Performance Spatial-Temporal Trajectory Analysis with Spark
Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...
Shaping a Digital Vision
Saving the elephant—now, not later
Migrating legacy ERP data into Hadoop
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Democratizing Data Science on Kubernetes
Adding structure to your streaming pipelines: moving from Spark streaming to ...
Delta Lake: Open Source Reliability w/ Apache Spark
Hadoop for the Masses
Migrate and Modernize Hadoop-Based Security Policies for Databricks
Intelligent Integration OOW2017 - Jeff Pollock
Druid Overview by Rachel Pedreschi
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Big SQL: Powerful SQL Optimization - Re-Imagined for open source
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Ad

Similar to LinkedIn2 (20)

PPTX
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
PDF
Exploiting the Data / Code Duality with Dali
PDF
Data Services and the Modern Data Ecosystem (ASEAN)
PDF
MAZZ -Bob Towards BIG DATA-RA-AlloyCloud-NIST_BD.pdf
PDF
Data Virtualization: Introduction and Business Value (UK)
PDF
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
PDF
Virtualisation de données : Enjeux, Usages & Bénéfices
PPS
Qo Introduction V2
PPTX
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...
PDF
Data Virtualization: An Introduction
PPTX
End-to-End Security and Auditing in a Big Data as a Service Deployment
PDF
Data Virtualization. An Introduction (ASEAN)
PDF
Your Data is Waiting. What are the Top 5 Trends for Data in 2022? (ASEAN)
PDF
Denodo Platform 7.0: What's New?
PPTX
Svcc services presentation (Silicon Valley code camp 2011)
PPT
Dynamic Object-Oriented Requirements System (DOORS)
PDF
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
PDF
Modern Data Management for Federal Modernization
PDF
Streaming is a Detail
PDF
Data Infrastructure at LinkedIn
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
Exploiting the Data / Code Duality with Dali
Data Services and the Modern Data Ecosystem (ASEAN)
MAZZ -Bob Towards BIG DATA-RA-AlloyCloud-NIST_BD.pdf
Data Virtualization: Introduction and Business Value (UK)
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Virtualisation de données : Enjeux, Usages & Bénéfices
Qo Introduction V2
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...
Data Virtualization: An Introduction
End-to-End Security and Auditing in a Big Data as a Service Deployment
Data Virtualization. An Introduction (ASEAN)
Your Data is Waiting. What are the Top 5 Trends for Data in 2022? (ASEAN)
Denodo Platform 7.0: What's New?
Svcc services presentation (Silicon Valley code camp 2011)
Dynamic Object-Oriented Requirements System (DOORS)
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
Modern Data Management for Federal Modernization
Streaming is a Detail
Data Infrastructure at LinkedIn
Ad

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop

Recently uploaded (20)

PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Mushroom cultivation and it's methods.pdf
PPTX
Tartificialntelligence_presentation.pptx
PPTX
A Presentation on Touch Screen Technology
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
cloud_computing_Infrastucture_as_cloud_p
PPTX
TLE Review Electricity (Electricity).pptx
PPTX
A Presentation on Artificial Intelligence
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
project resource management chapter-09.pdf
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
OMC Textile Division Presentation 2021.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Mushroom cultivation and it's methods.pdf
Tartificialntelligence_presentation.pptx
A Presentation on Touch Screen Technology
SOPHOS-XG Firewall Administrator PPT.pptx
Programs and apps: productivity, graphics, security and other tools
NewMind AI Weekly Chronicles - August'25-Week II
cloud_computing_Infrastucture_as_cloud_p
TLE Review Electricity (Electricity).pptx
A Presentation on Artificial Intelligence
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
1 - Historical Antecedents, Social Consideration.pdf
Web App vs Mobile App What Should You Build First.pdf
Hindi spoken digit analysis for native and non-native speakers
Agricultural_Statistics_at_a_Glance_2022_0.pdf
project resource management chapter-09.pdf
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf

LinkedIn2

  • 2. Scaling out to 10 Clusters, 1000 Users, and 10,000 Flows: The Dali Experience at LinkedIn Carl Steinbach Senior Staff Software Engineer Data Analytics Infrastructure Group LinkedIn
  • 3. Hadoop @ LinkedIn: Circa 2008 1 cluster 20 nodes 10 users 10 production workflows MapReduce, Pig
  • 4. Hadoop @ LinkedIn: NOW > 10 clusters > 10,000 nodes > 1,000 users Hundreds of production workflows, thousands of development flows and ad-hoc Qs MapReduce, Pig, Hive, Gobblin, Cubert, Scalding, Spark, Presto, …
  • 5. What did we learn along the way? Scaling Hardware Infrastructure is Hard
  • 6. What did we learn along the way? Scaling Human Infrastructure is Harder
  • 7. Dali Motivations: Data Consumers Data consumers have to manage too many details What data is available, and who produces it? Where is the data located (cluster, path) How is the data partitioned (logical  physical) How do I read the data (format)?
  • 8. Dali Motivations: Data Producers Managing contracts with consumers is hard Change Management Who depends on this dataset? Public versus Private APIs Can’t unpublish new fields Schemas are two weak Physical types are nice, but we want semantic types
  • 9. Dali Motivations: Infra Providers This mess makes things really hard for infrastructure providers! Lots of optimizations are impossible because producers/consumer logic locks us into what should be backend decisions Storage format (Avro) Physical partitioning scheme (Date) Data location (Specific directory, cluster, FS) Lots of redundant code paths to support
  • 11. Dali Vision and Mission Vision: Make analytics infrastructure invisible Mission: Make data on HDFS easier to manage Filesystem: multi-version, multi-cluster Datasets: tables not files Views: contract management for producers and consumers Lineage and Discovery: map datasets to producers, consumers, and track changes over time
  • 12. Dali Central Dogma Separate logical and physical concerns! GUIDs for logical entities are good! Versions are good!
  • 13. 13 “All problems in computer science can be solved by another level of indirection” David Wheeler (?)
  • 14. Dali Dataset API: Catalog Service
  • 15. Is a Dataset API Enough? Some use cases at LinkedIn Structural transformations (flattening and nesting) Muxing and de-muxing data (unions) Patching bad data Backward incompatible changes Code reuse What we need Ability to decouple the API from the dataset Producer control over public and private APIs Tooling and process to support safe evolution of these APIs
  • 17. A sample view CREATE VIEW profile_flattened TBLPROPERTIES( 'functions' = 'get_profile_section:isb.GetProfileSections', 'dependencies' = 'com.linkedin.dali-udfs:get-profile-sections:0.0.5') AS SELECT get_profile_section(...) FROM prod_identity.profile;
  • 18. Reading a Dali View from Pig register ivy://com.linkedin.dali:dali-all:2.3.52; a = load “dali:///prod_identity/profile_flattened” using dali.data.pig.DaliStorage();
  • 19. Versioning for Dali Views All views and UDFs must be versioned Declare and version view dependencies Deploy multiple versions of the same view Incremental pull upgrades replace monolithic push upgrades
  • 20. Managing views as Gradle artifacts Git is the source of truth for view definitions 1:1 mapping between views and published artifacts Mapping view artifacts to view names db.view  ${groupId}.${artifactId}_x_y_z ${groupId}.${artifactId} is an alias for the most recent view definition
  • 21. Leveraging existing LI tools INFRA Query view/UDF version dependency graph who-depends-on-me? Deprecate, EOL, and purge a specific view/UDF version Plug into existing global namespace management provided by LI developer tools Enforce referential integrity for views at deployment time
  • 22. Contract Law for Datasets Vague, poorly defined contracts bind data producers to consumers Physical types don’t tell us much STRING or URI? STRING or ENUM? Semantic types help, but what about other types of relationships? X IS NOT NULL A_time is in seconds, b_time is in millis Attributes of a good contract Easy to find Easy to understand Easy to change
  • 23. Hijacking an existing process Express contracts as logical constraints against the fields of a view Make the contract easy to find by storing it in the view’s Git repo Contract negotiation follows an existing process Data producer (view owner) controls the ACL on the view repo Data consumer requests a contract change via ReviewBoard request View owner either accepts or rejects the pull request If accepted, view version is bumped to notify downstream consumers If rejected, consumer still has the option of committing the constraint to their own repo Contract  Constraint based testing for views Contract  Data Quality tests
  • 24. Why Dali? Consumers Make data stable, predictable, discoverable Producers Explicit, manageable contracts with consumers Frictionless, familiar process for modifying existing contracts Infra Providers Freedom to optimize Flow portability  DR, multi-DC scheduling
  • 27. ©2014 LinkedIn Corporation. All Rights Reserved.©2014 LinkedIn Corporation. All Rights Reserved.

Editor's Notes

  • #4: -Since People You May Know is long, we call it PYMK at Linkedin. -The original version ran on Oracle -And the way it worked was to attempt to find overlaps between any pairs of people. Did they share the same school? Did they work at the same company? -One big indicator was common connections, and we used something called triangle closing.