SlideShare a Scribd company logo
The Big Data Ecosystem at LinkedInJay Kreps
MeBackground in data not infrastructureLinkedIn’s SNA teamOriginal co-author of some LinkedIn open source projects (Voldemort, Azkaban, Kafka)
This TalkWe are in a renaissance of data infrastructure.How do all these pieces fit together?
Why the current obsession with “Big Data”?
The goal of modern data infrastructure is to make many small computers act like one big one.
The Old Picture
The New Picture
Polyglot persistence?
Infrastructure Icebergs90k lines of tooling and monitoring, 30k lines of logicDedicated engineers, operationsTrainingFirst three nines come from operations
This is (still) a very immature space. Which systems should we have?
Infrastructure is sculpted by applications and constraintsProjects are defined by trade-offs
ConstraintsHardwareJeff Dean: Numbers everyone should knowDavid Patterson: Latency lags bandwidth$$$OtherPath dependenceComplexityResources
Applications
Common categories of non-CRUDRecommendations & MatchingGraphsSearchData NormalizationNews feedAnalysis & Monitoring
Social Graph
Search
Recommendations: People
Recommendations: Jobs
Recommendations: Newsfeed
Data Normalization
Analytics
InfrastructureSearchLuceneBobo (facets), Zoie (real-time indexing), Sensei (distribution)Social GraphStorageOracleVoldemortEspressoStreamsDatabusKafkaOfflineHadoop & friends (Pig, Hive, Azkaban, etc)
Three Major ParadigmsRequest/ResponseSearchSocial GraphStorageStreamsKafkaBatchHadoop
Most features are multi-paradigm
Request/ResponseSearchSocial GraphStorageVoldemortEspresso
Request/Response PatternsBroker, scatter-gatherStorage systems: only Partitioning strategyLatency oriented
Batch: HadoopUsesAd hocProduction batchEcosystemHive, PigAzkaban (workflow)Avro dataData in: KafkaData out: Voldemort, Kafka
Why do batch if you have real-time?Batch advantagesSafetyEasyThroughputSimplicityEconomicsTricky bit: engineering the data cycle
Why do streaming?You have to glue all these systems togetherThroughput as good as batchLatency much betterMetaphor more natural for low latency than Hadoop
What makes successful infrastructure systems?Operability and OperationsMonitoringSimplicityDocumentationBroad adoptionLazy usersOpen source
Open SourceData > InfrastructureOpen source creates better code—even with few outside contributorsCommercial infrastructure not interesting
Open Source ProjectsWe madeVoldemort: Key/Value storageSensei, Bobo, Zoie: Elastic, faceted, real-time search with LuceneKafka: Persistent, distributed data streamsNorbert: Cluster aware RPC, load balancing, and group membershipAnd others…We stoleHadoop, Pig, HiveLuceneNetty, JettyZookeeperAvroApache Traffic Server
The Endjay.kreps@gmail.comhttp://www.linkedin.com/in/jaykrepshttp://twitter.com/jaykrepshttp://sna-projects.com

More Related Content

PPTX
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
PPT
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
PDF
Big Data Architecture
PDF
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
PDF
Big Data Architecture and Deployment
PPTX
Владимир Слободянюк «DWH & BigData – architecture approaches»
PPTX
Big Data Use Cases
PDF
Big Data Analytics for Real Time Systems
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Big Data Architecture
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
Big Data Architecture and Deployment
Владимир Слободянюк «DWH & BigData – architecture approaches»
Big Data Use Cases
Big Data Analytics for Real Time Systems

What's hot (20)

PDF
What is an Open Data Lake? - Data Sheets | Whitepaper
PPT
My other computer is a datacentre - 2012 edition
PDF
Big Data Tech Stack
PPTX
High Performance Computing and Big Data
PDF
Big Data Computing Architecture
PDF
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
PDF
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
PPTX
Data lake-itweekend-sharif university-vahid amiry
PDF
Big Data Architecture Workshop - Vahid Amiri
PDF
The "Big Data" Ecosystem at LinkedIn
PPTX
Data & analytics challenges in a microservice architecture
PDF
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
PPTX
Big Data with Azure
PPTX
The key to unlocking the Value in the IoT? Managing the Data!
PDF
Big data on Azure for Architects
PPTX
Anatomy of a data driven architecture - Tamir Dresher
PPTX
Microsoft Azure Big Data Analytics
PDF
Lecture4 big data technology foundations
PDF
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
PDF
Big Data: Architecture and Performance Considerations in Logical Data Lakes
What is an Open Data Lake? - Data Sheets | Whitepaper
My other computer is a datacentre - 2012 edition
Big Data Tech Stack
High Performance Computing and Big Data
Big Data Computing Architecture
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
Data lake-itweekend-sharif university-vahid amiry
Big Data Architecture Workshop - Vahid Amiri
The "Big Data" Ecosystem at LinkedIn
Data & analytics challenges in a microservice architecture
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Big Data with Azure
The key to unlocking the Value in the IoT? Managing the Data!
Big data on Azure for Architects
Anatomy of a data driven architecture - Tamir Dresher
Microsoft Azure Big Data Analytics
Lecture4 big data technology foundations
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Ad

Viewers also liked (19)

PPTX
Predictive analytics
PDF
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
PPT
Real-Time Analytics for Industries
PDF
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...
PDF
Aligning BPM and EA
PPTX
The Big Data Ecosystem for Financial Services
PPTX
Big Data Application Architectures - IoT
PDF
Your Garden and Global Warming
PPTX
A look back bkelly duo farewell - june 2015
PPTX
LAST Conference - The Mickey Mouse model of leadership for software delivery ...
ODP
Quick Introduction to the Semantic Web, RDFa & Microformats
PDF
4º básico a semana 25 abril al 29 abril
PDF
Patrick Shields Digitising the Public Sector
PDF
Product Discovery and Delivery by Odd-e (Thailand). Build the right thing at ...
PPTX
Lo Mejor de Cibeles Madrid Fashion Week - Otoño/Invierno 2010 - 2011
PPTX
Driving school
PDF
Comic Analysis
PPT
同玩節海報事件
PDF
The Distribution of Household Income, Federal Taxes, and Government Spending
Predictive analytics
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Real-Time Analytics for Industries
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...
Aligning BPM and EA
The Big Data Ecosystem for Financial Services
Big Data Application Architectures - IoT
Your Garden and Global Warming
A look back bkelly duo farewell - june 2015
LAST Conference - The Mickey Mouse model of leadership for software delivery ...
Quick Introduction to the Semantic Web, RDFa & Microformats
4º básico a semana 25 abril al 29 abril
Patrick Shields Digitising the Public Sector
Product Discovery and Delivery by Odd-e (Thailand). Build the right thing at ...
Lo Mejor de Cibeles Madrid Fashion Week - Otoño/Invierno 2010 - 2011
Driving school
Comic Analysis
同玩節海報事件
The Distribution of Household Income, Federal Taxes, and Government Spending
Ad

Similar to The Big Data Ecosystem at LinkedIn (20)

PPTX
Microsoft Dryad
PDF
Python's Role in the Future of Data Analysis
PPTX
PDF
unit-4-notes.pdf
PDF
Cloud and Bid data Dr.VK.pdf
PDF
Startds9.19.17sd
PDF
GraphTour 2020 - Graphs & AI: A Path for Data Science
PPTX
Sycamore Quantum Computer 2019 developed.pptx
PPTX
Cloud Computing & Big Data
PDF
Data sci sd-11.6.17
PPSX
Big Data Basic Concepts | Presented in 2014
PDF
Intro to Neo4j Webinar
DOCX
Jon cohn exton pa corporate data architecture
PPT
Sem tech 2011 v8
PDF
Entity-Centric Data Management
PDF
The Evolving Landscape of Data Engineering
PDF
Business_Analytics_Presentation_Luke_Caratan
PDF
ITCamp 2018 - Magnus Mårtensson - Azure Global Application Perspectives
PPTX
Machine Learning and Hadoop
PDF
2018 learning approach-digitaltrends
Microsoft Dryad
Python's Role in the Future of Data Analysis
unit-4-notes.pdf
Cloud and Bid data Dr.VK.pdf
Startds9.19.17sd
GraphTour 2020 - Graphs & AI: A Path for Data Science
Sycamore Quantum Computer 2019 developed.pptx
Cloud Computing & Big Data
Data sci sd-11.6.17
Big Data Basic Concepts | Presented in 2014
Intro to Neo4j Webinar
Jon cohn exton pa corporate data architecture
Sem tech 2011 v8
Entity-Centric Data Management
The Evolving Landscape of Data Engineering
Business_Analytics_Presentation_Luke_Caratan
ITCamp 2018 - Magnus Mårtensson - Azure Global Application Perspectives
Machine Learning and Hadoop
2018 learning approach-digitaltrends

More from OSCON Byrum (20)

PDF
OSCON 2013 - Planning an OpenStack Cloud - Tom Fifield
PDF
Protecting Open Innovation with the Defensive Patent License
PDF
Using Cascalog to build an app with City of Palo Alto Open Data
PPTX
Finite State Machines - Why the fear?
PDF
Open Source Automotive Development
PPTX
How we built our community using Github - Uri Cohen
PDF
The Vanishing Pattern: from iterators to generators in Python
PDF
Distributed Coordination with Python
PDF
An overview of open source in East Asia (China, Japan, Korea)
PPTX
Oscon 2013 Jesse Anderson
PDF
US Patriot Act OSCON2012 David Mertz
PPTX
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...
PPTX
Big Data for each one of us
KEY
BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking
PDF
Declarative web data visualization using ClojureScript
PDF
Using and Building Open Source in Google Corporate Engineering - Justin McWil...
PDF
A Look at the Network: Searching for Truth in Distributed Applications
PPT
Life After Sharding: Monitoring and Management of a Complex Data Cloud
PPT
Faster! Faster! Accelerate your business with blazing prototypes
PDF
Comparing open source private cloud platforms
OSCON 2013 - Planning an OpenStack Cloud - Tom Fifield
Protecting Open Innovation with the Defensive Patent License
Using Cascalog to build an app with City of Palo Alto Open Data
Finite State Machines - Why the fear?
Open Source Automotive Development
How we built our community using Github - Uri Cohen
The Vanishing Pattern: from iterators to generators in Python
Distributed Coordination with Python
An overview of open source in East Asia (China, Japan, Korea)
Oscon 2013 Jesse Anderson
US Patriot Act OSCON2012 David Mertz
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...
Big Data for each one of us
BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking
Declarative web data visualization using ClojureScript
Using and Building Open Source in Google Corporate Engineering - Justin McWil...
A Look at the Network: Searching for Truth in Distributed Applications
Life After Sharding: Monitoring and Management of a Complex Data Cloud
Faster! Faster! Accelerate your business with blazing prototypes
Comparing open source private cloud platforms

Recently uploaded (20)

PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Spectroscopy.pptx food analysis technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPT
Teaching material agriculture food technology
PDF
Electronic commerce courselecture one. Pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Mobile App Security Testing_ A Comprehensive Guide.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Programs and apps: productivity, graphics, security and other tools
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Spectroscopy.pptx food analysis technology
Chapter 3 Spatial Domain Image Processing.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Spectral efficient network and resource selection model in 5G networks
Dropbox Q2 2025 Financial Results & Investor Presentation
Empathic Computing: Creating Shared Understanding
Understanding_Digital_Forensics_Presentation.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Teaching material agriculture food technology
Electronic commerce courselecture one. Pdf
MYSQL Presentation for SQL database connectivity
Encapsulation_ Review paper, used for researhc scholars
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
The AUB Centre for AI in Media Proposal.docx

The Big Data Ecosystem at LinkedIn

Editor's Notes

  • #11: Good news for users, bad news for distributed systems nerdsFilesystems take a decade to mature. Don’t expect this will be easier.