SlideShare a Scribd company logo
Using Scalding for Data-Driven
Product Development
Sasha Ovsankin
LinkedIn
Presented to Scala By The Bay
Aug 9, 2014
/summary
Data-Driven
Product
Development
/summary
Data-Driven
Product
Development
Scalding =
Hadoop + Scala
/summary
Data-Driven
Product
Development
Scalding =
Hadoop + Scala
/data-driven
Your
Service
/data-driven
Your
Service
Value
/data-driven
Your
Service
Value Data
/data-driven
Your
Service
Value Data
/data-driven
Your
Service
Value Data
/data-driven
Your
Amazing
Service
Value Data
“Online” World
/data-driven/linkedin
Web Applications
NoSQL Data
Stores
“Offline” World (Hadoop)
HDFS
Hadoop Jobs
Tracking/l
ogging
Analytics
Data
Products
Messaging
Message delivery
Databases
/linkedin/big-data/links
• “LinkedIn Big Data Ecosystem”
– http://lnkd.in/big-data-ecosystem
• Grid Operations
– http://lnkd.in/gridops2013
/scalding
http://github.com/twitter/scalding
• Scala-based DSL for Map/Reduce jobs
• Built on Cascading, stable and mature Hadoop
framework
• Uses API similar to Scala collections:
class WordCountJob(args : Args) extends Job(args) {
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => line.split("""s+""") }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
}
• Succinct and powerful
• High level of abstraction
/data-driven/problem/scaling
• Problem: Scaling
• Solution
– Distributed processing
– High-level description of algorithms
– Functional programming
…/solution/scalding
../problem/complexity
• Problem: Complexity
• Solution
– Consistent way of organizing data
• Self-describing data formats (Avro)
• File organization
– Type safety
– Modularization
…/solution/scalding
/linkedin/hadoop/practices
• All online data end up in HDFS
– Avro encoding is standard
• Production Process
– CI/Automatic Build
• More info forthcoming
– Production Review
– Operations and Monitoring
• More info at http://lnkd.in/gridops2013
• Result: Thousands of jobs running in production
• More info at http://lnkd.in/big-data-ecosystem
../solution/scala/killer-argument
• Map & reduce -- primitives
scala> (1 to 1000) map { pow(_,2) } reduce { _ + _ }
res20: Int = 333833500
/linkedin/scalding/status
• Started >1 year ago
• Thousands of production LOC written in Scalding by
our team
– Pretty happy with readability, maintainability and tooling
support
• Dozens of flows are currently in production, and
counting
• Created Scalding user group
• Growing interest
• Learning:
– Scala[Scalding] < Scala[ _ ]
/summary
Data-Driven
Product
Development
Scalding =
Hadoop + Scala
/linkedin/join-us
• Work on unique and interesting problems
• Be part of great engineering community
• Use latest tools and technologies
• Help connect the world’s professionals to help them
become more productive and successful
• We are looking for amazing people interested in
Software Engineering and Data Science
– http://linkedin.com/careers
Questions?

More Related Content

PDF
Using Redash for SQL Analytics on Databricks
PDF
BDTC2015 databricks-辛湜-state of spark
PDF
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
PDF
Azure databricks c sharp corner toronto feb 2019 heather grandy
PDF
Spark as a Service with Azure Databricks
PDF
Part 3 - Modern Data Warehouse with Azure Synapse
PPTX
Spark - Migration Story
PPTX
R in Power BI
Using Redash for SQL Analytics on Databricks
BDTC2015 databricks-辛湜-state of spark
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Azure databricks c sharp corner toronto feb 2019 heather grandy
Spark as a Service with Azure Databricks
Part 3 - Modern Data Warehouse with Azure Synapse
Spark - Migration Story
R in Power BI

What's hot (20)

PPTX
Azure data bricks by Eugene Polonichko
PDF
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
PPTX
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
PDF
Unleash the Power of Azure Data Factory - SQL User Group
PPTX
Eugene Polonichko "Architecture of modern data warehouse"
PPTX
Big Data on azure
PPTX
Azure data factory
PDF
201905 Azure Databricks for Machine Learning
PPTX
Snaplogic Live: Big Data in Motion
PPTX
Tokyo azure meetup #2 big data made easy
PPTX
SnapLogic Live: Big Data Integration
PDF
Building Data Lakes with Apache Airflow
PDF
Redash: Open Source SQL Analytics on Data Lakes
PPTX
Building a Self-Service Big Data Pipeline
PDF
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
PPTX
SnapLogic Live: Salesforce Integration
PPTX
Atlanta MLConf
PDF
Modern data warehouse with Azure
PDF
Azure Data Lake Store and Analytics
PPTX
Disrupting Big Data with Apache Spark in the Cloud
Azure data bricks by Eugene Polonichko
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Unleash the Power of Azure Data Factory - SQL User Group
Eugene Polonichko "Architecture of modern data warehouse"
Big Data on azure
Azure data factory
201905 Azure Databricks for Machine Learning
Snaplogic Live: Big Data in Motion
Tokyo azure meetup #2 big data made easy
SnapLogic Live: Big Data Integration
Building Data Lakes with Apache Airflow
Redash: Open Source SQL Analytics on Data Lakes
Building a Self-Service Big Data Pipeline
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
SnapLogic Live: Salesforce Integration
Atlanta MLConf
Modern data warehouse with Azure
Azure Data Lake Store and Analytics
Disrupting Big Data with Apache Spark in the Cloud
Ad

Similar to Using Scalding for Data Driven Product Development at LinkedIn (20)

PPTX
How LinkedIn Uses Scalding for Data Driven Product Development
PDF
Rajeev kumar apache_spark &amp; scala developer
PPTX
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
PPTX
Big Data Processing with Apache Spark 2014
PPTX
Data Engineer's Lunch #55: Get Started in Data Engineering
PDF
Dev Ops Training
PPTX
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
PPTX
Lambda architecture with Spark
PPTX
Spark SQL
PDF
Architecting Agile Data Applications for Scale
PDF
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
PDF
Snowplow presentation for Amsterdam Meetup #3
PPTX
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
PDF
Democratization of Data @Indix
PDF
deep learning in production cff 2017
PDF
Managing data analytics in a hybrid cloud
PDF
Fast, Flexible Application Development with Oracle Database Cloud Service
PDF
Transitioning Compute Models: Hadoop MapReduce to Spark
PPTX
Big Data & Oracle Technologies
PPTX
Hadoop workshop
How LinkedIn Uses Scalding for Data Driven Product Development
Rajeev kumar apache_spark &amp; scala developer
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Big Data Processing with Apache Spark 2014
Data Engineer's Lunch #55: Get Started in Data Engineering
Dev Ops Training
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
Lambda architecture with Spark
Spark SQL
Architecting Agile Data Applications for Scale
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Snowplow presentation for Amsterdam Meetup #3
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Democratization of Data @Indix
deep learning in production cff 2017
Managing data analytics in a hybrid cloud
Fast, Flexible Application Development with Oracle Database Cloud Service
Transitioning Compute Models: Hadoop MapReduce to Spark
Big Data & Oracle Technologies
Hadoop workshop
Ad

Recently uploaded (20)

PPTX
Spectroscopy.pptx food analysis technology
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Electronic commerce courselecture one. Pdf
PDF
cuic standard and advanced reporting.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Approach and Philosophy of On baking technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
Spectroscopy.pptx food analysis technology
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Per capita expenditure prediction using model stacking based on satellite ima...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Network Security Unit 5.pdf for BCA BBA.
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Electronic commerce courselecture one. Pdf
cuic standard and advanced reporting.pdf
Programs and apps: productivity, graphics, security and other tools
“AI and Expert System Decision Support & Business Intelligence Systems”
Dropbox Q2 2025 Financial Results & Investor Presentation
A comparative analysis of optical character recognition models for extracting...
Digital-Transformation-Roadmap-for-Companies.pptx
Empathic Computing: Creating Shared Understanding
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Approach and Philosophy of On baking technology
Advanced methodologies resolving dimensionality complications for autism neur...

Using Scalding for Data Driven Product Development at LinkedIn