SlideShare a Scribd company logo
SCALDING
Introduction and usage
What is Scalding?
• Scalding is a Scala based API for Map Reduce
applications
• Scalding is built on top of Cascading
• Cascading is a flow oriented processing framework which
acts as an abstraction layer for MapReduce
What is Cascading?
• Cascading introduces the
concept of source taps
(input) and sink taps
(output) and pipes to
connect them, essentially
abstracting the key/value
scheme in MR
• Within a pipe, users define
the transformation of data
by applying operations
such as GroupBy, Every
and others.
WordCount!
• WordCount in Cascading:
In comes Scalding
• Scalding was created by Twitter, basically as a DSL for
Cascading.
• The goal is to offer functions to operate on the data flow
as opposed to constructing objects with embedded
operations
• Scalding applications feel and behave like scripts, ideally
replacing Pig.
Scalding APIs
• Scalding offers three different APIs:
• Field API – a simple, abstracted symbol based function oriented
API, first choice for most use cases
• Type safe API – a more low level, typed API with closer access to
Cascading. This API is used for more complex inputs, such as Avro
• Matrix API – allows to apply matrix and vector operations to pipes,
however of type Int, Long and String (due to comparator ops)
• Both Field and Type APIs can convert to one another, the
APIs are designed to offer the same type of functions, i.e.
(Field) Pipe instances convert to TypePipe and vice versa.
Functions
• Scalding has Map – like functions, such as:
• map
• flatMap
• filter and filterNot
• collect
• Grouping / Joining functions:
• groupBy
• groupAll
• Join (left,right, outer etc)
• Reduce functions:
• reduce (DUH!)
• foldLeft
• average, sum
Documentation: https://github.com/twitter/scalding/wiki/Fields-
based-API-Reference
Example – Field API
Simple map and filter with the Field API
Example – Typed API
Simple mapping with Avro and TypedAPI
Example – Configuring and running
Configuration uses hadoop
And the Job / Toolrunner scheme:
Flow Listener
• You can monitor the execution progress with cascading
listeners.
1. Define Scalding Stat objects (Case classes for Hadoop
counters)
2. Increment within your operations by calling incBy(Int)
3. Implement FlowListener interface and increment your
Jobs listeners:
override def listeners = super.listeners ++ List(new
FlowListener)
Example: Flow Listener
Example: Flow Listener
Accessing stats values:
Resources
Scalding home and docs on Github:
https://github.com/twitter/scalding
Excellent intro and advanced topics:
http://www.slideshare.net/ktoso/scalding-the-notsobasics-scaladays-
2014

More Related Content

PPTX
Python and GIS: Improving Your Workflow
PDF
Torkel Ödegaard (Creator of Grafana) - Grafana at #DOXLON
PPT
Leveraging Open Source GIS with Python: A QGIS Approach
PDF
Scalding: Twitter's New DSL for Hadoop
PDF
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
PDF
Python Programming and GIS
PPTX
Introduction to apache spark
PDF
Integrations
Python and GIS: Improving Your Workflow
Torkel Ödegaard (Creator of Grafana) - Grafana at #DOXLON
Leveraging Open Source GIS with Python: A QGIS Approach
Scalding: Twitter's New DSL for Hadoop
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Python Programming and GIS
Introduction to apache spark
Integrations

What's hot (20)

PDF
Streaming sql w kafka and flink
PDF
GraphQL API on a Serverless Environment
PDF
Hkube
PDF
Writing an Interactive Interface for SQL on Flink
PDF
Uber Business Metrics Generation and Management Through Apache Flink
PPTX
Generating Pipeline Alignment Sheets Using FME
PPTX
Scaling graphite to handle a zerg rush
PDF
Stream Patterns
PDF
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
PPT
EUROCONTROL LARA - Presentation
PPTX
StockPredictionML Presentation
PDF
An Introduction to the Heatmap / Histogram Plugin
PPTX
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...
PPTX
Grafana optimization for Prometheus
PDF
The journey of Moving from AWS ELK to GCP Data Pipeline
PDF
AI at Scale
PPT
Reservoir drainage workflow new
PDF
Stream Computing & Analytics at Uber
PDF
Apache Airflow Architecture
Streaming sql w kafka and flink
GraphQL API on a Serverless Environment
Hkube
Writing an Interactive Interface for SQL on Flink
Uber Business Metrics Generation and Management Through Apache Flink
Generating Pipeline Alignment Sheets Using FME
Scaling graphite to handle a zerg rush
Stream Patterns
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
EUROCONTROL LARA - Presentation
StockPredictionML Presentation
An Introduction to the Heatmap / Histogram Plugin
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...
Grafana optimization for Prometheus
The journey of Moving from AWS ELK to GCP Data Pipeline
AI at Scale
Reservoir drainage workflow new
Stream Computing & Analytics at Uber
Apache Airflow Architecture
Ad

Viewers also liked (11)

PPTX
MapReduce with Scalding @ 24th Hadoop London Meetup
PDF
Cascading at the Lyon Hadoop User Group
PDF
스칼라
PDF
PDF
Programming Cascading
PDF
Scalding - Big Data Programming with Scala
PDF
Scalding - the not-so-basics @ ScalaDays 2014
PDF
Scalding - Hadoop Word Count in LESS than 70 lines of code
PPTX
빅데이터 구축 사례
PDF
AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지
PDF
KGC 2014 가볍고 유연하게 데이터 분석하기 : 쿠키런 사례 중심 , 데브시스터즈
MapReduce with Scalding @ 24th Hadoop London Meetup
Cascading at the Lyon Hadoop User Group
스칼라
Programming Cascading
Scalding - Big Data Programming with Scala
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - Hadoop Word Count in LESS than 70 lines of code
빅데이터 구축 사례
AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지
KGC 2014 가볍고 유연하게 데이터 분석하기 : 쿠키런 사례 중심 , 데브시스터즈
Ad

Similar to Scalding intro 20141125 (20)

KEY
Scalding: Twitter's Scala DSL for Hadoop/Cascading
PPTX
Scalding by Adform Research, Alex Gryzlov
PDF
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
PDF
Scala+data
PPTX
Scalding by Adform Research, Alex Gryzlov
PPTX
Using Scalding for Data Driven Product Development at LinkedIn
PPTX
Scalding Presentation
PPTX
How LinkedIn Uses Scalding for Data Driven Product Development
PDF
Parallel Data Processing with MapReduce: A Survey
PDF
ENAR short course
PDF
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
PDF
Accelerate Big Data Application Development with Cascading
PDF
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
PDF
MapReduce basics
PDF
Introduction to Scalding and Monoids
PDF
Cascading on starfish
PPTX
Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya
PPTX
Introduction to Apache Hadoop
PDF
Hadoop ecosystem
PPTX
Apache Spark: the next big thing? - StampedeCon 2014
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding by Adform Research, Alex Gryzlov
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
Scala+data
Scalding by Adform Research, Alex Gryzlov
Using Scalding for Data Driven Product Development at LinkedIn
Scalding Presentation
How LinkedIn Uses Scalding for Data Driven Product Development
Parallel Data Processing with MapReduce: A Survey
ENAR short course
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Accelerate Big Data Application Development with Cascading
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
MapReduce basics
Introduction to Scalding and Monoids
Cascading on starfish
Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya
Introduction to Apache Hadoop
Hadoop ecosystem
Apache Spark: the next big thing? - StampedeCon 2014

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Cloud computing and distributed systems.
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Electronic commerce courselecture one. Pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Review of recent advances in non-invasive hemoglobin estimation
Chapter 3 Spatial Domain Image Processing.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Cloud computing and distributed systems.
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Programs and apps: productivity, graphics, security and other tools
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Electronic commerce courselecture one. Pdf
Understanding_Digital_Forensics_Presentation.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
“AI and Expert System Decision Support & Business Intelligence Systems”
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
NewMind AI Weekly Chronicles - August'25 Week I
Dropbox Q2 2025 Financial Results & Investor Presentation
Network Security Unit 5.pdf for BCA BBA.
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

Scalding intro 20141125

  • 2. What is Scalding? • Scalding is a Scala based API for Map Reduce applications • Scalding is built on top of Cascading • Cascading is a flow oriented processing framework which acts as an abstraction layer for MapReduce
  • 3. What is Cascading? • Cascading introduces the concept of source taps (input) and sink taps (output) and pipes to connect them, essentially abstracting the key/value scheme in MR • Within a pipe, users define the transformation of data by applying operations such as GroupBy, Every and others.
  • 5. In comes Scalding • Scalding was created by Twitter, basically as a DSL for Cascading. • The goal is to offer functions to operate on the data flow as opposed to constructing objects with embedded operations • Scalding applications feel and behave like scripts, ideally replacing Pig.
  • 6. Scalding APIs • Scalding offers three different APIs: • Field API – a simple, abstracted symbol based function oriented API, first choice for most use cases • Type safe API – a more low level, typed API with closer access to Cascading. This API is used for more complex inputs, such as Avro • Matrix API – allows to apply matrix and vector operations to pipes, however of type Int, Long and String (due to comparator ops) • Both Field and Type APIs can convert to one another, the APIs are designed to offer the same type of functions, i.e. (Field) Pipe instances convert to TypePipe and vice versa.
  • 7. Functions • Scalding has Map – like functions, such as: • map • flatMap • filter and filterNot • collect • Grouping / Joining functions: • groupBy • groupAll • Join (left,right, outer etc) • Reduce functions: • reduce (DUH!) • foldLeft • average, sum Documentation: https://github.com/twitter/scalding/wiki/Fields- based-API-Reference
  • 8. Example – Field API Simple map and filter with the Field API
  • 9. Example – Typed API Simple mapping with Avro and TypedAPI
  • 10. Example – Configuring and running Configuration uses hadoop And the Job / Toolrunner scheme:
  • 11. Flow Listener • You can monitor the execution progress with cascading listeners. 1. Define Scalding Stat objects (Case classes for Hadoop counters) 2. Increment within your operations by calling incBy(Int) 3. Implement FlowListener interface and increment your Jobs listeners: override def listeners = super.listeners ++ List(new FlowListener)
  • 14. Resources Scalding home and docs on Github: https://github.com/twitter/scalding Excellent intro and advanced topics: http://www.slideshare.net/ktoso/scalding-the-notsobasics-scaladays- 2014