SlideShare a Scribd company logo
Scala and Hadoop @ eBay
What we will cover
• Polymorphic Function Values
• Higher Kinded/Recursive Types
• Cokleislis Star Operators
• Scala Macros
I have no clue what those things are
What we will ACTUALLY cover
• Why Scala
• Why Hadoop
• How we use Scala with Hadoop
• Lots of CODE!
Why Scala?
• JVM
• **Functional**
• Expressive
• How to convince your boss?
Someone on Hacker News said
Scala sucks
• Compile Times
• You changed List again?
• Complicated
• Leads to Madness
Madness?
trait Lazy[+T, P] {
var creationParameters: P = None.asInstanceOf[P];
lazy val lazyThing: Either[Throwable, T] = try {
Right(create(creationParameters)) }
catch { case e => Left(e) }
def get(createParams: P): Either[Throwable, T] = {
creationParameters = createParams
lazyThing
}
def create(params: P): T
}
Madness?
def getSingleInstance[T, P](params: P)(implicit
lazyCreator: Lazy[T, P]): T = {
lazyCreator.get(params) match {
case Right(successValue) => successValue
case Left(exception) => throw new
StackException(exception)
}
}
This is used by ONE client class
• Show some self-restraint
Scala and Hadoop @ eBay
Hadoop
• void map(K1 key, V1 value,
OutputCollector<K2, V2> output, Reporter
reporter)
• void reduce(K2 key, Iterator<V2> values,
OutputCollector<K3, V3> output, Reporter
reporter)
BIG NUMBERS
• Petabytes of data
• 1k+ node Hadoop cluster
• Multi-billion dollar merchandising business
• Lots of users and items 
How should I use Map Reduce?
• Raw map reduce 
• Pig
• Hive
• Cascading
• Scoobi
• Scalding 
Decision Time
• “And every one that heareth these sayings of
mine (great software engineers of the past),
and doeth them not, shall be likened unto a
foolish man, which built his house upon the
sand.”
• “And the rain descended, and the floods
came, and the winds blew, and beat upon that
house; and it fell: and great was the fall of it.”
I believe!
• Scalding combines the best of PIG and
Cascading
Good Pig
A = LOAD 'input' AS (x, y, z);
B = FILTER A BY x > 5;
DUMP B;
C = FOREACH B GENERATE y, z;
STORE C INTO 'output';
// do joins and group by also
Bad Pig
DEFINE NV_terms `perl nv_terms2.pl`
ship('$scripts/nv_terms2.pl');
i5 = stream i4 through NV_terms as (leafcat:chararray,
name:chararray, name1:chararray);
i7 = foreach i5 generate leafcat,
com.ebay.pigudf.sic.RtlUDF(0,0,0,'$site_id',name) as
name,
com.ebay.pigudf.sic.RtlUDF(0,0,0,'$site_id',name1) as
name1;
Other Pig Issues
• Scheduling and DAG creation
Cascading Rocks!
• What is it?
• Supports large workflows and reusable
components
– DAG generation
– Parallel Executions
Cascading code in Scala
val masterPipe = new
FilterURLEncodedStrings(masterPipe, "sqr")
masterPipe = new
FilterInappropriateQueries(masterPipe, "sqr”)
masterPipe = new GroupBy(masterPipe,
CFields("user_id", "epoch_ts", "sqr"),
sortFields)
Someone should really code review this
Cascading Issues
This page intentionally left blank
Scalding Time
class WordCountJob(args : Args) extends Job(args) {
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => tokenize(line) }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
// Split a piece of text into individual words.
def tokenize(text : String) : Array[String] = {
// Lowercase each word and remove punctuation.
text.toLowerCase.replaceAll("[^a-zA-Z0-9s]", "").split("s+")
}
}
Scalding @ eBay
• Boilerplate reduction
• Extensibility
• New hires
Practical Scalding Use
• Pimp my pimp
• Code generated boilerplate
• Cascades
• Traps
• Testing!
class eBayJob(args: Args) extends Job(args) with PipeBoilerPlate {
implicit def pipe2eBayRichPipe(pipe: Pipe) = new eBayRichPipe(pipe)
class eBayRichPipe(pipe: Pipe) extends RichPipe(pipe) with
CommonFunctions
trait CommonFunctions {
import Dsl._
import RichPipe.assignName
def pipe: Pipe
def reallyComplexFunction(field: Fields, param: Long) = {
//mind blowing code here
}
}}
CheckoutTransactionsPipe(//default path logic)
.project(//fields I need)
.countUserInteractions(//params)
.doScoreCalculation(//params)
.doConfidenceCalculation(//params)
Seems a bit too readable for Scala
Collaborative Filtering
• Typically hard to run on large datasets
Structured Data Importance
• Do people shop by brand?
0
0.2
0.4
0.6
0.8
1
1.2
Supply
Handbags and Purses
Markov Chains
• Investigation of buying patterns in ~50 lines of
code
val purchases = "firsttime" :: x.take(500).toList
val pairs = purchases zip purchases.tail
val grouped = pairs.groupBy(x =>
x._1.toString+"-"+x._2.toString)
val sizes = grouped map { x => {
x._1 -> x._2.size
}} toList
Mining Search Queries
• 20+ billion user queries - give me the top ones
per user
De-Dupe Rank ValidateSample Data
Automation
Hadoop Proxy
Batch Database Load
Machines
Cassandra
Jenkins
MySql
Mongo
Questions?
www.ebaynyc.com

More Related Content

PPTX
Dapper Tool - A Bundle to Make your ECL Neater
PDF
Kick-R: Get your own R instance with 36 cores on AWS
PDF
iOS Development with RubyMotion
PDF
CliqueSquare processing
KEY
Command Liner with Scala
PDF
WebAssembly. Neither Web Nor Assembly, All Revolutionary
PPTX
Speaking Scala: Refactoring for Fun and Profit (Workshop)
PDF
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Dapper Tool - A Bundle to Make your ECL Neater
Kick-R: Get your own R instance with 36 cores on AWS
iOS Development with RubyMotion
CliqueSquare processing
Command Liner with Scala
WebAssembly. Neither Web Nor Assembly, All Revolutionary
Speaking Scala: Refactoring for Fun and Profit (Workshop)
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

What's hot (19)

PPTX
Scala Refactoring for Fun and Profit
PPTX
Parse, scale to millions
PDF
今時なウェブ開発をSmalltalkでやってみる?
PPTX
Jug Marche: Meeting June 2014. Java 8 hands on
PPTX
Java scriptcore brief introduction
PPTX
Value protocols and codables
PDF
Journey's End – Collection and Reduction in the Stream API
PDF
Mist - Serverless proxy to Apache Spark
PDF
Ruby is an Acceptable Lisp
PDF
Holden Karau - Spark ML for Custom Models
PDF
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
PDF
The Evolution of Scala / Scala進化論
PDF
Clojure & Scala
PDF
RubyMotion
PDF
Unleash your inner console cowboy
PDF
HOW TO SCALE FROM ZERO TO BILLIONS!
KEY
Tools for writing Haskell programs
KEY
Adding Riak to your NoSQL Bag of Tricks
PPTX
Persistent Data Structures - partial::Conf
Scala Refactoring for Fun and Profit
Parse, scale to millions
今時なウェブ開発をSmalltalkでやってみる?
Jug Marche: Meeting June 2014. Java 8 hands on
Java scriptcore brief introduction
Value protocols and codables
Journey's End – Collection and Reduction in the Stream API
Mist - Serverless proxy to Apache Spark
Ruby is an Acceptable Lisp
Holden Karau - Spark ML for Custom Models
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
The Evolution of Scala / Scala進化論
Clojure & Scala
RubyMotion
Unleash your inner console cowboy
HOW TO SCALE FROM ZERO TO BILLIONS!
Tools for writing Haskell programs
Adding Riak to your NoSQL Bag of Tricks
Persistent Data Structures - partial::Conf
Ad

Similar to Scala and Hadoop @ eBay (20)

PDF
Scala+data
PPTX
Scalable and Flexible Machine Learning With Scala @ LinkedIn
PPT
11. From Hadoop to Spark 1:2
PPTX
Hands on Hadoop and pig
PPTX
Data science and Hadoop
PDF
Hadoop pycon2011uk
PPTX
The Hadoop Ecosystem
PDF
Hadoop breizhjug
PPTX
Scalding by Adform Research, Alex Gryzlov
PPTX
Big data week presentation
PDF
Hadoop User Group EU 2014
PPTX
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
PDF
Intro to Big Data - Spark
PDF
Hadoop pig
PPTX
Hadoop for Data Science
PDF
Hadoop Overview kdd2011
PDF
Hadoop and Spark
PDF
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
Scala+data
Scalable and Flexible Machine Learning With Scala @ LinkedIn
11. From Hadoop to Spark 1:2
Hands on Hadoop and pig
Data science and Hadoop
Hadoop pycon2011uk
The Hadoop Ecosystem
Hadoop breizhjug
Scalding by Adform Research, Alex Gryzlov
Big data week presentation
Hadoop User Group EU 2014
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Intro to Big Data - Spark
Hadoop pig
Hadoop for Data Science
Hadoop Overview kdd2011
Hadoop and Spark
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
Ad

Recently uploaded (20)

PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Encapsulation theory and applications.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
cuic standard and advanced reporting.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPT
Teaching material agriculture food technology
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Encapsulation theory and applications.pdf
20250228 LYD VKU AI Blended-Learning.pptx
cuic standard and advanced reporting.pdf
MYSQL Presentation for SQL database connectivity
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Teaching material agriculture food technology
Programs and apps: productivity, graphics, security and other tools
Network Security Unit 5.pdf for BCA BBA.
Agricultural_Statistics_at_a_Glance_2022_0.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Advanced methodologies resolving dimensionality complications for autism neur...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
The AUB Centre for AI in Media Proposal.docx
NewMind AI Weekly Chronicles - August'25-Week II
Unlocking AI with Model Context Protocol (MCP)
Digital-Transformation-Roadmap-for-Companies.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Encapsulation_ Review paper, used for researhc scholars
gpt5_lecture_notes_comprehensive_20250812015547.pdf

Scala and Hadoop @ eBay

  • 2. What we will cover • Polymorphic Function Values • Higher Kinded/Recursive Types • Cokleislis Star Operators • Scala Macros
  • 3. I have no clue what those things are
  • 4. What we will ACTUALLY cover • Why Scala • Why Hadoop • How we use Scala with Hadoop • Lots of CODE!
  • 5. Why Scala? • JVM • **Functional** • Expressive • How to convince your boss?
  • 6. Someone on Hacker News said Scala sucks • Compile Times • You changed List again? • Complicated • Leads to Madness
  • 7. Madness? trait Lazy[+T, P] { var creationParameters: P = None.asInstanceOf[P]; lazy val lazyThing: Either[Throwable, T] = try { Right(create(creationParameters)) } catch { case e => Left(e) } def get(createParams: P): Either[Throwable, T] = { creationParameters = createParams lazyThing } def create(params: P): T }
  • 8. Madness? def getSingleInstance[T, P](params: P)(implicit lazyCreator: Lazy[T, P]): T = { lazyCreator.get(params) match { case Right(successValue) => successValue case Left(exception) => throw new StackException(exception) } }
  • 9. This is used by ONE client class • Show some self-restraint
  • 11. Hadoop • void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter) • void reduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter)
  • 12. BIG NUMBERS • Petabytes of data • 1k+ node Hadoop cluster • Multi-billion dollar merchandising business • Lots of users and items 
  • 13. How should I use Map Reduce? • Raw map reduce  • Pig • Hive • Cascading • Scoobi • Scalding 
  • 14. Decision Time • “And every one that heareth these sayings of mine (great software engineers of the past), and doeth them not, shall be likened unto a foolish man, which built his house upon the sand.” • “And the rain descended, and the floods came, and the winds blew, and beat upon that house; and it fell: and great was the fall of it.”
  • 15. I believe! • Scalding combines the best of PIG and Cascading
  • 16. Good Pig A = LOAD 'input' AS (x, y, z); B = FILTER A BY x > 5; DUMP B; C = FOREACH B GENERATE y, z; STORE C INTO 'output'; // do joins and group by also
  • 17. Bad Pig DEFINE NV_terms `perl nv_terms2.pl` ship('$scripts/nv_terms2.pl'); i5 = stream i4 through NV_terms as (leafcat:chararray, name:chararray, name1:chararray); i7 = foreach i5 generate leafcat, com.ebay.pigudf.sic.RtlUDF(0,0,0,'$site_id',name) as name, com.ebay.pigudf.sic.RtlUDF(0,0,0,'$site_id',name1) as name1;
  • 18. Other Pig Issues • Scheduling and DAG creation
  • 19. Cascading Rocks! • What is it? • Supports large workflows and reusable components – DAG generation – Parallel Executions
  • 20. Cascading code in Scala val masterPipe = new FilterURLEncodedStrings(masterPipe, "sqr") masterPipe = new FilterInappropriateQueries(masterPipe, "sqr”) masterPipe = new GroupBy(masterPipe, CFields("user_id", "epoch_ts", "sqr"), sortFields)
  • 21. Someone should really code review this
  • 22. Cascading Issues This page intentionally left blank
  • 23. Scalding Time class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) // Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.replaceAll("[^a-zA-Z0-9s]", "").split("s+") } }
  • 24. Scalding @ eBay • Boilerplate reduction • Extensibility • New hires
  • 25. Practical Scalding Use • Pimp my pimp • Code generated boilerplate • Cascades • Traps • Testing!
  • 26. class eBayJob(args: Args) extends Job(args) with PipeBoilerPlate { implicit def pipe2eBayRichPipe(pipe: Pipe) = new eBayRichPipe(pipe) class eBayRichPipe(pipe: Pipe) extends RichPipe(pipe) with CommonFunctions trait CommonFunctions { import Dsl._ import RichPipe.assignName def pipe: Pipe def reallyComplexFunction(field: Fields, param: Long) = { //mind blowing code here } }}
  • 27. CheckoutTransactionsPipe(//default path logic) .project(//fields I need) .countUserInteractions(//params) .doScoreCalculation(//params) .doConfidenceCalculation(//params) Seems a bit too readable for Scala
  • 28. Collaborative Filtering • Typically hard to run on large datasets
  • 29. Structured Data Importance • Do people shop by brand? 0 0.2 0.4 0.6 0.8 1 1.2 Supply Handbags and Purses
  • 30. Markov Chains • Investigation of buying patterns in ~50 lines of code val purchases = "firsttime" :: x.take(500).toList val pairs = purchases zip purchases.tail val grouped = pairs.groupBy(x => x._1.toString+"-"+x._2.toString) val sizes = grouped map { x => { x._1 -> x._2.size }} toList
  • 31. Mining Search Queries • 20+ billion user queries - give me the top ones per user De-Dupe Rank ValidateSample Data
  • 32. Automation Hadoop Proxy Batch Database Load Machines Cassandra Jenkins MySql Mongo

Editor's Notes

  • #2: Introduce myself and ebay NYC
  • #4: Laugh
  • #5: We are starting to use scala for live site recs
  • #6: Mention the Option and EitherFirst class functionsMention how great traits areI feel like Haskell will never break into the corporation this is a great draft All my life I’ve wanted a type safe build system. And NOW I have it
  • #7: They break backward compatibilityWeak IDE support – debugging, refactoring, etcExplain the madness
  • #8: Tell them about the example
  • #11: The most complicated system for counting words insert meme hereExplain why we use hadoop. Data is huge. I can’t say when you want to make the jump to map reduce but I see growth in making it THE platform
  • #14: Say why raw map reduce stinks. Mention what hive is and scoobi is
  • #16: Explain why we didn’t go with scoobi even though it’s all scala
  • #18: Scheduling and DAG creationWhere is my SOURCE?
  • #19: Mentionazkaban
  • #20: Can do parallel executions of tasks that don’t depend on each otherSupports static dependencies via cascades
  • #23: Verbose. You still need to write a bunch of code.
  • #24: Mention about scoobi and how it’s not super stableRemindthen about how it combines the best of PIG and Cascading
  • #28: This is actual code to compute a user’s preferences. Explain a bit about user preferences
  • #29: Mahout has some functions for this but they are hard to setup and get goingLess precise than other state of the art methods but still accurateScala Days Talk with Chris Severs
  • #30: Linear ModelTalk about Concept ExtractionUse SQL Lite for ad hoc queries
  • #32: Talk about the use of cascadesTalk about traps and counters
  • #33: Scalding makes this 100% times easier because of cascades and flows