SlideShare a Scribd company logo
Ohai Hadoop!
Build your first MapReduce with
Hadoop & Ruby
Tweet@_swanand
GitHub@swanandp
StackOverflow@18678
Work@Kaverisoft
Make { DispatchTrack }
mailto:swanand@pagnis.
in
Who am I?
Ruby, Coffeescript,
Java, Rails, Sinatra,
Android, TextMate,
Emacs, Minitest,
MySQL, Cassandra,
Hadoop, Mountain
Lion, Curl, Zsh, GMail,
Solarized, Oscar
Wilde, Robert Jordan,
Quentin Tarantino,
Charlize Theron
● MapReduce! Wait, what?
● Enter the Hadoop. *gong*
● Convention over Configuration? You wish.
● Instant Gratification. Now you're talkin'
● Further Reading. Go forth and read!
Tell 'em what you're going to tell 'em
MapReduce! Wait, what?
● Map: Given a set of values (or key-values),
output another set of values (or key-values)
● [K1, V1] -> map -> [K2, V2]
● Map each value into a new value
MapReduce! Wait, what?
● Reduce: Given a set of values for a key,
come up with a summarized version
● K1[V1, V2 ... Vn] -> reduce -> K1[Y]
● Reduce given values into 1 value
MapReduce! Wait, what?
MapReduce! Um.. hmm..
Q: What is the single biggest takeaway from
mapping?
A: Map operation is stateless i.e. one iteration
doesn't depend on previous iteration.
Q: What is the single biggest takeaway from
reducing?
A: Reduce represents an operation for a
particular key.
Enter the Hadoop. *gong*
"The really interesting thing I want you to notice,
here, is that as soon as you think of map and
reduce as functions that everybody can use, and
they use them, you only have to get one
supergenius to write the hard code to run map and
reduce on a global massively parallel array of
computers, and all the old code that used to work
fine when you just ran a loop still works only it's a
zillion times faster which means it can be used to
tackle huge problems in an instant."
- Joel Spolsky
MapReduce! Oh, yeah!
1. Convert raw data into readable format
2. Iterate over data chunks, convert each
chunk into meaningful key, value pairs
3. Do this for all your data using massive
parallelization
4. Group all the keys and their respective
values
5. Take values for a key and convert into
desired meaningful format
6. Step 2 is called mapper
7. Step 5 is called reducer
Enter the Hadoop. *gong*
Same process has now become:
1. Put data into Hadoop
2. Define your mapper
3. Define your reducer
4. Run your jobs
5. Read processed data from Hadoop
Other advantages:
● Encapsulations over common problems
like large files, process management, disk
/ node failure
Top Level Descriptor
job has_many tasks
HDFS Boss core-site.xml
HDFS Slaves slaves
MapReduce Boss mapred-site.xml
MapReduce Slave mapred-site.xml
User's window into Hadoop, through the
command hadoop
Convention over Configuration? You wish.
Job
Task
NameNode
DataNode
JobTracker
TaskTracker
Client
Convention over Configuration? You wish.
● Configuration in XML & Shell scripts. Yuck!
● Respite:
○ Option for specifying a configuration directory
○ Shell script configuration is mostly ENV variables
● Which means:
○ Configuration can be written in YML or JSON or
Ruby and exported in XML
○ ENV variables can be set using rake, thor or just
plain Ruby
● Caveats:
○ No standard wrapper to do this (Go write one!)
Convention over Configuration? You wish.
● Default mappers and reducers are defined in
Java
● Other languages supported using Streaming
API
● Streaming API makes use of STDIN and
STDOUT to read and output data and
executable binaries for processing
● Caveats
○ No dependency management, we are on our own
Instant Gratification. Now you're talkin'
GOAL:
1. Take a couple of books in txt format
2. Find out the total usage of each character in
the english alphabet.
3. Establish that e is the most used.
4. Why this example?
a. Perfect use case for MapReduce.
b. Algorithm is simple.
c. Results are simple to analyze.
d. Txt formatted books are easily available in Project
Gutenberg.
● Official Documentation
● Wiki: http://wiki.apache.org/hadoop/
● Hadoop examples that ship with Hadoop
● http://www.bigfastblog.com/map-reduce-
with-ruby-using-hadoop
● http://www.youtube.com/watch?
v=d2xeNpfzsYI
Further Reading and Watching
Questions?
Thank you!

More Related Content

PDF
Xephon K A Time series database with multiple backends
PPTX
A Hands-on Introduction to MapReduce (in Python)
PDF
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
PDF
10 EZ Steps to SOLR Domination - Berlin Buzzwords 2012
PDF
Consistent hashing algorithmic tradeoffs
PDF
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...
KEY
2011 mongo sf-scaling
PDF
Xephon K A Time series database with multiple backends
A Hands-on Introduction to MapReduce (in Python)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
10 EZ Steps to SOLR Domination - Berlin Buzzwords 2012
Consistent hashing algorithmic tradeoffs
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...
2011 mongo sf-scaling

What's hot (20)

PPTX
PostgreSQL is the new NoSQL - at Devoxx 2018
PDF
SnappyDB - NoSQL database for Android
ODP
MySQL And Search At Craigslist
PPTX
Fusion-io and MySQL at Craigslist
ODP
AGES Presentation on Web, Python, Django and GeoServer
PDF
Optimizing MongoDB: Lessons Learned at Localytics
PDF
Ruby on Rails & PostgreSQL - v2
PDF
20140120 presto meetup_en
PDF
Transactional writes to cloud storage with Eric Liang
PDF
Universal Serverless with AWS Fargate
PDF
PyConIE 2017 Writing and deploying serverless python applications
PDF
SOLR Power FTW: short version
PDF
Pre fosdem2020 uber
ODP
Monitoring with ElasticSearch
PPTX
Event Pipe - Lambda Architecture
PPTX
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
PDF
Introduction to mongo db
PDF
Scalable Data Science with SparkR
PDF
Bringing spatial love to your python application
PDF
Scylla Summit 2022: ScyllaDB Embraces Wasm
PostgreSQL is the new NoSQL - at Devoxx 2018
SnappyDB - NoSQL database for Android
MySQL And Search At Craigslist
Fusion-io and MySQL at Craigslist
AGES Presentation on Web, Python, Django and GeoServer
Optimizing MongoDB: Lessons Learned at Localytics
Ruby on Rails & PostgreSQL - v2
20140120 presto meetup_en
Transactional writes to cloud storage with Eric Liang
Universal Serverless with AWS Fargate
PyConIE 2017 Writing and deploying serverless python applications
SOLR Power FTW: short version
Pre fosdem2020 uber
Monitoring with ElasticSearch
Event Pipe - Lambda Architecture
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Introduction to mongo db
Scalable Data Science with SparkR
Bringing spatial love to your python application
Scylla Summit 2022: ScyllaDB Embraces Wasm
Ad

Viewers also liked (10)

PPT
Jarrera eta Ikaskuntza 2011 3. gaia amaiera
ODP
Intro to Open Cloud Initiative
PPT
Lasterketa plana prestatzeko tailerra
ODP
Building Vibrant Open Source Communities
PPT
Kitasato Flask 97
PDF
GlusterFS Community Preso
PDF
Open Source and Cloud - The Two Great Tastes...
PDF
FOSS vs. Web Services Lightning Talk: Is FOSS Necessary?
PPT
Australia Presentation
ODP
The Future of GlusterFS and Gluster.org
Jarrera eta Ikaskuntza 2011 3. gaia amaiera
Intro to Open Cloud Initiative
Lasterketa plana prestatzeko tailerra
Building Vibrant Open Source Communities
Kitasato Flask 97
GlusterFS Community Preso
Open Source and Cloud - The Two Great Tastes...
FOSS vs. Web Services Lightning Talk: Is FOSS Necessary?
Australia Presentation
The Future of GlusterFS and Gluster.org
Ad

Similar to MapReduce with Hadoop and Ruby (20)

PPTX
Hadoop and Mapreduce for .NET User Group
PPT
L19CloudMapReduce introduction for cloud computing .ppt
PPT
Another Intro To Hadoop
PPTX
Hadoop overview
PPT
Hadoop basics
PDF
Learning How to Learn Hadoop
PDF
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
PPT
Apache hadoop, hdfs and map reduce Overview
PDF
Lecture 2 part 3
PPT
Big Data Technologies - Hadoop
PPTX
writing Hadoop Map Reduce programs
PDF
Big Data Engineering and Analytics Tools
PPTX
Hadoop And Big Data - My Presentation To Selective Audience
PDF
Map reduce and hadoop at mylife
PPT
Hadoop MapReduce Fundamentals
PPTX
Mapreduce is for Hadoop Ecosystem in Data Science
PPTX
Hadoop for sysadmins
PPTX
Map-Reduce and Apache Hadoop
PPTX
Hadoop and big data
PDF
Introduction to map reduce
Hadoop and Mapreduce for .NET User Group
L19CloudMapReduce introduction for cloud computing .ppt
Another Intro To Hadoop
Hadoop overview
Hadoop basics
Learning How to Learn Hadoop
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
Apache hadoop, hdfs and map reduce Overview
Lecture 2 part 3
Big Data Technologies - Hadoop
writing Hadoop Map Reduce programs
Big Data Engineering and Analytics Tools
Hadoop And Big Data - My Presentation To Selective Audience
Map reduce and hadoop at mylife
Hadoop MapReduce Fundamentals
Mapreduce is for Hadoop Ecosystem in Data Science
Hadoop for sysadmins
Map-Reduce and Apache Hadoop
Hadoop and big data
Introduction to map reduce

Recently uploaded (20)

PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Big Data Technologies - Introduction.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Cloud computing and distributed systems.
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Electronic commerce courselecture one. Pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Machine Learning_overview_presentation.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Agricultural_Statistics_at_a_Glance_2022_0.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Big Data Technologies - Introduction.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Cloud computing and distributed systems.
Per capita expenditure prediction using model stacking based on satellite ima...
Programs and apps: productivity, graphics, security and other tools
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Reach Out and Touch Someone: Haptics and Empathic Computing
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Review of recent advances in non-invasive hemoglobin estimation
Electronic commerce courselecture one. Pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Unlocking AI with Model Context Protocol (MCP)
Spectral efficient network and resource selection model in 5G networks
Machine Learning_overview_presentation.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Advanced methodologies resolving dimensionality complications for autism neur...

MapReduce with Hadoop and Ruby

  • 1. Ohai Hadoop! Build your first MapReduce with Hadoop & Ruby
  • 2. Tweet@_swanand GitHub@swanandp StackOverflow@18678 Work@Kaverisoft Make { DispatchTrack } mailto:swanand@pagnis. in Who am I? Ruby, Coffeescript, Java, Rails, Sinatra, Android, TextMate, Emacs, Minitest, MySQL, Cassandra, Hadoop, Mountain Lion, Curl, Zsh, GMail, Solarized, Oscar Wilde, Robert Jordan, Quentin Tarantino, Charlize Theron
  • 3. ● MapReduce! Wait, what? ● Enter the Hadoop. *gong* ● Convention over Configuration? You wish. ● Instant Gratification. Now you're talkin' ● Further Reading. Go forth and read! Tell 'em what you're going to tell 'em
  • 4. MapReduce! Wait, what? ● Map: Given a set of values (or key-values), output another set of values (or key-values) ● [K1, V1] -> map -> [K2, V2] ● Map each value into a new value
  • 5. MapReduce! Wait, what? ● Reduce: Given a set of values for a key, come up with a summarized version ● K1[V1, V2 ... Vn] -> reduce -> K1[Y] ● Reduce given values into 1 value
  • 7. MapReduce! Um.. hmm.. Q: What is the single biggest takeaway from mapping? A: Map operation is stateless i.e. one iteration doesn't depend on previous iteration. Q: What is the single biggest takeaway from reducing? A: Reduce represents an operation for a particular key.
  • 8. Enter the Hadoop. *gong* "The really interesting thing I want you to notice, here, is that as soon as you think of map and reduce as functions that everybody can use, and they use them, you only have to get one supergenius to write the hard code to run map and reduce on a global massively parallel array of computers, and all the old code that used to work fine when you just ran a loop still works only it's a zillion times faster which means it can be used to tackle huge problems in an instant." - Joel Spolsky
  • 9. MapReduce! Oh, yeah! 1. Convert raw data into readable format 2. Iterate over data chunks, convert each chunk into meaningful key, value pairs 3. Do this for all your data using massive parallelization 4. Group all the keys and their respective values 5. Take values for a key and convert into desired meaningful format 6. Step 2 is called mapper 7. Step 5 is called reducer
  • 10. Enter the Hadoop. *gong* Same process has now become: 1. Put data into Hadoop 2. Define your mapper 3. Define your reducer 4. Run your jobs 5. Read processed data from Hadoop Other advantages: ● Encapsulations over common problems like large files, process management, disk / node failure
  • 11. Top Level Descriptor job has_many tasks HDFS Boss core-site.xml HDFS Slaves slaves MapReduce Boss mapred-site.xml MapReduce Slave mapred-site.xml User's window into Hadoop, through the command hadoop Convention over Configuration? You wish. Job Task NameNode DataNode JobTracker TaskTracker Client
  • 12. Convention over Configuration? You wish. ● Configuration in XML & Shell scripts. Yuck! ● Respite: ○ Option for specifying a configuration directory ○ Shell script configuration is mostly ENV variables ● Which means: ○ Configuration can be written in YML or JSON or Ruby and exported in XML ○ ENV variables can be set using rake, thor or just plain Ruby ● Caveats: ○ No standard wrapper to do this (Go write one!)
  • 13. Convention over Configuration? You wish. ● Default mappers and reducers are defined in Java ● Other languages supported using Streaming API ● Streaming API makes use of STDIN and STDOUT to read and output data and executable binaries for processing ● Caveats ○ No dependency management, we are on our own
  • 14. Instant Gratification. Now you're talkin' GOAL: 1. Take a couple of books in txt format 2. Find out the total usage of each character in the english alphabet. 3. Establish that e is the most used. 4. Why this example? a. Perfect use case for MapReduce. b. Algorithm is simple. c. Results are simple to analyze. d. Txt formatted books are easily available in Project Gutenberg.
  • 15. ● Official Documentation ● Wiki: http://wiki.apache.org/hadoop/ ● Hadoop examples that ship with Hadoop ● http://www.bigfastblog.com/map-reduce- with-ruby-using-hadoop ● http://www.youtube.com/watch? v=d2xeNpfzsYI Further Reading and Watching