SlideShare a Scribd company logo
1
Meet-up: Tackling “Big
Data” with Hadoop and
Python
Sam Kamin, VP Data Engineering
NYC Data Science Academy
sam.kamin@nycdatascience.com
2
NYC Data Science Academy
● We’re a company that does training and
consulting in the Data Science area.
● I’m Sam Kamin. I just joined NYCDSA as
VP of Data Engineering (a new area for us).
I was formerly a professor at the U. of Illinois
(CS) and a Software Engineer at Google.
3
What this meet-up is about
● Wikipedia: “Data Science is the extraction of
knowledge from large volumes of data.”
● My goal tonight: Show you how you can
handle large volumes of data with simple
Python programming, using the Hadoop
streaming interface.
4
Outline of talk
● Brief overview of Hadoop
● Introduction to parallelism via MapReduce
● Examples of applying MapReduce
● Implementing MapReduce in Python
You can do some programming at the end if you want!
5
Big Data: What’s the problem?
Too much data!
o Web contains about 5 billion web pages. According
to Wikipedia, its total size in bytes is about 4
zettabytes - that’s 1021, or four thousand billion
gigabytes.
o Google’s datacenters store about 15 exabytes (15 x
1018 bytes).
6
Big Data: What’s the solution?
● Parallel computing: Use multiple,
cooperating computers.
7
Parallelism
● Parallelism = dividing up a problem so that
multiple computers can all work on it:
o Break the data into pieces
o Send the pieces to different computers for
processing.
o Send the results back and process the combination
to get the final result.
8
Cloud computing
● Amazon, Google, Microsoft, and many other
companies operate huge clusters: Racks of
(basically) off-the-shelf computers with
(basically) standard network connections.
● The computers in these clusters run Linux -
use them like any other computer...
9
Cloud computing
● But: getting them to work together is really
hard:
o Management: machine/disk failure; efficient data
placement; debugging, monitoring, logging, auditing.
o Algorithms: decomposing your problem so it can be
solved in parallel can be hard.
That’s what Hadoop is here to help with.
10
● A collection of services in a cluster:
o Distributed, reliable file system (HDFS)
o Scheduler to run jobs in correct order, monitor,
restart on failure, etc.
o MapReduce to help you decompose your problem
for parallel execution
o A variety of other components (mostly based on
MapReduce), e.g. databases, application-focused
libraries
11
How to use Hadoop
● Hadoop is open source (free!)
● It is hosted on Apache: hadoop.apache.org
● Download it and run it standalone (for
debugging)
● Buy a cluster or rent time on one, e.g. AWS,
GCE, Azure. (All offer some free time for
new users.)
12
MapReduce
● The main, and original, parallel-processing
system of Hadoop.
● Developed by Google to simplify parallel
processing. Hadoop started as an open-
source implementation of Google’s idea.
● With Hadoop’s streaming interface, it’s really
easy to use MapReduce in Python.
13
MapReduce - The Big Idea
● Calculations on large data sets often have
this form: Start by aggregating the data
(possibly in a different order from the
“natural order”), then perform a summarizing
calculation on the aggregated groups.
● The idea of MapReduce: If your calculation
is explicitly structured like this, it can be
automatically parallelized.
14
Computing with MapReduce
A MapReduce computation has three stages:
Map: A function called map is applied to each record in
your input. It produces zero or more records as output,
each with a key and value. Keys may be repeated.
Shuffle: The output from step 1 is sorted and combined: All
records with the same key are combined into one.
Reduce: A function called reduce is applied to each record
(key + values) from step 2 to produce the final output.
As the programmer, you only write map and reduce.
15
Computing with MapReduce
16
Input
A, 7
C, 5
B, 23
B, 12
A, 18
A, [18, 7]
B, [23, 12]
C, [5]
Outputmap reduceshuffle
Note: map is record-oriented, meaning the output of the
map stage is strictly a combination of the outputs from
each record. That allows us to calculate in parallel...
Parallelism via MapReduce
17
Input A, [18, 7]
B, [23, 12]
C, [5]
map reduce
Because map and reduce are record-oriented, MR can
divide inputs into arbitrary chunks:
map
map
map
reduce
reduce
reduce
Output
Output
Output
Outputdistribute
data
distribute
data
combine/
shuffle
MapReduce example: Stock prices
● Input: list of daily opening and closing prices for
thousands of stocks over thousands of days.
● Desired output: The biggest-ever one-day
percentage price increase for each stock.
● Solution using MR:
o map: (stock, open, close) =>
(stock, (close - open) / open) (if pos)
o reduce: (stock, [%c0, %c1, …]) =>
(stock, max [%c0, %c1, …]).
18
MapReduce example - map
Goog, 230, 240
Apple, 100, 98
MS, 300, 250
MS, 250, 260
MS, 270, 280
Goog, 220, 215
Goog, 300, 350
IBM, 80, 90
IBM, 90, 85
Goog, 4.3%
MS, 4%
MS, 3.7%
Goog, 16.6%
IBM, 12.5%
map
You supply map: Output stock with % increase, or
nothing if decrease.
19
MapReduce example - shuffle/sort
Goog, 4.3%
MS, 4%
MS, 3.7%
Goog, 16.6%
IBM, 12.5%
shuffle
/sort Goog, [4.3%, 16.6%]
IBM, [12.5%]
MS, [3.7%, 4%]
Goog, 4.3%
MS, 4%
MS, 3.7%
Goog, 16.6%
IBM, 12.5%
MapReduce supplies shuffle/sort: Combine all
records for each stock.
20
MapReduce example - reduce
reduceGoog, [4.3%, 16.6%]
IBM, [12.5%]
MS, [3.7%, 4%]
Goog, 16.6%
IBM, 12.5%
MS, 4%
You supply reduce: Output max of percentages for
each input record.
21
Wait, why did that help?
I could have just written a loop to read every
line and put the percentages in a table!
● Suppose you have a terabyte of data, and
1000 computers in your cluster.
● MapReduce can automatically split the data
into 1000 1GB chunks. You write two simple
functions and get a 1000x speed-up!
22
Modelling problems using MR
● We’re going to look at a variety of problems
and see how we can fit them into the MR
structure.
● The question for each problem is: What are
the types of map and reduce, and what do
they do?
23
Example: Word count
Input: Lines of text.
Desired output: # of occurrences of each
word (i.e. each sequence of non-space chars)
E.g. Input: Roses are red, violets are blue
Output: are, 2
blue, 1
red, 1 etc.
24
Example: Word count
Solution:
● map: “w1 w2 … wk” → w1, 1
w2, 1
...
wk, 1
● reduce: (w, [1, 1, …]) → (w, n)
n 1’s
25
Example: Word count frequency
Input: Output of word count
Desired output: For any number of
occurrences c, the number of different words
that occur c times.
E.g. Input: Roses are red, violets are blue
Output: 1, 4
2, 1
26
Example: Word count frequency
Solution:
● map: w, c → c, 1
● reduce: (c, [1, 1, …]) → (c, n)
n 1’s
27
Example: Page Rank
● Famous algorithm used by Google to rank
pages. (Comes down to matrix-vector
multiplication, as we’ll see…)
● Based on two ideas:
o Importance of a page depends upon how many
pages link to it.
o However, if a page has lots of links going out, the
value of each link is reduced.
28
Example: Page Rank
With those two ideas, calculate rank of page:
Note: Because the web has cycles - page p can
have a link to page q, which has a link to p -
this formula requires an iterative solution.
pagerank(p) =
Σq→p
29
pagerank(q)
out-degree(q)
Example: Page Rank
Consider pages and their links as a graph
(page A has links to B, C, and D, etc.):
30
pr(A) = pr(B)/2 + pr(D)/2
pr(B) = pr(A)/3 + pr(D)/2
pr(C) = pr(A)/3 + pr(B)/2
pr(D) = pr(A)/3 + pr(C)
Example: Page Rank
● Represent the graph as a weighted
adjacency matrix:
31
0 1/2 0 1/2
1/3 0 0 1/2
1/3 1/2 0 0
1/3 0 1 0
M =
links to
links from
A
B
C
D
B DA C
Example: Page Rank
● Now, if we put the page rank of each page in
a vector v, then multiplying M by v calculates
the pagerank formula for all nodes:
32
0 1/2 0 1/2
1/3 0 0 1/2
1/3 1/2 0 0
1/3 0 1 0
pr(A)
pr(B)
pr(C)
pr(D)
pr(B)/2 + pr(D)/2
pr(A)/3 + pr(D)/2
pr(A)/3 + pr(B)/2
pr(A)/3 + pr(C)
X =
Example: Page Rank
● So, to calculate page ranks, start with an
initial guess of all page ranks and multiply.
● After one multiplication:
33
0 1/2 0 1/2
1/3 0 0 1/2
1/3 1/2 0 0
1/3 0 1 0
1/4
1/4
1/4
1/4
1/4
5/24
5/24
1/3
X =
Example: Page Rank
● After two multiplications:
34
0 1/2 0 1/2
1/3 0 0 1/2
1/3 1/2 0 0
1/3 0 1 0
.27
.24
.188
.29
X =
1/4
5/24
5/24
1/3
Example: Page Rank
● Thus, page rank = matrix-vector product.
● Can we express matrix-vector multiplication
as a MapReduce?
o Assume v is copied (magically) to each node.
o M, being much bigger, needs to be partitioned, i.e. M
is the main input file.
o How shall we represent M and define map and
reduce?
35
Example: Page Rank
● A solution:
o Represent M using one record for each link:
(p, q, out-degree(p)) for every link p→q.
o map: (p, q, d) ↦ (q, v[p]/d)
reduce: p, [c1, c2, …] ↦ p, c1+c2+...
36
MapReduce: Summary
● Nowadays, MapReduce powers the internet:
o Google, Amazon, Facebook, use it extensively for
everything from page ranking to error log analysis.
o NIH use it to analyze gene sequences.
o NASA uses it to analyze data from probes.
o etc., etc.
● Next question: How can we implement a
MapReduce?
37
Writing map and reduce in Python
● Easy using the streaming interface:
o map and reduce : stdin → stdout. Each should
iterate over stdin and output result for each line.
o Inputs and outputs are text files. In map and reduce
output, tab character separates key from value.
o Shuffle just sorts the files on the key.
 Instead of a line with a key and list of values, we
get consecutive lines with the same key.
38
Example: stock prices
● Recall the output of the shuffle stage:
● The only difference is this becomes:
Goog, [4.3%, 16.6%]
IBM, [12.5%]
MS, [3.7%, 4%]
Goog 4.3%
Goog 16.6%
IBM 12.5%
MS 3.7%
MS 4%
39
Example: stock prices
● On the next two slides, we show the map
and reduce functions in Python.
● Both of them are just stand-alone programs
that read stdin and write stdout.
● In fact, we can test our pipeline without using
MapReduce:
cat input-file | ./map.py | sort |
./reduce.py
40
Example: stock prices - map.py
#!/usr/bin/env python
import sys
import string
for line in sys.stdin:
record = line.split(",")
opening = int(record[1])
closing = int(record[2])
if (closing > opening):
change = float(closing - opening) / opening
print '%st%s' % (record[0], change)
41
Example: stock prices - reduce.py
stock = None
max_increase = 0
for line in sys.stdin:
next_stock, increase = line.split('t')
increase = float(increase)
if next_stock == stock: # another line for the same stock
if increase > max_increase:
max_increase = increase
else: # new stock; output result for previous stock
if stock: # only false on the very first line of input
print( "%st%f" % (stock, max_increase) )
stock = next_stock
max_increase = increase
# print the last
print( "%st%d" % (stock, max_increase) )
42
Invoking Hadoop
● Now we just have to run Hadoop. (Here we
are running locally. To run in a cluster, you
need to move the data into HDFS first.)
If you want to run code on our servers, I’ll
give instructions at the end of the talk.
43
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar 
-input input.txt -output output 
-mapper map.py -reducer reduce.py
Brief history of Hadoop
● 2004: Two engineers from Google published
a paper on MapReduce
o Doug Cutting was working on an open-source web
crawler; saw that MapReduce solved his biggest
problem: coordinating lots of computers; decided to
implement an open-source version of MR.
o Yahoo hired Cutting and continued and expanded
the Hadoop project.
44
Brief history of Hadoop (cont.)
● Today: Hadoop includes its own scheduler,
lock mechanism, many database systems,
MapReduce, a non-MapReduce parallelism
system called Spark, and more.
● Demand for “data engineers” who can
manage huge datasets using Hadoop keeps
increasing.
45
Summary
● We discussed the easiest way (that I know)
to use Hadoop to process large datasets.
● Hadoop provides MapReduce, which can
exploit massive parallelism by automatically
breaking up inputs and processing the
pieces separately, as long as the user
supplies map and reduce functions.
46
Summary (cont.)
● Your problem as a programmer is to figure
out how to write map and reduce functions
that will solve your problem. This is
sometimes really easy.
● Using Python streaming, map and reduce
are just Python scripts that read from stdin
and write to stdout - no need to learn special
Hadoop APIs or anything!
47
So is that all there is to MapReduce?
● If only! For more complex cases and for
higher efficiency:
o Use Java for higher efficiency
o Store data in the cluster, for capacity, reliability, and
efficiency
o Tune your application for higher efficiency, e.g.
placing computations near data
o Use some of many Hadoop components that can
make programs easier to write and more efficient
48
Next steps
● If you want to learn more, there are many books and
online tutorials.
o Hadoop: The Definitive Guide, by Tom White, is the
definitive guide. (You’ll need to know Java.)
● We’ll be giving a five-Saturday lecture/lab class
expanding on this meet-up starting this Saturday, and a
twelve-evening class starting August 3.
● We’ll be giving a six-week, full-time bootcamp on
Hadoop+Python starting in late August.
49
Running examples
● For those of you who want to run examples:
o Login to server per given instructions
o Directory streaming-examples has code for stock
prices, wordcount, and word frequencies.
o In each directory, enter: source run-hadoop.sh
o Output in output/part-00000 should match file
expected-output.
o If you want to edit and re-run, you need to delete
output directories: rm -r output (and rm -r output0 in
count-freq).
50
Running examples (cont.)
● Please let us know if you want to continue
working on this tomorrow; we’ll leave the
accounts live until Friday if you request it.
● Some suggestions:
o Word count variants
 Ignore case
 Ignore punctuation
 Find number of words of each length
 Create sorted list of words of each length
51
Running examples (cont.)
● Some suggestions:
o Stock prices
 Produce both max and min increases
o Matrix-vector multiplication - you’ll be starting from
scratch on this one.
 Implement the method we described.
 Suppose the input is in the form p, q1, q2, …, qn,
i.e. a page and all of its outgoing links.
52
Combiners
● Obvious source of inefficiency in wordcount:
Suppose a word occurs twice on one line;
we should output one line of ‘w, 2’ instead of
two lines of ‘w, 1’.
● In fact, this applies to the entire file: Instead
of ‘w, 1’ for each occurrence of a word,
output ‘w, n’ if w occurs n times.
53
Combiners
● Or, to put this differently: We should apply
reduce to each file before the shuffle stage.
● Can do this by specifying a combiner
function (which in this case is just reduce).
54
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar 
-input input.txt 
-output output 
-mapper map.py 
-reducer reduce.py -combiner reduce.py

More Related Content

PPTX
Introduction of Xgboost
PDF
Gradient boosting in practice: a deep dive into xgboost
PDF
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
PDF
PDF
Nyc open-data-2015-andvanced-sklearn-expanded
PDF
PPTX
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
PDF
Gradient Boosted Regression Trees in scikit-learn
Introduction of Xgboost
Gradient boosting in practice: a deep dive into xgboost
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Nyc open-data-2015-andvanced-sklearn-expanded
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Gradient Boosted Regression Trees in scikit-learn

What's hot (20)

PDF
Data mining with caret package
PDF
Ensembling & Boosting 概念介紹
PDF
XGBoost @ Fyber
PDF
Introduction of Feature Hashing
PDF
Support Vector Machines (SVM)
 
PDF
3 R Tutorial Data Structure
PDF
8. R Graphics with R
 
PDF
Multiclass Logistic Regression: Derivation and Apache Spark Examples
PDF
Data Wrangling For Kaggle Data Science Competitions
PDF
Bringing Algebraic Semantics to Mahout
PPTX
Gbm.more GBM in H2O
PDF
Vectors data frames
 
PPTX
Fundamentals of Image Processing & Computer Vision with MATLAB
PDF
Image processing
PDF
How to use SVM for data classification
PPTX
Writing Fast MATLAB Code
PDF
Mapreduce Algorithms
PPT
R studio
PDF
Build your own Convolutional Neural Network CNN
PDF
Linear models
 
Data mining with caret package
Ensembling & Boosting 概念介紹
XGBoost @ Fyber
Introduction of Feature Hashing
Support Vector Machines (SVM)
 
3 R Tutorial Data Structure
8. R Graphics with R
 
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Data Wrangling For Kaggle Data Science Competitions
Bringing Algebraic Semantics to Mahout
Gbm.more GBM in H2O
Vectors data frames
 
Fundamentals of Image Processing & Computer Vision with MATLAB
Image processing
How to use SVM for data classification
Writing Fast MATLAB Code
Mapreduce Algorithms
R studio
Build your own Convolutional Neural Network CNN
Linear models
 
Ad

Viewers also liked (11)

PDF
A Hybrid Recommender with Yelp Challenge Data
PDF
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
PDF
Using Machine Learning to aid Journalism at the New York Times
PDF
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
PDF
Max Kuhn's talk on R machine learning
PDF
Wikipedia: Tuned Predictions on Big Data
PDF
We're so skewed_presentation
PDF
Introducing natural language processing(NLP) with r
PDF
Bayesian models in r
PDF
Winning data science competitions, presented by Owen Zhang
PDF
Tips for data science competitions
A Hybrid Recommender with Yelp Challenge Data
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Using Machine Learning to aid Journalism at the New York Times
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Max Kuhn's talk on R machine learning
Wikipedia: Tuned Predictions on Big Data
We're so skewed_presentation
Introducing natural language processing(NLP) with r
Bayesian models in r
Winning data science competitions, presented by Owen Zhang
Tips for data science competitions
Ad

Similar to Streaming Python on Hadoop (20)

PDF
MapReduce Algorithm Design
PDF
My mapreduce1 presentation
PDF
Introduction to Machine Learning with Spark
PDF
Benchmarking Tool for Graph Algorithms
PDF
Benchmarking tool for graph algorithms
PDF
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
PDF
Machine Learning and GraphX
PPTX
Large-scale Recommendation Systems on Just a PC
PPTX
Embarrassingly/Delightfully Parallel Problems
PPT
Behm Shah Pagerank
PPTX
ch02-mapreduce.pptx
PDF
Sparse matrix computations in MapReduce
PDF
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
PPTX
CPP Homework Help
PDF
MapReduce: teoria e prática
PPT
Big Data, a space adventure - Mario Cartia - Codemotion Milan 2014
PDF
An Introduction to MapReduce
PPTX
mapreduce.pptx
PPT
Download It
PDF
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
MapReduce Algorithm Design
My mapreduce1 presentation
Introduction to Machine Learning with Spark
Benchmarking Tool for Graph Algorithms
Benchmarking tool for graph algorithms
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Machine Learning and GraphX
Large-scale Recommendation Systems on Just a PC
Embarrassingly/Delightfully Parallel Problems
Behm Shah Pagerank
ch02-mapreduce.pptx
Sparse matrix computations in MapReduce
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
CPP Homework Help
MapReduce: teoria e prática
Big Data, a space adventure - Mario Cartia - Codemotion Milan 2014
An Introduction to MapReduce
mapreduce.pptx
Download It
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf

More from Vivian S. Zhang (14)

PDF
Why NYC DSA.pdf
PPTX
Career services workshop- Roger Ren
PDF
Nycdsa wordpress guide book
PDF
Nycdsa ml conference slides march 2015
PDF
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
PDF
Natural Language Processing(SupStat Inc)
PPTX
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
PPTX
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
PPTX
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
PDF
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
PPTX
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
PPTX
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
PPTX
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
PPTX
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
Why NYC DSA.pdf
Career services workshop- Roger Ren
Nycdsa wordpress guide book
Nycdsa ml conference slides march 2015
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
Natural Language Processing(SupStat Inc)
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...

Recently uploaded (20)

PDF
Business Ethics Teaching Materials for college
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
master seminar digital applications in india
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
Basic Mud Logging Guide for educational purpose
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PDF
Insiders guide to clinical Medicine.pdf
Business Ethics Teaching Materials for college
O7-L3 Supply Chain Operations - ICLT Program
Module 4: Burden of Disease Tutorial Slides S2 2025
Abdominal Access Techniques with Prof. Dr. R K Mishra
master seminar digital applications in india
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
human mycosis Human fungal infections are called human mycosis..pptx
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Pharmacology of Heart Failure /Pharmacotherapy of CHF
O5-L3 Freight Transport Ops (International) V1.pdf
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
Supply Chain Operations Speaking Notes -ICLT Program
STATICS OF THE RIGID BODIES Hibbelers.pdf
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
Anesthesia in Laparoscopic Surgery in India
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Basic Mud Logging Guide for educational purpose
VCE English Exam - Section C Student Revision Booklet
Week 4 Term 3 Study Techniques revisited.pptx
Insiders guide to clinical Medicine.pdf

Streaming Python on Hadoop

  • 1. 1
  • 2. Meet-up: Tackling “Big Data” with Hadoop and Python Sam Kamin, VP Data Engineering NYC Data Science Academy sam.kamin@nycdatascience.com 2
  • 3. NYC Data Science Academy ● We’re a company that does training and consulting in the Data Science area. ● I’m Sam Kamin. I just joined NYCDSA as VP of Data Engineering (a new area for us). I was formerly a professor at the U. of Illinois (CS) and a Software Engineer at Google. 3
  • 4. What this meet-up is about ● Wikipedia: “Data Science is the extraction of knowledge from large volumes of data.” ● My goal tonight: Show you how you can handle large volumes of data with simple Python programming, using the Hadoop streaming interface. 4
  • 5. Outline of talk ● Brief overview of Hadoop ● Introduction to parallelism via MapReduce ● Examples of applying MapReduce ● Implementing MapReduce in Python You can do some programming at the end if you want! 5
  • 6. Big Data: What’s the problem? Too much data! o Web contains about 5 billion web pages. According to Wikipedia, its total size in bytes is about 4 zettabytes - that’s 1021, or four thousand billion gigabytes. o Google’s datacenters store about 15 exabytes (15 x 1018 bytes). 6
  • 7. Big Data: What’s the solution? ● Parallel computing: Use multiple, cooperating computers. 7
  • 8. Parallelism ● Parallelism = dividing up a problem so that multiple computers can all work on it: o Break the data into pieces o Send the pieces to different computers for processing. o Send the results back and process the combination to get the final result. 8
  • 9. Cloud computing ● Amazon, Google, Microsoft, and many other companies operate huge clusters: Racks of (basically) off-the-shelf computers with (basically) standard network connections. ● The computers in these clusters run Linux - use them like any other computer... 9
  • 10. Cloud computing ● But: getting them to work together is really hard: o Management: machine/disk failure; efficient data placement; debugging, monitoring, logging, auditing. o Algorithms: decomposing your problem so it can be solved in parallel can be hard. That’s what Hadoop is here to help with. 10
  • 11. ● A collection of services in a cluster: o Distributed, reliable file system (HDFS) o Scheduler to run jobs in correct order, monitor, restart on failure, etc. o MapReduce to help you decompose your problem for parallel execution o A variety of other components (mostly based on MapReduce), e.g. databases, application-focused libraries 11
  • 12. How to use Hadoop ● Hadoop is open source (free!) ● It is hosted on Apache: hadoop.apache.org ● Download it and run it standalone (for debugging) ● Buy a cluster or rent time on one, e.g. AWS, GCE, Azure. (All offer some free time for new users.) 12
  • 13. MapReduce ● The main, and original, parallel-processing system of Hadoop. ● Developed by Google to simplify parallel processing. Hadoop started as an open- source implementation of Google’s idea. ● With Hadoop’s streaming interface, it’s really easy to use MapReduce in Python. 13
  • 14. MapReduce - The Big Idea ● Calculations on large data sets often have this form: Start by aggregating the data (possibly in a different order from the “natural order”), then perform a summarizing calculation on the aggregated groups. ● The idea of MapReduce: If your calculation is explicitly structured like this, it can be automatically parallelized. 14
  • 15. Computing with MapReduce A MapReduce computation has three stages: Map: A function called map is applied to each record in your input. It produces zero or more records as output, each with a key and value. Keys may be repeated. Shuffle: The output from step 1 is sorted and combined: All records with the same key are combined into one. Reduce: A function called reduce is applied to each record (key + values) from step 2 to produce the final output. As the programmer, you only write map and reduce. 15
  • 16. Computing with MapReduce 16 Input A, 7 C, 5 B, 23 B, 12 A, 18 A, [18, 7] B, [23, 12] C, [5] Outputmap reduceshuffle Note: map is record-oriented, meaning the output of the map stage is strictly a combination of the outputs from each record. That allows us to calculate in parallel...
  • 17. Parallelism via MapReduce 17 Input A, [18, 7] B, [23, 12] C, [5] map reduce Because map and reduce are record-oriented, MR can divide inputs into arbitrary chunks: map map map reduce reduce reduce Output Output Output Outputdistribute data distribute data combine/ shuffle
  • 18. MapReduce example: Stock prices ● Input: list of daily opening and closing prices for thousands of stocks over thousands of days. ● Desired output: The biggest-ever one-day percentage price increase for each stock. ● Solution using MR: o map: (stock, open, close) => (stock, (close - open) / open) (if pos) o reduce: (stock, [%c0, %c1, …]) => (stock, max [%c0, %c1, …]). 18
  • 19. MapReduce example - map Goog, 230, 240 Apple, 100, 98 MS, 300, 250 MS, 250, 260 MS, 270, 280 Goog, 220, 215 Goog, 300, 350 IBM, 80, 90 IBM, 90, 85 Goog, 4.3% MS, 4% MS, 3.7% Goog, 16.6% IBM, 12.5% map You supply map: Output stock with % increase, or nothing if decrease. 19
  • 20. MapReduce example - shuffle/sort Goog, 4.3% MS, 4% MS, 3.7% Goog, 16.6% IBM, 12.5% shuffle /sort Goog, [4.3%, 16.6%] IBM, [12.5%] MS, [3.7%, 4%] Goog, 4.3% MS, 4% MS, 3.7% Goog, 16.6% IBM, 12.5% MapReduce supplies shuffle/sort: Combine all records for each stock. 20
  • 21. MapReduce example - reduce reduceGoog, [4.3%, 16.6%] IBM, [12.5%] MS, [3.7%, 4%] Goog, 16.6% IBM, 12.5% MS, 4% You supply reduce: Output max of percentages for each input record. 21
  • 22. Wait, why did that help? I could have just written a loop to read every line and put the percentages in a table! ● Suppose you have a terabyte of data, and 1000 computers in your cluster. ● MapReduce can automatically split the data into 1000 1GB chunks. You write two simple functions and get a 1000x speed-up! 22
  • 23. Modelling problems using MR ● We’re going to look at a variety of problems and see how we can fit them into the MR structure. ● The question for each problem is: What are the types of map and reduce, and what do they do? 23
  • 24. Example: Word count Input: Lines of text. Desired output: # of occurrences of each word (i.e. each sequence of non-space chars) E.g. Input: Roses are red, violets are blue Output: are, 2 blue, 1 red, 1 etc. 24
  • 25. Example: Word count Solution: ● map: “w1 w2 … wk” → w1, 1 w2, 1 ... wk, 1 ● reduce: (w, [1, 1, …]) → (w, n) n 1’s 25
  • 26. Example: Word count frequency Input: Output of word count Desired output: For any number of occurrences c, the number of different words that occur c times. E.g. Input: Roses are red, violets are blue Output: 1, 4 2, 1 26
  • 27. Example: Word count frequency Solution: ● map: w, c → c, 1 ● reduce: (c, [1, 1, …]) → (c, n) n 1’s 27
  • 28. Example: Page Rank ● Famous algorithm used by Google to rank pages. (Comes down to matrix-vector multiplication, as we’ll see…) ● Based on two ideas: o Importance of a page depends upon how many pages link to it. o However, if a page has lots of links going out, the value of each link is reduced. 28
  • 29. Example: Page Rank With those two ideas, calculate rank of page: Note: Because the web has cycles - page p can have a link to page q, which has a link to p - this formula requires an iterative solution. pagerank(p) = Σq→p 29 pagerank(q) out-degree(q)
  • 30. Example: Page Rank Consider pages and their links as a graph (page A has links to B, C, and D, etc.): 30 pr(A) = pr(B)/2 + pr(D)/2 pr(B) = pr(A)/3 + pr(D)/2 pr(C) = pr(A)/3 + pr(B)/2 pr(D) = pr(A)/3 + pr(C)
  • 31. Example: Page Rank ● Represent the graph as a weighted adjacency matrix: 31 0 1/2 0 1/2 1/3 0 0 1/2 1/3 1/2 0 0 1/3 0 1 0 M = links to links from A B C D B DA C
  • 32. Example: Page Rank ● Now, if we put the page rank of each page in a vector v, then multiplying M by v calculates the pagerank formula for all nodes: 32 0 1/2 0 1/2 1/3 0 0 1/2 1/3 1/2 0 0 1/3 0 1 0 pr(A) pr(B) pr(C) pr(D) pr(B)/2 + pr(D)/2 pr(A)/3 + pr(D)/2 pr(A)/3 + pr(B)/2 pr(A)/3 + pr(C) X =
  • 33. Example: Page Rank ● So, to calculate page ranks, start with an initial guess of all page ranks and multiply. ● After one multiplication: 33 0 1/2 0 1/2 1/3 0 0 1/2 1/3 1/2 0 0 1/3 0 1 0 1/4 1/4 1/4 1/4 1/4 5/24 5/24 1/3 X =
  • 34. Example: Page Rank ● After two multiplications: 34 0 1/2 0 1/2 1/3 0 0 1/2 1/3 1/2 0 0 1/3 0 1 0 .27 .24 .188 .29 X = 1/4 5/24 5/24 1/3
  • 35. Example: Page Rank ● Thus, page rank = matrix-vector product. ● Can we express matrix-vector multiplication as a MapReduce? o Assume v is copied (magically) to each node. o M, being much bigger, needs to be partitioned, i.e. M is the main input file. o How shall we represent M and define map and reduce? 35
  • 36. Example: Page Rank ● A solution: o Represent M using one record for each link: (p, q, out-degree(p)) for every link p→q. o map: (p, q, d) ↦ (q, v[p]/d) reduce: p, [c1, c2, …] ↦ p, c1+c2+... 36
  • 37. MapReduce: Summary ● Nowadays, MapReduce powers the internet: o Google, Amazon, Facebook, use it extensively for everything from page ranking to error log analysis. o NIH use it to analyze gene sequences. o NASA uses it to analyze data from probes. o etc., etc. ● Next question: How can we implement a MapReduce? 37
  • 38. Writing map and reduce in Python ● Easy using the streaming interface: o map and reduce : stdin → stdout. Each should iterate over stdin and output result for each line. o Inputs and outputs are text files. In map and reduce output, tab character separates key from value. o Shuffle just sorts the files on the key.  Instead of a line with a key and list of values, we get consecutive lines with the same key. 38
  • 39. Example: stock prices ● Recall the output of the shuffle stage: ● The only difference is this becomes: Goog, [4.3%, 16.6%] IBM, [12.5%] MS, [3.7%, 4%] Goog 4.3% Goog 16.6% IBM 12.5% MS 3.7% MS 4% 39
  • 40. Example: stock prices ● On the next two slides, we show the map and reduce functions in Python. ● Both of them are just stand-alone programs that read stdin and write stdout. ● In fact, we can test our pipeline without using MapReduce: cat input-file | ./map.py | sort | ./reduce.py 40
  • 41. Example: stock prices - map.py #!/usr/bin/env python import sys import string for line in sys.stdin: record = line.split(",") opening = int(record[1]) closing = int(record[2]) if (closing > opening): change = float(closing - opening) / opening print '%st%s' % (record[0], change) 41
  • 42. Example: stock prices - reduce.py stock = None max_increase = 0 for line in sys.stdin: next_stock, increase = line.split('t') increase = float(increase) if next_stock == stock: # another line for the same stock if increase > max_increase: max_increase = increase else: # new stock; output result for previous stock if stock: # only false on the very first line of input print( "%st%f" % (stock, max_increase) ) stock = next_stock max_increase = increase # print the last print( "%st%d" % (stock, max_increase) ) 42
  • 43. Invoking Hadoop ● Now we just have to run Hadoop. (Here we are running locally. To run in a cluster, you need to move the data into HDFS first.) If you want to run code on our servers, I’ll give instructions at the end of the talk. 43 hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar -input input.txt -output output -mapper map.py -reducer reduce.py
  • 44. Brief history of Hadoop ● 2004: Two engineers from Google published a paper on MapReduce o Doug Cutting was working on an open-source web crawler; saw that MapReduce solved his biggest problem: coordinating lots of computers; decided to implement an open-source version of MR. o Yahoo hired Cutting and continued and expanded the Hadoop project. 44
  • 45. Brief history of Hadoop (cont.) ● Today: Hadoop includes its own scheduler, lock mechanism, many database systems, MapReduce, a non-MapReduce parallelism system called Spark, and more. ● Demand for “data engineers” who can manage huge datasets using Hadoop keeps increasing. 45
  • 46. Summary ● We discussed the easiest way (that I know) to use Hadoop to process large datasets. ● Hadoop provides MapReduce, which can exploit massive parallelism by automatically breaking up inputs and processing the pieces separately, as long as the user supplies map and reduce functions. 46
  • 47. Summary (cont.) ● Your problem as a programmer is to figure out how to write map and reduce functions that will solve your problem. This is sometimes really easy. ● Using Python streaming, map and reduce are just Python scripts that read from stdin and write to stdout - no need to learn special Hadoop APIs or anything! 47
  • 48. So is that all there is to MapReduce? ● If only! For more complex cases and for higher efficiency: o Use Java for higher efficiency o Store data in the cluster, for capacity, reliability, and efficiency o Tune your application for higher efficiency, e.g. placing computations near data o Use some of many Hadoop components that can make programs easier to write and more efficient 48
  • 49. Next steps ● If you want to learn more, there are many books and online tutorials. o Hadoop: The Definitive Guide, by Tom White, is the definitive guide. (You’ll need to know Java.) ● We’ll be giving a five-Saturday lecture/lab class expanding on this meet-up starting this Saturday, and a twelve-evening class starting August 3. ● We’ll be giving a six-week, full-time bootcamp on Hadoop+Python starting in late August. 49
  • 50. Running examples ● For those of you who want to run examples: o Login to server per given instructions o Directory streaming-examples has code for stock prices, wordcount, and word frequencies. o In each directory, enter: source run-hadoop.sh o Output in output/part-00000 should match file expected-output. o If you want to edit and re-run, you need to delete output directories: rm -r output (and rm -r output0 in count-freq). 50
  • 51. Running examples (cont.) ● Please let us know if you want to continue working on this tomorrow; we’ll leave the accounts live until Friday if you request it. ● Some suggestions: o Word count variants  Ignore case  Ignore punctuation  Find number of words of each length  Create sorted list of words of each length 51
  • 52. Running examples (cont.) ● Some suggestions: o Stock prices  Produce both max and min increases o Matrix-vector multiplication - you’ll be starting from scratch on this one.  Implement the method we described.  Suppose the input is in the form p, q1, q2, …, qn, i.e. a page and all of its outgoing links. 52
  • 53. Combiners ● Obvious source of inefficiency in wordcount: Suppose a word occurs twice on one line; we should output one line of ‘w, 2’ instead of two lines of ‘w, 1’. ● In fact, this applies to the entire file: Instead of ‘w, 1’ for each occurrence of a word, output ‘w, n’ if w occurs n times. 53
  • 54. Combiners ● Or, to put this differently: We should apply reduce to each file before the shuffle stage. ● Can do this by specifying a combiner function (which in this case is just reduce). 54 hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar -input input.txt -output output -mapper map.py -reducer reduce.py -combiner reduce.py