SlideShare a Scribd company logo
Apache Spark
CS240A
Winter 2016. T Yang
Some of them are based on P. Wendell’s Spark slides
• Hadoop: Distributed file system that connects machines.
• Mapreduce: parallel programming style built on a Hadoop
cluster
• Spark: Berkeley design of Mapreduce programming
• Given a file treated as a big list
 A file may be divided into multiple parts (splits).
• Each record (line) is processed by a Map function,
 produces a set of intermediate key/value pairs.
• Reduce: combine a set of values for the same key
Parallel Processing using Spark+Hadoop
Python Examples and List Comprehension
>>> lst = [3, 1, 4, 1, 5]
>>> lst.append(2)
>>> len(lst)
5
>>> lst.sort()
>>> lst.insert(4,"Hello")
>>> [1]+ [2]  [1,2]
>>> lst[0] ->3
Python tuples
>>> num=(1, 2, 3, 4)
>>> num +(5)  (1,2,3,4, 5)
for i in [5, 4, 3, 2, 1] :
print i
print 'Blastoff!'
>>>M = [x for x in S if x % 2 == 0]
>>> S = [x**2 for x in range(10)]
[0,1,4,9,16,…,81]
>>> words =‘hello lazy dog'.split()
>>> stuff = [(w.upper(), len(w)] for w in words]
 [ (‘HELLO’, 5) (‘LAZY’, 4) , (‘DOG’, 4)]
>>> words = 'The quick brown fox jumps over the lazy dog'.split()
>>>numset=set([1, 2, 3, 2])
Duplicated entries are deleted
>>>numset=frozenset([1, 2,3])
Such a set cannot be modified
a = [1, 2, 3]
b = [4, 5, 6, 7]
c = [8, 9, 1, 2, 3]
f= lambda x: len(x)
L = map(f, [a, b, c])
[3, 4, 5]
g=lambda x,y: x+y
reduce(g, [47,11,42,13])
113
Python map/reduce
Mapreduce programming with SPAK: key
concept
RDD: Resilient Distributed
Datasets
•Like a big list:
 Collections of objects spread
across a cluster, stored in RAM or
on Disk
•Built through parallel
transformations
•Automatically rebuilt on failure
Operations
•Transformations
(e.g. map, filter,
groupBy)
•Make sure input/output
match
Write programs in terms of operations on
implicitly distributed datasets (RDD)
RDD
RDD
RDD
RDD
MapReduce vs Spark
Spark operates on RDD
RDD
RDD
RDD
RDD
Map and reduce
tasks operate on key-value
pairs
Language Support
Standalone Programs
•Python, Scala, & Java
Interactive Shells
•Python & Scala
Performance
•Java & Scala are faster
due to static typing
•…but Python is often fine
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
Spark Context and Creating RDDs
#Start with sc – SparkContext as
Main entry point to Spark functionality
# Turn a Python collection into an RDD
>sc.parallelize([1, 2, 3])
# Load text file from local FS, HDFS, or S3
>sc.textFile(“file.txt”)
>sc.textFile(“directory/*.txt”)
>sc.textFile(“hdfs://namenode:9000/path/file”)
Spark Architecture
Spark Architecture
Basic Transformations
> nums = sc.parallelize([1, 2, 3])
# Pass each element through a function
> squares = nums.map(lambda x: x*x) // {1, 4, 9}
# Keep elements passing a predicate
> even = squares.filter(lambda x: x % 2 == 0) // {4}
#read a text file and count number of lines
containing error
lines = sc.textFile(“file.log”)
lines.filter(lambda s: “ERROR” in s).count()
Basic Actions
> nums = sc.parallelize([1, 2, 3])
# Retrieve RDD contents as a local collection
> nums.collect() # => [1, 2, 3]
# Return first K elements
> nums.take(2) # => [1, 2]
# Count number of elements
> nums.count() # => 3
# Merge elements with an associative function
> nums.reduce(lambda x, y: x + y) # => 6
# Write elements to a text file
> nums.saveAsTextFile(“hdfs://file.txt”)
Working with Key-Value Pairs
Spark’s “distributed reduce” transformations
operate on RDDs of key-value pairs
Python: pair = (a, b)
pair[0] # => a
pair[1] # => b
Scala: val pair = (a, b)
pair._1 // => a
pair._2 // => b
Java: Tuple2 pair = new Tuple2(a, b);
pair._1 // => a
pair._2 // => b
Some Key-Value Operations
> pets = sc.parallelize(
[(“cat”, 1), (“dog”, 1), (“cat”, 2)])
> pets.reduceByKey(lambda x, y: x + y)
# => {(cat, 3), (dog, 1)}
> pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])}
> pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)}
reduceByKey also automatically implements
combiners on the map side
> lines = sc.textFile(“hamlet.txt”)
> counts = lines.flatMap(lambda line: line.split(“ ”))
.map(lambda word: (word, 1))
.reduceByKey(lambda x, y: x + y)
Example: Word Count
“to be or”
“not to be”
“to”
“be”
“or”
“not”
“to”
“be”
(to, 1)
(be, 1)
(or, 1)
(not, 1)
(to, 1)
(be, 1)
(be, 1)(be,1)
(not, 1)
(or, 1)
(to, 1)(to,1)
(be,2)
(not, 1)
(or, 1)
(to, 2)
Other Key-Value Operations
> visits = sc.parallelize([ (“index.html”, “1.2.3.4”),
(“about.html”, “3.4.5.6”),
(“index.html”, “1.3.3.1”) ])
> pageNames = sc.parallelize([ (“index.html”, “Home”),
(“about.html”, “About”) ])
> visits.join(pageNames)
# (“index.html”, (“1.2.3.4”, “Home”))
# (“index.html”, (“1.3.3.1”, “Home”))
# (“about.html”, (“3.4.5.6”, “About”))
> visits.cogroup(pageNames)
# (“index.html”, ([“1.2.3.4”, “1.3.3.1”], [“Home”]))
# (“about.html”, ([“3.4.5.6”], [“About”]))
Under The Hood: DAG Scheduler
• General task
graphs
• Automatically
pipelines
functions
• Data locality
aware
• Partitioning
aware
to avoid shuffles
= cached partition
= RDD
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map
Setting the Level of Parallelism
All the pair RDD operations take an optional second
parameter for number of tasks
> words.reduceByKey(lambda x, y: x + y, 5)
> words.groupByKey(5)
> visits.join(pageViews, 5)
More RDD Operators
• map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin
• reduce
• count
• fold
• reduceByKey
• groupByKey
• cogroup
• cross
• zip
sample
take
first
partitionBy
mapWith
pipe
save ...
Interactive Shell
• The Fastest Way to
Learn Spark
• Available in Python
and Scala
• Runs as an
application on an
existing Spark
Cluster…
• OR Can run locally
import sys
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext( “local”, “WordCount”, sys.argv[0], None)
lines = sc.textFile(sys.argv[1])
counts = lines.flatMap(lambda s: s.split(“ ”)) 
.map(lambda word: (word, 1)) 
.reduceByKey(lambda x, y: x + y)
counts.saveAsTextFile(sys.argv[2])
… or a Standalone Application
import org.apache.spark.api.java.JavaSparkContext;
JavaSparkContext sc = new JavaSparkContext(
“masterUrl”, “name”, “sparkHome”, new String[] {“app.jar”}));
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val sc = new SparkContext(“url”, “name”, “sparkHome”, Seq(“app.jar”))
Cluster URL, or
local / local[N]
App
name
Spark install
path on
cluster
List of JARs
with app code
(to ship)
Create a SparkContext
Scala
Java
from pyspark import SparkContext
sc = SparkContext(“masterUrl”, “name”, “sparkHome”, [“library.py”]))
Python
Administrative GUIs
http://<Standalone Master>:8080
(by default)
EXAMPLE APPLICATION:
PAGERANK
Google PageRank
Give pages ranks
(scores) based on
links to them
•Links from many
pages  high rank
•Link from a high-rank
page  high rank
mage: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png
PageRank (one definition)
PageRank (one definition)
 Model page reputation on the web
 i=1,n lists all parents of page x.
 PR(x) is the page rank of each page.
 C(t) is the out-degree of t.
 d is a damping factor .





n
i i
i
t
C
t
PR
d
d
x
PR
1 )
(
)
(
)
1
(
)
(
0.4
0.4
0.2
0.2
0.2
0.2
0.4
Computing PageRank Iteratively
 Effects at each iteration is local. i+1th
iteration depends only on
ith
iteration
 At iteration i, PageRank for individual nodes can be computed
independently
PageRank using MapReduce
Map: distribute PageRank “credit” to link targets
Reduce: gather up PageRank “credit” from
multiple sources to compute new PageRank value
Iterate until
convergence
Source of Image: Lin 2008
Algorithm demo
1.0 1.0
1.0
1.0
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |outdegreep| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |outdegreep| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.0 1.0
1.0
1.0
1
0.5
0.5
0.5
1
0.5
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |outdegreep| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
0.58 1.0
1.85
0.58
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |outdegreep| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
0.58
0.29
0.29
0.5
1.85
0.58 1.0
1.85
0.58
0.5
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |outdegreep| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
0.39 1.72
1.31
0.58
. . .
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |outdegreep| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
0.46 1.37
1.44
0.73
Final state:
HW: SimplePageRank
Random surfer model to describe the algorithm
• Stay on the page: 0.05 *weight
• Randomly follow a link: 0.85/out-going-Degree to each child
 If no children, give that portion to other nodes evenly.
• Randomly go to another page: 0.10
 Meaning: contribute 10% of its weight to others. Others will evenly get that
weight. Repeat for everybody. Since the sum of all weights is num-nodes,
10%*num-nodes divided by num-nodes is 0.1
R(x) = 0.1+ 0.05 R(x) + incoming-contributions
Initial weight 1 for everybody
To/From 0 1 2 3
Random
Factor
New
Weight
0 0.05 0.283 0.0 0.283 0.10 0.716
1 0.425 0.05 0.0 0.283 0.10 0.858
2 0.425 0.283 0.05 0.283 0.10 1.141
3 0.00 0.283 0.85 0.05 0.10 1.283
Data structure in SimplePageRank

More Related Content

PDF
Artigo 81 - spark_tutorial.pdf
PPTX
AI與大數據數據處理 Spark實戰(20171216)
PDF
Introduction to Scalding and Monoids
PDF
Meetup ml spark_ppt
PPTX
Apache spark core
PDF
Real Time Big Data Management
PDF
Spark workshop
PPTX
Alpine academy apache spark series #1 introduction to cluster computing wit...
Artigo 81 - spark_tutorial.pdf
AI與大數據數據處理 Spark實戰(20171216)
Introduction to Scalding and Monoids
Meetup ml spark_ppt
Apache spark core
Real Time Big Data Management
Spark workshop
Alpine academy apache spark series #1 introduction to cluster computing wit...

Similar to apache spark presentation for distributed processing (20)

PPTX
Introduction to Apache Spark
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
PDF
Scala+data
PDF
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
PPTX
Testing batch and streaming Spark applications
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
PDF
Introduction to Apache Spark
PPT
11. From Hadoop to Spark 2/2
PDF
Spark devoxx2014
PPTX
Lambdas puzzler - Peter Lawrey
PDF
Scalding - the not-so-basics @ ScalaDays 2014
PDF
Apache Spark: What? Why? When?
PPTX
Hadoop ecosystem
PPTX
Dive into spark2
PDF
Scala @ TechMeetup Edinburgh
PDF
Osd ctw spark
PDF
Simple Apache Spark Introduction - Part 2
PPTX
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
PPT
Behm Shah Pagerank
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
Introduction to Apache Spark
Spark SQL Deep Dive @ Melbourne Spark Meetup
Scala+data
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
Testing batch and streaming Spark applications
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Introduction to Apache Spark
11. From Hadoop to Spark 2/2
Spark devoxx2014
Lambdas puzzler - Peter Lawrey
Scalding - the not-so-basics @ ScalaDays 2014
Apache Spark: What? Why? When?
Hadoop ecosystem
Dive into spark2
Scala @ TechMeetup Edinburgh
Osd ctw spark
Simple Apache Spark Introduction - Part 2
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Behm Shah Pagerank
Apache spark sneha challa- google pittsburgh-aug 25th
Ad

Recently uploaded (20)

PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Classroom Observation Tools for Teachers
PDF
RMMM.pdf make it easy to upload and study
PDF
Basic Mud Logging Guide for educational purpose
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Insiders guide to clinical Medicine.pdf
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
GDM (1) (1).pptx small presentation for students
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
master seminar digital applications in india
PDF
Pre independence Education in Inndia.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Microbial disease of the cardiovascular and lymphatic systems
O5-L3 Freight Transport Ops (International) V1.pdf
Final Presentation General Medicine 03-08-2024.pptx
Classroom Observation Tools for Teachers
RMMM.pdf make it easy to upload and study
Basic Mud Logging Guide for educational purpose
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Insiders guide to clinical Medicine.pdf
102 student loan defaulters named and shamed – Is someone you know on the list?
GDM (1) (1).pptx small presentation for students
TR - Agricultural Crops Production NC III.pdf
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Microbial diseases, their pathogenesis and prophylaxis
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Supply Chain Operations Speaking Notes -ICLT Program
PPH.pptx obstetrics and gynecology in nursing
Sports Quiz easy sports quiz sports quiz
master seminar digital applications in india
Pre independence Education in Inndia.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Ad

apache spark presentation for distributed processing

  • 1. Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendell’s Spark slides
  • 2. • Hadoop: Distributed file system that connects machines. • Mapreduce: parallel programming style built on a Hadoop cluster • Spark: Berkeley design of Mapreduce programming • Given a file treated as a big list  A file may be divided into multiple parts (splits). • Each record (line) is processed by a Map function,  produces a set of intermediate key/value pairs. • Reduce: combine a set of values for the same key Parallel Processing using Spark+Hadoop
  • 3. Python Examples and List Comprehension >>> lst = [3, 1, 4, 1, 5] >>> lst.append(2) >>> len(lst) 5 >>> lst.sort() >>> lst.insert(4,"Hello") >>> [1]+ [2]  [1,2] >>> lst[0] ->3 Python tuples >>> num=(1, 2, 3, 4) >>> num +(5)  (1,2,3,4, 5) for i in [5, 4, 3, 2, 1] : print i print 'Blastoff!' >>>M = [x for x in S if x % 2 == 0] >>> S = [x**2 for x in range(10)] [0,1,4,9,16,…,81] >>> words =‘hello lazy dog'.split() >>> stuff = [(w.upper(), len(w)] for w in words]  [ (‘HELLO’, 5) (‘LAZY’, 4) , (‘DOG’, 4)] >>> words = 'The quick brown fox jumps over the lazy dog'.split() >>>numset=set([1, 2, 3, 2]) Duplicated entries are deleted >>>numset=frozenset([1, 2,3]) Such a set cannot be modified
  • 4. a = [1, 2, 3] b = [4, 5, 6, 7] c = [8, 9, 1, 2, 3] f= lambda x: len(x) L = map(f, [a, b, c]) [3, 4, 5] g=lambda x,y: x+y reduce(g, [47,11,42,13]) 113 Python map/reduce
  • 5. Mapreduce programming with SPAK: key concept RDD: Resilient Distributed Datasets •Like a big list:  Collections of objects spread across a cluster, stored in RAM or on Disk •Built through parallel transformations •Automatically rebuilt on failure Operations •Transformations (e.g. map, filter, groupBy) •Make sure input/output match Write programs in terms of operations on implicitly distributed datasets (RDD) RDD RDD RDD RDD
  • 6. MapReduce vs Spark Spark operates on RDD RDD RDD RDD RDD Map and reduce tasks operate on key-value pairs
  • 7. Language Support Standalone Programs •Python, Scala, & Java Interactive Shells •Python & Scala Performance •Java & Scala are faster due to static typing •…but Python is often fine Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
  • 8. Spark Context and Creating RDDs #Start with sc – SparkContext as Main entry point to Spark functionality # Turn a Python collection into an RDD >sc.parallelize([1, 2, 3]) # Load text file from local FS, HDFS, or S3 >sc.textFile(“file.txt”) >sc.textFile(“directory/*.txt”) >sc.textFile(“hdfs://namenode:9000/path/file”)
  • 11. Basic Transformations > nums = sc.parallelize([1, 2, 3]) # Pass each element through a function > squares = nums.map(lambda x: x*x) // {1, 4, 9} # Keep elements passing a predicate > even = squares.filter(lambda x: x % 2 == 0) // {4} #read a text file and count number of lines containing error lines = sc.textFile(“file.log”) lines.filter(lambda s: “ERROR” in s).count()
  • 12. Basic Actions > nums = sc.parallelize([1, 2, 3]) # Retrieve RDD contents as a local collection > nums.collect() # => [1, 2, 3] # Return first K elements > nums.take(2) # => [1, 2] # Count number of elements > nums.count() # => 3 # Merge elements with an associative function > nums.reduce(lambda x, y: x + y) # => 6 # Write elements to a text file > nums.saveAsTextFile(“hdfs://file.txt”)
  • 13. Working with Key-Value Pairs Spark’s “distributed reduce” transformations operate on RDDs of key-value pairs Python: pair = (a, b) pair[0] # => a pair[1] # => b Scala: val pair = (a, b) pair._1 // => a pair._2 // => b Java: Tuple2 pair = new Tuple2(a, b); pair._1 // => a pair._2 // => b
  • 14. Some Key-Value Operations > pets = sc.parallelize( [(“cat”, 1), (“dog”, 1), (“cat”, 2)]) > pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)} > pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])} > pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)} reduceByKey also automatically implements combiners on the map side
  • 15. > lines = sc.textFile(“hamlet.txt”) > counts = lines.flatMap(lambda line: line.split(“ ”)) .map(lambda word: (word, 1)) .reduceByKey(lambda x, y: x + y) Example: Word Count “to be or” “not to be” “to” “be” “or” “not” “to” “be” (to, 1) (be, 1) (or, 1) (not, 1) (to, 1) (be, 1) (be, 1)(be,1) (not, 1) (or, 1) (to, 1)(to,1) (be,2) (not, 1) (or, 1) (to, 2)
  • 16. Other Key-Value Operations > visits = sc.parallelize([ (“index.html”, “1.2.3.4”), (“about.html”, “3.4.5.6”), (“index.html”, “1.3.3.1”) ]) > pageNames = sc.parallelize([ (“index.html”, “Home”), (“about.html”, “About”) ]) > visits.join(pageNames) # (“index.html”, (“1.2.3.4”, “Home”)) # (“index.html”, (“1.3.3.1”, “Home”)) # (“about.html”, (“3.4.5.6”, “About”)) > visits.cogroup(pageNames) # (“index.html”, ([“1.2.3.4”, “1.3.3.1”], [“Home”])) # (“about.html”, ([“3.4.5.6”], [“About”]))
  • 17. Under The Hood: DAG Scheduler • General task graphs • Automatically pipelines functions • Data locality aware • Partitioning aware to avoid shuffles = cached partition = RDD join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: map
  • 18. Setting the Level of Parallelism All the pair RDD operations take an optional second parameter for number of tasks > words.reduceByKey(lambda x, y: x + y, 5) > words.groupByKey(5) > visits.join(pageViews, 5)
  • 19. More RDD Operators • map • filter • groupBy • sort • union • join • leftOuterJoin • rightOuterJoin • reduce • count • fold • reduceByKey • groupByKey • cogroup • cross • zip sample take first partitionBy mapWith pipe save ...
  • 20. Interactive Shell • The Fastest Way to Learn Spark • Available in Python and Scala • Runs as an application on an existing Spark Cluster… • OR Can run locally
  • 21. import sys from pyspark import SparkContext if __name__ == "__main__": sc = SparkContext( “local”, “WordCount”, sys.argv[0], None) lines = sc.textFile(sys.argv[1]) counts = lines.flatMap(lambda s: s.split(“ ”)) .map(lambda word: (word, 1)) .reduceByKey(lambda x, y: x + y) counts.saveAsTextFile(sys.argv[2]) … or a Standalone Application
  • 22. import org.apache.spark.api.java.JavaSparkContext; JavaSparkContext sc = new JavaSparkContext( “masterUrl”, “name”, “sparkHome”, new String[] {“app.jar”})); import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ val sc = new SparkContext(“url”, “name”, “sparkHome”, Seq(“app.jar”)) Cluster URL, or local / local[N] App name Spark install path on cluster List of JARs with app code (to ship) Create a SparkContext Scala Java from pyspark import SparkContext sc = SparkContext(“masterUrl”, “name”, “sparkHome”, [“library.py”])) Python
  • 25. Google PageRank Give pages ranks (scores) based on links to them •Links from many pages  high rank •Link from a high-rank page  high rank mage: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png
  • 26. PageRank (one definition) PageRank (one definition)  Model page reputation on the web  i=1,n lists all parents of page x.  PR(x) is the page rank of each page.  C(t) is the out-degree of t.  d is a damping factor .      n i i i t C t PR d d x PR 1 ) ( ) ( ) 1 ( ) ( 0.4 0.4 0.2 0.2 0.2 0.2 0.4
  • 27. Computing PageRank Iteratively  Effects at each iteration is local. i+1th iteration depends only on ith iteration  At iteration i, PageRank for individual nodes can be computed independently
  • 28. PageRank using MapReduce Map: distribute PageRank “credit” to link targets Reduce: gather up PageRank “credit” from multiple sources to compute new PageRank value Iterate until convergence Source of Image: Lin 2008
  • 29. Algorithm demo 1.0 1.0 1.0 1.0 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |outdegreep| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs
  • 30. Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |outdegreep| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs 1.0 1.0 1.0 1.0 1 0.5 0.5 0.5 1 0.5
  • 31. Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |outdegreep| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs 0.58 1.0 1.85 0.58
  • 32. Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |outdegreep| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs 0.58 0.29 0.29 0.5 1.85 0.58 1.0 1.85 0.58 0.5
  • 33. Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |outdegreep| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs 0.39 1.72 1.31 0.58 . . .
  • 34. Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |outdegreep| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs 0.46 1.37 1.44 0.73 Final state:
  • 35. HW: SimplePageRank Random surfer model to describe the algorithm • Stay on the page: 0.05 *weight • Randomly follow a link: 0.85/out-going-Degree to each child  If no children, give that portion to other nodes evenly. • Randomly go to another page: 0.10  Meaning: contribute 10% of its weight to others. Others will evenly get that weight. Repeat for everybody. Since the sum of all weights is num-nodes, 10%*num-nodes divided by num-nodes is 0.1 R(x) = 0.1+ 0.05 R(x) + incoming-contributions Initial weight 1 for everybody To/From 0 1 2 3 Random Factor New Weight 0 0.05 0.283 0.0 0.283 0.10 0.716 1 0.425 0.05 0.0 0.283 0.10 0.858 2 0.425 0.283 0.05 0.283 0.10 1.141 3 0.00 0.283 0.85 0.05 0.10 1.283
  • 36. Data structure in SimplePageRank

Editor's Notes

  • #1: Preamble: * Excited to kick off first day of training * This first tutorial is about using Spark CORE * We’ve got a curriculum jammed packed with material, so let’s go ahead and get started
  • #5: RDD  Colloquially referred to as RDDs (e.g. caching in RAM) Lazy operations to build RDDs from other RDDs Return a result or write it to storage
  • #6: RDD  Colloquially referred to as RDDs (e.g. caching in RAM) Lazy operations to build RDDs from other RDDs Return a result or write it to storage
  • #9: NOT a modified version of Hadoop
  • #10: NOT a modified version of Hadoop
  • #11: All lazy
  • #12: Launch computations
  • #17: NOT a modified version of Hadoop
  • #20: The barrier to entry for working with the spark API is minimal
  • #22: Mention assembly JAR in Maven etc Use the shell with your own JAR`
  • #24: To Skip to the next section, go to slide 51