Data science bootcamp day 3

Data Science Bootcamp Day-3
Presented by: Chetan Khatri, Volunteer Teaching Assistant,
Data Science lab, University of Kachchh
Guidance by: Prof. Devji D. Chhanga, University of Kachchh.

Agenda
An Introduction to Apache Spark
Apache Spark single node configuration
MapReduce Program on Spark Cluster
An Introduction to Apache Kafka
Apache Kafka single on Configuration.
Create Topic, Push Messages to Topic

Spark Terminology
» Spark and SQL Contexts : A Spark program first creates a SparkContext object
» SparkContext tells Spark how and where to access a cluster
» The program next creates a sqlContext object
» Use sqlContext to create DataFrames

Review : DataFrames
The primary abstraction in Spark
» Immutable once constructed.
» Track lineage information to efficiently recompute lost data.
» Enable operations on collection of elements in parallel.
You construct DataFrames
» by parallelizing existing Scala collections (lists)
» by transforming an existing Spark DFs
» from files in HDFS or any other storage system

Review: DataFrames
Two types of operations: transformations and actions.
Transformations are lazy (not computed immediately).
Transformed DF is executed when action runs on it.
Persist (cache) DFs in memory or disk.

Resilient Distributed Datasets
Untyped Spark abstraction underneath DataFrames:
» Immutable once constructed
» Track lineage information to efficiently recompute lost data
» Enable operations on collection of elements in parallel
You construct RDDs
» by parallelizing existing Scala collections (lists)
» by transforming an existing RDDs or DataFrame
» from files in HDFS or any other storage system

When to use DataFrames ?
Need high-level transformations and actions, and want high-level
control over your dataset.
Have typed (structured or semi-structured) data.
You want DataFrame optimization and performance benefits
» Catalyst Optimization Engine
• 75% reduction in execution time
» Project Tungsten off-heap memory management
• 75+% reduction in memory usage (less GC)

Apache Spark MapReduce
1) Start Apache Spark Shell
./bin/spark-shell
2) Let's Read the text file
scala> val textFile = sc.textFile("file:///home/chetan306/inputfile.txt")
3) RDDs have actions, which return values, and transformations, which return pointers to new RDDs. Let’s
start with a few actions:
scala> textFile.count()
scala> textFile.first()
4) Now let’s use a transformation. We will use the filter transformation to return a new RDD with a subset
of the items in the file.
val linesWithSpark = textFile.filter(line => line.contains("Spark"))
// Get transformation output.
linesWithSpark.collect()

Apache Spark MapReduce
5) We can chain together transformations and actions:
textFile.filter(line => line.contains("Spark")).count()
6) One common data flow pattern is MapReduce, as popularized by Hadoop. Spark
can implement MapReduce flows easily:
val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word,
1)).reduceByKey((a, b) => a + b)
wordCounts.collect()

Data science bootcamp day 3

More Related Content

What's hot (20)

Viewers also liked (17)

Similar to Data science bootcamp day 3 (20)

More from Chetan Khatri (20)

Recently uploaded (20)

Data science bootcamp day 3