SlideShare a Scribd company logo
Data Science Bootcamp Day-3
Presented by: Chetan Khatri, Volunteer Teaching Assistant,
Data Science lab, University of Kachchh
Guidance by: Prof. Devji D. Chhanga, University of Kachchh.
Agenda
An Introduction to Apache Spark
Apache Spark single node configuration
MapReduce Program on Spark Cluster
An Introduction to Apache Kafka
Apache Kafka single on Configuration.
Create Topic, Push Messages to Topic
Spark Terminology
» Spark and SQL Contexts : A Spark program first creates a SparkContext object
» SparkContext tells Spark how and where to access a cluster
» The program next creates a sqlContext object
» Use sqlContext to create DataFrames
Review : DataFrames
The primary abstraction in Spark
» Immutable once constructed.
» Track lineage information to efficiently recompute lost data.
» Enable operations on collection of elements in parallel.
You construct DataFrames
» by parallelizing existing Scala collections (lists)
» by transforming an existing Spark DFs
» from files in HDFS or any other storage system
Review: DataFrames
Two types of operations: transformations and actions.
Transformations are lazy (not computed immediately).
Transformed DF is executed when action runs on it.
Persist (cache) DFs in memory or disk.
Resilient Distributed Datasets
Untyped Spark abstraction underneath DataFrames:
» Immutable once constructed
» Track lineage information to efficiently recompute lost data
» Enable operations on collection of elements in parallel
You construct RDDs
» by parallelizing existing Scala collections (lists)
» by transforming an existing RDDs or DataFrame
» from files in HDFS or any other storage system
When to use DataFrames ?
Need high-level transformations and actions, and want high-level
control over your dataset.
Have typed (structured or semi-structured) data.
You want DataFrame optimization and performance benefits
» Catalyst Optimization Engine
• 75% reduction in execution time
» Project Tungsten off-heap memory management
• 75+% reduction in memory usage (less GC)
Apache Spark MapReduce
1) Start Apache Spark Shell
./bin/spark-shell
2) Let's Read the text file
scala> val textFile = sc.textFile("file:///home/chetan306/inputfile.txt")
3) RDDs have actions, which return values, and transformations, which return pointers to new RDDs. Let’s
start with a few actions:
scala> textFile.count()
scala> textFile.first()
4) Now let’s use a transformation. We will use the filter transformation to return a new RDD with a subset
of the items in the file.
val linesWithSpark = textFile.filter(line => line.contains("Spark"))
// Get transformation output.
linesWithSpark.collect()
Apache Spark MapReduce
5) We can chain together transformations and actions:
textFile.filter(line => line.contains("Spark")).count()
6) One common data flow pattern is MapReduce, as popularized by Hadoop. Spark
can implement MapReduce flows easily:
val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word,
1)).reduceByKey((a, b) => a + b)
wordCounts.collect()

More Related Content

PDF
Stanford CS347 Guest Lecture: Apache Spark
PDF
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
PDF
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
PDF
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
PDF
Visualizing big data in the browser using spark
PPTX
Use r tutorial part1, introduction to sparkr
PDF
Structuring Spark: DataFrames, Datasets, and Streaming
PDF
Spark what's new what's coming
Stanford CS347 Guest Lecture: Apache Spark
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Visualizing big data in the browser using spark
Use r tutorial part1, introduction to sparkr
Structuring Spark: DataFrames, Datasets, and Streaming
Spark what's new what's coming

What's hot (20)

PDF
A look ahead at spark 2.0
PDF
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
PDF
Simplifying Big Data Analytics with Apache Spark
PDF
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
PDF
Real-Time Spark: From Interactive Queries to Streaming
PDF
Spark Application Carousel: Highlights of Several Applications Built with Spark
PDF
New directions for Apache Spark in 2015
PDF
Spark streaming State of the Union - Strata San Jose 2015
PDF
Enabling Exploratory Analysis of Large Data with Apache Spark and R
PDF
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
PDF
Spark streaming state of the union
PDF
Introduction to Spark (Intern Event Presentation)
PDF
Strata NYC 2015 - Supercharging R with Apache Spark
PDF
Overview of the Hive Stinger Initiative
PDF
Reactive dashboard’s using apache spark
PDF
Fully fault tolerant real time data pipeline with docker and mesos
PPTX
Intro to Apache Spark
PDF
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
PDF
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
PPTX
Spark streaming high level overview
A look ahead at spark 2.0
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Simplifying Big Data Analytics with Apache Spark
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Real-Time Spark: From Interactive Queries to Streaming
Spark Application Carousel: Highlights of Several Applications Built with Spark
New directions for Apache Spark in 2015
Spark streaming State of the Union - Strata San Jose 2015
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Spark streaming state of the union
Introduction to Spark (Intern Event Presentation)
Strata NYC 2015 - Supercharging R with Apache Spark
Overview of the Hive Stinger Initiative
Reactive dashboard’s using apache spark
Fully fault tolerant real time data pipeline with docker and mesos
Intro to Apache Spark
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Spark streaming high level overview
Ad

Viewers also liked (17)

PPTX
Data science bootcamp day2
DOCX
Feature Release
PDF
Alumni talk-university-of-kachchh
PDF
Internet of things initiative-cskskv
PPTX
Data science bootcamp day1
PPTX
Think Machine Learning with Scikit-Learn (Python)
PPTX
Pycon india-2016-success-story
PDF
Data Analytics with Pandas and Numpy - Python
DOCX
PPTX
Job fair at seattle
PPTX
Publication plan slideshare
PDF
8617 Taylor Road
PDF
Continuous Deployment with Containers
PPTX
Publication plan
TXT
Filme terror 2013
PPTX
LEÇON 127 – Il n’est d’amour que celui de Dieu.
PDF
Davidson Capital - NOAH15 London
Data science bootcamp day2
Feature Release
Alumni talk-university-of-kachchh
Internet of things initiative-cskskv
Data science bootcamp day1
Think Machine Learning with Scikit-Learn (Python)
Pycon india-2016-success-story
Data Analytics with Pandas and Numpy - Python
Job fair at seattle
Publication plan slideshare
8617 Taylor Road
Continuous Deployment with Containers
Publication plan
Filme terror 2013
LEÇON 127 – Il n’est d’amour que celui de Dieu.
Davidson Capital - NOAH15 London
Ad

Similar to Data science bootcamp day 3 (20)

PDF
Apache Spark: What? Why? When?
PDF
Introduction to Apache Spark
PPTX
Introduction to Apache Spark
PDF
Artigo 81 - spark_tutorial.pdf
PPTX
SparkNotes
PPTX
Spark core
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
PDF
Meetup ml spark_ppt
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
PPTX
Apache Spark Fundamentals Training
PPTX
Ten tools for ten big data areas 03_Apache Spark
PDF
Apache Spark Tutorial
PPTX
Apache spark Intro
PPTX
An Introduct to Spark - Atlanta Spark Meetup
PPTX
An Introduction to Spark
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PDF
Apache Spark Presentation good for big data
PPTX
2016-07-21-Godil-presentation.pptx
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
PPTX
Unit II Real Time Data Processing tools.pptx
Apache Spark: What? Why? When?
Introduction to Apache Spark
Introduction to Apache Spark
Artigo 81 - spark_tutorial.pdf
SparkNotes
Spark core
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Meetup ml spark_ppt
Big data vahidamiri-tabriz-13960226-datastack.ir
Apache Spark Fundamentals Training
Ten tools for ten big data areas 03_Apache Spark
Apache Spark Tutorial
Apache spark Intro
An Introduct to Spark - Atlanta Spark Meetup
An Introduction to Spark
Apache spark sneha challa- google pittsburgh-aug 25th
Apache Spark Presentation good for big data
2016-07-21-Godil-presentation.pptx
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Unit II Real Time Data Processing tools.pptx

More from Chetan Khatri (20)

PDF
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
PDF
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
PDF
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PDF
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
PDF
No more struggles with Apache Spark workloads in production
PDF
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PPTX
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
PDF
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
PDF
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
PDF
An Introduction to Spark with Scala
PDF
HBase with Apache Spark POC Demo
PDF
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
PDF
Fossasia 2018-chetan-khatri
PDF
Fossasia ai-ml technologies and application for product development-chetan kh...
PDF
An Introduction Linear Algebra for Neural Networks and Deep learning
PDF
Introduction to Computer Science
PDF
An introduction to Git with Atlassian Suite
PDF
Think machine-learning-with-scikit-learn-chetan
PDF
A step towards machine learning at accionlabs
DOCX
Voltage measurement using arduino
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
An Introduction to Spark with Scala
HBase with Apache Spark POC Demo
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
Fossasia 2018-chetan-khatri
Fossasia ai-ml technologies and application for product development-chetan kh...
An Introduction Linear Algebra for Neural Networks and Deep learning
Introduction to Computer Science
An introduction to Git with Atlassian Suite
Think machine-learning-with-scikit-learn-chetan
A step towards machine learning at accionlabs
Voltage measurement using arduino

Recently uploaded (20)

PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
New ISO 27001_2022 standard and the changes
PPTX
SET 1 Compulsory MNH machine learning intro
PDF
Introduction to Data Science and Data Analysis
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
DOCX
Factor Analysis Word Document Presentation
PDF
Navigating the Thai Supplements Landscape.pdf
PDF
Microsoft Core Cloud Services powerpoint
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPT
Image processing and pattern recognition 2.ppt
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
A Complete Guide to Streamlining Business Processes
PDF
Introduction to the R Programming Language
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Pilar Kemerdekaan dan Identi Bangsa.pptx
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
SAP 2 completion done . PRESENTATION.pptx
New ISO 27001_2022 standard and the changes
SET 1 Compulsory MNH machine learning intro
Introduction to Data Science and Data Analysis
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Factor Analysis Word Document Presentation
Navigating the Thai Supplements Landscape.pdf
Microsoft Core Cloud Services powerpoint
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Image processing and pattern recognition 2.ppt
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
A Complete Guide to Streamlining Business Processes
Introduction to the R Programming Language
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx

Data science bootcamp day 3

  • 1. Data Science Bootcamp Day-3 Presented by: Chetan Khatri, Volunteer Teaching Assistant, Data Science lab, University of Kachchh Guidance by: Prof. Devji D. Chhanga, University of Kachchh.
  • 2. Agenda An Introduction to Apache Spark Apache Spark single node configuration MapReduce Program on Spark Cluster An Introduction to Apache Kafka Apache Kafka single on Configuration. Create Topic, Push Messages to Topic
  • 3. Spark Terminology » Spark and SQL Contexts : A Spark program first creates a SparkContext object » SparkContext tells Spark how and where to access a cluster » The program next creates a sqlContext object » Use sqlContext to create DataFrames
  • 4. Review : DataFrames The primary abstraction in Spark » Immutable once constructed. » Track lineage information to efficiently recompute lost data. » Enable operations on collection of elements in parallel. You construct DataFrames » by parallelizing existing Scala collections (lists) » by transforming an existing Spark DFs » from files in HDFS or any other storage system
  • 5. Review: DataFrames Two types of operations: transformations and actions. Transformations are lazy (not computed immediately). Transformed DF is executed when action runs on it. Persist (cache) DFs in memory or disk.
  • 6. Resilient Distributed Datasets Untyped Spark abstraction underneath DataFrames: » Immutable once constructed » Track lineage information to efficiently recompute lost data » Enable operations on collection of elements in parallel You construct RDDs » by parallelizing existing Scala collections (lists) » by transforming an existing RDDs or DataFrame » from files in HDFS or any other storage system
  • 7. When to use DataFrames ? Need high-level transformations and actions, and want high-level control over your dataset. Have typed (structured or semi-structured) data. You want DataFrame optimization and performance benefits » Catalyst Optimization Engine • 75% reduction in execution time » Project Tungsten off-heap memory management • 75+% reduction in memory usage (less GC)
  • 8. Apache Spark MapReduce 1) Start Apache Spark Shell ./bin/spark-shell 2) Let's Read the text file scala> val textFile = sc.textFile("file:///home/chetan306/inputfile.txt") 3) RDDs have actions, which return values, and transformations, which return pointers to new RDDs. Let’s start with a few actions: scala> textFile.count() scala> textFile.first() 4) Now let’s use a transformation. We will use the filter transformation to return a new RDD with a subset of the items in the file. val linesWithSpark = textFile.filter(line => line.contains("Spark")) // Get transformation output. linesWithSpark.collect()
  • 9. Apache Spark MapReduce 5) We can chain together transformations and actions: textFile.filter(line => line.contains("Spark")).count() 6) One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily: val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) wordCounts.collect()