SlideShare a Scribd company logo
INTRODUCTION TO APACHE SPARK
JUGBD MEETUP #5.0
MAY 23, 2015
MUKTADIUR RAHMAN
TEAM LEAD, M&H INFORMATICS(BD) LTD.
OVERVIEW
• Apache Spark is a cluster computing framework that provide :
• fast and general engine for large-scale data processing
• Run programs up to 100x faster than Hadoop MapReduce
in memory, or 10x faster on disk
• Simple API in Scala, Java, Python
• This talk will cover :
• Components of Spark Stack
• Resilient Distributed DataSet(RDD)
• Programming with Spark
A BRIEF HISTORY OF SPARK
• Spark started by Matei Zaharia in 2009 as a research project
in the UC Berkeley RAD Lab, later to become the AMPLab.
• Spark was first open sourced in March 2010 and transferred
to the Apache Software Foundation in June 2013
• Spark had over 465 contributors in 2014,making it the most
active project in the Apache Software Foundation and
among Big Data open source projects
• Spark 1.3.1, released on April 17, 2015(http://
www.apache.org/dyn/closer.cgi/spark/spark-1.3.1/
spark-1.3.1-bin-hadoop2.6.tgz)
SPARK STACK
Resilient Distributed Datasets (RDD)
An RDD in Spark is simply an immutable distributed collection
of objects. Each RDD is split into multiple partitions, which
may be computed on different nodes of the cluster.
RDDs can be created in two ways:
• by loading an external dataset
•scala> val reads = sc.textFile(“README.md”)
• by distributing a collection of objects
•scala> val data = sc.parallelize(1 to 100000)
RDD
Once created, RDDs offer two types of operations:
• transformations
• actions
Example :
Step 1 : Create a RDD
scala> val data = sc.textFile(“README.md")
Step 2: Transformation
scala> val lines = data.filter(line=>line.contains(“Spark"))
Step 3: Action
scala> lines.count()
RDD
Persisting an RDD in memory
Example :
Step 1 : Create a RDD
scala> val data = sc.textFile(“README.md")
Step 2: Transformation
scala> val lines = data.filter(line=>line.contains(“Spark"))
Step 3: Persistent in memory
scala> lines.cache() or lines.persist()
Step 4: Unpersist memory
scala> lines.unpersist()
Step 5: Action
scala> lines.count()
SPARK EXAMPLE : WORD COUNT
Scala>>
var data = sc.textFile(“README.md")
var counts = data.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("/tmp/output")
SPARK EXAMPLE : WORD COUNT
Java 8>>
JavaRDD<String> data = sc.textFile(“README.md");
JavaRDD<String> words =
data.flatMap(line -> Arrays.asList(line.split(" “)));
JavaPairRDD<String, Integer> counts =
words.mapToPair(w -> new Tuple2<String, Integer>(w, 1))
.reduceByKey((x, y) -> x + y);
counts.saveAsTextFile(“/tmp/output“);
RESOURCES
• https://spark.apache.org/docs/latest/
• http://shop.oreilly.com/product/0636920028512.do
• https://www.edx.org/course/introduction-big-data-
apache-spark-uc-berkeleyx-cs100-1x
• https://www.edx.org/course/scalable-machine-
learning-uc-berkeleyx-cs190-1x
• https://www.facebook.com/groups/
898580040204667/
Q/A
Thank YOU!

More Related Content

PDF
Introduction to Apache Spark
PDF
Spark Worshop
PDF
Introduction to apache spark
PDF
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
PPTX
Productionizing Spark and the REST Job Server- Evan Chan
PDF
An Overview of Apache Spark
PDF
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
PDF
Spark Core
Introduction to Apache Spark
Spark Worshop
Introduction to apache spark
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Productionizing Spark and the REST Job Server- Evan Chan
An Overview of Apache Spark
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark Core

What's hot (20)

PPTX
Apache spark - History and market overview
PDF
Continuous Processing in Structured Streaming with Jose Torres
PDF
Introduction to apache spark
PPTX
Intro to Apache Spark
PDF
Introduction to Apache Spark
PDF
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
PPTX
Introduction to Apache Spark
PPTX
Learn Apache Spark: A Comprehensive Guide
PDF
Apache Spark Briefing
PDF
#MesosCon 2014: Spark on Mesos
PDF
Portable UDFs: Write Once, Run Anywhere
PDF
Introduction to Apache Spark
PDF
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
PPTX
Introduction to Apache Spark
PDF
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
PPTX
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
PDF
Tachyon and Apache Spark
PDF
Big Data visualization with Apache Spark and Zeppelin
PPTX
Real time Analytics with Apache Kafka and Apache Spark
PDF
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Apache spark - History and market overview
Continuous Processing in Structured Streaming with Jose Torres
Introduction to apache spark
Intro to Apache Spark
Introduction to Apache Spark
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Introduction to Apache Spark
Learn Apache Spark: A Comprehensive Guide
Apache Spark Briefing
#MesosCon 2014: Spark on Mesos
Portable UDFs: Write Once, Run Anywhere
Introduction to Apache Spark
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Introduction to Apache Spark
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Tachyon and Apache Spark
Big Data visualization with Apache Spark and Zeppelin
Real time Analytics with Apache Kafka and Apache Spark
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Ad

Similar to Introduction to apache spark (20)

PPTX
Big Data Processing with Apache Spark 2014
PPTX
Intro to Apache Spark by CTO of Twingo
PPTX
Dec6 meetup spark presentation
PPTX
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
PPTX
Spark core
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PPTX
Spark from the Surface
PPT
Big_data_analytics_NoSql_Module-4_Session
PDF
Apache Spark Tutorial
PPTX
Spark Study Notes
PDF
Apache Spark Overview @ ferret
PPTX
Paris Data Geek - Spark Streaming
PDF
Jump Start with Apache Spark 2.0 on Databricks
PDF
Jump Start on Apache Spark 2.2 with Databricks
PDF
Spark SQL
PPTX
Unit II Real Time Data Processing tools.pptx
PPT
Apache spark-melbourne-april-2015-meetup
PPTX
An Introduct to Spark - Atlanta Spark Meetup
PPTX
An Introduction to Spark
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
Big Data Processing with Apache Spark 2014
Intro to Apache Spark by CTO of Twingo
Dec6 meetup spark presentation
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Spark core
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark from the Surface
Big_data_analytics_NoSql_Module-4_Session
Apache Spark Tutorial
Spark Study Notes
Apache Spark Overview @ ferret
Paris Data Geek - Spark Streaming
Jump Start with Apache Spark 2.0 on Databricks
Jump Start on Apache Spark 2.2 with Databricks
Spark SQL
Unit II Real Time Data Processing tools.pptx
Apache spark-melbourne-april-2015-meetup
An Introduct to Spark - Atlanta Spark Meetup
An Introduction to Spark
Apache spark sneha challa- google pittsburgh-aug 25th
Ad

Recently uploaded (20)

PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Machine learning based COVID-19 study performance prediction
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Cloud computing and distributed systems.
PDF
Approach and Philosophy of On baking technology
DOCX
The AUB Centre for AI in Media Proposal.docx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
A Presentation on Artificial Intelligence
PDF
Network Security Unit 5.pdf for BCA BBA.
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Machine learning based COVID-19 study performance prediction
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Understanding_Digital_Forensics_Presentation.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
MYSQL Presentation for SQL database connectivity
Chapter 3 Spatial Domain Image Processing.pdf
Unlocking AI with Model Context Protocol (MCP)
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
NewMind AI Weekly Chronicles - August'25 Week I
The Rise and Fall of 3GPP – Time for a Sabbatical?
Cloud computing and distributed systems.
Approach and Philosophy of On baking technology
The AUB Centre for AI in Media Proposal.docx
“AI and Expert System Decision Support & Business Intelligence Systems”
A Presentation on Artificial Intelligence
Network Security Unit 5.pdf for BCA BBA.

Introduction to apache spark

  • 1. INTRODUCTION TO APACHE SPARK JUGBD MEETUP #5.0 MAY 23, 2015 MUKTADIUR RAHMAN TEAM LEAD, M&H INFORMATICS(BD) LTD.
  • 2. OVERVIEW • Apache Spark is a cluster computing framework that provide : • fast and general engine for large-scale data processing • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk • Simple API in Scala, Java, Python • This talk will cover : • Components of Spark Stack • Resilient Distributed DataSet(RDD) • Programming with Spark
  • 3. A BRIEF HISTORY OF SPARK • Spark started by Matei Zaharia in 2009 as a research project in the UC Berkeley RAD Lab, later to become the AMPLab. • Spark was first open sourced in March 2010 and transferred to the Apache Software Foundation in June 2013 • Spark had over 465 contributors in 2014,making it the most active project in the Apache Software Foundation and among Big Data open source projects • Spark 1.3.1, released on April 17, 2015(http:// www.apache.org/dyn/closer.cgi/spark/spark-1.3.1/ spark-1.3.1-bin-hadoop2.6.tgz)
  • 5. Resilient Distributed Datasets (RDD) An RDD in Spark is simply an immutable distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. RDDs can be created in two ways: • by loading an external dataset •scala> val reads = sc.textFile(“README.md”) • by distributing a collection of objects •scala> val data = sc.parallelize(1 to 100000)
  • 6. RDD Once created, RDDs offer two types of operations: • transformations • actions Example : Step 1 : Create a RDD scala> val data = sc.textFile(“README.md") Step 2: Transformation scala> val lines = data.filter(line=>line.contains(“Spark")) Step 3: Action scala> lines.count()
  • 7. RDD Persisting an RDD in memory Example : Step 1 : Create a RDD scala> val data = sc.textFile(“README.md") Step 2: Transformation scala> val lines = data.filter(line=>line.contains(“Spark")) Step 3: Persistent in memory scala> lines.cache() or lines.persist() Step 4: Unpersist memory scala> lines.unpersist() Step 5: Action scala> lines.count()
  • 8. SPARK EXAMPLE : WORD COUNT Scala>> var data = sc.textFile(“README.md") var counts = data.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("/tmp/output")
  • 9. SPARK EXAMPLE : WORD COUNT Java 8>> JavaRDD<String> data = sc.textFile(“README.md"); JavaRDD<String> words = data.flatMap(line -> Arrays.asList(line.split(" “))); JavaPairRDD<String, Integer> counts = words.mapToPair(w -> new Tuple2<String, Integer>(w, 1)) .reduceByKey((x, y) -> x + y); counts.saveAsTextFile(“/tmp/output“);
  • 10. RESOURCES • https://spark.apache.org/docs/latest/ • http://shop.oreilly.com/product/0636920028512.do • https://www.edx.org/course/introduction-big-data- apache-spark-uc-berkeleyx-cs100-1x • https://www.edx.org/course/scalable-machine- learning-uc-berkeleyx-cs190-1x • https://www.facebook.com/groups/ 898580040204667/