SlideShare a Scribd company logo
INTRODUCTION TO APACHE SPARK
JUGBD MEETUP #5.0
MAY 23, 2015
MUKTADIUR RAHMAN
TEAM LEAD, M&H INFORMATICS(BD) LTD.
OVERVIEW
• Apache Spark is a cluster computing framework that provide :
• fast and general engine for large-scale data processing
• Run programs up to 100x faster than Hadoop MapReduce
in memory, or 10x faster on disk
• Simple API in Scala, Java, Python
• This talk will cover :
• Components of Spark Stack
• Resilient Distributed DataSet(RDD)
• Programming with Spark
A BRIEF HISTORY OF SPARK
• Spark started by Matei Zaharia in 2009 as a research project
in the UC Berkeley RAD Lab, later to become the AMPLab.
• Spark was first open sourced in March 2010 and transferred
to the Apache Software Foundation in June 2013
• Spark had over 465 contributors in 2014,making it the most
active project in the Apache Software Foundation and
among Big Data open source projects
• Spark 1.3.1, released on April 17, 2015(http://
www.apache.org/dyn/closer.cgi/spark/spark-1.3.1/
spark-1.3.1-bin-hadoop2.6.tgz)
SPARK STACK
Resilient Distributed Datasets (RDD)
An RDD in Spark is simply an immutable distributed collection
of objects. Each RDD is split into multiple partitions, which
may be computed on different nodes of the cluster.
RDDs can be created in two ways:
• by loading an external dataset
•scala> val reads = sc.textFile(“README.md”)
• by distributing a collection of objects
•scala> val data = sc.parallelize(1 to 100000)
RDD
Once created, RDDs offer two types of operations:
• transformations
• actions
Example :
Step 1 : Create a RDD
scala> val data = sc.textFile(“README.md")
Step 2: Transformation
scala> val lines = data.filter(line=>line.contains(“Spark"))
Step 3: Action
scala> lines.count()
RDD
Persisting an RDD in memory
Example :
Step 1 : Create a RDD
scala> val data = sc.textFile(“README.md")
Step 2: Transformation
scala> val lines = data.filter(line=>line.contains(“Spark"))
Step 3: Persistent in memory
scala> lines.cache() or lines.persist()
Step 4: Unpersist memory
scala> lines.unpersist()
Step 5: Action
scala> lines.count()
SPARK EXAMPLE : WORD COUNT
Scala>>
var data = sc.textFile(“README.md")
var counts = data.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("/tmp/output")
SPARK EXAMPLE : WORD COUNT
Java 8>>
JavaRDD<String> data = sc.textFile(“README.md");
JavaRDD<String> words =
data.flatMap(line -> Arrays.asList(line.split(" “)));
JavaPairRDD<String, Integer> counts =
words.mapToPair(w -> new Tuple2<String, Integer>(w, 1))
.reduceByKey((x, y) -> x + y);
counts.saveAsTextFile(“/tmp/output“);
RESOURCES
• https://spark.apache.org/docs/latest/
• http://shop.oreilly.com/product/0636920028512.do
• https://www.edx.org/course/introduction-big-data-
apache-spark-uc-berkeleyx-cs100-1x
• https://www.edx.org/course/scalable-machine-
learning-uc-berkeleyx-cs190-1x
• https://www.facebook.com/groups/
898580040204667/
Q/A
Thank YOU!

More Related Content

PPTX
Apache Cassandra Lunch #75: Getting Started with DataStax Enterprise on Docker
PDF
Introduction to apache spark
PPTX
Building a REST API with Cassandra on Datastax Astra Using Python and Node
PDF
An Overview of Apache Spark
PPTX
Spark Sql for Training
PDF
Data processing with spark in r &amp; python
PDF
Spark Core
PDF
Spark SQL
Apache Cassandra Lunch #75: Getting Started with DataStax Enterprise on Docker
Introduction to apache spark
Building a REST API with Cassandra on Datastax Astra Using Python and Node
An Overview of Apache Spark
Spark Sql for Training
Data processing with spark in r &amp; python
Spark Core
Spark SQL

What's hot (20)

PPTX
Apache spark - History and market overview
PPTX
Intro to Spark
PPTX
Spark from the Surface
PPTX
Spark Introduction
PPTX
Apache Cassandra Lunch #70: Basics of Apache Cassandra
PDF
The SparkSQL things you maybe confuse
PPTX
Apache Spark Fundamentals
PPTX
Cassandra Lunch #87: Recreating Cassandra.api using Astra and Stargate
PPTX
Data Engineer's Lunch #54: dbt and Spark
PDF
Introduction to apache spark
PDF
Databases and how to choose them
PPTX
Introduction to Apache Spark and MLlib
PDF
#MesosCon 2014: Spark on Mesos
PPTX
Cassandra
PPTX
Spark + Cassandra
PDF
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
PDF
Apache Spark part of Eindhoven Java Meetup
PPTX
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
PPTX
Learn Apache Spark: A Comprehensive Guide
PDF
Cassandra + Spark + Elk
Apache spark - History and market overview
Intro to Spark
Spark from the Surface
Spark Introduction
Apache Cassandra Lunch #70: Basics of Apache Cassandra
The SparkSQL things you maybe confuse
Apache Spark Fundamentals
Cassandra Lunch #87: Recreating Cassandra.api using Astra and Stargate
Data Engineer's Lunch #54: dbt and Spark
Introduction to apache spark
Databases and how to choose them
Introduction to Apache Spark and MLlib
#MesosCon 2014: Spark on Mesos
Cassandra
Spark + Cassandra
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Apache Spark part of Eindhoven Java Meetup
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Learn Apache Spark: A Comprehensive Guide
Cassandra + Spark + Elk
Ad

Similar to Introduction to apache spark (20)

PPTX
Intro to Apache Spark
PPTX
Intro to Apache Spark
PDF
Introduction to Apache Spark
PPT
Big_data_analytics_NoSql_Module-4_Session
PPTX
Introduction to Apache Spark
PDF
Apache Spark and DataStax Enablement
PDF
Big Data Processing using Apache Spark and Clojure
PPTX
SparkNotes
PPTX
Unit II Real Time Data Processing tools.pptx
PDF
Reactive dashboard’s using apache spark
PPTX
Introduction to Apache Spark
PPTX
Spark core
PPTX
APACHE SPARK.pptx
PPTX
Apache Spark Introduction @ University College London
PDF
Introduction to Apache Spark
PPTX
Big Data Processing with Apache Spark 2014
PPTX
Ten tools for ten big data areas 03_Apache Spark
PDF
Apache Spark RDDs
PDF
Apache Spark Introduction
PDF
Apache Spark: What? Why? When?
Intro to Apache Spark
Intro to Apache Spark
Introduction to Apache Spark
Big_data_analytics_NoSql_Module-4_Session
Introduction to Apache Spark
Apache Spark and DataStax Enablement
Big Data Processing using Apache Spark and Clojure
SparkNotes
Unit II Real Time Data Processing tools.pptx
Reactive dashboard’s using apache spark
Introduction to Apache Spark
Spark core
APACHE SPARK.pptx
Apache Spark Introduction @ University College London
Introduction to Apache Spark
Big Data Processing with Apache Spark 2014
Ten tools for ten big data areas 03_Apache Spark
Apache Spark RDDs
Apache Spark Introduction
Apache Spark: What? Why? When?
Ad

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Machine learning based COVID-19 study performance prediction
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Empathic Computing: Creating Shared Understanding
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPT
Teaching material agriculture food technology
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
“AI and Expert System Decision Support & Business Intelligence Systems”
NewMind AI Monthly Chronicles - July 2025
Machine learning based COVID-19 study performance prediction
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Building Integrated photovoltaic BIPV_UPV.pdf
A Presentation on Artificial Intelligence
Empathic Computing: Creating Shared Understanding
The Rise and Fall of 3GPP – Time for a Sabbatical?
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Network Security Unit 5.pdf for BCA BBA.
Teaching material agriculture food technology
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
20250228 LYD VKU AI Blended-Learning.pptx
MYSQL Presentation for SQL database connectivity
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

Introduction to apache spark

  • 1. INTRODUCTION TO APACHE SPARK JUGBD MEETUP #5.0 MAY 23, 2015 MUKTADIUR RAHMAN TEAM LEAD, M&H INFORMATICS(BD) LTD.
  • 2. OVERVIEW • Apache Spark is a cluster computing framework that provide : • fast and general engine for large-scale data processing • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk • Simple API in Scala, Java, Python • This talk will cover : • Components of Spark Stack • Resilient Distributed DataSet(RDD) • Programming with Spark
  • 3. A BRIEF HISTORY OF SPARK • Spark started by Matei Zaharia in 2009 as a research project in the UC Berkeley RAD Lab, later to become the AMPLab. • Spark was first open sourced in March 2010 and transferred to the Apache Software Foundation in June 2013 • Spark had over 465 contributors in 2014,making it the most active project in the Apache Software Foundation and among Big Data open source projects • Spark 1.3.1, released on April 17, 2015(http:// www.apache.org/dyn/closer.cgi/spark/spark-1.3.1/ spark-1.3.1-bin-hadoop2.6.tgz)
  • 5. Resilient Distributed Datasets (RDD) An RDD in Spark is simply an immutable distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. RDDs can be created in two ways: • by loading an external dataset •scala> val reads = sc.textFile(“README.md”) • by distributing a collection of objects •scala> val data = sc.parallelize(1 to 100000)
  • 6. RDD Once created, RDDs offer two types of operations: • transformations • actions Example : Step 1 : Create a RDD scala> val data = sc.textFile(“README.md") Step 2: Transformation scala> val lines = data.filter(line=>line.contains(“Spark")) Step 3: Action scala> lines.count()
  • 7. RDD Persisting an RDD in memory Example : Step 1 : Create a RDD scala> val data = sc.textFile(“README.md") Step 2: Transformation scala> val lines = data.filter(line=>line.contains(“Spark")) Step 3: Persistent in memory scala> lines.cache() or lines.persist() Step 4: Unpersist memory scala> lines.unpersist() Step 5: Action scala> lines.count()
  • 8. SPARK EXAMPLE : WORD COUNT Scala>> var data = sc.textFile(“README.md") var counts = data.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("/tmp/output")
  • 9. SPARK EXAMPLE : WORD COUNT Java 8>> JavaRDD<String> data = sc.textFile(“README.md"); JavaRDD<String> words = data.flatMap(line -> Arrays.asList(line.split(" “))); JavaPairRDD<String, Integer> counts = words.mapToPair(w -> new Tuple2<String, Integer>(w, 1)) .reduceByKey((x, y) -> x + y); counts.saveAsTextFile(“/tmp/output“);
  • 10. RESOURCES • https://spark.apache.org/docs/latest/ • http://shop.oreilly.com/product/0636920028512.do • https://www.edx.org/course/introduction-big-data- apache-spark-uc-berkeleyx-cs100-1x • https://www.edx.org/course/scalable-machine- learning-uc-berkeleyx-cs190-1x • https://www.facebook.com/groups/ 898580040204667/