SlideShare a Scribd company logo
INTRODUCING
RDD'S
Frank Kane
RDD
■ Resilient
■ Distributed
■ Dataset
The SparkContext
■ Created by your driver program
■ Is responsible for making RDD's resilient and distributed!
■ Creates RDD's
■ The Spark shell creates a "sc" object for you
Creating RDD's
■ nums = parallelize([1, 2, 3, 4])
■ sc.textFile("file:///c:/users/frank/gobs-o-text.txt")
– or s3n:// , hdfs://
■ hiveCtx = HiveContext(sc) rows = hiveCtx.sql("SELECT name, age FROM users")
■ Can also create from:
– JDBC
– Cassandra
– HBase
– Elastisearch
– JSON, CSV, sequence files, object files, various compressed formats
Transforming RDD's
■ map
■ flatmap
■ filter
■ distinct
■ sample
■ union, intersection, subtract, cartesian
map example
■ rdd = sc.parallelize([1, 2, 3, 4])
■ squaredRDD = rdd.map(lambda x: x*x)
■ This yields 1, 4, 9, 16
What’s that lambda thing?
Many RDD methods accept a function as a parameter
rdd.map(lambda x: x*x)
Is the same thing as
def squareIt(x):
return x*x
rdd.map(squareIt)
There, you now understand functional programming.
RDD actions
■ collect
■ count
■ countByValue
■ take
■ top
■ reduce
■ … and more ...
Lazy evaluation
■ Nothing actually happens in your driver program until an action is called!

More Related Content

PDF
Introduction to Apache Spark
PDF
Big Data Processing using Apache Spark and Clojure
PDF
Meetup ml spark_ppt
PDF
Apache Spark: What? Why? When?
PPTX
Introduction to Apache Spark
PPTX
AI與大數據數據處理 Spark實戰(20171216)
PPTX
Spark core
PDF
Introduction to Apache Spark
Introduction to Apache Spark
Big Data Processing using Apache Spark and Clojure
Meetup ml spark_ppt
Apache Spark: What? Why? When?
Introduction to Apache Spark
AI與大數據數據處理 Spark實戰(20171216)
Spark core
Introduction to Apache Spark

Similar to introducing spark RDDs Resilient Distribute Dataset (20)

PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PDF
Apache Spark Tutorial
PPTX
Why Functional Programming Is Important in Big Data Era
PDF
Apache Spark and DataStax Enablement
PPT
Scala and spark
PDF
PPTX
An Introduct to Spark - Atlanta Spark Meetup
PPTX
An Introduction to Spark
PDF
Introduction to Apache Spark
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
PPTX
Intro to Apache Spark
PPTX
Intro to Apache Spark
PDF
Scala Meetup Hamburg - Spark
PPTX
Spark real world use cases and optimizations
PDF
A Deep Dive Into Spark
PPTX
Programming in Spark using PySpark
PDF
Spark Summit East 2015 Advanced Devops Student Slides
PPT
11. From Hadoop to Spark 2/2
PPTX
Ten tools for ten big data areas 03_Apache Spark
PDF
Reactive dashboard’s using apache spark
Apache spark sneha challa- google pittsburgh-aug 25th
Apache Spark Tutorial
Why Functional Programming Is Important in Big Data Era
Apache Spark and DataStax Enablement
Scala and spark
An Introduct to Spark - Atlanta Spark Meetup
An Introduction to Spark
Introduction to Apache Spark
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Intro to Apache Spark
Intro to Apache Spark
Scala Meetup Hamburg - Spark
Spark real world use cases and optimizations
A Deep Dive Into Spark
Programming in Spark using PySpark
Spark Summit East 2015 Advanced Devops Student Slides
11. From Hadoop to Spark 2/2
Ten tools for ten big data areas 03_Apache Spark
Reactive dashboard’s using apache spark
Ad

Recently uploaded (20)

PPTX
Feature types and data preprocessing steps
PDF
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
PDF
Visual Aids for Exploratory Data Analysis.pdf
PDF
distributed database system" (DDBS) is often used to refer to both the distri...
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
August 2025 - Top 10 Read Articles in Network Security & Its Applications
PDF
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
PPTX
Fundamentals of Mechanical Engineering.pptx
PPTX
Software Engineering and software moduleing
PDF
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
PPTX
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
PPTX
Management Information system : MIS-e-Business Systems.pptx
PPTX
Module 8- Technological and Communication Skills.pptx
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PPT
Occupational Health and Safety Management System
PPTX
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
introduction to high performance computing
PDF
Soil Improvement Techniques Note - Rabbi
PDF
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
Feature types and data preprocessing steps
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
Visual Aids for Exploratory Data Analysis.pdf
distributed database system" (DDBS) is often used to refer to both the distri...
Safety Seminar civil to be ensured for safe working.
August 2025 - Top 10 Read Articles in Network Security & Its Applications
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
Fundamentals of Mechanical Engineering.pptx
Software Engineering and software moduleing
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
Management Information system : MIS-e-Business Systems.pptx
Module 8- Technological and Communication Skills.pptx
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
Occupational Health and Safety Management System
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
R24 SURVEYING LAB MANUAL for civil enggi
introduction to high performance computing
Soil Improvement Techniques Note - Rabbi
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
Ad

introducing spark RDDs Resilient Distribute Dataset

  • 3. The SparkContext ■ Created by your driver program ■ Is responsible for making RDD's resilient and distributed! ■ Creates RDD's ■ The Spark shell creates a "sc" object for you
  • 4. Creating RDD's ■ nums = parallelize([1, 2, 3, 4]) ■ sc.textFile("file:///c:/users/frank/gobs-o-text.txt") – or s3n:// , hdfs:// ■ hiveCtx = HiveContext(sc) rows = hiveCtx.sql("SELECT name, age FROM users") ■ Can also create from: – JDBC – Cassandra – HBase – Elastisearch – JSON, CSV, sequence files, object files, various compressed formats
  • 5. Transforming RDD's ■ map ■ flatmap ■ filter ■ distinct ■ sample ■ union, intersection, subtract, cartesian
  • 6. map example ■ rdd = sc.parallelize([1, 2, 3, 4]) ■ squaredRDD = rdd.map(lambda x: x*x) ■ This yields 1, 4, 9, 16
  • 7. What’s that lambda thing? Many RDD methods accept a function as a parameter rdd.map(lambda x: x*x) Is the same thing as def squareIt(x): return x*x rdd.map(squareIt) There, you now understand functional programming.
  • 8. RDD actions ■ collect ■ count ■ countByValue ■ take ■ top ■ reduce ■ … and more ...
  • 9. Lazy evaluation ■ Nothing actually happens in your driver program until an action is called!