SlideShare a Scribd company logo
Introduction to
SparkR
OLGUN AYDIN – LEAD ANALYST
olgun.aydin@zingat.com
info@olgunaydin.com
Outline
u Installation and Creating a SparkContext
u Getting Data
u SQL queries in SparkR
u DataFrames
u Application Examples
u Correlation Analysis
u Time Series Analysis
u K-Means
Power of
u Fast
u Powerful
u Scalable
Power of
u Effective
u Number of Packages
u One of the Most prefered language
for statistical analysis
Power of
u Effective
u Powerful
u Statiscal Power
u Fast
u Scalable
+
u SparkR, an R package that provides a frontend to Apache Spark
and uses Spark’s distributed computation engine to enable large
scale data analysis from the R Shell.
u Data analysis using R is limited by the amount of memory available
on a single machine and further as R is single threaded it is often
impractical to use R on large datasets.
u R programs can be scaled while making it easy to use and deploy
across a number of workloads. SparkR: an R frontend for Apache
Spark, a widely deployed cluster computing engine. There are a
number of benefits to designing an R frontend that is tightly
integrated with Spark.
u SparkR is built as an R package and requires no changes to R. The
central component of SparkR is a distributed data frame that
enables structured data processing with a syntax familiar to R users.
u To improve performance over large datasets, SparkR performs lazy
evaluation on data frame operations and uses Spark’s relational
query optimizer to optimize execution.
u SparkR was initially developed at the AMPLab, UC Berkeley and has
been a part of the Apache Spark project for the past eight months.
Introduction to SparkR
u The central component of SparkR is a distributed data frame
implemented on top of Spark.
u SparkR DataFrames have an API similar to dplyr or local R data
frames, but scale to large datasets using Spark’s execution engine
and relational query optimizer.
u SparkR’s read.df method integrates with Spark’s data source API
and this enables users to load data from systems like HBase,
Cassandra etc. Having loaded the data, users are then able to use
a familiar syntax for performing relational operations like selections,
projections, aggregations and joins.
u Further, SparkR supports more than 100 pre-defined functions on
DataFrames including string manipulation methods, statistical
functions and date-time operations. Users can also execute SQL
queries directly on SparkR DataFrames using the sql command.
SparkR also makes it easy for users to chain commands using existing
R libraries.
u Finally, SparkR DataFrames can be converted to a local R data
frame using the collect operator and this is useful for the big data,
small learning scenarios described earlier
u SparkR’s architecture consists of two main components an R to JVM
binding on the driver that allows R programs to submit jobs to a
Spark cluster and support for running R on the Spark executors.
Installation and Creating a SparkContext
u Step 1: Download Spark
u http://spark.apache.org/
Installation and Creating a SparkContext
u Step 1: Download Spark
http://spark.apache.org/
u Step 2: Run in Command Prompt
Now start your favorite command shell and change directory to your Spark
folder
u Step 3: Run in RStudio
Set System Environment. Once you have opened RStudio, you need to set
the system environment first. You have to point your R session to the
installed version of SparkR. Use the code shown in Figure 11 below but
replace the SPARK_HOME variable using the path to your Spark folder.
“C:/Apache/Spark-1.4.1″.
Installation and Creating a SparkContext
u Step 4: Set the Library Paths
You have to set the library path for Spark
u Step 5: Load SparkR library
Load SparkR library by using library(SparkR)
u Step 6: Initialize Spark Context and SQL Context
sc<-sparkR.init(master=‘local’)
sqlcontext<-sparkRSQL.init(sc)
Getting Data
u From local data frames
u The simplest way to create a data frame is to convert a local R data
frame into a SparkR DataFrame. Specifically we can use
createDataFrame and pass in the local R data frame to create a
SparkR DataFrame. As an example, the following creates a
DataFrame based using the faithful dataset from R.
Getting Data
u From Data Sources
u SparkR supports operating on a variety of data sources through the
DataFrame interface. This section describes the general methods for
loading and saving data using Data Sources. You can check the Spark
SQL programming guide for more specific options that are available for
the built-in data sources.
u The general method for creating DataFrames from data sources is
read.df.
u This method takes in the SQLContext, the path for the file to load and
the type of data source.
u SparkR supports reading JSON and Parquet files natively and through
Spark Packages you can find data source connectors for popular file
formats like CSV and Avro.
Getting Data
u We can see how to use data sources using an example JSON input
file. Note that the file that is used here is not a typical JSON file. Each
line in the file must contain a separate, self-contained valid JSON
object.
Getting Data
u From Hive tables
u You can also create SparkR DataFrames from Hive tables. To do this we will need
to create a HiveContext which can access tables in the Hive MetaStore. Note
that Spark should have been built with Hive support and more details on the
difference between SQLContext and HiveContext can be found in the SQL
programming guide.
SQL queries in SparkR
u A SparkR DataFrame can also be registered as a temporary table in Spark SQL
and registering a DataFrame as a table allows you to run SQL queries over its
data. The sql function enables applications to run SQL queries programmatically
and returns the result as a DataFrame.
DataFrames
u SparkR DataFrames support a number of functions to do structured data
processing. Here we include some basic examples and a complete list can be
found in the API docs.
DataFrames
u SparkR data frames support a number of commonly used functions to
aggregate data after grouping. For example we can compute a histogram of
the waiting time in the faithful dataset as shown below
DataFrames
u SparkR also provides a number of functions that can directly applied to columns
for data processing and during aggregation. The example below shows the use
of basic arithmetic functions.
Application
u http://www.transtats.bts.gov/Tables.asp?DB_ID=120

More Related Content

PPTX
Aspirational districts programme
PPTX
Guns in america, the facts
PPTX
Data Engineering at Udemy
PPTX
Gunsense! on Gun Violence in the U.S.
PDF
Büyük Veri, Hadoop Ekosistemi ve Veri Bilimi
PPT
American Culture
PPTX
Machine Learning with SparkR
PDF
Enabling exploratory data science with Spark and R
Aspirational districts programme
Guns in america, the facts
Data Engineering at Udemy
Gunsense! on Gun Violence in the U.S.
Büyük Veri, Hadoop Ekosistemi ve Veri Bilimi
American Culture
Machine Learning with SparkR
Enabling exploratory data science with Spark and R

Similar to Introduction to SparkR (20)

PDF
Enabling Exploratory Analysis of Large Data with Apache Spark and R
PDF
Strata NYC 2015 - Supercharging R with Apache Spark
PDF
Parallelizing Existing R Packages
PDF
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
PDF
Big data analysis using spark r published
PDF
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
PDF
Sparkr sigmod
PDF
Data processing with spark in r &amp; python
PDF
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
PPT
Apache spark-melbourne-april-2015-meetup
PDF
SparkR Best Practices for R Data Scientists
PDF
SparkR best practices for R data scientist
PDF
Introduction to Spark R with R studio - Mr. Pragith
PDF
SparkR: Enabling Interactive Data Science at Scale
PDF
Final_show
PDF
Scalable Data Science with SparkR
PDF
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Parallelize R Code Using Apache Spark
PDF
Recent Developments In SparkR For Advanced Analytics
PDF
sparklyr - Jeff Allen
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Strata NYC 2015 - Supercharging R with Apache Spark
Parallelizing Existing R Packages
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Big data analysis using spark r published
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Sparkr sigmod
Data processing with spark in r &amp; python
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Apache spark-melbourne-april-2015-meetup
SparkR Best Practices for R Data Scientists
SparkR best practices for R data scientist
Introduction to Spark R with R studio - Mr. Pragith
SparkR: Enabling Interactive Data Science at Scale
Final_show
Scalable Data Science with SparkR
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Parallelize R Code Using Apache Spark
Recent Developments In SparkR For Advanced Analytics
sparklyr - Jeff Allen
Ad

Recently uploaded (20)

PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Lecture1 pattern recognition............
PDF
Mega Projects Data Mega Projects Data
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
annual-report-2024-2025 original latest.
PPT
Quality review (1)_presentation of this 21
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Database Infoormation System (DBIS).pptx
1_Introduction to advance data techniques.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Supervised vs unsupervised machine learning algorithms
Galatica Smart Energy Infrastructure Startup Pitch Deck
Lecture1 pattern recognition............
Mega Projects Data Mega Projects Data
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
annual-report-2024-2025 original latest.
Quality review (1)_presentation of this 21
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Fluorescence-microscope_Botany_detailed content
oil_refinery_comprehensive_20250804084928 (1).pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Ad

Introduction to SparkR

  • 1. Introduction to SparkR OLGUN AYDIN – LEAD ANALYST olgun.aydin@zingat.com info@olgunaydin.com
  • 2. Outline u Installation and Creating a SparkContext u Getting Data u SQL queries in SparkR u DataFrames u Application Examples u Correlation Analysis u Time Series Analysis u K-Means
  • 3. Power of u Fast u Powerful u Scalable
  • 4. Power of u Effective u Number of Packages u One of the Most prefered language for statistical analysis
  • 6. u Effective u Powerful u Statiscal Power u Fast u Scalable +
  • 7. u SparkR, an R package that provides a frontend to Apache Spark and uses Spark’s distributed computation engine to enable large scale data analysis from the R Shell.
  • 8. u Data analysis using R is limited by the amount of memory available on a single machine and further as R is single threaded it is often impractical to use R on large datasets. u R programs can be scaled while making it easy to use and deploy across a number of workloads. SparkR: an R frontend for Apache Spark, a widely deployed cluster computing engine. There are a number of benefits to designing an R frontend that is tightly integrated with Spark.
  • 9. u SparkR is built as an R package and requires no changes to R. The central component of SparkR is a distributed data frame that enables structured data processing with a syntax familiar to R users. u To improve performance over large datasets, SparkR performs lazy evaluation on data frame operations and uses Spark’s relational query optimizer to optimize execution. u SparkR was initially developed at the AMPLab, UC Berkeley and has been a part of the Apache Spark project for the past eight months.
  • 11. u The central component of SparkR is a distributed data frame implemented on top of Spark. u SparkR DataFrames have an API similar to dplyr or local R data frames, but scale to large datasets using Spark’s execution engine and relational query optimizer. u SparkR’s read.df method integrates with Spark’s data source API and this enables users to load data from systems like HBase, Cassandra etc. Having loaded the data, users are then able to use a familiar syntax for performing relational operations like selections, projections, aggregations and joins.
  • 12. u Further, SparkR supports more than 100 pre-defined functions on DataFrames including string manipulation methods, statistical functions and date-time operations. Users can also execute SQL queries directly on SparkR DataFrames using the sql command. SparkR also makes it easy for users to chain commands using existing R libraries. u Finally, SparkR DataFrames can be converted to a local R data frame using the collect operator and this is useful for the big data, small learning scenarios described earlier
  • 13. u SparkR’s architecture consists of two main components an R to JVM binding on the driver that allows R programs to submit jobs to a Spark cluster and support for running R on the Spark executors.
  • 14. Installation and Creating a SparkContext u Step 1: Download Spark u http://spark.apache.org/
  • 15. Installation and Creating a SparkContext u Step 1: Download Spark http://spark.apache.org/ u Step 2: Run in Command Prompt Now start your favorite command shell and change directory to your Spark folder u Step 3: Run in RStudio Set System Environment. Once you have opened RStudio, you need to set the system environment first. You have to point your R session to the installed version of SparkR. Use the code shown in Figure 11 below but replace the SPARK_HOME variable using the path to your Spark folder. “C:/Apache/Spark-1.4.1″.
  • 16. Installation and Creating a SparkContext u Step 4: Set the Library Paths You have to set the library path for Spark u Step 5: Load SparkR library Load SparkR library by using library(SparkR) u Step 6: Initialize Spark Context and SQL Context sc<-sparkR.init(master=‘local’) sqlcontext<-sparkRSQL.init(sc)
  • 17. Getting Data u From local data frames u The simplest way to create a data frame is to convert a local R data frame into a SparkR DataFrame. Specifically we can use createDataFrame and pass in the local R data frame to create a SparkR DataFrame. As an example, the following creates a DataFrame based using the faithful dataset from R.
  • 18. Getting Data u From Data Sources u SparkR supports operating on a variety of data sources through the DataFrame interface. This section describes the general methods for loading and saving data using Data Sources. You can check the Spark SQL programming guide for more specific options that are available for the built-in data sources. u The general method for creating DataFrames from data sources is read.df. u This method takes in the SQLContext, the path for the file to load and the type of data source. u SparkR supports reading JSON and Parquet files natively and through Spark Packages you can find data source connectors for popular file formats like CSV and Avro.
  • 19. Getting Data u We can see how to use data sources using an example JSON input file. Note that the file that is used here is not a typical JSON file. Each line in the file must contain a separate, self-contained valid JSON object.
  • 20. Getting Data u From Hive tables u You can also create SparkR DataFrames from Hive tables. To do this we will need to create a HiveContext which can access tables in the Hive MetaStore. Note that Spark should have been built with Hive support and more details on the difference between SQLContext and HiveContext can be found in the SQL programming guide.
  • 21. SQL queries in SparkR u A SparkR DataFrame can also be registered as a temporary table in Spark SQL and registering a DataFrame as a table allows you to run SQL queries over its data. The sql function enables applications to run SQL queries programmatically and returns the result as a DataFrame.
  • 22. DataFrames u SparkR DataFrames support a number of functions to do structured data processing. Here we include some basic examples and a complete list can be found in the API docs.
  • 23. DataFrames u SparkR data frames support a number of commonly used functions to aggregate data after grouping. For example we can compute a histogram of the waiting time in the faithful dataset as shown below
  • 24. DataFrames u SparkR also provides a number of functions that can directly applied to columns for data processing and during aggregation. The example below shows the use of basic arithmetic functions.