SlideShare a Scribd company logo
INTRODUCTION
Know your Instructor
● Author "R for Business Analytics"
● Author “ R for Cloud Computing”
● Founder "Decisionstats.com"
● University of Tennessee, Knoxville
MS (courses in statistics and
computer science)
● MBA (IIM Lucknow,India-2003)
● B.Engineering (DCE 2001)
http://linkedin.com/in/ajayohri
Classroom Rules
• From Instructor
• From Audience
– mobile phones should be kindly switched off
• Yes, this includes Whatsapp
– Ask Questions at end of session
– Take Notes
– Please Take Notes
What is data science ?
Hacking ( Programming) + Maths/Statistics + Domain Knowledge = Data Science
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Oh really, is this a Data Scientist ?
a data scientist is simply a person who can
write code = in R,Python,Java, SQL, Hadoop (Pig,HQL,MR) etc
= for data storage, querying, summarization, visualization
= how efficiently, and in time (fast results?)
= where on databases, on cloud, servers
and understand enough statistics
to derive insights from data
so business can make decisions
Data Science with R
A popular language
in Data Science
What Is R
https://www.r-project.org/about.html
R is an integrated suite of software facilities for data manipulation, calculation
and graphical display. It includes
● an effective data handling and storage facility,
● a suite of operators for calculations on arrays, in particular matrices,
● a large, coherent, integrated collection of intermediate tools for data
analysis,
● graphical facilities for data analysis and display either on-screen or on
hardcopy, and
● a well-developed, simple and effective programming language which
includes conditionals, loops, user-defined recursive functions and input and
output facilities.
Install R
https://cran.r-project.org/bin/windows/base/
Install RStudio
https://www.rstudio.com/products/rstudio/download/
Statistical Software Landscape
SAS
Python (Pandas)
IBM SPSS
R
Julia
Clojure
Octave
Matlab
JMP
E views
Using R with other software
https://rforanalytics.wordpress.com/useful-links-for-r/using-r-from-other-software/
Tableau http://www.tableausoftware.com/new-features/r-integration
Qlik http://qliksolutions.ru/qlikview/add-ons/r-connector-eng/
Oracle R http://www.oracle.com/technetwork/database/database-technologies/r/r-enterprise/overview/index.html
Rapid Miner https://rapid-i.com/content/view/202/206/lang,en/#r
JMP http://blogs.sas.com/jmp/index.php?/archives/298-JMP-Into-R!.html
Using R with other software
https://rforanalytics.wordpress.com/useful-links-for-r/using-r-from-other-software/
SAS/IML http://www.sas.com/technologies/analytics/statistics/iml/index.html
Teradata http://developer.teradata.com/applications/articles/in-database-analytics-with-teradata-r
Pentaho http://bigdatatechworld.blogspot.in/2013/10/integration-of-rweka-with-pentaho-data.html
IBM SPSS
https://www14.software.ibm.com/webapp/iwm/web/signup.do?source=ibm-analytics&S_PKG=ov18855&S_TACT=M161003W&dy
nform=127&lang=en_US
TIBCO TERR
http://spotfire.tibco.com/discover-spotfire/what-does-spotfire-do/predictive-analytics/tibco-enterprise-runtime-for-r-terr
Some Advantages of R
open source
free
large number of algorithms and packages esp for statistics
flexible
very good for data visualization
superb community
rapidly growing
can be used with other software
Some Disadvantages of R
in memory (RAM) usage
steep learning curve
some IT departments frown on open source
verbose documentation
tech support
evolving ecosystem for corporates
Solutions for Disadvantages of R
in memory (RAM) usage specialized packages, in database computing
steep learning curve TRAINING !!!
some IT departments frown on open source TRAINING and education!
verbose documentation CRAN View , R Documentation
tech support expanding pool of resources
evolving ecosystem for corporates getting better with MS et al
R used by Government
● In the early days of the Deepwater Horizon disaster, NIST used uncertainty analysis in R to harmonize spill
estimates from various sources, and to provide ranges of estimates to other agencies and the media.
● Before new drugs are allowed on the market, the FDA works with pharmaceutical companies to verify safety
and efficacy through clinical trials. Despite a false perception that only commercial software may be used,
many pharmaceutical companies are now using open-source R to analyze data from clinical trials.
● The National Weather Service uses R for research and development of models to predict river flooding.
● The newly-formed Consumer Financial Protection Bureau -- freed from the restrictions of a legacy IT
infrastructure -- is championing the use of open-source technologies in government.
● Local governments are also building data-based applications. The SF Estuary Institute uses R and Google
Maps to provide a tool to track pollution in the San Francisco Bay area.
http://gsnmagazine.com/node/26483?c=cyber_security
R used by Telecom
● Churn using
Social Network Analysis
http://www.slideshare.net/dataspora/social-network-analysis-for-telecoms
R used by Insurance
a few more insurance related packages:
● ChainLadder – Reserving methods in R. The package provides Mack-, Munich-, Bootstrap, and Multivariate-chain-ladder
methods, as well as the LDF Curve Fitting methods of Dave Clark and GLM-based reserving models.
● cplm – Monte Carlo EM algorithms and Bayesian methods for fitting Tweedie compound Poisson linear models
● lossDev – A Bayesian time series loss development model. Features include skewed-t distribution with time-varying scale
parameter, Reversible Jump MCMC for determining the functional form of the consumption path, and a structural break in
this path; by Christopher W. Laws and Frank A. Schmid
● actuar: Loss distributions modelling, risk theory (including ruin theory), simulation of compound hierarchical models and
credibility theory check out the actuar package by C. Dutang, V. Goulet and M. Pigeon.
● favir: Formatted Actuarial Vignettes in R. FAViR lowers the learning curve of the R environment. It is a series of
peer-reviewed Sweave papers that use a consistent style.
● mondate: R packackge to keep track of dates in terms of months
● lifecontingencies – Package to perform actuarial evaluation of life contingencies
and
Introduction to R for Actuaries by Nigel de Silva
and http://www.rininsurance.com/
R in Finance
http://www.rinfinance.com/
R in Finance
http://cran.r-project.org/web/views/Finance.html
This CRAN Task View contains a list of packages useful for empirical work in Finance, grouped by topic.
● The Rmetrics suite of packages comprises fArma, fAsianOptions, fAssets, fBasics, fBonds, timeDate (formerly: fCalendar),
fCopulae, fExoticOptions, fExtremes, fGarch, fImport,fNonlinear, fOptions, fPortfolio, fRegression, timeSeries (formerly:
fSeries), fTrading, fUnitRoots and contains a very large number of relevant functions for different aspect of empirical and
computational finance.
● The RQuantLib package provides several option-pricing functions as well as some fixed-income functionality from the
QuantLib project to R.
● The quantmod package offers a number of functions for quantitative modelling in finance as well as data acqusition,
plotting and other utilities.
● The portfolio package contains classes for equity portfolio management; the portfolioSim builds a related simulation
framework. The backtest offers tools to explore portfolio-based hypotheses about financial instruments. The stockPortfolio
package provides functions for single index, constant correlation and multigroup models. The pa package offers
performance attribution functionality for equity portfolios.
● The PerformanceAnalytics package contains a large number of functions for portfolio performance calculations and risk
management.
R in Pharma
http://blog.revolutionanalytics.com/2013/08/r-drug-development-and-the-fda.html
Opening the Doors to Open Source Programming in Drug Development.
R: Regulatory Compliance and Validation Issues A Guidance Document for the Use of R in Regulated Clinical Trial Environments in which he concluded that
useR 2012 FDA statistician Jea Brodsky presented a poster described how FDA scientists “use R on a daily basis” and have themselves written R packages
for use at various stages in the drug submission process.
Open Source Software in the Biopharma Industry: Challenges and Opportunities,
R in Pharma
http://web.quanticate.com/bid/102741/Using-the-Statistical-Programming-Language-R-in-the-Pharma-I
ndustry
R in Pharma
http://cran.r-project.org/web/views/ClinicalTrials.html
This task view gathers information on specific R packages for design, monitoring and analysis of data from clinical trials. It focuses
on including packages for clinical trial design and monitoring in general plus data analysis packages for a specific type of design.
Companies using R
from http://www.revolutionanalytics.com/companies-using-r
ANZ, the fourth largest bank in Australia, using R for credit risk analysis
Bank of America uses R for reporting.
The Consumer Financial Protection Bureau uses R for data analysis.
Facebook
Facebook and R:
● Analysis of Facebook Status Updates
● Facebook's Social Network Graph
● How Google and Facebook are using R
● Predicting Colleague Interactions with R
Refresher in Statistics
Mean
Arithmetic Mean- the sum of the values divided by the number of values.
The geometric mean is an average that is useful for sets of positive numbers that are interpreted according to their product and
not their sum (as is the case with the arithmetic mean) e.g. rates of growth.
Median
the median is the number separating the higher half of a data sample, a population, or a probability distribution, from the lower
hal
Mode-
The "mode" is the value that occurs most often.
Refresher in Statistics
Range
the range of a set of data is the difference between the largest and smallest values.
Variance
mean of squares of differences of values from mean
Standard Deviation
square root of its variance
Frequency
a frequency distribution is a table that displays the frequency of various outcomes in a sample.
Distributions
Normal
The simplest case of a normal distribution is known as the standard normal distribution. This is a special case where μ=0 and σ=1,
Refresher in Statistics
Probability Distribution
The probability density function (pdf) of the normal distribution, also called Gaussian or "bell curve", the most important
continuous random distribution. As notated on the figure, the probabilities of intervals of values correspond to the area
under the curve.
Pre Requisites
• Installation of R
http://cran.rstudio.com/bin/windows/base/
• R Studio
• R Packages
Pre Requisites
• Installation of R
– Rtools
– http://cran.rstudio.com/bin/windows/Rtools/
• R Studio
• R Packages
Pre Requisites
• Installation of R
– RTools
• R Studio
http://www.rstudio.com/products/rstudio/download/
• R Packages
Pre Requisites
• Installation of R
– RTools
• R Studio
http://www.rstudio.com/products/rstudio/download/
• R Packagesabout eight packages supplied with the R distribution and many more are available
through the CRAN family of Internet sites covering a very wide range of modern statistics.
CRAN
107 sites in 49 regions
Non CRAN Repositories
http://www.rdocumentation.org/
github
https://github.com/trending?l=R
bioconductor
http://www.bioconductor.org/
Install R
https://cran.r-project.org/bin/windows/base/
Install RStudio
https://www.rstudio.com/products/rstudio/download/
Pre Requisites
• R Packages
install.packages() INSTALLS
update.packages() UPDATES
library() LOADS
• Packages are installed once, updated periodically, but loaded every time
Interfaces to R
• Console
Default
Customization
• IDE
• GUI
Graphical Interfaces to R
• R Commander
• Rattle
• Deducer
Installation of R Commander
Overview of R Commander
Demo
R Commander – 3D Graphs
Installation of Rattle
Installation of Rattle
Installation of Rattle
Installation of Rattle
Installation of Rattle
• GTK+ Installation Necessary
• Install other packages when prompted
Installation of Rattle
• GTK+ Installation Necessary
• Install other packages when prompted
Overview of Rattle
Demo Rattle
RStudio
RStudio Desktop enables you with following advantages of native R console
● Syntax highlighting, code completion, and smart indentation
● Execute R code directly from the source editor
● Quickly jump to function definitions
● Easily manage multiple working directories using projects
● Integrated R help and documentation
● Interactive debugger to diagnose and fix errors quickly
● Extensive package development tools
http://www.rstudio.com/products/
RStudio
RStudio Server enables you to provide a browser based interface (the RStudio IDE) to a version of R running on a
remote Linux server. Deploying R and RStudio on a server has a number of benefits, including:
● The ability to access your R workspace from any computer in any location;
● Easy sharing of code, data, and other files with colleagues;
● Allowing multiple users to share access to the more powerful compute resources (memory,
processors, etc.) available on a well equipped server; and
● Centralized installation and configuration of R, R packages, TeX, and other supporting libraries.
RStudio -
Interface
R Landscape
R
Documentation
http://cran.r-project.org/manuals.html
Manuals
R
Documentation
Vignettes
CRAN Views
http://cran.r-project.org/web/views/
R Community
● email groups http://www.r-project.org/mail.html
R-announce
R-help
R-package-devel
R-devel
R-packages
Special Interest Groups
● Stack Overflow [r]
● Twitter #rstats
● Blogs at http://www.r-bloggers.com/ (573 blogs)
Stack Overflow
http://stackoverflow.com/questions/tagged/r
Twitter
https://twitter.com/search?q=rstats&src=sprv
Help within R
?”keyword”
??”keyword”
Example-
> ?kmeans
> ??kmeans
Functions Used in this Lesson
function(x)
for
library
install.packages
update.packages
ls
rm
print
Citations and References
> citation()
To cite R in publications use:
R Core Team (2015). R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
Introductory R
> Sys.Date()
[1] "2015-05-10"
> Sys.time()
[1] "2015-05-10 18:28:32 IST"
R as a Calculator
Basic Math on R Console
• +
• -
• Log
• Exp
• *
• /
• ()
• mean
• sum
• sd
• log
• median
• exp
Demo-
Basic Math on R Console
• +
• -
• Log
• Exp
• *
• /
Hint- Ctrl +L clears screen
Demo-
Basic Objects on R Console
• +
• -
• Log
• Exp
• *
• /
• ()
Hint- Up arrow gives you last
typed command
Functions-
ls() – what objects are here
rm(“foo”) removes object named foo
Assignment
Using = or -> assigns object names to values
Functions and Loops
• Loops
for (number in 1:5){ print (number) }
Functions and Loops
• Function
functionajay=function(a)(a^2+2*a+1)
Hint: Always match brackets
Each ( deserves a )
Each { deserves a }
Each [ deserves a ]
Other sources to learn R
swirlstats
http://swirlstats.com/
datacamp
https://www.datacamp.com/
codeschool
http://tryr.codeschool.com/
coursera
https://www.coursera.org/course/compdata
https://www.coursera.org/course/rprog
Good coding practices
• Use # for comment
• Use git for version control
• Use Rstudio for multiple lines of code
Functions in R
• custom functions
• source code for a function
• Understanding help ? , ??
HOMEWORK TIME !
Learning Objectives
● how to input data in R using various ways
● how to check for correct data input
● how to use special packages for fast data
input
● how to input data from statistical file formats
● how to input data from databases
● how to input data from web (web scraping)
What will you learn from this lesson
- data input from various kinds of format
- efficient data input via various packages
- sql to R
- web scraping
- piping in R
- using json in R
Environment
ls() -lists objects
rm()-removes an object
gc() -does garbage collection and frees up
memory
Environment
ls() -lists objects
rm()-removes an object
gc() -does garbage collection and frees up
memory
Environment
ls() -lists objects
rm()-removes an object
gc() -does garbage collection and frees up
memory
File System
getwd()- get working directory
setwd()- set or change working directory
dir() - lists files in working directory
File System
getwd()- get working directory
setwd()- set or change working directory
dir() - lists files in working directory
File System
getwd()- get working directory
setwd()- set or change working directory
dir() - lists files in working directory
File System
getwd()- get working directory
setwd()- set or change working directory
dir() - lists files in working directory
Assigning
objectname=read.csv(filepath,parameters)
OR
objectname<-read.csv(filepath,parameters)
Data Input
read.table() or read.csv()
read.spss()
read.sas7bdat()
read.table()
https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
Statistical formats
• read.spss from foreign package
• read.sas7bdat from sas7bdat package
Manual Entry
Manual Editing
Manual Editing
readr from Hadley
The goal of readr is to provide a fast and friendly way to read tabular data into R. The most important functions are:
● Read delimited files: read_delim(), read_csv(), read_tsv(), read_csv2().
● Read fixed width files: read_fwf(), read_table().
● Read lines: read_lines().
● Read whole file: read_file().
● Re-parse existing data frame: type_convert().
https://github.com/hadley/readr
readr from Hadley
Source Data - https://bit.ly/dsdata
https://github.com/hadley/readr
readxl from Hadley
Readxl supports both the legacy .xls format and the modern xml-based .xlsx format. .xlssupport is made possible
the with libxls C library, which abstracts away many of the complexities of the underlying binary format. To parse
.xlsx, we use the RapidXML C++ library.
read_excel("my-old-spreadsheet.xls")
read_excel("my-new-spreadsheet.xlsx")
read_excel("my-spreadsheet.xls", sheet = "data")
read_excel("my-spreadsheet.xls", sheet = 2)
read_excel("my-spreadsheet.xls", na = "NA")
https://github.com/hadley/readxl
data.table
fread is the fastest way to read data
data.table
fread is the fastest way to read data
data.table
fread is the fastest way to read data
Some learnings
1. Multiple packages can do the same thing faster or slower in R
2. Knowing the right package is the essential difference as a data scientist
3. Putting code within system.time() helps measure speed
also see http://adv-r.had.co.nz/Profiling.html for advanced ways to spped up code
Creating DSN (Optional)
Creating DSN (in Windows)
A Data Source Name (DSN) is the logical name that is used by Open Database Connectivity (ODBC) to refer to the drive and other information
that is required to access data. The name is use by Internet Information Services for a connection to an ODBC data source, such as a Microsoft
SQL Server database.
https://support.microsoft.com/en-us/kb/kbview/300596
Creating DSN (in Windows)
1. Click Start, point to Control Panel, double-click Administrative Tools, and then double-click Data Sources(ODBC).
2. Click the System DSN tab, and then click Add.
3. Click the database driver that corresponds with the database type to which you are connecting, and then click Finish.
4. Type the data source name. Make sure that you choose a name that you can remember. You will need to use this name
later.
5. Click Select.
6. Click the correct database, and then click OK.
7. Click OK, and then click OK.
https://support.microsoft.com/en-us/kb/kbview/300596
Creating DSN
Creating DSN
Creating DSN
Creating DSN
RODBC
> library(RODBC)
> odbcDataSources()
> ajay=odbcConnect(“MySQL”,uid=”root”,pwd=”XX”)
> ajay
> sqlTables(ajay)
>tested=sqlFetch(ajay,”host”)
From Databases
The RODBC package provides access to databases through
an ODBC interface.
The primary functions are
• odbcConnect(dsn, uid="", pwd="") Open a connection
to an ODBC database
• sqlFetch(channel, sqltable) Read a table from an ODBC
database into a data frame
Hint- a good site to revise R
http://www.statmethods.net
sqlite
http://cran.r-project.org/web/packages/RSQLite/RSQLite.pdf embeds the SQLite database engine in R and provides an interface
compliant with the DBI package.
SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL
database engine.
SQLite is the most widely deployed database engine in the world
library(RSQLite)
con <- dbConnect("SQLite", dbname = "sample_db")
# read csv file into sql database
dbWriteTable(con, name="sample_data", value="sample_data.csv", row.names=FALSE, header=TRUE, sep = ",")
http://cran.r-project.org/web/packages/sqldf/index.html Manipulate R data frames using SQL
read.csv.sql in the sqldf package imports data into a temporary SQLite database and then reads it into R.
A Detour to SQL Joins (Optional)
RMySQL
install.packages("RMySQL")
library(RMySQL)
mydb = dbConnect(MySQL(), user='user', password='password', dbname='database_name', host='host')
dbListTables(mydb)
dbListFields(mydb, 'some_table')
dbSendQuery(mydb, 'drop table if exists some_table, some_other_table')
dbWriteTable(mydb, name='table_name', value=data.frame.name)
Other databases
Teradata https://github.com/Teradata/teradataR
PostgreSQL http://cran.r-project.org/web/packages/RPostgreSQL/
MongoDBhttp://cran.r-project.org/web/packages/mongolite/index.html
couchDBhttp://cran.r-project.org/web/packages/couchDB/index.html
MonetDBhttp://cran.r-project.org/web/packages/MonetDB.R/index.html
Other data sources
Cassandra with R http://cran.r-project.org/web/packages/RCassandra/RCassandra.pdf
Neo4j with R
http://things-about-r.tumblr.com/post/47392314578/venue-recommendation-a-simple-use-case-connecting-r
R with Hadoop Stack https://github.com/RevolutionAnalytics/RHadoop/wiki
● NEW! ravro - read and write files in avro format
● plyrmr - higher level plyr-like data processing for structured data, powered by rmr
● rmr - functions providing Hadoop MapReduce functionality in R
● rhdfs - functions providing file management of the HDFS from within R
● rhbase - functions providing database management for the HBase distributed database from within R
https://amplab-extras.github.io/SparkR-pkg/ SparkR is an R package to use Spark from R.
Web Scraping
Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from
websites.
example - python (scrapy and beautiful soup)
Web Scraping
• readlines
Hint : R is case sensitive
readlines is not the same as readLines
Hint : Use head() and tail() to inspect objects
Other packages are XML and Curl
Case Study- http://decisionstats.com/2013/04/14/using-r-for-cricket-analysis-rstats/
curl
cURL is a computer software project providing a library and command-line tool for transferring data using various
protocols. The cURL project produces two products, libcurl and cURL.
The RCurl package is an R-interface to the libcurl library that provides HTTP facilities. This allows us to download
files from Web servers, post forms, use HTTPS (the secure HTTP), use persistent connections, upload files, use
binary content, handle redirects, password authentication, etc.
The primary top-level entry points are
● getURL()
● getURLContent()
● getForm()
● postForm()
http://www.omegahat.org/RCurl/RCurlJSS.pdf
Rcurl
XML
json format
jsonlite for json data
http://arxiv.org/abs/1403.2805
json format
jsonlite for json data
http://arxiv.org/abs/1403.2805
Using APIs for data
https://ropensci.org/
ff package
http://cran.r-project.org/web/packages/ff/index.html
The ff package provides data structures that are stored on disk but behave (almost) as if they were in RAM by transparently
mapping only a section (pagesize) in main memory - the effective virtual memory consumption per ff object.
http://cran.r-project.org/web/packages/ffbase/index.html
Basic (statistical) functionality for package ff
Example- http://www.bnosac.be/index.php/blog/22-if-you-are-into-large-data-and-work-a-lot-package-ff
> require(ffbase)
> hhp <-
read.table.ffdf(file="/home/jan/Work/RForgeBNOSAC/github/RBelgium_HeritageHealthPrize/Data/Claims.csv", FUN =
"read.csv", na.strings = "")
Also see http://cran.r-project.org/web/packages/bigmemory/index.html
Create, store, access, and manipulate massive matrices. Matrices are allocated to shared memory and may use memory-mapped
files. Packages biganalytics, bigtabulate, synchronicity, and bigalgebra provide advanced functionality
RevoScaleR package
RevoScaleR has its own file format, XDF, which is able to rapidly access data by row or by column and to read some
data sequentially. XDF file data is stored in the same binary format used in memory, which eliminates the need for
conversion when it is brought into memory.
http://www.revolutionanalytics.com/revolution-r-enterprise-scaler
rhdf5
This R/Bioconductor package provides an interface between HDF5 and R. HDF5's main features are the ability to store and
access very large and/or complex datasets and a wide variety of metadata on mass storage (disk) through a completely portable
file format.
http://www.bioconductor.org/packages/release/bioc/html/rhdf5.html
HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is
designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing
applications to evolve in their use of HDF5.
https://www.hdfgroup.org/HDF5/
HDF5 simplifies the file structure to include only two major types of object:
● Datasets, which are multidimensional arrays of a homogeneous type
● Groups, which are container structures which can hold datasets and other groups
rhdf5
This R/Bioconductor package provides an interface between HDF5 and R. HDF5's main features are the ability to store and
access very large and/or complex datasets and a wide variety of metadata on mass storage (disk) through a completely portable
file format.
http://www.bioconductor.org/packages/release/bioc/html/rhdf5.html
HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is
designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing
applications to evolve in their use of HDF5.
https://www.hdfgroup.org/HDF5/
HDF5 simplifies the file structure to include only two major types of object:
● Datasets, which are multidimensional arrays of a homogeneous type
● Groups, which are container structures which can hold datasets and other groups
Functions Used in this lesson
● toJSON and fromJSON
Packages
● data.table
● jsonlite
● rvest
Revision
getwd
setwd
dir
ls
rm
Install.packages
library
fread vs read,csv
Df[i,j]
Df$column
str
summary
table
citation
help
Revision
mean
std
median
length
Vector
data.frame
Indexing
class
nrow
ncol
head
tail
Citations and References
M Dowle, T Short, S Lianoglou, A Srinivasan with contributions from R Saporta and E Antonyan
(2014) data.table: Extension of data.frame. R package version 1.9.4.
http://CRAN.R-project.org/package=data.table
Jeroen Ooms (2014). The jsonlite Package: A Practical and Consistent Mapping Between JSON Data
and R Objects. arXiv:1403.2805 [stat.CO] URL http://arxiv.org/abs/1403.2805
Hadley Wickham (2015). rvest: Easily Harvest (Scrape) Web Pages. R package version 0.2.0.
http://CRAN.R-project.org/package=rvest

More Related Content

PDF
Download Python for R Users pdf for free
PPTX
Introduction to R
PPTX
R for data analytics
PDF
Open source analytics
PPTX
LSESU a Taste of R Language Workshop
PDF
Cheat sheets for data scientists
PDF
Introduction To R
DOCX
Ajay ohri Resume
Download Python for R Users pdf for free
Introduction to R
R for data analytics
Open source analytics
LSESU a Taste of R Language Workshop
Cheat sheets for data scientists
Introduction To R
Ajay ohri Resume

What's hot (19)

PPTX
Introduction to r
PPTX
Introduction to statistical software R
PDF
R tutorial
PPTX
R programming
PDF
2 it unit-1 start learning r
PPTX
Reason To learn & use r
PDF
Data analytics using the cloud challenges and opportunities for india
PPTX
R Programming
PPTX
R programming
PPT
R programming
PPTX
R language tutorial
PDF
A Data Science Tutorial in Python
PDF
Turbocharge your data science with python and r
PPTX
R and Rcmdr Statistical Software
PDF
Introduction To Data Science With Python
PPTX
The use of R statistical package in controlled infrastructure. The case of Cl...
PPTX
Financial Risk Mgt - Lec 4 by Dr. Syed Muhammad Ali Tirmizi
PPTX
Financial Risk Mgt - Lec 1 by dr. syed muhammad ali tirmizi
PDF
Weka tutorial
Introduction to r
Introduction to statistical software R
R tutorial
R programming
2 it unit-1 start learning r
Reason To learn & use r
Data analytics using the cloud challenges and opportunities for india
R Programming
R programming
R programming
R language tutorial
A Data Science Tutorial in Python
Turbocharge your data science with python and r
R and Rcmdr Statistical Software
Introduction To Data Science With Python
The use of R statistical package in controlled infrastructure. The case of Cl...
Financial Risk Mgt - Lec 4 by Dr. Syed Muhammad Ali Tirmizi
Financial Risk Mgt - Lec 1 by dr. syed muhammad ali tirmizi
Weka tutorial
Ad

Similar to Introduction to R ajay Ohri (20)

PPTX
Realtime usage and Applications of R.pptx
PPT
An introduction to R is a document useful
PDF
Data analytics using R programming
PPTX
R programming presentation
PDF
Executive Intro to R
PPTX
Big data analytics with R tool.pptx
PDF
R Programming Overview
PPTX
R_L1-Aug-2022.pptx
PDF
Learn Business Analytics with R at edureka!
PPTX
BIG DATA ANALYTICS USING R
PPTX
R as supporting tool for analytics and simulation
PDF
Introduction to R
PPTX
Intro to data science module 1 r
PDF
Unit1_Introduction to R.pdf
PDF
In-Database Analytics Deep Dive with Teradata and Revolution
PPTX
A Workshop on R
PDF
R Intro
PPTX
Introduction to basic statistics
 
PPTX
R programming for psychometrics
PDF
R - the language
Realtime usage and Applications of R.pptx
An introduction to R is a document useful
Data analytics using R programming
R programming presentation
Executive Intro to R
Big data analytics with R tool.pptx
R Programming Overview
R_L1-Aug-2022.pptx
Learn Business Analytics with R at edureka!
BIG DATA ANALYTICS USING R
R as supporting tool for analytics and simulation
Introduction to R
Intro to data science module 1 r
Unit1_Introduction to R.pdf
In-Database Analytics Deep Dive with Teradata and Revolution
A Workshop on R
R Intro
Introduction to basic statistics
 
R programming for psychometrics
R - the language
Ad

More from Ajay Ohri (20)

PDF
Social Media and Fake News in the 2016 Election
PDF
Pyspark
PDF
Install spark on_windows10
PDF
Statistics for data scientists
PPTX
National seminar on emergence of internet of things (io t) trends and challe...
PDF
Tools and techniques for data science
PPTX
How Big Data ,Cloud Computing ,Data Science can help business
PDF
Training in Analytics and Data Science
PDF
Tradecraft
PDF
Software Testing for Data Scientists
PDF
Craps
PDF
How does cryptography work? by Jeroen Ooms
PDF
Using R for Social Media and Sports Analytics
PDF
Kush stats alpha
PPTX
Analyze this
PPTX
Summer school python in spanish
PPTX
Introduction to sas in spanish
PPTX
What is r in spanish.
PDF
Rcpp
PDF
Ggplot in python
Social Media and Fake News in the 2016 Election
Pyspark
Install spark on_windows10
Statistics for data scientists
National seminar on emergence of internet of things (io t) trends and challe...
Tools and techniques for data science
How Big Data ,Cloud Computing ,Data Science can help business
Training in Analytics and Data Science
Tradecraft
Software Testing for Data Scientists
Craps
How does cryptography work? by Jeroen Ooms
Using R for Social Media and Sports Analytics
Kush stats alpha
Analyze this
Summer school python in spanish
Introduction to sas in spanish
What is r in spanish.
Rcpp
Ggplot in python

Recently uploaded (20)

PDF
Business Analytics and business intelligence.pdf
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Mega Projects Data Mega Projects Data
PDF
annual-report-2024-2025 original latest.
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Business Analytics and business intelligence.pdf
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Mega Projects Data Mega Projects Data
annual-report-2024-2025 original latest.
Reliability_Chapter_ presentation 1221.5784
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Clinical guidelines as a resource for EBP(1).pdf
Qualitative Qantitative and Mixed Methods.pptx
Miokarditis (Inflamasi pada Otot Jantung)
STUDY DESIGN details- Lt Col Maksud (21).pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Introduction to Knowledge Engineering Part 1
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Introduction-to-Cloud-ComputingFinal.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf

Introduction to R ajay Ohri

  • 2. Know your Instructor ● Author "R for Business Analytics" ● Author “ R for Cloud Computing” ● Founder "Decisionstats.com" ● University of Tennessee, Knoxville MS (courses in statistics and computer science) ● MBA (IIM Lucknow,India-2003) ● B.Engineering (DCE 2001) http://linkedin.com/in/ajayohri
  • 3. Classroom Rules • From Instructor • From Audience – mobile phones should be kindly switched off • Yes, this includes Whatsapp – Ask Questions at end of session – Take Notes – Please Take Notes
  • 4. What is data science ? Hacking ( Programming) + Maths/Statistics + Domain Knowledge = Data Science http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
  • 5. Oh really, is this a Data Scientist ? a data scientist is simply a person who can write code = in R,Python,Java, SQL, Hadoop (Pig,HQL,MR) etc = for data storage, querying, summarization, visualization = how efficiently, and in time (fast results?) = where on databases, on cloud, servers and understand enough statistics to derive insights from data so business can make decisions
  • 6. Data Science with R A popular language in Data Science
  • 7. What Is R https://www.r-project.org/about.html R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes ● an effective data handling and storage facility, ● a suite of operators for calculations on arrays, in particular matrices, ● a large, coherent, integrated collection of intermediate tools for data analysis, ● graphical facilities for data analysis and display either on-screen or on hardcopy, and ● a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.
  • 10. Statistical Software Landscape SAS Python (Pandas) IBM SPSS R Julia Clojure Octave Matlab JMP E views
  • 11. Using R with other software https://rforanalytics.wordpress.com/useful-links-for-r/using-r-from-other-software/ Tableau http://www.tableausoftware.com/new-features/r-integration Qlik http://qliksolutions.ru/qlikview/add-ons/r-connector-eng/ Oracle R http://www.oracle.com/technetwork/database/database-technologies/r/r-enterprise/overview/index.html Rapid Miner https://rapid-i.com/content/view/202/206/lang,en/#r JMP http://blogs.sas.com/jmp/index.php?/archives/298-JMP-Into-R!.html
  • 12. Using R with other software https://rforanalytics.wordpress.com/useful-links-for-r/using-r-from-other-software/ SAS/IML http://www.sas.com/technologies/analytics/statistics/iml/index.html Teradata http://developer.teradata.com/applications/articles/in-database-analytics-with-teradata-r Pentaho http://bigdatatechworld.blogspot.in/2013/10/integration-of-rweka-with-pentaho-data.html IBM SPSS https://www14.software.ibm.com/webapp/iwm/web/signup.do?source=ibm-analytics&S_PKG=ov18855&S_TACT=M161003W&dy nform=127&lang=en_US TIBCO TERR http://spotfire.tibco.com/discover-spotfire/what-does-spotfire-do/predictive-analytics/tibco-enterprise-runtime-for-r-terr
  • 13. Some Advantages of R open source free large number of algorithms and packages esp for statistics flexible very good for data visualization superb community rapidly growing can be used with other software
  • 14. Some Disadvantages of R in memory (RAM) usage steep learning curve some IT departments frown on open source verbose documentation tech support evolving ecosystem for corporates
  • 15. Solutions for Disadvantages of R in memory (RAM) usage specialized packages, in database computing steep learning curve TRAINING !!! some IT departments frown on open source TRAINING and education! verbose documentation CRAN View , R Documentation tech support expanding pool of resources evolving ecosystem for corporates getting better with MS et al
  • 16. R used by Government ● In the early days of the Deepwater Horizon disaster, NIST used uncertainty analysis in R to harmonize spill estimates from various sources, and to provide ranges of estimates to other agencies and the media. ● Before new drugs are allowed on the market, the FDA works with pharmaceutical companies to verify safety and efficacy through clinical trials. Despite a false perception that only commercial software may be used, many pharmaceutical companies are now using open-source R to analyze data from clinical trials. ● The National Weather Service uses R for research and development of models to predict river flooding. ● The newly-formed Consumer Financial Protection Bureau -- freed from the restrictions of a legacy IT infrastructure -- is championing the use of open-source technologies in government. ● Local governments are also building data-based applications. The SF Estuary Institute uses R and Google Maps to provide a tool to track pollution in the San Francisco Bay area. http://gsnmagazine.com/node/26483?c=cyber_security
  • 17. R used by Telecom ● Churn using Social Network Analysis http://www.slideshare.net/dataspora/social-network-analysis-for-telecoms
  • 18. R used by Insurance a few more insurance related packages: ● ChainLadder – Reserving methods in R. The package provides Mack-, Munich-, Bootstrap, and Multivariate-chain-ladder methods, as well as the LDF Curve Fitting methods of Dave Clark and GLM-based reserving models. ● cplm – Monte Carlo EM algorithms and Bayesian methods for fitting Tweedie compound Poisson linear models ● lossDev – A Bayesian time series loss development model. Features include skewed-t distribution with time-varying scale parameter, Reversible Jump MCMC for determining the functional form of the consumption path, and a structural break in this path; by Christopher W. Laws and Frank A. Schmid ● actuar: Loss distributions modelling, risk theory (including ruin theory), simulation of compound hierarchical models and credibility theory check out the actuar package by C. Dutang, V. Goulet and M. Pigeon. ● favir: Formatted Actuarial Vignettes in R. FAViR lowers the learning curve of the R environment. It is a series of peer-reviewed Sweave papers that use a consistent style. ● mondate: R packackge to keep track of dates in terms of months ● lifecontingencies – Package to perform actuarial evaluation of life contingencies and Introduction to R for Actuaries by Nigel de Silva and http://www.rininsurance.com/
  • 20. R in Finance http://cran.r-project.org/web/views/Finance.html This CRAN Task View contains a list of packages useful for empirical work in Finance, grouped by topic. ● The Rmetrics suite of packages comprises fArma, fAsianOptions, fAssets, fBasics, fBonds, timeDate (formerly: fCalendar), fCopulae, fExoticOptions, fExtremes, fGarch, fImport,fNonlinear, fOptions, fPortfolio, fRegression, timeSeries (formerly: fSeries), fTrading, fUnitRoots and contains a very large number of relevant functions for different aspect of empirical and computational finance. ● The RQuantLib package provides several option-pricing functions as well as some fixed-income functionality from the QuantLib project to R. ● The quantmod package offers a number of functions for quantitative modelling in finance as well as data acqusition, plotting and other utilities. ● The portfolio package contains classes for equity portfolio management; the portfolioSim builds a related simulation framework. The backtest offers tools to explore portfolio-based hypotheses about financial instruments. The stockPortfolio package provides functions for single index, constant correlation and multigroup models. The pa package offers performance attribution functionality for equity portfolios. ● The PerformanceAnalytics package contains a large number of functions for portfolio performance calculations and risk management.
  • 21. R in Pharma http://blog.revolutionanalytics.com/2013/08/r-drug-development-and-the-fda.html Opening the Doors to Open Source Programming in Drug Development. R: Regulatory Compliance and Validation Issues A Guidance Document for the Use of R in Regulated Clinical Trial Environments in which he concluded that useR 2012 FDA statistician Jea Brodsky presented a poster described how FDA scientists “use R on a daily basis” and have themselves written R packages for use at various stages in the drug submission process. Open Source Software in the Biopharma Industry: Challenges and Opportunities,
  • 23. R in Pharma http://cran.r-project.org/web/views/ClinicalTrials.html This task view gathers information on specific R packages for design, monitoring and analysis of data from clinical trials. It focuses on including packages for clinical trial design and monitoring in general plus data analysis packages for a specific type of design.
  • 24. Companies using R from http://www.revolutionanalytics.com/companies-using-r ANZ, the fourth largest bank in Australia, using R for credit risk analysis Bank of America uses R for reporting. The Consumer Financial Protection Bureau uses R for data analysis. Facebook Facebook and R: ● Analysis of Facebook Status Updates ● Facebook's Social Network Graph ● How Google and Facebook are using R ● Predicting Colleague Interactions with R
  • 25. Refresher in Statistics Mean Arithmetic Mean- the sum of the values divided by the number of values. The geometric mean is an average that is useful for sets of positive numbers that are interpreted according to their product and not their sum (as is the case with the arithmetic mean) e.g. rates of growth. Median the median is the number separating the higher half of a data sample, a population, or a probability distribution, from the lower hal Mode- The "mode" is the value that occurs most often.
  • 26. Refresher in Statistics Range the range of a set of data is the difference between the largest and smallest values. Variance mean of squares of differences of values from mean Standard Deviation square root of its variance Frequency a frequency distribution is a table that displays the frequency of various outcomes in a sample.
  • 27. Distributions Normal The simplest case of a normal distribution is known as the standard normal distribution. This is a special case where μ=0 and σ=1,
  • 28. Refresher in Statistics Probability Distribution The probability density function (pdf) of the normal distribution, also called Gaussian or "bell curve", the most important continuous random distribution. As notated on the figure, the probabilities of intervals of values correspond to the area under the curve.
  • 29. Pre Requisites • Installation of R http://cran.rstudio.com/bin/windows/base/ • R Studio • R Packages
  • 30. Pre Requisites • Installation of R – Rtools – http://cran.rstudio.com/bin/windows/Rtools/ • R Studio • R Packages
  • 31. Pre Requisites • Installation of R – RTools • R Studio http://www.rstudio.com/products/rstudio/download/ • R Packages
  • 32. Pre Requisites • Installation of R – RTools • R Studio http://www.rstudio.com/products/rstudio/download/ • R Packagesabout eight packages supplied with the R distribution and many more are available through the CRAN family of Internet sites covering a very wide range of modern statistics.
  • 33. CRAN 107 sites in 49 regions
  • 39. Pre Requisites • R Packages install.packages() INSTALLS update.packages() UPDATES library() LOADS • Packages are installed once, updated periodically, but loaded every time
  • 40. Interfaces to R • Console Default Customization • IDE • GUI
  • 41. Graphical Interfaces to R • R Commander • Rattle • Deducer
  • 42. Installation of R Commander
  • 43. Overview of R Commander
  • 44. Demo R Commander – 3D Graphs
  • 49. Installation of Rattle • GTK+ Installation Necessary • Install other packages when prompted
  • 50. Installation of Rattle • GTK+ Installation Necessary • Install other packages when prompted
  • 53. RStudio RStudio Desktop enables you with following advantages of native R console ● Syntax highlighting, code completion, and smart indentation ● Execute R code directly from the source editor ● Quickly jump to function definitions ● Easily manage multiple working directories using projects ● Integrated R help and documentation ● Interactive debugger to diagnose and fix errors quickly ● Extensive package development tools http://www.rstudio.com/products/
  • 54. RStudio RStudio Server enables you to provide a browser based interface (the RStudio IDE) to a version of R running on a remote Linux server. Deploying R and RStudio on a server has a number of benefits, including: ● The ability to access your R workspace from any computer in any location; ● Easy sharing of code, data, and other files with colleagues; ● Allowing multiple users to share access to the more powerful compute resources (memory, processors, etc.) available on a well equipped server; and ● Centralized installation and configuration of R, R packages, TeX, and other supporting libraries.
  • 60. R Community ● email groups http://www.r-project.org/mail.html R-announce R-help R-package-devel R-devel R-packages Special Interest Groups ● Stack Overflow [r] ● Twitter #rstats ● Blogs at http://www.r-bloggers.com/ (573 blogs)
  • 64. Functions Used in this Lesson function(x) for library install.packages update.packages ls rm print
  • 65. Citations and References > citation() To cite R in publications use: R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
  • 66. Introductory R > Sys.Date() [1] "2015-05-10" > Sys.time() [1] "2015-05-10 18:28:32 IST"
  • 67. R as a Calculator Basic Math on R Console • + • - • Log • Exp • * • / • () • mean • sum • sd • log • median • exp
  • 68. Demo- Basic Math on R Console • + • - • Log • Exp • * • / Hint- Ctrl +L clears screen
  • 69. Demo- Basic Objects on R Console • + • - • Log • Exp • * • / • () Hint- Up arrow gives you last typed command Functions- ls() – what objects are here rm(“foo”) removes object named foo Assignment Using = or -> assigns object names to values
  • 70. Functions and Loops • Loops for (number in 1:5){ print (number) }
  • 71. Functions and Loops • Function functionajay=function(a)(a^2+2*a+1) Hint: Always match brackets Each ( deserves a ) Each { deserves a } Each [ deserves a ]
  • 72. Other sources to learn R swirlstats http://swirlstats.com/ datacamp https://www.datacamp.com/ codeschool http://tryr.codeschool.com/ coursera https://www.coursera.org/course/compdata https://www.coursera.org/course/rprog
  • 73. Good coding practices • Use # for comment • Use git for version control • Use Rstudio for multiple lines of code
  • 74. Functions in R • custom functions • source code for a function • Understanding help ? , ??
  • 76. Learning Objectives ● how to input data in R using various ways ● how to check for correct data input ● how to use special packages for fast data input ● how to input data from statistical file formats ● how to input data from databases ● how to input data from web (web scraping)
  • 77. What will you learn from this lesson - data input from various kinds of format - efficient data input via various packages - sql to R - web scraping - piping in R - using json in R
  • 78. Environment ls() -lists objects rm()-removes an object gc() -does garbage collection and frees up memory
  • 79. Environment ls() -lists objects rm()-removes an object gc() -does garbage collection and frees up memory
  • 80. Environment ls() -lists objects rm()-removes an object gc() -does garbage collection and frees up memory
  • 81. File System getwd()- get working directory setwd()- set or change working directory dir() - lists files in working directory
  • 82. File System getwd()- get working directory setwd()- set or change working directory dir() - lists files in working directory
  • 83. File System getwd()- get working directory setwd()- set or change working directory dir() - lists files in working directory
  • 84. File System getwd()- get working directory setwd()- set or change working directory dir() - lists files in working directory
  • 86. Data Input read.table() or read.csv() read.spss() read.sas7bdat()
  • 88. Statistical formats • read.spss from foreign package • read.sas7bdat from sas7bdat package
  • 92. readr from Hadley The goal of readr is to provide a fast and friendly way to read tabular data into R. The most important functions are: ● Read delimited files: read_delim(), read_csv(), read_tsv(), read_csv2(). ● Read fixed width files: read_fwf(), read_table(). ● Read lines: read_lines(). ● Read whole file: read_file(). ● Re-parse existing data frame: type_convert(). https://github.com/hadley/readr
  • 93. readr from Hadley Source Data - https://bit.ly/dsdata https://github.com/hadley/readr
  • 94. readxl from Hadley Readxl supports both the legacy .xls format and the modern xml-based .xlsx format. .xlssupport is made possible the with libxls C library, which abstracts away many of the complexities of the underlying binary format. To parse .xlsx, we use the RapidXML C++ library. read_excel("my-old-spreadsheet.xls") read_excel("my-new-spreadsheet.xlsx") read_excel("my-spreadsheet.xls", sheet = "data") read_excel("my-spreadsheet.xls", sheet = 2) read_excel("my-spreadsheet.xls", na = "NA") https://github.com/hadley/readxl
  • 95. data.table fread is the fastest way to read data
  • 96. data.table fread is the fastest way to read data
  • 97. data.table fread is the fastest way to read data
  • 98. Some learnings 1. Multiple packages can do the same thing faster or slower in R 2. Knowing the right package is the essential difference as a data scientist 3. Putting code within system.time() helps measure speed also see http://adv-r.had.co.nz/Profiling.html for advanced ways to spped up code
  • 100. Creating DSN (in Windows) A Data Source Name (DSN) is the logical name that is used by Open Database Connectivity (ODBC) to refer to the drive and other information that is required to access data. The name is use by Internet Information Services for a connection to an ODBC data source, such as a Microsoft SQL Server database. https://support.microsoft.com/en-us/kb/kbview/300596
  • 101. Creating DSN (in Windows) 1. Click Start, point to Control Panel, double-click Administrative Tools, and then double-click Data Sources(ODBC). 2. Click the System DSN tab, and then click Add. 3. Click the database driver that corresponds with the database type to which you are connecting, and then click Finish. 4. Type the data source name. Make sure that you choose a name that you can remember. You will need to use this name later. 5. Click Select. 6. Click the correct database, and then click OK. 7. Click OK, and then click OK. https://support.microsoft.com/en-us/kb/kbview/300596
  • 106. RODBC > library(RODBC) > odbcDataSources() > ajay=odbcConnect(“MySQL”,uid=”root”,pwd=”XX”) > ajay > sqlTables(ajay) >tested=sqlFetch(ajay,”host”)
  • 107. From Databases The RODBC package provides access to databases through an ODBC interface. The primary functions are • odbcConnect(dsn, uid="", pwd="") Open a connection to an ODBC database • sqlFetch(channel, sqltable) Read a table from an ODBC database into a data frame Hint- a good site to revise R http://www.statmethods.net
  • 108. sqlite http://cran.r-project.org/web/packages/RSQLite/RSQLite.pdf embeds the SQLite database engine in R and provides an interface compliant with the DBI package. SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. SQLite is the most widely deployed database engine in the world library(RSQLite) con <- dbConnect("SQLite", dbname = "sample_db") # read csv file into sql database dbWriteTable(con, name="sample_data", value="sample_data.csv", row.names=FALSE, header=TRUE, sep = ",") http://cran.r-project.org/web/packages/sqldf/index.html Manipulate R data frames using SQL read.csv.sql in the sqldf package imports data into a temporary SQLite database and then reads it into R.
  • 109. A Detour to SQL Joins (Optional)
  • 110. RMySQL install.packages("RMySQL") library(RMySQL) mydb = dbConnect(MySQL(), user='user', password='password', dbname='database_name', host='host') dbListTables(mydb) dbListFields(mydb, 'some_table') dbSendQuery(mydb, 'drop table if exists some_table, some_other_table') dbWriteTable(mydb, name='table_name', value=data.frame.name)
  • 111. Other databases Teradata https://github.com/Teradata/teradataR PostgreSQL http://cran.r-project.org/web/packages/RPostgreSQL/ MongoDBhttp://cran.r-project.org/web/packages/mongolite/index.html couchDBhttp://cran.r-project.org/web/packages/couchDB/index.html MonetDBhttp://cran.r-project.org/web/packages/MonetDB.R/index.html
  • 112. Other data sources Cassandra with R http://cran.r-project.org/web/packages/RCassandra/RCassandra.pdf Neo4j with R http://things-about-r.tumblr.com/post/47392314578/venue-recommendation-a-simple-use-case-connecting-r R with Hadoop Stack https://github.com/RevolutionAnalytics/RHadoop/wiki ● NEW! ravro - read and write files in avro format ● plyrmr - higher level plyr-like data processing for structured data, powered by rmr ● rmr - functions providing Hadoop MapReduce functionality in R ● rhdfs - functions providing file management of the HDFS from within R ● rhbase - functions providing database management for the HBase distributed database from within R https://amplab-extras.github.io/SparkR-pkg/ SparkR is an R package to use Spark from R.
  • 113. Web Scraping Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. example - python (scrapy and beautiful soup)
  • 114. Web Scraping • readlines Hint : R is case sensitive readlines is not the same as readLines Hint : Use head() and tail() to inspect objects Other packages are XML and Curl Case Study- http://decisionstats.com/2013/04/14/using-r-for-cricket-analysis-rstats/
  • 115. curl cURL is a computer software project providing a library and command-line tool for transferring data using various protocols. The cURL project produces two products, libcurl and cURL. The RCurl package is an R-interface to the libcurl library that provides HTTP facilities. This allows us to download files from Web servers, post forms, use HTTPS (the secure HTTP), use persistent connections, upload files, use binary content, handle redirects, password authentication, etc. The primary top-level entry points are ● getURL() ● getURLContent() ● getForm() ● postForm() http://www.omegahat.org/RCurl/RCurlJSS.pdf
  • 116. Rcurl
  • 117. XML
  • 118. json format jsonlite for json data http://arxiv.org/abs/1403.2805
  • 119. json format jsonlite for json data http://arxiv.org/abs/1403.2805
  • 120. Using APIs for data https://ropensci.org/
  • 121. ff package http://cran.r-project.org/web/packages/ff/index.html The ff package provides data structures that are stored on disk but behave (almost) as if they were in RAM by transparently mapping only a section (pagesize) in main memory - the effective virtual memory consumption per ff object. http://cran.r-project.org/web/packages/ffbase/index.html Basic (statistical) functionality for package ff Example- http://www.bnosac.be/index.php/blog/22-if-you-are-into-large-data-and-work-a-lot-package-ff > require(ffbase) > hhp <- read.table.ffdf(file="/home/jan/Work/RForgeBNOSAC/github/RBelgium_HeritageHealthPrize/Data/Claims.csv", FUN = "read.csv", na.strings = "") Also see http://cran.r-project.org/web/packages/bigmemory/index.html Create, store, access, and manipulate massive matrices. Matrices are allocated to shared memory and may use memory-mapped files. Packages biganalytics, bigtabulate, synchronicity, and bigalgebra provide advanced functionality
  • 122. RevoScaleR package RevoScaleR has its own file format, XDF, which is able to rapidly access data by row or by column and to read some data sequentially. XDF file data is stored in the same binary format used in memory, which eliminates the need for conversion when it is brought into memory. http://www.revolutionanalytics.com/revolution-r-enterprise-scaler
  • 123. rhdf5 This R/Bioconductor package provides an interface between HDF5 and R. HDF5's main features are the ability to store and access very large and/or complex datasets and a wide variety of metadata on mass storage (disk) through a completely portable file format. http://www.bioconductor.org/packages/release/bioc/html/rhdf5.html HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. https://www.hdfgroup.org/HDF5/ HDF5 simplifies the file structure to include only two major types of object: ● Datasets, which are multidimensional arrays of a homogeneous type ● Groups, which are container structures which can hold datasets and other groups
  • 124. rhdf5 This R/Bioconductor package provides an interface between HDF5 and R. HDF5's main features are the ability to store and access very large and/or complex datasets and a wide variety of metadata on mass storage (disk) through a completely portable file format. http://www.bioconductor.org/packages/release/bioc/html/rhdf5.html HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. https://www.hdfgroup.org/HDF5/ HDF5 simplifies the file structure to include only two major types of object: ● Datasets, which are multidimensional arrays of a homogeneous type ● Groups, which are container structures which can hold datasets and other groups
  • 125. Functions Used in this lesson ● toJSON and fromJSON
  • 129. Citations and References M Dowle, T Short, S Lianoglou, A Srinivasan with contributions from R Saporta and E Antonyan (2014) data.table: Extension of data.frame. R package version 1.9.4. http://CRAN.R-project.org/package=data.table Jeroen Ooms (2014). The jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects. arXiv:1403.2805 [stat.CO] URL http://arxiv.org/abs/1403.2805 Hadley Wickham (2015). rvest: Easily Harvest (Scrape) Web Pages. R package version 0.2.0. http://CRAN.R-project.org/package=rvest