SlideShare a Scribd company logo
Basic Analysis Using R
SECTION 1
Descriptive Statistics
Summarizing Data
2
Data Snapshot
The data has 41 rows and 7 columns
First_Name First Name
Last_Name Last Name
Grade Grade
Location Location
Function Department
ba Basic Allowance
ms Management
Supplements
Data Description
basic_salary_R3
3
Describing Variable
summary(salary)
4
salary<-read.csv(file.choose(), header=TRUE)
#Importing Data
#Checking the variable features using summary function
First_Name Last_Name Grade Location Function
Kavita : 2 Joshi : 2 GR1 :23 DELHI :17 FINANCE :13
Mahesh : 2 Shah : 2 GR2 :17 MUMBAI:21 SALES :15
Nishi : 2 Singh : 2 NA's: 1 NA's : 3 TECHNICAL:11
Priya : 2 Arora : 1 NA's : 2
Ajit : 1 Bhide : 1
Ameet : 1 Bhutala: 1
(Other):31 (Other):32
ba ms
Min. :10940 Min. : 2700
1st Qu.:13785 1st Qu.:10450
Median :16230 Median :12420
Mean :17210 Mean :11939
3rd Qu.:19305 3rd Qu.:14200
Max. :29080 Max. :16970
NA's :2 NA's :4
summary() gives descriptive measures
for numeric variable
Measures of Central Tendency
mean(salary$ba,na.rm=T)
[1] 17209.74
mean(salary$ba )
[1] NA
# Mean
Using na.rm=T excludes the missing values from the mean
mean() gives mean of the variable.
median(salary$ba,na.rm=T)
[1] 16230
# Median
median(), gives median of the variable.
trimmed_mean<-mean(salary$ba,0.10,na.rm=T)
trimmed_mean
[1] 16879.09
Trimmed mean is a useful measure in the presence of outliers.
Using 0.10 in the mean(), excludes 10% observations from each side of
the data from the mean
names(sort(-table(salary$ba)))[1]
[1] "10940"
# Mode
There is no direct function in R for Mode. Hence, we
first obtain counts for each value using table() and
sort() to find the highest frequency.
name() displays the value of the first (highest
frequency) observation. 5
6
r<-range(salary$ba,na.rm=T)
r
[1] 10940 29080
# Range / Inter Quartile Range
range() gives minimum and maximum values of
that variable
diff(r)
[1] 18140
diff() calculates difference between all values of that
vector
IQR(salary$ba,na.rm=T)
[1] 9070 IQR() gives the Inter-Quartile range of the variable
Measures of Variation
var(salary$ba,na.rm=T)
[1] 17301567
7
sd(salary$ba,na.rm=T)
[1] 4159.515
# Standard Deviation/ Variance
sd() gives standard deviation of the variable
var() gives variance of the variable
cv<-sd(salary$ba,na.rm=T)/mean(salary$ba,na.rm=T)
cv
[1] 0.2416954
# Co-efficient of Variation
There is no standard function for CV in R. Hence we
calculate it by definition
Measures of Variation
Skewness and Kurtosis
kurtosis(salary$ba,na.rm=T,type=2)
[1] 0.4996513
8
skewness(salary$ba,na.rm=T,type=2)
[1] 0.9033507
# Skewness
skewness() gives skewness of the variable.
type=2 uses moment based formula
library(e1071) Using package “e1071” in R is the easiest way
to find skewness and kurtosis
# Kurtosis
kurtosis() gives kurtosis of the variable.
type=2 uses moment based formula
f<-function(x)
c(skew=skewness(x,na.rm=T,type=2),kurt=kurtosis(x,na.rm=T,type=2))
aggregate(ba~Grade,data=salary,FUN=f)
Grade ba.skew ba.kurt
1 GR1 0.85500651 0.62939005
2 GR2 0.08682743 0.34852383
# Skewness and Kurtosis by Grade
Generating Frequency Tables
9
freq<-table(salary$Location, salary$Grade)
freq
# Frequency Tables
GR1 GR2
DELHI 11 6
MUMBAI 10 10
table() gives the frequency of
counts of the two variables
mentioned.
prop.table(freq)
# Percentage Frequency Tables
GR1 GR2
DELHI 0.2972973 0.1621622
MUMBAI 0.2702703 0.2702703
prop.table() gives the frequency expressed as
percentage of total count.
(,1) expresses the frequency as percentage of
row count whereas (,2) would express it as
percentage of column count
Generating Frequency Tables
10
table1 <- table(salary$Location, salary$Grade, salary$Function)
ftable(table1)
# Three Way Frequency Table
FINANCE SALES TECHNICAL
DELHI GR1 4 4 3
GR2 2 1 3
MUMBAI GR1 2 4 2
GR2 3 4 3
ftable() gives the frequency of
counts of the three variables
mentioned in one table itself
Generating Cross Tables
11
install.packages("gmodels")
library(gmodels)
# Installing package – “gmodels”
# Frequency Table using ‘gmodels’ package
CrossTable(salary$Grade,salary$Location, prop.r=FALSE, prop.c=FALSE)
CrossTable() gives the frequency of counts of the
two variables mentioned
prop.r= removes the row proportion from the
output and prop.c= removes the column
proportion
SECTION 2
Bivariate Analysis
12
Data Snapshot
The data has 25 rows and 6 columns
empno Employee Number
aptitude Aptitude Score of the
Employee
testofen Test of English
tech_ Technical Score
g_k_ General Knowledge Score
job_prof Job Proficiency Score
Data Description
job_proficiency_R3
13
Scatter Plot
14
plot(job$aptitude,job$job_prof,col="red")
# scatterplot
job<-read.csv(file.choose(),header=T)
plot() gives a scatterplot of the two variables mentioned.
col= provides color to the points
• Study the correlation between Aptitude and Job Proficiency.
Pearson Correlation Coefficient
cor(job$aptitude,job$job_prof)
# correlation
[1] 0.5144107
cor() gives the Pearson Correlation Coefficient of the
two variables mentioned
15
attach(job)
plot(aptitude,job_prof, main="ScatterPlot with Regression Line",
xlab="Aptitude ", ylab="Job Proficiency", pch=19)
abline(lm(job_prof~aptitude), col="darkorange")
ScatterPlot with Regression Line
#Scatterplot with Regression Line
16
plot() in base R yields a different types of plots
aptitude is one of the variable for plot
job_prof is another variable to be plotted
main= provides the user defined name of the chart. It has to be put in double quotes
xlab= provides a user defined label for the variable on X axis
ylab= provides a user defined label for the variable on Y axis
pch= gives various shapes for the data points on the plot
abline() in base R yields a different types of lines on plot
lm() provides the liner regression line of the first variable mentioned on the second
col= provides the color of the line plotted
attach() is used to call the data in R with help of which in further . Codes specifying
the data repetitively can be avoided.
17
Output:
ScatterPlot with Regression Line
Scatter Plot Matrix
18
pairs(~job_prof+aptitude+testofen+tech_+g_k_,data=job,
main="ScatterPlot Matrix",col="darkorange")
#ScatterPlot Matrix
Output:
pairs() in base R is used to plot pairwise comparison
~ each variable name to be plotted followed by a “+” sign needs to mentioned
main= provides the user defined name of the chart.
SECTION 3
Data Visualisation
Graphs in R
19
Data Snapshot
The data has 1000 rows and 10
columns
CustID Customer ID
Age Age
Gender Gender
PinCode PinCode
Active Whether the customer
was active in past 24
weeks or not
Calls Number of Calls made
Minutes Number of minutes
spoken
Amt Amount charged
AvgTime Mean Time per call
Age_Group Age Group of the
Customer
Data Descriptiontelecom_R3
20
telecom1<-aggregate(Calls~Age_Group,data = telecom, FUN=sum)
telecom1
telecom<-read.csv(file.choose(), header=TRUE)
Simple Bar Chart
#Importing Data
#Creating a Matrix
21
Age_Group Calls
1 >45 128870
2 18-30 943187
3 30-45 798721
For plotting a bar chart in R, it is important that the variable to be
plotted is in a vector or a matrix form.
table() converts the variable in vector/matrix form
Simple Bar Chart
22
barplot(telecom1$Calls, main= "SIMPLE BAR CHART (Total Calls - Age Group)",
names.arg = telecom1$Age_Group, xlab = "Age Group",ylab="Total Calls",col=
"darkorange")
#Simple Bar Chart – Total Calls
barplot() in base R yields different types of bar chart
main= provides the user defined name of the chart. It has to be put in
double quotes
names.arg= takes vector of names to be plotted below each bar or group of
bars.
xlab= provides a user defined label for the variable on X axis
ylab= provides a user defined label for the variable on Y axis
col= can be used to input your choice of color to the bars
Simple Bar Chart
23
Output:
Stacked Bar Chart
24
#Stacked Bar Chart
barplot() in base R yields different types of bar chart
legend=rownames() displays the legend on the graph output
telecom2<-table(telecom$Gender,telecom$Age_Group)
telecom2
>45 18-30 30-45
F 32 256 221
M 39 245 207
table() inputting two variables gives a matrix having their counts
in each category
barplot(telecom2,main="STACKED BAR CHART",
xlab ="Age Group",ylab="No.of Customers",
col=c("cadetblue","orange"),legend=rownames(telecom2))
Stacked Bar Chart
25
Output:
Percentage Bar Chart
26
#Percentage Bar Chart
barplot() in base R yields different types of bar chart
telecom3*100 a vector or matrix for which the bar chart needs to be plotted. *100
would display percentage scale on x-axis
telecom3<-prop.table(telecom2,2)
telecom3
barplot(telecom3*100, main= "PERCENTAGE BAR CHART",
xlab = "Age Group",ylab="No.of Customers",
col = c("cadetblue","orange"),legend=rownames(telecom3))
>45 18-30 30-45
F 0.4507042 0.5109780 0.5163551
M 0.5492958 0.4890220 0.4836449
prop.table() helps us create data frame with
percentage values
(,2) gives percentage as per column count
Percentage Bar Chart
27
Output:
Multiple Bar Chart
28
# Multiple Bar Chart
barplot() in base R yields different types of bar chart
beside=TRUE enables us to show the different class of the same bar one beside the other
telecom4<-xtabs(Calls~Gender+Age_Group,telecom)
telecom4
barplot(telecom4,main="MULTIPLE BAR CHART (Total Calls - Gender
& Age Group)", xlab ="Age Group",ylab="No.of Calls",
col=c("cadetblue","orange"),legend=rownames(telecom4), beside = TRUE)
Age_Group
Gender >45 18-30 30-45
F 58310 480235 408184
M 70560 462952 390537
xtabs() is used to cross tabulate the categories of
more than one variables using another
numeric variable which results in totals in each
category
Multiple Bar Chart
29
Output:
Pie Chart
30
telecom5<-aggregate(Calls~Age_Group,data = telecom, FUN=sum)
telecom5$pct <- round(telecom5$Calls/sum(telecom5$Calls)*100)
telecom5
#Pie Chart
Here, we calculate the proportions for each category using formula
pie() in base R yields a pie chart
labels= provides a user defined label for the variable on X axis
paste() labels each category using string values separated by commas
col= can be used to input your choice of color to the bars
pie(telecom5$Calls,labels = paste(telecom5$Age_Group,"(",telecom5$pct,"%)"),
col=c("darkcyan","orange","yellowgreen"),
main="Pie Chart with Percentage")
Age_Group Calls pct
1 >45 128870 7
2 18-30 943187 50
3 30-45 798721 43
Pie Chart
Output:
31
Box Plot
32
boxplot(telecom$Calls,data=telecom, main="BOX PLOT (Total Calls)",
ylab="Total Calls",col="cadetblue3")
#BoxPlot – Total Calls
boxplot() in base R yields a different types of box chart
data= calls the data out of which the variable needs to be plotted
main= provides the user defined name of the chart. It has to be put in double quotes
ylab= provides a user defined label for the variable on Y axis
col= can be used to input your choice of color to the bars
Box Plot
33
Output:
Histogram
34
hist(telecom$AvgTime, breaks=12, main = "HISTOGRAM - Average
Call Time", xlab = "Average Call Time", ylab = "No. of Customers",
col="darkorange")
#Histogram – Average Call Time
hist() in base R yields a histogram
breaks= specifes the width of each bar
main= provides the user defined name of the chart. It is to be put in double quotes
xlab= provides a user defined label for the variable on X axis
ylab= provides a user defined label for the variable on Y axis
col= can be used to input your choice of color to the bars
Histogram
35
Output:
Density Plot
36
telecom6<-density(telecom$Amt)
plot(telecom6, main="DENSITY PLOT - Amount",xlab="Amount")
polygon(telecom6, col="yellowgreen")
#Density Plot - Amount
density() returns the density values of the variable
plot() plots the line graph of the specified variable
main= provides the user defined name of the chart. It is to be put in double quotes
xlab= provides a user defined label for the variable on X axis
polygon() shows the area covered under the curve
col= can be used to input your choice of color in the polygon
Density Plot
37
Output:
Dot Plot
38
dotchart(telecom$Calls,cex=.7,main="DOT PLOT",
xlab="Total Calls",col="red")
#DotPlot
dotchart() in base R yields a dotchart
cex= specifies the amount by which plotting text and symbols should be scaled relative to default
of 1
main= provides the user defined name of the chart. It has to be put in double quotes
xlab= provides a user defined label for the variable on X axis
col= can be used to input your choice of color in the dotchart
Dot Plot
39
Output:
stem(telecom$Calls)
Stem and Leaf Plot
40
#Stem and Leaf Plot in R
Output:
stem() in base R yields a stem and leaf chart
Heat Map
41
pheatmap(heatmap, display_numbers=TRUE,main="HEAT MAP",
fontsize=12, fontsize_row=15, number_color="black")
# Heat Map
install.packages("pheatmap")
library(pheatmap)
#Installing and loading the package
heatmap<-xtabs(Calls~Age_Group+Gender,data=telecom)
#Creating a Matrix
display_numbers=T shows the frequency of each cell
main= provides the user defined name of the chart
fontsize= inputs the font size of the main title
fontsize.row= inputs the font size of the row labels
number_color= inputs the color of the frequency shown in each cell
pheatmap is the best package we can use to plot an effective Heat Map in R
Heat Map
42
Output:
THANK YOU!
43

More Related Content

PPTX
Basic Analysis using Python
PPTX
Data Management in R
PPTX
Data Management in Python
PPTX
Programming in R
PDF
R basics
 
PPTX
R language introduction
PPTX
R Language Introduction
PDF
Introduction to R Programming
Basic Analysis using Python
Data Management in R
Data Management in Python
Programming in R
R basics
 
R language introduction
R Language Introduction
Introduction to R Programming

What's hot (20)

PDF
R Programming: Mathematical Functions In R
PPTX
2. R-basics, Vectors, Arrays, Matrices, Factors
PDF
R Programming: Importing Data In R
PDF
Vectors data frames
 
PDF
4 R Tutorial DPLYR Apply Function
PPTX
R programming language
PDF
Data Analysis with R (combined slides)
PDF
R basics
PDF
R learning by examples
PPTX
Language R
PDF
2 data structure in R
PPT
R tutorial for a windows environment
PDF
3 R Tutorial Data Structure
PPT
Python Pandas
PPT
Best corporate-r-programming-training-in-mumbai
PPTX
R language
PDF
PPTX
Presentation on use of r statistics
PDF
2 R Tutorial Programming
PPTX
Introduction To R Language
R Programming: Mathematical Functions In R
2. R-basics, Vectors, Arrays, Matrices, Factors
R Programming: Importing Data In R
Vectors data frames
 
4 R Tutorial DPLYR Apply Function
R programming language
Data Analysis with R (combined slides)
R basics
R learning by examples
Language R
2 data structure in R
R tutorial for a windows environment
3 R Tutorial Data Structure
Python Pandas
Best corporate-r-programming-training-in-mumbai
R language
Presentation on use of r statistics
2 R Tutorial Programming
Introduction To R Language
Ad

Similar to Basic Analysis using R (20)

PPTX
Data Exploration in R.pptx
PDF
Unit---4.pdf how to gst du paper in this day and age
PPT
R for Statistical Computing
PPTX
R part I
PDF
Collect 50 or more paired quantitative data items. You may use a met.pdf
PDF
R training5
PPTX
R - Get Started I - Sanaitics
PPTX
Exploratory data analysis using r
PPTX
Exploratory Data Analysis
PDF
R code for data manipulation
PDF
R code for data manipulation
PPT
Introduction to R for Data Science Technology
PPT
introduction to R with example, Data science
PPT
PPT
Slides on introduction to R by ArinBasu MD
PPT
17641.ppt
PPT
Basics of R-Progranmming with instata.ppt
PPT
How to obtain and install R.ppt
PDF
An introductiontoappliedmultivariateanalysiswithr everit
DOCX
Week-3 – System RSupplemental material1Recap •.docx
Data Exploration in R.pptx
Unit---4.pdf how to gst du paper in this day and age
R for Statistical Computing
R part I
Collect 50 or more paired quantitative data items. You may use a met.pdf
R training5
R - Get Started I - Sanaitics
Exploratory data analysis using r
Exploratory Data Analysis
R code for data manipulation
R code for data manipulation
Introduction to R for Data Science Technology
introduction to R with example, Data science
Slides on introduction to R by ArinBasu MD
17641.ppt
Basics of R-Progranmming with instata.ppt
How to obtain and install R.ppt
An introductiontoappliedmultivariateanalysiswithr everit
Week-3 – System RSupplemental material1Recap •.docx
Ad

More from Sankhya_Analytics (6)

PPTX
Getting Started with Python
PPTX
Getting Started with MySQL II
PPTX
Getting Started with MySQL I
PPTX
Getting Started with R
PPTX
R Get Started II
PPTX
R Get Started I
Getting Started with Python
Getting Started with MySQL II
Getting Started with MySQL I
Getting Started with R
R Get Started II
R Get Started I

Recently uploaded (20)

PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Lecture1 pattern recognition............
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
1_Introduction to advance data techniques.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Foundation of Data Science unit number two notes
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Qualitative Qantitative and Mixed Methods.pptx
Lecture1 pattern recognition............
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
IB Computer Science - Internal Assessment.pptx
Database Infoormation System (DBIS).pptx
Supervised vs unsupervised machine learning algorithms
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
climate analysis of Dhaka ,Banglades.pptx
Introduction to Knowledge Engineering Part 1
1_Introduction to advance data techniques.pptx

Basic Analysis using R

  • 3. Data Snapshot The data has 41 rows and 7 columns First_Name First Name Last_Name Last Name Grade Grade Location Location Function Department ba Basic Allowance ms Management Supplements Data Description basic_salary_R3 3
  • 4. Describing Variable summary(salary) 4 salary<-read.csv(file.choose(), header=TRUE) #Importing Data #Checking the variable features using summary function First_Name Last_Name Grade Location Function Kavita : 2 Joshi : 2 GR1 :23 DELHI :17 FINANCE :13 Mahesh : 2 Shah : 2 GR2 :17 MUMBAI:21 SALES :15 Nishi : 2 Singh : 2 NA's: 1 NA's : 3 TECHNICAL:11 Priya : 2 Arora : 1 NA's : 2 Ajit : 1 Bhide : 1 Ameet : 1 Bhutala: 1 (Other):31 (Other):32 ba ms Min. :10940 Min. : 2700 1st Qu.:13785 1st Qu.:10450 Median :16230 Median :12420 Mean :17210 Mean :11939 3rd Qu.:19305 3rd Qu.:14200 Max. :29080 Max. :16970 NA's :2 NA's :4 summary() gives descriptive measures for numeric variable
  • 5. Measures of Central Tendency mean(salary$ba,na.rm=T) [1] 17209.74 mean(salary$ba ) [1] NA # Mean Using na.rm=T excludes the missing values from the mean mean() gives mean of the variable. median(salary$ba,na.rm=T) [1] 16230 # Median median(), gives median of the variable. trimmed_mean<-mean(salary$ba,0.10,na.rm=T) trimmed_mean [1] 16879.09 Trimmed mean is a useful measure in the presence of outliers. Using 0.10 in the mean(), excludes 10% observations from each side of the data from the mean names(sort(-table(salary$ba)))[1] [1] "10940" # Mode There is no direct function in R for Mode. Hence, we first obtain counts for each value using table() and sort() to find the highest frequency. name() displays the value of the first (highest frequency) observation. 5
  • 6. 6 r<-range(salary$ba,na.rm=T) r [1] 10940 29080 # Range / Inter Quartile Range range() gives minimum and maximum values of that variable diff(r) [1] 18140 diff() calculates difference between all values of that vector IQR(salary$ba,na.rm=T) [1] 9070 IQR() gives the Inter-Quartile range of the variable Measures of Variation
  • 7. var(salary$ba,na.rm=T) [1] 17301567 7 sd(salary$ba,na.rm=T) [1] 4159.515 # Standard Deviation/ Variance sd() gives standard deviation of the variable var() gives variance of the variable cv<-sd(salary$ba,na.rm=T)/mean(salary$ba,na.rm=T) cv [1] 0.2416954 # Co-efficient of Variation There is no standard function for CV in R. Hence we calculate it by definition Measures of Variation
  • 8. Skewness and Kurtosis kurtosis(salary$ba,na.rm=T,type=2) [1] 0.4996513 8 skewness(salary$ba,na.rm=T,type=2) [1] 0.9033507 # Skewness skewness() gives skewness of the variable. type=2 uses moment based formula library(e1071) Using package “e1071” in R is the easiest way to find skewness and kurtosis # Kurtosis kurtosis() gives kurtosis of the variable. type=2 uses moment based formula f<-function(x) c(skew=skewness(x,na.rm=T,type=2),kurt=kurtosis(x,na.rm=T,type=2)) aggregate(ba~Grade,data=salary,FUN=f) Grade ba.skew ba.kurt 1 GR1 0.85500651 0.62939005 2 GR2 0.08682743 0.34852383 # Skewness and Kurtosis by Grade
  • 9. Generating Frequency Tables 9 freq<-table(salary$Location, salary$Grade) freq # Frequency Tables GR1 GR2 DELHI 11 6 MUMBAI 10 10 table() gives the frequency of counts of the two variables mentioned. prop.table(freq) # Percentage Frequency Tables GR1 GR2 DELHI 0.2972973 0.1621622 MUMBAI 0.2702703 0.2702703 prop.table() gives the frequency expressed as percentage of total count. (,1) expresses the frequency as percentage of row count whereas (,2) would express it as percentage of column count
  • 10. Generating Frequency Tables 10 table1 <- table(salary$Location, salary$Grade, salary$Function) ftable(table1) # Three Way Frequency Table FINANCE SALES TECHNICAL DELHI GR1 4 4 3 GR2 2 1 3 MUMBAI GR1 2 4 2 GR2 3 4 3 ftable() gives the frequency of counts of the three variables mentioned in one table itself
  • 11. Generating Cross Tables 11 install.packages("gmodels") library(gmodels) # Installing package – “gmodels” # Frequency Table using ‘gmodels’ package CrossTable(salary$Grade,salary$Location, prop.r=FALSE, prop.c=FALSE) CrossTable() gives the frequency of counts of the two variables mentioned prop.r= removes the row proportion from the output and prop.c= removes the column proportion
  • 13. Data Snapshot The data has 25 rows and 6 columns empno Employee Number aptitude Aptitude Score of the Employee testofen Test of English tech_ Technical Score g_k_ General Knowledge Score job_prof Job Proficiency Score Data Description job_proficiency_R3 13
  • 14. Scatter Plot 14 plot(job$aptitude,job$job_prof,col="red") # scatterplot job<-read.csv(file.choose(),header=T) plot() gives a scatterplot of the two variables mentioned. col= provides color to the points • Study the correlation between Aptitude and Job Proficiency.
  • 15. Pearson Correlation Coefficient cor(job$aptitude,job$job_prof) # correlation [1] 0.5144107 cor() gives the Pearson Correlation Coefficient of the two variables mentioned 15
  • 16. attach(job) plot(aptitude,job_prof, main="ScatterPlot with Regression Line", xlab="Aptitude ", ylab="Job Proficiency", pch=19) abline(lm(job_prof~aptitude), col="darkorange") ScatterPlot with Regression Line #Scatterplot with Regression Line 16 plot() in base R yields a different types of plots aptitude is one of the variable for plot job_prof is another variable to be plotted main= provides the user defined name of the chart. It has to be put in double quotes xlab= provides a user defined label for the variable on X axis ylab= provides a user defined label for the variable on Y axis pch= gives various shapes for the data points on the plot abline() in base R yields a different types of lines on plot lm() provides the liner regression line of the first variable mentioned on the second col= provides the color of the line plotted attach() is used to call the data in R with help of which in further . Codes specifying the data repetitively can be avoided.
  • 18. Scatter Plot Matrix 18 pairs(~job_prof+aptitude+testofen+tech_+g_k_,data=job, main="ScatterPlot Matrix",col="darkorange") #ScatterPlot Matrix Output: pairs() in base R is used to plot pairwise comparison ~ each variable name to be plotted followed by a “+” sign needs to mentioned main= provides the user defined name of the chart.
  • 20. Data Snapshot The data has 1000 rows and 10 columns CustID Customer ID Age Age Gender Gender PinCode PinCode Active Whether the customer was active in past 24 weeks or not Calls Number of Calls made Minutes Number of minutes spoken Amt Amount charged AvgTime Mean Time per call Age_Group Age Group of the Customer Data Descriptiontelecom_R3 20
  • 21. telecom1<-aggregate(Calls~Age_Group,data = telecom, FUN=sum) telecom1 telecom<-read.csv(file.choose(), header=TRUE) Simple Bar Chart #Importing Data #Creating a Matrix 21 Age_Group Calls 1 >45 128870 2 18-30 943187 3 30-45 798721 For plotting a bar chart in R, it is important that the variable to be plotted is in a vector or a matrix form. table() converts the variable in vector/matrix form
  • 22. Simple Bar Chart 22 barplot(telecom1$Calls, main= "SIMPLE BAR CHART (Total Calls - Age Group)", names.arg = telecom1$Age_Group, xlab = "Age Group",ylab="Total Calls",col= "darkorange") #Simple Bar Chart – Total Calls barplot() in base R yields different types of bar chart main= provides the user defined name of the chart. It has to be put in double quotes names.arg= takes vector of names to be plotted below each bar or group of bars. xlab= provides a user defined label for the variable on X axis ylab= provides a user defined label for the variable on Y axis col= can be used to input your choice of color to the bars
  • 24. Stacked Bar Chart 24 #Stacked Bar Chart barplot() in base R yields different types of bar chart legend=rownames() displays the legend on the graph output telecom2<-table(telecom$Gender,telecom$Age_Group) telecom2 >45 18-30 30-45 F 32 256 221 M 39 245 207 table() inputting two variables gives a matrix having their counts in each category barplot(telecom2,main="STACKED BAR CHART", xlab ="Age Group",ylab="No.of Customers", col=c("cadetblue","orange"),legend=rownames(telecom2))
  • 26. Percentage Bar Chart 26 #Percentage Bar Chart barplot() in base R yields different types of bar chart telecom3*100 a vector or matrix for which the bar chart needs to be plotted. *100 would display percentage scale on x-axis telecom3<-prop.table(telecom2,2) telecom3 barplot(telecom3*100, main= "PERCENTAGE BAR CHART", xlab = "Age Group",ylab="No.of Customers", col = c("cadetblue","orange"),legend=rownames(telecom3)) >45 18-30 30-45 F 0.4507042 0.5109780 0.5163551 M 0.5492958 0.4890220 0.4836449 prop.table() helps us create data frame with percentage values (,2) gives percentage as per column count
  • 28. Multiple Bar Chart 28 # Multiple Bar Chart barplot() in base R yields different types of bar chart beside=TRUE enables us to show the different class of the same bar one beside the other telecom4<-xtabs(Calls~Gender+Age_Group,telecom) telecom4 barplot(telecom4,main="MULTIPLE BAR CHART (Total Calls - Gender & Age Group)", xlab ="Age Group",ylab="No.of Calls", col=c("cadetblue","orange"),legend=rownames(telecom4), beside = TRUE) Age_Group Gender >45 18-30 30-45 F 58310 480235 408184 M 70560 462952 390537 xtabs() is used to cross tabulate the categories of more than one variables using another numeric variable which results in totals in each category
  • 30. Pie Chart 30 telecom5<-aggregate(Calls~Age_Group,data = telecom, FUN=sum) telecom5$pct <- round(telecom5$Calls/sum(telecom5$Calls)*100) telecom5 #Pie Chart Here, we calculate the proportions for each category using formula pie() in base R yields a pie chart labels= provides a user defined label for the variable on X axis paste() labels each category using string values separated by commas col= can be used to input your choice of color to the bars pie(telecom5$Calls,labels = paste(telecom5$Age_Group,"(",telecom5$pct,"%)"), col=c("darkcyan","orange","yellowgreen"), main="Pie Chart with Percentage") Age_Group Calls pct 1 >45 128870 7 2 18-30 943187 50 3 30-45 798721 43
  • 32. Box Plot 32 boxplot(telecom$Calls,data=telecom, main="BOX PLOT (Total Calls)", ylab="Total Calls",col="cadetblue3") #BoxPlot – Total Calls boxplot() in base R yields a different types of box chart data= calls the data out of which the variable needs to be plotted main= provides the user defined name of the chart. It has to be put in double quotes ylab= provides a user defined label for the variable on Y axis col= can be used to input your choice of color to the bars
  • 34. Histogram 34 hist(telecom$AvgTime, breaks=12, main = "HISTOGRAM - Average Call Time", xlab = "Average Call Time", ylab = "No. of Customers", col="darkorange") #Histogram – Average Call Time hist() in base R yields a histogram breaks= specifes the width of each bar main= provides the user defined name of the chart. It is to be put in double quotes xlab= provides a user defined label for the variable on X axis ylab= provides a user defined label for the variable on Y axis col= can be used to input your choice of color to the bars
  • 36. Density Plot 36 telecom6<-density(telecom$Amt) plot(telecom6, main="DENSITY PLOT - Amount",xlab="Amount") polygon(telecom6, col="yellowgreen") #Density Plot - Amount density() returns the density values of the variable plot() plots the line graph of the specified variable main= provides the user defined name of the chart. It is to be put in double quotes xlab= provides a user defined label for the variable on X axis polygon() shows the area covered under the curve col= can be used to input your choice of color in the polygon
  • 38. Dot Plot 38 dotchart(telecom$Calls,cex=.7,main="DOT PLOT", xlab="Total Calls",col="red") #DotPlot dotchart() in base R yields a dotchart cex= specifies the amount by which plotting text and symbols should be scaled relative to default of 1 main= provides the user defined name of the chart. It has to be put in double quotes xlab= provides a user defined label for the variable on X axis col= can be used to input your choice of color in the dotchart
  • 40. stem(telecom$Calls) Stem and Leaf Plot 40 #Stem and Leaf Plot in R Output: stem() in base R yields a stem and leaf chart
  • 41. Heat Map 41 pheatmap(heatmap, display_numbers=TRUE,main="HEAT MAP", fontsize=12, fontsize_row=15, number_color="black") # Heat Map install.packages("pheatmap") library(pheatmap) #Installing and loading the package heatmap<-xtabs(Calls~Age_Group+Gender,data=telecom) #Creating a Matrix display_numbers=T shows the frequency of each cell main= provides the user defined name of the chart fontsize= inputs the font size of the main title fontsize.row= inputs the font size of the row labels number_color= inputs the color of the frequency shown in each cell pheatmap is the best package we can use to plot an effective Heat Map in R

Editor's Notes