Basic Analysis using R

SECTION 1
Descriptive Statistics
Summarizing Data
2

Data Snapshot
The data has 41 rows and 7 columns
First_Name First Name
Last_Name Last Name
Grade Grade
Location Location
Function Department
ba Basic Allowance
ms Management
Supplements
Data Description
basic_salary_R3
3

Describing Variable
summary(salary)
4
salary<-read.csv(file.choose(), header=TRUE)
#Importing Data
#Checking the variable features using summary function
First_Name Last_Name Grade Location Function
Kavita : 2 Joshi : 2 GR1 :23 DELHI :17 FINANCE :13
Mahesh : 2 Shah : 2 GR2 :17 MUMBAI:21 SALES :15
Nishi : 2 Singh : 2 NA's: 1 NA's : 3 TECHNICAL:11
Priya : 2 Arora : 1 NA's : 2
Ajit : 1 Bhide : 1
Ameet : 1 Bhutala: 1
(Other):31 (Other):32
ba ms
Min. :10940 Min. : 2700
1st Qu.:13785 1st Qu.:10450
Median :16230 Median :12420
Mean :17210 Mean :11939
3rd Qu.:19305 3rd Qu.:14200
Max. :29080 Max. :16970
NA's :2 NA's :4
summary() gives descriptive measures
for numeric variable

Measures of Central Tendency
mean(salary$ba,na.rm=T)
[1] 17209.74
mean(salary$ba )
[1] NA
# Mean
Using na.rm=T excludes the missing values from the mean
mean() gives mean of the variable.
median(salary$ba,na.rm=T)
[1] 16230
# Median
median(), gives median of the variable.
trimmed_mean<-mean(salary$ba,0.10,na.rm=T)
trimmed_mean
[1] 16879.09
Trimmed mean is a useful measure in the presence of outliers.
Using 0.10 in the mean(), excludes 10% observations from each side of
the data from the mean
names(sort(-table(salary$ba)))[1]
[1] "10940"
# Mode
There is no direct function in R for Mode. Hence, we
first obtain counts for each value using table() and
sort() to find the highest frequency.
name() displays the value of the first (highest
frequency) observation. 5

6
r<-range(salary$ba,na.rm=T)
r
[1] 10940 29080
# Range / Inter Quartile Range
range() gives minimum and maximum values of
that variable
diff(r)
[1] 18140
diff() calculates difference between all values of that
vector
IQR(salary$ba,na.rm=T)
[1] 9070 IQR() gives the Inter-Quartile range of the variable
Measures of Variation

var(salary$ba,na.rm=T)
[1] 17301567
7
sd(salary$ba,na.rm=T)
[1] 4159.515
# Standard Deviation/ Variance
sd() gives standard deviation of the variable
var() gives variance of the variable
cv<-sd(salary$ba,na.rm=T)/mean(salary$ba,na.rm=T)
cv
[1] 0.2416954
# Co-efficient of Variation
There is no standard function for CV in R. Hence we
calculate it by definition
Measures of Variation

Skewness and Kurtosis
kurtosis(salary$ba,na.rm=T,type=2)
[1] 0.4996513
8
skewness(salary$ba,na.rm=T,type=2)
[1] 0.9033507
# Skewness
skewness() gives skewness of the variable.
type=2 uses moment based formula
library(e1071) Using package “e1071” in R is the easiest way
to find skewness and kurtosis
# Kurtosis
kurtosis() gives kurtosis of the variable.
type=2 uses moment based formula
f<-function(x)
c(skew=skewness(x,na.rm=T,type=2),kurt=kurtosis(x,na.rm=T,type=2))
aggregate(ba~Grade,data=salary,FUN=f)
Grade ba.skew ba.kurt
1 GR1 0.85500651 0.62939005
2 GR2 0.08682743 0.34852383
# Skewness and Kurtosis by Grade

Generating Frequency Tables
9
freq<-table(salary$Location, salary$Grade)
freq
# Frequency Tables
GR1 GR2
DELHI 11 6
MUMBAI 10 10
table() gives the frequency of
counts of the two variables
mentioned.
prop.table(freq)
# Percentage Frequency Tables
GR1 GR2
DELHI 0.2972973 0.1621622
MUMBAI 0.2702703 0.2702703
prop.table() gives the frequency expressed as
percentage of total count.
(,1) expresses the frequency as percentage of
row count whereas (,2) would express it as
percentage of column count

Generating Frequency Tables
10
table1 <- table(salary$Location, salary$Grade, salary$Function)
ftable(table1)
# Three Way Frequency Table
FINANCE SALES TECHNICAL
DELHI GR1 4 4 3
GR2 2 1 3
MUMBAI GR1 2 4 2
GR2 3 4 3
ftable() gives the frequency of
counts of the three variables
mentioned in one table itself

Generating Cross Tables
11
install.packages("gmodels")
library(gmodels)
# Installing package – “gmodels”
# Frequency Table using ‘gmodels’ package
CrossTable(salary$Grade,salary$Location, prop.r=FALSE, prop.c=FALSE)
CrossTable() gives the frequency of counts of the
two variables mentioned
prop.r= removes the row proportion from the
output and prop.c= removes the column
proportion

SECTION 2
Bivariate Analysis
12

Data Snapshot
The data has 25 rows and 6 columns
empno Employee Number
aptitude Aptitude Score of the
Employee
testofen Test of English
tech_ Technical Score
g_k_ General Knowledge Score
job_prof Job Proficiency Score
Data Description
job_proficiency_R3
13

Scatter Plot
14
plot(job$aptitude,job$job_prof,col="red")
# scatterplot
job<-read.csv(file.choose(),header=T)
plot() gives a scatterplot of the two variables mentioned.
col= provides color to the points
• Study the correlation between Aptitude and Job Proficiency.

Pearson Correlation Coefficient
cor(job$aptitude,job$job_prof)
# correlation
[1] 0.5144107
cor() gives the Pearson Correlation Coefficient of the
two variables mentioned
15

attach(job)
plot(aptitude,job_prof, main="ScatterPlot with Regression Line",
xlab="Aptitude ", ylab="Job Proficiency", pch=19)
abline(lm(job_prof~aptitude), col="darkorange")
ScatterPlot with Regression Line
#Scatterplot with Regression Line
16
plot() in base R yields a different types of plots
aptitude is one of the variable for plot
job_prof is another variable to be plotted
main= provides the user defined name of the chart. It has to be put in double quotes
xlab= provides a user defined label for the variable on X axis
ylab= provides a user defined label for the variable on Y axis
pch= gives various shapes for the data points on the plot
abline() in base R yields a different types of lines on plot
lm() provides the liner regression line of the first variable mentioned on the second
col= provides the color of the line plotted
attach() is used to call the data in R with help of which in further . Codes specifying
the data repetitively can be avoided.

17
Output:
ScatterPlot with Regression Line

Scatter Plot Matrix
18
pairs(~job_prof+aptitude+testofen+tech_+g_k_,data=job,
main="ScatterPlot Matrix",col="darkorange")
#ScatterPlot Matrix
Output:
pairs() in base R is used to plot pairwise comparison
~ each variable name to be plotted followed by a “+” sign needs to mentioned
main= provides the user defined name of the chart.

SECTION 3
Data Visualisation
Graphs in R
19

Data Snapshot
The data has 1000 rows and 10
columns
CustID Customer ID
Age Age
Gender Gender
PinCode PinCode
Active Whether the customer
was active in past 24
weeks or not
Calls Number of Calls made
Minutes Number of minutes
spoken
Amt Amount charged
AvgTime Mean Time per call
Age_Group Age Group of the
Customer
Data Descriptiontelecom_R3
20

telecom1<-aggregate(Calls~Age_Group,data = telecom, FUN=sum)
telecom1
telecom<-read.csv(file.choose(), header=TRUE)
Simple Bar Chart
#Importing Data
#Creating a Matrix
21
Age_Group Calls
1 >45 128870
2 18-30 943187
3 30-45 798721
For plotting a bar chart in R, it is important that the variable to be
plotted is in a vector or a matrix form.
table() converts the variable in vector/matrix form

Simple Bar Chart
22
barplot(telecom1$Calls, main= "SIMPLE BAR CHART (Total Calls - Age Group)",
names.arg = telecom1$Age_Group, xlab = "Age Group",ylab="Total Calls",col=
"darkorange")
#Simple Bar Chart – Total Calls
barplot() in base R yields different types of bar chart
main= provides the user defined name of the chart. It has to be put in
double quotes
names.arg= takes vector of names to be plotted below each bar or group of
bars.
col= can be used to input your choice of color to the bars

Stacked Bar Chart
24
#Stacked Bar Chart
legend=rownames() displays the legend on the graph output
telecom2<-table(telecom$Gender,telecom$Age_Group)
telecom2
>45 18-30 30-45
F 32 256 221
M 39 245 207
table() inputting two variables gives a matrix having their counts
in each category
barplot(telecom2,main="STACKED BAR CHART",
xlab ="Age Group",ylab="No.of Customers",
col=c("cadetblue","orange"),legend=rownames(telecom2))

Percentage Bar Chart
26
#Percentage Bar Chart
telecom3*100 a vector or matrix for which the bar chart needs to be plotted. *100
would display percentage scale on x-axis
telecom3<-prop.table(telecom2,2)
telecom3
barplot(telecom3*100, main= "PERCENTAGE BAR CHART",
xlab = "Age Group",ylab="No.of Customers",
col = c("cadetblue","orange"),legend=rownames(telecom3))
>45 18-30 30-45
F 0.4507042 0.5109780 0.5163551
M 0.5492958 0.4890220 0.4836449
prop.table() helps us create data frame with
percentage values
(,2) gives percentage as per column count

Percentage Bar Chart
27
Output:

Multiple Bar Chart
28
# Multiple Bar Chart
beside=TRUE enables us to show the different class of the same bar one beside the other
telecom4<-xtabs(Calls~Gender+Age_Group,telecom)
telecom4
barplot(telecom4,main="MULTIPLE BAR CHART (Total Calls - Gender
& Age Group)", xlab ="Age Group",ylab="No.of Calls",
col=c("cadetblue","orange"),legend=rownames(telecom4), beside = TRUE)
Age_Group
Gender >45 18-30 30-45
F 58310 480235 408184
M 70560 462952 390537
xtabs() is used to cross tabulate the categories of
more than one variables using another
numeric variable which results in totals in each
category

Pie Chart
30
telecom5<-aggregate(Calls~Age_Group,data = telecom, FUN=sum)
telecom5$pct <- round(telecom5$Calls/sum(telecom5$Calls)*100)
telecom5
#Pie Chart
Here, we calculate the proportions for each category using formula
pie() in base R yields a pie chart
labels= provides a user defined label for the variable on X axis
paste() labels each category using string values separated by commas
pie(telecom5$Calls,labels = paste(telecom5$Age_Group,"(",telecom5$pct,"%)"),
col=c("darkcyan","orange","yellowgreen"),
main="Pie Chart with Percentage")
Age_Group Calls pct
1 >45 128870 7
2 18-30 943187 50
3 30-45 798721 43

Box Plot
32
boxplot(telecom$Calls,data=telecom, main="BOX PLOT (Total Calls)",
ylab="Total Calls",col="cadetblue3")
#BoxPlot – Total Calls
boxplot() in base R yields a different types of box chart
data= calls the data out of which the variable needs to be plotted

Histogram
34
hist(telecom$AvgTime, breaks=12, main = "HISTOGRAM - Average
Call Time", xlab = "Average Call Time", ylab = "No. of Customers",
col="darkorange")
#Histogram – Average Call Time
hist() in base R yields a histogram
breaks= specifes the width of each bar
main= provides the user defined name of the chart. It is to be put in double quotes

Density Plot
36
telecom6<-density(telecom$Amt)
plot(telecom6, main="DENSITY PLOT - Amount",xlab="Amount")
polygon(telecom6, col="yellowgreen")
#Density Plot - Amount
density() returns the density values of the variable
plot() plots the line graph of the specified variable
main= provides the user defined name of the chart. It is to be put in double quotes
polygon() shows the area covered under the curve
col= can be used to input your choice of color in the polygon

Dot Plot
38
dotchart(telecom$Calls,cex=.7,main="DOT PLOT",
xlab="Total Calls",col="red")
#DotPlot
dotchart() in base R yields a dotchart
cex= specifies the amount by which plotting text and symbols should be scaled relative to default
of 1
col= can be used to input your choice of color in the dotchart

stem(telecom$Calls)
Stem and Leaf Plot
40
#Stem and Leaf Plot in R
Output:
stem() in base R yields a stem and leaf chart

Heat Map
41
pheatmap(heatmap, display_numbers=TRUE,main="HEAT MAP",
fontsize=12, fontsize_row=15, number_color="black")
# Heat Map
install.packages("pheatmap")
library(pheatmap)
#Installing and loading the package
heatmap<-xtabs(Calls~Age_Group+Gender,data=telecom)
#Creating a Matrix
display_numbers=T shows the frequency of each cell
main= provides the user defined name of the chart
fontsize= inputs the font size of the main title
fontsize.row= inputs the font size of the row labels
number_color= inputs the color of the frequency shown in each cell
pheatmap is the best package we can use to plot an effective Heat Map in R

Basic Analysis using R

More Related Content

What's hot (20)

Similar to Basic Analysis using R (20)

More from Sankhya_Analytics (6)

Recently uploaded (20)

Basic Analysis using R

Editor's Notes