Exploratory data analysis in R - Data Science Club

MEET
OUR
TEAM
WRITE HERE SOMETHING
DATA EXPLORATION METHODS &
PRACTISES
Martin Bago | Instarea
8.10.2018
2nd Data Science Club, 18/19 Winter

MEET
OUR
TEAM
WRITE HERE SOMETHINGTABLE OF CONTENT
INTRO
FIRST DEEP INTO DATASET
GOING DEEPER
CORRELATIONS
BONUS
D A T A S C I E N C E C L U B

Martin Bago
Data Scientist | Instarea
Ing. @ Process Automation and Informatization in Industry (2016, MTF STU BA)
Bc. @ Applied Informatics (2014, FEI STU BA)
2017- now Data Scientist, Instarea s.r.o., Market Locator
2015-2016 Head of Analyst, News and Media Holding a.s.
2014-2015 SEO Analyst, Centrum Holdings a.s.
2011-2014 Automix.sk, Centrum Holdings a.s.
2010-2013 Editor-in-chief OKO Casopis (FEI STU BA)
Passionate driver, beer&coffee&football lover

Something for you
Download this presentation +
source code here:
http://bit.ly/2QybvNV

The Data journey…always the
same

Dataset
>> install.packages("datasets") #installing datasets package in R
>> library(datasets)
For studying there is an unique library consisting of many real-life dataset examples (from Monthly
Airline Passenger Numbers, thru Weight versus age of chicks on different diets to Monthly Deaths from
Lung Diseases in the UK) .
For this presentation we will use mtcars dataset.
How to find&use

Baby steps
head(), tail(), nrow() and ncol()
To understand, what are you working with is very important to see dimensions of dataset a number/count
of values.
>> head(mtcars)
>> tail(mtcars)
>> head(mtcars, 25)
>> nrow(mtcars)
>> ncol(mtcars)
Input: Output:

Deeper insight
str(), summary()
To deeper understanding of dataset use detailed views of metrics and
dimensions.
>> str(mtcars)
>> summary(mtcars)
Input: Output:
Always check data types!!!
Source

Unique and missing values
unique(), is.na()
Is crucial to find, how many values are missing from the dataset. If there is 2/3 missing,
you got wrong dataset.
>> unique(mtcars$cyl)
>> is.na(mtcars)
Input: Output:
If there is something missing, you can
use old&good method to treat that –
filling with mean.
>> mtcars$smt[is.na(mtcars$smt)] <-
mean(mtcars$smt, na.rm = TRUE)

Histograms
hist()
The best way to learn and understand, is visual
>> hist(mtcars$mpg)
>> hist(mtcars$hp)
Input: Output:
Output:

Transforming and recalculating
Often you need to calculate your own metrics. In R, it’s really
easy.
>> mtcars2 <- mtcars
>> mtcars2$disp_l <- mtcars$mpg/61.024
>> mtcars2$kml <- 235/mtcars$mpg
>> hist(mtcars2$disp_l)
Input: Output:

Understand the scope of
variablesboxplot()
>> boxplot(mtcars)
>> boxplot(mtcars2$disp_l, mtcars2$kml)
>> boxplot(mtcars2$kml, main = "mtcars dataset",
xlab = "Comsumption per 100km", ylab = "Liters")
Input:
Output:
Output:

How to read boxplot?
boxplot()

Does it correlate?
Library(corplot), cor()
>> install.packages("corrplot")
>> library(corrplot)
>> #cor(x, method = "pearson", use = "complete.obs")
>> cor(mtcars)
Input:
Output: Not very intuitive…

Does it correlate?
>> res <- cor(mtcars)
>> round(res, 2)
>> corrplot(res, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 25)
Input: Output:
! Becareful !
Correlation is not causality

Heatmap via corrplot library
>> library(corrplot)
>> col<- colorRampPalette(c("blue", "white", "red"))(20)
>> heatmap(x = res, col = col, symm = TRUE)
Input: Output:
Does it correlate?

Or even deeper insight…
>>require(graphics)
pairs(mtcars2, main = "mtcars2 data", gap = 1/4)
coplot(kml ~ disp_l | as.factor(cyl), data = mtcars2,
panel = panel.smooth, rows = 1)
## possibly more meaningful, e.g., for summary() or
bivariate plots:
mtcars2 <- within(mtcars2, {
vs <- factor(vs, labels = c("V", "S"))
am <- factor(am, labels = c("automatic", "manual"))
cyl <- ordered(cyl)
gear <- ordered(gear)
carb <- ordered(carb)
})
summary(mtcars2)
Input: Output:

Or even deeper insight…
>> install.packages("PerformanceAnalytics")
>> library(PerformanceAnalytics)
>> chart.Correlation(mtcars, histogram=TRUE, pch=19)
>> mtcars_small <- mtcars[,1:4]
>> chart.Correlation(mtcars_small, histogram=TRUE, pch=19)
Input: Output:
Library Performance Analytics

Bonus - anomaliesDetection
AnomalyDetectionTs()
As input in considered time-series or vector, at least two periods.
Madeby Twitter

What next?
To create customizable dashboards try
Shiny: Tableau-like Drag and Drop GUI Visualization in R use esquisse:

Stay in touch
Instarea s.r.o.
29. Augusta 36/A
811 09 Bratislava
www.instarea.com
Martin Bago
Data Scientist
Instarea
martin.bago@instarea.com
+421 905 255 852
https://www.linkedin.com/in/martinbago/
Thank you!

Exploratory data analysis in R - Data Science Club

More Related Content

What's hot (20)

Similar to Exploratory data analysis in R - Data Science Club (20)

Recently uploaded (20)

Exploratory data analysis in R - Data Science Club