SlideShare a Scribd company logo
MEET
OUR
TEAM
WRITE HERE SOMETHING
DATA EXPLORATION METHODS &
PRACTISES
Martin Bago | Instarea
8.10.2018
2nd Data Science Club, 18/19 Winter
MEET
OUR
TEAM
WRITE HERE SOMETHINGTABLE OF CONTENT
INTRO
FIRST DEEP INTO DATASET
GOING DEEPER
CORRELATIONS
BONUS
D A T A S C I E N C E C L U B
Martin Bago
Data Scientist | Instarea
Ing. @ Process Automation and Informatization in Industry (2016, MTF STU BA)
Bc. @ Applied Informatics (2014, FEI STU BA)
2017- now Data Scientist, Instarea s.r.o., Market Locator
2015-2016 Head of Analyst, News and Media Holding a.s.
2014-2015 SEO Analyst, Centrum Holdings a.s.
2011-2014 Automix.sk, Centrum Holdings a.s.
2010-2013 Editor-in-chief OKO Casopis (FEI STU BA)
Passionate driver, beer&coffee&football lover
Something for you
Download this presentation +
source code here:
http://bit.ly/2QybvNV
The Data journey…always the
same
Dataset
>> install.packages("datasets") #installing datasets package in R
>> library(datasets)
For studying there is an unique library consisting of many real-life dataset examples (from Monthly
Airline Passenger Numbers, thru Weight versus age of chicks on different diets to Monthly Deaths from
Lung Diseases in the UK) .
For this presentation we will use mtcars dataset.
How to find&use
Baby steps
head(), tail(), nrow() and ncol()
To understand, what are you working with is very important to see dimensions of dataset a number/count
of values.
>> head(mtcars)
>> tail(mtcars)
>> head(mtcars, 25)
>> nrow(mtcars)
>> ncol(mtcars)
Input: Output:
Deeper insight
str(), summary()
To deeper understanding of dataset use detailed views of metrics and
dimensions.
>> str(mtcars)
>> summary(mtcars)
Input: Output:
Always check data types!!!
Source
Unique and missing values
unique(), is.na()
Is crucial to find, how many values are missing from the dataset. If there is 2/3 missing,
you got wrong dataset.
>> unique(mtcars$cyl)
>> is.na(mtcars)
Input: Output:
If there is something missing, you can
use old&good method to treat that –
filling with mean.
>> mtcars$smt[is.na(mtcars$smt)] <-
mean(mtcars$smt, na.rm = TRUE)
Histograms
hist()
The best way to learn and understand, is visual
>> hist(mtcars$mpg)
>> hist(mtcars$hp)
Input: Output:
Output:
Transforming and recalculating
Often you need to calculate your own metrics. In R, it’s really
easy.
>> mtcars2 <- mtcars
>> mtcars2$disp_l <- mtcars$mpg/61.024
>> mtcars2$kml <- 235/mtcars$mpg
>> hist(mtcars2$disp_l)
Input: Output:
Understand the scope of
variablesboxplot()
>> boxplot(mtcars)
>> boxplot(mtcars2$disp_l, mtcars2$kml)
>> boxplot(mtcars2$kml, main = "mtcars dataset",
xlab = "Comsumption per 100km", ylab = "Liters")
Input:
Output:
Output:
How to read boxplot?
boxplot()
Does it correlate?
Library(corplot), cor()
>> install.packages("corrplot")
>> library(corrplot)
>> #cor(x, method = "pearson", use = "complete.obs")
>> cor(mtcars)
Input:
Output: Not very intuitive…
Does it correlate?
Library(corplot), cor()
>> res <- cor(mtcars)
>> round(res, 2)
>> corrplot(res, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 25)
Input: Output:
! Becareful !
Correlation is not causality
Heatmap via corrplot library
>> library(corrplot)
>> col<- colorRampPalette(c("blue", "white", "red"))(20)
>> heatmap(x = res, col = col, symm = TRUE)
Input: Output:
Does it correlate?
Or even deeper insight…
>>require(graphics)
pairs(mtcars2, main = "mtcars2 data", gap = 1/4)
coplot(kml ~ disp_l | as.factor(cyl), data = mtcars2,
panel = panel.smooth, rows = 1)
## possibly more meaningful, e.g., for summary() or
bivariate plots:
mtcars2 <- within(mtcars2, {
vs <- factor(vs, labels = c("V", "S"))
am <- factor(am, labels = c("automatic", "manual"))
cyl <- ordered(cyl)
gear <- ordered(gear)
carb <- ordered(carb)
})
summary(mtcars2)
Input: Output:
Library(corplot), cor()
Or even deeper insight…
>> install.packages("PerformanceAnalytics")
>> library(PerformanceAnalytics)
>> chart.Correlation(mtcars, histogram=TRUE, pch=19)
>> mtcars_small <- mtcars[,1:4]
>> chart.Correlation(mtcars_small, histogram=TRUE, pch=19)
Input: Output:
Library Performance Analytics
Bonus - anomaliesDetection
AnomalyDetectionTs()
As input in considered time-series or vector, at least two periods.
Madeby Twitter
What next?
To create customizable dashboards try
Shiny: Tableau-like Drag and Drop GUI Visualization in R use esquisse:
Something for you
Download this presentation +
source code here:
http://bit.ly/2QybvNV
Stay in touch
Instarea s.r.o.
29. Augusta 36/A
811 09 Bratislava
www.instarea.com
Martin Bago
Data Scientist
Instarea
martin.bago@instarea.com
+421 905 255 852
https://www.linkedin.com/in/martinbago/
Thank you!

More Related Content

PPTX
Data science applications and usecases
PPTX
Exploratory data analysis with Python
PPTX
Machine Learning - Splitting Datasets
PDF
Data Visualization in Exploratory Data Analysis
PDF
Exploratory data analysis data visualization
PPTX
Data visualization using R
PPTX
Python Seaborn Data Visualization
PPT
Data preprocessing
Data science applications and usecases
Exploratory data analysis with Python
Machine Learning - Splitting Datasets
Data Visualization in Exploratory Data Analysis
Exploratory data analysis data visualization
Data visualization using R
Python Seaborn Data Visualization
Data preprocessing

What's hot (20)

PPTX
Data visualization with R
PPTX
Exploratory Data Analysis
PPTX
Types of Machine Learning
PDF
The Data Science Process
PPTX
Exploratory data analysis
PPTX
Statistics for data science
PPTX
Text mining
PDF
Data Science Project Lifecycle
PDF
Statistics for data scientists
PDF
Hierarchical clustering
PDF
Machine learning Algorithms
PDF
S3 classes and s4 classes
PPTX
Exploratory data analysis
PPTX
Python - Numpy/Pandas/Matplot Machine Learning Libraries
PPTX
Hierarchical clustering.pptx
PPTX
Machine Learning-Linear regression
PDF
Data analytics using R programming
PPTX
Data science life cycle
PPTX
Decision tree induction \ Decision Tree Algorithm with Example| Data science
PPTX
Step By Step Guide to Learn R
Data visualization with R
Exploratory Data Analysis
Types of Machine Learning
The Data Science Process
Exploratory data analysis
Statistics for data science
Text mining
Data Science Project Lifecycle
Statistics for data scientists
Hierarchical clustering
Machine learning Algorithms
S3 classes and s4 classes
Exploratory data analysis
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Hierarchical clustering.pptx
Machine Learning-Linear regression
Data analytics using R programming
Data science life cycle
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Step By Step Guide to Learn R
Ad

Similar to Exploratory data analysis in R - Data Science Club (20)

PDF
A Map of the PyData Stack
PDF
The Fine Art of Time Travelling - Implementing Event Sourcing - Andrea Saltar...
PDF
MLflow with R
PDF
Seeing Like Software
PDF
Road to Enterprise Architecture for Big Data Applications: Mixing Apache Spar...
PDF
Sparklyr: Big Data enabler for R users
PDF
Sparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAM
PDF
Monzor, Carbon-R-a, and the end of the world
PDF
How to calculate a broadcast address ?
PDF
InfluxData Webinar 16 June, 2020 - How to Create a Telegraf Parser Plugin for...
PDF
MUM Europe 2017 - Traffic Generator Case Study
PDF
Life of PySpark - A tale of two environments
PDF
CARTO en 5 Pasos: del Dato a la Toma de Decisiones [CARTO]
PDF
My Favorite Calc Code
PPTX
TabPy Presentation
PPTX
Big data bi-mature-oanyc summit
PPTX
7 key recipes for data engineering
PDF
AI Deeplearning Programming
PDF
Decoupling Official Statistics
PDF
EMRでスポットインスタンスの自動入札ツールを作成する
A Map of the PyData Stack
The Fine Art of Time Travelling - Implementing Event Sourcing - Andrea Saltar...
MLflow with R
Seeing Like Software
Road to Enterprise Architecture for Big Data Applications: Mixing Apache Spar...
Sparklyr: Big Data enabler for R users
Sparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAM
Monzor, Carbon-R-a, and the end of the world
How to calculate a broadcast address ?
InfluxData Webinar 16 June, 2020 - How to Create a Telegraf Parser Plugin for...
MUM Europe 2017 - Traffic Generator Case Study
Life of PySpark - A tale of two environments
CARTO en 5 Pasos: del Dato a la Toma de Decisiones [CARTO]
My Favorite Calc Code
TabPy Presentation
Big data bi-mature-oanyc summit
7 key recipes for data engineering
AI Deeplearning Programming
Decoupling Official Statistics
EMRでスポットインスタンスの自動入札ツールを作成する
Ad

Recently uploaded (20)

PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Business Analytics and business intelligence.pdf
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
annual-report-2024-2025 original latest.
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
IBA_Chapter_11_Slides_Final_Accessible.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Fluorescence-microscope_Botany_detailed content
Data_Analytics_and_PowerBI_Presentation.pptx
Miokarditis (Inflamasi pada Otot Jantung)
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Business Analytics and business intelligence.pdf
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
ISS -ESG Data flows What is ESG and HowHow
annual-report-2024-2025 original latest.
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Qualitative Qantitative and Mixed Methods.pptx
Business Acumen Training GuidePresentation.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf

Exploratory data analysis in R - Data Science Club

  • 1. MEET OUR TEAM WRITE HERE SOMETHING DATA EXPLORATION METHODS & PRACTISES Martin Bago | Instarea 8.10.2018 2nd Data Science Club, 18/19 Winter
  • 2. MEET OUR TEAM WRITE HERE SOMETHINGTABLE OF CONTENT INTRO FIRST DEEP INTO DATASET GOING DEEPER CORRELATIONS BONUS D A T A S C I E N C E C L U B
  • 3. Martin Bago Data Scientist | Instarea Ing. @ Process Automation and Informatization in Industry (2016, MTF STU BA) Bc. @ Applied Informatics (2014, FEI STU BA) 2017- now Data Scientist, Instarea s.r.o., Market Locator 2015-2016 Head of Analyst, News and Media Holding a.s. 2014-2015 SEO Analyst, Centrum Holdings a.s. 2011-2014 Automix.sk, Centrum Holdings a.s. 2010-2013 Editor-in-chief OKO Casopis (FEI STU BA) Passionate driver, beer&coffee&football lover
  • 4. Something for you Download this presentation + source code here: http://bit.ly/2QybvNV
  • 6. Dataset >> install.packages("datasets") #installing datasets package in R >> library(datasets) For studying there is an unique library consisting of many real-life dataset examples (from Monthly Airline Passenger Numbers, thru Weight versus age of chicks on different diets to Monthly Deaths from Lung Diseases in the UK) . For this presentation we will use mtcars dataset. How to find&use
  • 7. Baby steps head(), tail(), nrow() and ncol() To understand, what are you working with is very important to see dimensions of dataset a number/count of values. >> head(mtcars) >> tail(mtcars) >> head(mtcars, 25) >> nrow(mtcars) >> ncol(mtcars) Input: Output:
  • 8. Deeper insight str(), summary() To deeper understanding of dataset use detailed views of metrics and dimensions. >> str(mtcars) >> summary(mtcars) Input: Output: Always check data types!!! Source
  • 9. Unique and missing values unique(), is.na() Is crucial to find, how many values are missing from the dataset. If there is 2/3 missing, you got wrong dataset. >> unique(mtcars$cyl) >> is.na(mtcars) Input: Output: If there is something missing, you can use old&good method to treat that – filling with mean. >> mtcars$smt[is.na(mtcars$smt)] <- mean(mtcars$smt, na.rm = TRUE)
  • 10. Histograms hist() The best way to learn and understand, is visual >> hist(mtcars$mpg) >> hist(mtcars$hp) Input: Output: Output:
  • 11. Transforming and recalculating Often you need to calculate your own metrics. In R, it’s really easy. >> mtcars2 <- mtcars >> mtcars2$disp_l <- mtcars$mpg/61.024 >> mtcars2$kml <- 235/mtcars$mpg >> hist(mtcars2$disp_l) Input: Output:
  • 12. Understand the scope of variablesboxplot() >> boxplot(mtcars) >> boxplot(mtcars2$disp_l, mtcars2$kml) >> boxplot(mtcars2$kml, main = "mtcars dataset", xlab = "Comsumption per 100km", ylab = "Liters") Input: Output: Output:
  • 13. How to read boxplot? boxplot()
  • 14. Does it correlate? Library(corplot), cor() >> install.packages("corrplot") >> library(corrplot) >> #cor(x, method = "pearson", use = "complete.obs") >> cor(mtcars) Input: Output: Not very intuitive…
  • 15. Does it correlate? Library(corplot), cor() >> res <- cor(mtcars) >> round(res, 2) >> corrplot(res, type = "upper", order = "hclust", tl.col = "black", tl.srt = 25) Input: Output: ! Becareful ! Correlation is not causality
  • 16. Heatmap via corrplot library >> library(corrplot) >> col<- colorRampPalette(c("blue", "white", "red"))(20) >> heatmap(x = res, col = col, symm = TRUE) Input: Output: Does it correlate?
  • 17. Or even deeper insight… >>require(graphics) pairs(mtcars2, main = "mtcars2 data", gap = 1/4) coplot(kml ~ disp_l | as.factor(cyl), data = mtcars2, panel = panel.smooth, rows = 1) ## possibly more meaningful, e.g., for summary() or bivariate plots: mtcars2 <- within(mtcars2, { vs <- factor(vs, labels = c("V", "S")) am <- factor(am, labels = c("automatic", "manual")) cyl <- ordered(cyl) gear <- ordered(gear) carb <- ordered(carb) }) summary(mtcars2) Input: Output: Library(corplot), cor()
  • 18. Or even deeper insight… >> install.packages("PerformanceAnalytics") >> library(PerformanceAnalytics) >> chart.Correlation(mtcars, histogram=TRUE, pch=19) >> mtcars_small <- mtcars[,1:4] >> chart.Correlation(mtcars_small, histogram=TRUE, pch=19) Input: Output: Library Performance Analytics
  • 19. Bonus - anomaliesDetection AnomalyDetectionTs() As input in considered time-series or vector, at least two periods. Madeby Twitter
  • 20. What next? To create customizable dashboards try Shiny: Tableau-like Drag and Drop GUI Visualization in R use esquisse:
  • 21. Something for you Download this presentation + source code here: http://bit.ly/2QybvNV
  • 22. Stay in touch Instarea s.r.o. 29. Augusta 36/A 811 09 Bratislava www.instarea.com Martin Bago Data Scientist Instarea martin.bago@instarea.com +421 905 255 852 https://www.linkedin.com/in/martinbago/ Thank you!