SlideShare a Scribd company logo
4
Most read
7
Most read
Working on data ( cleaning, filtering
,transformation,sampling,visualization)
K K Singh, Dept. of CSE, RGUKT Nuzvid
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
1
Exploring DATA
 cd <- read.table(‘custData.csv’, sep=',',header=T)
 Once we’ve loaded the data into R, we’ll want to examine it.
 class()—Tells us what type of R object you have. In our case,
 summary()—Gives you a summary of almost any R object.
 str()-Gives structure of data table/frame
 names()– Gives detailed structure of data table/frame
 dim() –Gives rows and columns of data
 Data exploration uses a combination of summary statistics—means and
medians, variances, and counts—and visualization. You can spot some
problems just by using
summary statistics; other problems are easier to find visually.
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
2
OTHER DATA FORMATS
 .csv is not the only common data file format you’ll encounter. Other formats include
 .tsv (tab-separated values),
 pipe-separated files,
 Microsoft Excel workbooks,
 JSON data,
 and XML.
 R’s built-in read.table() command can be made to read most separated value formats.
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
3
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
4
 custdata<-fread(“custData.csv”)
 Summary(custdata)
Typical problems revealed by data summaries
 MISSING
VALUES
 INVALID
VALUES AND
OUTLIERS
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
5
Typical problems revealed by data summaries
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
6  DATA RANGE
 Unit
Data Cleaning
 Fundamentally, there are two things you can do with missing variables: drop the
rows with missing values, or convert the missing values to a meaningful value.
 If the missing data represents a fairly small fraction of the dataset, it’s probably saf
just to drop these customers from your analysis. But if it is significant, What do yo
do then?
 The most straightforward solution is just to create a new category for the variable,
called missing.
 f <- ifelse(is.na(custdata$is.employed), "missing", ifelse(custdata$is.employed==T,
“employed“, “not_employed”))
 summary(as.factor(f))
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
7
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
8
Data_transformations
The purpose of data transformation is to make data easier to model—and easier to
understand. For example, the cost of living will vary from state to state, so what would
be a high salary in one region could be barely enough to scrape by in another. If you
want to use income as an input to your insurance model, it might be more meaningful
to normalize a customer’s income by the typical income in the area where they live.
custdata <- merge(custdata, medianincome, by.x="state.of.res",
by.y="State")
summary(custdata[,c("state.of.res", "income", "Median.Income")])
custdata$income.norm <- with(custdata, income/Median.Income)
OR
custdata$income.norm <- custdata[, income/Median.Income]
summary(custdata$income.norm)
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
9
CONVERTING CONTINUOUS VARIABLES TO DISCRETE
 In these cases, you might want to convert the continuous age and income
variables into ranges, or discrete variables.
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
10
NORMALIZATION AND RESCALING
It is useful when absolute quantities are less meaningful than relative ones.
 For example, you might be less interested in a customer’s absolute age than in how old or young
they are relative to a “typical” customer. Let’s take the mean age of your customers to be the typical
age. You can normalize by that, as shown in the following listing.
 summary(custdata$age)
 meanage <- mean(custdata$age)
 custdata$age.normalized <- custdata$age/meanage
 summary(custdata$age.normalized)
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
11
Data Sampling
 Sampling is the process of selecting a subset of a population to
represent the whole, during analysis and modeling.
 it’s easier to test and debug the code on small subsamples before
training the model on the entire dataset. Visualization can be easier
with a subsample of the data;
 The other reason to sample your data is to create test and training
splits.
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
12
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
13 A convenient way to manage random sampling is to add a sample group column to the data frame. The
sample group column contains a number generated uniformly from zero to one, using the runif function. You
can draw a random sample of arbitrary size from the data frame by using the appropriate threshold on the
sample group column.
Data visualization (Refer to the lecture on Graph plotting )
 Visually checking distributions for a single variable
 What is the peak value of the distribution?
 How many peaks are there in the distribution (unimodality versus bimodality)?
 How normal (or lognormal) is the data?
 How much does the data vary? Is it concentrated in a certain interval or in a certain
category?
 Is there a relationship between the two inputs age and income in my data?
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
14
Uses
1. plot Shows the relationship between two continuous variables. Best when
that relationship is functional.
2. Shows the relationship between two continuous variables. Best when the
relationship is too loose or cloud-like to be seen on a line plot.
3. Shows the relationship between two categorical variables (var1 and var2).
Highlights the frequencies of each value of var1.
4. Shows the relationship between two categorical variables (var1 and var2).
Best for comparing the relative frequencies of each value of var2 within each
value of var1 when var2 takes on more than two values.
5. Examines data range, Checks number of modes,Checks if distribution is
normal/lognormal, Checks for anomalies and outliers. (use a log scale to
visualize data that is heavily skewed.)
6. Presents information from a five-number summary. Useful for indicating
whether a distribution is skewed and whether there are potential unusual
observations (outliers), Very useful when large numbers of observations are
involved and when two or more data sets are being compared.
 Graph type
1. Line Plot
2. Scatter plot
3. Bar chart
4. Bar chart with
faceting
5. Histogram or
density plot
6. A box and whisker
plot(boxplot)
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
15
Assignments
 load(nycflights)
 1. Create a new data frame that includes flights headed to SFO in February,
and save this data frame assfo_feb_flights. How many such recors are
there?
 2. Calculate the median and interquartile range for arr_delays of flights in
the sfo_feb_flights data frame, grouped by carrier. Which carrier has the
highest IQR of arrival delays?
 3. Considering the data from all the NYC airports, which month has the
highest average departure delay?
 4. What was the worst day to fly out of NYC in 2013 if you dislike delayed
flights?
 5. Make a histogram and calculate appropriate summary statistics for
arrival delays of sfo_feb_flights. Which of the following is false?
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
16
5. working on data using R -Cleaning, filtering ,transformation, Sampling

More Related Content

PPTX
3. R- list and data frame
PDF
Vectors data frames
 
PDF
Transpose and manipulate character strings
PPTX
Big Data Mining in Indian Economic Survey 2017
PDF
Manipulating data with dates
PPTX
R language introduction
PDF
Data Visualization using base graphics
PDF
3 R Tutorial Data Structure
3. R- list and data frame
Vectors data frames
 
Transpose and manipulate character strings
Big Data Mining in Indian Economic Survey 2017
Manipulating data with dates
R language introduction
Data Visualization using base graphics
3 R Tutorial Data Structure

What's hot (20)

PDF
3 Data Structure in R
PDF
Manipulating Data using base R package
PDF
Stata cheat sheet: data transformation
PDF
Stata cheatsheet transformation
PDF
4 R Tutorial DPLYR Apply Function
PDF
R code for data manipulation
PDF
R code descriptive statistics of phenotypic data by Avjinder Kaler
PDF
Data handling in r
PDF
Data manipulation on r
PDF
Stata Programming Cheat Sheet
PDF
Stata cheat sheet: data processing
PDF
SAS and R Code for Basic Statistics
PDF
Data manipulation with dplyr
PDF
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
PPTX
Basic Analysis using Python
PPTX
Basic Analysis using R
PDF
5 R Tutorial Data Visualization
PDF
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
PDF
Grouping & Summarizing Data in R
PDF
R getting spatial
 
3 Data Structure in R
Manipulating Data using base R package
Stata cheat sheet: data transformation
Stata cheatsheet transformation
4 R Tutorial DPLYR Apply Function
R code for data manipulation
R code descriptive statistics of phenotypic data by Avjinder Kaler
Data handling in r
Data manipulation on r
Stata Programming Cheat Sheet
Stata cheat sheet: data processing
SAS and R Code for Basic Statistics
Data manipulation with dplyr
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
Basic Analysis using Python
Basic Analysis using R
5 R Tutorial Data Visualization
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
Grouping & Summarizing Data in R
R getting spatial
 
Ad

Similar to 5. working on data using R -Cleaning, filtering ,transformation, Sampling (20)

PPTX
11-11_EDA Samia.pptx 11-11_EDA Samia.pptx
PDF
Data_Analytics_for_IoT_Solutions.pptx.pdf
PDF
R Cheat Sheet – Data Management
PDF
Fundamentals of Data Science CSB2205 Data Wrangling explained
PDF
Unit---4.pdf how to gst du paper in this day and age
PPTX
R part I
PPTX
Predicting Employee Churn: A Data-Driven Approach Project Presentation
PPT
R for Statistical Computing
PDF
R code for data manipulation
DOCX
Week-3 – System RSupplemental material1Recap •.docx
PPTX
EDA.pptx
PPTX
Statistics with R
DOCX
UNIT-4.docx
PPTX
2016 Pittsburgh Data Jam Student Workshop
PDF
2013.11.14 Big Data Workshop Bruno Voisin
PPTX
Types of Data in Machine Learning, Number aand Categorical
PDF
Collect 50 or more paired quantitative data items. You may use a met.pdf
PDF
PDF
Interpreting Data Like a Pro - Dawn of the Data Age Lecture Series
PDF
R programming & Machine Learning
11-11_EDA Samia.pptx 11-11_EDA Samia.pptx
Data_Analytics_for_IoT_Solutions.pptx.pdf
R Cheat Sheet – Data Management
Fundamentals of Data Science CSB2205 Data Wrangling explained
Unit---4.pdf how to gst du paper in this day and age
R part I
Predicting Employee Churn: A Data-Driven Approach Project Presentation
R for Statistical Computing
R code for data manipulation
Week-3 – System RSupplemental material1Recap •.docx
EDA.pptx
Statistics with R
UNIT-4.docx
2016 Pittsburgh Data Jam Student Workshop
2013.11.14 Big Data Workshop Bruno Voisin
Types of Data in Machine Learning, Number aand Categorical
Collect 50 or more paired quantitative data items. You may use a met.pdf
Interpreting Data Like a Pro - Dawn of the Data Age Lecture Series
R programming & Machine Learning
Ad

Recently uploaded (20)

PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Introduction to the R Programming Language
PDF
Business Analytics and business intelligence.pdf
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Fluorescence-microscope_Botany_detailed content
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Computer network topology notes for revision
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction to the R Programming Language
Business Analytics and business intelligence.pdf
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Galatica Smart Energy Infrastructure Startup Pitch Deck
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Supervised vs unsupervised machine learning algorithms
Fluorescence-microscope_Botany_detailed content
Miokarditis (Inflamasi pada Otot Jantung)
[EN] Industrial Machine Downtime Prediction
Qualitative Qantitative and Mixed Methods.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Computer network topology notes for revision
STERILIZATION AND DISINFECTION-1.ppthhhbx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush

5. working on data using R -Cleaning, filtering ,transformation, Sampling

  • 1. Working on data ( cleaning, filtering ,transformation,sampling,visualization) K K Singh, Dept. of CSE, RGUKT Nuzvid 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 1
  • 2. Exploring DATA  cd <- read.table(‘custData.csv’, sep=',',header=T)  Once we’ve loaded the data into R, we’ll want to examine it.  class()—Tells us what type of R object you have. In our case,  summary()—Gives you a summary of almost any R object.  str()-Gives structure of data table/frame  names()– Gives detailed structure of data table/frame  dim() –Gives rows and columns of data  Data exploration uses a combination of summary statistics—means and medians, variances, and counts—and visualization. You can spot some problems just by using summary statistics; other problems are easier to find visually. 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 2
  • 3. OTHER DATA FORMATS  .csv is not the only common data file format you’ll encounter. Other formats include  .tsv (tab-separated values),  pipe-separated files,  Microsoft Excel workbooks,  JSON data,  and XML.  R’s built-in read.table() command can be made to read most separated value formats. 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 3
  • 4. 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 4  custdata<-fread(“custData.csv”)  Summary(custdata)
  • 5. Typical problems revealed by data summaries  MISSING VALUES  INVALID VALUES AND OUTLIERS 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 5
  • 6. Typical problems revealed by data summaries 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 6  DATA RANGE  Unit
  • 7. Data Cleaning  Fundamentally, there are two things you can do with missing variables: drop the rows with missing values, or convert the missing values to a meaningful value.  If the missing data represents a fairly small fraction of the dataset, it’s probably saf just to drop these customers from your analysis. But if it is significant, What do yo do then?  The most straightforward solution is just to create a new category for the variable, called missing.  f <- ifelse(is.na(custdata$is.employed), "missing", ifelse(custdata$is.employed==T, “employed“, “not_employed”))  summary(as.factor(f)) 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 7
  • 8. 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 8
  • 9. Data_transformations The purpose of data transformation is to make data easier to model—and easier to understand. For example, the cost of living will vary from state to state, so what would be a high salary in one region could be barely enough to scrape by in another. If you want to use income as an input to your insurance model, it might be more meaningful to normalize a customer’s income by the typical income in the area where they live. custdata <- merge(custdata, medianincome, by.x="state.of.res", by.y="State") summary(custdata[,c("state.of.res", "income", "Median.Income")]) custdata$income.norm <- with(custdata, income/Median.Income) OR custdata$income.norm <- custdata[, income/Median.Income] summary(custdata$income.norm) 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 9
  • 10. CONVERTING CONTINUOUS VARIABLES TO DISCRETE  In these cases, you might want to convert the continuous age and income variables into ranges, or discrete variables. 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 10
  • 11. NORMALIZATION AND RESCALING It is useful when absolute quantities are less meaningful than relative ones.  For example, you might be less interested in a customer’s absolute age than in how old or young they are relative to a “typical” customer. Let’s take the mean age of your customers to be the typical age. You can normalize by that, as shown in the following listing.  summary(custdata$age)  meanage <- mean(custdata$age)  custdata$age.normalized <- custdata$age/meanage  summary(custdata$age.normalized) 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 11
  • 12. Data Sampling  Sampling is the process of selecting a subset of a population to represent the whole, during analysis and modeling.  it’s easier to test and debug the code on small subsamples before training the model on the entire dataset. Visualization can be easier with a subsample of the data;  The other reason to sample your data is to create test and training splits. 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 12
  • 13. 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 13 A convenient way to manage random sampling is to add a sample group column to the data frame. The sample group column contains a number generated uniformly from zero to one, using the runif function. You can draw a random sample of arbitrary size from the data frame by using the appropriate threshold on the sample group column.
  • 14. Data visualization (Refer to the lecture on Graph plotting )  Visually checking distributions for a single variable  What is the peak value of the distribution?  How many peaks are there in the distribution (unimodality versus bimodality)?  How normal (or lognormal) is the data?  How much does the data vary? Is it concentrated in a certain interval or in a certain category?  Is there a relationship between the two inputs age and income in my data? 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 14
  • 15. Uses 1. plot Shows the relationship between two continuous variables. Best when that relationship is functional. 2. Shows the relationship between two continuous variables. Best when the relationship is too loose or cloud-like to be seen on a line plot. 3. Shows the relationship between two categorical variables (var1 and var2). Highlights the frequencies of each value of var1. 4. Shows the relationship between two categorical variables (var1 and var2). Best for comparing the relative frequencies of each value of var2 within each value of var1 when var2 takes on more than two values. 5. Examines data range, Checks number of modes,Checks if distribution is normal/lognormal, Checks for anomalies and outliers. (use a log scale to visualize data that is heavily skewed.) 6. Presents information from a five-number summary. Useful for indicating whether a distribution is skewed and whether there are potential unusual observations (outliers), Very useful when large numbers of observations are involved and when two or more data sets are being compared.  Graph type 1. Line Plot 2. Scatter plot 3. Bar chart 4. Bar chart with faceting 5. Histogram or density plot 6. A box and whisker plot(boxplot) 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 15
  • 16. Assignments  load(nycflights)  1. Create a new data frame that includes flights headed to SFO in February, and save this data frame assfo_feb_flights. How many such recors are there?  2. Calculate the median and interquartile range for arr_delays of flights in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the highest IQR of arrival delays?  3. Considering the data from all the NYC airports, which month has the highest average departure delay?  4. What was the worst day to fly out of NYC in 2013 if you dislike delayed flights?  5. Make a histogram and calculate appropriate summary statistics for arrival delays of sfo_feb_flights. Which of the following is false? 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 16