5. working on data using R -Cleaning, filtering ,transformation, Sampling

Working on data ( cleaning, filtering
,transformation,sampling,visualization)
K K Singh, Dept. of CSE, RGUKT Nuzvid
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
1

Exploring DATA
 cd <- read.table(‘custData.csv’, sep=',',header=T)
 Once we’ve loaded the data into R, we’ll want to examine it.
 class()—Tells us what type of R object you have. In our case,
 summary()—Gives you a summary of almost any R object.
 str()-Gives structure of data table/frame
 names()– Gives detailed structure of data table/frame
 dim() –Gives rows and columns of data
 Data exploration uses a combination of summary statistics—means and
medians, variances, and counts—and visualization. You can spot some
problems just by using
summary statistics; other problems are easier to find visually.
2

OTHER DATA FORMATS
 .csv is not the only common data file format you’ll encounter. Other formats include
 .tsv (tab-separated values),
 pipe-separated files,
 Microsoft Excel workbooks,
 JSON data,
 and XML.
 R’s built-in read.table() command can be made to read most separated value formats.
3

4
 custdata<-fread(“custData.csv”)
 Summary(custdata)

Typical problems revealed by data summaries
 MISSING
VALUES
 INVALID
VALUES AND
OUTLIERS
5

Typical problems revealed by data summaries
6  DATA RANGE
 Unit

Data Cleaning
 Fundamentally, there are two things you can do with missing variables: drop the
rows with missing values, or convert the missing values to a meaningful value.
 If the missing data represents a fairly small fraction of the dataset, it’s probably saf
just to drop these customers from your analysis. But if it is significant, What do yo
do then?
 The most straightforward solution is just to create a new category for the variable,
called missing.
 f <- ifelse(is.na(custdata$is.employed), "missing", ifelse(custdata$is.employed==T,
“employed“, “not_employed”))
 summary(as.factor(f))
7

8

Data_transformations
The purpose of data transformation is to make data easier to model—and easier to
understand. For example, the cost of living will vary from state to state, so what would
be a high salary in one region could be barely enough to scrape by in another. If you
want to use income as an input to your insurance model, it might be more meaningful
to normalize a customer’s income by the typical income in the area where they live.
custdata <- merge(custdata, medianincome, by.x="state.of.res",
by.y="State")
summary(custdata[,c("state.of.res", "income", "Median.Income")])
custdata$income.norm <- with(custdata, income/Median.Income)
OR
custdata$income.norm <- custdata[, income/Median.Income]
summary(custdata$income.norm)
9

CONVERTING CONTINUOUS VARIABLES TO DISCRETE
 In these cases, you might want to convert the continuous age and income
variables into ranges, or discrete variables.
10

NORMALIZATION AND RESCALING
It is useful when absolute quantities are less meaningful than relative ones.
 For example, you might be less interested in a customer’s absolute age than in how old or young
they are relative to a “typical” customer. Let’s take the mean age of your customers to be the typical
age. You can normalize by that, as shown in the following listing.
 summary(custdata$age)
 meanage <- mean(custdata$age)
 custdata$age.normalized <- custdata$age/meanage
 summary(custdata$age.normalized)
11

Data Sampling
 Sampling is the process of selecting a subset of a population to
represent the whole, during analysis and modeling.
 it’s easier to test and debug the code on small subsamples before
training the model on the entire dataset. Visualization can be easier
with a subsample of the data;
 The other reason to sample your data is to create test and training
splits.
12

13 A convenient way to manage random sampling is to add a sample group column to the data frame. The
sample group column contains a number generated uniformly from zero to one, using the runif function. You
can draw a random sample of arbitrary size from the data frame by using the appropriate threshold on the
sample group column.

Data visualization (Refer to the lecture on Graph plotting )
 Visually checking distributions for a single variable
 What is the peak value of the distribution?
 How many peaks are there in the distribution (unimodality versus bimodality)?
 How normal (or lognormal) is the data?
 How much does the data vary? Is it concentrated in a certain interval or in a certain
category?
 Is there a relationship between the two inputs age and income in my data?
14

Uses
1. plot Shows the relationship between two continuous variables. Best when
that relationship is functional.
2. Shows the relationship between two continuous variables. Best when the
relationship is too loose or cloud-like to be seen on a line plot.
3. Shows the relationship between two categorical variables (var1 and var2).
Highlights the frequencies of each value of var1.
4. Shows the relationship between two categorical variables (var1 and var2).
Best for comparing the relative frequencies of each value of var2 within each
value of var1 when var2 takes on more than two values.
5. Examines data range, Checks number of modes,Checks if distribution is
normal/lognormal, Checks for anomalies and outliers. (use a log scale to
visualize data that is heavily skewed.)
6. Presents information from a five-number summary. Useful for indicating
whether a distribution is skewed and whether there are potential unusual
observations (outliers), Very useful when large numbers of observations are
involved and when two or more data sets are being compared.
 Graph type
1. Line Plot
2. Scatter plot
3. Bar chart
4. Bar chart with
faceting
5. Histogram or
density plot
6. A box and whisker
plot(boxplot)
15

Assignments
 load(nycflights)
 1. Create a new data frame that includes flights headed to SFO in February,
and save this data frame assfo_feb_flights. How many such recors are
there?
 2. Calculate the median and interquartile range for arr_delays of flights in
the sfo_feb_flights data frame, grouped by carrier. Which carrier has the
highest IQR of arrival delays?
 3. Considering the data from all the NYC airports, which month has the
highest average departure delay?
 4. What was the worst day to fly out of NYC in 2013 if you dislike delayed
flights?
 5. Make a histogram and calculate appropriate summary statistics for
arrival delays of sfo_feb_flights. Which of the following is false?
16

5. working on data using R -Cleaning, filtering ,transformation, Sampling

5. working on data using R -Cleaning, filtering ,transformation, Sampling

More Related Content

What's hot (20)

Similar to 5. working on data using R -Cleaning, filtering ,transformation, Sampling (20)

Recently uploaded (20)

5. working on data using R -Cleaning, filtering ,transformation, Sampling