Unit I - introduction to r language 2.pptx

Unit I
Data Manipulations
Data Already in R – Reading data – Reading and formatting datasets –
Manipulating Data with dplyr – Tiding Data with tidyr

Data Manipulation
• Data science is as much about manipulating data as it is about fitting
models to data.
• Data rarely arrives in a form that we can directly feed into the
statistical models or machine learning algorithms.
• The first stages of data analysis are almost always figuring out how to
load the data into R
• And then figuring out how to transform it into a shape you can readily
analyze.

Data Already in R:
• There are some data sets already built into R or available in R packages.
• Dataset is a package in R.
• Its aim is to make tidy datasets easier to release, exchange and reuse. It
organizes and formats data frame 'R' objects into well-referenced, well-
described, interoperable datasets into release and reuse ready form.
• We can load the package into R using the library() function.
library(datasets)
If we want , together with a short description of each, then use
library(help = "datasets“).
It describes Package, version, Title, author aetc.

To load an actual data set into R’s memory, use:
data() function
For example,
data(“CO2”)

To display first 6 rows of a dataset:
head(co2)

To plot a graph:
plot(conc ~ uptake, data = CO2)

• Another package with several useful data sets is mlbench.
• It contains data sets for machine learning benchmarks, so these
data sets are aimed at testing how new methods perform on
known data sets.
• This package is not distributed together with R, but you can
install it, load it, and get a list of the data sets within it like this:
install.packages("mlbench")
library(mlbench)
library(help = "mlbench")

Quickly Reviewing Data
• head(dataset/dataframe, no.ofrows)
• head(CO2,3) -> displays first 3 rows in co2 dataset.
• Simillarly,
• tail(CO2,3) ->
• summary(co2) ->displays summary statistics for all the columns in a
data frame,
• str(co2) -> displays the type of each column

Reading Data
• There are several packages for reading data in different file formats,
from Excel to JSON to XML and so on.
• R has plenty of built-in functions for reading such data. Use
?read.table
• read.table() is a function in R that reads a file in table format and creates
a data frame from it
read.table()
Example
my_data <- read.table("data.txt", header = TRUE, sep = "t")

Arguments in read.table():
• header: This is a boolean value telling the function whether it should
consider the first line in the input file a header line.
• col.names: If the first line is not used to specify the header, you can
use this option to name the columns.
• dec: This is the decimal point used in numbers.
• comment.char: By default, the function assumes that “#” is the start of
a comment and ignores the rest of a line when it sees it
• colClasses: This lets you specify which type each column should have,
so here you can specify that some columns should be factors, and
others should be strings

Install package in R studio
• To install the `mlbench` package in RStudio, you can follow these steps:
1. Open RStudio and click on the **Packages** tab in the bottom right
pane.
2. Click on the **Install** button.
3. In the **Install Packages** dialog box, type `mlbench` in the
**Packages** field.
4. Select the **Install dependencies** option.
5. Click on the **Install** button to install the package.

Examples of Reading and Formatting Data Sets
Breast Cancer Data set :
• As a first example of reading data from a text file, we consider the
BreastCancer data set from mlbench.
• Then we have something to compare our results with.
library(mlbench)
data(BreastCancer)
head(3,BreastCancer)

• The URL to the actual data is https://archive.ics.uci.edu/ml/machine-
learning-databases/ breast-cancer-wisconsin/breast-cancer-wisconsin.data’
• To get data,
• we could go to the URL and save the file.
• to read the data directly from the URL.
• We can read the data and get it as a vector of lines using the readLines()
function.
• lines <- readLines(data_url) lines[1:5]

• For this data, it seems to be a comma-separated values file without a
header line. So save the data with the “.csv” suffix.
• Boston Housing Data Set : For the second example of loading data,
we take another data set from the mlbench package.(Refer text book)

The readr package:
• readr is an R package that provides a fast and friendly way to read
rectangular data from delimited files, such as comma-separated values
(CSV) and tab-separated values (TSV)
• It is designed to parse many types of data.
• To install readr,
• you can either install the whole tidyverse by running
install.packages("tidyverse") or
• install just readr by running install.packages("readr")
• Once installed, you can load readr with library(readr)

• readr supports the following file formats with these read_*() functions:
• read_csv(): comma-separated values (CSV)
• read_tsv(): tab-separated values (TSV)
• read_csv2(): semicolon-separated values with , as the decimal mark
• read_delim(): delimited files (CSV and TSV are important special
cases)
• read_fwf(): fixed-width files
• read_table(): whitespace-separated files
• read_log(): web log files 1

Manipulating Data with dplyr
• Data frames are ideal for representing tabular data
• Nearly all packages that implement statistical models or machine
learning algorithms in R work on data frames.
• But to actually manipulate a data frame, you often have to write a lot
of code to filter data, rearrange data, summarize it in various ways,
and such.
• A few years ago, manipulating data frames required a lot more
programming than actually analyzing data.
• That has improved dramatically with the dplyr package (pronounced
“d plier” where “plier” is pronounced as “pliers”). This pack

• dplyr package has to be installed externaly.
• It helps to resolve the most frequent data manipulation hurdles.
• There are uncomplicated “verbs”, functions present for tackling every
common data manipulation and the thoughts can be translated into
code faster.
• This package provides a number of convenient functions that let you
modify data frames in various ways and string them together in pipes
using the %>% or |> operator
• If you import dplyr, you get a large selection of functions that let you
build pipelines for data frame manipulation using pipelines.

Some Useful dplyr Functions
• The dplyr package has several representations of data frame and its
equivalent formats.
• (illustrate with output)
• iris %>% as_tibble()
• iris |> as_tibble()
• iris %>% as_tibble() %>% select(Petal.Width,Petal.Length) %>% head(3)

• iris %>% as_tibble() %>% select(-starts_with("Petal")) %>% head(3)
• iris %>% as_tibble() %>% mutate(Petal.Width.plus.Length =
Petal.Width + Petal.Length) %>%
• select(Species, Petal.Width.plus.Length) %>% head(3)
• iris %>% as_tibble() %>%
• mutate(Petal.Width.plus.Length = Petal.Width + Petal.Length,
• Sepal.Width.plus.Length = Sepal.Width + Sepal.Length) %>%
• select(Petal.Width.plus.Length, Sepal.Width.plus.Length) %>%
head(3)

• iris %>% as_tibble() %>% arrange(Sepal.Length) %>% head(3)
• iris %>% as_tibble() %>% arrange(desc(Sepal.Length)) %>% head(3)
• iris %>% as_tibble() %>% group_by(Species) %>% head(3)
• iris %>% group_by(Species) %>% summarise(Mean.Petal.Length =
mean(Petal.Length))

• Breast Cancer Data Manipulation – Refer Text book

Tidying Data with tidyr
Tidy data is a standard way of mapping the meaning of a data set
to its structure. A data set is messy or tidy depending on how rows,
columns and tables are matched up with observations, variables
and types.
Hadley Wickham

• tidy data can be used to plot or summarize the data efficiently.
• It mostly comes down to what data is represented as columns in a data frame
and what is not.
• For example, if I want to look at the iris data set and see how the Petal.Length
varies among species, then I can look at the Species column against the
Petal.Length column:
iris |>
as_tibble() |>
select(Species, Petal.Length) |>
head(3)
• Can plot a graph for the same

• This works because we have a column for the x-axis and another for
the y-axis.
• But if we want to plot the different measurements of the irises to see
how those are related and each measurement is a separate column.
• In such case we can use tidyr package.
library(tidyr)
• It has a function, pivot_longer(), that modifies the data frame, so
columns become names in a factor and other columns become
values.

• pivot_longer function is designed to reshape data from a wider
format to a longer format.
• It makes easier to analyze and visualize the data.
• The data frame or tibble to be reshaped

• What it does is essentially transforming the data frame such that you
get one column containing the name of your original columns and
another column containing the values in those columns.
• In the iris data set, we have observations for sepal length and sepal
width.
• If we want to examine Species vs. Sepal.Length or Sepal.Width, we can
readily do this.

• Pivot wider() – inverse of pivot longer()

Unit I - introduction to r language 2.pptx

More Related Content

Similar to Unit I - introduction to r language 2.pptx (20)

Recently uploaded (20)

Unit I - introduction to r language 2.pptx