SlideShare a Scribd company logo
Unit I
Data Manipulations
Data Already in R – Reading data – Reading and formatting datasets –
Manipulating Data with dplyr – Tiding Data with tidyr
Data Manipulation
• Data science is as much about manipulating data as it is about fitting
models to data.
• Data rarely arrives in a form that we can directly feed into the
statistical models or machine learning algorithms.
• The first stages of data analysis are almost always figuring out how to
load the data into R
• And then figuring out how to transform it into a shape you can readily
analyze.
Data Already in R:
• There are some data sets already built into R or available in R packages.
• Dataset is a package in R.
• Its aim is to make tidy datasets easier to release, exchange and reuse. It
organizes and formats data frame 'R' objects into well-referenced, well-
described, interoperable datasets into release and reuse ready form.
• We can load the package into R using the library() function.
library(datasets)
If we want , together with a short description of each, then use
library(help = "datasets“).
It describes Package, version, Title, author aetc.
Unit I - introduction to r language 2.pptx
To load an actual data set into R’s memory, use:
data() function
For example,
data(“CO2”)
To display first 6 rows of a dataset:
head(co2)
To plot a graph:
plot(conc ~ uptake, data = CO2)
• Another package with several useful data sets is mlbench.
• It contains data sets for machine learning benchmarks, so these
data sets are aimed at testing how new methods perform on
known data sets.
• This package is not distributed together with R, but you can
install it, load it, and get a list of the data sets within it like this:
install.packages("mlbench")
library(mlbench)
library(help = "mlbench")
Quickly Reviewing Data
• head(dataset/dataframe, no.ofrows)
• head(CO2,3) -> displays first 3 rows in co2 dataset.
• Simillarly,
• tail(CO2,3) ->
• summary(co2) ->displays summary statistics for all the columns in a
data frame,
• str(co2) -> displays the type of each column
Reading Data
• There are several packages for reading data in different file formats,
from Excel to JSON to XML and so on.
• R has plenty of built-in functions for reading such data. Use
?read.table
• read.table() is a function in R that reads a file in table format and creates
a data frame from it
read.table()
Example
my_data <- read.table("data.txt", header = TRUE, sep = "t")
Arguments in read.table():
• header: This is a boolean value telling the function whether it should
consider the first line in the input file a header line.
• col.names: If the first line is not used to specify the header, you can
use this option to name the columns.
• dec: This is the decimal point used in numbers.
• comment.char: By default, the function assumes that “#” is the start of
a comment and ignores the rest of a line when it sees it
• colClasses: This lets you specify which type each column should have,
so here you can specify that some columns should be factors, and
others should be strings
Install package in R studio
• To install the `mlbench` package in RStudio, you can follow these steps:
1. Open RStudio and click on the **Packages** tab in the bottom right
pane.
2. Click on the **Install** button.
3. In the **Install Packages** dialog box, type `mlbench` in the
**Packages** field.
4. Select the **Install dependencies** option.
5. Click on the **Install** button to install the package.
Examples of Reading and Formatting Data Sets
Breast Cancer Data set :
• As a first example of reading data from a text file, we consider the
BreastCancer data set from mlbench.
• Then we have something to compare our results with.
library(mlbench)
data(BreastCancer)
head(3,BreastCancer)
• The URL to the actual data is https://archive.ics.uci.edu/ml/machine-
learning-databases/ breast-cancer-wisconsin/breast-cancer-wisconsin.data’
• To get data,
• we could go to the URL and save the file.
• to read the data directly from the URL.
• We can read the data and get it as a vector of lines using the readLines()
function.
• lines <- readLines(data_url) lines[1:5]
• For this data, it seems to be a comma-separated values file without a
header line. So save the data with the “.csv” suffix.
• Boston Housing Data Set : For the second example of loading data,
we take another data set from the mlbench package.(Refer text book)
The readr package:
• readr is an R package that provides a fast and friendly way to read
rectangular data from delimited files, such as comma-separated values
(CSV) and tab-separated values (TSV)
• It is designed to parse many types of data.
• To install readr,
• you can either install the whole tidyverse by running
install.packages("tidyverse") or
• install just readr by running install.packages("readr")
• Once installed, you can load readr with library(readr)
• readr supports the following file formats with these read_*() functions:
• read_csv(): comma-separated values (CSV)
• read_tsv(): tab-separated values (TSV)
• read_csv2(): semicolon-separated values with , as the decimal mark
• read_delim(): delimited files (CSV and TSV are important special
cases)
• read_fwf(): fixed-width files
• read_table(): whitespace-separated files
• read_log(): web log files 1
Manipulating Data with dplyr
• Data frames are ideal for representing tabular data
• Nearly all packages that implement statistical models or machine
learning algorithms in R work on data frames.
• But to actually manipulate a data frame, you often have to write a lot
of code to filter data, rearrange data, summarize it in various ways,
and such.
• A few years ago, manipulating data frames required a lot more
programming than actually analyzing data.
• That has improved dramatically with the dplyr package (pronounced
“d plier” where “plier” is pronounced as “pliers”). This pack
• dplyr package has to be installed externaly.
• It helps to resolve the most frequent data manipulation hurdles.
• There are uncomplicated “verbs”, functions present for tackling every
common data manipulation and the thoughts can be translated into
code faster.
• This package provides a number of convenient functions that let you
modify data frames in various ways and string them together in pipes
using the %>% or |> operator
• If you import dplyr, you get a large selection of functions that let you
build pipelines for data frame manipulation using pipelines.
Some Useful dplyr Functions
• The dplyr package has several representations of data frame and its
equivalent formats.
• (illustrate with output)
• iris %>% as_tibble()
• iris |> as_tibble()
• iris %>% as_tibble() %>% select(Petal.Width,Petal.Length) %>% head(3)
• iris %>% as_tibble() %>% select(Sepal.Length:Petal.Length) %>%
head(3)
• iris |> as_tibble() |> select(starts_with("Petal")) |> head(3)
• iris |> as_tibble() |> select(ends_with("Width")) |> head(3)
• iris |> as_tibble() |> select(contains("etal")) |> head(3)
• iris |> as_tibble() |> select(matches(".t.")) |> head(3)
• iris %>% as_tibble() %>% select(-starts_with("Petal")) %>% head(3)
• iris %>% as_tibble() %>% mutate(Petal.Width.plus.Length =
Petal.Width + Petal.Length) %>%
• select(Species, Petal.Width.plus.Length) %>% head(3)
• iris %>% as_tibble() %>%
• mutate(Petal.Width.plus.Length = Petal.Width + Petal.Length,
• Sepal.Width.plus.Length = Sepal.Width + Sepal.Length) %>%
• select(Petal.Width.plus.Length, Sepal.Width.plus.Length) %>%
head(3)
• iris %>% as_tibble() %>% arrange(Sepal.Length) %>% head(3)
• iris %>% as_tibble() %>% arrange(desc(Sepal.Length)) %>% head(3)
• iris %>% as_tibble() %>% group_by(Species) %>% head(3)
• iris %>% group_by(Species) %>% summarise(Mean.Petal.Length =
mean(Petal.Length))
• Breast Cancer Data Manipulation – Refer Text book
Tidying Data with tidyr
Tidy data is a standard way of mapping the meaning of a data set
to its structure. A data set is messy or tidy depending on how rows,
columns and tables are matched up with observations, variables
and types.
Hadley Wickham
• tidy data can be used to plot or summarize the data efficiently.
• It mostly comes down to what data is represented as columns in a data frame
and what is not.
• For example, if I want to look at the iris data set and see how the Petal.Length
varies among species, then I can look at the Species column against the
Petal.Length column:
iris |>
as_tibble() |>
select(Species, Petal.Length) |>
head(3)
• Can plot a graph for the same
Unit I - introduction to r language 2.pptx
• This works because we have a column for the x-axis and another for
the y-axis.
• But if we want to plot the different measurements of the irises to see
how those are related and each measurement is a separate column.
• In such case we can use tidyr package.
library(tidyr)
• It has a function, pivot_longer(), that modifies the data frame, so
columns become names in a factor and other columns become
values.
• pivot_longer function is designed to reshape data from a wider
format to a longer format.
• It makes easier to analyze and visualize the data.
• The data frame or tibble to be reshaped
Unit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptx
• What it does is essentially transforming the data frame such that you
get one column containing the name of your original columns and
another column containing the values in those columns.
• In the iris data set, we have observations for sepal length and sepal
width.
• If we want to examine Species vs. Sepal.Length or Sepal.Width, we can
readily do this.
• Pivot wider() – inverse of pivot longer()

More Related Content

PPTX
Data Exploration in R.pptx
PPTX
Introduction to R _IMPORTANT FOR DATA ANALYTICS
PPTX
Unit 2 - Data Manipulation with R.pptx
PDF
Data analystics with R module 3 cseds vtu
PPTX
R data structures-2
PPTX
Aggregate.pptx
PPT
Introduction to r language programming.ppt
PPTX
Data Analytics with R and SQL Server
Data Exploration in R.pptx
Introduction to R _IMPORTANT FOR DATA ANALYTICS
Unit 2 - Data Manipulation with R.pptx
Data analystics with R module 3 cseds vtu
R data structures-2
Aggregate.pptx
Introduction to r language programming.ppt
Data Analytics with R and SQL Server

Similar to Unit I - introduction to r language 2.pptx (20)

PPTX
Pandas yayyyyyyyyyyyyyyyyyin Python.pptx
PPTX
Data Structure & Algorithm.pptx
PPTX
What is tidyverse in R languages and different packages
PPTX
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
PPTX
Postgresql Database Administration Basic - Day2
PPTX
introduction of Data strutter and algirithm.pptx
PPTX
Introduction To Programming In R for data analyst
PPTX
Pandas-(Ziad).pptx
PPTX
Python Pandas.pptx
DOCX
Cassandra data modelling best practices
PPTX
2. Data Preprocessing with Numpy and Pandas.pptx
PPTX
Unit 3_Numpy_Vsp.pptx
PDF
R programming & Machine Learning
PPTX
python for data anal gh i o fytysis creation.pptx
PPTX
II B.Sc IT DATA STRUCTURES.pptx
PPTX
Data Handling in R language basic concepts.pptx
PDF
e_lumley.pdf
PPTX
python-pandas-For-Data-Analysis-Manipulate.pptx
PPTX
Lecture 9.pptx
DOCX
Summerization notes for descriptive statistics using r
Pandas yayyyyyyyyyyyyyyyyyin Python.pptx
Data Structure & Algorithm.pptx
What is tidyverse in R languages and different packages
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
Postgresql Database Administration Basic - Day2
introduction of Data strutter and algirithm.pptx
Introduction To Programming In R for data analyst
Pandas-(Ziad).pptx
Python Pandas.pptx
Cassandra data modelling best practices
2. Data Preprocessing with Numpy and Pandas.pptx
Unit 3_Numpy_Vsp.pptx
R programming & Machine Learning
python for data anal gh i o fytysis creation.pptx
II B.Sc IT DATA STRUCTURES.pptx
Data Handling in R language basic concepts.pptx
e_lumley.pdf
python-pandas-For-Data-Analysis-Manipulate.pptx
Lecture 9.pptx
Summerization notes for descriptive statistics using r
Ad

Recently uploaded (20)

PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PPTX
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
PPTX
UNIT III MENTAL HEALTH NURSING ASSESSMENT
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
IGGE1 Understanding the Self1234567891011
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PDF
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
PDF
Complications of Minimal Access Surgery at WLH
PDF
Computing-Curriculum for Schools in Ghana
PDF
What if we spent less time fighting change, and more time building what’s rig...
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PDF
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
PPTX
Introduction to Building Materials
PPTX
Digestion and Absorption of Carbohydrates, Proteina and Fats
PDF
Indian roads congress 037 - 2012 Flexible pavement
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Supply Chain Operations Speaking Notes -ICLT Program
Final Presentation General Medicine 03-08-2024.pptx
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
UNIT III MENTAL HEALTH NURSING ASSESSMENT
202450812 BayCHI UCSC-SV 20250812 v17.pptx
IGGE1 Understanding the Self1234567891011
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
Complications of Minimal Access Surgery at WLH
Computing-Curriculum for Schools in Ghana
What if we spent less time fighting change, and more time building what’s rig...
Final Presentation General Medicine 03-08-2024.pptx
Paper A Mock Exam 9_ Attempt review.pdf.
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
Introduction to Building Materials
Digestion and Absorption of Carbohydrates, Proteina and Fats
Indian roads congress 037 - 2012 Flexible pavement
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Ad

Unit I - introduction to r language 2.pptx

  • 1. Unit I Data Manipulations Data Already in R – Reading data – Reading and formatting datasets – Manipulating Data with dplyr – Tiding Data with tidyr
  • 2. Data Manipulation • Data science is as much about manipulating data as it is about fitting models to data. • Data rarely arrives in a form that we can directly feed into the statistical models or machine learning algorithms. • The first stages of data analysis are almost always figuring out how to load the data into R • And then figuring out how to transform it into a shape you can readily analyze.
  • 3. Data Already in R: • There are some data sets already built into R or available in R packages. • Dataset is a package in R. • Its aim is to make tidy datasets easier to release, exchange and reuse. It organizes and formats data frame 'R' objects into well-referenced, well- described, interoperable datasets into release and reuse ready form. • We can load the package into R using the library() function. library(datasets) If we want , together with a short description of each, then use library(help = "datasets“). It describes Package, version, Title, author aetc.
  • 5. To load an actual data set into R’s memory, use: data() function For example, data(“CO2”)
  • 6. To display first 6 rows of a dataset: head(co2)
  • 7. To plot a graph: plot(conc ~ uptake, data = CO2)
  • 8. • Another package with several useful data sets is mlbench. • It contains data sets for machine learning benchmarks, so these data sets are aimed at testing how new methods perform on known data sets. • This package is not distributed together with R, but you can install it, load it, and get a list of the data sets within it like this: install.packages("mlbench") library(mlbench) library(help = "mlbench")
  • 9. Quickly Reviewing Data • head(dataset/dataframe, no.ofrows) • head(CO2,3) -> displays first 3 rows in co2 dataset. • Simillarly, • tail(CO2,3) -> • summary(co2) ->displays summary statistics for all the columns in a data frame, • str(co2) -> displays the type of each column
  • 10. Reading Data • There are several packages for reading data in different file formats, from Excel to JSON to XML and so on. • R has plenty of built-in functions for reading such data. Use ?read.table • read.table() is a function in R that reads a file in table format and creates a data frame from it read.table() Example my_data <- read.table("data.txt", header = TRUE, sep = "t")
  • 11. Arguments in read.table(): • header: This is a boolean value telling the function whether it should consider the first line in the input file a header line. • col.names: If the first line is not used to specify the header, you can use this option to name the columns. • dec: This is the decimal point used in numbers. • comment.char: By default, the function assumes that “#” is the start of a comment and ignores the rest of a line when it sees it • colClasses: This lets you specify which type each column should have, so here you can specify that some columns should be factors, and others should be strings
  • 12. Install package in R studio • To install the `mlbench` package in RStudio, you can follow these steps: 1. Open RStudio and click on the **Packages** tab in the bottom right pane. 2. Click on the **Install** button. 3. In the **Install Packages** dialog box, type `mlbench` in the **Packages** field. 4. Select the **Install dependencies** option. 5. Click on the **Install** button to install the package.
  • 13. Examples of Reading and Formatting Data Sets Breast Cancer Data set : • As a first example of reading data from a text file, we consider the BreastCancer data set from mlbench. • Then we have something to compare our results with. library(mlbench) data(BreastCancer) head(3,BreastCancer)
  • 14. • The URL to the actual data is https://archive.ics.uci.edu/ml/machine- learning-databases/ breast-cancer-wisconsin/breast-cancer-wisconsin.data’ • To get data, • we could go to the URL and save the file. • to read the data directly from the URL. • We can read the data and get it as a vector of lines using the readLines() function. • lines <- readLines(data_url) lines[1:5]
  • 15. • For this data, it seems to be a comma-separated values file without a header line. So save the data with the “.csv” suffix. • Boston Housing Data Set : For the second example of loading data, we take another data set from the mlbench package.(Refer text book)
  • 16. The readr package: • readr is an R package that provides a fast and friendly way to read rectangular data from delimited files, such as comma-separated values (CSV) and tab-separated values (TSV) • It is designed to parse many types of data. • To install readr, • you can either install the whole tidyverse by running install.packages("tidyverse") or • install just readr by running install.packages("readr") • Once installed, you can load readr with library(readr)
  • 17. • readr supports the following file formats with these read_*() functions: • read_csv(): comma-separated values (CSV) • read_tsv(): tab-separated values (TSV) • read_csv2(): semicolon-separated values with , as the decimal mark • read_delim(): delimited files (CSV and TSV are important special cases) • read_fwf(): fixed-width files • read_table(): whitespace-separated files • read_log(): web log files 1
  • 18. Manipulating Data with dplyr • Data frames are ideal for representing tabular data • Nearly all packages that implement statistical models or machine learning algorithms in R work on data frames. • But to actually manipulate a data frame, you often have to write a lot of code to filter data, rearrange data, summarize it in various ways, and such. • A few years ago, manipulating data frames required a lot more programming than actually analyzing data. • That has improved dramatically with the dplyr package (pronounced “d plier” where “plier” is pronounced as “pliers”). This pack
  • 19. • dplyr package has to be installed externaly. • It helps to resolve the most frequent data manipulation hurdles. • There are uncomplicated “verbs”, functions present for tackling every common data manipulation and the thoughts can be translated into code faster. • This package provides a number of convenient functions that let you modify data frames in various ways and string them together in pipes using the %>% or |> operator • If you import dplyr, you get a large selection of functions that let you build pipelines for data frame manipulation using pipelines.
  • 20. Some Useful dplyr Functions • The dplyr package has several representations of data frame and its equivalent formats. • (illustrate with output) • iris %>% as_tibble() • iris |> as_tibble() • iris %>% as_tibble() %>% select(Petal.Width,Petal.Length) %>% head(3)
  • 21. • iris %>% as_tibble() %>% select(Sepal.Length:Petal.Length) %>% head(3) • iris |> as_tibble() |> select(starts_with("Petal")) |> head(3) • iris |> as_tibble() |> select(ends_with("Width")) |> head(3) • iris |> as_tibble() |> select(contains("etal")) |> head(3) • iris |> as_tibble() |> select(matches(".t.")) |> head(3)
  • 22. • iris %>% as_tibble() %>% select(-starts_with("Petal")) %>% head(3) • iris %>% as_tibble() %>% mutate(Petal.Width.plus.Length = Petal.Width + Petal.Length) %>% • select(Species, Petal.Width.plus.Length) %>% head(3) • iris %>% as_tibble() %>% • mutate(Petal.Width.plus.Length = Petal.Width + Petal.Length, • Sepal.Width.plus.Length = Sepal.Width + Sepal.Length) %>% • select(Petal.Width.plus.Length, Sepal.Width.plus.Length) %>% head(3)
  • 23. • iris %>% as_tibble() %>% arrange(Sepal.Length) %>% head(3) • iris %>% as_tibble() %>% arrange(desc(Sepal.Length)) %>% head(3) • iris %>% as_tibble() %>% group_by(Species) %>% head(3) • iris %>% group_by(Species) %>% summarise(Mean.Petal.Length = mean(Petal.Length))
  • 24. • Breast Cancer Data Manipulation – Refer Text book
  • 25. Tidying Data with tidyr Tidy data is a standard way of mapping the meaning of a data set to its structure. A data set is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. Hadley Wickham
  • 26. • tidy data can be used to plot or summarize the data efficiently. • It mostly comes down to what data is represented as columns in a data frame and what is not. • For example, if I want to look at the iris data set and see how the Petal.Length varies among species, then I can look at the Species column against the Petal.Length column: iris |> as_tibble() |> select(Species, Petal.Length) |> head(3) • Can plot a graph for the same
  • 28. • This works because we have a column for the x-axis and another for the y-axis. • But if we want to plot the different measurements of the irises to see how those are related and each measurement is a separate column. • In such case we can use tidyr package. library(tidyr) • It has a function, pivot_longer(), that modifies the data frame, so columns become names in a factor and other columns become values.
  • 29. • pivot_longer function is designed to reshape data from a wider format to a longer format. • It makes easier to analyze and visualize the data. • The data frame or tibble to be reshaped
  • 33. • What it does is essentially transforming the data frame such that you get one column containing the name of your original columns and another column containing the values in those columns. • In the iris data set, we have observations for sepal length and sepal width. • If we want to examine Species vs. Sepal.Length or Sepal.Width, we can readily do this.
  • 34. • Pivot wider() – inverse of pivot longer()