SlideShare a Scribd company logo
R Text-Based Data I/O R Data Frame Access and Manipulation Ian M. Cook September 29, 2010
R Data I/O, Access, and Manipulation September 29, 2010 Background Information
Data Types R has several important data types: numeric (stores integers and floating point real numbers) character (stores strings of characters, not single characters) logical (stores TRUE or FALSE) R Data I/O, Access, and Manipulation September 29, 2010
Data Containers The most basic data storage container in R is a  scalar , a 1x1 unit of data.  A scalar might contain a unit of numeric, character, or logical data. A 1-dimensional array of scalars in R is a  vector . A 2-dimensional array of scalars in R can be a  matrix  or a  data frame .  (The focus here is on data frames.  Matrices are often less useful and less accessible so are not covered in this presentation.) R also has other data containers, including  lists , which are important to know about but are often less useful for data analysis purposes. R Data I/O, Access, and Manipulation September 29, 2010
Data Containers A vector can be created in R using the function  c() .  To create several vectors of various lengths containing numerical, character, and logical data, we can enter v1 <- c(1, 3, 9, 3.14159, -88.1, 0) v2 <- c(&quot;abc&quot;,&quot;def&quot;,&quot;ghi&quot;) v3 <- c(TRUE, FALSE, TRUE, TRUE) Data types cannot be mixed  within a vector.  Entering mixed data types into a vector using the  c()  function converts all non-character entries into character representations. R Data I/O, Access, and Manipulation September 29, 2010
Data Frames A  data frame  is a rectangular array, with each column representing a variable. Different  columns in a data frame may have different data types.  (E.g. a data frame might have character strings in column 1, numerical values in column 2, and logical values in column 3.) A data frame can be created in R using the function  data.frame(),  but it is often more useful to input a data frame from an external data file or database. R Data I/O, Access, and Manipulation September 29, 2010
R Data I/O, Access, and Manipulation September 29, 2010 Data Frame Input/Output
Basic CSV Data Input To  read  the contents of a CSV file into an R data frame named  ds , use the command ds <- read.csv(file, header, …) header  is TRUE by default, indicating that the first row of the CSV file contains the row names. file  is the name of the file, enclosed in single or double quotes. Example: ds <- read.csv(&quot;C:/data/file.csv&quot;, header=TRUE) R Data I/O, Access, and Manipulation September 29, 2010
Important Tips When specifying file paths, use front slashes or  double  backslashes.  (The single backslash is a special character in R.) Works: ds <- read.csv(&quot;C: / data / file.csv&quot;) Works: ds <- read.csv(&quot;C: \\ data \\ file.csv&quot;) Fails: ds <- read.csv(&quot;C: \ data \ file.csv&quot;) R Data I/O, Access, and Manipulation September 29, 2010
Other Delimited Text Files To input a text data table delimited with characters other than commas, use the command ds <- read.table(file, header, sep, …) sep  specifies the delimiter: &quot;,&quot;  indicates a comma &quot;\t&quot;  indicates the tab character For example: ds <- read.table(&quot;C:/file.txt&quot;, sep=&quot;\t&quot;) R Data I/O, Access, and Manipulation September 29, 2010
Important Tips The logical values  TRUE  and  FALSE  must be all caps. If a data frame with named  ds  already exists, the command  ds <- read.csv(…)  or any other command using  ds  on the left side of the assignment operator  <-  will  overwrite   ds  if it executes successfully. Refer to the R Documentation page on  read.table(…)  for more detailed information and for other options such as ignoring comment headers and using special quotation characters. R Data I/O, Access, and Manipulation September 29, 2010
CSV Data Output To  write  the contents of a data frame named ds to a CSV file, use the command write.csv(ds, file, …) For example: write.csv(ds, &quot;C:/data/file.csv&quot;) To output a file delimited by a character other than the comma, use the command write.table(ds, file, … , sep) R Data I/O, Access, and Manipulation September 29, 2010
Important Tips The functions  write.csv(…)  and  write.table(…)  have many options, including  col.names  and  row.names , which allow users to choose whether to use column naming and/or row numbering. Refer to the R Documentation on  write.table(…)  for more information. R Data I/O, Access, and Manipulation September 29, 2010
Databases R has simple facilities for querying databases and filling a data frame with the results of your query. R can query  MySQL  databases using the R package  RMySQL . R can query  Oracle  databases using the R package  ROracle . Queries to either database type require the R package  DBI . R Data I/O, Access, and Manipulation September 29, 2010
MySQL Databases To fill a data frame  ds  with the results of a SQL query against a MySQL database, use the following template R code: library(DBI) library(RMySQL) db_name <- &quot;database_name&quot; db_node <- &quot;database_node&quot; db_user <- &quot;username&quot; db_pw <- &quot;password&quot; mysql <- dbDriver(&quot;MySQL&quot;) sql_statement <- &quot;select … from …&quot; con <- dbConnect(mysql, user=db_user, password=db_pw,  dbname=db_name, host=db_node) ds <- dbGetQuery(con, sql_statement) mysqlCloseConnection(con) R Data I/O, Access, and Manipulation September 29, 2010
Oracle Databases To fill a data frame  ds  with the results of a SQL query against an Oracle database, use the following template R code: library(DBI) library(ROracle) db_name <- &quot;database_name&quot; db_user <- &quot;username&quot; db_pw <- &quot;password&quot; ora <- dbDriver(&quot;Oracle&quot;) sql_statement <- &quot;select … from …&quot; con <- dbConnect(ora, user=db_user, password=db_pw,  dbname=db_name) ds <- dbGetQuery(con, sql_statement) dbDisconnect(con) R Data I/O, Access, and Manipulation September 29, 2010
R Data I/O, Access, and Manipulation September 29, 2010 Data Frame Access and Manipulation
Accessing Columns in a Data Frame Each column in a data frame represents a variable.  Different  columns may have different data types (e.g. character strings in column 1, numerical values in column 2, logical values in column 3). Columns inside a data frame can be accessed in any of three basic methods: Dollar sign extraction operator  $ Square brackets extraction operator  [] subset()  function R Data I/O, Access, and Manipulation September 29, 2010
Dollar Sign Extraction Operator A single column from a data frame can be accessed using the dollar sign operator  $  as follows.  To return a vector containing the data in the column named  SIDD  in the data frame named  ds , issue the command ds$SIDD Do not  surround the name of the column in quotes when using the  $  operator. R Data I/O, Access, and Manipulation September 29, 2010
Square Brackets Extraction Operator A single column from a data frame may  also  be accessed using the square brackets operator  []  as follows.  To return a vector containing the column named  SIDD  in the data frame named  ds , issue the command ds[,&quot;SIDD&quot;] You must  surround the name of the column in double or single quotes when using the  []  operator. The comma before the column name is important, as you will see several slides ahead. R Data I/O, Access, and Manipulation September 29, 2010
subset()  Function A third way to access a single column in a data frame utilizes R’s  subset()  function.  To return a vector containing the column named  SIDD  in the data frame named  ds , issue the command subset(ds, select=&quot;SIDD&quot;) R Data I/O, Access, and Manipulation September 29, 2010
Numerical Indices R indexes data containers with integers, beginning at  1 . This is unlike most programming languages, in which indices begin at 0. The square brackets extraction operator also accepts the  number  of the column.  If the third column in the data frame  ds  is named  SIDD , then ds[,&quot;SIDD&quot;]   and  ds[,3] are equivalent commands. R Data I/O, Access, and Manipulation September 29, 2010
Accessing Rows in a Data Frame The rows of a data frame are not generally named, but are numbered beginning at 1. The rows of a data frame can be accessed by either of two methods: Square brackets extraction operator  [] subset()  function R Data I/O, Access, and Manipulation September 29, 2010
Square Brackets Extraction Operator To return a vector containing the  n th row of a data frame  ds , issue the command ds[n,] The comma after the column name is important.  The square brackets expect a  row  number  before  the comma, and a  column  name or number  after  the comma. R Data I/O, Access, and Manipulation September 29, 2010
Square Brackets Extraction Operator Square brackets can also be used to return  multiple rows  of a data frame.  To return a smaller data frame containing the  n th through  n+m th rows of a data frame  ds , issue the command ds[n:(n+m),] The above command also demonstrates the colon operator  : , which is used to create sequences of integer numbers, in this case beginning with  n  and ending with  n+m . R Data I/O, Access, and Manipulation September 29, 2010
subset()  Function The  subset()  function is sometimes useful in returning multiple rows of a data frame.  It is more complicated to use than the square brackets. For example, to extract the 2 nd , 4 th , and 5 th  rows of a data frame with 5 rows, we could issue the commands: index <- c(FALSE, TRUE, FALSE, TRUE, TRUE) subset(ds, subset=index) R Data I/O, Access, and Manipulation September 29, 2010
Square Brackets Extraction Operator An individual scalar entry within a data frame can be returned by using the square bracket operators, with numbers on both sides of the comma. To return the scalar value in the  m th row and  n th column of a data frame  ds , issue the command ds[m,n] To return the scalar value in the  m th row of the data frame  ds , in the column named  SIDD , issue the command ds[m,&quot;SIDD&quot;] R Data I/O, Access, and Manipulation September 29, 2010
Assignment with  []  and  $ The square brackets and dollar sign can also be used to  assign  values within a data frame.  If the column  SIDD  in the data frame  ds  contains numerical data, we can multiply the 5 th  entry in the  SIDD  column by two by issuing the command ds[5,&quot;SIDD&quot;] <- 2 * ds[5,&quot;SIDD&quot;] We could create a new column (or replace the values within the column) named  TWICE_SIDD  in the data frame  ds , and fill it with values twice those in the column  SIDD , by issuing the command ds$TWICE_SIDD <- 2 * ds$SIDD R Data I/O, Access, and Manipulation September 29, 2010
Dimensions Commands to return the dimensions of a data frame  ds  are dim(ds)  nrow(ds)  ncol(ds) dim(ds)  returns a vector of length two containing the number of rows in position 1 and the number of columns in position 2. The command to return the length of a vector  v  is length(v) R Data I/O, Access, and Manipulation September 29, 2010
Factors By default, R stores the character vector columns in data frames as  factors .  In R, a factor is an indexed vector. To factor a vector, R identifies the unique entries in the vector and makes them the  levels  of the factor.  Each vector entry is then indexed by an integer to one of the factor levels.  This saves memory when the entries in a vector are not all unique. There are several functions to handle factors.  Refer to the R Documentation or Help pages about factors. R Data I/O, Access, and Manipulation September 29, 2010
R Data I/O, Access, and Manipulation September 29, 2010 Connections and Line-by-Line Text Input/Output
Connections In some cases, it is preferable to import or export data  line-by-line . Line-by-line data input/output reduces R’s memory usage and is useful when dealing with very large delimited text datasets. Line-by-line text input/output can be useful for reading and writing log files. The first step in  reading  line-by-line is opening a file  connection . R Data I/O, Access, and Manipulation September 29, 2010
Connections for Input R can open a text file connection  conn  for  input  using the command conn <- file(filename, open=&quot;rt&quot;) If the specified file exists and is accessible, then a connection is created and opened for text reading. Example: conn <- file( &quot;C:/data/in.txt&quot; , open=&quot;rt&quot;) (&quot;rt&quot;  indicates “read text”) R Data I/O, Access, and Manipulation September 29, 2010
Line-by-Line Input Once a text file input connection is open, we can use one of R’s line-by-line text input functions: readLines(conn, n) scan(…) The  scan(…)  function is useful for importing delimited data files (e.g. CSV) line by line.  The  scan(…)  function has many arguments.  Refer to its lengthy R Documentation page for details. The  readLines(…)  function is simpler and is useful for reading unstructured lines of text. R Data I/O, Access, and Manipulation September 29, 2010
Line-by-Line Input To read one line of text from a file into the scalar character array variable  str , we could use the following series of commands conn <- file( &quot;C:/data/in.txt&quot; , open=&quot;rt&quot;) str <- readLines(conn, n=1) close(conn) The  close(conn)  command closes the connection, leaving the file intact, and leaving  str  in the R workspace. R Data I/O, Access, and Manipulation September 29, 2010
Connections for Output R can create a text file connection  conn  for  output  using the command conn <- file(filename, open=&quot;wt&quot;) If the file does not exist, it is created.  If the file already exists, its contents are erased ! Example: conn <- file( &quot;C:/data/out.txt&quot; , open=&quot;wt&quot;) (&quot;wt&quot;  indicates “write text”) R Data I/O, Access, and Manipulation September 29, 2010
Output to a Connection Once a text file output connection is open, we can write text to the connection by making one or more calls to the R function write(&quot;text to write&quot;, file=conn,  append=TRUE) Once finished writing text to the connection, close it using the command close(conn) R Data I/O, Access, and Manipulation September 29, 2010
Output to a File R can also write directly to a file without creating a connection.  In this example, we  retain the contents of an existing text file  and append new text. To write the contents of the character string  str  to a file, issue the command write(str, file=filename, append=TRUE) Example: str <- &quot;some text to output \nline 2&quot; write(str, file=&quot;C:/data/out.txt&quot;, append=TRUE) R Data I/O, Access, and Manipulation September 29, 2010
Output to a File If the specified file does not exist, the  write()  command will create it. Be sure to use the  append=TRUE  option when appending to an existing text file, or the file’s contents will be cleared! There is no need to use the  close()  command after writing to a file without using a connection, because no persistent connection has been opened. Use the newline character  \n  to create line breaks in text output. R Data I/O, Access, and Manipulation September 29, 2010
Gzip Connections R provides facilities for line-by-line reading and writing of files compressed by the  gzip   utility. To create a connection to a gzip file for  reading , issue the command conn <- gzfile(filename, open=&quot;rt&quot;) To create a connection to a gzip file for  writing , issue the command   conn <- gzfile(filename, open=&quot;wt&quot;) The  readLines() ,  write() , and  close()  functions can be used in the same way as with text file connections. R Data I/O, Access, and Manipulation September 29, 2010

More Related Content

PDF
Matlab files
PDF
SQL For PHP Programmers
PPT
R programming by ganesh kavhar
PPT
Sas Plots Graphs
PPT
SAS Proc SQL
PPT
Reading Fixed And Varying Data
PDF
Sas cheat
PPT
SAS Functions
Matlab files
SQL For PHP Programmers
R programming by ganesh kavhar
Sas Plots Graphs
SAS Proc SQL
Reading Fixed And Varying Data
Sas cheat
SAS Functions

What's hot (19)

PPT
Physical elements of data
PPT
Improving Effeciency with Options in SAS
PPT
Data structure
PPTX
Basic Structure Of C++
PPTX
Co&amp;al lecture-05
PDF
Introduction to matlab
PPT
SAS Macros
PDF
Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation
PPTX
Adbms 22 dynamic multi level index using b and b+ tree
PPT
Intro To TSQL - Unit 5
PDF
Abap Questions
PPT
PPTX
Data frame operations
PPTX
Sql rally 2013 columnstore indexes
PPT
Trees - Data structures in C/Java
PPTX
Using Spectrum on Demand from MapInfo Pro
PPTX
SqlSaturday199 - Columnstore Indexes
PPT
Introductiont To Aray,Tree,Stack, Queue
Physical elements of data
Improving Effeciency with Options in SAS
Data structure
Basic Structure Of C++
Co&amp;al lecture-05
Introduction to matlab
SAS Macros
Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation
Adbms 22 dynamic multi level index using b and b+ tree
Intro To TSQL - Unit 5
Abap Questions
Data frame operations
Sql rally 2013 columnstore indexes
Trees - Data structures in C/Java
Using Spectrum on Demand from MapInfo Pro
SqlSaturday199 - Columnstore Indexes
Introductiont To Aray,Tree,Stack, Queue
Ad

Similar to R Text-Based Data I/O and Data Frame Access and Manupulation (20)

PPTX
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
PDF
Expository data analysis and visualization-1.pdf
PDF
Expository data analysis and visualization-1.pdf
PDF
R Introduction
PPTX
PPTX
Data Analytics with R and SQL Server
PDF
Expository data analysis aand visualization-1.pdf
PDF
Expository data analysis aand visualization-1.pdf
PPT
R Programming for Statistical Applications
PPT
R-programming with example representation.ppt
PPT
Basics of R-Programming with example.ppt
PPT
Basocs of statistics with R-Programming.ppt
PPT
R-Programming.ppt it is based on R programming language
PPT
Basics R.ppt
PDF
R training2
PPT
R Programming Intro
PPTX
Power point presentation on loading and handling data in R
PPTX
R ppt for skejsjsjsjjssjskskskskskksk.pptx
PPTX
R language introduction
PPT
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
Expository data analysis and visualization-1.pdf
Expository data analysis and visualization-1.pdf
R Introduction
Data Analytics with R and SQL Server
Expository data analysis aand visualization-1.pdf
Expository data analysis aand visualization-1.pdf
R Programming for Statistical Applications
R-programming with example representation.ppt
Basics of R-Programming with example.ppt
Basocs of statistics with R-Programming.ppt
R-Programming.ppt it is based on R programming language
Basics R.ppt
R training2
R Programming Intro
Power point presentation on loading and handling data in R
R ppt for skejsjsjsjjssjskskskskskksk.pptx
R language introduction
Ad

Recently uploaded (20)

PPTX
Cloud computing and distributed systems.
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPT
Teaching material agriculture food technology
PDF
KodekX | Application Modernization Development
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Big Data Technologies - Introduction.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
Cloud computing and distributed systems.
Reach Out and Touch Someone: Haptics and Empathic Computing
Teaching material agriculture food technology
KodekX | Application Modernization Development
Agricultural_Statistics_at_a_Glance_2022_0.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Unlocking AI with Model Context Protocol (MCP)
Big Data Technologies - Introduction.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Spectral efficient network and resource selection model in 5G networks
Building Integrated photovoltaic BIPV_UPV.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Network Security Unit 5.pdf for BCA BBA.
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Understanding_Digital_Forensics_Presentation.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Digital-Transformation-Roadmap-for-Companies.pptx
The AUB Centre for AI in Media Proposal.docx

R Text-Based Data I/O and Data Frame Access and Manupulation

  • 1. R Text-Based Data I/O R Data Frame Access and Manipulation Ian M. Cook September 29, 2010
  • 2. R Data I/O, Access, and Manipulation September 29, 2010 Background Information
  • 3. Data Types R has several important data types: numeric (stores integers and floating point real numbers) character (stores strings of characters, not single characters) logical (stores TRUE or FALSE) R Data I/O, Access, and Manipulation September 29, 2010
  • 4. Data Containers The most basic data storage container in R is a scalar , a 1x1 unit of data. A scalar might contain a unit of numeric, character, or logical data. A 1-dimensional array of scalars in R is a vector . A 2-dimensional array of scalars in R can be a matrix or a data frame . (The focus here is on data frames. Matrices are often less useful and less accessible so are not covered in this presentation.) R also has other data containers, including lists , which are important to know about but are often less useful for data analysis purposes. R Data I/O, Access, and Manipulation September 29, 2010
  • 5. Data Containers A vector can be created in R using the function c() . To create several vectors of various lengths containing numerical, character, and logical data, we can enter v1 <- c(1, 3, 9, 3.14159, -88.1, 0) v2 <- c(&quot;abc&quot;,&quot;def&quot;,&quot;ghi&quot;) v3 <- c(TRUE, FALSE, TRUE, TRUE) Data types cannot be mixed within a vector. Entering mixed data types into a vector using the c() function converts all non-character entries into character representations. R Data I/O, Access, and Manipulation September 29, 2010
  • 6. Data Frames A data frame is a rectangular array, with each column representing a variable. Different columns in a data frame may have different data types. (E.g. a data frame might have character strings in column 1, numerical values in column 2, and logical values in column 3.) A data frame can be created in R using the function data.frame(), but it is often more useful to input a data frame from an external data file or database. R Data I/O, Access, and Manipulation September 29, 2010
  • 7. R Data I/O, Access, and Manipulation September 29, 2010 Data Frame Input/Output
  • 8. Basic CSV Data Input To read the contents of a CSV file into an R data frame named ds , use the command ds <- read.csv(file, header, …) header is TRUE by default, indicating that the first row of the CSV file contains the row names. file is the name of the file, enclosed in single or double quotes. Example: ds <- read.csv(&quot;C:/data/file.csv&quot;, header=TRUE) R Data I/O, Access, and Manipulation September 29, 2010
  • 9. Important Tips When specifying file paths, use front slashes or double backslashes. (The single backslash is a special character in R.) Works: ds <- read.csv(&quot;C: / data / file.csv&quot;) Works: ds <- read.csv(&quot;C: \\ data \\ file.csv&quot;) Fails: ds <- read.csv(&quot;C: \ data \ file.csv&quot;) R Data I/O, Access, and Manipulation September 29, 2010
  • 10. Other Delimited Text Files To input a text data table delimited with characters other than commas, use the command ds <- read.table(file, header, sep, …) sep specifies the delimiter: &quot;,&quot; indicates a comma &quot;\t&quot; indicates the tab character For example: ds <- read.table(&quot;C:/file.txt&quot;, sep=&quot;\t&quot;) R Data I/O, Access, and Manipulation September 29, 2010
  • 11. Important Tips The logical values TRUE and FALSE must be all caps. If a data frame with named ds already exists, the command ds <- read.csv(…) or any other command using ds on the left side of the assignment operator <- will overwrite ds if it executes successfully. Refer to the R Documentation page on read.table(…) for more detailed information and for other options such as ignoring comment headers and using special quotation characters. R Data I/O, Access, and Manipulation September 29, 2010
  • 12. CSV Data Output To write the contents of a data frame named ds to a CSV file, use the command write.csv(ds, file, …) For example: write.csv(ds, &quot;C:/data/file.csv&quot;) To output a file delimited by a character other than the comma, use the command write.table(ds, file, … , sep) R Data I/O, Access, and Manipulation September 29, 2010
  • 13. Important Tips The functions write.csv(…) and write.table(…) have many options, including col.names and row.names , which allow users to choose whether to use column naming and/or row numbering. Refer to the R Documentation on write.table(…) for more information. R Data I/O, Access, and Manipulation September 29, 2010
  • 14. Databases R has simple facilities for querying databases and filling a data frame with the results of your query. R can query MySQL databases using the R package RMySQL . R can query Oracle databases using the R package ROracle . Queries to either database type require the R package DBI . R Data I/O, Access, and Manipulation September 29, 2010
  • 15. MySQL Databases To fill a data frame ds with the results of a SQL query against a MySQL database, use the following template R code: library(DBI) library(RMySQL) db_name <- &quot;database_name&quot; db_node <- &quot;database_node&quot; db_user <- &quot;username&quot; db_pw <- &quot;password&quot; mysql <- dbDriver(&quot;MySQL&quot;) sql_statement <- &quot;select … from …&quot; con <- dbConnect(mysql, user=db_user, password=db_pw, dbname=db_name, host=db_node) ds <- dbGetQuery(con, sql_statement) mysqlCloseConnection(con) R Data I/O, Access, and Manipulation September 29, 2010
  • 16. Oracle Databases To fill a data frame ds with the results of a SQL query against an Oracle database, use the following template R code: library(DBI) library(ROracle) db_name <- &quot;database_name&quot; db_user <- &quot;username&quot; db_pw <- &quot;password&quot; ora <- dbDriver(&quot;Oracle&quot;) sql_statement <- &quot;select … from …&quot; con <- dbConnect(ora, user=db_user, password=db_pw, dbname=db_name) ds <- dbGetQuery(con, sql_statement) dbDisconnect(con) R Data I/O, Access, and Manipulation September 29, 2010
  • 17. R Data I/O, Access, and Manipulation September 29, 2010 Data Frame Access and Manipulation
  • 18. Accessing Columns in a Data Frame Each column in a data frame represents a variable. Different columns may have different data types (e.g. character strings in column 1, numerical values in column 2, logical values in column 3). Columns inside a data frame can be accessed in any of three basic methods: Dollar sign extraction operator $ Square brackets extraction operator [] subset() function R Data I/O, Access, and Manipulation September 29, 2010
  • 19. Dollar Sign Extraction Operator A single column from a data frame can be accessed using the dollar sign operator $ as follows. To return a vector containing the data in the column named SIDD in the data frame named ds , issue the command ds$SIDD Do not surround the name of the column in quotes when using the $ operator. R Data I/O, Access, and Manipulation September 29, 2010
  • 20. Square Brackets Extraction Operator A single column from a data frame may also be accessed using the square brackets operator [] as follows. To return a vector containing the column named SIDD in the data frame named ds , issue the command ds[,&quot;SIDD&quot;] You must surround the name of the column in double or single quotes when using the [] operator. The comma before the column name is important, as you will see several slides ahead. R Data I/O, Access, and Manipulation September 29, 2010
  • 21. subset() Function A third way to access a single column in a data frame utilizes R’s subset() function. To return a vector containing the column named SIDD in the data frame named ds , issue the command subset(ds, select=&quot;SIDD&quot;) R Data I/O, Access, and Manipulation September 29, 2010
  • 22. Numerical Indices R indexes data containers with integers, beginning at 1 . This is unlike most programming languages, in which indices begin at 0. The square brackets extraction operator also accepts the number of the column. If the third column in the data frame ds is named SIDD , then ds[,&quot;SIDD&quot;] and ds[,3] are equivalent commands. R Data I/O, Access, and Manipulation September 29, 2010
  • 23. Accessing Rows in a Data Frame The rows of a data frame are not generally named, but are numbered beginning at 1. The rows of a data frame can be accessed by either of two methods: Square brackets extraction operator [] subset() function R Data I/O, Access, and Manipulation September 29, 2010
  • 24. Square Brackets Extraction Operator To return a vector containing the n th row of a data frame ds , issue the command ds[n,] The comma after the column name is important. The square brackets expect a row number before the comma, and a column name or number after the comma. R Data I/O, Access, and Manipulation September 29, 2010
  • 25. Square Brackets Extraction Operator Square brackets can also be used to return multiple rows of a data frame. To return a smaller data frame containing the n th through n+m th rows of a data frame ds , issue the command ds[n:(n+m),] The above command also demonstrates the colon operator : , which is used to create sequences of integer numbers, in this case beginning with n and ending with n+m . R Data I/O, Access, and Manipulation September 29, 2010
  • 26. subset() Function The subset() function is sometimes useful in returning multiple rows of a data frame. It is more complicated to use than the square brackets. For example, to extract the 2 nd , 4 th , and 5 th rows of a data frame with 5 rows, we could issue the commands: index <- c(FALSE, TRUE, FALSE, TRUE, TRUE) subset(ds, subset=index) R Data I/O, Access, and Manipulation September 29, 2010
  • 27. Square Brackets Extraction Operator An individual scalar entry within a data frame can be returned by using the square bracket operators, with numbers on both sides of the comma. To return the scalar value in the m th row and n th column of a data frame ds , issue the command ds[m,n] To return the scalar value in the m th row of the data frame ds , in the column named SIDD , issue the command ds[m,&quot;SIDD&quot;] R Data I/O, Access, and Manipulation September 29, 2010
  • 28. Assignment with [] and $ The square brackets and dollar sign can also be used to assign values within a data frame. If the column SIDD in the data frame ds contains numerical data, we can multiply the 5 th entry in the SIDD column by two by issuing the command ds[5,&quot;SIDD&quot;] <- 2 * ds[5,&quot;SIDD&quot;] We could create a new column (or replace the values within the column) named TWICE_SIDD in the data frame ds , and fill it with values twice those in the column SIDD , by issuing the command ds$TWICE_SIDD <- 2 * ds$SIDD R Data I/O, Access, and Manipulation September 29, 2010
  • 29. Dimensions Commands to return the dimensions of a data frame ds are dim(ds) nrow(ds) ncol(ds) dim(ds) returns a vector of length two containing the number of rows in position 1 and the number of columns in position 2. The command to return the length of a vector v is length(v) R Data I/O, Access, and Manipulation September 29, 2010
  • 30. Factors By default, R stores the character vector columns in data frames as factors . In R, a factor is an indexed vector. To factor a vector, R identifies the unique entries in the vector and makes them the levels of the factor. Each vector entry is then indexed by an integer to one of the factor levels. This saves memory when the entries in a vector are not all unique. There are several functions to handle factors. Refer to the R Documentation or Help pages about factors. R Data I/O, Access, and Manipulation September 29, 2010
  • 31. R Data I/O, Access, and Manipulation September 29, 2010 Connections and Line-by-Line Text Input/Output
  • 32. Connections In some cases, it is preferable to import or export data line-by-line . Line-by-line data input/output reduces R’s memory usage and is useful when dealing with very large delimited text datasets. Line-by-line text input/output can be useful for reading and writing log files. The first step in reading line-by-line is opening a file connection . R Data I/O, Access, and Manipulation September 29, 2010
  • 33. Connections for Input R can open a text file connection conn for input using the command conn <- file(filename, open=&quot;rt&quot;) If the specified file exists and is accessible, then a connection is created and opened for text reading. Example: conn <- file( &quot;C:/data/in.txt&quot; , open=&quot;rt&quot;) (&quot;rt&quot; indicates “read text”) R Data I/O, Access, and Manipulation September 29, 2010
  • 34. Line-by-Line Input Once a text file input connection is open, we can use one of R’s line-by-line text input functions: readLines(conn, n) scan(…) The scan(…) function is useful for importing delimited data files (e.g. CSV) line by line. The scan(…) function has many arguments. Refer to its lengthy R Documentation page for details. The readLines(…) function is simpler and is useful for reading unstructured lines of text. R Data I/O, Access, and Manipulation September 29, 2010
  • 35. Line-by-Line Input To read one line of text from a file into the scalar character array variable str , we could use the following series of commands conn <- file( &quot;C:/data/in.txt&quot; , open=&quot;rt&quot;) str <- readLines(conn, n=1) close(conn) The close(conn) command closes the connection, leaving the file intact, and leaving str in the R workspace. R Data I/O, Access, and Manipulation September 29, 2010
  • 36. Connections for Output R can create a text file connection conn for output using the command conn <- file(filename, open=&quot;wt&quot;) If the file does not exist, it is created. If the file already exists, its contents are erased ! Example: conn <- file( &quot;C:/data/out.txt&quot; , open=&quot;wt&quot;) (&quot;wt&quot; indicates “write text”) R Data I/O, Access, and Manipulation September 29, 2010
  • 37. Output to a Connection Once a text file output connection is open, we can write text to the connection by making one or more calls to the R function write(&quot;text to write&quot;, file=conn, append=TRUE) Once finished writing text to the connection, close it using the command close(conn) R Data I/O, Access, and Manipulation September 29, 2010
  • 38. Output to a File R can also write directly to a file without creating a connection. In this example, we retain the contents of an existing text file and append new text. To write the contents of the character string str to a file, issue the command write(str, file=filename, append=TRUE) Example: str <- &quot;some text to output \nline 2&quot; write(str, file=&quot;C:/data/out.txt&quot;, append=TRUE) R Data I/O, Access, and Manipulation September 29, 2010
  • 39. Output to a File If the specified file does not exist, the write() command will create it. Be sure to use the append=TRUE option when appending to an existing text file, or the file’s contents will be cleared! There is no need to use the close() command after writing to a file without using a connection, because no persistent connection has been opened. Use the newline character \n to create line breaks in text output. R Data I/O, Access, and Manipulation September 29, 2010
  • 40. Gzip Connections R provides facilities for line-by-line reading and writing of files compressed by the gzip utility. To create a connection to a gzip file for reading , issue the command conn <- gzfile(filename, open=&quot;rt&quot;) To create a connection to a gzip file for writing , issue the command conn <- gzfile(filename, open=&quot;wt&quot;) The readLines() , write() , and close() functions can be used in the same way as with text file connections. R Data I/O, Access, and Manipulation September 29, 2010