4.6 Data Frames

The Data Frame is a tabular data structure which can contain data of multiple types. It is conceptually similar to an Excel spreadsheet and is by far the most important data structure in R programming.

In a dataframe, each column is a vector. This is to say, all elements within a column will be of the same data type. However, different columns can be of different data types.

Here’s a data frame with a handful of writers, their birth years, and whether or not they were poets.

writers <- data.frame(
  Name = c("Plath", "Tolstoy", "Milton", "Woolf", "Farid ud-Din Attar"),
  BirthYear = c(1932, 1828, 1608, 1882, 1145),
  Poet = c(TRUE, FALSE, TRUE, FALSE, TRUE)
)

print(writers)
##                 Name BirthYear  Poet
## 1              Plath      1932  TRUE
## 2            Tolstoy      1828 FALSE
## 3             Milton      1608  TRUE
## 4              Woolf      1882 FALSE
## 5 Farid ud-Din Attar      1145  TRUE

Note that when we set the names of our columns, we must use the equals sign - we cannot use the assignment <- as that is only used for variable assignment (i.e. we cannot do Poet <- c(TRUE, FALSE, TRUE, FALSE, TRUE))

4.6.1 Subsetting Data Frames

When we indexed vectors, we used the bracket notation vector[index] to extract information. We can do the same for data frames, but now we must provide two values - one for the row index and one for the column index, so the syntax is dataFrame[row, column]. For example, to pull out the value 1608 from writers, we would do:

writers[3,2]
## [1] 1608

As with vectors, we can extract multiple elements at once:

writers[c(2,3), c(1, 2)]
##      Name BirthYear
## 2 Tolstoy      1828
## 3  Milton      1608

What if we want to subset the rows, but keep all the columns of our data frame? We can leave a field blank to not subset it at all. For example, pulling out all columns for rows 2 and 3:

writers[c(2,3), ]
##      Name BirthYear  Poet
## 2 Tolstoy      1828 FALSE
## 3  Milton      1608  TRUE

However, there is an easier way of extracting infromation from a data frame - we can take advantage of row names. We can pull out individual vectors from a data frame using the syntax dataFrame$columnName. For example, we can extract the Name vector from writers using:

writers$Name
## [1] "Plath"              "Tolstoy"            "Milton"            
## [4] "Woolf"              "Farid ud-Din Attar"

And then we can index the Name as we would any other vector:

writers$Name[2]
## [1] "Tolstoy"

4.6.2 Logical Indexing

As with vectors, we can use logic and comparison operators to subset data frames. For example, we can subset our data frame just to writers who are poets:

poetsVector <- writers$Poet == TRUE # Get the positions of writers who are poets
writers[poetsVector, ] # Subset our data
##                 Name BirthYear Poet
## 1              Plath      1932 TRUE
## 3             Milton      1608 TRUE
## 5 Farid ud-Din Attar      1145 TRUE

Notice how here, we save the logical vector as its own variable (poetsVector). We’re doing the equivalent of writers[writers$Poet == TRUE], but you may find dividing this process into multiple lines easier, especially as logic gets more complex.