4.6 Data Frames
The Data Frame is a tabular data structure which can contain data of multiple types. It is conceptually similar to an Excel spreadsheet and is by far the most important data structure in R programming.
In a dataframe, each column is a vector. This is to say, all elements within a column will be of the same data type. However, different columns can be of different data types.
Here’s a data frame with a handful of writers, their birth years, and whether or not they were poets.
writers <- data.frame(
Name = c("Plath", "Tolstoy", "Milton", "Woolf", "Farid ud-Din Attar"),
BirthYear = c(1932, 1828, 1608, 1882, 1145),
Poet = c(TRUE, FALSE, TRUE, FALSE, TRUE)
)
print(writers)## Name BirthYear Poet
## 1 Plath 1932 TRUE
## 2 Tolstoy 1828 FALSE
## 3 Milton 1608 TRUE
## 4 Woolf 1882 FALSE
## 5 Farid ud-Din Attar 1145 TRUE
Note that when we set the names of our columns, we must use the equals sign - we cannot use the assignment <- as that is only used for variable assignment (i.e. we cannot do Poet <- c(TRUE, FALSE, TRUE, FALSE, TRUE))
4.6.1 Subsetting Data Frames
When we indexed vectors, we used the bracket notation vector[index] to extract information. We can do the same for data frames, but now we must provide two values - one for the row index and one for the column index, so the syntax is dataFrame[row, column]. For example, to pull out the value 1608 from writers, we would do:
## [1] 1608
As with vectors, we can extract multiple elements at once:
## Name BirthYear
## 2 Tolstoy 1828
## 3 Milton 1608
What if we want to subset the rows, but keep all the columns of our data frame? We can leave a field blank to not subset it at all. For example, pulling out all columns for rows 2 and 3:
## Name BirthYear Poet
## 2 Tolstoy 1828 FALSE
## 3 Milton 1608 TRUE
However, there is an easier way of extracting infromation from a data frame - we can take advantage of row names. We can pull out individual vectors from a data frame using the syntax dataFrame$columnName. For example, we can extract the Name vector from writers using:
## [1] "Plath" "Tolstoy" "Milton"
## [4] "Woolf" "Farid ud-Din Attar"
And then we can index the Name as we would any other vector:
## [1] "Tolstoy"
4.6.2 Logical Indexing
As with vectors, we can use logic and comparison operators to subset data frames. For example, we can subset our data frame just to writers who are poets:
poetsVector <- writers$Poet == TRUE # Get the positions of writers who are poets
writers[poetsVector, ] # Subset our data## Name BirthYear Poet
## 1 Plath 1932 TRUE
## 3 Milton 1608 TRUE
## 5 Farid ud-Din Attar 1145 TRUE
Notice how here, we save the logical vector as its own variable (poetsVector). We’re doing the equivalent of writers[writers$Poet == TRUE], but you may find dividing this process into multiple lines easier, especially as logic gets more complex.