Learning Objectives
- Extract values from vectors and data frames.
- Perform operations on columns in a data frame.
- Append columns to a data frame.
- Create subsets of a data frame.
In this lesson you will learn how to extract and manipulate data stored in data frames in R. We will work with the E. coli metadata file that we used previously. Be sure to read this file into a dataframe named metadata
, if you haven’t already done so.
metadata <- read.csv('data/metadata.csv', header = TRUE)
Because the columns of a data frame are vectors, we will first learn how to extract elements from vectors and then learn how to apply this concept to select rows and columns from a data frame.
Let’s create a vector containing the first ten letters of the alphabet.
ten_letters <- c('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j')
In order to extract one or several values from a vector, we must provide one or several indices in square brackets, just as we do in math. R indexes start at 1. Programming languages like Fortran, MATLAB, and R start counting at 1, because that’s what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that’s simpler for computers to do.
So, to extract the 2nd element of ten_letters
we type:
ten_letters[2]
We can extract multiple elements at a time by specifying mulitple indices inside the square brackets as a vector. Notice how you can use :
to make a vector of all integers two numbers.
ten_letters[c(1,7)]
ten_letters[3:6]
ten_letters[10:1]
ten_letters[c(2, 8:10)]
Quick exercise / formative assessment: Select every other element in ten_letters
.
What if we were dealing with a much longer vector? We can use the seq()
function to quickly create sequences of numbers.
seq(1, 10, by = 2)
seq(20, 4, by = -3)
Exercise
Fill in the blank to select the even elements of ten_letters using the seq() function.
ten_letters[____________]
Solution
ten_letters[seq(2, 10, by = 2)]
The metadata data frame has rows and columns (it has 2 dimensions), if we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers (i.e. [row, column]).
metadata[1, 2] # 1st element in the 2nd column
metadata[1, 6] # 1st element in the 6th column
metadata[1:3, 7] # First three elements in the 7th column
metadata[3, ] # 3rd element for all columns
metadata[, 7] # Entire 7th column
Exercise
The function
nrow()
on adata.frame
returns the number of rows. For example, try typing nrow(metadata). Use
nrow()and
seq()to create a new data frame called
meta_by_2that includes all even numbered rows of
metadata`.Solution
meta_data[seq(2, nrow(metadata), by = 2, ]
For larger datasets, it can be tricky to remember the column number that corresponds to a particular variable. Sometimes the column number for a particular variable can change if your analysis adds or removes columns. The best practice when working with columns in a data frame is to refer to them by name. This also makes your code easier to read and your intentions clearer.
There are two ways to select a column by name from a data frame:
dataframe[ , "column_name"]
dataframe$column_name
You can do operations on a particular column, by selecting it using the $
sign. In this case, the entire column is a vector. You can use names(metadata)
or colnames(metadata)
to remind yourself of the column names. For instance, to extract all the strain information from our datasets:
# Select the strain column from metadata
metadata[ , "Genotype"]
# Alternatively...
metadata$Genotype
The first method allows you to select multiple columns at once. Suppose we wanted strain and clade information:
metadata[, c("Genotype", "Sex")]
You can even access columns by column name and select specific rows of interest. For example, if we wanted the strain and clade of just rows 4 through 7, we could do:
metadata[4:7, c("Genotype", "Sex")]
Data Carpentry,
2017. License. Contributing.
Questions? Feedback?
Please file
an issue on GitHub.
On
Twitter: @datacarpentry