Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations.
Enter dplyr
.
dplyr
is a package for making data manipulation easier.
Packages in R are basically sets of additional functions that let you do more stuff in R. The functions we’ve been using, like str()
, come built into R; packages give you access to more functions. You need to install a package and then load it to be able to use it.
install.packages("dplyr") ## install
You might get asked to choose a CRAN mirror – this is basically asking you to choose a site to download the package from. The choice doesn’t matter too much; I’d recommend choosing the RStudio mirror.
library("dplyr") ## load
You only need to install a package once per computer, but you need to load it every time you open a new R session and want to use that package.
The package dplyr
is a fairly new (2014) package that tries to provide easy tools for the most common data manipulation tasks. It is built to work directly with data frames. The thinking behind it was largely inspired by the package plyr
which has been in use for some time but suffered from being slow in some cases.dplyr
addresses this by porting much of the computation to C++. An additional feature is the ability to work with data stored directly in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query returned.
This addresses a common problem with R in that all operations are conducted in memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can have a database of many 100s GB, conduct queries on it directly and pull back just what you need for analysis in R.
We’re going to learn some of the most common dplyr
functions: select()
, filter()
, mutate()
, group_by()
, and summarize()
. To select columns of a data frame, use select()
. The first argument to this function is the data frame (metadata
), and the subsequent arguments are the columns to keep.
select(metadata, SampleID, Sex, Genotype)
## SampleID Sex Genotype
## 1 100CHE6KO F chemerin_KO
## 2 101CHE6WT F WT
## 3 102CHE6WT M WT
## 4 103CHE6KO M chemerin_KO
## 5 104CHE6KO M chemerin_KO
## 6 105CHE6WT WT
## 7 106CHE6WT F WT
## 8 107CHE6KO M chemerin_KO
## 9 108CHE6KO F chemerin_KO
## 10 109CHE6WT M WT
## 11 10CMK6KO F CMKLR1_KO
## 12 110CHE6WT F WT
## 13 111CHE6KO F chemerin_KO
## 14 112CHE6KO F chemerin_KO
## 15 113CHE6WT F WT
## 16 114CHE6WT M WT
## 17 115CHE6KO M chemerin_KO
## 18 116CHE6KO M chemerin_KO
## 19 11CMK6WT F WT
## 20 12CMK6WT M WT
## 21 13CMK6HET F CMKLR1_HE
## 22 14CMK6KO F CMKLR1_KO
## 23 15CMK6HET M CMKLR1_HE
## 24 16CMK6KO M CMKLR1_KO
## 25 17CMK6WT M WT
## 26 18CMK6KO M CMKLR1_KO
## 27 19CMK6WT F WT
## 28 1CMK6WT M WT
## 29 20CMK6KO M CMKLR1_KO
## 30 21CMK6WT M WT
## 31 22CMK6KO F CMKLR1_KO
## 32 23CMK6WT F WT
## 33 24CMK6KO F CMKLR1_KO
## 34 25CMK6KO CMKLR1_KO
## 35 26CMK6WT WT
## 36 27CMK6WT M WT
## 37 28CMK6KO M CMKLR1_KO
## 38 29CMK6WT M WT
## 39 2CMK6KO F CMKLR1_KO
## 40 30CMK6KO F CMKLR1_KO
## 41 31CMK6WT M WT
## 42 32CMK6KO F CMKLR1_KO
## 43 33CMK6WT M WT
## 44 34CMK6KO M CMKLR1_KO
## 45 35CMK6KO F CMKLR1_KO
## 46 36CMK6WT F WT
## 47 37CMK6WT M WT
## 48 38CMK6KO F CMKLR1_KO
## 49 39CMK6KO M CMKLR1_KO
## 50 3CMK6WT WT
## 51 40CMK6WT F WT
## 52 41CMK8WT F WT
## 53 42CMK8KO M CMKLR1_KO
## 54 43CMK8WT M WT
## 55 44CMK8KO F CMKLR1_KO
## 56 45CMK8WT F WT
## 57 46CMK8WT M WT
## 58 47CMK8KO F CMKLR1_KO
## 59 48CMK8KO CMKLR1_KO
## 60 49CMK8KO M CMKLR1_KO
## 61 4CMK6KO M CMKLR1_KO
## 62 50CMK8KO F CMKLR1_KO
## 63 51CMK8WT F WT
## 64 52CMK8WT M WT
## 65 53CMK8HET M CMKLR1_HE
## 66 54CMK8KO F CMKLR1_KO
## 67 55CMK8HET F CMKLR1_HE
## 68 56CMK8KO F CMKLR1_KO
## 69 57CMK8WT F WT
## 70 58CMK8KO M CMKLR1_KO
## 71 59CMK8WT M WT
## 72 5CMK6WT M WT
## 73 60CMK8KO F CMKLR1_KO
## 74 61CMK8WT F WT
## 75 62CMK8KO F CMKLR1_KO
## 76 63CMK8WT F WT
## 77 64CMK8KO M CMKLR1_KO
## 78 65CMK8KO M CMKLR1_KO
## 79 66CMK8WT WT
## 80 67CMK8WT F WT
## 81 68CMK8KO F CMKLR1_KO
## 82 69CMK8WT M WT
## 83 6CMK6WT F WT
## 84 70CMK8KO M CMKLR1_KO
## 85 71CMK8WT F WT
## 86 72CMK8KO M CMKLR1_KO
## 87 73CMK8WT F WT
## 88 74CMK8KO M CMKLR1_KO
## 89 75CMK8KO M CMKLR1_KO
## 90 76CMK8WT F WT
## 91 77CMK8WT F WT
## 92 78CMK8KO M CMKLR1_KO
## 93 79CMK8KO M CMKLR1_KO
## 94 7CMK6KO F CMKLR1_KO
## 95 80CMK8WT M WT
## 96 81CHE6WT F WT
## 97 82CHE6WT M WT
## 98 83CHE6KO F chemerin_KO
## 99 84CHE6KO M chemerin_KO
## 100 85CHE6WT F WT
## 101 86CHE6WT M WT
## 102 87CHE6KO F chemerin_KO
## 103 88CHE6KO F chemerin_KO
## 104 89CHE6WT F WT
## 105 8CMK6KO M CMKLR1_KO
## 106 90CHE6WT M WT
## 107 91CHE6KO F chemerin_KO
## 108 92CHE6KO F chemerin_KO
## 109 93CHE6WT M WT
## 110 94CHE6WT F WT
## 111 95CHE6KO chemerin_KO
## 112 96CHE6KO M chemerin_KO
## 113 97CHE6WT F WT
## 114 98CHE6WT M WT
## 115 99CHE6KO M chemerin_KO
## 116 9CMK6KO F CMKLR1_KO
To choose rows based on specific criteria, use filter()
:
filter(metadata, Genotype == "WT")
## SampleID BarcodeSequence LinkerPrimerSequence
## 1 101CHE6WT NA NA
## 2 102CHE6WT NA NA
## 3 105CHE6WT NA NA
## 4 106CHE6WT NA NA
## 5 109CHE6WT NA NA
## 6 110CHE6WT NA NA
## 7 113CHE6WT NA NA
## 8 114CHE6WT NA NA
## 9 11CMK6WT NA NA
## 10 12CMK6WT NA NA
## 11 17CMK6WT NA NA
## 12 19CMK6WT NA NA
## 13 1CMK6WT NA NA
## 14 21CMK6WT NA NA
## 15 23CMK6WT NA NA
## 16 26CMK6WT NA NA
## 17 27CMK6WT NA NA
## 18 29CMK6WT NA NA
## 19 31CMK6WT NA NA
## 20 33CMK6WT NA NA
## 21 36CMK6WT NA NA
## 22 37CMK6WT NA NA
## 23 3CMK6WT NA NA
## 24 40CMK6WT NA NA
## 25 41CMK8WT NA NA
## 26 43CMK8WT NA NA
## 27 45CMK8WT NA NA
## 28 46CMK8WT NA NA
## 29 51CMK8WT NA NA
## 30 52CMK8WT NA NA
## 31 57CMK8WT NA NA
## 32 59CMK8WT NA NA
## 33 5CMK6WT NA NA
## 34 61CMK8WT NA NA
## 35 63CMK8WT NA NA
## 36 66CMK8WT NA NA
## 37 67CMK8WT NA NA
## 38 69CMK8WT NA NA
## 39 6CMK6WT NA NA
## 40 71CMK8WT NA NA
## 41 73CMK8WT NA NA
## 42 76CMK8WT NA NA
## 43 77CMK8WT NA NA
## 44 80CMK8WT NA NA
## 45 81CHE6WT NA NA
## 46 82CHE6WT NA NA
## 47 85CHE6WT NA NA
## 48 86CHE6WT NA NA
## 49 89CHE6WT NA NA
## 50 90CHE6WT NA NA
## 51 93CHE6WT NA NA
## 52 94CHE6WT NA NA
## 53 97CHE6WT NA NA
## 54 98CHE6WT NA NA
## FileInput Source Mouse Sex Weight Genotype
## 1 101CHE6WT_S368_L001_R1_001.fastq BZ 101 F 20.6 WT
## 2 102CHE6WT_S291_L001_R1_001.fastq BZ 102 M NA WT
## 3 105CHE6WT_S325_L001_R1_001.fastq BZ 105 NA WT
## 4 106CHE6WT_S336_L001_R1_001.fastq BZ 106 F NA WT
## 5 109CHE6WT_S369_L001_R1_001.fastq BZ 109 M 21.6 WT
## 6 110CHE6WT_S292_L001_R1_001.fastq BZ 110 F 24.1 WT
## 7 113CHE6WT_S326_L001_R1_001.fastq BZ 113 F 19.5 WT
## 8 114CHE6WT_S337_L001_R1_001.fastq BZ 114 M 19.2 WT
## 9 11CMK6WT_S243_L001_R1_001.fastq CJS 11 F 21.8 WT
## 10 12CMK6WT_S255_L001_R1_001.fastq CJS 12 M NA WT
## 11 17CMK6WT_S220_L001_R1_001.fastq CJS 17 M 22.6 WT
## 12 19CMK6WT_S244_L001_R1_001.fastq CJS 19 F 21.4 WT
## 13 1CMK6WT_S218_L001_R1_001.fastq CJS 1 M NA WT
## 14 21CMK6WT_S268_L001_R1_001.fastq CJS 21 M 23.2 WT
## 15 23CMK6WT_S197_L001_R1_001.fastq CJS 23 F 18.4 WT
## 16 26CMK6WT_S233_L001_R1_001.fastq CJS 26 NA WT
## 17 27CMK6WT_S245_L001_R1_001.fastq CJS 27 M 20.6 WT
## 18 29CMK6WT_S269_L001_R1_001.fastq CJS 29 M 23.6 WT
## 19 31CMK6WT_S198_L001_R1_001.fastq CJS 31 M 22.6 WT
## 20 33CMK6WT_S222_L001_R1_001.fastq CJS 33 M 21.5 WT
## 21 36CMK6WT_S258_L001_R1_001.fastq CJS 36 F NA WT
## 22 37CMK6WT_S270_L001_R1_001.fastq CJS 37 M 18.6 WT
## 23 3CMK6WT_S242_L001_R1_001.fastq CJS 3 21.8 WT
## 24 40CMK6WT_S211_L001_R1_001.fastq CJS 40 F 21.9 WT
## 25 41CMK8WT_S223_L001_R1_001.fastq CJS 41 F 22.5 WT
## 26 43CMK8WT_S247_L001_R1_001.fastq CJS 43 M 21.6 WT
## 27 45CMK8WT_S271_L001_R1_001.fastq CJS 45 F 19.7 WT
## 28 46CMK8WT_S283_L001_R1_001.fastq CJS 46 M 18.6 WT
## 29 51CMK8WT_S248_L001_R1_001.fastq CJS 51 F 23.5 WT
## 30 52CMK8WT_S260_L001_R1_001.fastq CJS 52 M 24.1 WT
## 31 57CMK8WT_S225_L001_R1_001.fastq CJS 57 F 18.2 WT
## 32 59CMK8WT_S249_L001_R1_001.fastq CJS 59 M 21.8 WT
## 33 5CMK6WT_S266_L001_R1_001.fastq CJS 5 M 23.5 WT
## 34 61CMK8WT_S273_L001_R1_001.fastq CJS 61 F 22.6 WT
## 35 63CMK8WT_S202_L001_R1_001.fastq CJS 63 F 23.5 WT
## 36 66CMK8WT_S238_L001_R1_001.fastq CJS 66 21.4 WT
## 37 67CMK8WT_S250_L001_R1_001.fastq CJS 67 F 21.1 WT
## 38 69CMK8WT_S274_L001_R1_001.fastq CJS 69 M NA WT
## 39 6CMK6WT_S278_L001_R1_001.fastq CJS 6 F 19.5 WT
## 40 71CMK8WT_S203_L001_R1_001.fastq CJS 71 F 19.8 WT
## 41 73CMK8WT_S227_L001_R1_001.fastq CJS 73 F 21.8 WT
## 42 76CMK8WT_S263_L001_R1_001.fastq CJS 76 F 23.6 WT
## 43 77CMK8WT_S275_L001_R1_001.fastq CJS 77 F 21.5 WT
## 44 80CMK8WT_S216_L001_R1_001.fastq CJS 80 M 18.6 WT
## 45 81CHE6WT_S228_L001_R1_001.fastq BZ 81 F 19.4 WT
## 46 82CHE6WT_S240_L001_R1_001.fastq BZ 82 M 20.5 WT
## 47 85CHE6WT_S276_L001_R1_001.fastq BZ 85 F 22.5 WT
## 48 86CHE6WT_S289_L001_R1_001.fastq BZ 86 M 23.4 WT
## 49 89CHE6WT_S323_L001_R1_001.fastq BZ 89 F 19.7 WT
## 50 90CHE6WT_S334_L001_R1_001.fastq BZ 90 M 20.4 WT
## 51 93CHE6WT_S367_L001_R1_001.fastq BZ 93 M 22.4 WT
## 52 94CHE6WT_S290_L001_R1_001.fastq BZ 94 F 21.6 WT
## 53 97CHE6WT_S324_L001_R1_001.fastq BZ 97 F 20.8 WT
## 54 98CHE6WT_S335_L001_R1_001.fastq BZ 98 M 19.5 WT
## SamplingWeek Description
## 1 6 101CHE6WT
## 2 6 102CHE6WT
## 3 6 105CHE6WT
## 4 6 106CHE6WT
## 5 6 109CHE6WT
## 6 6 110CHE6WT
## 7 6 113CHE6WT
## 8 6 114CHE6WT
## 9 6 11CMK6WT
## 10 6 12CMK6WT
## 11 6 17CMK6WT
## 12 6 19CMK6WT
## 13 6 1CMK6WT
## 14 6 21CMK6WT
## 15 6 23CMK6WT
## 16 6 26CMK6WT
## 17 6 27CMK6WT
## 18 6 29CMK6WT
## 19 6 31CMK6WT
## 20 6 33CMK6WT
## 21 6 36CMK6WT
## 22 6 37CMK6WT
## 23 6 3CMK6WT
## 24 6 40CMK6WT
## 25 8 41CMK8WT
## 26 8 43CMK8WT
## 27 8 45CMK8WT
## 28 8 46CMK8WT
## 29 8 51CMK8WT
## 30 8 52CMK8WT
## 31 8 57CMK8WT
## 32 8 59CMK8WT
## 33 8 5CMK6WT
## 34 8 61CMK8WT
## 35 8 63CMK8WT
## 36 8 66CMK8WT
## 37 8 67CMK8WT
## 38 8 69CMK8WT
## 39 8 6CMK6WT
## 40 8 71CMK8WT
## 41 8 73CMK8WT
## 42 8 76CMK8WT
## 43 8 77CMK8WT
## 44 8 80CMK8WT
## 45 6 81CHE6WT
## 46 6 82CHE6WT
## 47 6 85CHE6WT
## 48 6 86CHE6WT
## 49 6 89CHE6WT
## 50 6 90CHE6WT
## 51 6 93CHE6WT
## 52 6 94CHE6WT
## 53 6 97CHE6WT
## 54 6 98CHE6WT
But what if you wanted to select and filter? There are three ways to do this: use intermediate steps, nested functions, or pipes. With the intermediate steps, you essentially create a temporary data frame and use that as input to the next function. This can clutter up your workspace with lots of objects. You can also nest functions (i.e. one function inside of another). This is handy, but can be difficult to read if too many functions are nested as the process from inside out. The last option, pipes, are a fairly recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same data set. Pipes in R look like %>%
and are made available via the magrittr
package installed as part of dplyr
.
metadata %>%
filter(Genotype == "WT") %>%
select(SampleID, Sex, Genotype)
## SampleID Sex Genotype
## 1 101CHE6WT F WT
## 2 102CHE6WT M WT
## 3 105CHE6WT WT
## 4 106CHE6WT F WT
## 5 109CHE6WT M WT
## 6 110CHE6WT F WT
## 7 113CHE6WT F WT
## 8 114CHE6WT M WT
## 9 11CMK6WT F WT
## 10 12CMK6WT M WT
## 11 17CMK6WT M WT
## 12 19CMK6WT F WT
## 13 1CMK6WT M WT
## 14 21CMK6WT M WT
## 15 23CMK6WT F WT
## 16 26CMK6WT WT
## 17 27CMK6WT M WT
## 18 29CMK6WT M WT
## 19 31CMK6WT M WT
## 20 33CMK6WT M WT
## 21 36CMK6WT F WT
## 22 37CMK6WT M WT
## 23 3CMK6WT WT
## 24 40CMK6WT F WT
## 25 41CMK8WT F WT
## 26 43CMK8WT M WT
## 27 45CMK8WT F WT
## 28 46CMK8WT M WT
## 29 51CMK8WT F WT
## 30 52CMK8WT M WT
## 31 57CMK8WT F WT
## 32 59CMK8WT M WT
## 33 5CMK6WT M WT
## 34 61CMK8WT F WT
## 35 63CMK8WT F WT
## 36 66CMK8WT WT
## 37 67CMK8WT F WT
## 38 69CMK8WT M WT
## 39 6CMK6WT F WT
## 40 71CMK8WT F WT
## 41 73CMK8WT F WT
## 42 76CMK8WT F WT
## 43 77CMK8WT F WT
## 44 80CMK8WT M WT
## 45 81CHE6WT F WT
## 46 82CHE6WT M WT
## 47 85CHE6WT F WT
## 48 86CHE6WT M WT
## 49 89CHE6WT F WT
## 50 90CHE6WT M WT
## 51 93CHE6WT M WT
## 52 94CHE6WT F WT
## 53 97CHE6WT F WT
## 54 98CHE6WT M WT
In the above we use the pipe to send the metadata
data set first through filter
, to keep rows where Genotype
was equal to ‘WT’, and then through select
to keep the SampleID
and Sex
and Genotype
columns. When the data frame is being passed to the filter()
and select()
functions through a pipe, we don’t need to include it as an argument to these functions anymore.
If we wanted to create a new object with this smaller version of the data we could do so by assigning it a new name:
meta_WT <- metadata %>%
filter(Genotype == "WT") %>%
select(SampleID, Sex, Genotype)
meta_WT
## SampleID Sex Genotype
## 1 101CHE6WT F WT
## 2 102CHE6WT M WT
## 3 105CHE6WT WT
## 4 106CHE6WT F WT
## 5 109CHE6WT M WT
## 6 110CHE6WT F WT
## 7 113CHE6WT F WT
## 8 114CHE6WT M WT
## 9 11CMK6WT F WT
## 10 12CMK6WT M WT
## 11 17CMK6WT M WT
## 12 19CMK6WT F WT
## 13 1CMK6WT M WT
## 14 21CMK6WT M WT
## 15 23CMK6WT F WT
## 16 26CMK6WT WT
## 17 27CMK6WT M WT
## 18 29CMK6WT M WT
## 19 31CMK6WT M WT
## 20 33CMK6WT M WT
## 21 36CMK6WT F WT
## 22 37CMK6WT M WT
## 23 3CMK6WT WT
## 24 40CMK6WT F WT
## 25 41CMK8WT F WT
## 26 43CMK8WT M WT
## 27 45CMK8WT F WT
## 28 46CMK8WT M WT
## 29 51CMK8WT F WT
## 30 52CMK8WT M WT
## 31 57CMK8WT F WT
## 32 59CMK8WT M WT
## 33 5CMK6WT M WT
## 34 61CMK8WT F WT
## 35 63CMK8WT F WT
## 36 66CMK8WT WT
## 37 67CMK8WT F WT
## 38 69CMK8WT M WT
## 39 6CMK6WT F WT
## 40 71CMK8WT F WT
## 41 73CMK8WT F WT
## 42 76CMK8WT F WT
## 43 77CMK8WT F WT
## 44 80CMK8WT M WT
## 45 81CHE6WT F WT
## 46 82CHE6WT M WT
## 47 85CHE6WT F WT
## 48 86CHE6WT M WT
## 49 89CHE6WT F WT
## 50 90CHE6WT M WT
## 51 93CHE6WT M WT
## 52 94CHE6WT F WT
## 53 97CHE6WT F WT
## 54 98CHE6WT M WT
Challenge
Using pipes, subset the metadata to include rows where the Sex is male, and retain only the columns
SampleID
andGenotype.
Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions or find the ratio of values in two columns. For this we’ll use mutate()
.
To create a new column of weight in kg:
metadata %>%
mutate(Weight_kg = Weight / 1000)
If this runs off your screen and you just want to see the first few rows, you can use a pipe to view the head()
of the data (pipes work with non-dplyr functions too, as long as the dplyr
or magrittr
packages are loaded).
metadata %>%
mutate(Weight_kg = Weight / 1000) %>%
head
Note that we don’t include parentheses at the end of our call to head()
above. When piping into a function with no additional arguments, you can call the function with or without parentheses (e.g. head
or head()
).
The first few rows of the output are full of NA
s, so if we wanted to remove those we could insert a filter()
in the chain:
metadata %>%
filter(!is.na(Weight)) %>%
mutate(Weight_kg = Weight / 1000) %>%
head
is.na()
is a function that determines whether something is or is not an NA
. The !
symbol negates it, so we’re asking for everything that is not an NA
.
Many data analysis tasks can be approached using the “split-apply-combine” paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr
makes this very easy through the use of the group_by()
function, which splits the data into groups. When the data is grouped in this way summarize()
can be used to collapse each group into a single-row summary. summarize()
does this by applying an aggregating or summary function to each group. For example, if we wanted to group by genotype and find the number of rows of data for each genotype, we would do:
metadata %>%
group_by(Genotype) %>%
summarize(n())
## # A tibble: 4 x 2
## Genotype `n()`
## <fctr> <int>
## 1 chemerin_KO 18
## 2 CMKLR1_HE 4
## 3 CMKLR1_KO 40
## 4 WT 54
Here the summary function used was n()
to find the count for each group. We can also apply many other functions to individual columns to get other summary statistics. For example, in the R base package we can use built-in functions like mean
, median
, min
, and max
. By default, all R functions operating on vectors that contains missing data will return NA. It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it. When dealing with simple statistics like the mean, the easiest way to ignore NA
(the missing data) is to use na.rm=TRUE
(rm
stands for remove).
So to view mean weight
by sex:
metadata %>%
group_by(Sex) %>%
summarize(mean_weight = mean(Weight, na.rm = TRUE))
## # A tibble: 3 x 2
## Sex mean_weight
## <fctr> <dbl>
## 1 21.74000
## 2 F 21.47925
## 3 M 21.30435
You can group by multiple columns too:
metadata %>%
group_by(Sex, Genotype) %>%
summarize(mean_weight = mean(Weight, na.rm = TRUE))
## # A tibble: 11 x 3
## # Groups: Sex [?]
## Sex Genotype mean_weight
## <fctr> <fctr> <dbl>
## 1 chemerin_KO 23.50000
## 2 CMKLR1_KO 21.00000
## 3 WT 21.60000
## 4 F chemerin_KO 22.48750
## 5 F CMKLR1_HE 20.95000
## 6 F CMKLR1_KO 21.45263
## 7 F WT 21.20833
## 8 M chemerin_KO 20.47143
## 9 M CMKLR1_HE 22.35000
## 10 M CMKLR1_KO 21.41765
## 11 M WT 21.39500
When grouping both by Sex
and Genotype
, the first rows are for individuals for which their Weight was not recorded. We can remove the missing values for weight before we attempt to calculate the summary statistics. Because the missing values are removed, we can omit na.rm = TRUE
when computing the mean:
metadata %>%
filter(!is.na(Weight)) %>%
group_by(Sex, Genotype) %>%
summarize(mean_weight = mean(Weight))
## # A tibble: 11 x 3
## # Groups: Sex [?]
## Sex Genotype mean_weight
## <fctr> <fctr> <dbl>
## 1 chemerin_KO 23.50000
## 2 CMKLR1_KO 21.00000
## 3 WT 21.60000
## 4 F chemerin_KO 22.48750
## 5 F CMKLR1_HE 20.95000
## 6 F CMKLR1_KO 21.45263
## 7 F WT 21.20833
## 8 M chemerin_KO 20.47143
## 9 M CMKLR1_HE 22.35000
## 10 M CMKLR1_KO 21.41765
## 11 M WT 21.39500
Once the data are grouped, you can also summarize multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the minimum Weight for each Genotype for each Sex:
metadata %>%
filter(!is.na(Weight)) %>%
group_by(Sex, Genotype) %>%
summarize(mean_weight = mean(Weight),
min_weight = min(Weight))
## # A tibble: 11 x 4
## # Groups: Sex [?]
## Sex Genotype mean_weight min_weight
## <fctr> <fctr> <dbl> <dbl>
## 1 chemerin_KO 23.50000 23.5
## 2 CMKLR1_KO 21.00000 20.5
## 3 WT 21.60000 21.4
## 4 F chemerin_KO 22.48750 20.5
## 5 F CMKLR1_HE 20.95000 19.5
## 6 F CMKLR1_KO 21.45263 18.5
## 7 F WT 21.20833 18.2
## 8 M chemerin_KO 20.47143 18.2
## 9 M CMKLR1_HE 22.35000 20.8
## 10 M CMKLR1_KO 21.41765 18.4
## 11 M WT 21.39500 18.6
Looks like for some of these samples, the Sex was also not recorded. We could then discard those rows using filter()
:
metadata %>%
group_by(Sex, Genotype) %>%
summarize(mean_weight = mean(Weight, na.rm = TRUE)) %>%
filter(Sex != "")
Much of this lesson was copied or adapted from Jeff Hollister’s materials
Data Carpentry,
2017. License. Contributing.
Questions? Feedback?
Please file
an issue on GitHub.
On
Twitter: @datacarpentry