Data manipulation using dplyr

Learning Objectives

Describe what the dplyr package in R is used for.
Apply common dplyr functions to manipulate data in R.
Employ the ‘pipe’ operator to link together a sequence of functions.
Employ the ‘mutate’ function to apply other chosen functions to existing columns and create new columns of data.
Employ the ‘split-apply-combine’ concept to split the data into groups, apply analysis to each group, and combine the results.

Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations.

Enter dplyr.

dplyr is a package for making data manipulation easier.

Packages in R are basically sets of additional functions that let you do more stuff in R. The functions we’ve been using, like str(), come built into R; packages give you access to more functions. You need to install a package and then load it to be able to use it.

install.packages("dplyr") ## install

You might get asked to choose a CRAN mirror – this is basically asking you to choose a site to download the package from. The choice doesn’t matter too much; I’d recommend choosing the RStudio mirror.

library("dplyr")          ## load

You only need to install a package once per computer, but you need to load it every time you open a new R session and want to use that package.

What is dplyr?

The package dplyr is a fairly new (2014) package that tries to provide easy tools for the most common data manipulation tasks. It is built to work directly with data frames. The thinking behind it was largely inspired by the package plyr which has been in use for some time but suffered from being slow in some cases.dplyr addresses this by porting much of the computation to C++. An additional feature is the ability to work with data stored directly in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query returned.

This addresses a common problem with R in that all operations are conducted in memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can have a database of many 100s GB, conduct queries on it directly and pull back just what you need for analysis in R.

Selecting columns and filtering rows

We’re going to learn some of the most common dplyr functions: select(), filter(), mutate(), group_by(), and summarize(). To select columns of a data frame, use select(). The first argument to this function is the data frame (metadata), and the subsequent arguments are the columns to keep.

select(metadata, SampleID, Sex, Genotype)

##      SampleID Sex    Genotype
## 1   100CHE6KO   F chemerin_KO
## 2   101CHE6WT   F          WT
## 3   102CHE6WT   M          WT
## 4   103CHE6KO   M chemerin_KO
## 5   104CHE6KO   M chemerin_KO
## 6   105CHE6WT              WT
## 7   106CHE6WT   F          WT
## 8   107CHE6KO   M chemerin_KO
## 9   108CHE6KO   F chemerin_KO
## 10  109CHE6WT   M          WT
## 11   10CMK6KO   F   CMKLR1_KO
## 12  110CHE6WT   F          WT
## 13  111CHE6KO   F chemerin_KO
## 14  112CHE6KO   F chemerin_KO
## 15  113CHE6WT   F          WT
## 16  114CHE6WT   M          WT
## 17  115CHE6KO   M chemerin_KO
## 18  116CHE6KO   M chemerin_KO
## 19   11CMK6WT   F          WT
## 20   12CMK6WT   M          WT
## 21  13CMK6HET   F   CMKLR1_HE
## 22   14CMK6KO   F   CMKLR1_KO
## 23  15CMK6HET   M   CMKLR1_HE
## 24   16CMK6KO   M   CMKLR1_KO
## 25   17CMK6WT   M          WT
## 26   18CMK6KO   M   CMKLR1_KO
## 27   19CMK6WT   F          WT
## 28    1CMK6WT   M          WT
## 29   20CMK6KO   M   CMKLR1_KO
## 30   21CMK6WT   M          WT
## 31   22CMK6KO   F   CMKLR1_KO
## 32   23CMK6WT   F          WT
## 33   24CMK6KO   F   CMKLR1_KO
## 34   25CMK6KO       CMKLR1_KO
## 35   26CMK6WT              WT
## 36   27CMK6WT   M          WT
## 37   28CMK6KO   M   CMKLR1_KO
## 38   29CMK6WT   M          WT
## 39    2CMK6KO   F   CMKLR1_KO
## 40   30CMK6KO   F   CMKLR1_KO
## 41   31CMK6WT   M          WT
## 42   32CMK6KO   F   CMKLR1_KO
## 43   33CMK6WT   M          WT
## 44   34CMK6KO   M   CMKLR1_KO
## 45   35CMK6KO   F   CMKLR1_KO
## 46   36CMK6WT   F          WT
## 47   37CMK6WT   M          WT
## 48   38CMK6KO   F   CMKLR1_KO
## 49   39CMK6KO   M   CMKLR1_KO
## 50    3CMK6WT              WT
## 51   40CMK6WT   F          WT
## 52   41CMK8WT   F          WT
## 53   42CMK8KO   M   CMKLR1_KO
## 54   43CMK8WT   M          WT
## 55   44CMK8KO   F   CMKLR1_KO
## 56   45CMK8WT   F          WT
## 57   46CMK8WT   M          WT
## 58   47CMK8KO   F   CMKLR1_KO
## 59   48CMK8KO       CMKLR1_KO
## 60   49CMK8KO   M   CMKLR1_KO
## 61    4CMK6KO   M   CMKLR1_KO
## 62   50CMK8KO   F   CMKLR1_KO
## 63   51CMK8WT   F          WT
## 64   52CMK8WT   M          WT
## 65  53CMK8HET   M   CMKLR1_HE
## 66   54CMK8KO   F   CMKLR1_KO
## 67  55CMK8HET   F   CMKLR1_HE
## 68   56CMK8KO   F   CMKLR1_KO
## 69   57CMK8WT   F          WT
## 70   58CMK8KO   M   CMKLR1_KO
## 71   59CMK8WT   M          WT
## 72    5CMK6WT   M          WT
## 73   60CMK8KO   F   CMKLR1_KO
## 74   61CMK8WT   F          WT
## 75   62CMK8KO   F   CMKLR1_KO
## 76   63CMK8WT   F          WT
## 77   64CMK8KO   M   CMKLR1_KO
## 78   65CMK8KO   M   CMKLR1_KO
## 79   66CMK8WT              WT
## 80   67CMK8WT   F          WT
## 81   68CMK8KO   F   CMKLR1_KO
## 82   69CMK8WT   M          WT
## 83    6CMK6WT   F          WT
## 84   70CMK8KO   M   CMKLR1_KO
## 85   71CMK8WT   F          WT
## 86   72CMK8KO   M   CMKLR1_KO
## 87   73CMK8WT   F          WT
## 88   74CMK8KO   M   CMKLR1_KO
## 89   75CMK8KO   M   CMKLR1_KO
## 90   76CMK8WT   F          WT
## 91   77CMK8WT   F          WT
## 92   78CMK8KO   M   CMKLR1_KO
## 93   79CMK8KO   M   CMKLR1_KO
## 94    7CMK6KO   F   CMKLR1_KO
## 95   80CMK8WT   M          WT
## 96   81CHE6WT   F          WT
## 97   82CHE6WT   M          WT
## 98   83CHE6KO   F chemerin_KO
## 99   84CHE6KO   M chemerin_KO
## 100  85CHE6WT   F          WT
## 101  86CHE6WT   M          WT
## 102  87CHE6KO   F chemerin_KO
## 103  88CHE6KO   F chemerin_KO
## 104  89CHE6WT   F          WT
## 105   8CMK6KO   M   CMKLR1_KO
## 106  90CHE6WT   M          WT
## 107  91CHE6KO   F chemerin_KO
## 108  92CHE6KO   F chemerin_KO
## 109  93CHE6WT   M          WT
## 110  94CHE6WT   F          WT
## 111  95CHE6KO     chemerin_KO
## 112  96CHE6KO   M chemerin_KO
## 113  97CHE6WT   F          WT
## 114  98CHE6WT   M          WT
## 115  99CHE6KO   M chemerin_KO
## 116   9CMK6KO   F   CMKLR1_KO

To choose rows based on specific criteria, use filter():

filter(metadata, Genotype == "WT")

##     SampleID BarcodeSequence LinkerPrimerSequence
## 1  101CHE6WT              NA                   NA
## 2  102CHE6WT              NA                   NA
## 3  105CHE6WT              NA                   NA
## 4  106CHE6WT              NA                   NA
## 5  109CHE6WT              NA                   NA
## 6  110CHE6WT              NA                   NA
## 7  113CHE6WT              NA                   NA
## 8  114CHE6WT              NA                   NA
## 9   11CMK6WT              NA                   NA
## 10  12CMK6WT              NA                   NA
## 11  17CMK6WT              NA                   NA
## 12  19CMK6WT              NA                   NA
## 13   1CMK6WT              NA                   NA
## 14  21CMK6WT              NA                   NA
## 15  23CMK6WT              NA                   NA
## 16  26CMK6WT              NA                   NA
## 17  27CMK6WT              NA                   NA
## 18  29CMK6WT              NA                   NA
## 19  31CMK6WT              NA                   NA
## 20  33CMK6WT              NA                   NA
## 21  36CMK6WT              NA                   NA
## 22  37CMK6WT              NA                   NA
## 23   3CMK6WT              NA                   NA
## 24  40CMK6WT              NA                   NA
## 25  41CMK8WT              NA                   NA
## 26  43CMK8WT              NA                   NA
## 27  45CMK8WT              NA                   NA
## 28  46CMK8WT              NA                   NA
## 29  51CMK8WT              NA                   NA
## 30  52CMK8WT              NA                   NA
## 31  57CMK8WT              NA                   NA
## 32  59CMK8WT              NA                   NA
## 33   5CMK6WT              NA                   NA
## 34  61CMK8WT              NA                   NA
## 35  63CMK8WT              NA                   NA
## 36  66CMK8WT              NA                   NA
## 37  67CMK8WT              NA                   NA
## 38  69CMK8WT              NA                   NA
## 39   6CMK6WT              NA                   NA
## 40  71CMK8WT              NA                   NA
## 41  73CMK8WT              NA                   NA
## 42  76CMK8WT              NA                   NA
## 43  77CMK8WT              NA                   NA
## 44  80CMK8WT              NA                   NA
## 45  81CHE6WT              NA                   NA
## 46  82CHE6WT              NA                   NA
## 47  85CHE6WT              NA                   NA
## 48  86CHE6WT              NA                   NA
## 49  89CHE6WT              NA                   NA
## 50  90CHE6WT              NA                   NA
## 51  93CHE6WT              NA                   NA
## 52  94CHE6WT              NA                   NA
## 53  97CHE6WT              NA                   NA
## 54  98CHE6WT              NA                   NA
##                           FileInput Source Mouse Sex Weight Genotype
## 1  101CHE6WT_S368_L001_R1_001.fastq     BZ   101   F   20.6       WT
## 2  102CHE6WT_S291_L001_R1_001.fastq     BZ   102   M     NA       WT
## 3  105CHE6WT_S325_L001_R1_001.fastq     BZ   105         NA       WT
## 4  106CHE6WT_S336_L001_R1_001.fastq     BZ   106   F     NA       WT
## 5  109CHE6WT_S369_L001_R1_001.fastq     BZ   109   M   21.6       WT
## 6  110CHE6WT_S292_L001_R1_001.fastq     BZ   110   F   24.1       WT
## 7  113CHE6WT_S326_L001_R1_001.fastq     BZ   113   F   19.5       WT
## 8  114CHE6WT_S337_L001_R1_001.fastq     BZ   114   M   19.2       WT
## 9   11CMK6WT_S243_L001_R1_001.fastq    CJS    11   F   21.8       WT
## 10  12CMK6WT_S255_L001_R1_001.fastq    CJS    12   M     NA       WT
## 11  17CMK6WT_S220_L001_R1_001.fastq    CJS    17   M   22.6       WT
## 12  19CMK6WT_S244_L001_R1_001.fastq    CJS    19   F   21.4       WT
## 13   1CMK6WT_S218_L001_R1_001.fastq    CJS     1   M     NA       WT
## 14  21CMK6WT_S268_L001_R1_001.fastq    CJS    21   M   23.2       WT
## 15  23CMK6WT_S197_L001_R1_001.fastq    CJS    23   F   18.4       WT
## 16  26CMK6WT_S233_L001_R1_001.fastq    CJS    26         NA       WT
## 17  27CMK6WT_S245_L001_R1_001.fastq    CJS    27   M   20.6       WT
## 18  29CMK6WT_S269_L001_R1_001.fastq    CJS    29   M   23.6       WT
## 19  31CMK6WT_S198_L001_R1_001.fastq    CJS    31   M   22.6       WT
## 20  33CMK6WT_S222_L001_R1_001.fastq    CJS    33   M   21.5       WT
## 21  36CMK6WT_S258_L001_R1_001.fastq    CJS    36   F     NA       WT
## 22  37CMK6WT_S270_L001_R1_001.fastq    CJS    37   M   18.6       WT
## 23   3CMK6WT_S242_L001_R1_001.fastq    CJS     3       21.8       WT
## 24  40CMK6WT_S211_L001_R1_001.fastq    CJS    40   F   21.9       WT
## 25  41CMK8WT_S223_L001_R1_001.fastq    CJS    41   F   22.5       WT
## 26  43CMK8WT_S247_L001_R1_001.fastq    CJS    43   M   21.6       WT
## 27  45CMK8WT_S271_L001_R1_001.fastq    CJS    45   F   19.7       WT
## 28  46CMK8WT_S283_L001_R1_001.fastq    CJS    46   M   18.6       WT
## 29  51CMK8WT_S248_L001_R1_001.fastq    CJS    51   F   23.5       WT
## 30  52CMK8WT_S260_L001_R1_001.fastq    CJS    52   M   24.1       WT
## 31  57CMK8WT_S225_L001_R1_001.fastq    CJS    57   F   18.2       WT
## 32  59CMK8WT_S249_L001_R1_001.fastq    CJS    59   M   21.8       WT
## 33   5CMK6WT_S266_L001_R1_001.fastq    CJS     5   M   23.5       WT
## 34  61CMK8WT_S273_L001_R1_001.fastq    CJS    61   F   22.6       WT
## 35  63CMK8WT_S202_L001_R1_001.fastq    CJS    63   F   23.5       WT
## 36  66CMK8WT_S238_L001_R1_001.fastq    CJS    66       21.4       WT
## 37  67CMK8WT_S250_L001_R1_001.fastq    CJS    67   F   21.1       WT
## 38  69CMK8WT_S274_L001_R1_001.fastq    CJS    69   M     NA       WT
## 39   6CMK6WT_S278_L001_R1_001.fastq    CJS     6   F   19.5       WT
## 40  71CMK8WT_S203_L001_R1_001.fastq    CJS    71   F   19.8       WT
## 41  73CMK8WT_S227_L001_R1_001.fastq    CJS    73   F   21.8       WT
## 42  76CMK8WT_S263_L001_R1_001.fastq    CJS    76   F   23.6       WT
## 43  77CMK8WT_S275_L001_R1_001.fastq    CJS    77   F   21.5       WT
## 44  80CMK8WT_S216_L001_R1_001.fastq    CJS    80   M   18.6       WT
## 45  81CHE6WT_S228_L001_R1_001.fastq     BZ    81   F   19.4       WT
## 46  82CHE6WT_S240_L001_R1_001.fastq     BZ    82   M   20.5       WT
## 47  85CHE6WT_S276_L001_R1_001.fastq     BZ    85   F   22.5       WT
## 48  86CHE6WT_S289_L001_R1_001.fastq     BZ    86   M   23.4       WT
## 49  89CHE6WT_S323_L001_R1_001.fastq     BZ    89   F   19.7       WT
## 50  90CHE6WT_S334_L001_R1_001.fastq     BZ    90   M   20.4       WT
## 51  93CHE6WT_S367_L001_R1_001.fastq     BZ    93   M   22.4       WT
## 52  94CHE6WT_S290_L001_R1_001.fastq     BZ    94   F   21.6       WT
## 53  97CHE6WT_S324_L001_R1_001.fastq     BZ    97   F   20.8       WT
## 54  98CHE6WT_S335_L001_R1_001.fastq     BZ    98   M   19.5       WT
##    SamplingWeek Description
## 1             6   101CHE6WT
## 2             6   102CHE6WT
## 3             6   105CHE6WT
## 4             6   106CHE6WT
## 5             6   109CHE6WT
## 6             6   110CHE6WT
## 7             6   113CHE6WT
## 8             6   114CHE6WT
## 9             6    11CMK6WT
## 10            6    12CMK6WT
## 11            6    17CMK6WT
## 12            6    19CMK6WT
## 13            6     1CMK6WT
## 14            6    21CMK6WT
## 15            6    23CMK6WT
## 16            6    26CMK6WT
## 17            6    27CMK6WT
## 18            6    29CMK6WT
## 19            6    31CMK6WT
## 20            6    33CMK6WT
## 21            6    36CMK6WT
## 22            6    37CMK6WT
## 23            6     3CMK6WT
## 24            6    40CMK6WT
## 25            8    41CMK8WT
## 26            8    43CMK8WT
## 27            8    45CMK8WT
## 28            8    46CMK8WT
## 29            8    51CMK8WT
## 30            8    52CMK8WT
## 31            8    57CMK8WT
## 32            8    59CMK8WT
## 33            8     5CMK6WT
## 34            8    61CMK8WT
## 35            8    63CMK8WT
## 36            8    66CMK8WT
## 37            8    67CMK8WT
## 38            8    69CMK8WT
## 39            8     6CMK6WT
## 40            8    71CMK8WT
## 41            8    73CMK8WT
## 42            8    76CMK8WT
## 43            8    77CMK8WT
## 44            8    80CMK8WT
## 45            6    81CHE6WT
## 46            6    82CHE6WT
## 47            6    85CHE6WT
## 48            6    86CHE6WT
## 49            6    89CHE6WT
## 50            6    90CHE6WT
## 51            6    93CHE6WT
## 52            6    94CHE6WT
## 53            6    97CHE6WT
## 54            6    98CHE6WT

Pipes

But what if you wanted to select and filter? There are three ways to do this: use intermediate steps, nested functions, or pipes. With the intermediate steps, you essentially create a temporary data frame and use that as input to the next function. This can clutter up your workspace with lots of objects. You can also nest functions (i.e. one function inside of another). This is handy, but can be difficult to read if too many functions are nested as the process from inside out. The last option, pipes, are a fairly recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same data set. Pipes in R look like %>% and are made available via the magrittr package installed as part of dplyr.

metadata %>%
  filter(Genotype == "WT") %>%
  select(SampleID, Sex, Genotype)

##     SampleID Sex Genotype
## 1  101CHE6WT   F       WT
## 2  102CHE6WT   M       WT
## 3  105CHE6WT           WT
## 4  106CHE6WT   F       WT
## 5  109CHE6WT   M       WT
## 6  110CHE6WT   F       WT
## 7  113CHE6WT   F       WT
## 8  114CHE6WT   M       WT
## 9   11CMK6WT   F       WT
## 10  12CMK6WT   M       WT
## 11  17CMK6WT   M       WT
## 12  19CMK6WT   F       WT
## 13   1CMK6WT   M       WT
## 14  21CMK6WT   M       WT
## 15  23CMK6WT   F       WT
## 16  26CMK6WT           WT
## 17  27CMK6WT   M       WT
## 18  29CMK6WT   M       WT
## 19  31CMK6WT   M       WT
## 20  33CMK6WT   M       WT
## 21  36CMK6WT   F       WT
## 22  37CMK6WT   M       WT
## 23   3CMK6WT           WT
## 24  40CMK6WT   F       WT
## 25  41CMK8WT   F       WT
## 26  43CMK8WT   M       WT
## 27  45CMK8WT   F       WT
## 28  46CMK8WT   M       WT
## 29  51CMK8WT   F       WT
## 30  52CMK8WT   M       WT
## 31  57CMK8WT   F       WT
## 32  59CMK8WT   M       WT
## 33   5CMK6WT   M       WT
## 34  61CMK8WT   F       WT
## 35  63CMK8WT   F       WT
## 36  66CMK8WT           WT
## 37  67CMK8WT   F       WT
## 38  69CMK8WT   M       WT
## 39   6CMK6WT   F       WT
## 40  71CMK8WT   F       WT
## 41  73CMK8WT   F       WT
## 42  76CMK8WT   F       WT
## 43  77CMK8WT   F       WT
## 44  80CMK8WT   M       WT
## 45  81CHE6WT   F       WT
## 46  82CHE6WT   M       WT
## 47  85CHE6WT   F       WT
## 48  86CHE6WT   M       WT
## 49  89CHE6WT   F       WT
## 50  90CHE6WT   M       WT
## 51  93CHE6WT   M       WT
## 52  94CHE6WT   F       WT
## 53  97CHE6WT   F       WT
## 54  98CHE6WT   M       WT

In the above we use the pipe to send the metadata data set first through filter, to keep rows where Genotype was equal to ‘WT’, and then through select to keep the SampleID and Sex and Genotype columns. When the data frame is being passed to the filter() and select() functions through a pipe, we don’t need to include it as an argument to these functions anymore.

If we wanted to create a new object with this smaller version of the data we could do so by assigning it a new name:

meta_WT <- metadata %>%
  filter(Genotype == "WT") %>%
  select(SampleID, Sex, Genotype)

meta_WT

##     SampleID Sex Genotype
## 1  101CHE6WT   F       WT
## 2  102CHE6WT   M       WT
## 3  105CHE6WT           WT
## 4  106CHE6WT   F       WT
## 5  109CHE6WT   M       WT
## 6  110CHE6WT   F       WT
## 7  113CHE6WT   F       WT
## 8  114CHE6WT   M       WT
## 9   11CMK6WT   F       WT
## 10  12CMK6WT   M       WT
## 11  17CMK6WT   M       WT
## 12  19CMK6WT   F       WT
## 13   1CMK6WT   M       WT
## 14  21CMK6WT   M       WT
## 15  23CMK6WT   F       WT
## 16  26CMK6WT           WT
## 17  27CMK6WT   M       WT
## 18  29CMK6WT   M       WT
## 19  31CMK6WT   M       WT
## 20  33CMK6WT   M       WT
## 21  36CMK6WT   F       WT
## 22  37CMK6WT   M       WT
## 23   3CMK6WT           WT
## 24  40CMK6WT   F       WT
## 25  41CMK8WT   F       WT
## 26  43CMK8WT   M       WT
## 27  45CMK8WT   F       WT
## 28  46CMK8WT   M       WT
## 29  51CMK8WT   F       WT
## 30  52CMK8WT   M       WT
## 31  57CMK8WT   F       WT
## 32  59CMK8WT   M       WT
## 33   5CMK6WT   M       WT
## 34  61CMK8WT   F       WT
## 35  63CMK8WT   F       WT
## 36  66CMK8WT           WT
## 37  67CMK8WT   F       WT
## 38  69CMK8WT   M       WT
## 39   6CMK6WT   F       WT
## 40  71CMK8WT   F       WT
## 41  73CMK8WT   F       WT
## 42  76CMK8WT   F       WT
## 43  77CMK8WT   F       WT
## 44  80CMK8WT   M       WT
## 45  81CHE6WT   F       WT
## 46  82CHE6WT   M       WT
## 47  85CHE6WT   F       WT
## 48  86CHE6WT   M       WT
## 49  89CHE6WT   F       WT
## 50  90CHE6WT   M       WT
## 51  93CHE6WT   M       WT
## 52  94CHE6WT   F       WT
## 53  97CHE6WT   F       WT
## 54  98CHE6WT   M       WT

Challenge

Using pipes, subset the metadata to include rows where the Sex is male, and retain only the columns SampleID and Genotype.

Mutate

Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions or find the ratio of values in two columns. For this we’ll use mutate().

To create a new column of weight in kg:

metadata %>%
  mutate(Weight_kg = Weight / 1000)

If this runs off your screen and you just want to see the first few rows, you can use a pipe to view the head() of the data (pipes work with non-dplyr functions too, as long as the dplyr or magrittr packages are loaded).

metadata %>%
  mutate(Weight_kg = Weight / 1000) %>%
  head

Note that we don’t include parentheses at the end of our call to head() above. When piping into a function with no additional arguments, you can call the function with or without parentheses (e.g. head or head()).

The first few rows of the output are full of NAs, so if we wanted to remove those we could insert a filter() in the chain:

metadata %>%
  filter(!is.na(Weight)) %>%
  mutate(Weight_kg = Weight / 1000) %>%
  head

is.na() is a function that determines whether something is or is not an NA. The ! symbol negates it, so we’re asking for everything that is not an NA.

Split-apply-combine data analysis and the summarize() function

Many data analysis tasks can be approached using the “split-apply-combine” paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr makes this very easy through the use of the group_by() function, which splits the data into groups. When the data is grouped in this way summarize() can be used to collapse each group into a single-row summary. summarize() does this by applying an aggregating or summary function to each group. For example, if we wanted to group by genotype and find the number of rows of data for each genotype, we would do:

metadata %>%
  group_by(Genotype) %>%
  summarize(n())

## # A tibble: 4 x 2
##      Genotype `n()`
##        <fctr> <int>
## 1 chemerin_KO    18
## 2   CMKLR1_HE     4
## 3   CMKLR1_KO    40
## 4          WT    54

Here the summary function used was n() to find the count for each group. We can also apply many other functions to individual columns to get other summary statistics. For example, in the R base package we can use built-in functions like mean, median, min, and max. By default, all R functions operating on vectors that contains missing data will return NA. It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it. When dealing with simple statistics like the mean, the easiest way to ignore NA (the missing data) is to use na.rm=TRUE (rm stands for remove).

So to view mean weight by sex:

metadata %>%
  group_by(Sex) %>%
  summarize(mean_weight = mean(Weight, na.rm = TRUE))

## # A tibble: 3 x 2
##      Sex mean_weight
##   <fctr>       <dbl>
## 1           21.74000
## 2      F    21.47925
## 3      M    21.30435

You can group by multiple columns too:

metadata %>%
  group_by(Sex, Genotype) %>%
  summarize(mean_weight = mean(Weight, na.rm = TRUE))

## # A tibble: 11 x 3
## # Groups:   Sex [?]
##       Sex    Genotype mean_weight
##    <fctr>      <fctr>       <dbl>
##  1        chemerin_KO    23.50000
##  2          CMKLR1_KO    21.00000
##  3                 WT    21.60000
##  4      F chemerin_KO    22.48750
##  5      F   CMKLR1_HE    20.95000
##  6      F   CMKLR1_KO    21.45263
##  7      F          WT    21.20833
##  8      M chemerin_KO    20.47143
##  9      M   CMKLR1_HE    22.35000
## 10      M   CMKLR1_KO    21.41765
## 11      M          WT    21.39500

When grouping both by Sex and Genotype, the first rows are for individuals for which their Weight was not recorded. We can remove the missing values for weight before we attempt to calculate the summary statistics. Because the missing values are removed, we can omit na.rm = TRUE when computing the mean:

metadata %>%
  filter(!is.na(Weight)) %>%
  group_by(Sex, Genotype) %>%
  summarize(mean_weight = mean(Weight))

## # A tibble: 11 x 3
## # Groups:   Sex [?]
##       Sex    Genotype mean_weight
##    <fctr>      <fctr>       <dbl>
##  1        chemerin_KO    23.50000
##  2          CMKLR1_KO    21.00000
##  3                 WT    21.60000
##  4      F chemerin_KO    22.48750
##  5      F   CMKLR1_HE    20.95000
##  6      F   CMKLR1_KO    21.45263
##  7      F          WT    21.20833
##  8      M chemerin_KO    20.47143
##  9      M   CMKLR1_HE    22.35000
## 10      M   CMKLR1_KO    21.41765
## 11      M          WT    21.39500

Once the data are grouped, you can also summarize multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the minimum Weight for each Genotype for each Sex:

metadata %>%
  filter(!is.na(Weight)) %>%
  group_by(Sex, Genotype) %>%
  summarize(mean_weight = mean(Weight),
            min_weight = min(Weight))

## # A tibble: 11 x 4
## # Groups:   Sex [?]
##       Sex    Genotype mean_weight min_weight
##    <fctr>      <fctr>       <dbl>      <dbl>
##  1        chemerin_KO    23.50000       23.5
##  2          CMKLR1_KO    21.00000       20.5
##  3                 WT    21.60000       21.4
##  4      F chemerin_KO    22.48750       20.5
##  5      F   CMKLR1_HE    20.95000       19.5
##  6      F   CMKLR1_KO    21.45263       18.5
##  7      F          WT    21.20833       18.2
##  8      M chemerin_KO    20.47143       18.2
##  9      M   CMKLR1_HE    22.35000       20.8
## 10      M   CMKLR1_KO    21.41765       18.4
## 11      M          WT    21.39500       18.6

Looks like for some of these samples, the Sex was also not recorded. We could then discard those rows using filter():

metadata %>%
  group_by(Sex, Genotype) %>%
  summarize(mean_weight = mean(Weight, na.rm = TRUE)) %>%
  filter(Sex != "")

Handy dplyr cheatsheet

Much of this lesson was copied or adapted from Jeff Hollister’s materials

Data Carpentry, 2017. License. Contributing.
Questions? Feedback? Please file an issue on GitHub.
On Twitter: @datacarpentry

Aggregating and analyzing data with dplyr

Data Carpentry contributors