Learning Objectives

  • Load external tabular data from a .csv file into R.
  • Describe what an R data frame is.
  • Summarize the contents of a data frame in R.
  • Manipulate categorical data in R using factors.

Looking at Metadata

We will be looking at a modified version of the metadata of the gene knockout study. The aim of this study was to determine whether knocking out the protein ‘chemerin’ affects gut microbial composition. In total, 116 mouse samples were acquired from two different facilities. Metadata associated with each sample is indicated in the mapping file (mouse_mapfile.txt), which we will download using RStudio. In this mapping file the genotypes of interest can be seen: wildtype (WT), chemerin knockout (chemerin_KO), chemerin receptor knockout (CMKLR1_KO) and a heterozygote for the receptor knockout (HET). Also of importance are the two source facilities: “BZ” and “CJS”. It is generally a good idea to include as much metadata as possible, since this data can easily be explored later on.

The metadata mapping file of this dataset is stored as a comma separated value (CSV) file. Each row holds information for a single mouse, and the columns represent:

Column Description
SampleID Unique id for each sample
BarcodeSequence Barcode sequence used for each sample (NA)
LinkerPrimerSequence Linker/primer sequence used to amplify the sample (NA)
FileInput Name of .fastq input file
Source BZ or CJS
Mouse Unique number for each mouse
Sex Sex of each mouse
Weight Weight of each mouse in grams
Cage Cage number (NA)
Genotype WT, chemerin_KO, CMKLR1_HE, or CMKLR1_KO
SamplingWeek Week in which mouse was sampled
Description Unique id for each mouse

We are going to use the R function download.file() to download the CSV file that contains the metadata, and we will use read.csv() to load into memory the content of the CSV file as an object of class data.frame.

To download the metadata into the data/ subdirectory, run the following:

download.file("https://raw.githubusercontent.com/BinxiePeterson/introduction-to-R-for-16S/gh-pages/mouse_mapfile.csv",
              "data/metadata.csv")

You are now ready to load the data. We are going to use the R function read.csv() to load the data file into memory (as a data.frame):

metadata <- read.csv('data/metadata.csv', header = TRUE)

This statement doesn’t produce any output because assignment doesn’t display anything. If we want to check that our data has been loaded, we can print the variable’s value: metadata

Alternatively, wrapping an assignment in parentheses will perform the assignment and display it at the same time.

(metadata <- read.csv('data/metadata.csv'))
##      SampleID BarcodeSequence LinkerPrimerSequence
## 1   100CHE6KO              NA                   NA
## 2   101CHE6WT              NA                   NA
## 3   102CHE6WT              NA                   NA
## 4   103CHE6KO              NA                   NA
## 5   104CHE6KO              NA                   NA
## 6   105CHE6WT              NA                   NA
## 7   106CHE6WT              NA                   NA
## 8   107CHE6KO              NA                   NA
## 9   108CHE6KO              NA                   NA
## 10  109CHE6WT              NA                   NA
## 11   10CMK6KO              NA                   NA
## 12  110CHE6WT              NA                   NA
## 13  111CHE6KO              NA                   NA
## 14  112CHE6KO              NA                   NA
## 15  113CHE6WT              NA                   NA
## 16  114CHE6WT              NA                   NA
## 17  115CHE6KO              NA                   NA
## 18  116CHE6KO              NA                   NA
## 19   11CMK6WT              NA                   NA
## 20   12CMK6WT              NA                   NA
## 21  13CMK6HET              NA                   NA
## 22   14CMK6KO              NA                   NA
## 23  15CMK6HET              NA                   NA
## 24   16CMK6KO              NA                   NA
## 25   17CMK6WT              NA                   NA
## 26   18CMK6KO              NA                   NA
## 27   19CMK6WT              NA                   NA
## 28    1CMK6WT              NA                   NA
## 29   20CMK6KO              NA                   NA
## 30   21CMK6WT              NA                   NA
## 31   22CMK6KO              NA                   NA
## 32   23CMK6WT              NA                   NA
## 33   24CMK6KO              NA                   NA
## 34   25CMK6KO              NA                   NA
## 35   26CMK6WT              NA                   NA
## 36   27CMK6WT              NA                   NA
## 37   28CMK6KO              NA                   NA
## 38   29CMK6WT              NA                   NA
## 39    2CMK6KO              NA                   NA
## 40   30CMK6KO              NA                   NA
## 41   31CMK6WT              NA                   NA
## 42   32CMK6KO              NA                   NA
## 43   33CMK6WT              NA                   NA
## 44   34CMK6KO              NA                   NA
## 45   35CMK6KO              NA                   NA
## 46   36CMK6WT              NA                   NA
## 47   37CMK6WT              NA                   NA
## 48   38CMK6KO              NA                   NA
## 49   39CMK6KO              NA                   NA
## 50    3CMK6WT              NA                   NA
## 51   40CMK6WT              NA                   NA
## 52   41CMK8WT              NA                   NA
## 53   42CMK8KO              NA                   NA
## 54   43CMK8WT              NA                   NA
## 55   44CMK8KO              NA                   NA
## 56   45CMK8WT              NA                   NA
## 57   46CMK8WT              NA                   NA
## 58   47CMK8KO              NA                   NA
## 59   48CMK8KO              NA                   NA
## 60   49CMK8KO              NA                   NA
## 61    4CMK6KO              NA                   NA
## 62   50CMK8KO              NA                   NA
## 63   51CMK8WT              NA                   NA
## 64   52CMK8WT              NA                   NA
## 65  53CMK8HET              NA                   NA
## 66   54CMK8KO              NA                   NA
## 67  55CMK8HET              NA                   NA
## 68   56CMK8KO              NA                   NA
## 69   57CMK8WT              NA                   NA
## 70   58CMK8KO              NA                   NA
## 71   59CMK8WT              NA                   NA
## 72    5CMK6WT              NA                   NA
## 73   60CMK8KO              NA                   NA
## 74   61CMK8WT              NA                   NA
## 75   62CMK8KO              NA                   NA
## 76   63CMK8WT              NA                   NA
## 77   64CMK8KO              NA                   NA
## 78   65CMK8KO              NA                   NA
## 79   66CMK8WT              NA                   NA
## 80   67CMK8WT              NA                   NA
## 81   68CMK8KO              NA                   NA
## 82   69CMK8WT              NA                   NA
## 83    6CMK6WT              NA                   NA
## 84   70CMK8KO              NA                   NA
## 85   71CMK8WT              NA                   NA
## 86   72CMK8KO              NA                   NA
## 87   73CMK8WT              NA                   NA
## 88   74CMK8KO              NA                   NA
## 89   75CMK8KO              NA                   NA
## 90   76CMK8WT              NA                   NA
## 91   77CMK8WT              NA                   NA
## 92   78CMK8KO              NA                   NA
## 93   79CMK8KO              NA                   NA
## 94    7CMK6KO              NA                   NA
## 95   80CMK8WT              NA                   NA
## 96   81CHE6WT              NA                   NA
## 97   82CHE6WT              NA                   NA
## 98   83CHE6KO              NA                   NA
## 99   84CHE6KO              NA                   NA
## 100  85CHE6WT              NA                   NA
## 101  86CHE6WT              NA                   NA
## 102  87CHE6KO              NA                   NA
## 103  88CHE6KO              NA                   NA
## 104  89CHE6WT              NA                   NA
## 105   8CMK6KO              NA                   NA
## 106  90CHE6WT              NA                   NA
## 107  91CHE6KO              NA                   NA
## 108  92CHE6KO              NA                   NA
## 109  93CHE6WT              NA                   NA
## 110  94CHE6WT              NA                   NA
## 111  95CHE6KO              NA                   NA
## 112  96CHE6KO              NA                   NA
## 113  97CHE6WT              NA                   NA
## 114  98CHE6WT              NA                   NA
## 115  99CHE6KO              NA                   NA
## 116   9CMK6KO              NA                   NA
##                            FileInput Source Mouse Sex Weight    Genotype
## 1   100CHE6KO_S357_L001_R1_001.fastq     BZ   100   F     NA chemerin_KO
## 2   101CHE6WT_S368_L001_R1_001.fastq     BZ   101   F   20.6          WT
## 3   102CHE6WT_S291_L001_R1_001.fastq     BZ   102   M     NA          WT
## 4   103CHE6KO_S302_L001_R1_001.fastq     BZ   103   M   19.7 chemerin_KO
## 5   104CHE6KO_S313_L001_R1_001.fastq     BZ   104   M     NA chemerin_KO
## 6   105CHE6WT_S325_L001_R1_001.fastq     BZ   105         NA          WT
## 7   106CHE6WT_S336_L001_R1_001.fastq     BZ   106   F     NA          WT
## 8   107CHE6KO_S347_L001_R1_001.fastq     BZ   107   M   22.8 chemerin_KO
## 9   108CHE6KO_S358_L001_R1_001.fastq     BZ   108   F   22.4 chemerin_KO
## 10  109CHE6WT_S369_L001_R1_001.fastq     BZ   109   M   21.6          WT
## 11   10CMK6KO_S231_L001_R1_001.fastq    CJS    10   F   23.5   CMKLR1_KO
## 12  110CHE6WT_S292_L001_R1_001.fastq     BZ   110   F   24.1          WT
## 13  111CHE6KO_S303_L001_R1_001.fastq     BZ   111   F   20.8 chemerin_KO
## 14  112CHE6KO_S314_L001_R1_001.fastq     BZ   112   F   20.5 chemerin_KO
## 15  113CHE6WT_S326_L001_R1_001.fastq     BZ   113   F   19.5          WT
## 16  114CHE6WT_S337_L001_R1_001.fastq     BZ   114   M   19.2          WT
## 17  115CHE6KO_S348_L001_R1_001.fastq     BZ   115   M   18.2 chemerin_KO
## 18  116CHE6KO_S359_L001_R1_001.fastq     BZ   116   M   20.7 chemerin_KO
## 19   11CMK6WT_S243_L001_R1_001.fastq    CJS    11   F   21.8          WT
## 20   12CMK6WT_S255_L001_R1_001.fastq    CJS    12   M     NA          WT
## 21  13CMK6HET_S267_L001_R1_001.fastq    CJS    13   F   22.4   CMKLR1_HE
## 22   14CMK6KO_S279_L001_R1_001.fastq    CJS    14   F   22.6   CMKLR1_KO
## 23  15CMK6HET_S196_L001_R1_001.fastq    CJS    15   M   23.9   CMKLR1_HE
## 24   16CMK6KO_S208_L001_R1_001.fastq    CJS    16   M   23.5   CMKLR1_KO
## 25   17CMK6WT_S220_L001_R1_001.fastq    CJS    17   M   22.6          WT
## 26   18CMK6KO_S232_L001_R1_001.fastq    CJS    18   M   21.8   CMKLR1_KO
## 27   19CMK6WT_S244_L001_R1_001.fastq    CJS    19   F   21.4          WT
## 28    1CMK6WT_S218_L001_R1_001.fastq    CJS     1   M     NA          WT
## 29   20CMK6KO_S256_L001_R1_001.fastq    CJS    20   M   22.3   CMKLR1_KO
## 30   21CMK6WT_S268_L001_R1_001.fastq    CJS    21   M   23.2          WT
## 31   22CMK6KO_S280_L001_R1_001.fastq    CJS    22   F   19.5   CMKLR1_KO
## 32   23CMK6WT_S197_L001_R1_001.fastq    CJS    23   F   18.4          WT
## 33   24CMK6KO_S209_L001_R1_001.fastq    CJS    24   F   19.8   CMKLR1_KO
## 34   25CMK6KO_S221_L001_R1_001.fastq    CJS    25       20.5   CMKLR1_KO
## 35   26CMK6WT_S233_L001_R1_001.fastq    CJS    26         NA          WT
## 36   27CMK6WT_S245_L001_R1_001.fastq    CJS    27   M   20.6          WT
## 37   28CMK6KO_S257_L001_R1_001.fastq    CJS    28   M   21.7   CMKLR1_KO
## 38   29CMK6WT_S269_L001_R1_001.fastq    CJS    29   M   23.6          WT
## 39    2CMK6KO_S230_L001_R1_001.fastq    CJS     2   F   22.4   CMKLR1_KO
## 40   30CMK6KO_S281_L001_R1_001.fastq    CJS    30   F   21.5   CMKLR1_KO
## 41   31CMK6WT_S198_L001_R1_001.fastq    CJS    31   M   22.6          WT
## 42   32CMK6KO_S210_L001_R1_001.fastq    CJS    32   F   23.5   CMKLR1_KO
## 43   33CMK6WT_S222_L001_R1_001.fastq    CJS    33   M   21.5          WT
## 44   34CMK6KO_S234_L001_R1_001.fastq    CJS    34   M   20.8   CMKLR1_KO
## 45   35CMK6KO_S246_L001_R1_001.fastq    CJS    35   F   19.5   CMKLR1_KO
## 46   36CMK6WT_S258_L001_R1_001.fastq    CJS    36   F     NA          WT
## 47   37CMK6WT_S270_L001_R1_001.fastq    CJS    37   M   18.6          WT
## 48   38CMK6KO_S282_L001_R1_001.fastq    CJS    38   F   19.4   CMKLR1_KO
## 49   39CMK6KO_S199_L001_R1_001.fastq    CJS    39   M   20.5   CMKLR1_KO
## 50    3CMK6WT_S242_L001_R1_001.fastq    CJS     3       21.8          WT
## 51   40CMK6WT_S211_L001_R1_001.fastq    CJS    40   F   21.9          WT
## 52   41CMK8WT_S223_L001_R1_001.fastq    CJS    41   F   22.5          WT
## 53   42CMK8KO_S235_L001_R1_001.fastq    CJS    42   M   23.4   CMKLR1_KO
## 54   43CMK8WT_S247_L001_R1_001.fastq    CJS    43   M   21.6          WT
## 55   44CMK8KO_S259_L001_R1_001.fastq    CJS    44   F   28.5   CMKLR1_KO
## 56   45CMK8WT_S271_L001_R1_001.fastq    CJS    45   F   19.7          WT
## 57   46CMK8WT_S283_L001_R1_001.fastq    CJS    46   M   18.6          WT
## 58   47CMK8KO_S200_L001_R1_001.fastq    CJS    47   F   20.4   CMKLR1_KO
## 59   48CMK8KO_S212_L001_R1_001.fastq    CJS    48       21.5   CMKLR1_KO
## 60   49CMK8KO_S224_L001_R1_001.fastq    CJS    49   M   22.8   CMKLR1_KO
## 61    4CMK6KO_S254_L001_R1_001.fastq    CJS     4   M   22.4   CMKLR1_KO
## 62   50CMK8KO_S236_L001_R1_001.fastq    CJS    50   F   21.6   CMKLR1_KO
## 63   51CMK8WT_S248_L001_R1_001.fastq    CJS    51   F   23.5          WT
## 64   52CMK8WT_S260_L001_R1_001.fastq    CJS    52   M   24.1          WT
## 65  53CMK8HET_S272_L001_R1_001.fastq    CJS    53   M   20.8   CMKLR1_HE
## 66   54CMK8KO_S284_L001_R1_001.fastq    CJS    54   F   20.5   CMKLR1_KO
## 67  55CMK8HET_S201_L001_R1_001.fastq    CJS    55   F   19.5   CMKLR1_HE
## 68   56CMK8KO_S213_L001_R1_001.fastq    CJS    56   F   19.2   CMKLR1_KO
## 69   57CMK8WT_S225_L001_R1_001.fastq    CJS    57   F   18.2          WT
## 70   58CMK8KO_S237_L001_R1_001.fastq    CJS    58   M   20.7   CMKLR1_KO
## 71   59CMK8WT_S249_L001_R1_001.fastq    CJS    59   M   21.8          WT
## 72    5CMK6WT_S266_L001_R1_001.fastq    CJS     5   M   23.5          WT
## 73   60CMK8KO_S261_L001_R1_001.fastq    CJS    60   F   22.4   CMKLR1_KO
## 74   61CMK8WT_S273_L001_R1_001.fastq    CJS    61   F   22.6          WT
## 75   62CMK8KO_S285_L001_R1_001.fastq    CJS    62   F   23.9   CMKLR1_KO
## 76   63CMK8WT_S202_L001_R1_001.fastq    CJS    63   F   23.5          WT
## 77   64CMK8KO_S214_L001_R1_001.fastq    CJS    64   M   22.6   CMKLR1_KO
## 78   65CMK8KO_S226_L001_R1_001.fastq    CJS    65   M   21.8   CMKLR1_KO
## 79   66CMK8WT_S238_L001_R1_001.fastq    CJS    66       21.4          WT
## 80   67CMK8WT_S250_L001_R1_001.fastq    CJS    67   F   21.1          WT
## 81   68CMK8KO_S262_L001_R1_001.fastq    CJS    68   F   22.3   CMKLR1_KO
## 82   69CMK8WT_S274_L001_R1_001.fastq    CJS    69   M     NA          WT
## 83    6CMK6WT_S278_L001_R1_001.fastq    CJS     6   F   19.5          WT
## 84   70CMK8KO_S286_L001_R1_001.fastq    CJS    70   M   18.4   CMKLR1_KO
## 85   71CMK8WT_S203_L001_R1_001.fastq    CJS    71   F   19.8          WT
## 86   72CMK8KO_S215_L001_R1_001.fastq    CJS    72   M   20.5   CMKLR1_KO
## 87   73CMK8WT_S227_L001_R1_001.fastq    CJS    73   F   21.8          WT
## 88   74CMK8KO_S239_L001_R1_001.fastq    CJS    74   M   20.6   CMKLR1_KO
## 89   75CMK8KO_S251_L001_R1_001.fastq    CJS    75   M     NA   CMKLR1_KO
## 90   76CMK8WT_S263_L001_R1_001.fastq    CJS    76   F   23.6          WT
## 91   77CMK8WT_S275_L001_R1_001.fastq    CJS    77   F   21.5          WT
## 92   78CMK8KO_S287_L001_R1_001.fastq    CJS    78   M   20.8   CMKLR1_KO
## 93   79CMK8KO_S204_L001_R1_001.fastq    CJS    79   M   19.5   CMKLR1_KO
## 94    7CMK6KO_S195_L001_R1_001.fastq    CJS     7   F   18.5   CMKLR1_KO
## 95   80CMK8WT_S216_L001_R1_001.fastq    CJS    80   M   18.6          WT
## 96   81CHE6WT_S228_L001_R1_001.fastq     BZ    81   F   19.4          WT
## 97   82CHE6WT_S240_L001_R1_001.fastq     BZ    82   M   20.5          WT
## 98   83CHE6KO_S252_L001_R1_001.fastq     BZ    83   F   21.8 chemerin_KO
## 99   84CHE6KO_S264_L001_R1_001.fastq     BZ    84   M   21.9 chemerin_KO
## 100  85CHE6WT_S276_L001_R1_001.fastq     BZ    85   F   22.5          WT
## 101  86CHE6WT_S289_L001_R1_001.fastq     BZ    86   M   23.4          WT
## 102  87CHE6KO_S300_L001_R1_001.fastq     BZ    87   F   21.6 chemerin_KO
## 103  88CHE6KO_S311_L001_R1_001.fastq     BZ    88   F   28.5 chemerin_KO
## 104  89CHE6WT_S323_L001_R1_001.fastq     BZ    89   F   19.7          WT
## 105   8CMK6KO_S207_L001_R1_001.fastq    CJS     8   M     NA   CMKLR1_KO
## 106  90CHE6WT_S334_L001_R1_001.fastq     BZ    90   M   20.4          WT
## 107  91CHE6KO_S345_L001_R1_001.fastq     BZ    91   F   21.5 chemerin_KO
## 108  92CHE6KO_S356_L001_R1_001.fastq     BZ    92   F   22.8 chemerin_KO
## 109  93CHE6WT_S367_L001_R1_001.fastq     BZ    93   M   22.4          WT
## 110  94CHE6WT_S290_L001_R1_001.fastq     BZ    94   F   21.6          WT
## 111  95CHE6KO_S301_L001_R1_001.fastq     BZ    95       23.5 chemerin_KO
## 112  96CHE6KO_S312_L001_R1_001.fastq     BZ    96   M   21.5 chemerin_KO
## 113  97CHE6WT_S324_L001_R1_001.fastq     BZ    97   F   20.8          WT
## 114  98CHE6WT_S335_L001_R1_001.fastq     BZ    98   M   19.5          WT
## 115  99CHE6KO_S346_L001_R1_001.fastq     BZ    99   M   18.5 chemerin_KO
## 116   9CMK6KO_S219_L001_R1_001.fastq    CJS     9   F   18.6   CMKLR1_KO
##     SamplingWeek Description
## 1              6   100CHE6KO
## 2              6   101CHE6WT
## 3              6   102CHE6WT
## 4              6   103CHE6KO
## 5              6   104CHE6KO
## 6              6   105CHE6WT
## 7              6   106CHE6WT
## 8              6   107CHE6KO
## 9              6   108CHE6KO
## 10             6   109CHE6WT
## 11             6    10CMK6KO
## 12             6   110CHE6WT
## 13             6   111CHE6KO
## 14             6   112CHE6KO
## 15             6   113CHE6WT
## 16             6   114CHE6WT
## 17             6   115CHE6KO
## 18             6   116CHE6KO
## 19             6    11CMK6WT
## 20             6    12CMK6WT
## 21             6   13CMK6HET
## 22             6    14CMK6KO
## 23             6   15CMK6HET
## 24             6    16CMK6KO
## 25             6    17CMK6WT
## 26             6    18CMK6KO
## 27             6    19CMK6WT
## 28             6     1CMK6WT
## 29             6    20CMK6KO
## 30             6    21CMK6WT
## 31             6    22CMK6KO
## 32             6    23CMK6WT
## 33             6    24CMK6KO
## 34             6    25CMK6KO
## 35             6    26CMK6WT
## 36             6    27CMK6WT
## 37             6    28CMK6KO
## 38             6    29CMK6WT
## 39             6     2CMK6KO
## 40             6    30CMK6KO
## 41             6    31CMK6WT
## 42             6    32CMK6KO
## 43             6    33CMK6WT
## 44             6    34CMK6KO
## 45             6    35CMK6KO
## 46             6    36CMK6WT
## 47             6    37CMK6WT
## 48             6    38CMK6KO
## 49             6    39CMK6KO
## 50             6     3CMK6WT
## 51             6    40CMK6WT
## 52             8    41CMK8WT
## 53             8    42CMK8KO
## 54             8    43CMK8WT
## 55             8    44CMK8KO
## 56             8    45CMK8WT
## 57             8    46CMK8WT
## 58             8    47CMK8KO
## 59             8    48CMK8KO
## 60             8    49CMK8KO
## 61             8     4CMK6KO
## 62             8    50CMK8KO
## 63             8    51CMK8WT
## 64             8    52CMK8WT
## 65             8   53CMK8HET
## 66             8    54CMK8KO
## 67             8   55CMK8HET
## 68             8    56CMK8KO
## 69             8    57CMK8WT
## 70             8    58CMK8KO
## 71             8    59CMK8WT
## 72             8     5CMK6WT
## 73             8    60CMK8KO
## 74             8    61CMK8WT
## 75             8    62CMK8KO
## 76             8    63CMK8WT
## 77             8    64CMK8KO
## 78             8    65CMK8KO
## 79             8    66CMK8WT
## 80             8    67CMK8WT
## 81             8    68CMK8KO
## 82             8    69CMK8WT
## 83             8     6CMK6WT
## 84             8    70CMK8KO
## 85             8    71CMK8WT
## 86             8    72CMK8KO
## 87             8    73CMK8WT
## 88             8    74CMK8KO
## 89             8    75CMK8KO
## 90             8    76CMK8WT
## 91             8    77CMK8WT
## 92             8    78CMK8KO
## 93             8    79CMK8KO
## 94             6     7CMK6KO
## 95             8    80CMK8WT
## 96             6    81CHE6WT
## 97             6    82CHE6WT
## 98             6    83CHE6KO
## 99             6    84CHE6KO
## 100            6    85CHE6WT
## 101            6    86CHE6WT
## 102            6    87CHE6KO
## 103            6    88CHE6KO
## 104            6    89CHE6WT
## 105            6     8CMK6KO
## 106            6    90CHE6WT
## 107            6    91CHE6KO
## 108            6    92CHE6KO
## 109            6    93CHE6WT
## 110            6    94CHE6WT
## 111            6    95CHE6KO
## 112            6    96CHE6KO
## 113            6    97CHE6WT
## 114            6    98CHE6WT
## 115            6    99CHE6KO
## 116            6     9CMK6KO

Let’s check the top (the first 6 lines) of this data.frame using the function head():

head(metadata)
##    SampleID BarcodeSequence LinkerPrimerSequence
## 1 100CHE6KO              NA                   NA
## 2 101CHE6WT              NA                   NA
## 3 102CHE6WT              NA                   NA
## 4 103CHE6KO              NA                   NA
## 5 104CHE6KO              NA                   NA
## 6 105CHE6WT              NA                   NA
##                          FileInput Source Mouse Sex Weight    Genotype
## 1 100CHE6KO_S357_L001_R1_001.fastq     BZ   100   F     NA chemerin_KO
## 2 101CHE6WT_S368_L001_R1_001.fastq     BZ   101   F   20.6          WT
## 3 102CHE6WT_S291_L001_R1_001.fastq     BZ   102   M     NA          WT
## 4 103CHE6KO_S302_L001_R1_001.fastq     BZ   103   M   19.7 chemerin_KO
## 5 104CHE6KO_S313_L001_R1_001.fastq     BZ   104   M     NA chemerin_KO
## 6 105CHE6WT_S325_L001_R1_001.fastq     BZ   105         NA          WT
##   SamplingWeek Description
## 1            6   100CHE6KO
## 2            6   101CHE6WT
## 3            6   102CHE6WT
## 4            6   103CHE6KO
## 5            6   104CHE6KO
## 6            6   105CHE6WT

We’ve just done two very useful things. 1. We’ve read our data in to R, so now we can work with it in R 2. We’ve created a data frame (with the read.csv command) the standard way R works with data.

What are data frames?

data.frame is the de facto data structure for most tabular data and what we use for statistics and plotting.

A data.frame is a collection of vectors of identical lengths. Each vector represents a column, and each vector can be of a different data type (e.g., characters, integers, factors). The str() function is useful to inspect the data types of the columns.

A data.frame can be created by the functions read.csv() or read.table(), in other words, when importing spreadsheets from your hard drive (or the web).

By default, data.frame converts (= coerces) columns that contain characters (i.e., text) into the factor data type. Depending on what you want to do with the data, you may want to keep these columns as character. To do so, read.csv() and read.table() have an argument called stringsAsFactors which can be set to FALSE:

Let’s now check the __str__ucture of this data.frame in more details with the function str():

str(metadata)
## 'data.frame':    116 obs. of  11 variables:
##  $ SampleID            : Factor w/ 116 levels "100CHE6KO","101CHE6WT",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ BarcodeSequence     : logi  NA NA NA NA NA NA ...
##  $ LinkerPrimerSequence: logi  NA NA NA NA NA NA ...
##  $ FileInput           : Factor w/ 116 levels "100CHE6KO_S357_L001_R1_001.fastq",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Source              : Factor w/ 2 levels "BZ","CJS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Mouse               : int  100 101 102 103 104 105 106 107 108 109 ...
##  $ Sex                 : Factor w/ 3 levels "","F","M": 2 2 3 3 3 1 2 3 2 3 ...
##  $ Weight              : num  NA 20.6 NA 19.7 NA NA NA 22.8 22.4 21.6 ...
##  $ Genotype            : Factor w/ 4 levels "chemerin_KO",..: 1 4 4 1 1 4 4 1 1 4 ...
##  $ SamplingWeek        : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ Description         : Factor w/ 116 levels "100CHE6KO","101CHE6WT",..: 1 2 3 4 5 6 7 8 9 10 ...

Inspecting data.frame objects

We already saw how the functions head() and str() can be useful to check the content and the structure of a data.frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data.

  • Size:
    • dim() - returns a vector with the number of rows in the first element, and the number of columns as the second element (the __dim__ensions of the object)
    • nrow() - returns the number of rows
    • ncol() - returns the number of columns
  • Content:
    • head() - shows the first 6 rows
    • tail() - shows the last 6 rows
  • Names:
    • names() - returns the column names (synonym of colnames() for data.frame objects)
    • rownames() - returns the row names
  • Summary:
    • str() - structure of the object and information about the class, length and content of each column
    • summary() - summary statistics for each column

Note: most of these functions are “generic”, they can be used on other types of objects besides data.frame.

Challenge

Based on the given table of functions to asses data structure, can you answer the following questions?

  • What is the class of the object metadata?
  • How many rows and how many columns are in this object?
  • How many wild type strains have been used in this study?

As you can see, many of the columns in our data frame are of a special class called factor. Before we learn more about the data.frame class, we are going to talk about factors. They are very useful but not necessarily intuitive, and therefore require some attention.

Factors

Factors are used to represent categorical data. Factors can be ordered or unordered and are an important class for statistical analysis and for plotting.

Factors are stored as integers, and have labels associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.

In the data frame we just imported, let’s do

str(metadata)
## 'data.frame':    116 obs. of  11 variables:
##  $ SampleID            : Factor w/ 116 levels "100CHE6KO","101CHE6WT",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ BarcodeSequence     : logi  NA NA NA NA NA NA ...
##  $ LinkerPrimerSequence: logi  NA NA NA NA NA NA ...
##  $ FileInput           : Factor w/ 116 levels "100CHE6KO_S357_L001_R1_001.fastq",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Source              : Factor w/ 2 levels "BZ","CJS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Mouse               : int  100 101 102 103 104 105 106 107 108 109 ...
##  $ Sex                 : Factor w/ 3 levels "","F","M": 2 2 3 3 3 1 2 3 2 3 ...
##  $ Weight              : num  NA 20.6 NA 19.7 NA NA NA 22.8 22.4 21.6 ...
##  $ Genotype            : Factor w/ 4 levels "chemerin_KO",..: 1 4 4 1 1 4 4 1 1 4 ...
##  $ SamplingWeek        : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ Description         : Factor w/ 116 levels "100CHE6KO","101CHE6WT",..: 1 2 3 4 5 6 7 8 9 10 ...

We can see the names of the multiple columns. And, we see that some say things like Factor w/ 4 levels

When we read in a file, any column that contains text is automatically assumed to be a factor. Once created, factors can only contain a pre-defined set values, known as levels. By default, R always sorts levels in alphabetical order.

For instance, we see that Genotype is a Factor w/ 4 levels, chemerin_KO, CMKLR1_HE, CMKLR1_KO and WT.

You can check this by using the function levels(), and check the number of levels using nlevels():

levels(metadata$Genotype)
nlevels(metadata$Genotype)

Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”) or it is required by particular type of analysis. Additionally, specifying the order of the levels allows to compare levels:

expression <- factor(c("low", "high", "medium", "high", "low", "medium", "high"))
levels(expression)
expression <- factor(expression, levels=c("low", "medium", "high"))
levels(expression)
min(expression) ## doesn't work
expression <- factor(expression, levels=c("low", "medium", "high"), ordered=TRUE)
levels(expression)
min(expression) ## works!

In R’s memory, these factors are represented by numbers (1, 2, 3). They are better than using simple integer labels because factors are self describing: "low", "medium", and "high"" is more descriptive than 1, 2, 3. Which is low? You wouldn’t be able to tell with just integer data. Factors have this information built in. It is particularly helpful when there are many levels (like the species in our example data set).

Challenge

The function table() tabulates observations and can be used to create bar plots quickly. For instance:

## Question: How can you recreate this plot but by having "control"
## being listed last instead of first?
exprmt <- factor(c("treat1", "treat2", "treat1", "treat3", "treat1", "control",
                   "treat1", "treat2", "treat3"))
table(exprmt)
## exprmt
## control  treat1  treat2  treat3 
##       1       4       2       2
barplot(table(exprmt))

exprmt <- factor(exprmt, levels=c("treat1", "treat2", "treat3", "control"))
barplot(table(exprmt))

—>


Data Carpentry, 2017. License. Contributing.
Questions? Feedback? Please file an issue on GitHub.
On Twitter: @datacarpentry