Skip to content

Import Data

Data can be imported using functions from base R (such as read.csv() and read.table()) or with functions from readr(such as read_csv() and read_tsv()). There are subtle differences in the default behavior these of functions, including how they treat dashes and spaces in column names, whether headers and row names are default. For this workshop, we will use read.csv() and read.table().

Files

Today, I will show you how to import the following files:

  1. data/samples.csv
  2. data/GTExHeart_20-29_vs_50-59.tsv
  3. data/colData.HEART.csv
  4. data/countData.HEART.csv.gz

Later, you can practice on your own using the following files:

  1. data/GTExMuscle_20-29_vs_50-59.tsv
  2. data/colData.MUSCLE.csv
  3. data/countData.MUSCLE.csv.gz

The samples.csv file in ./data/ contains information about all the samples in the GTEx portal v8. Let’s import this file using read.csv().

read.csv()

samples <- read.csv("./data/samples.csv")

head() and tail()

After importing a file, there are multiple ways to view the data. head() and tail() to view the first and last 6 lines of a file.

str will compactly displaying the internal structure. summary() will compute statistics.

head(samples)
##       SUBJID    SEX   AGE    DTHHRDY                   SAMPID           SMTS
## 1 GTEX-1117F Female 60-69 Slow death GTEX-1117F-0226-SM-5GZZ7 Adipose Tissue
## 2 GTEX-1117F Female 60-69 Slow death GTEX-1117F-0426-SM-5EGHI         Muscle
## 3 GTEX-1117F Female 60-69 Slow death GTEX-1117F-0526-SM-5EGHJ   Blood Vessel
## 4 GTEX-1117F Female 60-69 Slow death GTEX-1117F-0626-SM-5N9CS   Blood Vessel
## 5 GTEX-1117F Female 60-69 Slow death GTEX-1117F-0726-SM-5GIEN          Heart
## 6 GTEX-1117F Female 60-69 Slow death GTEX-1117F-1326-SM-5EGHH Adipose Tissue
##   SMNABTCH  SMNABTCHD SMGEBTCHT SMAFRZE SMCENTER SMRIN SMATSSCR
## 1 BP-43693 2013-09-17 TruSeq.v1  RNASEQ       B1   6.8        0
## 2 BP-43495 2013-09-12 TruSeq.v1  RNASEQ       B1   7.1        0
## 3 BP-43495 2013-09-12 TruSeq.v1  RNASEQ       B1   8.0        0
## 4 BP-43956 2013-09-25 TruSeq.v1  RNASEQ       B1   6.9        1
## 5 BP-44261 2013-10-03 TruSeq.v1  RNASEQ       B1   6.3        1
## 6 BP-43495 2013-09-12 TruSeq.v1  RNASEQ       B1   5.9        1
tail(samples)
##          SUBJID    SEX   AGE         DTHHRDY                   SAMPID
## 1523 GTEX-145ME Female 40-49 Ventilator Case GTEX-145ME-0926-SM-5O9AR
## 1524 GTEX-145ME Female 40-49 Ventilator Case GTEX-145ME-1026-SM-5O9B4
## 1525 GTEX-145ME Female 40-49 Ventilator Case GTEX-145ME-1126-SM-5SIAT
## 1526 GTEX-145ME Female 40-49 Ventilator Case GTEX-145ME-1226-SM-5SIB6
## 1527 GTEX-145ME Female 40-49 Ventilator Case GTEX-145ME-1326-SM-5O98Q
## 1528 GTEX-145ME Female 40-49 Ventilator Case GTEX-145ME-1426-SM-5RQJS
##                 SMTS SMNABTCH  SMNABTCHD SMGEBTCHT SMAFRZE SMCENTER SMRIN
## 1523 Small Intestine BP-47675 2013-12-19 TruSeq.v1  RNASEQ       B1   7.4
## 1524         Stomach BP-47675 2013-12-19 TruSeq.v1  RNASEQ       B1   7.4
## 1525           Colon BP-47616 2013-12-18 TruSeq.v1  RNASEQ       B1   6.9
## 1526           Ovary BP-47616 2013-12-18 TruSeq.v1  RNASEQ       B1   7.3
## 1527          Uterus BP-47675 2013-12-19 TruSeq.v1  RNASEQ       B1   8.5
## 1528          Vagina BP-48437 2014-01-17 TruSeq.v1  RNASEQ       B1   7.2
##      SMATSSCR
## 1523        1
## 1524        1
## 1525        1
## 1526        1
## 1527        1
## 1528        1

str() and summary()

str will compactly displaying the internal structure. summary will compute statistics.

str(samples)
## 'data.frame':    1528 obs. of  13 variables:
##  $ SUBJID   : chr  "GTEX-1117F" "GTEX-1117F" "GTEX-1117F" "GTEX-1117F" ...
##  $ SEX      : chr  "Female" "Female" "Female" "Female" ...
##  $ AGE      : chr  "60-69" "60-69" "60-69" "60-69" ...
##  $ DTHHRDY  : chr  "Slow death" "Slow death" "Slow death" "Slow death" ...
##  $ SAMPID   : chr  "GTEX-1117F-0226-SM-5GZZ7" "GTEX-1117F-0426-SM-5EGHI" "GTEX-1117F-0526-SM-5EGHJ" "GTEX-1117F-0626-SM-5N9CS" ...
##  $ SMTS     : chr  "Adipose Tissue" "Muscle" "Blood Vessel" "Blood Vessel" ...
##  $ SMNABTCH : chr  "BP-43693" "BP-43495" "BP-43495" "BP-43956" ...
##  $ SMNABTCHD: chr  "2013-09-17" "2013-09-12" "2013-09-12" "2013-09-25" ...
##  $ SMGEBTCHT: chr  "TruSeq.v1" "TruSeq.v1" "TruSeq.v1" "TruSeq.v1" ...
##  $ SMAFRZE  : chr  "RNASEQ" "RNASEQ" "RNASEQ" "RNASEQ" ...
##  $ SMCENTER : chr  "B1" "B1" "B1" "B1" ...
##  $ SMRIN    : num  6.8 7.1 8 6.9 6.3 5.9 6.6 6.3 6.5 5.8 ...
##  $ SMATSSCR : int  0 0 0 1 1 1 1 1 2 1 ...
summary(samples)
##     SUBJID              SEX                AGE              DTHHRDY         
##  Length:1528        Length:1528        Length:1528        Length:1528       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     SAMPID              SMTS             SMNABTCH          SMNABTCHD        
##  Length:1528        Length:1528        Length:1528        Length:1528       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   SMGEBTCHT           SMAFRZE            SMCENTER             SMRIN       
##  Length:1528        Length:1528        Length:1528        Min.   : 3.200  
##  Class :character   Class :character   Class :character   1st Qu.: 6.300  
##  Mode  :character   Mode  :character   Mode  :character   Median : 7.000  
##                                                           Mean   : 7.058  
##                                                           3rd Qu.: 7.700  
##                                                           Max.   :10.000  
##     SMATSSCR     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :1.0000  
##  Mean   :0.8534  
##  3rd Qu.:1.0000  
##  Max.   :3.0000

Count files can be very long and wide, so it is a good idea to only view the first (or last) few rows and columns. Typically, a gene identifier (like an ensemble id) will be used as the row names. We can use dim to see how many rows and columns are in the file.

counts <- read.csv("./data/countData.HEART.csv", row.names = 1)
dim(counts)
## [1] 63811   306
head(counts)[1:5]
##                 GTEX.12ZZX.0726.SM.5EGKA.1 GTEX.13D11.1526.SM.5J2NA.1
## ENSG00000278704                          0                          0
## ENSG00000277400                          0                          0
## ENSG00000274847                          0                          0
## ENSG00000277428                          0                          0
## ENSG00000276256                          0                          0
## ENSG00000278198                          0                          0
##                 GTEX.ZAJG.0826.SM.5PNVA.1 GTEX.11TT1.1426.SM.5EGIA.1
## ENSG00000278704                         0                          0
## ENSG00000277400                         0                          0
## ENSG00000274847                         0                          0
## ENSG00000277428                         0                          0
## ENSG00000276256                         0                          0
## ENSG00000278198                         0                          0
##                 GTEX.13VXT.1126.SM.5LU3A.1
## ENSG00000278704                          0
## ENSG00000277400                          0
## ENSG00000274847                          0
## ENSG00000277428                          0
## ENSG00000276256                          0
## ENSG00000278198                          0

This “countData” was generated by using recount3 as described in the file scripts/recount3.Rmd. It comes from a Ranged Summarized Experiment (rse) which contains quantitative information about read counts as well as quality control information and sample descriptions. The “colData” from an rse can also be obtained. This information should match the information in our samples file, but there can be subtle differences in formatting We will read the colData in a later section.

read.table()

Very large tabular files are often saved as .tsv files. These can be imported with read.table() or read_tsv(). You can also specify the tab delimiter as well as the row and column names. You can import files using the default parameters or you can change them. Because the first column in the .tsv files does not have a row name, by default, read.table(), imports the first column as the row.names. When sep = "\t", header = TRUE is specified, the fist column is imported as column one and given the column name X.

results <- read.table("./data/GTEx_Heart_20-29_vs_50-59.tsv")
head(results)
##               logFC    AveExpr         t    P.Value adj.P.Val         B
## A1BG     0.67408600  1.6404652 2.1740238 0.03283291 0.1536518 -3.617093
## A1BG-AS1 0.23168690 -0.1864802 1.0403316 0.30150123 0.5316030 -4.984225
## A2M      0.02453974  9.8251848 0.1948624 0.84602333 0.9215696 -5.783835
## A2M-AS1  0.38115436  2.4535892 2.4839630 0.01520646 0.1033370 -3.067127
## A2ML1    0.58865741 -1.0412696 1.8263856 0.07173966 0.2328150 -4.065276
## A2MP1    0.31631081 -0.8994146 1.4061454 0.16377753 0.3730822 -4.583435

Exercise

What commands could you use to read the following files: 1. GTEx results comparing the muscles of 20-29 year old to 70-79 year olds? 2. The csv file information describing the muscle samples?

read.table("./data/GTEx_Muscle_20-29_vs_70-79.tsv") 
read.csv("./data/countData.MUSCLE.csv", row.names = 1) 

dim()

You have now seen a variety of options for importing files. You may use many more in your R-based RNA-seq workflow, but these basics will get you started. Let’s now explore the functions summary(), length(), dim(), and count() us to quickly summarize and compare data frames to answer the following questions.

How many samples do we have? Over 1500!

dim(samples)
## [1] 1528   13

count()

How many samples are there per tissue?

dplyr::count(samples, SMTS) 
##               SMTS   n
## 1   Adipose Tissue 134
## 2    Adrenal Gland  20
## 3     Blood Vessel 139
## 4            Brain  82
## 5           Breast  50
## 6            Colon  78
## 7        Esophagus 144
## 8            Heart 106
## 9           Kidney   6
## 10           Liver  28
## 11            Lung  78
## 12          Muscle 104
## 13           Nerve  71
## 14           Ovary  16
## 15        Pancreas  30
## 16       Pituitary  37
## 17        Prostate  24
## 18  Salivary Gland  22
## 19            Skin 160
## 20 Small Intestine  23
## 21          Spleen  16
## 22         Stomach  29
## 23          Testis  37
## 24         Thyroid  69
## 25          Uterus  14
## 26          Vagina  11

How many samples are there per tissue and sex? Can we test the effect of sex on gene expression in all tissues? For many samples, yes, but not all tissues were samples from both males and females.

head(dplyr::count(samples, SMTS, SEX))
##             SMTS    SEX  n
## 1 Adipose Tissue Female 40
## 2 Adipose Tissue   Male 94
## 3  Adrenal Gland Female  7
## 4  Adrenal Gland   Male 13
## 5   Blood Vessel Female 48
## 6   Blood Vessel   Male 91

How many samples are there per sex, age, and hardy scale? Do you have enough samples to test the effects of Sex, Age, and Hardy Scale in the Heart?

head(dplyr::count(samples, SMTS, SEX, AGE, DTHHRDY ))
##             SMTS    SEX   AGE                      DTHHRDY n
## 1 Adipose Tissue Female 20-29              Ventilator Case 3
## 2 Adipose Tissue Female 30-39              Ventilator Case 2
## 3 Adipose Tissue Female 40-49 Fast death of natural causes 1
## 4 Adipose Tissue Female 40-49              Ventilator Case 5
## 5 Adipose Tissue Female 40-49       Violent and fast death 2
## 6 Adipose Tissue Female 50-59 Fast death of natural causes 3

Now you have successfully imported data using multiple methods. Let's complete an exercise.

Exercise

What series commands would you use to import the data/colData.MUSCLE.csv and count the number of muscles samples persex, age?

How many female muscles samples are there from age group 30-39?

Hint: use head() or names() after importing a file to verify the variable names.

df <- read.csv("./data/colData.MUSCLE.csv") 
dplyr::count(df, SMTS, SEX, AGE) 
# 3 samples are in the female group age 30-39 

Key functions

Function Description
read.csv() A base R function for importing comma separated tabular data
read_csv() A tidyR function for importing .csv files as tibbles
read.table() A base R function for importing tabular data with any delimiter
read_tsv() A tidyR function for importing .tsv files as tibbles
head() and tail() Print the first or last 6 lines of an object
dim() A function that prints the dimensions of an object
length() Calculate the length of an object
count() A dplyr function that counts number of samples per group
str() A function that prints the internal structure of an object
summary() A function that summarizes each variable

Last update: June 22, 2022