Import Data
Data can be imported using functions from base
R (such as read.csv()
and
read.table()
) or with functions from readr
(such as read_csv()
and read_tsv()
).
There are subtle differences in the default behavior these of functions, including how they treat dashes and spaces in column names, whether headers and row names are default. For this workshop, we will use read.csv()
and read.table()
.
Files¶
Today, I will show you how to import the following files:
- data/samples.csv
- data/GTExHeart_20-29_vs_50-59.tsv
- data/colData.HEART.csv
- data/countData.HEART.csv.gz
Later, you can practice on your own using the following files:
- data/GTExMuscle_20-29_vs_50-59.tsv
- data/colData.MUSCLE.csv
- data/countData.MUSCLE.csv.gz
The samples.csv
file in ./data/
contains information about all
the samples in the GTEx portal v8. Let’s import this file using
read.csv()
.
read.csv()
¶
samples <- read.csv("./data/samples.csv")
head()
and tail()
¶
After importing a file, there are multiple ways to view the data. head()
and tail()
to view the first and last 6 lines of a file.
str
will compactly displaying the internal
structure. summary()
will compute statistics.
head(samples)
## SUBJID SEX AGE DTHHRDY SAMPID SMTS
## 1 GTEX-1117F Female 60-69 Slow death GTEX-1117F-0226-SM-5GZZ7 Adipose Tissue
## 2 GTEX-1117F Female 60-69 Slow death GTEX-1117F-0426-SM-5EGHI Muscle
## 3 GTEX-1117F Female 60-69 Slow death GTEX-1117F-0526-SM-5EGHJ Blood Vessel
## 4 GTEX-1117F Female 60-69 Slow death GTEX-1117F-0626-SM-5N9CS Blood Vessel
## 5 GTEX-1117F Female 60-69 Slow death GTEX-1117F-0726-SM-5GIEN Heart
## 6 GTEX-1117F Female 60-69 Slow death GTEX-1117F-1326-SM-5EGHH Adipose Tissue
## SMNABTCH SMNABTCHD SMGEBTCHT SMAFRZE SMCENTER SMRIN SMATSSCR
## 1 BP-43693 2013-09-17 TruSeq.v1 RNASEQ B1 6.8 0
## 2 BP-43495 2013-09-12 TruSeq.v1 RNASEQ B1 7.1 0
## 3 BP-43495 2013-09-12 TruSeq.v1 RNASEQ B1 8.0 0
## 4 BP-43956 2013-09-25 TruSeq.v1 RNASEQ B1 6.9 1
## 5 BP-44261 2013-10-03 TruSeq.v1 RNASEQ B1 6.3 1
## 6 BP-43495 2013-09-12 TruSeq.v1 RNASEQ B1 5.9 1
tail(samples)
## SUBJID SEX AGE DTHHRDY SAMPID
## 1523 GTEX-145ME Female 40-49 Ventilator Case GTEX-145ME-0926-SM-5O9AR
## 1524 GTEX-145ME Female 40-49 Ventilator Case GTEX-145ME-1026-SM-5O9B4
## 1525 GTEX-145ME Female 40-49 Ventilator Case GTEX-145ME-1126-SM-5SIAT
## 1526 GTEX-145ME Female 40-49 Ventilator Case GTEX-145ME-1226-SM-5SIB6
## 1527 GTEX-145ME Female 40-49 Ventilator Case GTEX-145ME-1326-SM-5O98Q
## 1528 GTEX-145ME Female 40-49 Ventilator Case GTEX-145ME-1426-SM-5RQJS
## SMTS SMNABTCH SMNABTCHD SMGEBTCHT SMAFRZE SMCENTER SMRIN
## 1523 Small Intestine BP-47675 2013-12-19 TruSeq.v1 RNASEQ B1 7.4
## 1524 Stomach BP-47675 2013-12-19 TruSeq.v1 RNASEQ B1 7.4
## 1525 Colon BP-47616 2013-12-18 TruSeq.v1 RNASEQ B1 6.9
## 1526 Ovary BP-47616 2013-12-18 TruSeq.v1 RNASEQ B1 7.3
## 1527 Uterus BP-47675 2013-12-19 TruSeq.v1 RNASEQ B1 8.5
## 1528 Vagina BP-48437 2014-01-17 TruSeq.v1 RNASEQ B1 7.2
## SMATSSCR
## 1523 1
## 1524 1
## 1525 1
## 1526 1
## 1527 1
## 1528 1
str()
and summary()
¶
str
will compactly displaying the internal structure. summary
will compute statistics.
str(samples)
## 'data.frame': 1528 obs. of 13 variables:
## $ SUBJID : chr "GTEX-1117F" "GTEX-1117F" "GTEX-1117F" "GTEX-1117F" ...
## $ SEX : chr "Female" "Female" "Female" "Female" ...
## $ AGE : chr "60-69" "60-69" "60-69" "60-69" ...
## $ DTHHRDY : chr "Slow death" "Slow death" "Slow death" "Slow death" ...
## $ SAMPID : chr "GTEX-1117F-0226-SM-5GZZ7" "GTEX-1117F-0426-SM-5EGHI" "GTEX-1117F-0526-SM-5EGHJ" "GTEX-1117F-0626-SM-5N9CS" ...
## $ SMTS : chr "Adipose Tissue" "Muscle" "Blood Vessel" "Blood Vessel" ...
## $ SMNABTCH : chr "BP-43693" "BP-43495" "BP-43495" "BP-43956" ...
## $ SMNABTCHD: chr "2013-09-17" "2013-09-12" "2013-09-12" "2013-09-25" ...
## $ SMGEBTCHT: chr "TruSeq.v1" "TruSeq.v1" "TruSeq.v1" "TruSeq.v1" ...
## $ SMAFRZE : chr "RNASEQ" "RNASEQ" "RNASEQ" "RNASEQ" ...
## $ SMCENTER : chr "B1" "B1" "B1" "B1" ...
## $ SMRIN : num 6.8 7.1 8 6.9 6.3 5.9 6.6 6.3 6.5 5.8 ...
## $ SMATSSCR : int 0 0 0 1 1 1 1 1 2 1 ...
summary(samples)
## SUBJID SEX AGE DTHHRDY
## Length:1528 Length:1528 Length:1528 Length:1528
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## SAMPID SMTS SMNABTCH SMNABTCHD
## Length:1528 Length:1528 Length:1528 Length:1528
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## SMGEBTCHT SMAFRZE SMCENTER SMRIN
## Length:1528 Length:1528 Length:1528 Min. : 3.200
## Class :character Class :character Class :character 1st Qu.: 6.300
## Mode :character Mode :character Mode :character Median : 7.000
## Mean : 7.058
## 3rd Qu.: 7.700
## Max. :10.000
## SMATSSCR
## Min. :0.0000
## 1st Qu.:0.0000
## Median :1.0000
## Mean :0.8534
## 3rd Qu.:1.0000
## Max. :3.0000
Count files can be very long and wide, so it is a good idea to only view
the first (or last) few rows and columns. Typically, a gene identifier
(like an ensemble id) will be used as the row names. We can use dim
to
see how many rows and columns are in the file.
counts <- read.csv("./data/countData.HEART.csv", row.names = 1)
dim(counts)
## [1] 63811 306
head(counts)[1:5]
## GTEX.12ZZX.0726.SM.5EGKA.1 GTEX.13D11.1526.SM.5J2NA.1
## ENSG00000278704 0 0
## ENSG00000277400 0 0
## ENSG00000274847 0 0
## ENSG00000277428 0 0
## ENSG00000276256 0 0
## ENSG00000278198 0 0
## GTEX.ZAJG.0826.SM.5PNVA.1 GTEX.11TT1.1426.SM.5EGIA.1
## ENSG00000278704 0 0
## ENSG00000277400 0 0
## ENSG00000274847 0 0
## ENSG00000277428 0 0
## ENSG00000276256 0 0
## ENSG00000278198 0 0
## GTEX.13VXT.1126.SM.5LU3A.1
## ENSG00000278704 0
## ENSG00000277400 0
## ENSG00000274847 0
## ENSG00000277428 0
## ENSG00000276256 0
## ENSG00000278198 0
This “countData” was generated by using recount3
as described in the
file scripts/recount3.Rmd
. It comes from a Ranged Summarized
Experiment (rse) which contains quantitative information about read
counts as well as quality control information and sample descriptions.
The “colData” from an rse can also be obtained. This information
should match the information in our samples file, but there can be
subtle differences in formatting We will read the colData in a later
section.
read.table()
¶
Very large tabular files are often saved as .tsv files. These can be
imported with read.table()
or read_tsv()
. You can also specify the
tab delimiter as well as the row and column names. You can import files
using the default parameters or you can change them. Because the first
column in the .tsv files does not have a row name, by default,
read.table()
, imports the first column as the row.names. When
sep = "\t", header = TRUE
is specified, the fist column is imported as
column one and given the column name X
.
results <- read.table("./data/GTEx_Heart_20-29_vs_50-59.tsv")
head(results)
## logFC AveExpr t P.Value adj.P.Val B
## A1BG 0.67408600 1.6404652 2.1740238 0.03283291 0.1536518 -3.617093
## A1BG-AS1 0.23168690 -0.1864802 1.0403316 0.30150123 0.5316030 -4.984225
## A2M 0.02453974 9.8251848 0.1948624 0.84602333 0.9215696 -5.783835
## A2M-AS1 0.38115436 2.4535892 2.4839630 0.01520646 0.1033370 -3.067127
## A2ML1 0.58865741 -1.0412696 1.8263856 0.07173966 0.2328150 -4.065276
## A2MP1 0.31631081 -0.8994146 1.4061454 0.16377753 0.3730822 -4.583435
Exercise¶
What commands could you use to read the following files: 1. GTEx results comparing the muscles of 20-29 year old to 70-79 year olds? 2. The csv file information describing the muscle samples?
read.table("./data/GTEx_Muscle_20-29_vs_70-79.tsv")
read.csv("./data/countData.MUSCLE.csv", row.names = 1)
dim()
¶
You have now seen a variety of options for importing files. You may use
many more in your R-based RNA-seq workflow, but these basics will get
you started. Let’s now explore the functions summary()
, length()
,
dim()
, and count()
us to quickly summarize and compare data frames
to answer the following questions.
How many samples do we have? Over 1500!
dim(samples)
## [1] 1528 13
count()
¶
How many samples are there per tissue?
dplyr::count(samples, SMTS)
## SMTS n
## 1 Adipose Tissue 134
## 2 Adrenal Gland 20
## 3 Blood Vessel 139
## 4 Brain 82
## 5 Breast 50
## 6 Colon 78
## 7 Esophagus 144
## 8 Heart 106
## 9 Kidney 6
## 10 Liver 28
## 11 Lung 78
## 12 Muscle 104
## 13 Nerve 71
## 14 Ovary 16
## 15 Pancreas 30
## 16 Pituitary 37
## 17 Prostate 24
## 18 Salivary Gland 22
## 19 Skin 160
## 20 Small Intestine 23
## 21 Spleen 16
## 22 Stomach 29
## 23 Testis 37
## 24 Thyroid 69
## 25 Uterus 14
## 26 Vagina 11
How many samples are there per tissue and sex? Can we test the effect of sex on gene expression in all tissues? For many samples, yes, but not all tissues were samples from both males and females.
head(dplyr::count(samples, SMTS, SEX))
## SMTS SEX n
## 1 Adipose Tissue Female 40
## 2 Adipose Tissue Male 94
## 3 Adrenal Gland Female 7
## 4 Adrenal Gland Male 13
## 5 Blood Vessel Female 48
## 6 Blood Vessel Male 91
How many samples are there per sex, age, and hardy scale? Do you have enough samples to test the effects of Sex, Age, and Hardy Scale in the Heart?
head(dplyr::count(samples, SMTS, SEX, AGE, DTHHRDY ))
## SMTS SEX AGE DTHHRDY n
## 1 Adipose Tissue Female 20-29 Ventilator Case 3
## 2 Adipose Tissue Female 30-39 Ventilator Case 2
## 3 Adipose Tissue Female 40-49 Fast death of natural causes 1
## 4 Adipose Tissue Female 40-49 Ventilator Case 5
## 5 Adipose Tissue Female 40-49 Violent and fast death 2
## 6 Adipose Tissue Female 50-59 Fast death of natural causes 3
Now you have successfully imported data using multiple methods. Let's complete an exercise.
Exercise¶
What series commands would you use to import the data/colData.MUSCLE.csv
and count the number of muscles samples persex, age?
How many female muscles samples are there from age group 30-39?
Hint: use head() or names() after importing a file to verify the variable names.
df <- read.csv("./data/colData.MUSCLE.csv")
dplyr::count(df, SMTS, SEX, AGE)
# 3 samples are in the female group age 30-39
Key functions¶
Function | Description |
---|---|
read.csv() |
A base R function for importing comma separated tabular data |
read_csv() |
A tidyR function for importing .csv files as tibbles |
read.table() |
A base R function for importing tabular data with any delimiter |
read_tsv() |
A tidyR function for importing .tsv files as tibbles |
head() and tail() |
Print the first or last 6 lines of an object |
dim() |
A function that prints the dimensions of an object |
length() |
Calculate the length of an object |
count() |
A dplyr function that counts number of samples per group |
str() |
A function that prints the internal structure of an object |
summary() |
A function that summarizes each variable |