The data needed for this exercise is taken from the RNA-seq exercise. In particular it comes from this publication. We will work with the output from the differential expression analysis of KO vs WT. The data files should be in a data/
subdirectory in your R working directory. You can download the data/
directory with files here. You can use the files results_DE.txt
and tableCounts
that you generated in the RNA-seq exercise, or use the ones availabel in the data.zip
file. They should be the same :-)
The idea is that you should go through all the steps in this tutorial and keep track of your code and plots using Rmarkdown. That way, you can include comments and answers to the questions and document your progress.
Make a new directory for this exercise and add the data/
directory to it. Start R and use setwd()
to change the working directory to the one you created.
We will use the following R packages in this exercise:
library(knitr)
library(topGO)
library(biomaRt)
library(piano)
library(NMF)
library(org.Mm.eg.db)
library(Rgraphviz)
library(edgeR)
If you are not able to load any of these packages, try to install them, either using biocLite()
(first you need to run source("http://www.bioconductor.org/biocLite.R")
) or install.packages()
. If something fails, try to understand the error message and fix it. If you get stuck, ask for help :)
diffExpRes <- read.delim("data/results_DE.txt", stringsAsFactors=F)
head(diffExpRes[,c(1,3,5,8,9)]) # skip some columns
## ensembl_gene_id mgi_symbol logFC PValue FDR
## 1 ENSMUSG00000034810 Scn7a -2.861294 3.433028e-45 4.822375e-41
## 2 ENSMUSG00000024799 Tm7sf2 -2.167481 7.022796e-44 4.932461e-40
## 3 ENSMUSG00000027200 Sema6d -1.837139 7.826289e-42 3.664530e-38
## 4 ENSMUSG00000022231 Sema5a -3.039879 2.030104e-41 7.129218e-38
## 5 ENSMUSG00000022483 Col2a1 1.880197 3.485428e-41 9.791960e-38
## 6 ENSMUSG00000023832 Acat2 -1.935112 1.067795e-40 2.499886e-37
Question 1: How many genes are significant at a cutoff of FDR<0.001?
Answer:
|
We will start by performing overrepresentation analysis (a.k.a. list enrichment analysis, …) by using Enrichr and DAVID. Both use a scoring method similar to the Hypergeomtric/Fisher’s exact test. To do such a test, we need to have a list of interesting (e.g. differentially expressed) genes and a list of the background (also known as universe). However, both Enrichr and DAVID have their own background lists, so we do not need to specify them explicitly. First, let’s make a list of interesting genes!
write.table
function to print the top significant genes to copy paste into some web browser tools:sel <- diffExpRes$ensembl_gene_id[diffExpRes$FDR<0.001]
write.table(sel, quote=F, row.names=F, col.names=F)
Question 2: What is the alternative to the Fisher Exact P-Value used by DAVID?
Answer: DAVID uses the so-called EASE Score. In the contingency table, 1 is subtracted from the number of genes that are both in the gene-list and in the gene-set. |
Question 3: Were all gene Ensembl IDs recognized? How many were unmapped?
Answer: 706 were recognized. A few (~19) genes were not in the database. |
Question 4: What seems to be the main functions of the top significant genes (i.e. in summary what are the significantly enriched gene-sets)? Does this make sense considering the experimental design?
Answer: Lipid, fatty-acid and sterol metabolism. Membrane, ER. Muscle contraction. Synapse. |
sel
object. Also, DAVID will probably warn you at some point that the genes map to multiple species. Select the correct species and also go to the background tab and select the correct background.Question 5: Were all gene symbols recognized? How many were unmapped?
Answer:
Select OFFICIAL_GENE_SYMBOL. Multiple species were mapped. Select mouse. 720 gene symbols were matched. 4 were not recognized.
|
Question 6: How many genes in our list overlap with the GO-term Lipid biosynthesis? (Hint: check e.g. the Functional Annotation Clustering)
Answer: 42 genes. |
Question 7: Does this list give the same results as using the Ensembl IDs?
Answer: Not 100% exactly the same, since the recognized gene-list is slightly different. But very similar results. |
As an alternative to DAVID we can try Enrichr. This page takes human gene symbols as input, so first we need to translate our mouse Ensembl IDs to that. This can be done in many ways, one way is to use the biomaRt
package (find a user guide here or type vignette("biomaRt")
):
# Use biomaRt to translate the mouse genes into human orthologs:
# Select the Ensembl BioMart database
ensembl <- useMart("ENSEMBL_MART_ENSEMBL")
# Select the mouse dataset and update the Mart object:
ensembl <- useDataset("mmusculus_gene_ensembl", mart=ensembl)
# Make the query:
bm <- getBM(attributes=c("ensembl_gene_id","hsapiens_homolog_ensembl_gene"), # this is what we want to extract
filters="ensembl_gene_id", # this determines the filter
values=sel, # this are the values to filter for (get only the genes in our list)
mart=ensembl)
head(bm)
## ensembl_gene_id hsapiens_homolog_ensembl_gene
## 1 ENSMUSG00000000093 ENSG00000121068
## 2 ENSMUSG00000000184 ENSG00000118971
## 3 ENSMUSG00000000223 ENSG00000102385
## 4 ENSMUSG00000000253 ENSG00000137198
## 5 ENSMUSG00000000983
## 6 ENSMUSG00000001467 ENSG00000001630
As you can see, the above code gave us a map between mouse and human ensembl gene IDs. But this is not exactly what we wanted, we need human gene symbols.
attributes
argument to the proper setting.Use the function listAttributes(ensembl)
to see all possible options. Hint: if there are too many options to go through you can use grep
to pull out the relevant ones, e.g.:
tmp <- listAttributes(ensembl)
tmp[grep("Human",tmp[,2]),]
If you are working in RStudio you can also use the convenient command View
and then search for human, e.g.:
View(listAttributes(ensembl))
Try it also for the bm
object. For future reference, note that you can also use listDatasets(ensembl)
to see which datasetes you can choose from, try it!
Question 8: How many of the genes in our selected list have gene symbols?
Answer:
|
duplicated
and unique
functions:# A vector to test stuff on:
tmp <- c(1,2,2,3,4,4,4,5)
# Get unique elements:
unique(tmp)
## [1] 1 2 3 4 5
# Which elements are duplicated?
duplicated(tmp)
## [1] FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE
# How many duplicates are there:
sum(duplicated(tmp))
## [1] 3
# Which are the duplicated elements:
tmp[duplicated(tmp)]
## [1] 2 4 4
# Which are the unique duplicated elements:
unique(tmp[duplicated(tmp)])
## [1] 2 4
Make sure you understand how these two functions work and what they do, then move ahead with the following steps and questions:
Question 9: Are there duplicated Ensembl IDs or gene symbols in the list for Enrichr (
bm
)? How can this affect the results that we get?
Answer:
This means that some Ensembl IDs map to more than one gene symbol, and that multiple Ensemble IDs can map to the same gene symbol. |
getBM
again, this time also adding information about % identity between mouse and human genes.?boxplot
if you are unsure about this. Try a histogram as well (hist()
)!Question 10: Looking at the boxplot, are there any reasons to be concerned?
Answer:
|
Question 11: Which gene in the list has the lowest % identity?
Answer:
|
Question 12: Are the results similar to those we got from DAVID?
Answer:
Yes, fairly similar. There is muscle contraction, sterol and lipid biosynthesis, and myelination.
|
Question 13: What do the edges (links) in the network view on the Enrichr webpage denote? (Note: The network view is not available for some gene-set collections, try to click on a different tab and gene set collection to find an example of the network view.)
Answer: See http://amp.pharm.mssm.edu/Enrichr/help#basics&q=5. The edges denote that there is a gene overlap between gene-sets. |
Question 14: We have this far used a cutoff of FDR<0.001. Try a few different cutoffs and rerun DAVID and Enrichr. How different are the results?
Answer: They differ a bit depending on the list size. Overall conclusions are similar. |
Question 15: For the Enrichr and DAVID results so far, can we know whether the identified functions are active/inactive or up/down-regulated in KO vs control? If not, how can we go about to get a clue about that?
Answer: The list we have used contains both up- and down-regulated genes, so we have not taken into account direction. One could subset the list to up- and down-regulated genes separately. Or use a method that takes directionality into account. |
Above, we have looked at some different types of gene-set collections, but a lot of focus has been on GO-terms.
Question 16: What are the three main domains/ontologies?
Answer: Biological process, Molecular function, Cellular compartment. |
Question 17: By whom and how are GO-terms maintained/defined/updated?
Answer: The Gene Ontology Consortium. But edits can and are submitted by the research community. |
Question 18: What is the evidence code?
Answer: A description denoting the type of evidence underlying a specific annotation. |
Question 19: How are GO-terms connected to each other?
Answer: GO-terms are connected in a hierarchical structure with increasing specificity. A term can however have more than one parent term. |
We can use e.g. the topGO
package to visualize gene-set analysis results on the GO hierarchy network (topGO manual).
# Format for topgo - remove genes without ensembl id:
d_topgo <- diffExpRes[!is.na(diffExpRes$ensembl_gene_id),]
rownames(d_topgo) <- d_topgo$ensembl_gene_id
# Create the topGOdata object:
GOobj <- new("topGOdata",
description = "Simple session", # just a name, can be changed
ontology = "BP", # set to BP, MF, or CC
allGenes = setNames(diffExpRes$FDR, diffExpRes$ensembl_gene_id), # named numeric vector
geneSel = function(x) x<0.01, # a function to select genes from the allGenes vector
nodeSize = 10, # remove GO-terms with less than this number of genes
mapping = "org.Mm.eg.db", # annotation database
ID = "ensembl", # gene ID used as names in allGenes vector
annot = annFUN.org)
# Run GO-term enrichment analysis:
resultFisher <- runTest(GOobj, algorithm = "classic", statistic = "fisher")
# Plot results using the GO hierarchical DAG:
showSigOfNodes(GOobj, score(resultFisher), firstSigNode=15, useInfo ='all')
If you save the plot as a PDF it will be easier to zoom in and read the node text. (Sometimes the plotting misbehaves in RStudio, try running grid.newpage()
, plot.new()
, and/or par(mfcol=c(1,1))
if plotting looks weird.)
Question 20: What are the circles and squares denoting?
Answer: Rectangles are the significant terms, circles are connected terms. Color reflects significance. |
Question 21: Which is the most significant BP GO-term?
Answer:
Muscle contraction.
|
Answer:
|
Question 22: Which is the most significant CC GO-term?
Answer:
The 4 first terms in the table are the most significant.
|
Piano is an R-package for carrying out gene-set analysis (see more info here).
gscGO <- loadGSC("data/GSC/c5.bp.v5.2.symbols.gmt")
gscGO
## First 50 (out of 4653) gene set names:
## [1] "GO_REGULATION_O..." "GO_LACTATE_TRAN..." "GO_POSITIVE_REG..."
## [4] "GO_DETECTION_OF..." "GO_CARDIAC_CHAM..." "GO_CALCIUM_ION_..."
## [7] "GO_DNA_DEPENDEN..." "GO_TRNA_AMINOAC..." "GO_CIRCADIAN_RH..."
## [10] "GO_PHOSPHATIDYL..." "GO_SPINAL_CORD_..." "GO_REGULATION_O..."
## [13] "GO_PLATELET_DER..." "GO_GLUCURONATE_..." "GO_CELLULAR_RES..."
## [16] "GO_REGULATION_O..." "GO_POSITIVE_REG..." "GO_CEREBRAL_COR..."
## [19] "GO_POSITIVE_REG..." "GO_NEGATIVE_REG..." "GO_POTASSIUM_IO..."
## [22] "GO_L_PHENYLALAN..." "GO_REGULATION_O..." "GO_CARDIAC_MUSC..."
## [25] "GO_NEGATIVE_REG..." "GO_MOVEMENT_IN_..." "GO_REGULATION_O..."
## [28] "GO_APICAL_PROTE..." "GO_REGULATION_O..." "GO_FOREBRAIN_NE..."
## [31] "GO_POSITIVE_REG..." "GO_NEUROMUSCULA..." "GO_MITOTIC_CYTO..."
## [34] "GO_NEGATIVE_REG..." "GO_SMAD_PROTEIN..." "GO_CYTOPLASMIC_..."
## [37] "GO_MEIOTIC_CHRO..." "GO_POSITIVE_REG..." "GO_REGULATION_O..."
## [40] "GO_RNA_DEPENDEN..." "GO_REGULATION_O..." "GO_REGULATION_O..."
## [43] "GO_DENDRITE_DEV..." "GO_REGULATION_O..." "GO_G_PROTEIN_CO..."
## [46] "GO_MEMORY" "GO_NEURON_DEVEL..." "GO_REGULATION_O..."
## [49] "GO_ENDOTHELIAL_..." "GO_POSITIVE_REG..."
##
## First 50 (out of 15578) gene names:
## [1] "PDE1B" "NR4A2" "MAOB" "DRD4" "TACR3" "PARK2"
## [7] "COMT" "HTR1A" "GPR37" "HPRT1" "CHRNB2" "SNCA"
## [13] "SLC6A3" "ABAT" "DRD1" "PNKD" "PARK7" "SLC16A8"
## [19] "SLC16A5" "SLC16A4" "SLC5A12" "EMB" "SLC16A1" "SLC16A7"
## [25] "SLC16A12" "SLC16A13" "SLC16A6" "SLC16A3" "SLC16A11" "POLR2C"
## [31] "POLR2J" "CTDP1" "RDBP" "COBRA1" "RSF1" "POLR2B"
## [37] "POLR2D" "POLR2G" "POLR2F" "POLR2A" "SNW1" "CHD1"
## [43] "POLR2K" "CDK9" "JUN" "POLR2E" "CCNT1" "LEF1"
## [49] "POLR2H" "GTF2F2"
##
## Gene set size summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.0 17.0 34.0 111.5 92.0 1984.0
##
## First 10 gene sets with additional info:
## Gene set
## [1,] "GO_REGULATION_OF_DOP..."
## [2,] "GO_LACTATE_TRANSPORT"
## [3,] "GO_POSITIVE_REGULATI..."
## [4,] "GO_DETECTION_OF_LIGH..."
## [5,] "GO_CARDIAC_CHAMBER_D..."
## [6,] "GO_CALCIUM_ION_TRANS..."
## [7,] "GO_DNA_DEPENDENT_DNA..."
## [8,] "GO_TRNA_AMINOACYLATI..."
## [9,] "GO_CIRCADIAN_RHYTHM"
## [10,] "GO_PHOSPHATIDYLSERIN..."
## Additional info
## [1,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_REGULATIO..."
## [2,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_LACTATE_T..."
## [3,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_POSITIVE_..."
## [4,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_DETECTION..."
## [5,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_CARDIAC_C..."
## [6,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_CALCIUM_I..."
## [7,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_DNA_DEPEN..."
## [8,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_TRNA_AMIN..."
## [9,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_CIRCADIAN..."
## [10,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_PHOSPHATI..."
The printed info is always good to check. Did the genes get read as genes and the gene-sets as gene-sets? Do the gene-set sizes seem reasonable? This is a way to sanity check that the loading from a file worked as expected. Usually, reading data from a file is a weak point in any workflow and it is good to carefully check that the data in the file got read as intended. These gene-sets are annotated with human gene symbols, which we currently don’t have in our data.
bm <- getBM(attributes=c("ensembl_gene_id","hsapiens_homolog_associated_gene_name"),
filters="ensembl_gene_id",
values=diffExpRes$ensembl_gene_id, # note that now we use all genes here!
mart=ensembl)
bm <- bm[bm[,2]%in%unique(unlist(gscGO)),] # filter for genes in the GO gene-set collection
head(bm)
## ensembl_gene_id hsapiens_homolog_associated_gene_name
## 1 ENSMUSG00000000093 TBX2
## 2 ENSMUSG00000000184 CCND2
## 3 ENSMUSG00000000223 DRP2
## 4 ENSMUSG00000000253 GMPR
## 6 ENSMUSG00000001467 CYP51A1
## 7 ENSMUSG00000001508 SGCA
nrow(bm)
## [1] 10332
Question 23: Are there any duplicates in
bm
? How many? What does this mean?
Answer:
This means that some Ensembl IDs map to more than one gene symbol, and that multiple Ensemble IDs can map to the same gene symbol.
|
bm
:bm <- bm[!duplicated(bm[,1]),]
nrow(bm)
Question 24: How many rows did we remove?
Answer:
|
Question 25: How many unique mouse Ensembl IDs could we map to human orthologs? How many unique human orhtologs do we map to?
Answer:
|
Note that multiple ENSMUS IDs still can map to the same gene symbol!
# merge diffExpRes and bm:
d_piano <- merge(diffExpRes, bm, by="ensembl_gene_id", all.x=T, sort=F)
# avoid duplicating gene-level statistics due to Ensembl ID mapping to multiple symbols:
d_piano <- d_piano[!duplicated(d_piano[,-ncol(d_piano)]),]
# get geneNames, pvals and logfc:
geneNames <- d_piano$hsapiens_homolog_associated_gene_name
pvals <- d_piano$FDR
names(pvals) <- geneNames
logfc <- d_piano$logFC
names(logfc) <- geneNames
head()
, View()
and length()
) so that you know how they look and what they contain.We now have the gene-level data in a convenient format and we are ready to continue with the analysis.
First, let’s use piano to perform a hypergeometric test, similar to what we did with Enrichr, DAVID, and topGO.
# Get the gene-list:
sel <- names(pvals)[pvals<0.001]
sel <- unique(sel[!is.na(sel)])
# Run the analysis:
res <- runGSAhyper(genes=sel, universe=unique(names(pvals)), gsc=gscGO, gsSizeLim=c(20, 200))
# Visualize results:
networkPlot(res, class="non", adjusted=T, significance=0.0001, ncharLabel=Inf,
nodeSize=c(3,20), edgeWidth=c(1,15), cexLabel=0.7, overlap=20,
scoreColors=c("greenyellow","darkgreen"))
Here, we use the
networkPlot
function which draws a network for the most significant gene-sets.
Question 26: What does
gsSizeLim=c(20, 200)
in the code above mean?
Answer: To only include gene-sets with between 20 and 200 genes in the analysis (after filtering out genes not in the expression data). |
Question 27: What do the colors, node-sizes, and edges denote?
Answer: Node size corresponds to gene-set size, i.e. number of genes. Edges denote the number of shared genes between gene-sets. Note that gene-sets without edges can also share a few genes, depending on the setting overlap. The color corresponds to the significance. |
Question 28: Are the results in line with those from topGO, DAVID, and Enrichr?
Answer: The results show muscle contraction related terms, synapse organization, and lipid/sterol related processes, so the results appear to be fairly similar. |
Next, we will use piano to perform gene-set analysis, i.e. not the enrichment of a gene-list based on a cut-off, but using all available gene-level statistics. We will use signed (to keep track of fold-change) -log10-pvalues to rank the gene-sets, i.e. -log10(pvals)*sign(logfc)
.
gsaRes <- runGSA(-log10(pvals)*sign(logfc), geneSetStat="fgsea", gsc=gscGO, gsSizeLim=c(20, 200))
networkPlot(gsaRes, "distinct", "both", adjusted=T, significance=0.05, ncharLabel=Inf,
nodeSize=c(3,15), edgeWidth=c(1,8), cexLabel=0.7, overlap=20,
scoreColors=c("red", "orange", "yellow", "blue", "lightblue", "lightgreen"))
Question 29: Look at the network plot, could gene overlap between you gene-sets bias the result interpretation in any way?
Answer: There are clusters of gene-sets that share a large portion of genes, i.e. they are in fact quite similar and in extreme cases basically measure the same thing. If these sets are many it is easy to focus the conclusions on the process that those sets represent, in contrast to the processes represented only by single or a few gene-sets. However, this maybe is due to the fact that one process is more fine-grained annotated by e.g. GO. The network plot helps in visualizing this, and one can try to see each cluster as an affected process, regardless of the number of gene-sets it contains. |
The network plot can be nice for visualizing your results, but the function GSAsummaryTable
can be used to get a table of the all the GSA results.
View()
command:View(GSAsummaryTable(gsaRes))
Question 30: What is the top significant GO-term affected by down-regulation and up-regulation, respectively?
Answer: Up: GO_SARCOMERE_ORGANIZATION Down: GO_STEROID_METABOLIC_PROCESS |
data/GSC/c2.cp.v5.2.symbols.gmt
, which contains pathway gene-sets from MSigDB, as a gene-set collection (load it with loadGSC()
).geneSetStat="mean"
) instead of fgsea.For example it could look something similar to this:
gscCP <- loadGSC("data/GSC/c2.cp.v5.2.symbols.gmt")
gsaResCP <- runGSA(-log10(pvals)*sign(logfc), geneSetStat="mean", gsc=gscCP, gsSizeLim=c(10,100))
networkPlot(gsaResCP, "distinct", "both", adjusted=T, significance=0.05, ncharLabel=Inf,
nodeSize=c(5,25), edgeWidth=c(1,8), cexLabel=0.7, overlap=5,
scoreColors=c("red", "orange", "yellow", "blue", "lightblue", "lightgreen"))
Question 31: Are these results in line with previous results that we have seen in this exercise? Do you feel there is a consistent pattern?
Answer: Running the piano gene-set analysis on GO-terms show that muscle related processes are affected by up-regulation and that sterol/steroid metabolism is affected by down-regulation. The pathway analysis shows that muscle contraction is affected by up-regulation and that sterol/cholesterol metabolism is affected by down-regulation. This seems to be quite similar to what we saw previously. |
It is always good to go back to the gene-level while looking at your final GSA results. Here we will use a heatmap for visualizing gene-level data for selected gene-sets (using one of many heatmap functions available for R).
# Get genes for a specific gene-set:
selGenes <- names(geneSetSummary(gsaRes, "GO_STEROID_BIOSYNTHETIC_PROCESS")$geneLevelStats)
# Merge with the diffexp result table:
selTab <- d_piano[d_piano$hsapiens_homolog_associated_gene_name%in%selGenes, c(1,3,5,9,10)]
selTab$log10FDR <- -log10(selTab$FDR)*sign(selTab$logFC)
# Display this table:
kable(selTab)
ensembl_gene_id | mgi_symbol | logFC | FDR | hsapiens_homolog_associated_gene_name | log10FDR | |
---|---|---|---|---|---|---|
2 | ENSMUSG00000024799 | Tm7sf2 | -2.1674811 | 0.0000000 | TM7SF2 | -39.3069364 |
11 | ENSMUSG00000031604 | Msmo1 | -1.9232681 | 0.0000000 | MSMO1 | -30.1658228 |
24 | ENSMUSG00000021670 | Hmgcr | -1.9047857 | 0.0000000 | HMGCR | -22.9287137 |
25 | ENSMUSG00000021273 | Fdft1 | -1.4725972 | 0.0000000 | FDFT1 | -22.7646930 |
30 | ENSMUSG00000093930 | Hmgcs1 | -2.1353849 | 0.0000000 | HMGCS1 | -21.3508866 |
31 | ENSMUSG00000001467 | Cyp51 | -1.6536521 | 0.0000000 | CYP51A1 | -20.9697120 |
34 | ENSMUSG00000045294 | Insig1 | -1.4729905 | 0.0000000 | INSIG1 | -20.1736652 |
53 | ENSMUSG00000020917 | Acly | -1.0106441 | 0.0000000 | ACLY | -14.8880627 |
54 | ENSMUSG00000031349 | Nsdhl | -1.3312848 | 0.0000000 | NSDHL | -14.8880627 |
58 | ENSMUSG00000022351 | Sqle | -1.2507425 | 0.0000000 | SQLE | -14.1926825 |
76 | ENSMUSG00000041939 | Mvk | -1.0538597 | 0.0000000 | MVK | -12.0939358 |
108 | ENSMUSG00000027952 | Pmvk | -1.0461615 | 0.0000000 | PMVK | -10.2176927 |
121 | ENSMUSG00000023963 | Cyp39a1 | -1.0528371 | 0.0000000 | CYP39A1 | -9.8008369 |
126 | ENSMUSG00000026675 | Hsd17b7 | -1.4294868 | 0.0000000 | HSD17B7 | -9.6091437 |
149 | ENSMUSG00000058454 | Dhcr7 | -0.8589378 | 0.0000000 | DHCR7 | -8.7776193 |
179 | ENSMUSG00000058258 | Idi1 | -1.8503612 | 0.0000000 | IDI1 | -7.7828797 |
276 | ENSMUSG00000033105 | Lss | -2.3822023 | 0.0000019 | LSS | -5.7273752 |
364 | ENSMUSG00000059743 | Fdps | -2.0813679 | 0.0000231 | FDPS | -4.6355084 |
371 | ENSMUSG00000034926 | Dhcr24 | -1.7835910 | 0.0000266 | DHCR24 | -4.5744417 |
556 | ENSMUSG00000046027 | Stard5 | 0.5669507 | 0.0006520 | STARD5 | 3.1857624 |
632 | ENSMUSG00000027195 | Hsd17b12 | -0.4651350 | 0.0014378 | HSD17B12 | -2.8423134 |
641 | ENSMUSG00000036856 | Wnt4 | 1.8707597 | 0.0015159 | WNT4 | 2.8193167 |
1029 | ENSMUSG00000031168 | Ebp | -0.5237848 | 0.0205263 | EBP | -1.6876888 |
1058 | ENSMUSG00000028518 | Prkaa2 | 1.0584440 | 0.0227853 | PRKAA2 | 1.6423454 |
1097 | ENSMUSG00000027875 | Hmgcs2 | 0.4015057 | 0.0262726 | HMGCS2 | 1.5804976 |
1315 | ENSMUSG00000037936 | Scarb1 | -0.3771259 | 0.0504584 | SCARB1 | -1.2970664 |
1359 | ENSMUSG00000028944 | Prkag2 | -0.5884059 | 0.0555413 | PRKAG2 | -1.2553837 |
1412 | ENSMUSG00000024112 | Cacna1h | 0.6243337 | 0.0635828 | CACNA1H | 1.1966606 |
1474 | ENSMUSG00000036880 | Acaa2 | 0.3221681 | 0.0741393 | ACAA2 | 1.1299516 |
1647 | ENSMUSG00000026456 | Cyb5r1 | 0.4542855 | 0.1048128 | CYB5R1 | 0.9795859 |
1930 | ENSMUSG00000031891 | Hsd11b2 | 1.0349272 | 0.1505277 | HSD11B2 | 0.8223835 |
2014 | ENSMUSG00000018042 | Cyb5r3 | -0.2810385 | 0.1700817 | CYB5R3 | -0.7693423 |
2101 | ENSMUSG00000018167 | Stard3 | -0.2933683 | 0.1865191 | STARD3 | -0.7292768 |
2530 | ENSMUSG00000061758 | Akr1b10 | 0.4704163 | 0.2585394 | AKR1B15 | 0.5874733 |
2732 | ENSMUSG00000029311 | Hsd17b11 | 0.2579175 | 0.2965381 | HSD17B11 | 0.5279195 |
2933 | ENSMUSG00000050697 | Prkaa1 | 0.7667416 | 0.3340304 | PRKAA1 | 0.4762140 |
3270 | ENSMUSG00000004880 | Lbr | -0.2605653 | 0.3946105 | LBR | -0.4038313 |
3585 | ENSMUSG00000018861 | Fdxr | -0.2721958 | 0.4506118 | FDXR | -0.3461974 |
3626 | ENSMUSG00000032051 | Fdx1 | 0.3157853 | 0.4590777 | FDX1 | 0.3381138 |
4027 | ENSMUSG00000032323 | Cyp11a1 | 0.6364368 | 0.5269210 | CYP11A1 | 0.2782545 |
4080 | ENSMUSG00000021214 | Akr1c18 | 0.5195843 | 0.5371843 | AKR1C4 | 0.2698766 |
4539 | ENSMUSG00000018160 | Med1 | 0.1443773 | 0.5986866 | MED1 | 0.2228004 |
5060 | ENSMUSG00000033715 | Akr1c14 | -0.3009562 | 0.6533384 | AKR1C4 | -0.1848618 |
5117 | ENSMUSG00000034308 | Sdr42e1 | 0.3779597 | 0.6597789 | SDR42E1 | 0.1806016 |
5237 | ENSMUSG00000021259 | Cyp46a1 | -0.4382325 | 0.6730181 | CYP46A1 | -0.1719732 |
5695 | ENSMUSG00000021302 | Ggps1 | -0.1886793 | 0.7181029 | GGPS1 | -0.1438133 |
5752 | ENSMUSG00000030825 | Hsd17b14 | -0.4391568 | 0.7242179 | HSD17B14 | -0.1401307 |
6796 | ENSMUSG00000041736 | Tspo | 0.0871980 | 0.8089995 | TSPO | 0.0920518 |
6798 | ENSMUSG00000030057 | Cnbp | 0.0843414 | 0.8090640 | CNBP | 0.0920171 |
6943 | ENSMUSG00000017307 | Acot8 | 0.1457815 | 0.8210293 | ACOT8 | 0.0856413 |
7041 | ENSMUSG00000026499 | Acbd3 | -0.2124423 | 0.8294557 | ACBD3 | -0.0812068 |
7150 | ENSMUSG00000028470 | Hint2 | 0.1065509 | 0.8376562 | HINT2 | 0.0769342 |
7442 | ENSMUSG00000003721 | Insig2 | -0.0982338 | 0.8560106 | INSIG2 | -0.0675208 |
7562 | ENSMUSG00000022244 | Amacr | 0.1082852 | 0.8643204 | AMACR | 0.0633252 |
7761 | ENSMUSG00000028603 | Scp2 | -0.2133358 | 0.8753502 | SCP2 | -0.0578182 |
7834 | ENSMUSG00000030670 | Cyp2r1 | -0.2499409 | 0.8829956 | CYP2R1 | -0.0540414 |
7934 | ENSMUSG00000039519 | Cyp7b1 | -0.0509951 | 0.8898487 | CYP7B1 | -0.0506838 |
8170 | ENSMUSG00000026170 | Cyp27a1 | 0.1069738 | 0.9068471 | CYP27A1 | 0.0424659 |
8482 | ENSMUSG00000024507 | Hsd17b4 | -0.0399601 | 0.9246019 | HSD17B4 | -0.0340452 |
8856 | ENSMUSG00000029233 | Srd5a3 | 0.0349583 | 0.9415716 | SRD5A3 | 0.0261466 |
8993 | ENSMUSG00000031982 | Arv1 | -0.0708507 | 0.9469453 | ARV1 | -0.0236751 |
9255 | ENSMUSG00000052534 | Pbx1 | 0.0231469 | 0.9578520 | PBX1 | 0.0187016 |
9984 | ENSMUSG00000031574 | Star | 0.0146487 | 0.9846159 | STAR | 0.0067332 |
Question 32: Does the table look like you expect it to (did the human to mouse gene symbol matching work, are the signed log10-FDRs consistent with the logFC and FDR columns)?
Answer: Yes, a visual inspection shows that the mapping is correct. |
# Read in the count data:
tableCounts <- read.table(file="data/tableCounts", sep="\t", header=TRUE, skip=1)
rownames(tableCounts) <- tableCounts[,1]
tableCounts <- tableCounts[,7:12]
colnames(tableCounts) <- c("KO_1","KO_2", "KO_3", "WT_1", "WT_2", "WT_3");
# Merge count data with the selTab:
selTab <- merge(selTab, tableCounts, by.x="ensembl_gene_id", by.y=0, all.x=T)
# Plot a heatmap:
# Row annotation:
rowann <- list(Significant=ifelse(selTab$FDR<0.05,"yes","no"), Regulation=ifelse(selTab$logFC>0,"Up","Down"))
rowannCol <- list(Significant=c("black","green"), Regulation=c("blue","red"))
# Row labels:
rowlabs <- selTab$mgi_symbol
rowlabs[duplicated(rowlabs)] <- paste(rowlabs[duplicated(rowlabs)]," ")
# Plot:
aheatmap(selTab[,7:12], scale="row", annRow=rowann, annColors=rowannCol,
labRow=rowlabs,
txt=selTab[,7:12])
Note that here we scale the rows so the colors will be visualizing the difference for each gene specifically, but is not comparable across genes. Therefore we also include the actual counts in each heatmap cell. Also, note that these are raw counts. As an alternative we could have plotted normalized CPMs (counts per million) using the command:
cpms <- cpm(calcNormFactors(DGEList(tableCounts), method='TMM'))
Try this if you have time!
Question 33: Why are the rows and columns reordered?
Answer: A clustering is performed on both the columns and rows in order to group genes and samples with similar patterns. |
Question 34: Does it look “better”, what is the drawback/benefit?
Answer:
|
Answer:
|
Question 35: Given the count data for these genes, does the differential expression results make sense, as indicated by the column annotation columns? Any specific genes that you are concerned about?
Answer: These are the genes belonging to GO_STEROID_BIOSYNTHETIC_PROCESS, which was shown to be affected by down-regulation in the gene-set analysis. Here we can see that a high portion of the genes are signifcant, and that the vast majority of significant genes are in fact down-regulated. Looking at the count data (the actual heatmap) one can also see a pattern of blue to the left (KO) and red to the right (WT), indicating a lower expression of these genes in the KO:s. |
Question 36: Would you say from looking at the heatmap, that in general steroid biosynthesis is down-regulated or up-regulated? Does your answer reflect the GSA results as seen in the network plot above?
Answer: Down-regulated. It does reflect the network plot results. |
Question 37: Are these genes actually involved in steroid biosynthesis? Are you confident enough from the data and analysis results to say anything about how YAP/TAZ knockout affects this biological process?
Answer: I spot checked ACAA2 (involved in fatty acid beta-oxidation), SCARB1 (receptor for high density lipoprotein cholesterol, HDL), HMGCR (rate-limiting enzyme for cholesterol synthesis). One could check more, but at least from these it seems correct. The paper from where this data is taken also reports that YAP/TAZ regulate expression of lipid synthetic enzymes and found down-regulation of lipid and sterol biosynthesis in the KO:s (see e.g. Figure 6). |
This page was generated using the following R session:
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib/libopenblasp-r0.2.19.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats4 parallel methods stats graphics grDevices
## [8] utils datasets base
##
## other attached packages:
## [1] RColorBrewer_1.1-2 edgeR_3.20.1 limma_3.34.2
## [4] Rgraphviz_2.22.0 org.Mm.eg.db_3.5.0 NMF_0.20.6
## [7] cluster_2.0.6 rngtools_1.2.4 pkgmaker_0.22
## [10] registry_0.3 piano_1.18.0 biomaRt_2.34.0
## [13] topGO_2.30.0 SparseM_1.77 GO.db_3.5.0
## [16] AnnotationDbi_1.40.0 IRanges_2.12.0 S4Vectors_0.16.0
## [19] Biobase_2.38.0 graph_1.56.0 BiocGenerics_0.24.0
## [22] knitr_1.17 BiocInstaller_1.28.0
##
## loaded via a namespace (and not attached):
## [1] foreach_1.4.3 bit64_0.9-7 gtools_3.5.0
## [4] assertthat_0.2.0 highr_0.6 blob_1.1.0
## [7] yaml_2.1.14 progress_1.1.2 slam_0.1-40
## [10] RSQLite_2.0 backports_1.1.1 lattice_0.20-35
## [13] digest_0.6.12 colorspace_1.3-2 htmltools_0.3.6
## [16] plyr_1.8.4 XML_3.98-1.9 pkgconfig_2.0.1
## [19] xtable_1.8-2 relations_0.6-7 scales_0.5.0
## [22] gdata_2.18.0 BiocParallel_1.12.0 tibble_1.3.4
## [25] ggplot2_2.2.1 lazyeval_0.2.1 magrittr_1.5
## [28] memoise_1.1.0 evaluate_0.10.1 doParallel_1.0.11
## [31] gplots_3.0.1 tools_3.4.2 data.table_1.10.4-3
## [34] prettyunits_1.0.2 gridBase_0.4-7 matrixStats_0.52.2
## [37] stringr_1.2.0 locfit_1.5-9.1 munsell_0.4.3
## [40] compiler_3.4.2 caTools_1.17.1 rlang_0.1.2
## [43] RCurl_1.95-4.8 iterators_1.0.8 marray_1.56.0
## [46] igraph_1.1.2 bitops_1.0-6 rmarkdown_1.6
## [49] gtable_0.2.0 codetools_0.2-15 DBI_0.7
## [52] sets_1.0-17 reshape2_1.4.2 R6_2.2.2
## [55] gridExtra_2.3 bit_1.1-12 fastmatch_1.1-0
## [58] fgsea_1.4.0 rprojroot_1.2 KernSmooth_2.23-15
## [61] stringi_1.1.5 Rcpp_0.12.13