Setup and load data

The data needed for this exercise is taken from the RNA-seq exercise. In particular it comes from this publication. We will work with the output from the differential expression analysis of KO vs WT. The data files should be in a data/ subdirectory in your R working directory. You can download the data/ directory with files here. You can use the files results_DE.txt and tableCounts that you generated in the RNA-seq exercise, or use the ones availabel in the data.zip file. They should be the same :-)

The idea is that you should go through all the steps in this tutorial and keep track of your code and plots using Rmarkdown. That way, you can include comments and answers to the questions and document your progress.

  1. Make a new directory for this exercise and add the data/ directory to it. Start R and use setwd() to change the working directory to the one you created.

  2. We will use the following R packages in this exercise:

library(knitr)
library(topGO)
library(biomaRt)
library(piano)
library(NMF)
library(org.Mm.eg.db)
library(Rgraphviz)
library(edgeR)

If you are not able to load any of these packages, try to install them, either using biocLite() (first you need to run source("http://www.bioconductor.org/biocLite.R")) or install.packages(). If something fails, try to understand the error message and fix it. If you get stuck, ask for help :)

  1. Start by reading in the differential expression results from the previous exercise:
diffExpRes <- read.delim("data/results_DE.txt", stringsAsFactors=F)
head(diffExpRes[,c(1,3,5,8,9)]) # skip some columns
##      ensembl_gene_id mgi_symbol     logFC       PValue          FDR
## 1 ENSMUSG00000034810      Scn7a -2.861294 3.433028e-45 4.822375e-41
## 2 ENSMUSG00000024799     Tm7sf2 -2.167481 7.022796e-44 4.932461e-40
## 3 ENSMUSG00000027200     Sema6d -1.837139 7.826289e-42 3.664530e-38
## 4 ENSMUSG00000022231     Sema5a -3.039879 2.030104e-41 7.129218e-38
## 5 ENSMUSG00000022483     Col2a1  1.880197 3.485428e-41 9.791960e-38
## 6 ENSMUSG00000023832      Acat2 -1.935112 1.067795e-40 2.499886e-37

Question 1: How many genes are significant at a cutoff of FDR<0.001?

Answer:

sum(diffExpRes$FDR<0.001)
## [1] 733

Web-based enrichment analysis

We will start by performing overrepresentation analysis (a.k.a. list enrichment analysis, …) by using Enrichr and DAVID. Both use a scoring method similar to the Hypergeomtric/Fisher’s exact test. To do such a test, we need to have a list of interesting (e.g. differentially expressed) genes and a list of the background (also known as universe). However, both Enrichr and DAVID have their own background lists, so we do not need to specify them explicitly. First, let’s make a list of interesting genes!

  1. You can use the write.table function to print the top significant genes to copy paste into some web browser tools:
sel <- diffExpRes$ensembl_gene_id[diffExpRes$FDR<0.001]
write.table(sel, quote=F, row.names=F, col.names=F)

DAVID

  1. Select and copy the gene list given by the above command and head over to DAVID.
  2. Locate Functional Annotation in the left menu and click the more link and briefly scroll through that page to get an overview of what the Functional Annotation Tool at DAVID does.

Question 2: What is the alternative to the Fisher Exact P-Value used by DAVID?

Answer:

DAVID uses the so-called EASE Score. In the contingency table, 1 is subtracted from the number of genes that are both in the gene-list and in the gene-set.

  1. Now go to the Functional Annotation page and make sure the Upload tab is visible.
  2. Paste the copied gene-list, select the correct identifier, and select whether this is a gene list or background (discuss with other students if you are not sure, or ask the instructors). Submit list.

Question 3: Were all gene Ensembl IDs recognized? How many were unmapped?

Answer: 706 were recognized. A few (~19) genes were not in the database.

  1. Explore the results. For instance, click on Functional Annotation Clustering at the bottom of the page. This shows related gene-sets clustered together in larger groups for a nicer overview.

Question 4: What seems to be the main functions of the top significant genes (i.e. in summary what are the significantly enriched gene-sets)? Does this make sense considering the experimental design?

Answer: Lipid, fatty-acid and sterol metabolism. Membrane, ER. Muscle contraction. Synapse.

  1. Now, rerun DAVID but use gene symbols as input instead. Hint: you need to modify the command above creating the sel object. Also, DAVID will probably warn you at some point that the genes map to multiple species. Select the correct species and also go to the background tab and select the correct background.

Question 5: Were all gene symbols recognized? How many were unmapped?

Answer:

sel2 <- diffExpRes$mgi_symbol[diffExpRes$FDR<0.001]
write.table(sel2, quote=F, row.names=F, col.names=F)
Select OFFICIAL_GENE_SYMBOL. Multiple species were mapped. Select mouse. 720 gene symbols were matched. 4 were not recognized.

Question 6: How many genes in our list overlap with the GO-term Lipid biosynthesis? (Hint: check e.g. the Functional Annotation Clustering)

Answer: 42 genes.

Question 7: Does this list give the same results as using the Ensembl IDs?

Answer: Not 100% exactly the same, since the recognized gene-list is slightly different. But very similar results.

Enrichr

As an alternative to DAVID we can try Enrichr. This page takes human gene symbols as input, so first we need to translate our mouse Ensembl IDs to that. This can be done in many ways, one way is to use the biomaRt package (find a user guide here or type vignette("biomaRt")):

# Use biomaRt to translate the mouse genes into human orthologs:
# Select the Ensembl BioMart database
ensembl <- useMart("ENSEMBL_MART_ENSEMBL")
# Select the mouse dataset and update the Mart object: 
ensembl <- useDataset("mmusculus_gene_ensembl", mart=ensembl)
# Make the query:
bm <- getBM(attributes=c("ensembl_gene_id","hsapiens_homolog_ensembl_gene"), # this is what we want to extract 
            filters="ensembl_gene_id", # this determines the filter
            values=sel, # this are the values to filter for (get only the genes in our list)
            mart=ensembl)
head(bm)
##      ensembl_gene_id hsapiens_homolog_ensembl_gene
## 1 ENSMUSG00000000093               ENSG00000121068
## 2 ENSMUSG00000000184               ENSG00000118971
## 3 ENSMUSG00000000223               ENSG00000102385
## 4 ENSMUSG00000000253               ENSG00000137198
## 5 ENSMUSG00000000983                              
## 6 ENSMUSG00000001467               ENSG00000001630

As you can see, the above code gave us a map between mouse and human ensembl gene IDs. But this is not exactly what we wanted, we need human gene symbols.

  1. Change the attributes argument to the proper setting.

Use the function listAttributes(ensembl) to see all possible options. Hint: if there are too many options to go through you can use grep to pull out the relevant ones, e.g.:

tmp <- listAttributes(ensembl)
tmp[grep("Human",tmp[,2]),]

If you are working in RStudio you can also use the convenient command View and then search for human, e.g.:

View(listAttributes(ensembl))

Try it also for the bm object. For future reference, note that you can also use listDatasets(ensembl) to see which datasetes you can choose from, try it!

Question 8: How many of the genes in our selected list have gene symbols?

Answer:

# Make the query:
bm <- getBM(attributes=c("ensembl_gene_id","hsapiens_homolog_associated_gene_name"), # this is what we want to extract 
            filters="ensembl_gene_id", # this determines the filter
            values=sel, # this are the values to filter for (get only the genes in our list)
            mart=ensembl)
head(bm)
##      ensembl_gene_id hsapiens_homolog_associated_gene_name
## 1 ENSMUSG00000000093                                  TBX2
## 2 ENSMUSG00000000184                                 CCND2
## 3 ENSMUSG00000000223                                  DRP2
## 4 ENSMUSG00000000253                                  GMPR
## 5 ENSMUSG00000000983                                      
## 6 ENSMUSG00000001467                               CYP51A1
# subset to the rows were the gene-symbol is >1 characters:
tmp <- bm[nchar(bm[,2])>1,]
# How many unique Ensembl IDs?
length(unique(tmp$ensembl_gene_id))
## [1] 698

  1. Get familiar with the duplicated and unique functions:
# A vector to test stuff on:
tmp <- c(1,2,2,3,4,4,4,5)
# Get unique elements:
unique(tmp)
## [1] 1 2 3 4 5
# Which elements are duplicated?
duplicated(tmp)
## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE
# How many duplicates are there:
sum(duplicated(tmp))
## [1] 3
# Which are the duplicated elements:
tmp[duplicated(tmp)]
## [1] 2 4 4
# Which are the unique duplicated elements:
unique(tmp[duplicated(tmp)])
## [1] 2 4

Make sure you understand how these two functions work and what they do, then move ahead with the following steps and questions:

Question 9: Are there duplicated Ensembl IDs or gene symbols in the list for Enrichr (bm)? How can this affect the results that we get?

Answer:

# Duplicated Ensembl IDs:
sum(duplicated(bm$ensembl_gene_id))
## [1] 17
# Duplicated gene symbols:
sum(duplicated(bm$hsapiens_homolog_associated_gene_name))
## [1] 41

This means that some Ensembl IDs map to more than one gene symbol, and that multiple Ensemble IDs can map to the same gene symbol.


  1. Now, run getBM again, this time also adding information about % identity between mouse and human genes.
  2. Make a boxplot of the % identity for our gene list. Hint: try ?boxplot if you are unsure about this. Try a histogram as well (hist())!

Question 10: Looking at the boxplot, are there any reasons to be concerned?

Answer:

# Make the query:
bm <- getBM(attributes=c("ensembl_gene_id","hsapiens_homolog_associated_gene_name",
                         "hsapiens_homolog_perc_id"), # this is what we want to extract 
            filters="ensembl_gene_id", # this determines the filter
            values=sel, # this are the values to filter for (get only the genes in our list)
            mart=ensembl)
head(bm)
##      ensembl_gene_id hsapiens_homolog_associated_gene_name
## 1 ENSMUSG00000000093                                  TBX2
## 2 ENSMUSG00000000184                                 CCND2
## 3 ENSMUSG00000000223                                  DRP2
## 4 ENSMUSG00000000253                                  GMPR
## 5 ENSMUSG00000000983                                      
## 6 ENSMUSG00000001467                               CYP51A1
##   hsapiens_homolog_perc_id
## 1                  95.0774
## 2                  92.0415
## 3                  95.0888
## 4                  95.3623
## 5                       NA
## 6                  91.2525
boxplot(bm$hsapiens_homolog_perc_id)

hist(bm$hsapiens_homolog_perc_id, breaks=50)
There are some genes with very low % identity, one could consider to remove these. The majority of genes have high similarity however, which is good.

Question 11: Which gene in the list has the lowest % identity?

Answer:

idx <- order(bm$hsapiens_homolog_perc_id)
bm[idx[1],]
##        ensembl_gene_id hsapiens_homolog_associated_gene_name
## 708 ENSMUSG00000071317                                POPDC3
##     hsapiens_homolog_perc_id
## 708                        0

  1. Paste your list of human gene symbols on the Enrichr website. Look at Ontologies > GO Biological Process.

Question 12: Are the results similar to those we got from DAVID?

Answer:

write.table(bm$hsapiens_homolog_associated_gene_name, col.names=F, row.names=F, quote=F)
Yes, fairly similar. There is muscle contraction, sterol and lipid biosynthesis, and myelination.

  1. Look around on the Enrichr page and get familiar with the different results. Look also on the different views, e.g. Table is good to see the actual p-values of the results.

Question 13: What do the edges (links) in the network view on the Enrichr webpage denote? (Note: The network view is not available for some gene-set collections, try to click on a different tab and gene set collection to find an example of the network view.)

Answer: See http://amp.pharm.mssm.edu/Enrichr/help#basics&q=5. The edges denote that there is a gene overlap between gene-sets.

Question 14: We have this far used a cutoff of FDR<0.001. Try a few different cutoffs and rerun DAVID and Enrichr. How different are the results?

Answer: They differ a bit depending on the list size. Overall conclusions are similar.

Question 15: For the Enrichr and DAVID results so far, can we know whether the identified functions are active/inactive or up/down-regulated in KO vs control? If not, how can we go about to get a clue about that?

Answer: The list we have used contains both up- and down-regulated genes, so we have not taken into account direction. One could subset the list to up- and down-regulated genes separately. Or use a method that takes directionality into account.

Enrichment analysis in R

Visualizing GO-terms with TopGO

Above, we have looked at some different types of gene-set collections, but a lot of focus has been on GO-terms.

  1. Use the internet to learn about what GO-terms are (e.g. Wikipedia or here).

Question 16: What are the three main domains/ontologies?

Answer: Biological process, Molecular function, Cellular compartment.

Question 17: By whom and how are GO-terms maintained/defined/updated?

Answer: The Gene Ontology Consortium. But edits can and are submitted by the research community.

Question 18: What is the evidence code?

Answer: A description denoting the type of evidence underlying a specific annotation.

Question 19: How are GO-terms connected to each other?

Answer: GO-terms are connected in a hierarchical structure with increasing specificity. A term can however have more than one parent term.

We can use e.g. the topGO package to visualize gene-set analysis results on the GO hierarchy network (topGO manual).

  1. Run the code below, it takes a short while to complete. Meanwhile, go through the code and understand what it does.
# Format for topgo - remove genes without ensembl id:
d_topgo <- diffExpRes[!is.na(diffExpRes$ensembl_gene_id),]
rownames(d_topgo) <- d_topgo$ensembl_gene_id

# Create the topGOdata object:
GOobj <- new("topGOdata",
             description = "Simple session", # just a name, can be changed
             ontology = "BP", # set to BP, MF, or CC
             allGenes = setNames(diffExpRes$FDR, diffExpRes$ensembl_gene_id), # named numeric vector
             geneSel = function(x) x<0.01, # a function to select genes from the allGenes vector
             nodeSize = 10, # remove GO-terms with less than this number of genes
             mapping = "org.Mm.eg.db", # annotation database
             ID = "ensembl", # gene ID used as names in allGenes vector
             annot = annFUN.org) 
# Run GO-term enrichment analysis:
resultFisher <- runTest(GOobj, algorithm = "classic", statistic = "fisher")
# Plot results using the GO hierarchical DAG:
showSigOfNodes(GOobj, score(resultFisher), firstSigNode=15, useInfo ='all')

If you save the plot as a PDF it will be easier to zoom in and read the node text. (Sometimes the plotting misbehaves in RStudio, try running grid.newpage(), plot.new(), and/or par(mfcol=c(1,1)) if plotting looks weird.)

Question 20: What are the circles and squares denoting?

Answer: Rectangles are the significant terms, circles are connected terms. Color reflects significance.

Question 21: Which is the most significant BP GO-term?

Answer:

GenTable(GOobj,resultFisher)
##         GO.ID                                 Term Annotated Significant
## 1  GO:0006941          striated muscle contraction       122          49
## 2  GO:0006936                   muscle contraction       216          65
## 3  GO:0003012                muscle system process       298          78
## 4  GO:0003008                       system process       987         163
## 5  GO:0061061         muscle structure development       517         105
## 6  GO:0055002     striated muscle cell development       140          47
## 7  GO:0051146 striated muscle cell differentiation       232          62
## 8  GO:0055001              muscle cell development       154          49
## 9  GO:0042692          muscle cell differentiation       336          76
## 10 GO:0030239                   myofibril assembly        53          26
##    Expected result1
## 1      9.91 1.5e-22
## 2     17.55 2.0e-21
## 3     24.21 2.7e-21
## 4     80.19 1.2e-19
## 5     42.00 3.5e-19
## 6     11.37 6.0e-18
## 7     18.85 1.2e-17
## 8     12.51 1.5e-17
## 9     27.30 7.9e-17
## 10     4.31 3.8e-15
Muscle contraction.

  1. Now make a topGO plot for the top 10 significant Cellular Compartments (CC), based on a gene list using a cutoff of FDR<0.001. It should look like the one below.

Answer:

# Create the topGOdata object:
GOobj <- new("topGOdata",
             description = "Simple session", # just a name, can be changed
             ontology = "CC", # set to BP, MF, or CC
             allGenes = setNames(diffExpRes$FDR, diffExpRes$ensembl_gene_id), # named numeric vector
             geneSel = function(x) x<0.001, # a function to select genes from the allGenes vector
             nodeSize = 10, # remove GO-terms with less than this number of genes
             mapping = "org.Mm.eg.db", # annotation database
             ID = "ensembl", # gene ID used as names in allGenes vector
             annot = annFUN.org) 
# Run GO-term enrichment analysis:
resultFisher <- runTest(GOobj, algorithm = "classic", statistic = "fisher")
# Plot results using the GO hierarchical DAG:
showSigOfNodes(GOobj, score(resultFisher), firstSigNode=10, useInfo ='all')

Question 22: Which is the most significant CC GO-term?

Answer:

GenTable(GOobj,resultFisher)
##         GO.ID                            Term Annotated Significant
## 1  GO:0030016                       myofibril       173          64
## 2  GO:0043292               contractile fiber       186          65
## 3  GO:0030017                       sarcomere       152          58
## 4  GO:0044449          contractile fiber part       165          59
## 5  GO:0031674                          I band       110          45
## 6  GO:0030018                          Z disc       100          38
## 7  GO:0099080          supramolecular complex       665          91
## 8  GO:0099081          supramolecular polymer       665          91
## 9  GO:0099512            supramolecular fiber       660          90
## 10 GO:0031224 intrinsic component of membrane      3085         257
##    Expected result1
## 1      9.34 < 1e-30
## 2     10.05 < 1e-30
## 3      8.21 < 1e-30
## 4      8.91 < 1e-30
## 5      5.94 1.3e-28
## 6      5.40 6.1e-23
## 7     35.92 7.3e-17
## 8     35.92 7.3e-17
## 9     35.65 1.4e-16
## 10   166.61 2.1e-15
The 4 first terms in the table are the most significant.

Piano

Piano is an R-package for carrying out gene-set analysis (see more info here).

  1. First we need to import some gene-sets. Load GO-terms from MSigDB:
gscGO <- loadGSC("data/GSC/c5.bp.v5.2.symbols.gmt")
gscGO
## First 50 (out of 4653) gene set names:
##  [1] "GO_REGULATION_O..." "GO_LACTATE_TRAN..." "GO_POSITIVE_REG..."
##  [4] "GO_DETECTION_OF..." "GO_CARDIAC_CHAM..." "GO_CALCIUM_ION_..."
##  [7] "GO_DNA_DEPENDEN..." "GO_TRNA_AMINOAC..." "GO_CIRCADIAN_RH..."
## [10] "GO_PHOSPHATIDYL..." "GO_SPINAL_CORD_..." "GO_REGULATION_O..."
## [13] "GO_PLATELET_DER..." "GO_GLUCURONATE_..." "GO_CELLULAR_RES..."
## [16] "GO_REGULATION_O..." "GO_POSITIVE_REG..." "GO_CEREBRAL_COR..."
## [19] "GO_POSITIVE_REG..." "GO_NEGATIVE_REG..." "GO_POTASSIUM_IO..."
## [22] "GO_L_PHENYLALAN..." "GO_REGULATION_O..." "GO_CARDIAC_MUSC..."
## [25] "GO_NEGATIVE_REG..." "GO_MOVEMENT_IN_..." "GO_REGULATION_O..."
## [28] "GO_APICAL_PROTE..." "GO_REGULATION_O..." "GO_FOREBRAIN_NE..."
## [31] "GO_POSITIVE_REG..." "GO_NEUROMUSCULA..." "GO_MITOTIC_CYTO..."
## [34] "GO_NEGATIVE_REG..." "GO_SMAD_PROTEIN..." "GO_CYTOPLASMIC_..."
## [37] "GO_MEIOTIC_CHRO..." "GO_POSITIVE_REG..." "GO_REGULATION_O..."
## [40] "GO_RNA_DEPENDEN..." "GO_REGULATION_O..." "GO_REGULATION_O..."
## [43] "GO_DENDRITE_DEV..." "GO_REGULATION_O..." "GO_G_PROTEIN_CO..."
## [46] "GO_MEMORY"          "GO_NEURON_DEVEL..." "GO_REGULATION_O..."
## [49] "GO_ENDOTHELIAL_..." "GO_POSITIVE_REG..."
## 
## First 50 (out of 15578) gene names:
##  [1] "PDE1B"    "NR4A2"    "MAOB"     "DRD4"     "TACR3"    "PARK2"   
##  [7] "COMT"     "HTR1A"    "GPR37"    "HPRT1"    "CHRNB2"   "SNCA"    
## [13] "SLC6A3"   "ABAT"     "DRD1"     "PNKD"     "PARK7"    "SLC16A8" 
## [19] "SLC16A5"  "SLC16A4"  "SLC5A12"  "EMB"      "SLC16A1"  "SLC16A7" 
## [25] "SLC16A12" "SLC16A13" "SLC16A6"  "SLC16A3"  "SLC16A11" "POLR2C"  
## [31] "POLR2J"   "CTDP1"    "RDBP"     "COBRA1"   "RSF1"     "POLR2B"  
## [37] "POLR2D"   "POLR2G"   "POLR2F"   "POLR2A"   "SNW1"     "CHD1"    
## [43] "POLR2K"   "CDK9"     "JUN"      "POLR2E"   "CCNT1"    "LEF1"    
## [49] "POLR2H"   "GTF2F2"  
## 
## Gene set size summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.0    17.0    34.0   111.5    92.0  1984.0 
## 
## First 10 gene sets with additional info:
##       Gene set                 
##  [1,] "GO_REGULATION_OF_DOP..."
##  [2,] "GO_LACTATE_TRANSPORT"   
##  [3,] "GO_POSITIVE_REGULATI..."
##  [4,] "GO_DETECTION_OF_LIGH..."
##  [5,] "GO_CARDIAC_CHAMBER_D..."
##  [6,] "GO_CALCIUM_ION_TRANS..."
##  [7,] "GO_DNA_DEPENDENT_DNA..."
##  [8,] "GO_TRNA_AMINOACYLATI..."
##  [9,] "GO_CIRCADIAN_RHYTHM"    
## [10,] "GO_PHOSPHATIDYLSERIN..."
##       Additional info                                                  
##  [1,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_REGULATIO..."
##  [2,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_LACTATE_T..."
##  [3,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_POSITIVE_..."
##  [4,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_DETECTION..."
##  [5,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_CARDIAC_C..."
##  [6,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_CALCIUM_I..."
##  [7,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_DNA_DEPEN..."
##  [8,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_TRNA_AMIN..."
##  [9,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_CIRCADIAN..."
## [10,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_PHOSPHATI..."

The printed info is always good to check. Did the genes get read as genes and the gene-sets as gene-sets? Do the gene-set sizes seem reasonable? This is a way to sanity check that the loading from a file worked as expected. Usually, reading data from a file is a weak point in any workflow and it is good to carefully check that the data in the file got read as intended. These gene-sets are annotated with human gene symbols, which we currently don’t have in our data.

  1. Add these IDs to the data:
bm <- getBM(attributes=c("ensembl_gene_id","hsapiens_homolog_associated_gene_name"),
            filters="ensembl_gene_id",
            values=diffExpRes$ensembl_gene_id, # note that now we use all genes here!
            mart=ensembl)
bm <- bm[bm[,2]%in%unique(unlist(gscGO)),] # filter for genes in the GO gene-set collection
head(bm)
##      ensembl_gene_id hsapiens_homolog_associated_gene_name
## 1 ENSMUSG00000000093                                  TBX2
## 2 ENSMUSG00000000184                                 CCND2
## 3 ENSMUSG00000000223                                  DRP2
## 4 ENSMUSG00000000253                                  GMPR
## 6 ENSMUSG00000001467                               CYP51A1
## 7 ENSMUSG00000001508                                  SGCA
nrow(bm)
## [1] 10332

Question 23: Are there any duplicates in bm? How many? What does this mean?

Answer:

# Duplicated Ensembl IDs:
sum(duplicated(bm$ensembl_gene_id))
## [1] 220
# Duplicated gene symbols:
sum(duplicated(bm$hsapiens_homolog_associated_gene_name))
## [1] 165
This means that some Ensembl IDs map to more than one gene symbol, and that multiple Ensemble IDs can map to the same gene symbol.

  1. Let’s take a simple approach and remove all ENSMUS duplicates in bm:
bm <- bm[!duplicated(bm[,1]),]
nrow(bm)

Question 24: How many rows did we remove?

Answer:

10489 - 10243
## [1] 246

Question 25: How many unique mouse Ensembl IDs could we map to human orthologs? How many unique human orhtologs do we map to?

Answer:

# How many unique mouse Ensembl IDs could we map to human orthologs?
length(unique(bm$ensembl_gene_id))
## [1] 10112
# How many unique human orhtologs do we map to?
length(unique(bm$hsapiens_homolog_associated_gene_name))
## [1] 10167

Note that multiple ENSMUS IDs still can map to the same gene symbol!

  1. Extract the gene-level statistics that we will use:
# merge diffExpRes and bm:
d_piano <- merge(diffExpRes, bm, by="ensembl_gene_id", all.x=T, sort=F)
# avoid duplicating gene-level statistics due to Ensembl ID mapping to multiple symbols:
d_piano <- d_piano[!duplicated(d_piano[,-ncol(d_piano)]),]
# get geneNames, pvals and logfc:
geneNames <- d_piano$hsapiens_homolog_associated_gene_name
pvals <- d_piano$FDR
names(pvals) <- geneNames
logfc <- d_piano$logFC
names(logfc) <- geneNames
  1. Inspect the objects we just created (e.g. using head(), View() and length()) so that you know how they look and what they contain.

We now have the gene-level data in a convenient format and we are ready to continue with the analysis.

Overrepresentation analysis

First, let’s use piano to perform a hypergeometric test, similar to what we did with Enrichr, DAVID, and topGO.

  1. Run the following code and read the printed info to understand what it does:
# Get the gene-list:
sel <- names(pvals)[pvals<0.001]
sel <- unique(sel[!is.na(sel)])
# Run the analysis:
res <- runGSAhyper(genes=sel, universe=unique(names(pvals)), gsc=gscGO, gsSizeLim=c(20, 200))
# Visualize results:
networkPlot(res, class="non", adjusted=T, significance=0.0001, ncharLabel=Inf,
            nodeSize=c(3,20), edgeWidth=c(1,15), cexLabel=0.7, overlap=20, 
            scoreColors=c("greenyellow","darkgreen"))

Here, we use the networkPlot function which draws a network for the most significant gene-sets.

Question 26: What does gsSizeLim=c(20, 200) in the code above mean?

Answer: To only include gene-sets with between 20 and 200 genes in the analysis (after filtering out genes not in the expression data).

Question 27: What do the colors, node-sizes, and edges denote?

Answer: Node size corresponds to gene-set size, i.e. number of genes. Edges denote the number of shared genes between gene-sets. Note that gene-sets without edges can also share a few genes, depending on the setting overlap. The color corresponds to the significance.

Question 28: Are the results in line with those from topGO, DAVID, and Enrichr?

Answer: The results show muscle contraction related terms, synapse organization, and lipid/sterol related processes, so the results appear to be fairly similar.

Gene-set analysis

Next, we will use piano to perform gene-set analysis, i.e. not the enrichment of a gene-list based on a cut-off, but using all available gene-level statistics. We will use signed (to keep track of fold-change) -log10-pvalues to rank the gene-sets, i.e. -log10(pvals)*sign(logfc).

  1. Run piano and plot the results:
gsaRes <- runGSA(-log10(pvals)*sign(logfc), geneSetStat="fgsea", gsc=gscGO, gsSizeLim=c(20, 200))

networkPlot(gsaRes, "distinct", "both", adjusted=T, significance=0.05, ncharLabel=Inf,
            nodeSize=c(3,15), edgeWidth=c(1,8), cexLabel=0.7, overlap=20,
            scoreColors=c("red", "orange", "yellow", "blue", "lightblue", "lightgreen"))

  1. Read the printed info to check that the numbers make sense!

Question 29: Look at the network plot, could gene overlap between you gene-sets bias the result interpretation in any way?

Answer: There are clusters of gene-sets that share a large portion of genes, i.e. they are in fact quite similar and in extreme cases basically measure the same thing. If these sets are many it is easy to focus the conclusions on the process that those sets represent, in contrast to the processes represented only by single or a few gene-sets. However, this maybe is due to the fact that one process is more fine-grained annotated by e.g. GO. The network plot helps in visualizing this, and one can try to see each cluster as an affected process, regardless of the number of gene-sets it contains.

The network plot can be nice for visualizing your results, but the function GSAsummaryTable can be used to get a table of the all the GSA results.

  1. Try it together with RStudios View() command:
View(GSAsummaryTable(gsaRes))

Question 30: What is the top significant GO-term affected by down-regulation and up-regulation, respectively?

Answer: Up: GO_SARCOMERE_ORGANIZATION
Down: GO_STEROID_METABOLIC_PROCESS

  1. Now run a second GSA, according to the following:
  • This time use the file data/GSC/c2.cp.v5.2.symbols.gmt, which contains pathway gene-sets from MSigDB, as a gene-set collection (load it with loadGSC()).
  • This time use the mean method (option geneSetStat="mean") instead of fgsea.
  • Use only gene-sets containing 10-100 genes.
  1. Plot the results using a network plot, play around with the options until you think it looks nice and presents the results from the GSA in a good way.

For example it could look something similar to this:

gscCP <- loadGSC("data/GSC/c2.cp.v5.2.symbols.gmt")
gsaResCP <- runGSA(-log10(pvals)*sign(logfc), geneSetStat="mean", gsc=gscCP, gsSizeLim=c(10,100))
networkPlot(gsaResCP, "distinct", "both", adjusted=T, significance=0.05, ncharLabel=Inf,
            nodeSize=c(5,25), edgeWidth=c(1,8), cexLabel=0.7, overlap=5,
            scoreColors=c("red", "orange", "yellow", "blue", "lightblue", "lightgreen"))

Question 31: Are these results in line with previous results that we have seen in this exercise? Do you feel there is a consistent pattern?

Answer: Running the piano gene-set analysis on GO-terms show that muscle related processes are affected by up-regulation and that sterol/steroid metabolism is affected by down-regulation. The pathway analysis shows that muscle contraction is affected by up-regulation and that sterol/cholesterol metabolism is affected by down-regulation. This seems to be quite similar to what we saw previously.

It is always good to go back to the gene-level while looking at your final GSA results. Here we will use a heatmap for visualizing gene-level data for selected gene-sets (using one of many heatmap functions available for R).

  1. Let’s pull out the information we need for a specific gene-set:
# Get genes for a specific gene-set:
selGenes <- names(geneSetSummary(gsaRes, "GO_STEROID_BIOSYNTHETIC_PROCESS")$geneLevelStats)

# Merge with the diffexp result table:
selTab <- d_piano[d_piano$hsapiens_homolog_associated_gene_name%in%selGenes, c(1,3,5,9,10)]
selTab$log10FDR <- -log10(selTab$FDR)*sign(selTab$logFC)
# Display this table:
kable(selTab)
ensembl_gene_id mgi_symbol logFC FDR hsapiens_homolog_associated_gene_name log10FDR
2 ENSMUSG00000024799 Tm7sf2 -2.1674811 0.0000000 TM7SF2 -39.3069364
11 ENSMUSG00000031604 Msmo1 -1.9232681 0.0000000 MSMO1 -30.1658228
24 ENSMUSG00000021670 Hmgcr -1.9047857 0.0000000 HMGCR -22.9287137
25 ENSMUSG00000021273 Fdft1 -1.4725972 0.0000000 FDFT1 -22.7646930
30 ENSMUSG00000093930 Hmgcs1 -2.1353849 0.0000000 HMGCS1 -21.3508866
31 ENSMUSG00000001467 Cyp51 -1.6536521 0.0000000 CYP51A1 -20.9697120
34 ENSMUSG00000045294 Insig1 -1.4729905 0.0000000 INSIG1 -20.1736652
53 ENSMUSG00000020917 Acly -1.0106441 0.0000000 ACLY -14.8880627
54 ENSMUSG00000031349 Nsdhl -1.3312848 0.0000000 NSDHL -14.8880627
58 ENSMUSG00000022351 Sqle -1.2507425 0.0000000 SQLE -14.1926825
76 ENSMUSG00000041939 Mvk -1.0538597 0.0000000 MVK -12.0939358
108 ENSMUSG00000027952 Pmvk -1.0461615 0.0000000 PMVK -10.2176927
121 ENSMUSG00000023963 Cyp39a1 -1.0528371 0.0000000 CYP39A1 -9.8008369
126 ENSMUSG00000026675 Hsd17b7 -1.4294868 0.0000000 HSD17B7 -9.6091437
149 ENSMUSG00000058454 Dhcr7 -0.8589378 0.0000000 DHCR7 -8.7776193
179 ENSMUSG00000058258 Idi1 -1.8503612 0.0000000 IDI1 -7.7828797
276 ENSMUSG00000033105 Lss -2.3822023 0.0000019 LSS -5.7273752
364 ENSMUSG00000059743 Fdps -2.0813679 0.0000231 FDPS -4.6355084
371 ENSMUSG00000034926 Dhcr24 -1.7835910 0.0000266 DHCR24 -4.5744417
556 ENSMUSG00000046027 Stard5 0.5669507 0.0006520 STARD5 3.1857624
632 ENSMUSG00000027195 Hsd17b12 -0.4651350 0.0014378 HSD17B12 -2.8423134
641 ENSMUSG00000036856 Wnt4 1.8707597 0.0015159 WNT4 2.8193167
1029 ENSMUSG00000031168 Ebp -0.5237848 0.0205263 EBP -1.6876888
1058 ENSMUSG00000028518 Prkaa2 1.0584440 0.0227853 PRKAA2 1.6423454
1097 ENSMUSG00000027875 Hmgcs2 0.4015057 0.0262726 HMGCS2 1.5804976
1315 ENSMUSG00000037936 Scarb1 -0.3771259 0.0504584 SCARB1 -1.2970664
1359 ENSMUSG00000028944 Prkag2 -0.5884059 0.0555413 PRKAG2 -1.2553837
1412 ENSMUSG00000024112 Cacna1h 0.6243337 0.0635828 CACNA1H 1.1966606
1474 ENSMUSG00000036880 Acaa2 0.3221681 0.0741393 ACAA2 1.1299516
1647 ENSMUSG00000026456 Cyb5r1 0.4542855 0.1048128 CYB5R1 0.9795859
1930 ENSMUSG00000031891 Hsd11b2 1.0349272 0.1505277 HSD11B2 0.8223835
2014 ENSMUSG00000018042 Cyb5r3 -0.2810385 0.1700817 CYB5R3 -0.7693423
2101 ENSMUSG00000018167 Stard3 -0.2933683 0.1865191 STARD3 -0.7292768
2530 ENSMUSG00000061758 Akr1b10 0.4704163 0.2585394 AKR1B15 0.5874733
2732 ENSMUSG00000029311 Hsd17b11 0.2579175 0.2965381 HSD17B11 0.5279195
2933 ENSMUSG00000050697 Prkaa1 0.7667416 0.3340304 PRKAA1 0.4762140
3270 ENSMUSG00000004880 Lbr -0.2605653 0.3946105 LBR -0.4038313
3585 ENSMUSG00000018861 Fdxr -0.2721958 0.4506118 FDXR -0.3461974
3626 ENSMUSG00000032051 Fdx1 0.3157853 0.4590777 FDX1 0.3381138
4027 ENSMUSG00000032323 Cyp11a1 0.6364368 0.5269210 CYP11A1 0.2782545
4080 ENSMUSG00000021214 Akr1c18 0.5195843 0.5371843 AKR1C4 0.2698766
4539 ENSMUSG00000018160 Med1 0.1443773 0.5986866 MED1 0.2228004
5060 ENSMUSG00000033715 Akr1c14 -0.3009562 0.6533384 AKR1C4 -0.1848618
5117 ENSMUSG00000034308 Sdr42e1 0.3779597 0.6597789 SDR42E1 0.1806016
5237 ENSMUSG00000021259 Cyp46a1 -0.4382325 0.6730181 CYP46A1 -0.1719732
5695 ENSMUSG00000021302 Ggps1 -0.1886793 0.7181029 GGPS1 -0.1438133
5752 ENSMUSG00000030825 Hsd17b14 -0.4391568 0.7242179 HSD17B14 -0.1401307
6796 ENSMUSG00000041736 Tspo 0.0871980 0.8089995 TSPO 0.0920518
6798 ENSMUSG00000030057 Cnbp 0.0843414 0.8090640 CNBP 0.0920171
6943 ENSMUSG00000017307 Acot8 0.1457815 0.8210293 ACOT8 0.0856413
7041 ENSMUSG00000026499 Acbd3 -0.2124423 0.8294557 ACBD3 -0.0812068
7150 ENSMUSG00000028470 Hint2 0.1065509 0.8376562 HINT2 0.0769342
7442 ENSMUSG00000003721 Insig2 -0.0982338 0.8560106 INSIG2 -0.0675208
7562 ENSMUSG00000022244 Amacr 0.1082852 0.8643204 AMACR 0.0633252
7761 ENSMUSG00000028603 Scp2 -0.2133358 0.8753502 SCP2 -0.0578182
7834 ENSMUSG00000030670 Cyp2r1 -0.2499409 0.8829956 CYP2R1 -0.0540414
7934 ENSMUSG00000039519 Cyp7b1 -0.0509951 0.8898487 CYP7B1 -0.0506838
8170 ENSMUSG00000026170 Cyp27a1 0.1069738 0.9068471 CYP27A1 0.0424659
8482 ENSMUSG00000024507 Hsd17b4 -0.0399601 0.9246019 HSD17B4 -0.0340452
8856 ENSMUSG00000029233 Srd5a3 0.0349583 0.9415716 SRD5A3 0.0261466
8993 ENSMUSG00000031982 Arv1 -0.0708507 0.9469453 ARV1 -0.0236751
9255 ENSMUSG00000052534 Pbx1 0.0231469 0.9578520 PBX1 0.0187016
9984 ENSMUSG00000031574 Star 0.0146487 0.9846159 STAR 0.0067332

Question 32: Does the table look like you expect it to (did the human to mouse gene symbol matching work, are the signed log10-FDRs consistent with the logFC and FDR columns)?

Answer: Yes, a visual inspection shows that the mapping is correct.

  1. Now, let’s fetch the count data for these genes and make a heatmap of all the information we have collected:
# Read in the count data:
tableCounts <- read.table(file="data/tableCounts", sep="\t", header=TRUE, skip=1)
rownames(tableCounts) <- tableCounts[,1]
tableCounts <- tableCounts[,7:12]
colnames(tableCounts) <- c("KO_1","KO_2", "KO_3", "WT_1", "WT_2", "WT_3");

# Merge count data with the selTab:
selTab <- merge(selTab, tableCounts, by.x="ensembl_gene_id", by.y=0, all.x=T)

# Plot a heatmap:
# Row annotation:
rowann <- list(Significant=ifelse(selTab$FDR<0.05,"yes","no"), Regulation=ifelse(selTab$logFC>0,"Up","Down"))
rowannCol <- list(Significant=c("black","green"), Regulation=c("blue","red"))
# Row labels:
rowlabs <- selTab$mgi_symbol
rowlabs[duplicated(rowlabs)] <- paste(rowlabs[duplicated(rowlabs)]," ")
# Plot:
aheatmap(selTab[,7:12], scale="row", annRow=rowann, annColors=rowannCol, 
         labRow=rowlabs,
         txt=selTab[,7:12])

Note that here we scale the rows so the colors will be visualizing the difference for each gene specifically, but is not comparable across genes. Therefore we also include the actual counts in each heatmap cell. Also, note that these are raw counts. As an alternative we could have plotted normalized CPMs (counts per million) using the command:

cpms <- cpm(calcNormFactors(DGEList(tableCounts), method='TMM'))

Try this if you have time!

Question 33: Why are the rows and columns reordered?

Answer: A clustering is performed on both the columns and rows in order to group genes and samples with similar patterns.

  1. Make a heatmap without the scaling.

Question 34: Does it look “better”, what is the drawback/benefit?

Answer:

aheatmap(selTab[,7:12], scale="none", annRow=rowann, annColors=rowannCol, 
         labRow=rowlabs,
         txt=selTab[,7:12])

Since a few genes have very high counts, it becomes difficult to see the patterns for the remaining genes.
One can take the log of the counts to partly alleviate this:

aheatmap(log10(selTab[,7:12]), scale="none", annRow=rowann, annColors=rowannCol, 
         labRow=rowlabs,
         txt=selTab[,7:12])


  1. Change the row-labels to mouse ENSEMBL IDs.

Answer:

# Row labels:
rowlabs <- selTab$ensembl_gene_id
rowlabs[duplicated(rowlabs)] <- paste(rowlabs[duplicated(rowlabs)]," ")
# Plot:
aheatmap(selTab[,7:12], scale="row", annRow=rowann, annColors=rowannCol, 
         labRow=rowlabs,
         txt=selTab[,7:12])


Question 35: Given the count data for these genes, does the differential expression results make sense, as indicated by the column annotation columns? Any specific genes that you are concerned about?

Answer: These are the genes belonging to GO_STEROID_BIOSYNTHETIC_PROCESS, which was shown to be affected by down-regulation in the gene-set analysis. Here we can see that a high portion of the genes are signifcant, and that the vast majority of significant genes are in fact down-regulated. Looking at the count data (the actual heatmap) one can also see a pattern of blue to the left (KO) and red to the right (WT), indicating a lower expression of these genes in the KO:s.

Question 36: Would you say from looking at the heatmap, that in general steroid biosynthesis is down-regulated or up-regulated? Does your answer reflect the GSA results as seen in the network plot above?

Answer: Down-regulated. It does reflect the network plot results.

  1. Spot-check some of these genes to see what their functions are (Google search or similar).

Question 37: Are these genes actually involved in steroid biosynthesis? Are you confident enough from the data and analysis results to say anything about how YAP/TAZ knockout affects this biological process?

Answer: I spot checked ACAA2 (involved in fatty acid beta-oxidation), SCARB1 (receptor for high density lipoprotein cholesterol, HDL), HMGCR (rate-limiting enzyme for cholesterol synthesis). One could check more, but at least from these it seems correct.

The paper from where this data is taken also reports that YAP/TAZ regulate expression of lipid synthetic enzymes and found down-regulation of lipid and sterol biosynthesis in the KO:s (see e.g. Figure 6).

  1. If you have time, do a similar heatmap but for another gene-set of your choice.

Session info

This page was generated using the following R session:

## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib/libopenblasp-r0.2.19.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      stats4    parallel  methods   stats     graphics  grDevices
##  [8] utils     datasets  base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-2   edgeR_3.20.1         limma_3.34.2        
##  [4] Rgraphviz_2.22.0     org.Mm.eg.db_3.5.0   NMF_0.20.6          
##  [7] cluster_2.0.6        rngtools_1.2.4       pkgmaker_0.22       
## [10] registry_0.3         piano_1.18.0         biomaRt_2.34.0      
## [13] topGO_2.30.0         SparseM_1.77         GO.db_3.5.0         
## [16] AnnotationDbi_1.40.0 IRanges_2.12.0       S4Vectors_0.16.0    
## [19] Biobase_2.38.0       graph_1.56.0         BiocGenerics_0.24.0 
## [22] knitr_1.17           BiocInstaller_1.28.0
## 
## loaded via a namespace (and not attached):
##  [1] foreach_1.4.3       bit64_0.9-7         gtools_3.5.0       
##  [4] assertthat_0.2.0    highr_0.6           blob_1.1.0         
##  [7] yaml_2.1.14         progress_1.1.2      slam_0.1-40        
## [10] RSQLite_2.0         backports_1.1.1     lattice_0.20-35    
## [13] digest_0.6.12       colorspace_1.3-2    htmltools_0.3.6    
## [16] plyr_1.8.4          XML_3.98-1.9        pkgconfig_2.0.1    
## [19] xtable_1.8-2        relations_0.6-7     scales_0.5.0       
## [22] gdata_2.18.0        BiocParallel_1.12.0 tibble_1.3.4       
## [25] ggplot2_2.2.1       lazyeval_0.2.1      magrittr_1.5       
## [28] memoise_1.1.0       evaluate_0.10.1     doParallel_1.0.11  
## [31] gplots_3.0.1        tools_3.4.2         data.table_1.10.4-3
## [34] prettyunits_1.0.2   gridBase_0.4-7      matrixStats_0.52.2 
## [37] stringr_1.2.0       locfit_1.5-9.1      munsell_0.4.3      
## [40] compiler_3.4.2      caTools_1.17.1      rlang_0.1.2        
## [43] RCurl_1.95-4.8      iterators_1.0.8     marray_1.56.0      
## [46] igraph_1.1.2        bitops_1.0-6        rmarkdown_1.6      
## [49] gtable_0.2.0        codetools_0.2-15    DBI_0.7            
## [52] sets_1.0-17         reshape2_1.4.2      R6_2.2.2           
## [55] gridExtra_2.3       bit_1.1-12          fastmatch_1.1-0    
## [58] fgsea_1.4.0         rprojroot_1.2       KernSmooth_2.23-15 
## [61] stringi_1.1.5       Rcpp_0.12.13