Setup and load data

The data needed for this exercise is taken from the RNA-seq exercise. In particular it comes from this publication. We will work with the output from the differential expression analysis of KO vs WT. The data files should be in a data/ subdirectory in your R working directory. You can download the data/ directory with files here. You can use the files results_DE.txt and tableCounts that you generated in the RNA-seq exercise, or use the ones availabel in the data.zip file. They should be the same :-)

The idea is that you should go through all the steps in this tutorial and keep track of your code and plots using Rmarkdown. That way, you can include comments and answers to the questions and document your progress.

Make a new directory for this exercise and add the data/ directory to it. Start R and use setwd() to change the working directory to the one you created.
We will use the following R packages in this exercise:

library(knitr)
library(topGO)
library(biomaRt)
library(piano)
library(NMF)
library(org.Mm.eg.db)
library(Rgraphviz)
library(edgeR)

If you are not able to load any of these packages, try to install them, either using biocLite() (first you need to run source("http://www.bioconductor.org/biocLite.R")) or install.packages(). If something fails, try to understand the error message and fix it. If you get stuck, ask for help :)

Start by reading in the differential expression results from the previous exercise:

diffExpRes <- read.delim("data/results_DE.txt", stringsAsFactors=F)
head(diffExpRes[,c(1,3,5,8,9)]) # skip some columns

##      ensembl_gene_id mgi_symbol     logFC       PValue          FDR
## 1 ENSMUSG00000034810      Scn7a -2.861294 3.433028e-45 4.822375e-41
## 2 ENSMUSG00000024799     Tm7sf2 -2.167481 7.022796e-44 4.932461e-40
## 3 ENSMUSG00000027200     Sema6d -1.837139 7.826289e-42 3.664530e-38
## 4 ENSMUSG00000022231     Sema5a -3.039879 2.030104e-41 7.129218e-38
## 5 ENSMUSG00000022483     Col2a1  1.880197 3.485428e-41 9.791960e-38
## 6 ENSMUSG00000023832      Acat2 -1.935112 1.067795e-40 2.499886e-37

Question 1: How many genes are significant at a cutoff of FDR<0.001?

Answer:

sum(diffExpRes$FDR<0.001)

## [1] 733

Web-based enrichment analysis

We will start by performing overrepresentation analysis (a.k.a. list enrichment analysis, …) by using Enrichr and DAVID. Both use a scoring method similar to the Hypergeomtric/Fisher’s exact test. To do such a test, we need to have a list of interesting (e.g. differentially expressed) genes and a list of the background (also known as universe). However, both Enrichr and DAVID have their own background lists, so we do not need to specify them explicitly. First, let’s make a list of interesting genes!

You can use the write.table function to print the top significant genes to copy paste into some web browser tools:

sel <- diffExpRes$ensembl_gene_id[diffExpRes$FDR<0.001]
write.table(sel, quote=F, row.names=F, col.names=F)

DAVID

Select and copy the gene list given by the above command and head over to DAVID.
Locate Functional Annotation in the left menu and click the more link and briefly scroll through that page to get an overview of what the Functional Annotation Tool at DAVID does.

Question 2: What is the alternative to the Fisher Exact P-Value used by DAVID?

Answer:

DAVID uses the so-called EASE Score. In the contingency table, 1 is subtracted from the number of genes that are both in the gene-list and in the gene-set.

Now go to the Functional Annotation page and make sure the Upload tab is visible.
Paste the copied gene-list, select the correct identifier, and select whether this is a gene list or background (discuss with other students if you are not sure, or ask the instructors). Submit list.

Question 3: Were all gene Ensembl IDs recognized? How many were unmapped?

Answer: 706 were recognized. A few (~19) genes were not in the database.

Explore the results. For instance, click on Functional Annotation Clustering at the bottom of the page. This shows related gene-sets clustered together in larger groups for a nicer overview.

Question 4: What seems to be the main functions of the top significant genes (i.e. in summary what are the significantly enriched gene-sets)? Does this make sense considering the experimental design?

Answer: Lipid, fatty-acid and sterol metabolism. Membrane, ER. Muscle contraction. Synapse.

Now, rerun DAVID but use gene symbols as input instead. Hint: you need to modify the command above creating the sel object. Also, DAVID will probably warn you at some point that the genes map to multiple species. Select the correct species and also go to the background tab and select the correct background.

Question 5: Were all gene symbols recognized? How many were unmapped?

Answer:

sel2 <- diffExpRes$mgi_symbol[diffExpRes$FDR<0.001]
write.table(sel2, quote=F, row.names=F, col.names=F)

Select OFFICIAL_GENE_SYMBOL. Multiple species were mapped. Select mouse. 720 gene symbols were matched. 4 were not recognized.

Question 6: How many genes in our list overlap with the GO-term Lipid biosynthesis? (Hint: check e.g. the Functional Annotation Clustering)

Answer: 42 genes.

Question 7: Does this list give the same results as using the Ensembl IDs?

Answer: Not 100% exactly the same, since the recognized gene-list is slightly different. But very similar results.

Enrichr

As an alternative to DAVID we can try Enrichr. This page takes human gene symbols as input, so first we need to translate our mouse Ensembl IDs to that. This can be done in many ways, one way is to use the biomaRt package (find a user guide here or type vignette("biomaRt")):

# Use biomaRt to translate the mouse genes into human orthologs:
# Select the Ensembl BioMart database
ensembl <- useMart("ENSEMBL_MART_ENSEMBL")
# Select the mouse dataset and update the Mart object: 
ensembl <- useDataset("mmusculus_gene_ensembl", mart=ensembl)
# Make the query:
bm <- getBM(attributes=c("ensembl_gene_id","hsapiens_homolog_ensembl_gene"), # this is what we want to extract 
            filters="ensembl_gene_id", # this determines the filter
            values=sel, # this are the values to filter for (get only the genes in our list)
            mart=ensembl)
head(bm)

##      ensembl_gene_id hsapiens_homolog_ensembl_gene
## 1 ENSMUSG00000000093               ENSG00000121068
## 2 ENSMUSG00000000184               ENSG00000118971
## 3 ENSMUSG00000000223               ENSG00000102385
## 4 ENSMUSG00000000253               ENSG00000137198
## 5 ENSMUSG00000000983                              
## 6 ENSMUSG00000001467               ENSG00000001630

As you can see, the above code gave us a map between mouse and human ensembl gene IDs. But this is not exactly what we wanted, we need human gene symbols.

Change the attributes argument to the proper setting.

Use the function listAttributes(ensembl) to see all possible options. Hint: if there are too many options to go through you can use grep to pull out the relevant ones, e.g.:

tmp <- listAttributes(ensembl)
tmp[grep("Human",tmp[,2]),]

If you are working in RStudio you can also use the convenient command View and then search for human, e.g.:

View(listAttributes(ensembl))

Try it also for the bm object. For future reference, note that you can also use listDatasets(ensembl) to see which datasetes you can choose from, try it!

Question 8: How many of the genes in our selected list have gene symbols?

Answer:

# Make the query:
bm <- getBM(attributes=c("ensembl_gene_id","hsapiens_homolog_associated_gene_name"), # this is what we want to extract 
            filters="ensembl_gene_id", # this determines the filter
            values=sel, # this are the values to filter for (get only the genes in our list)
            mart=ensembl)
head(bm)

##      ensembl_gene_id hsapiens_homolog_associated_gene_name
## 1 ENSMUSG00000000093                                  TBX2
## 2 ENSMUSG00000000184                                 CCND2
## 3 ENSMUSG00000000223                                  DRP2
## 4 ENSMUSG00000000253                                  GMPR
## 5 ENSMUSG00000000983                                      
## 6 ENSMUSG00000001467                               CYP51A1

# subset to the rows were the gene-symbol is >1 characters:
tmp <- bm[nchar(bm[,2])>1,]
# How many unique Ensembl IDs?
length(unique(tmp$ensembl_gene_id))

## [1] 698

Get familiar with the duplicated and unique functions:

# A vector to test stuff on:
tmp <- c(1,2,2,3,4,4,4,5)
# Get unique elements:
unique(tmp)

## [1] 1 2 3 4 5

# Which elements are duplicated?
duplicated(tmp)

## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE

# How many duplicates are there:
sum(duplicated(tmp))

## [1] 3

# Which are the duplicated elements:
tmp[duplicated(tmp)]

## [1] 2 4 4

# Which are the unique duplicated elements:
unique(tmp[duplicated(tmp)])

## [1] 2 4

Make sure you understand how these two functions work and what they do, then move ahead with the following steps and questions:

Question 9: Are there duplicated Ensembl IDs or gene symbols in the list for Enrichr (bm)? How can this affect the results that we get?

Answer:

# Duplicated Ensembl IDs:
sum(duplicated(bm$ensembl_gene_id))

## [1] 17

# Duplicated gene symbols:
sum(duplicated(bm$hsapiens_homolog_associated_gene_name))

## [1] 41

This means that some Ensembl IDs map to more than one gene symbol, and that multiple Ensemble IDs can map to the same gene symbol.

Now, run getBM again, this time also adding information about % identity between mouse and human genes.
Make a boxplot of the % identity for our gene list. Hint: try ?boxplot if you are unsure about this. Try a histogram as well (hist())!

Question 10: Looking at the boxplot, are there any reasons to be concerned?

Answer:

# Make the query:
bm <- getBM(attributes=c("ensembl_gene_id","hsapiens_homolog_associated_gene_name",
                         "hsapiens_homolog_perc_id"), # this is what we want to extract 
            filters="ensembl_gene_id", # this determines the filter
            values=sel, # this are the values to filter for (get only the genes in our list)
            mart=ensembl)
head(bm)

##      ensembl_gene_id hsapiens_homolog_associated_gene_name
## 1 ENSMUSG00000000093                                  TBX2
## 2 ENSMUSG00000000184                                 CCND2
## 3 ENSMUSG00000000223                                  DRP2
## 4 ENSMUSG00000000253                                  GMPR
## 5 ENSMUSG00000000983                                      
## 6 ENSMUSG00000001467                               CYP51A1
##   hsapiens_homolog_perc_id
## 1                  95.0774
## 2                  92.0415
## 3                  95.0888
## 4                  95.3623
## 5                       NA
## 6                  91.2525

boxplot(bm$hsapiens_homolog_perc_id)

hist(bm$hsapiens_homolog_perc_id, breaks=50)

There are some genes with very low % identity, one could consider to remove these. The majority of genes have high similarity however, which is good.

Question 11: Which gene in the list has the lowest % identity?

Answer:

idx <- order(bm$hsapiens_homolog_perc_id)
bm[idx[1],]

##        ensembl_gene_id hsapiens_homolog_associated_gene_name
## 708 ENSMUSG00000071317                                POPDC3
##     hsapiens_homolog_perc_id
## 708                        0

Paste your list of human gene symbols on the Enrichr website. Look at Ontologies > GO Biological Process.

Question 12: Are the results similar to those we got from DAVID?

Answer:

write.table(bm$hsapiens_homolog_associated_gene_name, col.names=F, row.names=F, quote=F)

Yes, fairly similar. There is muscle contraction, sterol and lipid biosynthesis, and myelination.

Look around on the Enrichr page and get familiar with the different results. Look also on the different views, e.g. Table is good to see the actual p-values of the results.

Question 13: What do the edges (links) in the network view on the Enrichr webpage denote? (Note: The network view is not available for some gene-set collections, try to click on a different tab and gene set collection to find an example of the network view.)

Answer: See http://amp.pharm.mssm.edu/Enrichr/help#basics&q=5. The edges denote that there is a gene overlap between gene-sets.

Question 14: We have this far used a cutoff of FDR<0.001. Try a few different cutoffs and rerun DAVID and Enrichr. How different are the results?

Answer: They differ a bit depending on the list size. Overall conclusions are similar.

Question 15: For the Enrichr and DAVID results so far, can we know whether the identified functions are active/inactive or up/down-regulated in KO vs control? If not, how can we go about to get a clue about that?

Answer: The list we have used contains both up- and down-regulated genes, so we have not taken into account direction. One could subset the list to up- and down-regulated genes separately. Or use a method that takes directionality into account.

Enrichment analysis in R

Visualizing GO-terms with TopGO

Above, we have looked at some different types of gene-set collections, but a lot of focus has been on GO-terms.

Use the internet to learn about what GO-terms are (e.g. Wikipedia or here).

Question 16: What are the three main domains/ontologies?

Answer: Biological process, Molecular function, Cellular compartment.

Question 17: By whom and how are GO-terms maintained/defined/updated?

Answer: The Gene Ontology Consortium. But edits can and are submitted by the research community.

Question 18: What is the evidence code?

Answer: A description denoting the type of evidence underlying a specific annotation.

Question 19: How are GO-terms connected to each other?

Answer: GO-terms are connected in a hierarchical structure with increasing specificity. A term can however have more than one parent term.

We can use e.g. the topGO package to visualize gene-set analysis results on the GO hierarchy network (topGO manual).

Run the code below, it takes a short while to complete. Meanwhile, go through the code and understand what it does.

# Format for topgo - remove genes without ensembl id:
d_topgo <- diffExpRes[!is.na(diffExpRes$ensembl_gene_id),]
rownames(d_topgo) <- d_topgo$ensembl_gene_id

# Create the topGOdata object:
GOobj <- new("topGOdata",
             description = "Simple session", # just a name, can be changed
             ontology = "BP", # set to BP, MF, or CC
             allGenes = setNames(diffExpRes$FDR, diffExpRes$ensembl_gene_id), # named numeric vector
             geneSel = function(x) x<0.01, # a function to select genes from the allGenes vector
             nodeSize = 10, # remove GO-terms with less than this number of genes
             mapping = "org.Mm.eg.db", # annotation database
             ID = "ensembl", # gene ID used as names in allGenes vector
             annot = annFUN.org) 
# Run GO-term enrichment analysis:
resultFisher <- runTest(GOobj, algorithm = "classic", statistic = "fisher")
# Plot results using the GO hierarchical DAG:
showSigOfNodes(GOobj, score(resultFisher), firstSigNode=15, useInfo ='all')

If you save the plot as a PDF it will be easier to zoom in and read the node text. (Sometimes the plotting misbehaves in RStudio, try running grid.newpage(), plot.new(), and/or par(mfcol=c(1,1)) if plotting looks weird.)

Question 20: What are the circles and squares denoting?

Answer: Rectangles are the significant terms, circles are connected terms. Color reflects significance.

Question 21: Which is the most significant BP GO-term?

Answer:

GenTable(GOobj,resultFisher)

##         GO.ID                                 Term Annotated Significant
## 1  GO:0006941          striated muscle contraction       122          49
## 2  GO:0006936                   muscle contraction       216          65
## 3  GO:0003012                muscle system process       298          78
## 4  GO:0003008                       system process       987         163
## 5  GO:0061061         muscle structure development       517         105
## 6  GO:0055002     striated muscle cell development       140          47
## 7  GO:0051146 striated muscle cell differentiation       232          62
## 8  GO:0055001              muscle cell development       154          49
## 9  GO:0042692          muscle cell differentiation       336          76
## 10 GO:0030239                   myofibril assembly        53          26
##    Expected result1
## 1      9.91 1.5e-22
## 2     17.55 2.0e-21
## 3     24.21 2.7e-21
## 4     80.19 1.2e-19
## 5     42.00 3.5e-19
## 6     11.37 6.0e-18
## 7     18.85 1.2e-17
## 8     12.51 1.5e-17
## 9     27.30 7.9e-17
## 10     4.31 3.8e-15

Muscle contraction.

Now make a topGO plot for the top 10 significant Cellular Compartments (CC), based on a gene list using a cutoff of FDR<0.001. It should look like the one below.

Answer:

# Create the topGOdata object:
GOobj <- new("topGOdata",
             description = "Simple session", # just a name, can be changed
             ontology = "CC", # set to BP, MF, or CC
             allGenes = setNames(diffExpRes$FDR, diffExpRes$ensembl_gene_id), # named numeric vector
             geneSel = function(x) x<0.001, # a function to select genes from the allGenes vector
             nodeSize = 10, # remove GO-terms with less than this number of genes
             mapping = "org.Mm.eg.db", # annotation database
             ID = "ensembl", # gene ID used as names in allGenes vector
             annot = annFUN.org) 
# Run GO-term enrichment analysis:
resultFisher <- runTest(GOobj, algorithm = "classic", statistic = "fisher")
# Plot results using the GO hierarchical DAG:
showSigOfNodes(GOobj, score(resultFisher), firstSigNode=10, useInfo ='all')

Question 22: Which is the most significant CC GO-term?

Answer:

GenTable(GOobj,resultFisher)

##         GO.ID                            Term Annotated Significant
## 1  GO:0030016                       myofibril       173          64
## 2  GO:0043292               contractile fiber       186          65
## 3  GO:0030017                       sarcomere       152          58
## 4  GO:0044449          contractile fiber part       165          59
## 5  GO:0031674                          I band       110          45
## 6  GO:0030018                          Z disc       100          38
## 7  GO:0099080          supramolecular complex       665          91
## 8  GO:0099081          supramolecular polymer       665          91
## 9  GO:0099512            supramolecular fiber       660          90
## 10 GO:0031224 intrinsic component of membrane      3085         257
##    Expected result1
## 1      9.34 < 1e-30
## 2     10.05 < 1e-30
## 3      8.21 < 1e-30
## 4      8.91 < 1e-30
## 5      5.94 1.3e-28
## 6      5.40 6.1e-23
## 7     35.92 7.3e-17
## 8     35.92 7.3e-17
## 9     35.65 1.4e-16
## 10   166.61 2.1e-15

The 4 first terms in the table are the most significant.

Piano

Piano is an R-package for carrying out gene-set analysis (see more info here).

First we need to import some gene-sets. Load GO-terms from MSigDB:

gscGO <- loadGSC("data/GSC/c5.bp.v5.2.symbols.gmt")
gscGO

## First 50 (out of 4653) gene set names:
##  [1] "GO_REGULATION_O..." "GO_LACTATE_TRAN..." "GO_POSITIVE_REG..."
##  [4] "GO_DETECTION_OF..." "GO_CARDIAC_CHAM..." "GO_CALCIUM_ION_..."
##  [7] "GO_DNA_DEPENDEN..." "GO_TRNA_AMINOAC..." "GO_CIRCADIAN_RH..."
## [10] "GO_PHOSPHATIDYL..." "GO_SPINAL_CORD_..." "GO_REGULATION_O..."
## [13] "GO_PLATELET_DER..." "GO_GLUCURONATE_..." "GO_CELLULAR_RES..."
## [16] "GO_REGULATION_O..." "GO_POSITIVE_REG..." "GO_CEREBRAL_COR..."
## [19] "GO_POSITIVE_REG..." "GO_NEGATIVE_REG..." "GO_POTASSIUM_IO..."
## [22] "GO_L_PHENYLALAN..." "GO_REGULATION_O..." "GO_CARDIAC_MUSC..."
## [25] "GO_NEGATIVE_REG..." "GO_MOVEMENT_IN_..." "GO_REGULATION_O..."
## [28] "GO_APICAL_PROTE..." "GO_REGULATION_O..." "GO_FOREBRAIN_NE..."
## [31] "GO_POSITIVE_REG..." "GO_NEUROMUSCULA..." "GO_MITOTIC_CYTO..."
## [34] "GO_NEGATIVE_REG..." "GO_SMAD_PROTEIN..." "GO_CYTOPLASMIC_..."
## [37] "GO_MEIOTIC_CHRO..." "GO_POSITIVE_REG..." "GO_REGULATION_O..."
## [40] "GO_RNA_DEPENDEN..." "GO_REGULATION_O..." "GO_REGULATION_O..."
## [43] "GO_DENDRITE_DEV..." "GO_REGULATION_O..." "GO_G_PROTEIN_CO..."
## [46] "GO_MEMORY"          "GO_NEURON_DEVEL..." "GO_REGULATION_O..."
## [49] "GO_ENDOTHELIAL_..." "GO_POSITIVE_REG..."
## 
## First 50 (out of 15578) gene names:
##  [1] "PDE1B"    "NR4A2"    "MAOB"     "DRD4"     "TACR3"    "PARK2"   
##  [7] "COMT"     "HTR1A"    "GPR37"    "HPRT1"    "CHRNB2"   "SNCA"    
## [13] "SLC6A3"   "ABAT"     "DRD1"     "PNKD"     "PARK7"    "SLC16A8" 
## [19] "SLC16A5"  "SLC16A4"  "SLC5A12"  "EMB"      "SLC16A1"  "SLC16A7" 
## [25] "SLC16A12" "SLC16A13" "SLC16A6"  "SLC16A3"  "SLC16A11" "POLR2C"  
## [31] "POLR2J"   "CTDP1"    "RDBP"     "COBRA1"   "RSF1"     "POLR2B"  
## [37] "POLR2D"   "POLR2G"   "POLR2F"   "POLR2A"   "SNW1"     "CHD1"    
## [43] "POLR2K"   "CDK9"     "JUN"      "POLR2E"   "CCNT1"    "LEF1"    
## [49] "POLR2H"   "GTF2F2"  
## 
## Gene set size summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.0    17.0    34.0   111.5    92.0  1984.0 
## 
## First 10 gene sets with additional info:
##       Gene set                 
##  [1,] "GO_REGULATION_OF_DOP..."
##  [2,] "GO_LACTATE_TRANSPORT"   
##  [3,] "GO_POSITIVE_REGULATI..."
##  [4,] "GO_DETECTION_OF_LIGH..."
##  [5,] "GO_CARDIAC_CHAMBER_D..."
##  [6,] "GO_CALCIUM_ION_TRANS..."
##  [7,] "GO_DNA_DEPENDENT_DNA..."
##  [8,] "GO_TRNA_AMINOACYLATI..."
##  [9,] "GO_CIRCADIAN_RHYTHM"    
## [10,] "GO_PHOSPHATIDYLSERIN..."
##       Additional info                                                  
##  [1,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_REGULATIO..."
##  [2,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_LACTATE_T..."
##  [3,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_POSITIVE_..."
##  [4,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_DETECTION..."
##  [5,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_CARDIAC_C..."
##  [6,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_CALCIUM_I..."
##  [7,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_DNA_DEPEN..."
##  [8,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_TRNA_AMIN..."
##  [9,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_CIRCADIAN..."
## [10,] "http://www.broadinstitute.org/gsea/msigdb/cards/GO_PHOSPHATI..."

The printed info is always good to check. Did the genes get read as genes and the gene-sets as gene-sets? Do the gene-set sizes seem reasonable? This is a way to sanity check that the loading from a file worked as expected. Usually, reading data from a file is a weak point in any workflow and it is good to carefully check that the data in the file got read as intended. These gene-sets are annotated with human gene symbols, which we currently don’t have in our data.

Add these IDs to the data:

bm <- getBM(attributes=c("ensembl_gene_id","hsapiens_homolog_associated_gene_name"),
            filters="ensembl_gene_id",
            values=diffExpRes$ensembl_gene_id, # note that now we use all genes here!
            mart=ensembl)
bm <- bm[bm[,2]%in%unique(unlist(gscGO)),] # filter for genes in the GO gene-set collection
head(bm)

##      ensembl_gene_id hsapiens_homolog_associated_gene_name
## 1 ENSMUSG00000000093                                  TBX2
## 2 ENSMUSG00000000184                                 CCND2
## 3 ENSMUSG00000000223                                  DRP2
## 4 ENSMUSG00000000253                                  GMPR
## 6 ENSMUSG00000001467                               CYP51A1
## 7 ENSMUSG00000001508                                  SGCA

nrow(bm)

## [1] 10332

Question 23: Are there any duplicates in bm? How many? What does this mean?

Answer:

# Duplicated Ensembl IDs:
sum(duplicated(bm$ensembl_gene_id))

## [1] 220

# Duplicated gene symbols:
sum(duplicated(bm$hsapiens_homolog_associated_gene_name))

## [1] 165

This means that some Ensembl IDs map to more than one gene symbol, and that multiple Ensemble IDs can map to the same gene symbol.

Let’s take a simple approach and remove all ENSMUS duplicates in bm:

bm <- bm[!duplicated(bm[,1]),]
nrow(bm)

Question 24: How many rows did we remove?

Answer:

10489 - 10243

## [1] 246

Question 25: How many unique mouse Ensembl IDs could we map to human orthologs? How many unique human orhtologs do we map to?

Answer:

# How many unique mouse Ensembl IDs could we map to human orthologs?
length(unique(bm$ensembl_gene_id))

## [1] 10112

# How many unique human orhtologs do we map to?
length(unique(bm$hsapiens_homolog_associated_gene_name))

## [1] 10167

Note that multiple ENSMUS IDs still can map to the same gene symbol!

Extract the gene-level statistics that we will use:

# merge diffExpRes and bm:
d_piano <- merge(diffExpRes, bm, by="ensembl_gene_id", all.x=T, sort=F)
# avoid duplicating gene-level statistics due to Ensembl ID mapping to multiple symbols:
d_piano <- d_piano[!duplicated(d_piano[,-ncol(d_piano)]),]
# get geneNames, pvals and logfc:
geneNames <- d_piano$hsapiens_homolog_associated_gene_name
pvals <- d_piano$FDR
names(pvals) <- geneNames
logfc <- d_piano$logFC
names(logfc) <- geneNames

Inspect the objects we just created (e.g. using head(), View() and length()) so that you know how they look and what they contain.

We now have the gene-level data in a convenient format and we are ready to continue with the analysis.

Overrepresentation analysis

First, let’s use piano to perform a hypergeometric test, similar to what we did with Enrichr, DAVID, and topGO.

Run the following code and read the printed info to understand what it does:

# Get the gene-list:
sel <- names(pvals)[pvals<0.001]
sel <- unique(sel[!is.na(sel)])
# Run the analysis:
res <- runGSAhyper(genes=sel, universe=unique(names(pvals)), gsc=gscGO, gsSizeLim=c(20, 200))
# Visualize results:
networkPlot(res, class="non", adjusted=T, significance=0.0001, ncharLabel=Inf,
            nodeSize=c(3,20), edgeWidth=c(1,15), cexLabel=0.7, overlap=20, 
            scoreColors=c("greenyellow","darkgreen"))

Here, we use the networkPlot function which draws a network for the most significant gene-sets.

Question 26: What does gsSizeLim=c(20, 200) in the code above mean?

Answer: To only include gene-sets with between 20 and 200 genes in the analysis (after filtering out genes not in the expression data).

Question 27: What do the colors, node-sizes, and edges denote?

Answer: Node size corresponds to gene-set size, i.e. number of genes. Edges denote the number of shared genes between gene-sets. Note that gene-sets without edges can also share a few genes, depending on the setting overlap. The color corresponds to the significance.

Question 28: Are the results in line with those from topGO, DAVID, and Enrichr?

Answer: The results show muscle contraction related terms, synapse organization, and lipid/sterol related processes, so the results appear to be fairly similar.

Gene-set analysis

Next, we will use piano to perform gene-set analysis, i.e. not the enrichment of a gene-list based on a cut-off, but using all available gene-level statistics. We will use signed (to keep track of fold-change) -log10-pvalues to rank the gene-sets, i.e. -log10(pvals)*sign(logfc).

Run piano and plot the results:

gsaRes <- runGSA(-log10(pvals)*sign(logfc), geneSetStat="fgsea", gsc=gscGO, gsSizeLim=c(20, 200))

networkPlot(gsaRes, "distinct", "both", adjusted=T, significance=0.05, ncharLabel=Inf,
            nodeSize=c(3,15), edgeWidth=c(1,8), cexLabel=0.7, overlap=20,
            scoreColors=c("red", "orange", "yellow", "blue", "lightblue", "lightgreen"))

Read the printed info to check that the numbers make sense!

Question 29: Look at the network plot, could gene overlap between you gene-sets bias the result interpretation in any way?

Answer: There are clusters of gene-sets that share a large portion of genes, i.e. they are in fact quite similar and in extreme cases basically measure the same thing. If these sets are many it is easy to focus the conclusions on the process that those sets represent, in contrast to the processes represented only by single or a few gene-sets. However, this maybe is due to the fact that one process is more fine-grained annotated by e.g. GO. The network plot helps in visualizing this, and one can try to see each cluster as an affected process, regardless of the number of gene-sets it contains.

The network plot can be nice for visualizing your results, but the function GSAsummaryTable can be used to get a table of the all the GSA results.

Try it together with RStudios View() command:

View(GSAsummaryTable(gsaRes))

Question 30: What is the top significant GO-term affected by down-regulation and up-regulation, respectively?

Answer: Up: GO_SARCOMERE_ORGANIZATION
Down: GO_STEROID_METABOLIC_PROCESS

Now run a second GSA, according to the following:

This time use the file data/GSC/c2.cp.v5.2.symbols.gmt, which contains pathway gene-sets from MSigDB, as a gene-set collection (load it with loadGSC()).
This time use the mean method (option geneSetStat="mean") instead of fgsea.
Use only gene-sets containing 10-100 genes.

Plot the results using a network plot, play around with the options until you think it looks nice and presents the results from the GSA in a good way.

For example it could look something similar to this:

gscCP <- loadGSC("data/GSC/c2.cp.v5.2.symbols.gmt")
gsaResCP <- runGSA(-log10(pvals)*sign(logfc), geneSetStat="mean", gsc=gscCP, gsSizeLim=c(10,100))
networkPlot(gsaResCP, "distinct", "both", adjusted=T, significance=0.05, ncharLabel=Inf,
            nodeSize=c(5,25), edgeWidth=c(1,8), cexLabel=0.7, overlap=5,
            scoreColors=c("red", "orange", "yellow", "blue", "lightblue", "lightgreen"))

Question 31: Are these results in line with previous results that we have seen in this exercise? Do you feel there is a consistent pattern?

Answer: Running the piano gene-set analysis on GO-terms show that muscle related processes are affected by up-regulation and that sterol/steroid metabolism is affected by down-regulation. The pathway analysis shows that muscle contraction is affected by up-regulation and that sterol/cholesterol metabolism is affected by down-regulation. This seems to be quite similar to what we saw previously.

It is always good to go back to the gene-level while looking at your final GSA results. Here we will use a heatmap for visualizing gene-level data for selected gene-sets (using one of many heatmap functions available for R).

Let’s pull out the information we need for a specific gene-set:

# Get genes for a specific gene-set:
selGenes <- names(geneSetSummary(gsaRes, "GO_STEROID_BIOSYNTHETIC_PROCESS")$geneLevelStats)

# Merge with the diffexp result table:
selTab <- d_piano[d_piano$hsapiens_homolog_associated_gene_name%in%selGenes, c(1,3,5,9,10)]
selTab$log10FDR <- -log10(selTab$FDR)*sign(selTab$logFC)
# Display this table:
kable(selTab)

	ensembl_gene_id	mgi_symbol	logFC	FDR	hsapiens_homolog_associated_gene_name	log10FDR
2	ENSMUSG00000024799	Tm7sf2	-2.1674811	0.0000000	TM7SF2	-39.3069364
11	ENSMUSG00000031604	Msmo1	-1.9232681	0.0000000	MSMO1	-30.1658228
24	ENSMUSG00000021670	Hmgcr	-1.9047857	0.0000000	HMGCR	-22.9287137
25	ENSMUSG00000021273	Fdft1	-1.4725972	0.0000000	FDFT1	-22.7646930
30	ENSMUSG00000093930	Hmgcs1	-2.1353849	0.0000000	HMGCS1	-21.3508866
31	ENSMUSG00000001467	Cyp51	-1.6536521	0.0000000	CYP51A1	-20.9697120
34	ENSMUSG00000045294	Insig1	-1.4729905	0.0000000	INSIG1	-20.1736652
53	ENSMUSG00000020917	Acly	-1.0106441	0.0000000	ACLY	-14.8880627
54	ENSMUSG00000031349	Nsdhl	-1.3312848	0.0000000	NSDHL	-14.8880627
58	ENSMUSG00000022351	Sqle	-1.2507425	0.0000000	SQLE	-14.1926825
76	ENSMUSG00000041939	Mvk	-1.0538597	0.0000000	MVK	-12.0939358
108	ENSMUSG00000027952	Pmvk	-1.0461615	0.0000000	PMVK	-10.2176927
121	ENSMUSG00000023963	Cyp39a1	-1.0528371	0.0000000	CYP39A1	-9.8008369
126	ENSMUSG00000026675	Hsd17b7	-1.4294868	0.0000000	HSD17B7	-9.6091437
149	ENSMUSG00000058454	Dhcr7	-0.8589378	0.0000000	DHCR7	-8.7776193
179	ENSMUSG00000058258	Idi1	-1.8503612	0.0000000	IDI1	-7.7828797
276	ENSMUSG00000033105	Lss	-2.3822023	0.0000019	LSS	-5.7273752
364	ENSMUSG00000059743	Fdps	-2.0813679	0.0000231	FDPS	-4.6355084
371	ENSMUSG00000034926	Dhcr24	-1.7835910	0.0000266	DHCR24	-4.5744417
556	ENSMUSG00000046027	Stard5	0.5669507	0.0006520	STARD5	3.1857624
632	ENSMUSG00000027195	Hsd17b12	-0.4651350	0.0014378	HSD17B12	-2.8423134
641	ENSMUSG00000036856	Wnt4	1.8707597	0.0015159	WNT4	2.8193167
1029	ENSMUSG00000031168	Ebp	-0.5237848	0.0205263	EBP	-1.6876888
1058	ENSMUSG00000028518	Prkaa2	1.0584440	0.0227853	PRKAA2	1.6423454
1097	ENSMUSG00000027875	Hmgcs2	0.4015057	0.0262726	HMGCS2	1.5804976
1315	ENSMUSG00000037936	Scarb1	-0.3771259	0.0504584	SCARB1	-1.2970664
1359	ENSMUSG00000028944	Prkag2	-0.5884059	0.0555413	PRKAG2	-1.2553837
1412	ENSMUSG00000024112	Cacna1h	0.6243337	0.0635828	CACNA1H	1.1966606
1474	ENSMUSG00000036880	Acaa2	0.3221681	0.0741393	ACAA2	1.1299516
1647	ENSMUSG00000026456	Cyb5r1	0.4542855	0.1048128	CYB5R1	0.9795859
1930	ENSMUSG00000031891	Hsd11b2	1.0349272	0.1505277	HSD11B2	0.8223835
2014	ENSMUSG00000018042	Cyb5r3	-0.2810385	0.1700817	CYB5R3	-0.7693423
2101	ENSMUSG00000018167	Stard3	-0.2933683	0.1865191	STARD3	-0.7292768
2530	ENSMUSG00000061758	Akr1b10	0.4704163	0.2585394	AKR1B15	0.5874733
2732	ENSMUSG00000029311	Hsd17b11	0.2579175	0.2965381	HSD17B11	0.5279195
2933	ENSMUSG00000050697	Prkaa1	0.7667416	0.3340304	PRKAA1	0.4762140
3270	ENSMUSG00000004880	Lbr	-0.2605653	0.3946105	LBR	-0.4038313
3585	ENSMUSG00000018861	Fdxr	-0.2721958	0.4506118	FDXR	-0.3461974
3626	ENSMUSG00000032051	Fdx1	0.3157853	0.4590777	FDX1	0.3381138
4027	ENSMUSG00000032323	Cyp11a1	0.6364368	0.5269210	CYP11A1	0.2782545
4080	ENSMUSG00000021214	Akr1c18	0.5195843	0.5371843	AKR1C4	0.2698766
4539	ENSMUSG00000018160	Med1	0.1443773	0.5986866	MED1	0.2228004
5060	ENSMUSG00000033715	Akr1c14	-0.3009562	0.6533384	AKR1C4	-0.1848618
5117	ENSMUSG00000034308	Sdr42e1	0.3779597	0.6597789	SDR42E1	0.1806016
5237	ENSMUSG00000021259	Cyp46a1	-0.4382325	0.6730181	CYP46A1	-0.1719732
5695	ENSMUSG00000021302	Ggps1	-0.1886793	0.7181029	GGPS1	-0.1438133
5752	ENSMUSG00000030825	Hsd17b14	-0.4391568	0.7242179	HSD17B14	-0.1401307
6796	ENSMUSG00000041736	Tspo	0.0871980	0.8089995	TSPO	0.0920518
6798	ENSMUSG00000030057	Cnbp	0.0843414	0.8090640	CNBP	0.0920171
6943	ENSMUSG00000017307	Acot8	0.1457815	0.8210293	ACOT8	0.0856413
7041	ENSMUSG00000026499	Acbd3	-0.2124423	0.8294557	ACBD3	-0.0812068
7150	ENSMUSG00000028470	Hint2	0.1065509	0.8376562	HINT2	0.0769342
7442	ENSMUSG00000003721	Insig2	-0.0982338	0.8560106	INSIG2	-0.0675208
7562	ENSMUSG00000022244	Amacr	0.1082852	0.8643204	AMACR	0.0633252
7761	ENSMUSG00000028603	Scp2	-0.2133358	0.8753502	SCP2	-0.0578182
7834	ENSMUSG00000030670	Cyp2r1	-0.2499409	0.8829956	CYP2R1	-0.0540414
7934	ENSMUSG00000039519	Cyp7b1	-0.0509951	0.8898487	CYP7B1	-0.0506838
8170	ENSMUSG00000026170	Cyp27a1	0.1069738	0.9068471	CYP27A1	0.0424659
8482	ENSMUSG00000024507	Hsd17b4	-0.0399601	0.9246019	HSD17B4	-0.0340452
8856	ENSMUSG00000029233	Srd5a3	0.0349583	0.9415716	SRD5A3	0.0261466
8993	ENSMUSG00000031982	Arv1	-0.0708507	0.9469453	ARV1	-0.0236751
9255	ENSMUSG00000052534	Pbx1	0.0231469	0.9578520	PBX1	0.0187016
9984	ENSMUSG00000031574	Star	0.0146487	0.9846159	STAR	0.0067332

Question 32: Does the table look like you expect it to (did the human to mouse gene symbol matching work, are the signed log10-FDRs consistent with the logFC and FDR columns)?

Answer: Yes, a visual inspection shows that the mapping is correct.

Now, let’s fetch the count data for these genes and make a heatmap of all the information we have collected:

# Read in the count data:
tableCounts <- read.table(file="data/tableCounts", sep="\t", header=TRUE, skip=1)
rownames(tableCounts) <- tableCounts[,1]
tableCounts <- tableCounts[,7:12]
colnames(tableCounts) <- c("KO_1","KO_2", "KO_3", "WT_1", "WT_2", "WT_3");

# Merge count data with the selTab:
selTab <- merge(selTab, tableCounts, by.x="ensembl_gene_id", by.y=0, all.x=T)

# Plot a heatmap:
# Row annotation:
rowann <- list(Significant=ifelse(selTab$FDR<0.05,"yes","no"), Regulation=ifelse(selTab$logFC>0,"Up","Down"))
rowannCol <- list(Significant=c("black","green"), Regulation=c("blue","red"))
# Row labels:
rowlabs <- selTab$mgi_symbol
rowlabs[duplicated(rowlabs)] <- paste(rowlabs[duplicated(rowlabs)]," ")
# Plot:
aheatmap(selTab[,7:12], scale="row", annRow=rowann, annColors=rowannCol, 
         labRow=rowlabs,
         txt=selTab[,7:12])

Note that here we scale the rows so the colors will be visualizing the difference for each gene specifically, but is not comparable across genes. Therefore we also include the actual counts in each heatmap cell. Also, note that these are raw counts. As an alternative we could have plotted normalized CPMs (counts per million) using the command:

cpms <- cpm(calcNormFactors(DGEList(tableCounts), method='TMM'))

Try this if you have time!

Question 33: Why are the rows and columns reordered?

Answer: A clustering is performed on both the columns and rows in order to group genes and samples with similar patterns.

Make a heatmap without the scaling.

Question 34: Does it look “better”, what is the drawback/benefit?

Answer:

aheatmap(selTab[,7:12], scale="none", annRow=rowann, annColors=rowannCol, 
         labRow=rowlabs,
         txt=selTab[,7:12])

Since a few genes have very high counts, it becomes difficult to see the patterns for the remaining genes.
One can take the log of the counts to partly alleviate this:

aheatmap(log10(selTab[,7:12]), scale="none", annRow=rowann, annColors=rowannCol, 
         labRow=rowlabs,
         txt=selTab[,7:12])

Change the row-labels to mouse ENSEMBL IDs.

Answer:

# Row labels:
rowlabs <- selTab$ensembl_gene_id
rowlabs[duplicated(rowlabs)] <- paste(rowlabs[duplicated(rowlabs)]," ")
# Plot:
aheatmap(selTab[,7:12], scale="row", annRow=rowann, annColors=rowannCol, 
         labRow=rowlabs,
         txt=selTab[,7:12])

Question 35: Given the count data for these genes, does the differential expression results make sense, as indicated by the column annotation columns? Any specific genes that you are concerned about?

Answer: These are the genes belonging to GO_STEROID_BIOSYNTHETIC_PROCESS, which was shown to be affected by down-regulation in the gene-set analysis. Here we can see that a high portion of the genes are signifcant, and that the vast majority of significant genes are in fact down-regulated. Looking at the count data (the actual heatmap) one can also see a pattern of blue to the left (KO) and red to the right (WT), indicating a lower expression of these genes in the KO:s.

Question 36: Would you say from looking at the heatmap, that in general steroid biosynthesis is down-regulated or up-regulated? Does your answer reflect the GSA results as seen in the network plot above?

Answer: Down-regulated. It does reflect the network plot results.

Spot-check some of these genes to see what their functions are (Google search or similar).

Question 37: Are these genes actually involved in steroid biosynthesis? Are you confident enough from the data and analysis results to say anything about how YAP/TAZ knockout affects this biological process?

Answer: I spot checked ACAA2 (involved in fatty acid beta-oxidation), SCARB1 (receptor for high density lipoprotein cholesterol, HDL), HMGCR (rate-limiting enzyme for cholesterol synthesis). One could check more, but at least from these it seems correct.

The paper from where this data is taken also reports that YAP/TAZ regulate expression of lipid synthetic enzymes and found down-regulation of lipid and sterol biosynthesis in the KO:s (see e.g. Figure 6).

If you have time, do a similar heatmap but for another gene-set of your choice.

Session info

This page was generated using the following R session:

## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib/libopenblasp-r0.2.19.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      stats4    parallel  methods   stats     graphics  grDevices
##  [8] utils     datasets  base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-2   edgeR_3.20.1         limma_3.34.2        
##  [4] Rgraphviz_2.22.0     org.Mm.eg.db_3.5.0   NMF_0.20.6          
##  [7] cluster_2.0.6        rngtools_1.2.4       pkgmaker_0.22       
## [10] registry_0.3         piano_1.18.0         biomaRt_2.34.0      
## [13] topGO_2.30.0         SparseM_1.77         GO.db_3.5.0         
## [16] AnnotationDbi_1.40.0 IRanges_2.12.0       S4Vectors_0.16.0    
## [19] Biobase_2.38.0       graph_1.56.0         BiocGenerics_0.24.0 
## [22] knitr_1.17           BiocInstaller_1.28.0
## 
## loaded via a namespace (and not attached):
##  [1] foreach_1.4.3       bit64_0.9-7         gtools_3.5.0       
##  [4] assertthat_0.2.0    highr_0.6           blob_1.1.0         
##  [7] yaml_2.1.14         progress_1.1.2      slam_0.1-40        
## [10] RSQLite_2.0         backports_1.1.1     lattice_0.20-35    
## [13] digest_0.6.12       colorspace_1.3-2    htmltools_0.3.6    
## [16] plyr_1.8.4          XML_3.98-1.9        pkgconfig_2.0.1    
## [19] xtable_1.8-2        relations_0.6-7     scales_0.5.0       
## [22] gdata_2.18.0        BiocParallel_1.12.0 tibble_1.3.4       
## [25] ggplot2_2.2.1       lazyeval_0.2.1      magrittr_1.5       
## [28] memoise_1.1.0       evaluate_0.10.1     doParallel_1.0.11  
## [31] gplots_3.0.1        tools_3.4.2         data.table_1.10.4-3
## [34] prettyunits_1.0.2   gridBase_0.4-7      matrixStats_0.52.2 
## [37] stringr_1.2.0       locfit_1.5-9.1      munsell_0.4.3      
## [40] compiler_3.4.2      caTools_1.17.1      rlang_0.1.2        
## [43] RCurl_1.95-4.8      iterators_1.0.8     marray_1.56.0      
## [46] igraph_1.1.2        bitops_1.0-6        rmarkdown_1.6      
## [49] gtable_0.2.0        codetools_0.2-15    DBI_0.7            
## [52] sets_1.0-17         reshape2_1.4.2      R6_2.2.2           
## [55] gridExtra_2.3       bit_1.1-12          fastmatch_1.1-0    
## [58] fgsea_1.4.0         rprojroot_1.2       KernSmooth_2.23-15 
## [61] stringi_1.1.5       Rcpp_0.12.13

Functional annotation and interpretation

Setup and load data

Web-based enrichment analysis

DAVID

Enrichr

Enrichment analysis in R

Visualizing GO-terms with TopGO

Piano

Overrepresentation analysis

Gene-set analysis

Session info