Statistical analysis of RNA-seq data

NGS cycle

29 April 2026

Preamble

Practical information


At the bottom left, there is a menu to better navigate through the slides


The content of this training course was highly inspired by previous training courses given by Pierre Pericard (2023) and Guillemette Marot (2022 and years before).
They themselves drew inspiration from existing materials, written mainly by Hugo Varet and the biostatistics team from the Pasteur Institute Bioinfo & Biostat Hub ; and J. Aubert, C. Hennequet-Antier (Inrae), M.A. Dillies and H. Varet (Institut Pasteur Paris).

Bilille


Bilille is the Lille bioinformatics and biostatistics platform, within the UAR 2014 - US 41 “Plateformes Lilloises en Biologie et Santé”.


PLBS includes 8 platforms, providing access to expertise and equipments to support research in biology and health.

In Bilille, we currently are 13 engineers, directed by Jimmy Vandel (IR CNRS), Mamadou-Dia Sow (IR Univ. Lille) and Ségolène Caboche (IR Univ. Lille).


Our missions are to :

  • support scientific projects
  • organise training courses
  • provide access to cloud computing resources
  • ensure access to software resources
  • conduct scientific and technical animation

Presentation outline


  1. Introduction

  2. Experimental design

  3. Exploration

  4. Normalisation

  5. Differential analysis

  6. Enrichment analysis

1.Introduction

Differential analysis

A differential analysis is a comparaison. This comparaison may be between treatments, states, conditions…


Example :

  • ill vs healthy
  • Smoker vs No smoker
  • mutation A vs wild type

Particularities of NGS data :

  • Very few individuals
  • Many tests (one per variable)
  • Count data (statistical distributions different from the ones used for continuous data from microarrays)

Differential analysis


Obtaining a result using a statistical procedure does not mean that this result is reliable. If you do not know the assumptions behind, please be careful with interpretation or ask an expert to help you.


Most of the time, not a unique solution ⇒ statisticians do not know all statistical procedures developed (example of the Bioconductor project : more than 2000 R packages) but have competences to understand them.


”All models are wrong but some are useful” (G. Box, 1978)

Differential analysis

DGE : differential gene expression

DTE : differential transcript expression

DTU : differential transcript usage

This course focuses on DGE

Differential analysis

A gene is declared differentially expressed if the observed difference between two conditions is statistically significant, that is to say higher than some natural random variation.

Key steps for statisticians :

  • Experimental design

  • Normalisation

  • Differential analysis

  • Multiple testing

Data example - Count

Here is an example of count data table.
This dataset will be used to produce examples throughout the training session.
The raw data (fastq files) originates from a publication, and an RNA-Seq pipeline (nfcore/rna-seq) was used to generate this count file.

Data example - Metadata


With count data, you need to know the characteristics of each sample, at least the condition that you want to use in the comparison.
You can also include other information you have, such as gender, date of sequencing or age for example. This kind of information is useful to explain your data.


Here, we have 9 samples, with 1 condition of 3 differents modalities wild type (WT), mutant A(Mutation_A) and mutant B (Mutation_B).

Mutation_A and Mutation_B are two different mutants of a transcription factor, so we might witness a lot of changes.

2.Experimental design

Experimental design




“To consult a statistician after an experiment is finished is often merely to ask him to conduct a post-mortem examination. He can perhaps say what the experiment died of.”

Ronald A. Fisher, Indian Statistical Congress, 1938, vol. 4, p 17

Experimental design


Experimental design is defined as the process of organising an experiment to ensure the collection of appropriate and sufficient data to answer research questions clearly and efficiently. It involves creating a structured approach that reduces variation and enhances the validity of conclusions drawn from the data.

Here you may find a publication on experimental design


Some rules have to be followed to make an experimental design :

  • well define a unique biological question
  • identify factors of variation and collect metadata from experiment and sequencing
  • choose a priori tools for bioinformatics and statistical analyses
  • draw conclusions on results

Garbage in - garbage out

Example

For example

10 patients have the same disease.
5 of them received a treatment, while the other 5 patients didn’t get any treatment.

We want to identify differentially expressed genes between the patients that receive a treatment and the others.


1 - Is there a unique biological question ?

Yes !

2 - What biological factors could induce variations, other than taking a treatment?

Age, sex, weight, … any aspect linked to the studied organism.

3 - What technical factors could induce variations ?

Date of sequencing, the person who performed the sequencing, … any aspect linked to the technical experiment.


NB : We have no control over biological variation factors, contrary to technical variations that we can’t eliminate but at least optimised.

Replicates

There are 2 kinds of replicates : biological and technical.


Biological replicate
Repetition of the same experimental protocol but independant data acquisition : several samples.


Technical replicate
Same biological material but independant replications of the technical steps : several extracts from the same sample.


Sequencing technology does not eliminate biological variablility
Nature Biotechnology Correspondence, 2011

Replicates

Exercise

Sequencing design

Transcriptome differences between Cystic Fibrosis (CF) patients and healthy people.

Which of these design(s) is/are acceptable ?

NO

YES

YES

Exercise

Confounding effect

Transcriptome differences between Cystic Fibrosis (CF) patients and healthy people.

Can you identify confunding effects ?

A gene is detected as being differentially expressed between healthy and CF patients. Is it due to the disease ? The age effect ? The gender effect ? The date effect ? The technician effect ?

Example of good design

Transcriptome differences between Lung cancer (LC) patients and healthy people.

  • The age distribution is the same in both conditions
  • There are males and females in both conditions
  • The date is evenly distributed accross conditions

Conclusion


Before the experiment :

  • ask a unique and well defined biological question
  • list all possible biological confounding effects (gender, age, ..)
  • collect the samples while taking care of the distribution of unwanted sources of variation accross samples (date and technician for sequencing, ..)
  • Include at least 3 biological replicates per condition. Technical replicates are not necessary.


What about increasing the number of biological replicates ?

It would allow to generalise to the population level.

Illustration of sampling

Set of all mice we could measure

Illustration of sampling


Selection of 3 mice per condition

Sample 1

Sample 2

Illustration of sampling

Sample 3 : Non representative of the whole population


Increasing the number of biological replicates would allow to be more precise while estimating within-condition variability and to improve detection of DE transcripts.

3.Exploration

SARTools

SARTools = Statistical Analysis of RNA-Seq Tools.


SARTools is available on R and Galaxy and allows to :

  • perform a systematic quality control of the data
  • perform differential analysis including normalisation and multiple testing
  • facilitate the utilisation of DESeq2 and edgeR


It provides :

  • a HTML report which displays all the figures produced, explains the statistical methods and gives the results of the differential analysis
  • a list of all parameters used (reproducible research)
  • results exported into easily readable tab-delimited files

SARTools

https://github.com/PF2-pasteur-fr/SARTools

Sample-by-sample

Total number of counts

Sample-by-sample

Percent of zero counts

Sample-by-sample

Most expressed genes

Among the top three genes, it may happen that these genes are not the same across every sample.

Sample-by-sample

Troubleshouting

SERE

The SERE (Simple Error Ratio Estimate) coefficient :

  • assesses the similarity/dissimilarity between samples
  • is calculated between 2 samples
  • is more suited to RNAseq data than the Pearson and Spearman’s correlation coefficients
  • is not very easy to interpret with many samples

SERE

Example

As expected, we can note that :

  • the SERE coefficient between a sample and itself is 0
  • the SERE coefficients between 2 biological replicates are slightly greater than 1
  • the SERE coefficients between samples from different conditions are much greater

SERE

Exercise

What conclusion could you make when looking at this table of SERE coefficients ?

  • What about MutB samples ?
  • What about MutA samples ?
  • What about WT samples ?


Maybe Samples MutA2 and WT1 have been inverted : it’s worth investigating.

Multivariate representations


So far, we have seen representations of counts by sample or between samples 2-to-2.
Now, we would like to visualise the whole dataset, considering every sample.


The main goal of the multivariate exploratory data analysis is to explore the whole structure of the dataset, to better understand the proximity between samples and detect possible problems.
This is a quality control step.


Two main tools

  • Principal Component Analysis
  • Clustering

Both these methods depend on a notion of distance between samples.
To compute distances properly, the variance must be independant of the mean : data have to be homoscedastic.

Variance increases with intensity

Homoscedasticity is not satisfied : variance increases with mean.

-> It’s necessary to transform data before performing PCA and clustering.

Transformation

Several transformation methods exist, for example :

  • DESeq2 proposes VST (Variance stabilising Transformation) and rlog (Regularised Log Transformation)
  • edgeR proposes a transformation of the count data as moderated log-counts-per-million


All these methods allow to have transformed data that we can use to perform PCA and clustering.
We usually use the vst method from DESeq2 package, that runs faster than rlog method.

PCA


It’s easy to visualise pair-wise relations, through scatter plots :


In the context of RNAseq, we have n individuals (samples) and p variables (genes).

Performing a PCA will allow to extand the visualisation to p variables, which is much greater than 2.

PCA


Main goal : explore the structure of the dataset to better understand the proximity between samples and detect possible problems (often used as a quality control step)

  • synthetize information and visualise points in a space of reduced dimension
  • describe links between variables and which ones explain most variability
  • hilight homogeneous subgroups
  • detect aberrant individuals

Principle : find axes on which one can project points to obtain a space of reduced dimension.
PCA uses a criteron based on variance to build new axes, also called components, in order to preserve variability. These new components are a linear combination of the initial variables.

PCA

PCA

Important scores


Percentage of inertia associated with an axis :

  • Proportion of the total information supported by this axis
  • Decreases with the axis rank (by construction)


Number of axes to interpret :

  • Such as the sum of the percentages of inertia is greater than a threshold
  • Elbow criterion
  • And many other methods

PCA

PCA - example

PCA - examples with troubleshoot

What conclusion could you make when looking at this projection ?

Maybe Samples MutA2 and WT1 have been inverted : it’s worth investigating.

PCA - examples with troubleshoot

What conclusion could you make when looking at this projection ?

Sample MutB1 behaves like an outlier : it’s worth investigating.

PCA - examples with troubleshoot

In this example, we aim at performing the differential analysis comapring Chow and HFD.

What conclusion could you make when looking at these projections ?

The effect of Sex is greater than the effect of Phenotype.
Variable Sex is then considered as a batch effect, and has to be controled in the differential analysis.

Clustering

Goal : build groups of samples such that :

  • samples within a group are similar
  • samples from distinct groups are different

Two main approaches :

K-means

  • initialise randomly the centers of clusters
  • compute the distances between each point and every center
  • affect each point to the nearest center
  • update the centers of clusters
  • repeat until the defined criterion is reached

Hierarchical clustering

  • compute the distances between each pair of points
  • gather the 2 nearest points into a unique cluster
  • compute the distances between this cluster and each point
  • gather the 2 nearest clusters/points
  • repeat until getting a single cluster

Hierarchical clustering

In RNAseq, we used the Hierarchical clustering method. This method doesn’t depend on a randomly initialisation and gives the same results from one run to the next.
What’s more, it provides a graphical representation : the dendrogram.


Parameters

  • Distance between two samples
  • Aggregation criterion (ie distance between two elements including a cluster) :
    • average linkage : average distance between all the samples
    • single linkage : distance between the 2 closest samples
    • complete linkage : distance between the 2 furthest samples
    • Ward : merge the clusters that lead to the cluster with minimum variance

We usually use the euclidian distance with Ward criterion (option method="Ward.D2" in R hclust function).

HC - example

HC - examples with troubleshoot

What conclusion could you make when looking at this dendrogram ?

Maybe Samples MutA2 and WT1 have been inverted : it’s worth investigating.

HC - examples with troubleshoot

What conclusion could you make when looking at this dendrogram ?

Sample MutB1 behaves like an outlier : it’s worth investigating.

4.Normalisation

Normalisation

Definition : normalisation is a process designed to identify and correct some technical biases removing the least possible biological signal. This step is technology and platform-dependant.


Within-sample normalisation : Normalisation enabling comparisons of fragments (genes) from a unique sample. No need in a differential analysis context.

Between-sample normalisation : Normalisation enabling comparisons of fragments (genes) from different samples.

Sources of variability

Read counts are proportional to expression level, gene length and sequencing depth (same RNAs in equal proportions).


Within-sample :

  • Gene length
  • Sequence composition (GC content)

Between-sample :

  • Depth (total number of sequenced and mapped reads)
  • Sampling bias in library construction ?
  • Presence of majority fragments
  • Sequence composition due to PCR-amplification step in library preparation Pickrell et al., 2010, Risso et al., 2011

What is a differentially expressed gene ?

What is a differentially expressed gene ?

Goal of DESeq2/edgeR normalisations

Correction for differences in library sizes

Assumptions:

  • The majority of genes is not differentially expressed
  • There are as many downregulated genes as upregulated genes

The first goal is to correct the differences of library sizes. After sequencing, the number of reads is different between each samples, we want to limit the size effect.

In the example below, we can see that sample 2 has twice the total number of reads of sample 1. However, after normalisation, there is no difference anymore.

What we witness
Sample1 Sample2
SOX2 30 60
BRCA1 50 100
TP53 20 40
EGFR 100 200
Total 200 400
Reality
Sample1 Sample2
SOX2 30 30
BRCA1 50 50
TP53 20 20
EGFR 100 100
Total 200 200

Goal of DESeq2/edgeR normalisations

Correction for differences in library compositions

Assumptions:

  • The majority of genes is not differentially expressed
  • There are as many downregulated genes as upregulated genes

The second goal is to address the differences in library composition. While the library size may be the same, this does not reflect reality.

What we witness
Sample1 Sample2
SOX2 6 30
BRCA1 6 30
TP53 6 30
EGFR 72 0
Total 90 90
Reality
Sample1 Sample2
SOX2 2 2
BRCA1 2 2
TP53 2 2
EGFR 24 0
Total 30 6

DESeq2 normalisation

Assumptions:

  • The majority of the genes is not differentially expressed
  • As many down- as up-regulated genes

DESeq2 computes a size factor \(s_{j}\) per sample:

\({s}_{j} = \underset{i}{median}\frac{x_{ij}}{(\prod^{n}_{v=1}x_{iv})^{1/n}}\)


  • \(x_{ij}\) : number of reads for gene \(i\) in sample \(j\)
  • \(n\) : number of samples studied
  • \(s_{j}\) : normalisation factor for sample \(j\)


To normalise counts :

\(x'_{ij} = \frac{x_{ij}}{s_{j}}\)

edgeR normalisation

Assumptions:

  • The majority of the genes is not differentially expressed
  • As many down- as up-regulated genes

edgeR computes a normalisation factor \(f_j\) per sample and normalises the total number of reads \(N_j\) in each sample:

\(N'_{j} = f_{j} \times N_{j}\)

We can calculate DESeq2-like size factors \(s_{j}\) in order to normalise the counts:

\(s_{j} = \frac{N'_{j}}{\frac{1}{n}\sum_{k} N'_{k}}\)

and then :

\(x'_{ij} = \frac{x_{ij}}{s_{j}}\)
  • \(x_{ij}\) : number of reads for gene \(i\) in sample \(j\)
  • \(n\) : number of samples studied
  • \(s_{j}\) and \(f_{j}\) : normalisation factors for sample \(j\)
  • \(N_{j}\) : total number of reads in sample \(j\) (library size)

Other normalisation methods

Total number of reads



\(s_{j} = \frac{N_{j}}{\frac{1}{n}\sum_{k}N_{k}}\)


Or


\(\frac{N_{j}}{\sqrt[n]{\prod_{k}N_{k}}}\)

\(n\) : number of samples studied

\(s_{j}\) : normalisation factor for sample \(j\)

\(N_{j}\) : total number of reads in sample \(j\) (library size)

  • Robustness issue if a gene catches a very high number of reads.

RPKM (Reads Per Kilobase Per Million mapped reads)



\(x'_{ij} = \frac{x_{ij}}{N_{j} \times L_{i}} \times 10^{6} \times 10^{3}\)

\(x_{ij}\) : number of reads for gene \(i\) in sample \(j\)

\(N_{j}\) : total number of reads in sample \(j\) (library size)

\(L_{i}\) : Length of gene \(i\)

  • Same issue than the total number of reads method
  • Introduce other biases
  • No need to correct for the gene length since the gene is “fixed”

Size factor vs library size

We expect to see all the dots on the diagonal.

Effect of the normalisation (DESeq2 or edgeR)

After normalisation, you can see that the third quartile is more aligned.

5.Differential analysis

Differential analysis


Goal : Detect differentially expressed genes between two conditions.


Why replicates ?

In a perfect world : no biological nor technical variability. So one sample from each condition would be necessary to conclude.

In our world : each individual has its own behavior. So we need biological replicates to estimate within-condition variability.

Statistical test

  • state the null and the alternative hypotheses

    • \(H_0\) {the mean expression of the gene is identical between both conditions}
    • \(H_1\) {the mean expression of the gene is different between both conditions}
  • consider the statistical assumptions and distributions
  • calculate the appropriate test statistic T
  • derive the distribution of the test statistic under the null hypothesis
  • select a significance level \(\alpha\) : a probability threshold below which the null hypothesis will be rejected

\(H_0\) is always preferred. No sufficient proof \(\rightarrow\) no rejection. When we can’t reject \(H_0\), this doesn’t mean that \(H_0\) is true.

Statistical test

p-value and conclusion of the test


p-value is the probability to obtain, under the null hypothesis, a test statistic at least as extreme as the one that we actually observed.

In other terms :

  • if we consider that the mean expression of the gene is the same in both conditions (ie we’re under the null hypothesis). The p-value is the probability to observe the actual test statistic T
  • it corresponds to the probability of rejecting \(H_0\) incorrectly

Conclusion : if p-value \(\leqslant\) \(\alpha\) then we reject \(H_0\)

\(\Rightarrow\) We consider that the mean expression of the gene is different in both conditions.

Errors

Errors

Let \(\mu_1\) and \(\mu_2\) the mean expression of gene g for the first and second condition respectively.

We wish to test the hypothesis : \(H_0\) : \(\mu_1\) = \(\mu_2\) vs \(H_1\) : \(\mu_1\) \(\neq\) \(\mu_2\)


The risks can be summarised in :

Overdispersion

  • \(\pi_G\) the proportion of fragments from gene G
  • \(X_G\) the number of reads from gene G
  • \(N\) the total number of reads

\[ X_G \sim Binomial(N, \pi_G) \approx Poisson(N\pi_G) \]

But variance increases with intensity, due to biological variability :

Technical variability is the main source of variablility in low counts, whereas biological variability is dominant in high counts.

In case of overdispersion, increase of the type I error rate (probability to declare incorrectly gene DE).

Overdispersion

If \(X_G \sim Poisson(N\pi_G)\), then \(mean(X_G) = variance(X_G) = N\pi_G\).

This is not satisfied \(\rightarrow\) we need a statistical law with variance \(\neq\) mean.


  • \(x_{ij}\) the number of reads that align on gene i for sample j

\[ X_{ij} \sim Negative-Binomial(mean = \mu_{ij}, variance = \sigma_{ij}^2)\]


DESeq2 and edgeR used models based on Negative binomial distribution, but other methods exist.

Generalised Linear Models


One linear model by gene :

  • estimation of coefficients for each contrast
  • test for significance of coefficients using Wald test


Coefficients

Coefficients of models correspond to the log2 Fold-Change computed for each contrast


Significance

Significance is expressed through the p-value

Fold-Change

  • \(\mu_1\) = \(mean\ of\ normalised\ values\ in\ condition\ 1\)

  • \(\mu_2\) = \(mean\ of\ normalised\ values\ in\ condition\ 2\)

  • Fold-Change = \(\frac{\mu_1}{\mu_2}\)

  • Log2 Fold-Change = log2(Fold-Change) = log2(\(\mu_1) -\) log2(\(\mu_2)\)


The Fold-Change is a measure describing how much a quantity changes.
Why is’nt it enough to use Fold-Change to find differentially expressed genes ?


Fold-change does not take the variance of the samples into account.

For example, the difference between 102 and 100 is the same as between 4 and 2, but does not seem to have the same importance, regarding the baseline value.

Shrunken Fold-Change

It’s recommended to shrink the log2 Fold-Change :

  • it lowers the log2 Fold-Change of genes where the number of counts is low
  • it reduces the False Discovery Rate
   Unshrunken log2FC       Shrunken log2FC

source : ENS Lyon

One of the most shrinkage method used is Adaptative SHRinkage(ashr package), but you might also encounter Approximate posterior estimation for GLM coefficients (apeglm).

Differentially expressed genes

Practical importance and statistical significance have little to do with each other :

  • An effect can be important, but undetectable (statistically insignificant) because the data are few, irrelevant, or of poor quality
  • An effect can be statistically significant (detectable) even if it is small and unimportant, if the data are many and of high quality


Usually, we apply 2 thresholds on both p-value and log2FC to detect differentially expressed genes, in order to control respectively :

  • the statistical effect
  • the biological effect

DESeq2 and edgeR

DESeq2 and edgeR are the most used, but many other tools exist, for example NBPSeq, TSPM, baySeq, EBSeq, NOISeq, SAMseq, ShrinkSeq, voom(+limma), etc.


Similarities

  • Easy to use and well documented R packages

  • A 3-step analysis process :

    • normalisation
    • dispersion estimation
    • statistical tests
  • Negative Binomial distribution of counts

  • Generalised Linear Models (GLM)

Differences

  • low counts filtering
  • dispersion estimation
  • outlier detection and processing

Key points

With a small number of replicates (2-3) or low expression, be careful : the results may not be robust.

With a large number of replicates (10 or so) or very high expression : the method choice does not matter much.

  • Outlier counts affect different methods in different ways.
  • Results are more accurate and less variable between methods if DE genes are regulated in both directions.

Use p-value and log2FC to detect differentially expressed genes.

Histogram of p-value

Examples of expected overall distribution of raw p-values

Very low counts usually have a large p-value

Most of them are not kept after filtering (independant filtering)

This is the most desirable shape after removing low counts

Histogram of p-value

Examples of expected overall distribution of raw p-values

Do not expect positive tests after correction

You have a lot of low counts

Histogram of p-value

Examples of not expected overall distribution of raw p-values

This kind of distribution is expected if you have a batch effect

Descrete distribution of p-values : unexpected

The statistic tests may be inappropriate due to strong correlation structure for instance

In these cases, you can’t interpret the results of the differential analysis because the hypothesis of the model are not satisfied.

Description of results

MA Plot

Description of results

Volcanoplot

Multiple testing issue

Context:

We performed a large number of statistical tests for which we reject or not \(H_0\) (1 test per gene so \(N\) tests for \(N\) genes)

Possible conclusions:

Among all the genes differentially expressed, the False Discovary Rate (FDR) is :


\(FDR = \frac{FP}{FP+TP}\)

Example of the multiple testing issue

We perform 10000 statistical tests (\(N\) = 10000) and we get the following conclusions:

\(FDR = \frac{FP}{FP+TP} = \frac{450}{450+800} = 0.36\)


In this example, there is 36% of falsely discovered genes!

Control of the FDR

Goal: Control the FDR among the list of differentially expressed genes.

(Very strong) assumption: all the \(N\) statistical tests are independent.

Procedure: The Benjamini and Hochberg (1993) algorithm transforms the \(N\) raw p-values in \(N\) adjusted p-values.

If adjusted p-value \(\leqslant\) \(\alpha\) then we reject \(H_0\), with \(\alpha\) acceptance threshold (most common is 0.05).

Results with thresholds - MA plot

The plots are drawn with shrunk log2 Fold-Change

\(\alpha\) = 0.05

Results with thresholds - Volcanoplot

Threshold on adjusted p-value

The plots are drawn with shrunk log2 Fold-Change

\(\alpha\) = 0.05

Results with thresholds - Volcanoplot

Threshold on adjusted p-value and log2 Foldchange

The plots are drawn with shrunk log2 Fold-Change

\(\alpha < 0.05\)

\(|log_{2}(foldChange)| > 1\)

Results with thresholds - Volcanoplot

Threshold on adjusted p-value and log2 Foldchange

The plots are drawn with shrunk log2 Fold-Change

\(\alpha < 0.05\)

\(log_{2}(foldChange) > 1 \rightarrow Up\) & \(log_{2}(foldChange) < -1 \rightarrow Down\)

Common thresholds

Some thresholds are pretty common:

\(\alpha\):

  • 0.05 \(\rightarrow\) most common for significativity
  • 0.01 \(\rightarrow\) threshold with more stringency

\(|log_{2}(foldChange)|\):

  • 1 \(\rightarrow\) most common for absolute value of \(log_{2}(foldChange)\) (Fold Change of 2)
  • 0.56 \(\rightarrow\) common for animal models (Fold Change of 1.5)
  • 0.26 \(\rightarrow\) common for human (Fold Change of 1.2)
  • 2 \(\rightarrow\) really stringent (Fold Change of 4)

Enrichment

Purpose

If the experiment went well, and if there are differences between our conditions of interest, we might find differences in gene expression levels.

We want to see which gene sets are the most impacted by these differences.

Gene Annotations Databases

Controlled vocabulary (fixed terms) for annotating genes

Gene Ontology (GO)

  • Molecular Functions: Molecular-level activities performed by gene products
  • Cellular Components: Locations relative to cell compartments and structures
  • Biological Process: Larger processes accomplished by multiple molecular activities

Kyoto Encyclopedia of Genes and Genomes (KEGG)

  • Pathways: Larger processes accomplished by multiple molecular activities

Other

Enrichment analyses - ORA

Over-Representation Analysis (ORA)

  • Available through many web-services and many R packages
  • Takes as input a list of genes of interest (usually all DE genes) and test if the list is enriched in specific gene sets (better than by chance), using Fisher’s exact tests
  • Based on a universe of genes, specified by the user

For a given gene set/annotation (eg. a given GO term):

  • GeneRatio: #annotated genes / #considered genes in all the database
  • BgRatio: #annotated genes in universe / #genes in universe
  • RichFactor: #annotated genes / #annotated genes in universe

Most the time, the universe comprises all the genes used in the initial analysis for alignment. It includes all genes that can be measured. (You can exclude the genes that did not pass the filter).

Enrichment analyses - ORA

Example - table

Enrichment analyses - ORA

Example - dotplot

Enrichment analyses - GSEA

Gene-Set Enrichment Analysis (GSEA) (Subramanian et al.)

  • Available as a desktop software, and implemented in multiple R packages
  • Takes as input a ranked list of genes (usually all genes in the study) and test if specific gene sets are over-represented at the extremes (top or bottom) of the entire ranked list
  • Permutation tests

GSEA results are highly dependent on the chosen ranking factor / metric

Possible metrics

  • stat or |stat|
  • FoldChange
  • Log2 Fold-Change or |Log2 Fold-Change|
  • -log10 pvalue
  • -log10 adjusted pvalue

Enrichment analyses - GSEA

Example - table

Enrichment analyses - GSEA

Example - GSEA plot

Thank you for your attention !

Contacts

If you have any questions later

We can answer specific questions but not provide project follow-up.
If you need regular interactions to work on your data, you can contact Bilille using the bilille@univ-lille.fr email address and we will help you with biostatistics or bioinformatics needs.


We are physically present on 3 sites :

  • Cité Scientifique, ESPRIT building, 3rd floor
  • Campus Santé, Plateformes-Cancer building, R+1
  • Campus Pasteur, E.Roux building, 2nd floor

You will find more information on our website, and the one of our unity, PLBS.


One last thing : please fill in the following Framaforms to give your opinion on the training course.

The end ☺️