| Sample1 | Sample2 | |
|---|---|---|
| SOX2 | 30 | 60 |
| BRCA1 | 50 | 100 |
| TP53 | 20 | 40 |
| EGFR | 100 | 200 |
| Total | 200 | 400 |
NGS cycle
29 April 2026
At the bottom left, there is a menu to better navigate through the slides
The content of this training course was highly inspired by previous training courses given by Pierre Pericard (2023) and Guillemette Marot (2022 and years before).
They themselves drew inspiration from existing materials, written mainly by Hugo Varet and the biostatistics team from the Pasteur Institute Bioinfo & Biostat Hub ; and J. Aubert, C. Hennequet-Antier (Inrae), M.A. Dillies and H. Varet (Institut Pasteur Paris).
Bilille is the Lille bioinformatics and biostatistics platform, within the UAR 2014 - US 41 “Plateformes Lilloises en Biologie et Santé”.
PLBS includes 8 platforms, providing access to expertise and equipments to support research in biology and health.
In Bilille, we currently are 13 engineers, directed by Jimmy Vandel (IR CNRS), Mamadou-Dia Sow (IR Univ. Lille) and Ségolène Caboche (IR Univ. Lille).
Our missions are to :
Introduction
Experimental design
Exploration
Normalisation
Differential analysis
Enrichment analysis
A differential analysis is a comparaison. This comparaison may be between treatments, states, conditions…
Example :
Particularities of NGS data :
Obtaining a result using a statistical procedure does not mean that this result is reliable. If you do not know the assumptions behind, please be careful with interpretation or ask an expert to help you.
Most of the time, not a unique solution ⇒ statisticians do not know all statistical procedures developed (example of the Bioconductor project : more than 2000 R packages) but have competences to understand them.
”All models are wrong but some are useful” (G. Box, 1978)
DGE : differential gene expression
DTE : differential transcript expression
DTU : differential transcript usage
This course focuses on DGE
A gene is declared differentially expressed if the observed difference between two conditions is statistically significant, that is to say higher than some natural random variation.
Key steps for statisticians :
Experimental design
Normalisation
Differential analysis
Multiple testing
Here is an example of count data table.
This dataset will be used to produce examples throughout the training session.
The raw data (fastq files) originates from a publication, and an RNA-Seq pipeline (nfcore/rna-seq) was used to generate this count file.
With count data, you need to know the characteristics of each sample, at least the condition that you want to use in the comparison.
You can also include other information you have, such as gender, date of sequencing or age for example. This kind of information is useful to explain your data.
Here, we have 9 samples, with 1 condition of 3 differents modalities wild type (WT), mutant A(Mutation_A) and mutant B (Mutation_B).
Mutation_A and Mutation_B are two different mutants of a transcription factor, so we might witness a lot of changes.
“To consult a statistician after an experiment is finished is often merely to ask him to conduct a post-mortem examination. He can perhaps say what the experiment died of.”
Experimental design is defined as the process of organising an experiment to ensure the collection of appropriate and sufficient data to answer research questions clearly and efficiently. It involves creating a structured approach that reduces variation and enhances the validity of conclusions drawn from the data.
Some rules have to be followed to make an experimental design :
For example
10 patients have the same disease.
5 of them received a treatment, while the other 5 patients didn’t get any treatment.
We want to identify differentially expressed genes between the patients that receive a treatment and the others.
1 - Is there a unique biological question ?
Yes !
2 - What biological factors could induce variations, other than taking a treatment?
Age, sex, weight, … any aspect linked to the studied organism.
3 - What technical factors could induce variations ?
Date of sequencing, the person who performed the sequencing, … any aspect linked to the technical experiment.
NB : We have no control over biological variation factors, contrary to technical variations that we can’t eliminate but at least optimised.
There are 2 kinds of replicates : biological and technical.
Biological replicate
Repetition of the same experimental protocol but independant data acquisition : several samples.
Technical replicate
Same biological material but independant replications of the technical steps : several extracts from the same sample.
Sequencing design
Transcriptome differences between Cystic Fibrosis (CF) patients and healthy people.
Which of these design(s) is/are acceptable ?
NO
YES
YES
Confounding effect
Transcriptome differences between Cystic Fibrosis (CF) patients and healthy people.
Can you identify confunding effects ?
A gene is detected as being differentially expressed between healthy and CF patients. Is it due to the disease ? The age effect ? The gender effect ? The date effect ? The technician effect ?
Transcriptome differences between Lung cancer (LC) patients and healthy people.
Before the experiment :
What about increasing the number of biological replicates ?
It would allow to generalise to the population level.
Set of all mice we could measure
Selection of 3 mice per condition
Sample 1
Sample 2
Sample 3 : Non representative of the whole population
Increasing the number of biological replicates would allow to be more precise while estimating within-condition variability and to improve detection of DE transcripts.
SARTools = Statistical Analysis of RNA-Seq Tools.
SARTools is available on R and Galaxy and allows to :
It provides :
https://github.com/PF2-pasteur-fr/SARTools
Among the top three genes, it may happen that these genes are not the same across every sample.
The SERE (Simple Error Ratio Estimate) coefficient :
As expected, we can note that :
What conclusion could you make when looking at this table of SERE coefficients ?
MutB samples ?MutA samples ?WT samples ?Maybe Samples MutA2 and WT1 have been inverted : it’s worth investigating.
So far, we have seen representations of counts by sample or between samples 2-to-2.
Now, we would like to visualise the whole dataset, considering every sample.
The main goal of the multivariate exploratory data analysis is to explore the whole structure of the dataset, to better understand the proximity between samples and detect possible problems.
This is a quality control step.
Two main tools
Both these methods depend on a notion of distance between samples.
To compute distances properly, the variance must be independant of the mean : data have to be homoscedastic.
Homoscedasticity is not satisfied : variance increases with mean.
-> It’s necessary to transform data before performing PCA and clustering.
Several transformation methods exist, for example :
DESeq2 proposes VST (Variance stabilising Transformation) and rlog (Regularised Log Transformation)edgeR proposes a transformation of the count data as moderated log-counts-per-millionAll these methods allow to have transformed data that we can use to perform PCA and clustering.
We usually use the vst method from DESeq2 package, that runs faster than rlog method.
It’s easy to visualise pair-wise relations, through scatter plots :
In the context of RNAseq, we have n individuals (samples) and p variables (genes).
Performing a PCA will allow to extand the visualisation to p variables, which is much greater than 2.
Main goal : explore the structure of the dataset to better understand the proximity between samples and detect possible problems (often used as a quality control step)
Principle : find axes on which one can project points to obtain a space of reduced dimension.
PCA uses a criteron based on variance to build new axes, also called components, in order to preserve variability. These new components are a linear combination of the initial variables.
Percentage of inertia associated with an axis :
Number of axes to interpret :
What conclusion could you make when looking at this projection ?
Maybe Samples MutA2 and WT1 have been inverted : it’s worth investigating.
What conclusion could you make when looking at this projection ?
Sample MutB1 behaves like an outlier : it’s worth investigating.
In this example, we aim at performing the differential analysis comapring Chow and HFD.
What conclusion could you make when looking at these projections ?
The effect of Sex is greater than the effect of Phenotype.
Variable Sex is then considered as a batch effect, and has to be controled in the differential analysis.
Goal : build groups of samples such that :
Two main approaches :
K-means
Hierarchical clustering
In RNAseq, we used the Hierarchical clustering method. This method doesn’t depend on a randomly initialisation and gives the same results from one run to the next.
What’s more, it provides a graphical representation : the dendrogram.
We usually use the euclidian distance with Ward criterion (option method="Ward.D2" in R hclust function).
What conclusion could you make when looking at this dendrogram ?
Maybe Samples MutA2 and WT1 have been inverted : it’s worth investigating.
What conclusion could you make when looking at this dendrogram ?
Sample MutB1 behaves like an outlier : it’s worth investigating.
Definition : normalisation is a process designed to identify and correct some technical biases removing the least possible biological signal. This step is technology and platform-dependant.
Within-sample normalisation : Normalisation enabling comparisons of fragments (genes) from a unique sample. No need in a differential analysis context.
Between-sample normalisation : Normalisation enabling comparisons of fragments (genes) from different samples.
Read counts are proportional to expression level, gene length and sequencing depth (same RNAs in equal proportions).
Within-sample :
Between-sample :
Assumptions:
The first goal is to correct the differences of library sizes. After sequencing, the number of reads is different between each samples, we want to limit the size effect.
In the example below, we can see that sample 2 has twice the total number of reads of sample 1. However, after normalisation, there is no difference anymore.
| Sample1 | Sample2 | |
|---|---|---|
| SOX2 | 30 | 60 |
| BRCA1 | 50 | 100 |
| TP53 | 20 | 40 |
| EGFR | 100 | 200 |
| Total | 200 | 400 |
| Sample1 | Sample2 | |
|---|---|---|
| SOX2 | 30 | 30 |
| BRCA1 | 50 | 50 |
| TP53 | 20 | 20 |
| EGFR | 100 | 100 |
| Total | 200 | 200 |
Assumptions:
The second goal is to address the differences in library composition. While the library size may be the same, this does not reflect reality.
| Sample1 | Sample2 | |
|---|---|---|
| SOX2 | 6 | 30 |
| BRCA1 | 6 | 30 |
| TP53 | 6 | 30 |
| EGFR | 72 | 0 |
| Total | 90 | 90 |
| Sample1 | Sample2 | |
|---|---|---|
| SOX2 | 2 | 2 |
| BRCA1 | 2 | 2 |
| TP53 | 2 | 2 |
| EGFR | 24 | 0 |
| Total | 30 | 6 |
Assumptions:
DESeq2 computes a size factor \(s_{j}\) per sample:
\({s}_{j} = \underset{i}{median}\frac{x_{ij}}{(\prod^{n}_{v=1}x_{iv})^{1/n}}\)
To normalise counts :
\(x'_{ij} = \frac{x_{ij}}{s_{j}}\)
Assumptions:
edgeR computes a normalisation factor \(f_j\) per sample and normalises the total number of reads \(N_j\) in each sample:
\(N'_{j} = f_{j} \times N_{j}\)
We can calculate DESeq2-like size factors \(s_{j}\) in order to normalise the counts:
\(s_{j} = \frac{N'_{j}}{\frac{1}{n}\sum_{k} N'_{k}}\)
and then :
\(x'_{ij} = \frac{x_{ij}}{s_{j}}\)\(n\) : number of samples studied
\(s_{j}\) : normalisation factor for sample \(j\)
\(N_{j}\) : total number of reads in sample \(j\) (library size)
\(x'_{ij} = \frac{x_{ij}}{N_{j} \times L_{i}} \times 10^{6} \times 10^{3}\)
\(x_{ij}\) : number of reads for gene \(i\) in sample \(j\)
\(N_{j}\) : total number of reads in sample \(j\) (library size)
\(L_{i}\) : Length of gene \(i\)
After normalisation, you can see that the third quartile is more aligned.
Goal : Detect differentially expressed genes between two conditions.
Why replicates ?
In a perfect world : no biological nor technical variability. So one sample from each condition would be necessary to conclude.
In our world : each individual has its own behavior. So we need biological replicates to estimate within-condition variability.
state the null and the alternative hypotheses
\(H_0\) is always preferred. No sufficient proof \(\rightarrow\) no rejection. When we can’t reject \(H_0\), this doesn’t mean that \(H_0\) is true.
p-value is the probability to obtain, under the null hypothesis, a test statistic at least as extreme as the one that we actually observed.
In other terms :
Conclusion : if p-value \(\leqslant\) \(\alpha\) then we reject \(H_0\)
\(\Rightarrow\) We consider that the mean expression of the gene is different in both conditions.
Let \(\mu_1\) and \(\mu_2\) the mean expression of gene g for the first and second condition respectively.
We wish to test the hypothesis : \(H_0\) : \(\mu_1\) = \(\mu_2\) vs \(H_1\) : \(\mu_1\) \(\neq\) \(\mu_2\)
The risks can be summarised in :
\[ X_G \sim Binomial(N, \pi_G) \approx Poisson(N\pi_G) \]
But variance increases with intensity, due to biological variability :
Technical variability is the main source of variablility in low counts, whereas biological variability is dominant in high counts.
In case of overdispersion, increase of the type I error rate (probability to declare incorrectly gene DE).
If \(X_G \sim Poisson(N\pi_G)\), then \(mean(X_G) = variance(X_G) = N\pi_G\).
This is not satisfied \(\rightarrow\) we need a statistical law with variance \(\neq\) mean.
\[ X_{ij} \sim Negative-Binomial(mean = \mu_{ij}, variance = \sigma_{ij}^2)\]
DESeq2 and edgeR used models based on Negative binomial distribution, but other methods exist.
One linear model by gene :
Wald testCoefficients
Coefficients of models correspond to the log2 Fold-Change computed for each contrast
Significance
Significance is expressed through the p-value
\(\mu_1\) = \(mean\ of\ normalised\ values\ in\ condition\ 1\)
\(\mu_2\) = \(mean\ of\ normalised\ values\ in\ condition\ 2\)
Fold-Change = \(\frac{\mu_1}{\mu_2}\)
Log2 Fold-Change = log2(Fold-Change) = log2(\(\mu_1) -\) log2(\(\mu_2)\)
The Fold-Change is a measure describing how much a quantity changes.
Why is’nt it enough to use Fold-Change to find differentially expressed genes ?
Fold-change does not take the variance of the samples into account.
For example, the difference between 102 and 100 is the same as between 4 and 2, but does not seem to have the same importance, regarding the baseline value.
It’s recommended to shrink the log2 Fold-Change :
One of the most shrinkage method used is Adaptative SHRinkage(ashr package), but you might also encounter Approximate posterior estimation for GLM coefficients (apeglm).
Practical importance and statistical significance have little to do with each other :
Usually, we apply 2 thresholds on both p-value and log2FC to detect differentially expressed genes, in order to control respectively :
DESeq2 and edgeR are the most used, but many other tools exist, for example NBPSeq, TSPM, baySeq, EBSeq, NOISeq, SAMseq, ShrinkSeq, voom(+limma), etc.
Easy to use and well documented R packages
A 3-step analysis process :
Negative Binomial distribution of counts
Generalised Linear Models (GLM)
With a small number of replicates (2-3) or low expression, be careful : the results may not be robust.
With a large number of replicates (10 or so) or very high expression : the method choice does not matter much.
Use p-value and log2FC to detect differentially expressed genes.
Examples of expected overall distribution of raw p-values
Very low counts usually have a large p-value
Most of them are not kept after filtering (independant filtering)
This is the most desirable shape after removing low counts
Examples of expected overall distribution of raw p-values
Do not expect positive tests after correction
You have a lot of low counts
Examples of not expected overall distribution of raw p-values
This kind of distribution is expected if you have a batch effect
Descrete distribution of p-values : unexpected
The statistic tests may be inappropriate due to strong correlation structure for instance
In these cases, you can’t interpret the results of the differential analysis because the hypothesis of the model are not satisfied.
Context:
We performed a large number of statistical tests for which we reject or not \(H_0\) (1 test per gene so \(N\) tests for \(N\) genes)
Possible conclusions:
Among all the genes differentially expressed, the False Discovary Rate (FDR) is :
We perform 10000 statistical tests (\(N\) = 10000) and we get the following conclusions:
In this example, there is 36% of falsely discovered genes!
Goal: Control the FDR among the list of differentially expressed genes.
(Very strong) assumption: all the \(N\) statistical tests are independent.
Procedure: The Benjamini and Hochberg (1993) algorithm transforms the \(N\) raw p-values in \(N\) adjusted p-values.
If adjusted p-value \(\leqslant\) \(\alpha\) then we reject \(H_0\), with \(\alpha\) acceptance threshold (most common is 0.05).
The plots are drawn with shrunk log2 Fold-Change
\(\alpha\) = 0.05
The plots are drawn with shrunk log2 Fold-Change
\(\alpha\) = 0.05
The plots are drawn with shrunk log2 Fold-Change
\(\alpha < 0.05\)
\(|log_{2}(foldChange)| > 1\)
The plots are drawn with shrunk log2 Fold-Change
\(\alpha < 0.05\)
\(log_{2}(foldChange) > 1 \rightarrow Up\) & \(log_{2}(foldChange) < -1 \rightarrow Down\)
Some thresholds are pretty common:
\(\alpha\):
\(|log_{2}(foldChange)|\):
If the experiment went well, and if there are differences between our conditions of interest, we might find differences in gene expression levels.
We want to see which gene sets are the most impacted by these differences.
Controlled vocabulary (fixed terms) for annotating genes
For a given gene set/annotation (eg. a given GO term):
#annotated genes / #considered genes in all the database#annotated genes in universe / #genes in universe#annotated genes / #annotated genes in universeMost the time, the universe comprises all the genes used in the initial analysis for alignment. It includes all genes that can be measured. (You can exclude the genes that did not pass the filter).
GSEA results are highly dependent on the chosen ranking factor / metric
Possible metrics
If you have any questions later
We can answer specific questions but not provide project follow-up.
If you need regular interactions to work on your data, you can contact Bilille using the bilille@univ-lille.fr email address and we will help you with biostatistics or bioinformatics needs.
We are physically present on 3 sites :
You will find more information on our website, and the one of our unity, PLBS.
One last thing : please fill in the following Framaforms to give your opinion on the training course.