logo Bilille


TP on proteomics mockup data

This TP was created in the context of the R Bilille training program in 2025.
The mockup data was created using R.

1 Abstract

In this training, we aim to import, manipulate and visualise proteomics data. The data was generated artificially and does the conclusions of this TP do not correspond to any biological effects.

Proteomics data here is in the form of a table with proteins as lines and samples as columns. A value in a cell is therefore the measured abundance of the protein (of the row) in a sample (the column). When a protein has not been detected in a sample there is a missing value NA in the cell.

We also have access to metadata which is the information about the samples studied. For the sake of the training, this information is very reduced here. For each sample, we know : it’s ID (name), the batch it belonged to (if samples were analysed at different times), and the Group it belongs to. The Group correspond to our condition of interest. For example, the goal of our proteomics analysis could be to see if there is a difference between case samples and control samples.

In this training, we will start by importing and exploring the data. We will see how to filter samples and proteins with missing values from our dataset. We will proceed to create different visualisations in this document : a histogram, a density plot, a boxplot by condition. Lastly, a Principal Components Analysis (PCA) will be conducted.

2 Importing and exploring the data

2.1 Importing data

We are going to use two tables for this analysis :

  • proteomic_data.csv contains the proteins’ informations and abundances for each sample

  • sample_metadata.csv contains information about the samples studied (batch and case/control)

Question 1 :

1.1 Import and study the tables.

1.2 How many proteins were identified ?

1.3 On how many samples was the study conducted ?

1.4 How many case and control ?

Question 1
Your turn
Hints
# 1.1 To import tables use the click button or read.csv() function. And then you can use the functions head() and summary()

# 1.2 The number of proteins identified is the number of lines in proteomic_data

# 1.3 The number of samples is the number of lines of sample_metadata

# 1.4 Here you can use the table() function of the Group of sample_metadata
Complete answer
# 1.1
proteomic_data <- read.csv("./proteomic_data.csv")
head(proteomic_data)
##   Proteine_ID Gene_Name    Function Mass_kDa Sample_1  Sample_2 Sample_3
## 1      Prot_1    Gene_1    Receptor     46.9 121.6470 100.36027 119.1426
## 2      Prot_2    Gene_2      Enzyme     14.9 134.4514  98.03269 164.7120
## 3      Prot_3    Gene_3  Structural     49.1 266.6354 145.50525 121.1320
## 4      Prot_4    Gene_4 Transporter     26.8 150.5016 138.94809 242.5193
## 5      Prot_5    Gene_5     Unknown     50.0 151.3885  53.43598 152.0930
## 6      Prot_6    Gene_6  Structural     90.2 284.5408 233.03854 118.3439
##    Sample_4  Sample_5  Sample_6 Sample_7  Sample_8  Sample_9 Sample_10
## 1 139.78971 161.48780 124.31637 113.8472  80.49437 188.58740  335.5857
## 2 132.05879 192.71823 246.25901 233.9253 182.34273 330.94095  122.1533
## 3  83.71415 196.04661  98.54651 118.2785 331.32807  78.23942  125.5227
## 4 114.65343  89.20677 283.22978 146.9143 194.86008 174.33630  102.1006
## 5 421.19308  66.99370 215.88462 202.8356 313.67663 142.17540  469.6861
## 6 145.06087 372.72768 182.39673 124.0830 150.99983 132.59189  113.8820
##   Sample_11 Sample_12 Sample_13 Sample_14 Sample_15 Sample_16 Sample_17
## 1        NA  213.0200  184.1431        NA  138.6391  103.9925  120.3741
## 2  184.3775  160.2631  254.2895        NA  208.9348  159.3205  114.0580
## 3  268.4671  219.1539  118.6289  240.7887  129.5348        NA  299.2542
## 4  151.0859  119.1441  215.6033  159.8170  219.8116  276.1021  269.2076
## 5  211.4392  264.4778  196.0366  170.3546  233.2495  233.8605  128.7948
## 6  295.5621  220.8819  262.5801  332.0887  226.7469  301.7992  203.2669
##   Sample_18 Sample_19 Sample_20
## 1 264.65098 202.14201  444.6929
## 2 180.93777  92.36344  100.1268
## 3 122.91136 183.42033  102.3379
## 4 158.59004 114.60753  171.6060
## 5 298.48229 146.05061  131.2460
## 6  99.61055 187.75987  115.7825
summary(proteomic_data)
##  Proteine_ID         Gene_Name           Function            Mass_kDa    
##  Length:1000        Length:1000        Length:1000        Min.   : 10.0  
##  Class :character   Class :character   Class :character   1st Qu.: 48.6  
##  Mode  :character   Mode  :character   Mode  :character   Median : 83.8  
##                                                           Mean   : 82.5  
##                                                           3rd Qu.:116.9  
##                                                           Max.   :149.9  
##                                                                          
##     Sample_1         Sample_2         Sample_3         Sample_4     
##  Min.   : 49.07   Min.   : 43.35   Min.   : 46.61   Min.   : 43.99  
##  1st Qu.:114.87   1st Qu.:113.90   1st Qu.:114.21   1st Qu.:115.09  
##  Median :148.60   Median :151.51   Median :145.82   Median :148.65  
##  Mean   :161.11   Mean   :163.69   Mean   :159.29   Mean   :159.25  
##  3rd Qu.:192.39   3rd Qu.:199.36   3rd Qu.:192.25   3rd Qu.:192.16  
##  Max.   :554.43   Max.   :576.71   Max.   :584.03   Max.   :423.96  
##  NA's   :33       NA's   :31       NA's   :22       NA's   :20      
##     Sample_5         Sample_6        Sample_7         Sample_8     
##  Min.   : 41.91   Min.   : 40.5   Min.   : 40.84   Min.   : 33.22  
##  1st Qu.:111.37   1st Qu.:120.8   1st Qu.:118.85   1st Qu.:118.52  
##  Median :147.27   Median :157.3   Median :152.80   Median :155.56  
##  Mean   :159.03   Mean   :170.9   Mean   :165.81   Mean   :169.75  
##  3rd Qu.:191.76   3rd Qu.:206.1   3rd Qu.:205.45   3rd Qu.:207.23  
##  Max.   :588.69   Max.   :680.5   Max.   :580.23   Max.   :491.69  
##  NA's   :15       NA's   :22      NA's   :15       NA's   :20      
##     Sample_9        Sample_10        Sample_11        Sample_12     
##  Min.   : 40.12   Min.   : 40.73   Min.   : 48.57   Min.   : 55.42  
##  1st Qu.:117.93   1st Qu.:117.45   1st Qu.:145.49   1st Qu.:143.70  
##  Median :156.29   Median :151.19   Median :188.26   Median :188.52  
##  Mean   :172.11   Mean   :165.56   Mean   :204.99   Mean   :205.33  
##  3rd Qu.:208.34   3rd Qu.:199.87   3rd Qu.:250.53   3rd Qu.:254.80  
##  Max.   :726.19   Max.   :528.59   Max.   :590.38   Max.   :620.51  
##  NA's   :26       NA's   :15       NA's   :24       NA's   :22      
##    Sample_13       Sample_14        Sample_15        Sample_16     
##  Min.   : 51.7   Min.   : 51.11   Min.   : 52.99   Min.   : 62.42  
##  1st Qu.:139.3   1st Qu.:149.15   1st Qu.:144.80   1st Qu.:145.50  
##  Median :181.7   Median :193.70   Median :188.03   Median :186.67  
##  Mean   :197.1   Mean   :208.38   Mean   :204.95   Mean   :203.85  
##  3rd Qu.:236.9   3rd Qu.:249.18   3rd Qu.:246.44   3rd Qu.:242.20  
##  Max.   :586.3   Max.   :607.46   Max.   :696.27   Max.   :695.61  
##  NA's   :19      NA's   :409      NA's   :14       NA's   :13      
##    Sample_17        Sample_18        Sample_19        Sample_20    
##  Min.   : 53.22   Min.   : 48.34   Min.   : 46.74   Min.   : 47.8  
##  1st Qu.:138.43   1st Qu.:140.71   1st Qu.:143.58   1st Qu.:137.0  
##  Median :181.22   Median :185.13   Median :186.63   Median :186.8  
##  Mean   :199.14   Mean   :202.10   Mean   :201.73   Mean   :200.3  
##  3rd Qu.:242.74   3rd Qu.:240.19   3rd Qu.:240.24   3rd Qu.:241.0  
##  Max.   :713.96   Max.   :642.52   Max.   :833.25   Max.   :731.6  
##  NA's   :20       NA's   :15       NA's   :15       NA's   :26
sample_metadata <- read.csv("./sample_metadata.csv")
head(sample_metadata)
##   Sample_ID   Batch   Group
## 1  Sample_1 Batch_1 Control
## 2  Sample_2 Batch_1 Control
## 3  Sample_3 Batch_1 Control
## 4  Sample_4 Batch_1 Control
## 5  Sample_5 Batch_1 Control
## 6  Sample_6 Batch_2 Control
summary(sample_metadata)
##   Sample_ID            Batch              Group          
##  Length:20          Length:20          Length:20         
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character
# 1.2 Number of identified proteins
n_proteins <- nrow(proteomic_data)
print(paste0("There is ", n_proteins, " identified proteins in our dataset."))
## [1] "There is 1000 identified proteins in our dataset."
# 1.3 Number of samples
n_samples <- nrow(sample_metadata)
print(paste0("There is ", n_samples, " samples in our dataset."))
## [1] "There is 20 samples in our dataset."
# 1.4
table(sample_metadata$Group)
## 
##    Case Control 
##      10      10

2.2 Filtering samples

In an analysis, samples can sometimes be filtered from the dataset when considered inaccurate.

2.2.1 Number of quantified proteins

The proteins identified in each samples in a proteomics study are not always the same. Our goal here is to look at the number of identified or quantified (not NA value) proteins for each sample.

Question 2 :

2.1 Create a table abundance_data from proteomic_data by keeping only the columns with abundance data. To keep the information about which protein is in which row, add the Protein ID as the row names of this abundance_data table.

2.2 Create another table that is the number of quantified proteins for each Sample.

2.3 Create a barplot of the number of quantified proteins per sample. Color the bars of this barplot depending on the batch the sample belongs to.

Question 2
Your turn
Hints
# Make sure to load the appropriate librairies for manipulation and ploting (dplyr, tidyr, ggplot2.. anything you think necessary)

# 2.1 You can use dplyr package with :  %>% select(starts_with("Sample_"))

# 2.2 You can use colSums() with !is.na() to have quantified values

# 2.3 Your goal is to have a table with columns Sample_ID, Proteins_quantified and Batch for the barplot. if you use ggplot(), look into geom_bar(). 
Complete answer
# Load the libraries
library(tidyr)
library(ggplot2)
library(dplyr)
## 
## Attachement du package : 'dplyr'
## Les objets suivants sont masqués depuis 'package:stats':
## 
##     filter, lag
## Les objets suivants sont masqués depuis 'package:base':
## 
##     intersect, setdiff, setequal, union
# 2.1 Select only the samples data and add rownames
abundance_data <- proteomic_data %>% select(starts_with("Sample_"))
rownames(abundance_data) <- proteomic_data$Proteine_ID

# 2.2 Count the number of not NA values
counts <- colSums(!is.na(abundance_data))

# 2.3 Create the dataframe to use in ggplot
data_counts <- data.frame(
  Sample_ID = names(counts),
  Proteins_quantified = counts
) %>%
  left_join(sample_metadata, by = "Sample_ID")

# Sample_ID as factor from 1 to 22 : this enables the samples to be in the right order on the barplot
data_counts <- data_counts %>%
  mutate(Sample_ID = factor(Sample_ID, levels = paste0("Sample_", 1:n_samples)))

# Barplot
ggplot(data_counts,
       aes(x = Sample_ID, y = Proteins_quantified, fill = Batch)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
  labs(title = "Number of quantified proteins per sample", x = "Sample", y = "Number of proteins")

2.2.2 Filtering anormal samples

According to the barplot you just made, we can see that all samples have similar number of quantified values, except for sample 14 which has a very low number.

Question 3 : Filter the sample n°14 from the dataset.

Question 3
Your turn
Hints
# You can either filter the existing data or create new objects to use 
# (new objects is recommended as you will keep the initial data and be sure what dataset you use)
# When naming your data use explicit names !
Complete answer
# Filter the sample 14 (a column in proteomic data and a row in the sample metadata)
filtered_proteomic_data <- proteomic_data
filtered_proteomic_data$Sample_14 <- NULL

filtered_abundance_data <- abundance_data %>% select(!("Sample_14"))

# Filter a row :
filtered_sample_metadata <- subset(sample_metadata, Sample_ID != "Sample_14")

2.3 Dealing with missing values

Here, we are aiming to have a table of abundance with no missing values.

Question 4 :

4.1 Create a new table with no lines with NAs from your filtered proteomic data.

4.2 How many proteins are in this new table ?

Question 4
Your turn
Hints
# 4.1 There are many ways to do this (complete.cases() or dplyr with drop_na() for example)

# 4.2 The number of proteins is linked to the number of rows (like in Q°1.3)
Complete answer
# 4.1 Delete all the rows with at least an NA

proteomic_data_noNA <- filtered_proteomic_data %>%
  drop_na()
# Or
proteomic_data_noNA <- filtered_proteomic_data[complete.cases(filtered_proteomic_data), ]
# Or
proteomic_data_noNA <- subset(filtered_proteomic_data,
                              complete.cases(filtered_proteomic_data))

abundance_data_noNA <- filtered_abundance_data %>%
  drop_na()


# 4.2 The dim() function gives the dimensions (number of rows and columns) of a table
dim(abundance_data_noNA)
## [1] 685  19
# Let's print a sentence with text and numbers with the paste() function
dim_ab_data = dim(abundance_data_noNA) # this is now a vector with two values
print(paste(
  "There is ",
  dim(abundance_data_noNA)[1] ,
  " proteins (rows) after filtering the missing values."
))
## [1] "There is  685  proteins (rows) after filtering the missing values."
print(paste(
  "There is ",
  dim(abundance_data_noNA)[2] ,
  " samples (columns) left after filtering sample 14."
))
## [1] "There is  19  samples (columns) left after filtering sample 14."

2.4 Density plot

Now that we filtered the NAs, we can look at the distribution of the remaining values of abundance of the proteins in each sample. We will see how to log transform data in this part.

Question 5 : Our goal is to create a density plot for each sample colored by batch. To do so :

5.1 Create an object named data_density where each row corresponds to a unique value of abundance for one protein in one sample. You can use pivot_longer() function with :
data_density = proteomic_data_noNA %>% pivot_longer(starts_with("Sample_"), names_to = "Sample_ID", values_to = "Abundance")

5.2 Explore this new object data_density and check for any NA values

5.3 Add the information of the batch in data_density

5.4 Create a density plot of abundance for each sample of your data and color the density lines per batch

5.5 Create a density plot of the log transformed abundance

Question 5
Your turn
Hints
# 5.1 The code is given in the question

# 5.2 View() function or click button on the object in the Environment panel
#     For NAs, you can use summary() or any.NA()

# 5.3 You can use dplyr left_join() with filtered_sample_metadata to add information to data_density

# 5.4 Create your density plot using ggplot and geom_density(), 
#     colored by Batch and grouped by Sample_ID

# 5.5 This is the same plot but the abundances are log transformed with log() function
Complete answer
# 5.1 Create the long version of proteomic data with no NA values.
data_density <- proteomic_data_noNA %>%
  pivot_longer(starts_with("Sample_"),
               names_to = "Sample_ID",
               values_to = "Abundance")

# 5.2 Explore data_density
# View(data_density)
head(data_density)
## # A tibble: 6 × 6
##   Proteine_ID Gene_Name Function Mass_kDa Sample_ID Abundance
##   <chr>       <chr>     <chr>       <dbl> <chr>         <dbl>
## 1 Prot_2      Gene_2    Enzyme       14.9 Sample_1      134. 
## 2 Prot_2      Gene_2    Enzyme       14.9 Sample_2       98.0
## 3 Prot_2      Gene_2    Enzyme       14.9 Sample_3      165. 
## 4 Prot_2      Gene_2    Enzyme       14.9 Sample_4      132. 
## 5 Prot_2      Gene_2    Enzyme       14.9 Sample_5      193. 
## 6 Prot_2      Gene_2    Enzyme       14.9 Sample_6      246.
summary(data_density)
##  Proteine_ID         Gene_Name           Function            Mass_kDa     
##  Length:13015       Length:13015       Length:13015       Min.   : 10.00  
##  Class :character   Class :character   Class :character   1st Qu.: 46.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 82.60  
##                                                           Mean   : 81.51  
##                                                           3rd Qu.:116.60  
##                                                           Max.   :149.70  
##   Sample_ID           Abundance     
##  Length:13015       Min.   : 40.12  
##  Class :character   1st Qu.:126.43  
##  Mode  :character   Median :166.74  
##                     Mean   :182.16  
##                     3rd Qu.:220.76  
##                     Max.   :833.25
# 5.3 Add information from sample_metadata
data_density <- data_density %>%
  left_join(filtered_sample_metadata, by = "Sample_ID")

# N.B. We could have done question 5.1 ad 5.3 in one command with :
data_density <- proteomic_data_noNA %>%
  pivot_longer(starts_with("Sample_"),
               names_to = "Sample_ID",
               values_to = "Abundance") %>%
  left_join(filtered_sample_metadata, by = "Sample_ID")

# 5.4 Density plot with ggplot
ggplot(data_density, aes(
  x = (Abundance),
  color = Batch,
  group = Sample_ID
)) +
  geom_density() +
  theme_minimal() +
  labs(title = "Density of protein abundances per sample", x = "Abundance", y = "Density")

# 5.5 Density plot with log transformation of the abundance
ggplot(data_density, aes(
  x = log2(Abundance),
  color = Batch,
  group = Sample_ID
)) +
  geom_density() +
  theme_minimal() +
  labs(title = "Density of protein abundances per sample", x = "log2(Abundance)", y = "Density")

3 Analysis

In this part, we will continue our analysis on our data now that it has been cleaned and filtered.

3.1 Case-Control investigation

The goal here is to look if there are differences in our data between our 2 groups from Group (case or control).

Question 6 : Using the data_density object you created previously :

6.1 Compute summary statistics (mean, median, max and min) of abundance per Group

6.2 Visualise the distribution of abundances per Group using a boxplot colored by Group.

Question 6
Your turn
Hints
# 6.1 There are many ways to do this : You can do subsets of your data depending on their Group and then use the mean(), median(), min(), max() function
#      Or you can look into using group_by() and summarise() from dplyr

# 6.2 Using data_density with ggplot() + geom_boxplot() is a great start
Complete answer
# 6.1 
summary_stats <- data_density %>%
    group_by(Group) %>%
    summarise(mean_abundance = mean(Abundance, na.rm = TRUE),
              median_abundance = median(Abundance, na.rm = TRUE),
              max_abundance=max(Abundance, na.rm = TRUE),
              min_abundance=min(Abundance, na.rm = TRUE)
              )

# Note that na.rm=TRUE can be useful with these functions but here we know there is no NA in our data
summary_stats
## # A tibble: 2 × 5
##   Group   mean_abundance median_abundance max_abundance min_abundance
##   <chr>            <dbl>            <dbl>         <dbl>         <dbl>
## 1 Case              202.             186.          833.          46.7
## 2 Control           164.             152.          726.          40.1
# 6.2 Simple boxplot
ggplot(data_density, aes(x = Group, y = Abundance, fill = Group)) +
    geom_boxplot() +
    theme_minimal()

# We can improve our boxplot with a title, labels and by removing the legend which is useless here
ggplot(data_density, aes(x = Group, y = Abundance, fill = Group)) +
  geom_boxplot(outlier.shape = 21, alpha = 0.8) +
  labs(
    title = "Abundance distribution per condition",   # the title
    x = "Condition",                                  # the x label
    y = "Protein abundance"                       # the y label
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),    # Centering the title
    legend.position = "none"                   # Hiding the legend
  )

You can interpret the boxplot by observing whether the median (horizontal line in each box) and spread (height of the boxes) differ between Case and Control groups. If the distributions overlap, the difference between the groups on the studied proteins might not be significant.

3.2 PCA

PCA (Principal Component Analysis) is a statistical technique which enables us to simplify complex multidimensionnal datasets by reducing the number of variables while preserving information.
Indeed, this method is based on transforming the original variables into new, uncorrelated variables, called principal components, which successively capture the greatest possible variance in the data. So, it is often used to make data easier to explore and visualize.

Question 7 : Here we will conduct a pca analysis and plot the results.

7.1 For the PCA, we will use the prcomp() function to create pca_res. This function expects a table with no NA and with rows as “individuals” (the samples) and columns as variables (the proteins). Use the t() function to transpose the abundance data with noNA and use this transposed table in the prcomp() with scale parameter being TRUE.

7.2 Explore the PCA object created

7.3 Extract the x from the PCA object as a dataframe. Add information about the batch and the group in this dataframe.

7.4 Visualise the samples in PC1 and PC2 in a plot, color the points by batch

7.5 Visualise the samples in PC3 and PC4 in a plot, color the points by batch

7.6 Visualise the samples in PC1 and PC2 in a plot, color the points by Group

7.7 You can know the percentage of variance explained by each PC by using
\(\text{var_explained} = \frac{(\text{pca_res\$dev})^2)}{sum((\text{pca_res\$dev})^2)}\). Add this information to the plots you did before in the labels of the x and y axis. Round the values for lisibility

7.8 A usual graphical representation of the percentage of variance explained by each PC is the scree plot : it is simply a barplot of the values per PC. Try to create this graph with the var_explained object you created.

Question 7
Your turn
Hints
# 7.1 Create abundance_t from abundance_data_noNA with t() function
#     Use the prcomp() function and the abundance_t object to create pca_res 

# 7.2 Look at the Environment panel, try to access the elements in pca_res

# 7.3 The x object has samples as rows, 
#     you can then easily add information from the filtered sample metadata


# 7.4 Use ggplot() and geom_point() with x = PC1, y = PC2, color = Batch

# 7.5 and 7.6 same as 7.4


# 7.7 Create var_explained as explained in the question, 
#     explore this object (vector), 
#     the first element of the vector is the % of variance explained by 
#     the first PC : PC1. the 2nd element of the vector is the % of variance
#     explained by the 2nd PC : PC2. 
#     From this, add this information in the x and y parameters from the labs() of the ggplot 


# 7.8 Create a dataframe var_df with the variance information data.frame().
#     Add the PC name information to var_df. 
#     Order the var_df$PC column as factor (factor()) with levels being 
#     the order of the PC from 1 to 19. Then, create the barplot.
Complete answer
# 7.1 Transpose data and create your PCA object with scaling
abundance_t <- t(abundance_data_noNA)
pca_res <- prcomp(abundance_t, scale = TRUE) 
# A PCA is sensitive to the scale of the data, meaning that a variable with high
# values will have more impact in the creation of the PC. 
# To give the same importance to every variable (protein), we standardize the data :
# we center and reduce each variable so that they have an average of 0 
# and a standard deviation of 1.  

# 7.2 pca_res is the PCA object created, you can see in the Environment panel that it
# is a list of 5 objects (x, center, scale, sdev, and rotation). You can look at the
# help of the function (section "Value") to have more information on these elements.


# 7.3 Extract x object from pca_res les coordonnées des échantillons (scores de la PCA)
pca_df <- as.data.frame(pca_res$x)
# Add the metadata
pca_df$Batch <- filtered_sample_metadata$Batch
pca_df$Group <- filtered_sample_metadata$Group


# 7.4 Create a plot of PC1 and PC2 colored by Batch
ggplot(pca_df, aes(x = PC1, y = PC2, color = Batch)) +
  geom_point(size = 3) +
  labs(title = "PCA - colored by Batch") +
  theme_minimal()

# 7.5 Create a plot of PC3 and PC4 colored by Batch
ggplot(pca_df, aes(x = PC3, y = PC4, color = Batch)) +
  geom_point(size = 3) +
  labs(title = "PCA - colored by Batch") +
  theme_minimal()

# 7.6 Create a plot of PC1 and PC2 colored by Group
ggplot(pca_df, aes(x = PC1, y = PC2, color = Group)) +
  geom_point(size = 3) +
  labs(title = "PCA - colored by Group") +
  theme_minimal()

# associate a different palet ??



# 7.7 Variance explained by each PC
var_explained <- (pca_res$sdev^2) / sum(pca_res$sdev^2)

# This is a good information to have on your graph
pc1_var <- round(var_explained[1] * 100, 1)
pc2_var <- round(var_explained[2] * 100, 1)

ggplot(pca_df, aes(x = PC1, y = PC2, color = Group)) +
  geom_point(size = 3) +
  labs(
    title = "PCA of proteomic data",
    x = paste0("PC1 (", pc1_var, "%)"),
    y = paste0("PC2 (", pc2_var, "%)")
  ) +
  theme_minimal()

# 7.8 Scree plot : add the PC name information to var_explained
var_df <- data.frame(
  PC = paste0("PC", 1:length(var_explained)),
  Variance = var_explained * 100
)
# Order them as factors to have them in the right order
var_df$PC=factor(var_df$PC, levels=var_df$PC)

# Create the barplot
ggplot(var_df, aes(x = PC, y = Variance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(
    title = "Variance expliquée par chaque composante",
    x = "Composantes principales",
    y = "Pourcentage de variance expliqué (%)"
  ) +
  theme_minimal()

We could then interpret these graphs. For instance, we can see that PC1 explains 11.1% of the total variance of our data. It reveals the existence of a difference between the samples of the 2 groups (Case and Control). It also may reveal a batch effect : the samples from batch 1 and 2 are in the same group but are still a bit separated by PC1.

4 Session information

It is always important to have access to the version of R and of the packages we used to generate results in a report.

Question 8 : Do you remember what function gives you this information in R ?

Question 8
Your turn
Complete answer

You can check the packages displayed to see if there is some you recognise. Our session info might be more dense than yours.

devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.4.2 (2024-10-31)
##  os       Ubuntu 24.04.1 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  fr_FR.UTF-8
##  ctype    fr_FR.UTF-8
##  tz       Europe/Paris
##  date     2025-10-10
##  pandoc   3.2 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/x86_64/ (via rmarkdown)
##  quarto   1.5.57 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/quarto
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package         * version date (UTC) lib source
##  bslib             0.9.0   2025-01-30 [1] CRAN (R 4.4.2)
##  cachem            1.1.0   2024-05-16 [1] CRAN (R 4.4.2)
##  cli               3.6.5   2025-04-23 [1] CRAN (R 4.4.2)
##  devtools          2.4.6   2025-10-03 [1] CRAN (R 4.4.2)
##  digest            0.6.37  2024-08-19 [1] CRAN (R 4.4.2)
##  dplyr           * 1.1.4   2023-11-17 [1] CRAN (R 4.4.2)
##  ellipsis          0.3.2   2021-04-29 [1] CRAN (R 4.4.2)
##  evaluate          1.0.5   2025-08-27 [1] CRAN (R 4.4.2)
##  farver            2.1.2   2024-05-13 [1] CRAN (R 4.4.2)
##  fastmap           1.2.0   2024-05-15 [1] CRAN (R 4.4.2)
##  fs                1.6.6   2025-04-12 [1] CRAN (R 4.4.2)
##  generics          0.1.4   2025-05-09 [1] CRAN (R 4.4.2)
##  ggplot2         * 4.0.0   2025-09-11 [1] CRAN (R 4.4.2)
##  glue              1.8.0   2024-09-30 [1] CRAN (R 4.4.2)
##  gtable            0.3.6   2024-10-25 [1] CRAN (R 4.4.2)
##  htmltools         0.5.8.1 2024-04-04 [1] CRAN (R 4.4.2)
##  jquerylib         0.1.4   2021-04-26 [1] CRAN (R 4.4.2)
##  jsonlite          2.0.0   2025-03-27 [1] CRAN (R 4.4.2)
##  knitr             1.50    2025-03-16 [1] CRAN (R 4.4.2)
##  labeling          0.4.3   2023-08-29 [1] CRAN (R 4.4.2)
##  lifecycle         1.0.4   2023-11-07 [1] CRAN (R 4.4.2)
##  magrittr          2.0.4   2025-09-12 [1] CRAN (R 4.4.2)
##  memoise           2.0.1   2021-11-26 [1] CRAN (R 4.4.2)
##  pillar            1.11.1  2025-09-17 [1] CRAN (R 4.4.2)
##  pkgbuild          1.4.8   2025-05-26 [1] CRAN (R 4.4.2)
##  pkgconfig         2.0.3   2019-09-22 [1] CRAN (R 4.4.2)
##  pkgload           1.4.1   2025-09-23 [1] CRAN (R 4.4.2)
##  purrr             1.1.0   2025-07-10 [1] CRAN (R 4.4.2)
##  R6                2.6.1   2025-02-15 [1] CRAN (R 4.4.2)
##  RColorBrewer      1.1-3   2022-04-03 [1] CRAN (R 4.4.2)
##  remotes           2.5.0   2024-03-17 [1] CRAN (R 4.4.2)
##  rlang             1.1.6   2025-04-11 [1] CRAN (R 4.4.2)
##  rmarkdown         2.30    2025-09-28 [1] CRAN (R 4.4.2)
##  rstudioapi        0.17.1  2024-10-22 [1] CRAN (R 4.4.2)
##  S7                0.2.0   2024-11-07 [1] CRAN (R 4.4.2)
##  sass              0.4.10  2025-04-11 [1] CRAN (R 4.4.2)
##  scales            1.4.0   2025-04-24 [1] CRAN (R 4.4.2)
##  sessioninfo       1.2.3   2025-02-05 [1] CRAN (R 4.4.2)
##  templatebilille   0.1.0   2025-02-11 [1] local
##  tibble            3.3.0   2025-06-08 [1] CRAN (R 4.4.2)
##  tidyr           * 1.3.1   2024-01-24 [1] CRAN (R 4.4.2)
##  tidyselect        1.2.1   2024-03-11 [1] CRAN (R 4.4.2)
##  usethis           3.2.1   2025-09-06 [1] CRAN (R 4.4.2)
##  utf8              1.2.6   2025-06-08 [1] CRAN (R 4.4.2)
##  vctrs             0.6.5   2023-12-01 [1] CRAN (R 4.4.2)
##  withr             3.0.2   2024-10-28 [1] CRAN (R 4.4.2)
##  xfun              0.53    2025-08-19 [1] CRAN (R 4.4.2)
##  yaml              2.3.10  2024-07-26 [1] CRAN (R 4.4.2)
## 
##  [1] /home/oriane/R/x86_64-pc-linux-gnu-library/4.4
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
##  * ── Packages attached to the search path.
## 
## ──────────────────────────────────────────────────────────────────────────────