logo Bilille


TP on clinical data

This TP was created in the context of the R Bilille training program in 2025.

1 Abstract

This document aims to propose an analysis based on clinical data.
We retrieved a dataset on the following website : https://archive.ics.uci.edu, and explored it with R.

To complete this practical session, you can either create a basic R script, or a Rmd script that will create a PDF or HTML report.
In this whole session, you can manipulate the data as you want, even if you don’t follow the questions ! Try things and ask questions.

2 Importing and exploring the data

2.1 Importing data

We are going to use one table for this analysis :

  • obesity_with_na.csv which contains clinical data mainly on obese individuals

Details of the different variables are described here.
There were no missing values in the original data table, we added some for this exercise.

Question 1 :

1.1 Import the table. NB : all characters are factors in this table.

1.2 How many individuals and variables are in the table ?

1.3 Give the global information of the table.

1.4 What is the repartition of the gender ? And the repartition of the smokers ? And the repartition of the smokers across the gender ?

Question 1
Your turn
Hints
#' 1.1 To import tables, use the click button or `read.csv()` function.
#'     Don't forget to tick the 'Strings as factors' case

#' 1.2 The number of individuals is the number of rows, and the number of variables is the number of columns in the table.

#' 1.3 Use the `summary` function.

#' 1.4 Here you can use the `table()` function on Gender, SMOKE, and combination of both these variables.
Complete answer
## 1.1
# The `stringsAsFactors` option allows to consider every character variable as factor
obesity_with_na = read.csv("./obesity_with_na.csv", 
                           stringsAsFactors=TRUE)    


## 1.2 
dim(obesity_with_na) 
## [1] 2111   17
# Let's print a sentence with text and numbers using the paste() function
# First saving the dimensions into a vector
dimensions = dim(obesity_with_na) # this is now a vector with two values
# Second using it into the Paste function that allows to concatenate characters
print(paste("There are", dimensions[1], "individuals and", dimensions[2], "variables in our dataset."))
## [1] "There are 2111 individuals and 17 variables in our dataset."
## 1.3 
summary(obesity_with_na)
##     Gender          Age            Height          Weight      
##  Female:1040   Min.   :14.00   Min.   :1.450   Min.   : 39.00  
##  Male  :1068   1st Qu.:19.95   1st Qu.:1.630   1st Qu.: 65.62  
##  NA's  :   3   Median :22.78   Median :1.701   Median : 83.00  
##                Mean   :24.32   Mean   :1.702   Mean   : 86.62  
##                3rd Qu.:26.00   3rd Qu.:1.769   3rd Qu.:107.54  
##                Max.   :61.00   Max.   :1.980   Max.   :173.00  
##                NA's   :6       NA's   :10      NA's   :5       
##  family_history_with_overweight   FAVC           FCVC            NCP       
##  no  : 383                      no  : 245   Min.   :1.000   Min.   :1.000  
##  yes :1719                      yes :1860   1st Qu.:2.000   1st Qu.:2.659  
##  NA's:   9                      NA's:   6   Median :2.381   Median :3.000  
##                                             Mean   :2.419   Mean   :2.685  
##                                             3rd Qu.:3.000   3rd Qu.:3.000  
##                                             Max.   :3.000   Max.   :4.000  
##                                             NA's   :8       NA's   :2      
##          CAEC       SMOKE           CH2O         SCC            FAF        
##  Always    :  52   no  :2060   Min.   :1.000   no  :2014   Min.   :0.0000  
##  Frequently: 242   yes :  44   1st Qu.:1.585   yes :  95   1st Qu.:0.1245  
##  no        :  51   NA's:   7   Median :2.000   NA's:   2   Median :1.0000  
##  Sometimes :1761               Mean   :2.008               Mean   :1.0099  
##  NA's      :   5               3rd Qu.:2.480               3rd Qu.:1.6667  
##                                Max.   :3.000               Max.   :3.0000  
##                                NA's   :4                   NA's   :4       
##       TUE                 CALC                        MTRANS    
##  Min.   :0.0000   Always    :   1   Automobile           : 457  
##  1st Qu.:0.0000   Frequently:  69   Bike                 :   7  
##  Median :0.6253   no        : 635   Motorbike            :  11  
##  Mean   :0.6576   Sometimes :1395   Public_Transportation:1571  
##  3rd Qu.:1.0000   NA's      :  11   Walking              :  56  
##  Max.   :2.0000                     NA's                 :   9  
##  NA's   :5                                                      
##                NObeyesdad 
##  Obesity_Type_I     :349  
##  Obesity_Type_III   :324  
##  Obesity_Type_II    :296  
##  Overweight_Level_I :288  
##  Overweight_Level_II:288  
##  (Other)            :557  
##  NA's               :  9
## 1.4
table(obesity_with_na$Gender) # Contingency table for Gender variable
## 
## Female   Male 
##   1040   1068
table(obesity_with_na$SMOKE)  # Contingency table for SMOKE variable
## 
##   no  yes 
## 2060   44
table(obesity_with_na$Gender, obesity_with_na$SMOKE)  # Contingency table for Gender x SMOKE variables
##         
##            no  yes
##   Female 1022   15
##   Male   1035   29

2.2 Dealing with missing values

Here, we are aiming to have a table with no missing values.

Question 2 :

2.1 Load tidyverse package.

2.2 Create a new table with no lines with NAs from your table.

2.3 How many individuals are in this new table ?

Question 2
Your turn
Hints
#' 2.1 Use `library` to load a package.

#' 2.2 There are many ways to do this (`na.omit()` ,`complete.cases()` or dplyr with `dplyr::drop_na()` for example)

#' 2.3 The number of individuals is linked to the number of rows (like in Q°1.2)
Complete answer
## 2.1
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 2.2
obesity_without_na <- obesity_with_na %>% drop_na()
# Or
obesity_without_na <- obesity_with_na[complete.cases(obesity_with_na), ]
# Or
obesity_without_na <- subset(obesity_with_na,complete.cases(obesity_with_na))
# Or
obesity_without_na <- obesity_with_na %>% na.omit()


## 2.3
n_ind <- dim(obesity_without_na)[1]  # Using the first element of the vector returned by `dim`
# Or
n_ind <- nrow(obesity_without_na) # Using directly the function `nrow` which returns the number of rows

# To print the number of individuals in a sentance :
print(paste("There are", n_ind, "individuals with no missing values."))
## [1] "There are 2006 individuals with no missing values."

2.3 Add variables

Here we want to add a variable to our table.
We work on data without NAs.

Question 3 :

3.1 Calculate BMI and add it to the table.

3.2 Make a summary of the new variable.

3.3 Create a sub table with only patients whose BMI is greater than 25.

3.4 Make a summary of the new obese table.

Question 3
Your turn
Hints
#' 3.1 BMI = weight / (height ** 2)

#' 3.2 Use the `summary` function on the new variable

#' 3.3 You can use subset, dplyr with dplyr::filter or boolean indexing

#' 3.4 Use the `summary` function on the new table
Complete answer
## 3.1 
# Using the $ to access the variable
obesity_without_na$BMI = obesity_without_na$Weight / (obesity_without_na$Height ** 2)
# Or
# Using the [,] to access the variable
obesity_without_na[,"BMI"] = obesity_without_na[,"Weight"] / (obesity_without_na[,"Height"] ** 2)
# Or
# Using the `dplyr` package
obesity_without_na = obesity_without_na %>% dplyr::mutate(BMI = Weight / (Height ** 2))


## 3.2
# `Summary` function can be applied on a single variable
summary(obesity_without_na$BMI)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   24.34   28.78   29.72   36.05   50.81
# Or
obesity_without_na$BMI %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   24.34   28.78   29.72   36.05   50.81
## 3.3
# Using `subset` function
obese_individuals = subset(obesity_without_na, BMI > 25)
# Or
# Using the `dplyr` package
obese_individuals = obesity_without_na %>% filter(BMI > 25)
# Or
# Using boolean indexing
obese_individuals = obesity_without_na[obesity_without_na$BMI > 25,]


## 3.4
summary(obese_individuals)
##     Gender         Age            Height          Weight      
##  Female:682   Min.   :15.00   Min.   :1.456   Min.   : 55.52  
##  Male  :782   1st Qu.:21.01   1st Qu.:1.641   1st Qu.: 80.13  
##               Median :24.00   Median :1.710   Median : 97.43  
##               Mean   :25.65   Mean   :1.710   Mean   : 98.01  
##               3rd Qu.:28.87   3rd Qu.:1.774   3rd Qu.:112.73  
##               Max.   :56.00   Max.   :1.980   Max.   :173.00  
##                                                               
##  family_history_with_overweight  FAVC           FCVC            NCP       
##  no :  95                       no : 108   Min.   :1.000   Min.   :1.000  
##  yes:1369                       yes:1356   1st Qu.:2.000   1st Qu.:2.570  
##                                            Median :2.356   Median :3.000  
##                                            Mean   :2.427   Mean   :2.639  
##                                            3rd Qu.:3.000   3rd Qu.:3.000  
##                                            Max.   :3.000   Max.   :4.000  
##                                                                           
##          CAEC      SMOKE           CH2O        SCC            FAF        
##  Always    :  14   no :1436   Min.   :1.000   no :1429   Min.   :0.0000  
##  Frequently:  36   yes:  28   1st Qu.:1.679   yes:  35   1st Qu.:0.1124  
##  no        :  37              Median :2.003              Median :0.9499  
##  Sometimes :1377              Mean   :2.074              Mean   :0.9304  
##                               3rd Qu.:2.580              3rd Qu.:1.4771  
##                               Max.   :3.000              Max.   :3.0000  
##                                                                          
##       TUE                   CALC                        MTRANS    
##  Min.   :0.000000   Always    :   0   Automobile           : 351  
##  1st Qu.:0.001628   Frequently:  49   Bike                 :   3  
##  Median :0.548920   no        : 388   Motorbike            :   5  
##  Mean   :0.612998   Sometimes :1027   Public_Transportation:1088  
##  3rd Qu.:0.999451                     Walking              :  17  
##  Max.   :2.000000                                                 
##                                                                   
##                NObeyesdad       BMI       
##  Insufficient_Weight:  0   Min.   :25.00  
##  Normal_Weight      :  0   1st Qu.:27.94  
##  Obesity_Type_I     :336   Median :32.42  
##  Obesity_Type_II    :286   Mean   :33.38  
##  Obesity_Type_III   :308   3rd Qu.:37.83  
##  Overweight_Level_I :263   Max.   :50.81  
##  Overweight_Level_II:271
# Or
obese_individuals %>% summary()
##     Gender         Age            Height          Weight      
##  Female:682   Min.   :15.00   Min.   :1.456   Min.   : 55.52  
##  Male  :782   1st Qu.:21.01   1st Qu.:1.641   1st Qu.: 80.13  
##               Median :24.00   Median :1.710   Median : 97.43  
##               Mean   :25.65   Mean   :1.710   Mean   : 98.01  
##               3rd Qu.:28.87   3rd Qu.:1.774   3rd Qu.:112.73  
##               Max.   :56.00   Max.   :1.980   Max.   :173.00  
##                                                               
##  family_history_with_overweight  FAVC           FCVC            NCP       
##  no :  95                       no : 108   Min.   :1.000   Min.   :1.000  
##  yes:1369                       yes:1356   1st Qu.:2.000   1st Qu.:2.570  
##                                            Median :2.356   Median :3.000  
##                                            Mean   :2.427   Mean   :2.639  
##                                            3rd Qu.:3.000   3rd Qu.:3.000  
##                                            Max.   :3.000   Max.   :4.000  
##                                                                           
##          CAEC      SMOKE           CH2O        SCC            FAF        
##  Always    :  14   no :1436   Min.   :1.000   no :1429   Min.   :0.0000  
##  Frequently:  36   yes:  28   1st Qu.:1.679   yes:  35   1st Qu.:0.1124  
##  no        :  37              Median :2.003              Median :0.9499  
##  Sometimes :1377              Mean   :2.074              Mean   :0.9304  
##                               3rd Qu.:2.580              3rd Qu.:1.4771  
##                               Max.   :3.000              Max.   :3.0000  
##                                                                          
##       TUE                   CALC                        MTRANS    
##  Min.   :0.000000   Always    :   0   Automobile           : 351  
##  1st Qu.:0.001628   Frequently:  49   Bike                 :   3  
##  Median :0.548920   no        : 388   Motorbike            :   5  
##  Mean   :0.612998   Sometimes :1027   Public_Transportation:1088  
##  3rd Qu.:0.999451                     Walking              :  17  
##  Max.   :2.000000                                                 
##                                                                   
##                NObeyesdad       BMI       
##  Insufficient_Weight:  0   Min.   :25.00  
##  Normal_Weight      :  0   1st Qu.:27.94  
##  Obesity_Type_I     :336   Median :32.42  
##  Obesity_Type_II    :286   Mean   :33.38  
##  Obesity_Type_III   :308   3rd Qu.:37.83  
##  Overweight_Level_I :263   Max.   :50.81  
##  Overweight_Level_II:271

3 Visualization

Here we want to visualize some variables through plots.
We work on data without NAs.

3.1 Histogram

Visualisation of one numeric variable.

Question 4 :

4.1 Print the distribution of BMI using a histogram.

4.2 Change title and axes labels.

4.3 Add vertical lines for thresholds values 17 and 25.

4.4 Add the density line on the plot.

Question 4
Your turn
Hints
#' 4.1 You can use `hist` function ; or `geom_histogram` with `ggplot`. 
#'     You can also use Esquisse, that will generate the code for ggplot.

#' 4.2 You can use options of the `hist` function ; or `labs` with `ggplot`. 
#'     You can also use Esquisse, that will generate the code for ggplot.

#' 4.3 You can use `abline` with `hist` ; or `geom_vline` with `ggplot`.

#' 4.4 You can use `lines(density())`  with `hist` ; or `geom_density` with `ggplot`
#'     Don't forget to set the parameter `freq` to FALSE when using `hist`
Complete answer
## 4.1
hist(obesity_without_na$BMI, 
     breaks = 30, col = "#D282E6")

# Or 

ggplot(obesity_without_na) +
  aes(x = BMI) +
  geom_histogram(bins = 30L, fill = "#D282E6")

## 4.2
hist(obesity_without_na$BMI, 
     breaks = 30, col = "#D282E6",
     main = "Distribution of BMI",
     xlab = "BMI")

# Or 

ggplot(obesity_without_na) +
  aes(x = BMI) +
  geom_histogram(bins = 30L, fill = "#D282E6") +
  labs(                                        # `labs()` enables to specify labels (title, axes, ..)
    x = "BMI",
    y = "Frequency",
    title = "Distribution of BMI"
  )

## 4.3
hist(obesity_without_na$BMI, 
     breaks = 30, col = "#D282E6",
     main = "Distribution of BMI",
     xlab = "BMI")
abline(v = c(17, 25),                   # Allow to print both vertical lines in the same command
       col = c("blue", "forestgreen"),  # Give 2 different colors to the vertical lines
       lwd = c(3, 5))                   # Give 2 different line width to the vertical lines

# Or

ggplot(obesity_without_na) +
  aes(x = BMI) +
  geom_histogram(bins = 30L, fill = "#D282E6") +
  labs(
    x = "BMI",
    y = "Frequency",
    title = "Distribution of BMI"
  ) +
  geom_vline(xintercept = c(17, 25),           # Allow to print both vertical lines in the same command
             colour = c("blue","forestgreen"), # Give 2 different colors to the vertical lines
             linewidth = c(2,3))               # Give 2 different line width to the vertical lines

## 4.4
hist(obesity_without_na$BMI, 
     breaks = 30, col = "#D282E6",
     freq = FALSE,                         # Essential to be able to print the density
     main = "Distribution of BMI",
     xlab = "BMI")
abline(v = c(17, 25),                               
       col = c("blue", "forestgreen"),        
       lwd = c(3, 5))
lines(density(obesity_without_na$BMI),
      col = "brown", lwd = 3)

# Or

ggplot(obesity_without_na) +
  aes(x = BMI) +
  geom_histogram(aes(y = after_stat(density)),      # Add this command
                 bins = 30L, fill = "#D282E6") +
  labs(
    x = "BMI",
    y = "Frequency",
    title = "Distribution of BMI"
  ) +
  
  geom_vline(xintercept = c(17, 25),
             colour = c("blue", "forestgreen"), 
             linewidth = c(2,3)) +
  
  geom_density(colour = "brown", linewidth = 1.4)   # And this one

3.2 Scatter plot

Visualisation of two numeric variables.
We work on data without NAs.

Question 5 :

5.1 Use ggplot to create a variable fig containing a plot showing repartition of height across weight and coloring the points with the factor NOeyesdad

5.2 Improve the previous fig variable with some labels and title

5.3 Find a way to calculate the centroids of the Height and Weight for each class of NObeyesdad. NB : it corresponds to the mean of Height and Weight in each class of NObeyesdad.

5.4 Add the centroids of each class of NObeyesdad on the fig ggplot created previously.

5.5 Create a new plot with these elements :

  • showing repartition of height across weight
  • coloring points depending on BMI
  • shape depending on family_history_with_overweight
  • you can change titles, colors, shapes .. if wanted
Question 5
Your turn
Hints
#' 5.1 You can use Esquisse, that will generate the code for ggplot. 
#'     Plots can also be used as objects such as `fig = ggplot(...)`

#' 5.2 You can use the previous fig with the ggplot syntax (`fig = fig + labs(...)`)

#' 5.3 You can use the function `group_by()` to concatenate the data according to a factor, then use `summarise(...)`
#'     This creates a new dataframe, which can be assigned to an object.
   
#' 5.4  You can then use your previous fig and add the new points with:
#'      `fig + geom_point()`

#' 5.5 You can use Esquisse, that will generate the code for ggplot.
#'     But Esquisse doesn't allow to put a shape depending on a variable.
#'     To put a shape depending on a variable, you will have to add it manually using the `shape` parameter in `aes` definition.
#'     In `ggplot2` you can use the `color` parameter in `aes` definition, with a numeric variable.
#'     You can indicate the title of shape legend in the `labs()` command
#'     You can change the shapes using the `scale_shape_manual()` command
Complete answer
## 5.1
fig = ggplot(obesity_without_na) +
  aes(x = Height, y = Weight, color = NObeyesdad) +
  geom_point()


## 5.2
fig = fig + labs(x = "Height (m)", y = "Weight (kg)", color = "Class of obesity")
fig

## 5.3
centroids = obesity_without_na %>% group_by(NObeyesdad) %>% summarise(
  mean_by_categ_height = mean(Height),
  mean_by_categ_weight = mean(Weight)
)


## 5.4
fig = fig + geom_point(
  data = centroids,
  aes(x = mean_by_categ_height, y = mean_by_categ_weight),
  color = "black",     # Draw them in black for greater visibility
  size = 5,            # Increase the size for gretaer visibility
  shape = 8            # Change the shape for greater visibility
)


## 5.5
ggplot(obesity_without_na) +
  aes(
    x = Height,
    y = Weight,
    colour = BMI,
    shape = family_history_with_overweight  # Change shape depending on a variable
  ) +
  geom_point(size = 2.2) +                  # Change point size in general
  scale_color_distiller(palette = "OrRd", direction = 1) +
  scale_shape_manual(values = c(19, 18)) +  # Change shapes for both modalities of family history
  labs(title = "Repartition of weight according to height", shape = "Family history")

Now what if we want to print more than 2 variables ?
We’ll have to use statistical technics such as PCA (Principal Component Analysis).

4 Some statistical analysis

4.1 PCA

PCA (Principal Component Analysis) is a statistical technique which enables us to simplify complex multidimensionnal datasets by reducing the number of variables while preserving information.
Indeed, this method is based on transforming the original variables into new, uncorrelated variables, called principal components, which successively capture the greatest possible variance in the data. So, it is often used to make data easier to explore and visualize.

Question 6 : Here we will conduct a pca analysis and plot the results.

To perform the PCA, we will use the prcomp() function to create pca_res. This function expects a table with no NA and with rows as “individuals” (the samples) and columns as numeric variables. Use prcomp() with scale parameter being TRUE, in order to scale the variables and make them comparable.

6.1 Create a data table with no missing values and only numeric variables.
Explore quickly this table to check that only numeric data are present.

6.2 Compute the pca_res object, using the prcomp() function on the data table with no missing values and only numeric variables.

6.3 Explore the pca_res object created.

6.4 Install and load the factoextra package.

6.5 Use the fviz_eig() function to print the scree plot, which represents the percentage of variance explained by each component.

6.6 Use the fviz_pca_ind() function to print individuals. Print only points because there are too many individuals to print the labels.

6.7 On the individuals plot, color the individuals by NObeyesdad.

6.8 On the last plot, change the components to print (3,4) instead of (1,2).

6.9 Use the fviz_pca_var() function to print the graph of variables.

Question 6
Your turn
Hints
#' 6.1 To keep only numeric variables, you can use `is.numeric` with `map_lgl` functions.  
#'     Open the table in your environment or make a `summary()` on it.

#' 6.2 Don't forget to set the `scale` parameter to `TRUE`

#' 6.3 Look at the Environment panel, try to access the elements in pca_res
#'     You can access the elements of this object using `$`

#' 6.4 You can use `install.packages()` function or click on Packages -> Install. Then use `library()`.

#' 6.6 You can use the `geom` parameter to change the geometry used

#' 6.7 You can use the `col.ind` parameter to change the coloring variable.

#' 6.8 You can use the `axes` parameter to change the printed components
Complete answer
## 6.1
data_for_pca = obesity_without_na[,map_lgl(obesity_without_na, is.numeric)]
View(data_for_pca)
summary(data_for_pca)
##       Age            Height          Weight            FCVC      
##  Min.   :14.00   Min.   :1.456   Min.   : 39.00   Min.   :1.000  
##  1st Qu.:19.90   1st Qu.:1.630   1st Qu.: 65.62   1st Qu.:2.000  
##  Median :22.80   Median :1.701   Median : 83.00   Median :2.378  
##  Mean   :24.34   Mean   :1.702   Mean   : 86.73   Mean   :2.417  
##  3rd Qu.:26.00   3rd Qu.:1.769   3rd Qu.:108.10   3rd Qu.:3.000  
##  Max.   :61.00   Max.   :1.980   Max.   :173.00   Max.   :3.000  
##       NCP             CH2O            FAF              TUE        
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:2.657   1st Qu.:1.605   1st Qu.:0.1393   1st Qu.:0.0000  
##  Median :3.000   Median :2.000   Median :1.0000   Median :0.6221  
##  Mean   :2.685   Mean   :2.015   Mean   :1.0164   Mean   :0.6544  
##  3rd Qu.:3.000   3rd Qu.:2.495   3rd Qu.:1.6685   3rd Qu.:1.0000  
##  Max.   :4.000   Max.   :3.000   Max.   :3.0000   Max.   :2.0000  
##       BMI       
##  Min.   :13.00  
##  1st Qu.:24.34  
##  Median :28.78  
##  Mean   :29.72  
##  3rd Qu.:36.05  
##  Max.   :50.81
## 6.2
pca_res = prcomp(data_for_pca, scale = TRUE)


## 6.3
head(pca_res$x)
##          PC1        PC2         PC3        PC4         PC5        PC6
## 1 -1.4833880 -0.3522440 -0.70065216  0.7180619  0.16048037  1.1810569
## 2 -1.1256790  0.7533163  1.07684458 -3.0984167  0.07254866  0.8296358
## 3 -0.6334807  1.6832944  0.60247341  0.5823536 -0.15601342 -0.4871111
## 4  0.6121253  0.6842018  1.49063830 -1.0252005  0.63352528 -0.6666564
## 5 -0.1367602 -1.0759814  0.02587933  0.3317372 -1.81954242 -0.6486798
## 6 -1.5675642 -1.3029831  1.19159441  0.3999047  0.12424497  1.3559510
##           PC7          PC8         PC9
## 1 -0.16626861 -0.003844692  0.03274228
## 2  0.43108038 -1.743759554  0.08418544
## 3  0.25573240  0.095977078 -0.05401156
## 4 -0.06985795  0.542883721 -0.01944557
## 5 -1.62643112  1.053804892 -0.03602851
## 6 -0.22657870  0.312509072  0.07287356
## 6.4
# install.packages("factoextra")
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## 6.5
fviz_eig(pca_res)

## 6.6
fviz_pca_ind(pca_res, geom = "point")

## 6.7
fviz_pca_ind(pca_res, geom = "point",
             col.ind = obesity_without_na$NObeyesdad)

## 6.8
fviz_pca_ind(pca_res, geom = "point",
             col.ind = obesity_without_na$NObeyesdad,
             axes = c(3,4))

## 6.9
fviz_pca_var(pca_res)

We could then interpret these graphs.
For instance, we can see that axes 1 and 2 explain 44.6% of the total variance.
In the individuals plot, class of obesity looks associated with the first component, with low weights at the left and over weights at the right.
This is confirmed by the variables plot, on which we can see that weight and BMI are highly correlated with the first component.

Note that :

  • plots printed using the factoextra package (such as fviz_pca_ind or fviz_pca_var) are using the ggplot2 syntax, so you can add options using + ... if you want
  • you can retrieve the results of the PCA in the pca_res object and make your own plots using plot() or ggplot() functions.

4.2 Khi-2 test

To explore some qualitative variables, we’ll perform a \(\chi^2\) test.

A \(\chi^2\) independance test allows us to test, and possibly reject, the hypothesis of independance between two categorical variables.

This test calculates the p-value, which allows us to reject or not the independance hypothesis : if the p-value is lower than a threshold, commonly set to 0.05, we deduce that the variables are significantly associated.

You can find some explanations of this test here and here.

Note that one condition to perform this test is to have at least 5 expected individuals in each crossed group.

Question 7 :

7.1 Print the repartition of the Gender across the family_history_with_overweight.

7.2 Now print the same repartition, calculating the proportion by Gender : among the Males / Females, what is the proportion of individuals with family history ?

7.3 Compute the khi-2 test comparing both these variables.

7.4 Save the result of this test in an object (called khi_result for example).

7.5 Explore the created object.

Question 7
Your turn
Hints
#' 7.1 You can use `table()` function, like in Q 1.4

#' 7.2 You can use the `prop.table()` function on the `table()` function
#'     You can also use the `round()` function to make it more readable

#' 7.3 You can use the `chisq.test())` function

#' 7.5 You can access the elements of this object using `$`
Complete answer
## 7.1
table(obesity_without_na$Gender, obesity_without_na$family_history_with_overweight)
##         
##           no yes
##   Female 218 774
##   Male   146 868
## 7.2 
round(prop.table(
  table(
    obesity_without_na$Gender,
    obesity_without_na$family_history_with_overweight
  ),
  1   # To specify that we want calculate proportions on rows
), 3)
##         
##             no   yes
##   Female 0.220 0.780
##   Male   0.144 0.856
## 7.3
chisq.test(x = obesity_without_na$Gender,
           y = obesity_without_na$family_history_with_overweight)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  obesity_without_na$Gender and obesity_without_na$family_history_with_overweight
## X-squared = 18.877, df = 1, p-value = 1.394e-05
## 7.4
khi_result = chisq.test(x = obesity_without_na$Gender,
                        y = obesity_without_na$family_history_with_overweight)


## 7.5
khi_result$p.value   # To access the p-value
## [1] 1.394052e-05
khi_result$observed  # To access the table of observed counts
##                          
## obesity_without_na$Gender  no yes
##                    Female 218 774
##                    Male   146 868
khi_result$expected  # To access the table of expected counts under the null hypothesis
##                          
## obesity_without_na$Gender      no     yes
##                    Female 180.004 811.996
##                    Male   183.996 830.004

The p-value is lower than 0.05, indicating that Gender and family history are statistically significantly associated at 5% threshold.

5 Session info

We’re now at the end of the practical session.
To ensure reproducibility of analyses, we would like to conserve informations about versions for every package used.

Question 8 :

8.1 Which command would allow you to print versions for every package used in the project ?

Question 8
Your turn
Hints
#' 8.1 We've talked about it in the slides, in the 'packages' part
Complete answer
devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.4.2 (2024-10-31)
##  os       Ubuntu 20.04.6 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  fr_FR.UTF-8
##  ctype    fr_FR.UTF-8
##  tz       Europe/Paris
##  date     2025-10-11
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package         * version  date (UTC) lib source
##  abind             1.4-8    2024-09-12 [1] CRAN (R 4.4.1)
##  backports         1.5.0    2024-05-23 [1] CRAN (R 4.4.1)
##  broom             1.0.7    2024-09-26 [1] CRAN (R 4.4.1)
##  bslib             0.8.0    2024-07-29 [1] CRAN (R 4.4.1)
##  cachem            1.1.0    2024-05-16 [1] CRAN (R 4.4.1)
##  car               3.1-3    2024-09-27 [1] CRAN (R 4.4.1)
##  carData           3.0-5    2022-01-06 [1] CRAN (R 4.4.1)
##  cli               3.6.5    2025-04-23 [1] CRAN (R 4.4.2)
##  colorspace        2.1-1    2024-07-26 [1] CRAN (R 4.4.1)
##  devtools          2.4.5    2022-10-11 [1] CRAN (R 4.4.1)
##  digest            0.6.37   2024-08-19 [1] CRAN (R 4.4.1)
##  dplyr           * 1.1.4    2023-11-17 [1] CRAN (R 4.4.1)
##  ellipsis          0.3.2    2021-04-29 [1] CRAN (R 4.4.1)
##  evaluate          1.0.1    2024-10-10 [1] CRAN (R 4.4.1)
##  factoextra      * 1.0.7    2020-04-01 [1] CRAN (R 4.4.2)
##  fansi             1.0.6    2023-12-08 [1] CRAN (R 4.4.1)
##  farver            2.1.2    2024-05-13 [1] CRAN (R 4.4.1)
##  fastmap           1.2.0    2024-05-15 [1] CRAN (R 4.4.1)
##  forcats         * 1.0.0    2023-01-29 [1] CRAN (R 4.4.1)
##  Formula           1.2-5    2023-02-24 [1] CRAN (R 4.4.1)
##  fs                1.6.5    2024-10-30 [1] CRAN (R 4.4.1)
##  generics          0.1.3    2022-07-05 [1] CRAN (R 4.4.1)
##  ggplot2         * 3.5.1    2024-04-23 [1] CRAN (R 4.4.1)
##  ggpubr            0.6.0    2023-02-10 [1] CRAN (R 4.4.1)
##  ggrepel           0.9.6    2024-09-07 [1] CRAN (R 4.4.1)
##  ggsignif          0.6.4    2022-10-13 [1] CRAN (R 4.4.1)
##  glue              1.8.0    2024-09-30 [1] CRAN (R 4.4.1)
##  gtable            0.3.6    2024-10-25 [1] CRAN (R 4.4.1)
##  hms               1.1.3    2023-03-21 [1] CRAN (R 4.4.1)
##  htmltools         0.5.8.1  2024-04-04 [1] CRAN (R 4.4.1)
##  htmlwidgets       1.6.4    2023-12-06 [1] CRAN (R 4.4.1)
##  httpuv            1.6.15   2024-03-26 [1] CRAN (R 4.4.1)
##  jquerylib         0.1.4    2021-04-26 [1] CRAN (R 4.4.1)
##  jsonlite          1.8.9    2024-09-20 [1] CRAN (R 4.4.1)
##  knitr             1.49     2024-11-08 [1] CRAN (R 4.4.1)
##  labeling          0.4.3    2023-08-29 [1] CRAN (R 4.4.1)
##  later             1.3.2    2023-12-06 [1] CRAN (R 4.4.1)
##  lifecycle         1.0.4    2023-11-07 [1] CRAN (R 4.4.1)
##  lubridate       * 1.9.4    2024-12-08 [1] CRAN (R 4.4.2)
##  magrittr          2.0.3    2022-03-30 [1] CRAN (R 4.4.1)
##  memoise           2.0.1    2021-11-26 [1] CRAN (R 4.4.1)
##  mime              0.12     2021-09-28 [1] CRAN (R 4.4.1)
##  miniUI            0.1.1.1  2018-05-18 [1] CRAN (R 4.4.1)
##  munsell           0.5.1    2024-04-01 [1] CRAN (R 4.4.1)
##  pillar            1.9.0    2023-03-22 [1] CRAN (R 4.4.1)
##  pkgbuild          1.4.5    2024-10-28 [1] CRAN (R 4.4.1)
##  pkgconfig         2.0.3    2019-09-22 [1] CRAN (R 4.4.1)
##  pkgload           1.4.0    2024-06-28 [1] CRAN (R 4.4.1)
##  profvis           0.4.0    2024-09-20 [1] CRAN (R 4.4.1)
##  promises          1.3.0    2024-04-05 [1] CRAN (R 4.4.1)
##  purrr           * 1.0.2    2023-08-10 [1] CRAN (R 4.4.1)
##  R6                2.5.1    2021-08-19 [1] CRAN (R 4.4.1)
##  RColorBrewer      1.1-3    2022-04-03 [1] CRAN (R 4.4.1)
##  Rcpp              1.0.13-1 2024-11-02 [1] CRAN (R 4.4.1)
##  readr           * 2.1.5    2024-01-10 [1] CRAN (R 4.4.2)
##  remotes           2.5.0    2024-03-17 [1] CRAN (R 4.4.1)
##  rlang             1.1.6    2025-04-11 [1] CRAN (R 4.4.2)
##  rmarkdown         2.29     2024-11-04 [1] CRAN (R 4.4.1)
##  rstatix           0.7.2    2023-02-01 [1] CRAN (R 4.4.1)
##  rstudioapi        0.17.1   2024-10-22 [1] CRAN (R 4.4.2)
##  sass              0.4.9    2024-03-15 [1] CRAN (R 4.4.1)
##  scales            1.3.0    2023-11-28 [1] CRAN (R 4.4.1)
##  sessioninfo       1.2.2    2021-12-06 [1] CRAN (R 4.4.1)
##  shiny             1.9.1    2024-08-01 [1] CRAN (R 4.4.1)
##  stringi           1.8.4    2024-05-06 [1] CRAN (R 4.4.1)
##  stringr         * 1.5.1    2023-11-14 [1] CRAN (R 4.4.1)
##  templatebilille   0.1.0    2024-03-01 [1] local
##  tibble          * 3.2.1    2023-03-20 [1] CRAN (R 4.4.1)
##  tidyr           * 1.3.1    2024-01-24 [1] CRAN (R 4.4.1)
##  tidyselect        1.2.1    2024-03-11 [1] CRAN (R 4.4.1)
##  tidyverse       * 2.0.0    2023-02-22 [1] CRAN (R 4.4.2)
##  timechange        0.3.0    2024-01-18 [1] CRAN (R 4.4.2)
##  tzdb              0.5.0    2025-03-15 [1] CRAN (R 4.4.2)
##  urlchecker        1.0.1    2021-11-30 [1] CRAN (R 4.4.1)
##  usethis           3.0.0    2024-07-29 [1] CRAN (R 4.4.1)
##  utf8              1.2.4    2023-10-22 [1] CRAN (R 4.4.1)
##  vctrs             0.6.5    2023-12-01 [1] CRAN (R 4.4.1)
##  withr             3.0.2    2024-10-28 [1] CRAN (R 4.4.1)
##  xfun              0.49     2024-10-31 [1] CRAN (R 4.4.1)
##  xtable            1.8-4    2019-04-21 [1] CRAN (R 4.4.1)
##  yaml              2.3.10   2024-07-26 [1] CRAN (R 4.4.1)
## 
##  [1] /home/estelle/R/x86_64-pc-linux-gnu-library/4.4
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────