TP on clinical data
This TP was created in the context of the R Bilille training program in 2025.
This document aims to propose an analysis based on clinical
data.
We retrieved a dataset on the following website : https://archive.ics.uci.edu, and explored it with R.
To complete this practical session, you can either create a basic
R script, or a Rmd script that will create a
PDF or HTML report.
In this whole session, you can manipulate the data as you want, even if
you don’t follow the questions ! Try things and ask questions.
We are going to use one table for this analysis :
obesity_with_na.csv which contains clinical data mainly
on obese individualsDetails of the different variables are described here.
There were no missing values in the original data table, we added some
for this exercise.
Question 1 :
1.1 Import the table. NB : all characters are factors in this table.
1.2 How many individuals and variables are in the table ?
1.3 Give the global information of the table.
1.4 What is the repartition of the
gender ? And the repartition of the smokers ?
And the repartition of the smokers across the
gender ?
#' 1.1 To import tables, use the click button or `read.csv()` function.
#' Don't forget to tick the 'Strings as factors' case
#' 1.2 The number of individuals is the number of rows, and the number of variables is the number of columns in the table.
#' 1.3 Use the `summary` function.
#' 1.4 Here you can use the `table()` function on Gender, SMOKE, and combination of both these variables.
## 1.1
# The `stringsAsFactors` option allows to consider every character variable as factor
obesity_with_na = read.csv("./obesity_with_na.csv",
stringsAsFactors=TRUE)
## 1.2
dim(obesity_with_na)
## [1] 2111 17
# Let's print a sentence with text and numbers using the paste() function
# First saving the dimensions into a vector
dimensions = dim(obesity_with_na) # this is now a vector with two values
# Second using it into the Paste function that allows to concatenate characters
print(paste("There are", dimensions[1], "individuals and", dimensions[2], "variables in our dataset."))
## [1] "There are 2111 individuals and 17 variables in our dataset."
## 1.3
summary(obesity_with_na)
## Gender Age Height Weight
## Female:1040 Min. :14.00 Min. :1.450 Min. : 39.00
## Male :1068 1st Qu.:19.95 1st Qu.:1.630 1st Qu.: 65.62
## NA's : 3 Median :22.78 Median :1.701 Median : 83.00
## Mean :24.32 Mean :1.702 Mean : 86.62
## 3rd Qu.:26.00 3rd Qu.:1.769 3rd Qu.:107.54
## Max. :61.00 Max. :1.980 Max. :173.00
## NA's :6 NA's :10 NA's :5
## family_history_with_overweight FAVC FCVC NCP
## no : 383 no : 245 Min. :1.000 Min. :1.000
## yes :1719 yes :1860 1st Qu.:2.000 1st Qu.:2.659
## NA's: 9 NA's: 6 Median :2.381 Median :3.000
## Mean :2.419 Mean :2.685
## 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :3.000 Max. :4.000
## NA's :8 NA's :2
## CAEC SMOKE CH2O SCC FAF
## Always : 52 no :2060 Min. :1.000 no :2014 Min. :0.0000
## Frequently: 242 yes : 44 1st Qu.:1.585 yes : 95 1st Qu.:0.1245
## no : 51 NA's: 7 Median :2.000 NA's: 2 Median :1.0000
## Sometimes :1761 Mean :2.008 Mean :1.0099
## NA's : 5 3rd Qu.:2.480 3rd Qu.:1.6667
## Max. :3.000 Max. :3.0000
## NA's :4 NA's :4
## TUE CALC MTRANS
## Min. :0.0000 Always : 1 Automobile : 457
## 1st Qu.:0.0000 Frequently: 69 Bike : 7
## Median :0.6253 no : 635 Motorbike : 11
## Mean :0.6576 Sometimes :1395 Public_Transportation:1571
## 3rd Qu.:1.0000 NA's : 11 Walking : 56
## Max. :2.0000 NA's : 9
## NA's :5
## NObeyesdad
## Obesity_Type_I :349
## Obesity_Type_III :324
## Obesity_Type_II :296
## Overweight_Level_I :288
## Overweight_Level_II:288
## (Other) :557
## NA's : 9
## 1.4
table(obesity_with_na$Gender) # Contingency table for Gender variable
##
## Female Male
## 1040 1068
table(obesity_with_na$SMOKE) # Contingency table for SMOKE variable
##
## no yes
## 2060 44
table(obesity_with_na$Gender, obesity_with_na$SMOKE) # Contingency table for Gender x SMOKE variables
##
## no yes
## Female 1022 15
## Male 1035 29
Here, we are aiming to have a table with no missing values.
Question 2 :
2.1 Load tidyverse package.
2.2 Create a new table with no lines with NAs from your table.
2.3 How many individuals are in this new table ?
#' 2.1 Use `library` to load a package.
#' 2.2 There are many ways to do this (`na.omit()` ,`complete.cases()` or dplyr with `dplyr::drop_na()` for example)
#' 2.3 The number of individuals is linked to the number of rows (like in Q°1.2)
## 2.1
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 2.2
obesity_without_na <- obesity_with_na %>% drop_na()
# Or
obesity_without_na <- obesity_with_na[complete.cases(obesity_with_na), ]
# Or
obesity_without_na <- subset(obesity_with_na,complete.cases(obesity_with_na))
# Or
obesity_without_na <- obesity_with_na %>% na.omit()
## 2.3
n_ind <- dim(obesity_without_na)[1] # Using the first element of the vector returned by `dim`
# Or
n_ind <- nrow(obesity_without_na) # Using directly the function `nrow` which returns the number of rows
# To print the number of individuals in a sentance :
print(paste("There are", n_ind, "individuals with no missing values."))
## [1] "There are 2006 individuals with no missing values."
Here we want to add a variable to our table.
We work on data without NAs.
Question 3 :
3.1 Calculate BMI and add it to the
table.
3.2 Make a summary of the new variable.
3.3 Create a sub table with only patients whose
BMI is greater than 25.
3.4 Make a summary of the new obese table.
#' 3.1 BMI = weight / (height ** 2)
#' 3.2 Use the `summary` function on the new variable
#' 3.3 You can use subset, dplyr with dplyr::filter or boolean indexing
#' 3.4 Use the `summary` function on the new table
## 3.1
# Using the $ to access the variable
obesity_without_na$BMI = obesity_without_na$Weight / (obesity_without_na$Height ** 2)
# Or
# Using the [,] to access the variable
obesity_without_na[,"BMI"] = obesity_without_na[,"Weight"] / (obesity_without_na[,"Height"] ** 2)
# Or
# Using the `dplyr` package
obesity_without_na = obesity_without_na %>% dplyr::mutate(BMI = Weight / (Height ** 2))
## 3.2
# `Summary` function can be applied on a single variable
summary(obesity_without_na$BMI)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 24.34 28.78 29.72 36.05 50.81
# Or
obesity_without_na$BMI %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 24.34 28.78 29.72 36.05 50.81
## 3.3
# Using `subset` function
obese_individuals = subset(obesity_without_na, BMI > 25)
# Or
# Using the `dplyr` package
obese_individuals = obesity_without_na %>% filter(BMI > 25)
# Or
# Using boolean indexing
obese_individuals = obesity_without_na[obesity_without_na$BMI > 25,]
## 3.4
summary(obese_individuals)
## Gender Age Height Weight
## Female:682 Min. :15.00 Min. :1.456 Min. : 55.52
## Male :782 1st Qu.:21.01 1st Qu.:1.641 1st Qu.: 80.13
## Median :24.00 Median :1.710 Median : 97.43
## Mean :25.65 Mean :1.710 Mean : 98.01
## 3rd Qu.:28.87 3rd Qu.:1.774 3rd Qu.:112.73
## Max. :56.00 Max. :1.980 Max. :173.00
##
## family_history_with_overweight FAVC FCVC NCP
## no : 95 no : 108 Min. :1.000 Min. :1.000
## yes:1369 yes:1356 1st Qu.:2.000 1st Qu.:2.570
## Median :2.356 Median :3.000
## Mean :2.427 Mean :2.639
## 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :3.000 Max. :4.000
##
## CAEC SMOKE CH2O SCC FAF
## Always : 14 no :1436 Min. :1.000 no :1429 Min. :0.0000
## Frequently: 36 yes: 28 1st Qu.:1.679 yes: 35 1st Qu.:0.1124
## no : 37 Median :2.003 Median :0.9499
## Sometimes :1377 Mean :2.074 Mean :0.9304
## 3rd Qu.:2.580 3rd Qu.:1.4771
## Max. :3.000 Max. :3.0000
##
## TUE CALC MTRANS
## Min. :0.000000 Always : 0 Automobile : 351
## 1st Qu.:0.001628 Frequently: 49 Bike : 3
## Median :0.548920 no : 388 Motorbike : 5
## Mean :0.612998 Sometimes :1027 Public_Transportation:1088
## 3rd Qu.:0.999451 Walking : 17
## Max. :2.000000
##
## NObeyesdad BMI
## Insufficient_Weight: 0 Min. :25.00
## Normal_Weight : 0 1st Qu.:27.94
## Obesity_Type_I :336 Median :32.42
## Obesity_Type_II :286 Mean :33.38
## Obesity_Type_III :308 3rd Qu.:37.83
## Overweight_Level_I :263 Max. :50.81
## Overweight_Level_II:271
# Or
obese_individuals %>% summary()
## Gender Age Height Weight
## Female:682 Min. :15.00 Min. :1.456 Min. : 55.52
## Male :782 1st Qu.:21.01 1st Qu.:1.641 1st Qu.: 80.13
## Median :24.00 Median :1.710 Median : 97.43
## Mean :25.65 Mean :1.710 Mean : 98.01
## 3rd Qu.:28.87 3rd Qu.:1.774 3rd Qu.:112.73
## Max. :56.00 Max. :1.980 Max. :173.00
##
## family_history_with_overweight FAVC FCVC NCP
## no : 95 no : 108 Min. :1.000 Min. :1.000
## yes:1369 yes:1356 1st Qu.:2.000 1st Qu.:2.570
## Median :2.356 Median :3.000
## Mean :2.427 Mean :2.639
## 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :3.000 Max. :4.000
##
## CAEC SMOKE CH2O SCC FAF
## Always : 14 no :1436 Min. :1.000 no :1429 Min. :0.0000
## Frequently: 36 yes: 28 1st Qu.:1.679 yes: 35 1st Qu.:0.1124
## no : 37 Median :2.003 Median :0.9499
## Sometimes :1377 Mean :2.074 Mean :0.9304
## 3rd Qu.:2.580 3rd Qu.:1.4771
## Max. :3.000 Max. :3.0000
##
## TUE CALC MTRANS
## Min. :0.000000 Always : 0 Automobile : 351
## 1st Qu.:0.001628 Frequently: 49 Bike : 3
## Median :0.548920 no : 388 Motorbike : 5
## Mean :0.612998 Sometimes :1027 Public_Transportation:1088
## 3rd Qu.:0.999451 Walking : 17
## Max. :2.000000
##
## NObeyesdad BMI
## Insufficient_Weight: 0 Min. :25.00
## Normal_Weight : 0 1st Qu.:27.94
## Obesity_Type_I :336 Median :32.42
## Obesity_Type_II :286 Mean :33.38
## Obesity_Type_III :308 3rd Qu.:37.83
## Overweight_Level_I :263 Max. :50.81
## Overweight_Level_II:271
Here we want to visualize some variables through plots.
We work on data without NAs.
Visualisation of one numeric variable.
Question 4 :
4.1 Print the distribution of BMI using
a histogram.
4.2 Change title and axes labels.
4.3 Add vertical lines for thresholds values 17 and 25.
4.4 Add the density line on the plot.
#' 4.1 You can use `hist` function ; or `geom_histogram` with `ggplot`.
#' You can also use Esquisse, that will generate the code for ggplot.
#' 4.2 You can use options of the `hist` function ; or `labs` with `ggplot`.
#' You can also use Esquisse, that will generate the code for ggplot.
#' 4.3 You can use `abline` with `hist` ; or `geom_vline` with `ggplot`.
#' 4.4 You can use `lines(density())` with `hist` ; or `geom_density` with `ggplot`
#' Don't forget to set the parameter `freq` to FALSE when using `hist`
## 4.1
hist(obesity_without_na$BMI,
breaks = 30, col = "#D282E6")
# Or
ggplot(obesity_without_na) +
aes(x = BMI) +
geom_histogram(bins = 30L, fill = "#D282E6")
## 4.2
hist(obesity_without_na$BMI,
breaks = 30, col = "#D282E6",
main = "Distribution of BMI",
xlab = "BMI")
# Or
ggplot(obesity_without_na) +
aes(x = BMI) +
geom_histogram(bins = 30L, fill = "#D282E6") +
labs( # `labs()` enables to specify labels (title, axes, ..)
x = "BMI",
y = "Frequency",
title = "Distribution of BMI"
)
## 4.3
hist(obesity_without_na$BMI,
breaks = 30, col = "#D282E6",
main = "Distribution of BMI",
xlab = "BMI")
abline(v = c(17, 25), # Allow to print both vertical lines in the same command
col = c("blue", "forestgreen"), # Give 2 different colors to the vertical lines
lwd = c(3, 5)) # Give 2 different line width to the vertical lines
# Or
ggplot(obesity_without_na) +
aes(x = BMI) +
geom_histogram(bins = 30L, fill = "#D282E6") +
labs(
x = "BMI",
y = "Frequency",
title = "Distribution of BMI"
) +
geom_vline(xintercept = c(17, 25), # Allow to print both vertical lines in the same command
colour = c("blue","forestgreen"), # Give 2 different colors to the vertical lines
linewidth = c(2,3)) # Give 2 different line width to the vertical lines
## 4.4
hist(obesity_without_na$BMI,
breaks = 30, col = "#D282E6",
freq = FALSE, # Essential to be able to print the density
main = "Distribution of BMI",
xlab = "BMI")
abline(v = c(17, 25),
col = c("blue", "forestgreen"),
lwd = c(3, 5))
lines(density(obesity_without_na$BMI),
col = "brown", lwd = 3)
# Or
ggplot(obesity_without_na) +
aes(x = BMI) +
geom_histogram(aes(y = after_stat(density)), # Add this command
bins = 30L, fill = "#D282E6") +
labs(
x = "BMI",
y = "Frequency",
title = "Distribution of BMI"
) +
geom_vline(xintercept = c(17, 25),
colour = c("blue", "forestgreen"),
linewidth = c(2,3)) +
geom_density(colour = "brown", linewidth = 1.4) # And this one
Visualisation of two numeric variables.
We work on data without NAs.
Question 5 :
5.1 Use ggplot to create a variable
fig containing a plot showing repartition of
height across weight and coloring the points
with the factor NOeyesdad
5.2 Improve the previous fig variable with some labels and title
5.3 Find a way to calculate the centroids of the
Height and Weight for each class of
NObeyesdad. NB : it corresponds to the
mean of Height and Weight in each class of
NObeyesdad.
5.4 Add the centroids of each class of
NObeyesdad on the fig ggplot created
previously.
5.5 Create a new plot with these elements :
height across
weightBMIfamily_history_with_overweight#' 5.1 You can use Esquisse, that will generate the code for ggplot.
#' Plots can also be used as objects such as `fig = ggplot(...)`
#' 5.2 You can use the previous fig with the ggplot syntax (`fig = fig + labs(...)`)
#' 5.3 You can use the function `group_by()` to concatenate the data according to a factor, then use `summarise(...)`
#' This creates a new dataframe, which can be assigned to an object.
#' 5.4 You can then use your previous fig and add the new points with:
#' `fig + geom_point()`
#' 5.5 You can use Esquisse, that will generate the code for ggplot.
#' But Esquisse doesn't allow to put a shape depending on a variable.
#' To put a shape depending on a variable, you will have to add it manually using the `shape` parameter in `aes` definition.
#' In `ggplot2` you can use the `color` parameter in `aes` definition, with a numeric variable.
#' You can indicate the title of shape legend in the `labs()` command
#' You can change the shapes using the `scale_shape_manual()` command
## 5.1
fig = ggplot(obesity_without_na) +
aes(x = Height, y = Weight, color = NObeyesdad) +
geom_point()
## 5.2
fig = fig + labs(x = "Height (m)", y = "Weight (kg)", color = "Class of obesity")
fig
## 5.3
centroids = obesity_without_na %>% group_by(NObeyesdad) %>% summarise(
mean_by_categ_height = mean(Height),
mean_by_categ_weight = mean(Weight)
)
## 5.4
fig = fig + geom_point(
data = centroids,
aes(x = mean_by_categ_height, y = mean_by_categ_weight),
color = "black", # Draw them in black for greater visibility
size = 5, # Increase the size for gretaer visibility
shape = 8 # Change the shape for greater visibility
)
## 5.5
ggplot(obesity_without_na) +
aes(
x = Height,
y = Weight,
colour = BMI,
shape = family_history_with_overweight # Change shape depending on a variable
) +
geom_point(size = 2.2) + # Change point size in general
scale_color_distiller(palette = "OrRd", direction = 1) +
scale_shape_manual(values = c(19, 18)) + # Change shapes for both modalities of family history
labs(title = "Repartition of weight according to height", shape = "Family history")
Now what if we want to print more than 2 variables
?
We’ll have to use statistical technics such as PCA (Principal Component
Analysis).
PCA (Principal Component Analysis) is a statistical technique
which enables us to simplify complex multidimensionnal datasets by
reducing the number of variables while preserving
information.
Indeed, this method is based on transforming the original variables into
new, uncorrelated variables, called principal components, which
successively capture the greatest possible variance in the data. So, it
is often used to make data easier to explore and visualize.
Question 6 : Here we will conduct a pca analysis and plot the results.
To perform the PCA, we will use the prcomp() function to
create pca_res. This function expects a table with no NA
and with rows as “individuals” (the samples) and columns as
numeric variables. Use prcomp() with scale
parameter being TRUE, in order to scale the variables and
make them comparable.
6.1 Create a data table with no missing values and
only numeric variables.
Explore quickly this table to check that only numeric data are
present.
6.2 Compute the pca_res object, using
the prcomp() function on the data table with no
missing values and only numeric variables.
6.3 Explore the pca_res object
created.
6.4 Install and load the factoextra
package.
6.5 Use the fviz_eig() function to
print the scree plot, which represents the percentage of variance
explained by each component.
6.6 Use the fviz_pca_ind() function to
print individuals. Print only points because there are too many
individuals to print the labels.
6.7 On the individuals plot, color the individuals
by NObeyesdad.
6.8 On the last plot, change the components to print (3,4) instead of (1,2).
6.9 Use the fviz_pca_var() function to
print the graph of variables.
#' 6.1 To keep only numeric variables, you can use `is.numeric` with `map_lgl` functions.
#' Open the table in your environment or make a `summary()` on it.
#' 6.2 Don't forget to set the `scale` parameter to `TRUE`
#' 6.3 Look at the Environment panel, try to access the elements in pca_res
#' You can access the elements of this object using `$`
#' 6.4 You can use `install.packages()` function or click on Packages -> Install. Then use `library()`.
#' 6.6 You can use the `geom` parameter to change the geometry used
#' 6.7 You can use the `col.ind` parameter to change the coloring variable.
#' 6.8 You can use the `axes` parameter to change the printed components
## 6.1
data_for_pca = obesity_without_na[,map_lgl(obesity_without_na, is.numeric)]
View(data_for_pca)
summary(data_for_pca)
## Age Height Weight FCVC
## Min. :14.00 Min. :1.456 Min. : 39.00 Min. :1.000
## 1st Qu.:19.90 1st Qu.:1.630 1st Qu.: 65.62 1st Qu.:2.000
## Median :22.80 Median :1.701 Median : 83.00 Median :2.378
## Mean :24.34 Mean :1.702 Mean : 86.73 Mean :2.417
## 3rd Qu.:26.00 3rd Qu.:1.769 3rd Qu.:108.10 3rd Qu.:3.000
## Max. :61.00 Max. :1.980 Max. :173.00 Max. :3.000
## NCP CH2O FAF TUE
## Min. :1.000 Min. :1.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:2.657 1st Qu.:1.605 1st Qu.:0.1393 1st Qu.:0.0000
## Median :3.000 Median :2.000 Median :1.0000 Median :0.6221
## Mean :2.685 Mean :2.015 Mean :1.0164 Mean :0.6544
## 3rd Qu.:3.000 3rd Qu.:2.495 3rd Qu.:1.6685 3rd Qu.:1.0000
## Max. :4.000 Max. :3.000 Max. :3.0000 Max. :2.0000
## BMI
## Min. :13.00
## 1st Qu.:24.34
## Median :28.78
## Mean :29.72
## 3rd Qu.:36.05
## Max. :50.81
## 6.2
pca_res = prcomp(data_for_pca, scale = TRUE)
## 6.3
head(pca_res$x)
## PC1 PC2 PC3 PC4 PC5 PC6
## 1 -1.4833880 -0.3522440 -0.70065216 0.7180619 0.16048037 1.1810569
## 2 -1.1256790 0.7533163 1.07684458 -3.0984167 0.07254866 0.8296358
## 3 -0.6334807 1.6832944 0.60247341 0.5823536 -0.15601342 -0.4871111
## 4 0.6121253 0.6842018 1.49063830 -1.0252005 0.63352528 -0.6666564
## 5 -0.1367602 -1.0759814 0.02587933 0.3317372 -1.81954242 -0.6486798
## 6 -1.5675642 -1.3029831 1.19159441 0.3999047 0.12424497 1.3559510
## PC7 PC8 PC9
## 1 -0.16626861 -0.003844692 0.03274228
## 2 0.43108038 -1.743759554 0.08418544
## 3 0.25573240 0.095977078 -0.05401156
## 4 -0.06985795 0.542883721 -0.01944557
## 5 -1.62643112 1.053804892 -0.03602851
## 6 -0.22657870 0.312509072 0.07287356
## 6.4
# install.packages("factoextra")
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## 6.5
fviz_eig(pca_res)
## 6.6
fviz_pca_ind(pca_res, geom = "point")
## 6.7
fviz_pca_ind(pca_res, geom = "point",
col.ind = obesity_without_na$NObeyesdad)
## 6.8
fviz_pca_ind(pca_res, geom = "point",
col.ind = obesity_without_na$NObeyesdad,
axes = c(3,4))
## 6.9
fviz_pca_var(pca_res)
We could then interpret these graphs.
For instance, we can see that axes 1 and 2 explain 44.6% of the total
variance.
In the individuals plot, class of obesity looks associated with the
first component, with low weights at the left and over weights at the
right.
This is confirmed by the variables plot, on which we can see that weight
and BMI are highly correlated with the first component.
Note that :
factoextra package (such as
fviz_pca_ind or fviz_pca_var) are using the
ggplot2 syntax, so you can add options using
+ ... if you wantpca_res
object and make your own plots using plot() or
ggplot() functions.Khi-2
testTo explore some qualitative variables, we’ll perform a \(\chi^2\) test.
A \(\chi^2\) independance test allows us to test, and possibly reject, the hypothesis of independance between two categorical variables.
This test calculates the p-value, which allows us to reject or not the independance hypothesis : if the p-value is lower than a threshold, commonly set to 0.05, we deduce that the variables are significantly associated.
You can find some explanations of this test here and here.
Note that one condition to perform this test is to have at least 5 expected individuals in each crossed group.
Question 7 :
7.1 Print the repartition of the Gender
across the family_history_with_overweight.
7.2 Now print the same repartition, calculating the proportion by Gender : among the Males / Females, what is the proportion of individuals with family history ?
7.3 Compute the khi-2 test comparing
both these variables.
7.4 Save the result of this test in an object
(called khi_result for example).
7.5 Explore the created object.
#' 7.1 You can use `table()` function, like in Q 1.4
#' 7.2 You can use the `prop.table()` function on the `table()` function
#' You can also use the `round()` function to make it more readable
#' 7.3 You can use the `chisq.test())` function
#' 7.5 You can access the elements of this object using `$`
## 7.1
table(obesity_without_na$Gender, obesity_without_na$family_history_with_overweight)
##
## no yes
## Female 218 774
## Male 146 868
## 7.2
round(prop.table(
table(
obesity_without_na$Gender,
obesity_without_na$family_history_with_overweight
),
1 # To specify that we want calculate proportions on rows
), 3)
##
## no yes
## Female 0.220 0.780
## Male 0.144 0.856
## 7.3
chisq.test(x = obesity_without_na$Gender,
y = obesity_without_na$family_history_with_overweight)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: obesity_without_na$Gender and obesity_without_na$family_history_with_overweight
## X-squared = 18.877, df = 1, p-value = 1.394e-05
## 7.4
khi_result = chisq.test(x = obesity_without_na$Gender,
y = obesity_without_na$family_history_with_overweight)
## 7.5
khi_result$p.value # To access the p-value
## [1] 1.394052e-05
khi_result$observed # To access the table of observed counts
##
## obesity_without_na$Gender no yes
## Female 218 774
## Male 146 868
khi_result$expected # To access the table of expected counts under the null hypothesis
##
## obesity_without_na$Gender no yes
## Female 180.004 811.996
## Male 183.996 830.004
The p-value is lower than 0.05, indicating that
Gender and family history are statistically
significantly associated at 5% threshold.
We’re now at the end of the practical session.
To ensure reproducibility of analyses, we would like to conserve
informations about versions for every package used.
Question 8 :
8.1 Which command would allow you to print versions for every package used in the project ?
#' 8.1 We've talked about it in the slides, in the 'packages' part
devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.4.2 (2024-10-31)
## os Ubuntu 20.04.6 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate fr_FR.UTF-8
## ctype fr_FR.UTF-8
## tz Europe/Paris
## date 2025-10-11
## pandoc 3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## abind 1.4-8 2024-09-12 [1] CRAN (R 4.4.1)
## backports 1.5.0 2024-05-23 [1] CRAN (R 4.4.1)
## broom 1.0.7 2024-09-26 [1] CRAN (R 4.4.1)
## bslib 0.8.0 2024-07-29 [1] CRAN (R 4.4.1)
## cachem 1.1.0 2024-05-16 [1] CRAN (R 4.4.1)
## car 3.1-3 2024-09-27 [1] CRAN (R 4.4.1)
## carData 3.0-5 2022-01-06 [1] CRAN (R 4.4.1)
## cli 3.6.5 2025-04-23 [1] CRAN (R 4.4.2)
## colorspace 2.1-1 2024-07-26 [1] CRAN (R 4.4.1)
## devtools 2.4.5 2022-10-11 [1] CRAN (R 4.4.1)
## digest 0.6.37 2024-08-19 [1] CRAN (R 4.4.1)
## dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.1)
## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.4.1)
## evaluate 1.0.1 2024-10-10 [1] CRAN (R 4.4.1)
## factoextra * 1.0.7 2020-04-01 [1] CRAN (R 4.4.2)
## fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.1)
## farver 2.1.2 2024-05-13 [1] CRAN (R 4.4.1)
## fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.1)
## forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.4.1)
## Formula 1.2-5 2023-02-24 [1] CRAN (R 4.4.1)
## fs 1.6.5 2024-10-30 [1] CRAN (R 4.4.1)
## generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.1)
## ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.4.1)
## ggpubr 0.6.0 2023-02-10 [1] CRAN (R 4.4.1)
## ggrepel 0.9.6 2024-09-07 [1] CRAN (R 4.4.1)
## ggsignif 0.6.4 2022-10-13 [1] CRAN (R 4.4.1)
## glue 1.8.0 2024-09-30 [1] CRAN (R 4.4.1)
## gtable 0.3.6 2024-10-25 [1] CRAN (R 4.4.1)
## hms 1.1.3 2023-03-21 [1] CRAN (R 4.4.1)
## htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.1)
## htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.1)
## httpuv 1.6.15 2024-03-26 [1] CRAN (R 4.4.1)
## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.4.1)
## jsonlite 1.8.9 2024-09-20 [1] CRAN (R 4.4.1)
## knitr 1.49 2024-11-08 [1] CRAN (R 4.4.1)
## labeling 0.4.3 2023-08-29 [1] CRAN (R 4.4.1)
## later 1.3.2 2023-12-06 [1] CRAN (R 4.4.1)
## lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.1)
## lubridate * 1.9.4 2024-12-08 [1] CRAN (R 4.4.2)
## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.1)
## memoise 2.0.1 2021-11-26 [1] CRAN (R 4.4.1)
## mime 0.12 2021-09-28 [1] CRAN (R 4.4.1)
## miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.4.1)
## munsell 0.5.1 2024-04-01 [1] CRAN (R 4.4.1)
## pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.1)
## pkgbuild 1.4.5 2024-10-28 [1] CRAN (R 4.4.1)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.1)
## pkgload 1.4.0 2024-06-28 [1] CRAN (R 4.4.1)
## profvis 0.4.0 2024-09-20 [1] CRAN (R 4.4.1)
## promises 1.3.0 2024-04-05 [1] CRAN (R 4.4.1)
## purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.4.1)
## R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.1)
## RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.4.1)
## Rcpp 1.0.13-1 2024-11-02 [1] CRAN (R 4.4.1)
## readr * 2.1.5 2024-01-10 [1] CRAN (R 4.4.2)
## remotes 2.5.0 2024-03-17 [1] CRAN (R 4.4.1)
## rlang 1.1.6 2025-04-11 [1] CRAN (R 4.4.2)
## rmarkdown 2.29 2024-11-04 [1] CRAN (R 4.4.1)
## rstatix 0.7.2 2023-02-01 [1] CRAN (R 4.4.1)
## rstudioapi 0.17.1 2024-10-22 [1] CRAN (R 4.4.2)
## sass 0.4.9 2024-03-15 [1] CRAN (R 4.4.1)
## scales 1.3.0 2023-11-28 [1] CRAN (R 4.4.1)
## sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.1)
## shiny 1.9.1 2024-08-01 [1] CRAN (R 4.4.1)
## stringi 1.8.4 2024-05-06 [1] CRAN (R 4.4.1)
## stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.4.1)
## templatebilille 0.1.0 2024-03-01 [1] local
## tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.4.1)
## tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.4.1)
## tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.1)
## tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.4.2)
## timechange 0.3.0 2024-01-18 [1] CRAN (R 4.4.2)
## tzdb 0.5.0 2025-03-15 [1] CRAN (R 4.4.2)
## urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.4.1)
## usethis 3.0.0 2024-07-29 [1] CRAN (R 4.4.1)
## utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.1)
## vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.1)
## withr 3.0.2 2024-10-28 [1] CRAN (R 4.4.1)
## xfun 0.49 2024-10-31 [1] CRAN (R 4.4.1)
## xtable 1.8-4 2019-04-21 [1] CRAN (R 4.4.1)
## yaml 2.3.10 2024-07-26 [1] CRAN (R 4.4.1)
##
## [1] /home/estelle/R/x86_64-pc-linux-gnu-library/4.4
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library
##
## ──────────────────────────────────────────────────────────────────────────────