Preamble

Practical informations

Schedule :

  • 13th & 14th October 2025
  • 9am to 5pm

Breaks :

  • every half-day
  • lunch for 1 hour, around 12:30


At the bottom left, there is a menu to better navigate through the slides

Lunch :

  • micro-waves available if needed
  • possibility to buy sandwiches or hot dished at the hospital (Huriez)
  • possibility to eat in the training room

R and Rstudio installed successfully for everyone ?

Access to Nextcloud files ?
password : formation_r_2025

Bilille


Bilille is the Lille bioinformatics and biostatistics platform, within the UAR 2014 - US 41 “Plateformes Lilloises en Biologie et Santé”.


PLBS includes 8 platforms, providing access to expertise and equipments to support research in biology and health.


In Bilille, we currently are 10 full time engineers, directed by Jimmy Vandel (IR CNRS).

Our missions are to :

  • support scientific projects
  • organise training courses
  • provide access to cloud computing resources
  • ensure access to software resources
  • conduct scientific and technical animation

Quick presentation


Us


What about you ?

  • name
  • profile
  • labs
  • experience with R (in few words) : have you already tried using it ?

Introduction

What is data science ?

Data Science is the bridge between raw data (after experimentation) and meaningful insight. It is often divided into several stages (which do not always have a fixed order) :

  • Importing the data
  • Cleaning & manipulating it
  • Transforming it into useful formats
  • Visualizing for exploration or communication
  • Modeling to understand or predict
  • …. (possible return to previous steps)
  • And ultimately, communicating the results clearly

What is data science ?

The programming language R enables us to do all these steps !

Presentation outline

First day

  1. Understanding R and Rstudio

  2. Programming with R

  3. Importing data and basic manipulation

  4. The importance of packages

  5. Basic visualization

Second day

  1. Programming with Tidyverse

  2. Reports with Rmarkdown

  3. Manipulation of (your ?) data

1. Understanding R and Rstudio

R

R is a programming language for statistical computing and graphics. It originated in 1993 and has been widely adopted since.

It is available as a Free Software (like python and Julia for example) and runs on a wide variety of systems (UNIX, Windows, MacOS).

  • Widely used for bioinformatics/statistics/data science

  • Relatively “easy” to understand

  • Open source so everybody can contribute

  • Very large number of libraries developed by a community of contributors

  • Over 15,000 libraries listed on the Comprehensive R Archive Network (CRAN), GitHub and Bioconductor

  • The Bioconductor project alone includes more than 1,000 libraries allowing analysis of biological data

R

R

R scripts have the file extension .R.

They can be executed in a terminal using the command : Rscript yourfile.R

R

You can also open the R console in a terminal using the command line R :

NB : This example and the previous one come from a Unix distribution, but there is an equivalent command for Windows or Mac.

Rstudio

RStudio is an integrated development environment for R, whose first version was released in 2011. It allows you to :

  • work on your R scripts and write reports

  • execute your code and scripts

  • visualize the environment and variables

  • visualize plots, install new packages, consult help

  • etc

Rstudio

Code execution

CTRL + enter key executes :

  • the code that is currently selected
  • the line in which the mouse cursor is placed

It is equivalent to the Run button which is top right of the script region of RStudio.


The whole script can be executed using the Source button, also placed top right of the script region.

TP 1

2. Programming with R

Programming with R

In R, you can manipulate :

  • Values :
    • 69 (integer)
    • "Name" (character)
    • "2025-04-14" (date)
    • c("a","b","c") (vector)
  • Objects :
    • x <- 21 (variable integer)
    • y <- c(21/7, 42, 0.99) (variable vector)
  • Functions :
    • round(x, digits = 3)

Values : types

Main types of R values :

  • integer
420
[1] 420
  • numeric
3.141591
[1] 3.141591
  • character
"Bilille"
[1] "Bilille"
  • boolean
TRUE
[1] TRUE
  • vector
c(1, 2, 3, 4)
[1] 1 2 3 4
  • list
list("info", 2)
[[1]]
[1] "info"

[[2]]
[1] 2
  • matrix
matrix(c(1,2, 3,4), nrow = 2, ncol = 2)
     [,1] [,2]
[1,]    1    3
[2,]    2    4
  • dataframe
data.frame(Student = c("John_Doe", "Jane_Doe"),
           Note = c(10, 14))
   Student Note
1 John_Doe   10
2 Jane_Doe   14

Those values can be assigned to objects with <- like my_var <- 21

Exercise 1

Which of these are numbers in R ?



      1                  “1”                  “one”                  one


  • 1 is a number


  • "1" and "one" are characters


  • one is an object (created with one <-)

Object assignation

To store the results and elements of your analysis, you need to use objects.

For example :

2+2
[1] 4

only prints the result, and doesn’t store it.

x <- 2+2

doesn’t print the result, but assign the value of the result into the object called x.

Now you can see x in the environment (top right).

You can print the result doing

this

print(x)
[1] 4

or this

x
[1] 4

Object assignation


To name your object :

  • Use specific and explicit names

  • Use _ if needed (not spaces !)

  • Use lowercase (convention) because my_var is not the same as My_Var

  • Start your variable name with a letter

To assign a value to your object :

  • Use <- : [less than] symbol + dash (it is also possible to use = for assignation)

Exercise 2

Which of these will work ?

Let one <- 1,


1 + 1          “1” + “1”          “one” + “one”          one + one


  • 1+1 and one+one will both give us the addition of 1 with 1 which is 2


  • "1"+"1" and "one"+"one" will not work and you’ll have an error message

Manipulating vectors

A vector is an element containing one or more single values.

To create a vector, use the command c() :

my_vector <- c(1,2,4,8)

To access the elements of a vector, use the [] with the desired indexation.

One value :

my_vector[3]
[1] 4

returns the third element of my_vector.

Several values :

my_vector[1:3]
[1] 1 2 4

returns elements 1 to 3 of my_vector.


my_vector[c(1,3)]
[1] 1 4

returns elements 1 and 3 of my_vector.

  • In a vector definition, comas “,” are used to separate values while points “.” are used for decimal separation in numerical elements.
  • indexation starts at 1, not 0 as some other informatic languages.

Exercise 3

Will both these instructions produce the same result ?

my_vect_1 <- c(1, 1,25, 2,48, 3)
my_vect_2 <- c(1, 1.25, 2.48, 3)

my_vect_1 will be a vector of length 6 with only integers
my_vect_2 will be a vector of length 4 with integers and numerics


  • how to access the 5th value of my_vect_1 ?
my_vect_1[5]
  • how to access the values 2 to 4 of my_vect_2 ?
my_vect_2[2:4]
  • how to access the values 1, 3, 4 and 5 of my_vect_1 ?
my_vect_1[c(1,3,4,5)]
my_vect_1[c(1,3:5)]
  • how to store the second element of my_vect_2 into an object called vect_pos_2 ?
vect_pos_2 <- my_vect_2[2]

Missing or Empty Values in R

In R, there are two values that indicate missing or empty data :


NA (Not Available)

  • indicates missing data within a structure (e.g vectors, data frames, …)

  • commonly appears when a value is unknown, not recorded, or undefined.

my_vector <-  c(95, NA, 87)

NULL

  • represents the absence of a value or object.

  • used for optional parameters or to indicate no content at all.

my_f <- function(x, y, title = NULL) {
}

Classical operations

Operations between single values \(\in \mathbb{R}\) :

  • arithmetic operators :
    • addition, subtraction : +, -
    • multiplication, division : *, /
    • exponent : ** or ^
    • residual of euclidean division : %%
  • logical operators, that will return TRUE or FALSE :
    • less than, greater than : <, >
    • less or equal than, greater or equal than : <=, >=
    • equal, not equal : ==, !=


== will compare two values whereas = will affect a value to an object, like <-.


Also, these operators can be used on two vectors of the same length.
For example :

x <- c(1, 0.1, 0) and y <- c(3, 1.9, 0). Then x + y equals c(4, 2, 0)

Exercise 4

what will be the outputs for the following cases :

  • “a” == “a”

  • a == a

  • “a” == “b”

TRUE

Error (object a doesn’t exist)

FALSE

Let a = 2 and b <- 2.

  • a == a
  • a == b
  • a != b

TRUE

TRUE

FALSE

Exercise 5

What will be the outputs for the following classical and logical operations :

  • 4**2 or 4^2
4 ** 2
[1] 16
4 ^ 2
[1] 16
  • 5 / 2
5 / 2
[1] 2.5
  • 5 %% 2
5 %% 2
[1] 1
  • (6, 2, 9) - (4, 8, 9)
c(6, 2, 9) - c(4, 8, 9)
[1]  2 -6  0
  • (6, 420) == (6.1, 420)
c(6, 420) == c(6.1, 420)
[1] FALSE  TRUE
  • (1, 2, 9) < (4, 1.25, 9)
c(1, 2, 9) < c(4, 1.25, 9)
[1]  TRUE FALSE FALSE
  • (1, 2, 9) <= (4, 1.25, 9)
c(1, 2, 9) <= c(4, 1.25, 9)
[1]  TRUE FALSE  TRUE

Functions

A function is a very useful object in R. It is a set of command lines, computing a specific task, and that is reusable.

They are great tools to gain a lot of time, avoid copy/paste when running the same task several times and ensure reproducibility.

You can create your own functions or use existing functions depending on your needs.

The structure of a function is the following :

# Creating the function
my_function = function(necessary_input_parameter_1, 
                       optional_input_parameter_2=NULL){

  code_instructions
  
  return(result)
}

# Calling the function
my_result = my_function(3,5)
  • input parameters can be of different types
  • code instructions are a succession of code lines, the objects created in a function are temporary and NOT in your environment
  • the function returns a result, which can be of different types and stored in your environment

Like any object, the function has to exist (being created) before using it.

Creating functions

Creating a function is quite straightforward when following the structure (input, inside code, output).

Here, we create a function computing the addition of two numbers :

# Creating the function
addition = function(number_1, number_2){
  
  add = number_1 + number_2
  
  return(add)
}
# Calling the function
addition(2, 5.3)
[1] 7.3
  • there are two input parameters that are numbers
  • the code instructions computes the addition of these numbers
  • it returns the result of the addition


For this training, we will not be creating any function, as it can be considered advanced R.

Instead, we are going to use functions that have already been created to make our lives easier !

Existing functions

Many basic functions already exist and are available in R. They were created in the R programming language to tackle basic needs and you can directly call them.
A few examples are listed here :

  • round()

  • print()

  • paste0()

round(3.141592, digits = 2)
[1] 3.14
print("Hello world")
[1] "Hello world"
paste0("subject_", 1+1)
[1] "subject_2"


It is very important to understand what is the goal of a function, what input parameters it expects and what it returns.

Documentation for functions

Every function from R base (and from packages), has a guide that will help you understand it.

To access it, you can either go to the Help panel and type the name of your function or use ?function like in this example :

?print

There are many informations in the documentation. We mainly look at the sections :

  • Description : the function’s goal
  • Usage : how to call the function, the input parameters and their order
  • Arguments : the expected type of each of the input parameters
  • Examples : code examples of usage of the function

Input parameters

The input parameters of a function are described in the documentation.

They can be :
- necessary : you have to add them in the function calling
- or optional : these parameters were already assigned a value by default in the function’s definition. The function will take this value for the parameter if you don’t specify another value in the call.


Be careful in what order the function expects the input parameters.

To be sure that you assign correct values to the input parameters of a function, you can use the explicit names of the input parameters.

For example, the print() function expects an input parameter named x :

print(x="Hello world")
[1] "Hello world"

Let’s try with an example !

An example : the round() function

Let’s look at the help for this function.

?round

Using the documentation, what will the outputs for the following calls be :

  • round(7.123, digits=2)

  • round(7.123)

  • round(2, 7)

  • round(digits=2, 7.123)

  • round(7.123, digits=“1”)

  • round(2, 3, 1)

7.12

7 (digits=0 in the function definition)

2 (the order !)

7.1 (explicit calling of digits)

ERROR : digits has to be numerical

ERROR : expects only 2 arguments

Be careful when assigning values to the parameters in a function use the = (not <-)

TP 2

3. Importing data and basic manipulation

Importing data

Understanding basic path architecture…

getwd()                                    # Get the working directory
 ~/projets/2025_formation_r  
setwd("~/projets/2025_formation_r/Data")   # Set the working directory
getwd()                                    # The working directory has now changed
 ~/projets/2025_formation_r/Data  

To import data we can use functions like read.csv() with the local path to our data :

great_data <- read.csv("my_great_data.csv") # my_great_data.csv is in ~/projets/2025_formation_r/Data

Or with the absolute path :

great_data <- read.csv("/home/user/Documents/projets/2025_formation_r/Data/my_great_data.csv")

Be careful while importing data :
- name your files and repositories to make it easy for you to import the data
- use appropriate column names in your data file (avoid blank spaces, special characters like "(", keep a coherent value type in a column…)
- if your data is too large, your computer may won’t have enough RAM to allow R importing and manipulating it
- there is a subtlety in Windows, that uses \ in paths instead of /

Importing data

You can import data using Rstudio interface. It works for various types of data files and give you the equivalent command if you want to import your data using command lines in your script.

You can import:

  • text files (csv, tsv, table)
  • excel files (xlsx, xls)
  • files from specific statistical tools (SPSS, SAS, Stata)

Form of data

Data is often presented in data frames, which are tables with rows and columns.
It is possible to specify names for rows and columns.

Variables (columns) can be of different types :

  • numeric / integers
  • factors
  • ordered factors
  • characters…

str(my_dataframe)     # get type and value of the object : for a dataframe will describe type and values of each column
summary(my_dataframe) # get a summary of each column (distribution/repartition, NA's)
  

Make sure a column has a unique type of values on all rows.
For instance, only characters in the region column or only numeric in the soil_ph column.
Checking this will help you greatly for downstream analyses !

Data manipulations

  • Accessing column names :
colnames(great_data)
  • Accessing a column :
great_data$column_name
great_data[, 1]                                              # will access the first column in the dataframe
great_data[, c(1,3,5)]                                       # will access the columns 1, 3 and 5     
great_data[, "column_name"]                                 
great_data[["column_name"]]
  • Creating a column and removing a column :
great_data$a_new_column <- great_data$existing_col1 + great_data$existing_col2
great_data[,"a_new_column"] <- great_data[,"existing_col1"] + great_data[,"existing_col2"]
great_data <- cbind(great_data, a_new_column)              # Binding a vector

# To remove a column named "useless" placed in second position, any of these will work
great_data$useless <- NULL
great_data[2] <- NULL
great_data[[2]] <- NULL
great_data <- great_data[,-2] 
great_data <- great_data[-2]

When using cbind, you have to check that rows in the data frame are sorted in the same order than data in the new data vector. Otherwise, you will attribute values that doesn’t match with the appropriate row.

Data manipulations

  • Accessing row names :
rownames(great_data)
  • Accessing a row :
great_data[1,]                                              # will access the first row in the dataframe
great_data[c(3,9,18),]                                      # will access the rows 3, 9 and 18 in the dataframe
great_data["row_name",]
  • Removing a row :
# To remove a row placed in second position :
great_data <- great_data[-2,] 
  • Add lines to a data table :
# If great_data contains 3 columns : ("Name", "first name", "year of birth")
great_data[nrow(great_data)+1,] = c("Descartes", "René", 1596)
great_data = rbind(great_data, c("Descartes", "René", 1596)) # Binding a vector of length 3

When using rbind, you have to check that columns in the data frame are sorted in the same order than data in the new data vector. Otherwise, you will attribute values that doesn’t match with the appropriate column.

Data manipulations

  • Filtering the dataframe depending on a column with the subset() function :
filtered_data_adults <- subset(great_data, age > 18)         # Filter great_data keeping only lines where great_data$age > 18
  • Manage missing data (NA) with different functions :
anyNA(c("a","b",NA))                                         # checks if there is NA's in the object
great_data_no_na <- na.omit(great_data)                      # deletes rows with at least one NA
great_data_no_na <- great_data[complete.cases(great_data),]  # same as above

Some functions allow NA values but it’s necessary to specify how to deal with them, for example :

sum(c(1,2,3,NA))
[1] NA
sum(c(1,2,3,NA), na.rm = TRUE)
[1] 6

Some other functions can’t be computed on data with missing values.
When trying to use them, an error message will be displayed, for example :

Export Data

To save and export the data created, we can use different approaches.

  • To save the data into a .csv file we can use the write.csv() function :
write.csv(my_data_to_save, "path/to/where/I/want/to/save/my_data_to_save.csv")
  • We can save data objects into an R Data format (.rds or .RData) :
saveRDS(my_data_to_save, file = "path/to/my_data.rds")  # Save a single object in a .rds file
my_data_again <- readRDS(file = "path/to/my_data.rds")  # Restore the object in R


save(data1, file = "data.RData")                        # Save an object in a .RData file
save(data1, data2, file = "data_1_2.RData")             # Save multiple objects in a .RData file

load("data_1_2.RData")                                  # To load the data again
  • We can also save the whole workspace (all objects created) into an .RData object :
save.image(file = "path/to/my_work_space_today.RData")  # Save the workspace
load("path/to/my_work_space_today.RData")               # Load the workspace

When leaving Rstudio, it will suggest to save the current workspace, which is equivalent to the save.image function.

TP 3

4. The importance of packages

Packages

R packages are extensions to the R statistical programming language. They are created in a standardized collection format that can be installed and used in R.
They contain:

  • code like functions created specifically for a context
  • data for training or reference
  • documentation

Packages are very useful : they help you optimise your code and make it easier to manage. A library is a collection of reusable packages that address common needs.


The large number of packages/libraries available for R, and the ease of installing and using them, has been cited as a major factor driving the widespread adoption of this language in data science.


Installing a package

To install a package there are (again) different ways to proceed :

  • We can directly use the Rstudio interface in the Packages tab, with the Install button for packages available on CRAN or with a package archive file .tar.gz.

  • This can be equivalent to using the install.packages() function in the Console.
install.packages("ggplot2")                # notice the " " 
  • Or we can use the Console to download and install packages from the internet for instance from Bioconductor using the package BiocManager.
install.packages("BiocManager")
BiocManager::install("simplifyEnrichment") # A package from bioconductor

Loading and using a package

To use a package, we first have to load it by calling library(name_of_the_package).

library(ggplot2) # notice there is no need for "" here

We can then use the functions and data contained in the package in our script.

The packages can be updated regularly which can make your code obsolete.
Make sure your know which package and which version you use in your script to ensure reproducibility.

The information of the packages and versions used can be accessed with the session_info() function of the devtools package.

devtools::session_info() # devtools is also a package : notice how we called its function session_info with ::

All the packages are open-source and can be used for free. Nevertheless, you must cite the packages used in your research.

citation("ggplot2") # gives the way to cite the package

Warnings and errors associated to packages

Sometimes, a function exists with the same name in two different packages. For example, the function filter exists in both packages stats and dplyr, and doesn’t do exactly the same thing in both cases.

If you want to be sure of which function you are using, you can use the soft-load method to call it with package::function()

dplyr::filter()
  • If you try using a package that is not installed or not loaded, an error message will be displayed.
  • Sometimes, there are issues when trying to install a new package.
    It can be caused by conflicts with packages versions or dependancies, or with ubuntu dependencies for ubuntu distribution.

For example :

TP 4

5. Visualization

Data generation

To illustrate different basic visualisations, we will create the dataframe trees_data with 100 individuals for 3 variables :

  • size_m : trees sizes in meters

  • circumference_cm : trees circumferences in centimeters

  • type : trees types

# We use the MASS library to generate correlated data
sigma <- matrix(c(1, 0.8, 0.8, 1), nrow = 2)            # Correlation matrix
mv_norm <- MASS::mvrnorm(100, mu = c(20, 50),           # Multivariate normal distribution
                      Sigma = sigma) 
# Mean of size variable distribution = 20m
# Mean of circumference distribution = 50cm

trees_data <- data.frame(
  size_m = abs(mv_norm[, 1] * 5 + 10),                  # We adjust sizes between 10m and 110m
  
  circumference_cm = abs(mv_norm[, 2] * 2 + 20),        # We adjust circumferences between 20cm and 120cm
  type = sample(c("Maple", "Oak", "Willow"),            
                100,
                replace = TRUE)  
)
head(trees_data)
    size_m circumference_cm   type
1 113.8421         120.0436 Willow
2 111.7442         119.6686 Willow
3 110.5724         120.4254  Maple
4 110.3417         119.6122 Willow
5 112.8848         120.4722  Maple
6 113.0304         122.5207  Maple
summary(trees_data)
     size_m       circumference_cm     type          
 Min.   : 99.23   Min.   :115.4    Length:100        
 1st Qu.:106.35   1st Qu.:118.6    Class :character  
 Median :110.35   Median :120.1    Mode  :character  
 Mean   :110.01   Mean   :120.0                      
 3rd Qu.:113.27   3rd Qu.:121.0                      
 Max.   :129.17   Max.   :127.1                      

Scatter plot

A scatter plot is a basic plot allowing to display the values from two quantitative variables.

plot(trees_data$size_m, trees_data$circumference_cm)

plot(trees_data$size_m, trees_data$circumference_cm,
     main = "Tree size by circumference",              # Add a title
     xlab = "Size (m)", ylab = "Circumference (cm)",   # Add labels for axis
     pch = 17,                                         # Change the shape of points
     col = "deeppink2")                                # Change the color of points           

par(mfrow = c(1,2))                                                   # Split the window into two parts
plot(trees_data$size_m, trees_data$circumference_cm)
abline(h = mean(trees_data$circumference_cm ), col = "grey", lty = 5) # Add horizontal line
abline(v = mean(trees_data$size_m), col = "red", lty = 7)             # Add vertical line

plot(trees_data$size_m, trees_data$circumference_cm)
abline(lm(trees_data$circumference_cm ~ trees_data$size_m),           # Add regression line
       col = "blue")            

Histogram

A histogram allows to visualize the distribution of quantitative variables.

hist(trees_data$size_m)

par(mfrow = c(1,2))      # Split the window into two parts
hist(trees_data$size_m,
     freq = TRUE,        # Represent frequencies in the first hist
     col="deeppink2",    # Choose the color
     breaks = 50)        # Choose the number of bars*

hist(trees_data$size_m,
     freq = FALSE,       # Probability densities are plotted in the second hist
     col="deeppink4",    # Change the color
     breaks = 20)        # Lessen the amount of bars

par(mfrow = c(1,1))
hist(trees_data$size_m,
     col="deeppink3",                  
     breaks = 50,                       
     main = "Distribution of trees sizes", # Choose title name
     xlab = "Sizes",                       # Choose x axis label
     ylab = "Frequency")                   # Choose y axis label

hist(trees_data$size_m, 
     freq =FALSE,                     
     col="darkseagreen",               
     breaks = 50,                     
     main = "Distribution of trees sizes", 
     xlab = "Sizes",                   
     ylab = "Frequency")               
abline(v = mean(trees_data$size_m), col = "red")  # Add a vertical red line at the mean of the distribution
lines(density(trees_data$size_m), col = "black")  # Add a line from a density vector created from x   

Boxplot

A boxplot is a representation of a quantitative variable throught its minmum, first quartile, median, third quartile and maximum values.

par(mfrow = c(1,2))
boxplot(trees_data$size_m)
boxplot(trees_data$size_m, frame = FALSE)           # remove frame      

par(mfrow = c(1,2))
boxplot(trees_data$size_m ~ trees_data$type, frame = FALSE)  # Boxplots by groups
boxplot(trees_data$size_m ~ trees_data$type, frame = FALSE,
        horizontal = TRUE)                                   # Horizontal boxplots

boxplot(trees_data$size_m ~ trees_data$type, 
        frame = FALSE,
        names = c("Maple tree", "Oak tree", "Willow tree"), # Change group names
        border = c("red", "darkgreen", "blue"),             # Change border colors (by group)
        col = "white",                                      # Change fill color (all the same)
        main = "Plot of trees sizes by types",              # Change main title
        xlab = "Trees types",                               # Change x-axis title
        ylab = "Sizes")                                     # Change y-axis title

Pie chart

The pie chart allows to visualize the proportions of qualitative variables.

# Calculate effectives of trees types
effectives <- table(trees_data$type)
print(effectives)

 Maple    Oak Willow 
    29     35     36 
pie(effectives)

my_labels <- c("Maple tree", "Oak tree", "Willow tree")
my_color <- RColorBrewer::brewer.pal(5, "Set2")
pie(
  effectives ,
  main="Pie Chart of trees types",      # add title
  labels = my_labels,                   # add labels
  border = "white",                     # color of border
  col = my_color                        # vector of colors to be used in filling
)

Barplot

The barplot allows to visualize the repartition of modalities from qualitative variables.

# Calculate effectives of trees types
effectives <- table(trees_data$type)

par(mfrow=c(1,2))
barplot(effectives)
barplot(effectives, horiz = TRUE)      # horizontal barplot

par(mfrow=c(1,2))
barplot(effectives, 
        names.arg = c("Maple tree", "Oak tree", "Willow tree"), # Change group names
        main = "Repartition of trees types",                    # Add main tilte
        xlab = "Trees types")                                   # Add x-axis label
barplot(effectives, border = "grey",                            # Change border color (single color)
        col = c("red", "cyan", "coral"))                        # Change filled color (different color for each group)

y <- rbind(effectives,
           old_effectives = c(8, 7, 10),
           new_effectives = c(7, 3, 5))
par(mfrow=c(1, 2))
barplot(y, legend = rownames(y))    # Stacked barplot
barplot(y, legend = rownames(y),    
        beside = TRUE)              # Grouped barplot

TP 5

Conclusion

Conclusion

Today we worked on :

  • The Rstudio IDE for R programming
  • The basics of R, functions, importing and manipulating data
  • The importance of packages
  • Simple visualisation with R base



Tomorrow will be about improving your basics with :

  • dplyr for improved manipulation
  • ggplot2 for better visualisation
  • R markdown for beautiful reports

Conclusion



Remember that many online ressources are accessible to help you code in R :

A more advanced use (day 2)

Presentation outline

First day

  1. Understanding R and Rstudio

  2. Programming with R

  3. Importing data and basic manipulation

  4. The importance of packages

  5. Basic visualization

Second day

  1. Tidyverse : dplyr and ggplot2 for visualisation

  2. Reports with Rmarkdown

  3. Manipulation of (your ?) data

1. Tidyverse

Tidyverse

tidyverse is one of the most used packages. It’s a collection of R packages, each with its own specific use, designed to make data science tasks easier and more readable.


Some packages included in Tidyverse :

  • dplyr for data manipulation
  • ggplot2 for data visualization
  • readr for importing text data
  • stringr for string manipulation

dplyr

dplyr for data manipulation

dplyr is a package used to manipulate data.

dplyr syntax uses the pipe %>%, a powerful tool in R allowing to apply multiple functions to the same object successively.

Exemple :

library(tidyverse)  # or just library(dplyr)

grades <- c(12, 8, 14, 7, 19)

grades %>%          # Vector containing value 12, 8, 14, 7, 19
  mean() %>%        # Mean of this vector
  sqrt() %>%        # square root of the mean
  round(digits = 2) # round the value to 2 digits
[1] 3.46

Is equivalent but more readable than :

grades <- c(12, 8, 14, 7, 19)

round(sqrt(mean(grades)), digits = 2)
[1] 3.46

dplyr for data manipulation

Here is an example that uses dplyr to select multiple columns to create a new data frame, using the select function.

Print the header of whole dataset :

library(tidyverse)

iris %>% head()
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Print the header of the same dataset, after selecting some columns :

library(tidyverse)

iris %>% 
  select(Sepal.Length,
         Sepal.Width,
         Species) %>% 
  head()
  Sepal.Length Sepal.Width Species
1          5.1         3.5  setosa
2          4.9         3.0  setosa
3          4.7         3.2  setosa
4          4.6         3.1  setosa
5          5.0         3.6  setosa
6          5.4         3.9  setosa

dplyr for data manipulation

Here is an example that uses dplyr to apply a filter on rows, using the filter function.

Plot the whole dataset.

plot(iris$Sepal.Length, 
     iris$Sepal.Width,
     col = iris$Species,
     pch = 19)

Same plot after filtering on species.

library(tidyverse)

iris_filtered <- iris %>%
  filter(Species %in% c("virginica","setosa"))

plot(iris_filtered$Sepal.Length, 
     iris_filtered$Sepal.Width,
     col = iris$Species,
     pch = 19)

dplyr for data manipulation

Here is an example that uses dplyr to add a new column, using the mutate function.

Just select some columns and print the header of the result.

library(tidyverse)

iris %>%
  select(Sepal.Length, Sepal.Width) %>%
  head()
  Sepal.Length Sepal.Width
1          5.1         3.5
2          4.9         3.0
3          4.7         3.2
4          4.6         3.1
5          5.0         3.6
6          5.4         3.9

Create a new variable Sepal.Area, then select some variables and plot the header of the results.

library(tidyverse)

iris %>% 
  mutate(Sepal.Area = Sepal.Length * Sepal.Width) %>% 
  select(Sepal.Area, Sepal.Length, Sepal.Width) %>%
  head()
  Sepal.Area Sepal.Length Sepal.Width
1      17.85          5.1         3.5
2      14.70          4.9         3.0
3      15.04          4.7         3.2
4      14.26          4.6         3.1
5      18.00          5.0         3.6
6      21.06          5.4         3.9

ggplot2

ggplot2 for data visualization

ggplot2 is a package from the tidy library that allows to produce elegant graphics.
It provides a lot of options allowing to customize meticulously every detail.

The base of ggplot2 plots is divided into 3 fundamental parts :

  • data which is a data frame
  • Aesthetics which contains :
    • x and y variables
    • the colors
    • the size or shape of points
    • etc
  • Geometry which defines the type of desired graphic
ggplot2::ggplot(data, aes(x = x, y = y, color = "blue")) +
                  geom_point()                              # plot a scatter plot

ggplot2 for data visualization

Geometry terms (geom_xxx) are used to determine how data are displayed.  For example :

  • geom_point for scatterplot
  • geom_bar for barplot
  • geom_text to add text to the existing plots.

Scales terms (scale_*_xxx) are used to modify scale of plots.
For example :

  • scale_x_log10() converts a linear scale to logarithmic scale

Coordinates terms (coord_xxx) are used to modify the coordinates of plots.

Facets terms (facet_xxx) are used to divide a plot into multiple subplots.

Theme is used to change the background of the graphic.

ggplot2 : examples

library(tidyverse)

ggplot() +
  geom_point(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  theme_bw()

ggplot2 : examples

library(tidyverse)

ggplot() +
  geom_point(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  theme_bw()

ggplot2 : examples

library(tidyverse)

classroom <- data.frame(
  Course = c("Mathematics", "Biology", "Physics", "Chemistry"),
  Estelle = c(18, 13, 14, 15),
  Benjamin = c(10, 19, 2, 17),
  Clement = c(15, 9, 19, 11),
  Oriane = c(14, 14, 10, 14)
)

classroom %>% head()
       Course Estelle Benjamin Clement Oriane
1 Mathematics      18       10      15     14
2     Biology      13       19       9     14
3     Physics      14        2      19     10
4   Chemistry      15       17      11     14
# We use the function melt of the package reshape2 
# to format the data.frame in a usable format for ggplot
classroom <- classroom %>%
  reshape2::melt(id.vers = "Course",
                 value.name = "Marks",
                 variable.name = "Student")   
classroom %>% head()
       Course  Student Marks
1 Mathematics  Estelle    18
2     Biology  Estelle    13
3     Physics  Estelle    14
4   Chemistry  Estelle    15
5 Mathematics Benjamin    10
6     Biology Benjamin    19
classroom %>% dim()
[1] 16  3
# We calculate the means for each student
means <- classroom %>% 
  group_by(Student) %>% 
  summarise(mean_marks = mean(Marks))       
means
# A tibble: 4 × 2
  Student  mean_marks
  <fct>         <dbl>
1 Estelle        15  
2 Benjamin       12  
3 Clement        13.5
4 Oriane         13  

ggplot2 : examples

ggplot(classroom, aes(x = Student, y = Marks, fill = Course)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_hline(data = means, 
             aes(yintercept = mean_marks, color = Student),
             linetype = "dashed",
             size = 1) +
  labs(title = "Marks for students in different courses",
       x = "Students",
       y = "Marks", 
       fill = "Course",
       color = "Student") +
  scale_color_brewer(palette = "BrBG") +   
  theme_minimal()

ggplot2 : Esquisse

esquisse is an R Addin allowing to create ggplot graphics interactively.

You can create graphics without coding, and retrieve the code generating the graph at the end.

Here is a tutorial explaining how to use this addin.

TP 6

2. R markdown

R markdown


R Markdown documents provide quick, reproducible reporting from R.
They are designed to be used with the rmarkdown package. This package comes already installed with the RStudio IDE.


You write your document in markdown (an easy-to-write plain text format) and can add chunks of embedded code (R code but also other languages !).
The extension of an R markdown file is .Rmd.


The R Markdown document is updated at any time by knitting the code chunks and then convert them to generate a final document in common formats like HTML, PDF, MS Word or even HTML5/PDF slideshow.

(This presentation is actually made with quarto which is based on R markdown format)


Rmarkdown structure

This is an example of a R Markdown document with its 3 main elements :

  • The YAML

  • The texts parts (with titles and text)
  • The chunks (containing codes)

YAML

The YAML header is used to control how rmarkdown renders your .Rmd file. It is a section of key: value pairs surrounded by --- marks.

The output: value determines what type of output to convert the file into when you call rmarkdown::render().

The possible values for output are :
- html_document (default)
- pdf_document
- word_document
- ioslides_presentation
- beamer_presentation

Text

  • Headers : Place one or more # at the start of a line that will be a header (or sub-header). One # for a first level header, two ## for a second level header, and so on.


  • Italicized : Surround italicized text with asterisks *, like this *this is italicized*.


  • Bold : Surround with 2 asterisks* around the text like : **this will be in bold**.


  • Lists : Group lines into bullet points that begin with asterisks. Leave a blank line before the first bullet.


  • Hyperlinks : Surround links with brackets, and then provide the target link in parentheses : [Bilille](https://bilille.univ-lille.fr/) will give : Bilille.

Chunks

The chunks are where you will write your code in a chosen language (r, python, bash for example).


The chunk is delimited with ``` and contains a specification of informations and options about the chunk in between {}.

The different options enable the chunk to :

  • not stop on error,
  • hide the code or the output
  • chose the format of the output text

For more informations on these options and the Rmd in general you can visit the very useful rmarkdown cookbook.

Rendering : creating your report

To create your report file in the chosen output format, you can render an R Markdown file with rmarkdown::render() and by specifying the output chosen in the YAML section :

rmarkdown::render(
  'code/my_document.Rmd',
  output_file = 'reports/my_first_report.html',
  clean = TRUE
)

You can also render with a click button action by clicking the “Knit” icon that appears above your file in the Rstudio editor. A drop down menu will let you select the type of output that you want.

If you use the RStudio IDE knit button to render your file, you do not need to specify output in the YAML, because the selection made by clicking will override any output set in the YAML header.

Rendering : creating your report

TP 7

Conclusion theory day 2

We hope that this express training made you discover the joys of programming in R.
You should now know the basics of :

  • The usage of Rstudio IDE and R programming
  • Creating objects, using functions, doing basic operations
  • Importing, cleaning, manipulating and exporting datatables
  • The importance of packages and how to use them to your advantage (like with the useful dplyr and ggplot2 packages)
  • Creating colourful and meaningful visualisations
  • Using R markdown to create beautiful reports


Don’t forget to use online ressources anytime you need ! One big advantage of R resides in its community !

Now what’s next ?

TP 8 : manipulating data from A to Z

On your own data (with our ponctual help)

OR choose between 2 corrected TPs :
- Abundance table proteomic data
- Clinical obesity data

General conclusion

Restitution



  • Share experience on TP8 : any volunteer ?


  • The corrected report generated with Rmarkdown is available in the files we sent you.

Contacts

If you have any questions later

We can answer specific questions but not provide project follow-up.
If you need regular interactions to work on your data, you can contact Bilille using the bilille@univ-lille.fr address.


We are physically present on 3 sites :

  • Cité Scientifique, ESPRIT building, 3rd floor
  • Campus Santé, Plateformes-Cancer building, R+1
  • Campus Pasteur, E.Roux building, 2nd floor

You’ll find more informations on our website, and the one of our unity, PLBS.


One last thing : please fill in the following Framaforms to give your opinion on the training course.

The end ☺️