[1] 420
Introduction to R for data manipulation and visualization
October 2025
Schedule :
Breaks :
At the bottom left, there is a menu to better navigate through the slides
Lunch :
R and Rstudio installed successfully for everyone ?
Access to Nextcloud files ?
password : formation_r_2025
Bilille is the Lille bioinformatics and biostatistics platform, within the UAR 2014 - US 41 “Plateformes Lilloises en Biologie et Santé”.
PLBS includes 8 platforms, providing access to expertise and equipments to support research in biology and health.
In Bilille, we currently are 10 full time engineers, directed by Jimmy Vandel (IR CNRS).
Our missions are to :
Us
What about you ?
Data Science is the bridge between raw data (after experimentation) and meaningful insight. It is often divided into several stages (which do not always have a fixed order) :
The programming language R enables us to do all these steps !
First day
Understanding R and Rstudio
Programming with R
Importing data and basic manipulation
The importance of packages
Basic visualization
Second day
Programming with Tidyverse
Reports with Rmarkdown
Manipulation of (your ?) data
R is a programming language for statistical computing and graphics. It originated in 1993 and has been widely adopted since.
It is available as a Free Software (like python and Julia for example) and runs on a wide variety of systems (UNIX, Windows, MacOS).
Widely used for bioinformatics/statistics/data science
Relatively “easy” to understand
Open source so everybody can contribute
Very large number of libraries developed by a community of contributors
Over 15,000 libraries listed on the Comprehensive R Archive Network (CRAN), GitHub and Bioconductor
The Bioconductor project alone includes more than 1,000 libraries allowing analysis of biological data
R scripts have the file extension .R.
They can be executed in a terminal using the command : Rscript yourfile.R
You can also open the R console in a terminal using the command line R :
NB : This example and the previous one come from a Unix distribution, but there is an equivalent command for Windows or Mac.
RStudio is an integrated development environment for R, whose first version was released in 2011. It allows you to :
work on your R scripts and write reports
execute your code and scripts
visualize the environment and variables
visualize plots, install new packages, consult help
etc
CTRL + enter key executes :
It is equivalent to the Run button which is top right of the script region of RStudio.
The whole script can be executed using the Source button, also placed top right of the script region.
In R, you can manipulate :
69 (integer)"Name" (character)"2025-04-14" (date)c("a","b","c") (vector)x <- 21 (variable integer)y <- c(21/7, 42, 0.99) (variable vector)round(x, digits = 3)Main types of R values :
Those values can be assigned to objects with <- like my_var <- 21
Which of these are numbers in R ?
1 “1” “one” one
1 is a number"1" and "one" are charactersone is an object (created with one <-)To store the results and elements of your analysis, you need to use objects.
For example :
Now you can see x in the environment (top right).
You can print the result doing
To name your object :
Use specific and explicit names
Use _ if needed (not spaces !)
Use lowercase (convention) because my_var is not the same as My_Var
Start your variable name with a letter
To assign a value to your object :
<- : [less than] symbol + dash (it is also possible to use = for assignation)Which of these will work ?
Let one <- 1,
1 + 1 “1” + “1” “one” + “one” one + one
1+1 and one+one will both give us the addition of 1 with 1 which is 2"1"+"1" and "one"+"one" will not work and you’ll have an error messageA vector is an element containing one or more single values.
To access the elements of a vector, use the [] with the desired indexation.
,” are used to separate values while points “.” are used for decimal separation in numerical elements.1, not 0 as some other informatic languages.Will both these instructions produce the same result ?
my_vect_1 will be a vector of length 6 with only integers
my_vect_2 will be a vector of length 4 with integers and numerics
my_vect_1 ?my_vect_2 ?my_vect_1 ?my_vect_2 into an object called vect_pos_2 ?In R, there are two values that indicate missing or empty data :
NA (Not Available)
Operations between single values \(\in \mathbb{R}\) :
TRUE or FALSE :
== will compare two values whereas = will affect a value to an object, like <-.
Also, these operators can be used on two vectors of the same length.
For example :
x <- c(1, 0.1, 0) and y <- c(3, 1.9, 0). Then x + y equals c(4, 2, 0)
what will be the outputs for the following cases :
“a” == “a”
a == a
“a” == “b”
TRUE
Error (object a doesn’t exist)
FALSE
Let a = 2 and b <- 2.
TRUE
TRUE
FALSE
What will be the outputs for the following classical and logical operations :
A function is a very useful object in R. It is a set of command lines, computing a specific task, and that is reusable.
They are great tools to gain a lot of time, avoid copy/paste when running the same task several times and ensure reproducibility.
You can create your own functions or use existing functions depending on your needs.
temporary and NOT in your environmentLike any object, the function has to exist (being created) before using it.
Creating a function is quite straightforward when following the structure (input, inside code, output).
Here, we create a function computing the addition of two numbers :
For this training, we will not be creating any function, as it can be considered advanced R.
Instead, we are going to use functions that have already been created to make our lives easier !
Many basic functions already exist and are available in R. They were created in the R programming language to tackle basic needs and you can directly call them.
A few examples are listed here :
It is very important to understand what is the goal of a function, what input parameters it expects and what it returns.
Every function from R base (and from packages), has a guide that will help you understand it.
To access it, you can either go to the Help panel and type the name of your function or use ?function like in this example :
There are many informations in the documentation. We mainly look at the sections :
Description : the function’s goalUsage : how to call the function, the input parameters and their orderArguments : the expected type of each of the input parametersExamples : code examples of usage of the functionThe input parameters of a function are described in the documentation.
They can be :
- necessary : you have to add them in the function calling
- or optional : these parameters were already assigned a value by default in the function’s definition. The function will take this value for the parameter if you don’t specify another value in the call.
Be careful in what order the function expects the input parameters.
To be sure that you assign correct values to the input parameters of a function, you can use the explicit names of the input parameters.
For example, the print() function expects an input parameter named x :
Let’s try with an example !
Let’s look at the help for this function.
Using the documentation, what will the outputs for the following calls be :
round(7.123, digits=2)
round(7.123)
round(2, 7)
round(digits=2, 7.123)
round(7.123, digits=“1”)
round(2, 3, 1)
7.12
7 (digits=0 in the function definition)
2 (the order !)
7.1 (explicit calling of digits)
ERROR : digits has to be numerical
ERROR : expects only 2 arguments
Be careful when assigning values to the parameters in a function use the = (not <-)
Understanding basic path architecture…
~/projets/2025_formation_r
~/projets/2025_formation_r/Data
To import data we can use functions like read.csv() with the local path to our data :
great_data <- read.csv("my_great_data.csv") # my_great_data.csv is in ~/projets/2025_formation_r/DataOr with the absolute path :
Be careful while importing data :
- name your files and repositories to make it easy for you to import the data
- use appropriate column names in your data file (avoid blank spaces, special characters like "(", keep a coherent value type in a column…)
- if your data is too large, your computer may won’t have enough RAM to allow R importing and manipulating it
- there is a subtlety in Windows, that uses \ in paths instead of /
You can import data using Rstudio interface. It works for various types of data files and give you the equivalent command if you want to import your data using command lines in your script.
You can import:
Data is often presented in data frames, which are tables with rows and columns.
It is possible to specify names for rows and columns.
Variables (columns) can be of different types :
Make sure a column has a unique type of values on all rows.
For instance, only characters in the region column or only numeric in the soil_ph column.
Checking this will help you greatly for downstream analyses !
great_data$a_new_column <- great_data$existing_col1 + great_data$existing_col2
great_data[,"a_new_column"] <- great_data[,"existing_col1"] + great_data[,"existing_col2"]
great_data <- cbind(great_data, a_new_column) # Binding a vector
# To remove a column named "useless" placed in second position, any of these will work
great_data$useless <- NULL
great_data[2] <- NULL
great_data[[2]] <- NULL
great_data <- great_data[,-2]
great_data <- great_data[-2]
When using cbind, you have to check that rows in the data frame are sorted in the same order than data in the new data vector. Otherwise, you will attribute values that doesn’t match with the appropriate row.
# If great_data contains 3 columns : ("Name", "first name", "year of birth")
great_data[nrow(great_data)+1,] = c("Descartes", "René", 1596)
great_data = rbind(great_data, c("Descartes", "René", 1596)) # Binding a vector of length 3
When using rbind, you have to check that columns in the data frame are sorted in the same order than data in the new data vector. Otherwise, you will attribute values that doesn’t match with the appropriate column.
subset() function :Some functions allow NA values but it’s necessary to specify how to deal with them, for example :
Some other functions can’t be computed on data with missing values.
When trying to use them, an error message will be displayed, for example :
To save and export the data created, we can use different approaches.
.csv file we can use the write.csv() function :.rds or .RData) : saveRDS(my_data_to_save, file = "path/to/my_data.rds") # Save a single object in a .rds file
my_data_again <- readRDS(file = "path/to/my_data.rds") # Restore the object in R
save(data1, file = "data.RData") # Save an object in a .RData file
save(data1, data2, file = "data_1_2.RData") # Save multiple objects in a .RData file
load("data_1_2.RData") # To load the data again.RData object :When leaving Rstudio, it will suggest to save the current workspace, which is equivalent to the save.image function.
R packages are extensions to the R statistical programming language. They are created in a standardized collection format that can be installed and used in R.
They contain:
Packages are very useful : they help you optimise your code and make it easier to manage. A library is a collection of reusable packages that address common needs.
The large number of packages/libraries available for R, and the ease of installing and using them, has been cited as a major factor driving the widespread adoption of this language in data science.
To install a package there are (again) different ways to proceed :
Packages tab, with the Install button for packages available on CRAN or with a package archive file .tar.gz.install.packages() function in the Console.Console to download and install packages from the internet for instance from Bioconductor using the package BiocManager.To use a package, we first have to load it by calling library(name_of_the_package).
We can then use the functions and data contained in the package in our script.
The packages can be updated regularly which can make your code obsolete.
Make sure your know which package and which version you use in your script to ensure reproducibility.
The information of the packages and versions used can be accessed with the session_info() function of the devtools package.
Sometimes, a function exists with the same name in two different packages. For example, the function filter exists in both packages stats and dplyr, and doesn’t do exactly the same thing in both cases.
For example :
To illustrate different basic visualisations, we will create the dataframe trees_data with 100 individuals for 3 variables :
size_m : trees sizes in meters
circumference_cm : trees circumferences in centimeters
type : trees types
# We use the MASS library to generate correlated data
sigma <- matrix(c(1, 0.8, 0.8, 1), nrow = 2) # Correlation matrix
mv_norm <- MASS::mvrnorm(100, mu = c(20, 50), # Multivariate normal distribution
Sigma = sigma)
# Mean of size variable distribution = 20m
# Mean of circumference distribution = 50cm
trees_data <- data.frame(
size_m = abs(mv_norm[, 1] * 5 + 10), # We adjust sizes between 10m and 110m
circumference_cm = abs(mv_norm[, 2] * 2 + 20), # We adjust circumferences between 20cm and 120cm
type = sample(c("Maple", "Oak", "Willow"),
100,
replace = TRUE)
)A scatter plot is a basic plot allowing to display the values from two quantitative variables.
par(mfrow = c(1,2)) # Split the window into two parts
plot(trees_data$size_m, trees_data$circumference_cm)
abline(h = mean(trees_data$circumference_cm ), col = "grey", lty = 5) # Add horizontal line
abline(v = mean(trees_data$size_m), col = "red", lty = 7) # Add vertical line
plot(trees_data$size_m, trees_data$circumference_cm)
abline(lm(trees_data$circumference_cm ~ trees_data$size_m), # Add regression line
col = "blue") A histogram allows to visualize the distribution of quantitative variables.
par(mfrow = c(1,2)) # Split the window into two parts
hist(trees_data$size_m,
freq = TRUE, # Represent frequencies in the first hist
col="deeppink2", # Choose the color
breaks = 50) # Choose the number of bars*
hist(trees_data$size_m,
freq = FALSE, # Probability densities are plotted in the second hist
col="deeppink4", # Change the color
breaks = 20) # Lessen the amount of barshist(trees_data$size_m,
freq =FALSE,
col="darkseagreen",
breaks = 50,
main = "Distribution of trees sizes",
xlab = "Sizes",
ylab = "Frequency")
abline(v = mean(trees_data$size_m), col = "red") # Add a vertical red line at the mean of the distribution
lines(density(trees_data$size_m), col = "black") # Add a line from a density vector created from x A boxplot is a representation of a quantitative variable throught its minmum, first quartile, median, third quartile and maximum values.
boxplot(trees_data$size_m ~ trees_data$type,
frame = FALSE,
names = c("Maple tree", "Oak tree", "Willow tree"), # Change group names
border = c("red", "darkgreen", "blue"), # Change border colors (by group)
col = "white", # Change fill color (all the same)
main = "Plot of trees sizes by types", # Change main title
xlab = "Trees types", # Change x-axis title
ylab = "Sizes") # Change y-axis titleThe pie chart allows to visualize the proportions of qualitative variables.
The barplot allows to visualize the repartition of modalities from qualitative variables.
par(mfrow=c(1,2))
barplot(effectives,
names.arg = c("Maple tree", "Oak tree", "Willow tree"), # Change group names
main = "Repartition of trees types", # Add main tilte
xlab = "Trees types") # Add x-axis label
barplot(effectives, border = "grey", # Change border color (single color)
col = c("red", "cyan", "coral")) # Change filled color (different color for each group)Today we worked on :
Tomorrow will be about improving your basics with :
Remember that many online ressources are accessible to help you code in R :
First day
Understanding R and Rstudio
Programming with R
Importing data and basic manipulation
The importance of packages
Basic visualization
Second day
Tidyverse : dplyr and ggplot2 for visualisation
Reports with Rmarkdown
Manipulation of (your ?) data
tidyverse is one of the most used packages. It’s a collection of R packages, each with its own specific use, designed to make data science tasks easier and more readable.
Some packages included in Tidyverse :
dplyr for data manipulationggplot2 for data visualizationreadr for importing text datastringr for string manipulationdplyr is a package used to manipulate data.
dplyr syntax uses the pipe %>%, a powerful tool in R allowing to apply multiple functions to the same object successively.
Exemple :
library(tidyverse) # or just library(dplyr)
grades <- c(12, 8, 14, 7, 19)
grades %>% # Vector containing value 12, 8, 14, 7, 19
mean() %>% # Mean of this vector
sqrt() %>% # square root of the mean
round(digits = 2) # round the value to 2 digits[1] 3.46
Is equivalent but more readable than :
Here is an example that uses dplyr to select multiple columns to create a new data frame, using the select function.
Here is an example that uses dplyr to apply a filter on rows, using the filter function.
Here is an example that uses dplyr to add a new column, using the mutate function.
Create a new variable Sepal.Area, then select some variables and plot the header of the results.
ggplot2 is a package from the tidy library that allows to produce elegant graphics.
It provides a lot of options allowing to customize meticulously every detail.
The base of ggplot2 plots is divided into 3 fundamental parts :
x and y variablesGeometry terms (geom_xxx) are used to determine how data are displayed. For example :
geom_point for scatterplotgeom_bar for barplotgeom_text to add text to the existing plots.Scales terms (scale_*_xxx) are used to modify scale of plots.
For example :
scale_x_log10() converts a linear scale to logarithmic scaleCoordinates terms (coord_xxx) are used to modify the coordinates of plots.
Facets terms (facet_xxx) are used to divide a plot into multiple subplots.
Theme is used to change the background of the graphic.
library(tidyverse)
classroom <- data.frame(
Course = c("Mathematics", "Biology", "Physics", "Chemistry"),
Estelle = c(18, 13, 14, 15),
Benjamin = c(10, 19, 2, 17),
Clement = c(15, 9, 19, 11),
Oriane = c(14, 14, 10, 14)
)
classroom %>% head() Course Estelle Benjamin Clement Oriane
1 Mathematics 18 10 15 14
2 Biology 13 19 9 14
3 Physics 14 2 19 10
4 Chemistry 15 17 11 14
# We use the function melt of the package reshape2
# to format the data.frame in a usable format for ggplot
classroom <- classroom %>%
reshape2::melt(id.vers = "Course",
value.name = "Marks",
variable.name = "Student")
classroom %>% head() Course Student Marks
1 Mathematics Estelle 18
2 Biology Estelle 13
3 Physics Estelle 14
4 Chemistry Estelle 15
5 Mathematics Benjamin 10
6 Biology Benjamin 19
[1] 16 3
ggplot(classroom, aes(x = Student, y = Marks, fill = Course)) +
geom_bar(stat = "identity", position = "dodge") +
geom_hline(data = means,
aes(yintercept = mean_marks, color = Student),
linetype = "dashed",
size = 1) +
labs(title = "Marks for students in different courses",
x = "Students",
y = "Marks",
fill = "Course",
color = "Student") +
scale_color_brewer(palette = "BrBG") +
theme_minimal()esquisse is an R Addin allowing to create ggplot graphics interactively.
You can create graphics without coding, and retrieve the code generating the graph at the end.
Here is a tutorial explaining how to use this addin.
R Markdown documents provide quick, reproducible reporting from R.
They are designed to be used with the rmarkdown package. This package comes already installed with the RStudio IDE.
You write your document in markdown (an easy-to-write plain text format) and can add chunks of embedded code (R code but also other languages !).
The extension of an R markdown file is .Rmd.
The R Markdown document is updated at any time by knitting the code chunks and then convert them to generate a final document in common formats like HTML, PDF, MS Word or even HTML5/PDF slideshow.
(This presentation is actually made with quarto which is based on R markdown format)
This is an example of a R Markdown document with its 3 main elements :
The YAML header is used to control how rmarkdown renders your .Rmd file. It is a section of key: value pairs surrounded by --- marks.
The output: value determines what type of output to convert the file into when you call rmarkdown::render().
The possible values for output are :
- html_document (default)
- pdf_document
- word_document
- ioslides_presentation
- beamer_presentation
# at the start of a line that will be a header (or sub-header). One # for a first level header, two ## for a second level header, and so on.*, like this *this is italicized*.* around the text like : **this will be in bold**.[Bilille](https://bilille.univ-lille.fr/) will give : Bilille.The chunks are where you will write your code in a chosen language (r, python, bash for example).
The chunk is delimited with ``` and contains a specification of informations and options about the chunk in between {}.
The different options enable the chunk to :
For more informations on these options and the Rmd in general you can visit the very useful rmarkdown cookbook.
To create your report file in the chosen output format, you can render an R Markdown file with rmarkdown::render() and by specifying the output chosen in the YAML section :
rmarkdown::render(
'code/my_document.Rmd',
output_file = 'reports/my_first_report.html',
clean = TRUE
)You can also render with a click button action by clicking the “Knit” icon that appears above your file in the Rstudio editor. A drop down menu will let you select the type of output that you want.
If you use the RStudio IDE knit button to render your file, you do not need to specify output in the YAML, because the selection made by clicking will override any output set in the YAML header.
We hope that this express training made you discover the joys of programming in R.
You should now know the basics of :
Don’t forget to use online ressources anytime you need ! One big advantage of R resides in its community !
Now what’s next ?
On your own data (with our ponctual help)
OR choose between 2 corrected TPs :
- Abundance table proteomic data
- Clinical obesity data
Rmarkdown is available in the files we sent you.If you have any questions later
We can answer specific questions but not provide project follow-up.
If you need regular interactions to work on your data, you can contact Bilille using the bilille@univ-lille.fr address.
We are physically present on 3 sites :
You’ll find more informations on our website, and the one of our unity, PLBS.
One last thing : please fill in the following Framaforms to give your opinion on the training course.