The first step is the setup chunk, to configure the global options of your Rmarkdown document.
# The next function specify that every chunk will have the echo parameter to TRUE and the code will be displayed in the html (unless specified to hide it)
knitr::opts_chunk$set(echo = TRUE)
options(knitr.duplicate.label = "allow")
Note : chunks are named, but Rmarkdown
doesn’t allow to have several chunks with the same name. In the setup
chunk you can specify
options(knitr.duplicate.label = "allow") to allow this.
You could have specified
{r setup, include = FALSE, message = FALSE} to remove the
chunk from the rendering and to ignore all messages.
The second step is to load the libraries you will use throughout the R Markdown report.
When writing an R Markdown document, you may use different packages
at different stages.
A good practice is to load all of them at the beginning of the
report.
library(tidyverse)
# You may install the DT package if not in your library
# install.packages("DT")
library(DT)
In the library chunk, you can specify message = FALSE to
suppress messages, as library messages often include advice on how to
cite the package in publications or describe the package’s purpose.
The new step is to import your data. In this
example, my data are located in the folder Data.
You should adjust the file path according to the location of your data to ensure the report can be knitted successfully.
crops_data = read.csv("./../Data/crops_data.csv", stringsAsFactors = TRUE)
In order to use the dataset for plotting, you first need to
inspect it and remove any missing values. To perform
this, you can use the function summary().
summary(crops_data)
## farm_id region crop_type soil_moisture
## FARM0001: 1 Central USA:108 Cotton :107 Min. : 7.154
## FARM0002: 1 East Africa:107 Maize :111 1st Qu.:17.128
## FARM0003: 1 North India: 99 Rice : 80 Median :24.672
## FARM0004: 1 South India: 91 Soybean:108 Mean :27.922
## FARM0005: 1 South USA : 93 Wheat : 92 3rd Qu.:36.675
## FARM0006: 1 NA's : 2 NA's : 2 Max. :67.395
## (Other) :494 NA's :3
## soil_pH temperature_C rainfall_mm humidity
## Min. :5.510 Min. :15.01 Min. : 35.12 Min. :40.23
## 1st Qu.:6.030 1st Qu.:20.30 1st Qu.:111.43 1st Qu.:51.76
## Median :6.530 Median :24.70 Median :185.80 Median :65.61
## Mean :6.525 Mean :24.70 Mean :189.01 Mean :65.17
## 3rd Qu.:7.040 3rd Qu.:29.09 3rd Qu.:244.23 3rd Qu.:77.96
## Max. :7.500 Max. :34.84 Max. :444.17 Max. :90.00
## NA's :2 NA's :1 NA's :1 NA's :3
## sunlight_hours irrigation_type fertilizer_type pesticide_usage_ml
## Min. : 4.010 Drip :111 Inorganic:166 Min. : 5.05
## 1st Qu.: 5.668 Manual :118 Mixed :166 1st Qu.:14.95
## Median : 6.995 None :150 Organic :166 Median :25.98
## Mean : 7.030 Sprinkler:121 NA's : 2 Mean :26.59
## 3rd Qu.: 8.470 3rd Qu.:38.01
## Max. :10.000 Max. :49.94
##
## sowing_date harvest_date total_days yield_kg_per_hectare
## 03-05-24: 15 06-04-24: 10 Min. : 90.0 Min. :2024
## 03-07-24: 11 06-28-24: 10 1st Qu.:105.8 1st Qu.:2995
## 01-27-24: 9 06-02-24: 9 Median :119.0 Median :4071
## 02-02-24: 9 06-10-24: 9 Mean :119.5 Mean :4032
## 03-04-24: 9 06-23-24: 9 3rd Qu.:134.0 3rd Qu.:5066
## (Other) :446 06-27-24: 9 Max. :150.0 Max. :5998
## NA's : 1 (Other) :444 NA's :1
## sensor_id timestamp latitude longitude NDVI_index
## SENS0001: 1 04-14-24: 10 Min. :10.00 Min. :70.02 Min. :0.3000
## SENS0002: 1 04-02-24: 9 1st Qu.:16.26 1st Qu.:75.38 1st Qu.:0.4475
## SENS0003: 1 02-25-24: 7 Median :21.98 Median :80.67 Median :0.6100
## SENS0004: 1 03-21-24: 7 Mean :22.44 Mean :80.40 Mean :0.6021
## SENS0005: 1 05-06-24: 7 3rd Qu.:28.53 3rd Qu.:85.66 3rd Qu.:0.7500
## (Other) :494 05-14-24: 7 Max. :34.98 Max. :89.99 Max. :0.9000
## NA's : 1 (Other) :453 NA's :1
## crop_disease_status
## Mild :125
## Moderate:112
## None :130
## Severe :133
##
##
##
You can observe several missing values, which we will remove using
na.omit().
We will also create a smaller dataset containing only the relevant columns, which will be easier to work with. Specifically, we will use the solution from Question 3 of TP6 to generate a reduced dataset and then perform some basic analysis.
This new dataset, called crops_data_less, will
exclude missing values and include only the columns we are interested
in.
Creating a copy of the dataset allows us to retain the
original crops_data in case we need to use it
later.
crops_data_less = crops_data %>%
na.omit() %>%
dplyr::select(
irrigation_type,
region,
crop_type,
soil_moisture,
rainfall_mm,
humidity,
pesticide_usage_ml
)
Note : We used explicit function calls here to avoid potential issues. In fact, it is considered good practice to always call functions this way to prevent accidentally using a function from another package without realizing it.
Warning : In the previous code, you can
see that filtering with na.omit() is performed before
selecting columns. If the order was different, the number of rows might
change. It’s up to you whether to apply filtering to the full dataset or
only to a subset of selected columns, but be aware that this choice can
lead to different results.
We can inspect the data again to visualize if there are missing values left.
summary(crops_data_less)
## irrigation_type region crop_type soil_moisture
## Drip :102 Central USA:103 Cotton :106 Min. : 7.154
## Manual :115 East Africa:105 Maize :105 1st Qu.:17.234
## None :146 North India: 96 Rice : 76 Median :24.696
## Sprinkler:117 South India: 87 Soybean:104 Mean :27.934
## South USA : 89 Wheat : 89 3rd Qu.:36.634
## Max. :67.395
## rainfall_mm humidity pesticide_usage_ml
## Min. : 35.12 Min. :40.23 Min. : 5.05
## 1st Qu.:111.98 1st Qu.:52.09 1st Qu.:14.76
## Median :186.02 Median :65.69 Median :25.82
## Mean :189.92 Mean :65.15 Mean :26.46
## 3rd Qu.:243.87 3rd Qu.:77.96 3rd Qu.:37.88
## Max. :444.17 Max. :90.00 Max. :49.94
You can show the head of the dataset, and use the function
datatable of the DT package to show the table.
Using this function allows to sort and filter the table directly on the
report, what’s not possible with some other basic functions
(knitr::kable for example).
# The option `scrollX=TRUE` mean you can scroll in the x axis, if the table is too wide.
crops_data_less %>%
head() %>%
DT::datatable(options = list(scrollX = TRUE))
Now we will use the package ggplot2 in order to make
plots. We will recreate the plots from TP6.
The first step was to perform an histogram showing the distribution of soil moisture.
You can use ggplot:Esquisse to have the main code. But
Esquisse has its limits and you can’t add vertical
lines for example. To do it, you can copy and paste the code from
Esquisse then add + geom_vline() to add this vertical
line.
Note that you can’t use Esquisse when generating a
Rmarkdown report. If you want to use it, you necessarily
have to retrieve te code to put it in a chunk, and then render your
document.
ggplot(crops_data_less) +
aes(x = soil_moisture) +
geom_histogram(bins = 30L, fill = "#91ABDA") +
theme_minimal() +
geom_vline(xintercept = mean(crops_data_less$soil_moisture),
colour = "red")
You can also create a boxplot showing the pesticide usage per region.
ggplot(crops_data_less) +
aes(x = region, y = pesticide_usage_ml, fill = region) +
geom_boxplot() +
scale_fill_viridis_d(option = "viridis", direction = 1) +
labs(
x = "Region",
y = "Pesticide usage (mL)",
title = "Pesticide usage by region",
fill = "Region"
) +
theme_minimal() +
theme(legend.position = "none") # This argument is used to remove the legend
You can display multiple pieces of information on the same plot.
For example, we can create a new data frame containing the mean rainfall and moisture values for each region, in order to display their centroids on the scatterplot.
# We create a data.frame for centroids coordinates
centroids = crops_data_less %>%
dplyr::group_by(region) %>%
summarise(mean_rainfall = mean(rainfall_mm),
mean_soil_moisture = mean(soil_moisture)
)
ggplot(crops_data_less) +
aes(x = soil_moisture, y = rainfall_mm, colour = region) +
geom_point(size = 2.3, shape = "triangle") +
geom_point(data = centroids, #in order to show the centroids, you should specify `data = centroids`, otherwise ggplot won't understand which argument centroids comes from
aes(x = mean_soil_moisture, y = mean_rainfall, color = region),
shape = 3,
size = 8
) +
scale_color_manual(
values = c(
`Central USA` = "#6D7BF8",
`East Africa` = "#44D051",
`North India` = "#111111",
`South India` = "#E85C3C",
`South USA` = "#FF61C3"
)
) +
labs(
x = "Soil moisture",
y = "Rainfall (mm)",
title = "Soil moisture according to rainfall, colored by region",
color = "Region"
) +
theme_minimal()
It is useful to show the session info at the end of the report to indicate which packages have been used and their versions.
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Paris
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] DT_0.34.0 lubridate_1.9.4 forcats_1.0.1 stringr_1.5.2
## [5] dplyr_1.1.4 purrr_1.1.0 readr_2.1.5 tidyr_1.3.1
## [9] tibble_3.3.0 ggplot2_4.0.0 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 jsonlite_2.0.0 compiler_4.4.2 tidyselect_1.2.1
## [5] jquerylib_0.1.4 scales_1.4.0 yaml_2.3.10 fastmap_1.2.0
## [9] R6_2.6.1 labeling_0.4.3 generics_0.1.4 knitr_1.50
## [13] htmlwidgets_1.6.4 bslib_0.9.0 pillar_1.11.1 RColorBrewer_1.1-3
## [17] tzdb_0.5.0 rlang_1.1.6 stringi_1.8.7 cachem_1.1.0
## [21] xfun_0.53 sass_0.4.10 S7_0.2.0 viridisLite_0.4.2
## [25] timechange_0.3.0 cli_3.6.5 withr_3.0.2 magrittr_2.0.4
## [29] crosstalk_1.2.2 digest_0.6.37 grid_4.4.2 rstudioapi_0.17.1
## [33] hms_1.1.3 lifecycle_1.0.4 vctrs_0.6.5 evaluate_1.0.5
## [37] glue_1.8.0 farver_2.1.2 rmarkdown_2.30 tools_4.4.2
## [41] pkgconfig_2.0.3 htmltools_0.5.8.1