Setup

Global options

The first step is the setup chunk, to configure the global options of your Rmarkdown document.

# The next function specify that every chunk will have the echo parameter to TRUE and the code will be displayed in the html (unless specified to hide it)
knitr::opts_chunk$set(echo = TRUE)
options(knitr.duplicate.label = "allow")

Note : chunks are named, but Rmarkdown doesn’t allow to have several chunks with the same name. In the setup chunk you can specify options(knitr.duplicate.label = "allow") to allow this.

You could have specified {r setup, include = FALSE, message = FALSE} to remove the chunk from the rendering and to ignore all messages.

Packages

The second step is to load the libraries you will use throughout the R Markdown report.

When writing an R Markdown document, you may use different packages at different stages.
A good practice is to load all of them at the beginning of the report.

library(tidyverse)
# You may install the DT package if not in your library
# install.packages("DT")
library(DT)

In the library chunk, you can specify message = FALSE to suppress messages, as library messages often include advice on how to cite the package in publications or describe the package’s purpose.

Data import

The new step is to import your data. In this example, my data are located in the folder Data.

You should adjust the file path according to the location of your data to ensure the report can be knitted successfully.

crops_data = read.csv("./../Data/crops_data.csv", stringsAsFactors = TRUE)

Data inspection and modification

In order to use the dataset for plotting, you first need to inspect it and remove any missing values. To perform this, you can use the function summary().

summary(crops_data)
##      farm_id            region      crop_type   soil_moisture   
##  FARM0001:  1   Central USA:108   Cotton :107   Min.   : 7.154  
##  FARM0002:  1   East Africa:107   Maize  :111   1st Qu.:17.128  
##  FARM0003:  1   North India: 99   Rice   : 80   Median :24.672  
##  FARM0004:  1   South India: 91   Soybean:108   Mean   :27.922  
##  FARM0005:  1   South USA  : 93   Wheat  : 92   3rd Qu.:36.675  
##  FARM0006:  1   NA's       :  2   NA's   :  2   Max.   :67.395  
##  (Other) :494                                   NA's   :3       
##     soil_pH      temperature_C    rainfall_mm        humidity    
##  Min.   :5.510   Min.   :15.01   Min.   : 35.12   Min.   :40.23  
##  1st Qu.:6.030   1st Qu.:20.30   1st Qu.:111.43   1st Qu.:51.76  
##  Median :6.530   Median :24.70   Median :185.80   Median :65.61  
##  Mean   :6.525   Mean   :24.70   Mean   :189.01   Mean   :65.17  
##  3rd Qu.:7.040   3rd Qu.:29.09   3rd Qu.:244.23   3rd Qu.:77.96  
##  Max.   :7.500   Max.   :34.84   Max.   :444.17   Max.   :90.00  
##  NA's   :2       NA's   :1       NA's   :1        NA's   :3      
##  sunlight_hours    irrigation_type  fertilizer_type pesticide_usage_ml
##  Min.   : 4.010   Drip     :111    Inorganic:166    Min.   : 5.05     
##  1st Qu.: 5.668   Manual   :118    Mixed    :166    1st Qu.:14.95     
##  Median : 6.995   None     :150    Organic  :166    Median :25.98     
##  Mean   : 7.030   Sprinkler:121    NA's     :  2    Mean   :26.59     
##  3rd Qu.: 8.470                                     3rd Qu.:38.01     
##  Max.   :10.000                                     Max.   :49.94     
##                                                                       
##    sowing_date    harvest_date   total_days    yield_kg_per_hectare
##  03-05-24: 15   06-04-24: 10   Min.   : 90.0   Min.   :2024        
##  03-07-24: 11   06-28-24: 10   1st Qu.:105.8   1st Qu.:2995        
##  01-27-24:  9   06-02-24:  9   Median :119.0   Median :4071        
##  02-02-24:  9   06-10-24:  9   Mean   :119.5   Mean   :4032        
##  03-04-24:  9   06-23-24:  9   3rd Qu.:134.0   3rd Qu.:5066        
##  (Other) :446   06-27-24:  9   Max.   :150.0   Max.   :5998        
##  NA's    :  1   (Other) :444                   NA's   :1           
##     sensor_id      timestamp      latitude       longitude       NDVI_index    
##  SENS0001:  1   04-14-24: 10   Min.   :10.00   Min.   :70.02   Min.   :0.3000  
##  SENS0002:  1   04-02-24:  9   1st Qu.:16.26   1st Qu.:75.38   1st Qu.:0.4475  
##  SENS0003:  1   02-25-24:  7   Median :21.98   Median :80.67   Median :0.6100  
##  SENS0004:  1   03-21-24:  7   Mean   :22.44   Mean   :80.40   Mean   :0.6021  
##  SENS0005:  1   05-06-24:  7   3rd Qu.:28.53   3rd Qu.:85.66   3rd Qu.:0.7500  
##  (Other) :494   05-14-24:  7   Max.   :34.98   Max.   :89.99   Max.   :0.9000  
##  NA's    :  1   (Other) :453                   NA's   :1                       
##  crop_disease_status
##  Mild    :125       
##  Moderate:112       
##  None    :130       
##  Severe  :133       
##                     
##                     
## 

You can observe several missing values, which we will remove using na.omit().

We will also create a smaller dataset containing only the relevant columns, which will be easier to work with. Specifically, we will use the solution from Question 3 of TP6 to generate a reduced dataset and then perform some basic analysis.

This new dataset, called crops_data_less, will exclude missing values and include only the columns we are interested in.

Creating a copy of the dataset allows us to retain the original crops_data in case we need to use it later.

crops_data_less = crops_data %>%
  na.omit() %>%
  dplyr::select(
    irrigation_type,
    region,
    crop_type,
    soil_moisture,
    rainfall_mm,
    humidity,
    pesticide_usage_ml
  )

Note : We used explicit function calls here to avoid potential issues. In fact, it is considered good practice to always call functions this way to prevent accidentally using a function from another package without realizing it.

Warning : In the previous code, you can see that filtering with na.omit() is performed before selecting columns. If the order was different, the number of rows might change. It’s up to you whether to apply filtering to the full dataset or only to a subset of selected columns, but be aware that this choice can lead to different results.

We can inspect the data again to visualize if there are missing values left.

summary(crops_data_less)
##   irrigation_type         region      crop_type   soil_moisture   
##  Drip     :102    Central USA:103   Cotton :106   Min.   : 7.154  
##  Manual   :115    East Africa:105   Maize  :105   1st Qu.:17.234  
##  None     :146    North India: 96   Rice   : 76   Median :24.696  
##  Sprinkler:117    South India: 87   Soybean:104   Mean   :27.934  
##                   South USA  : 89   Wheat  : 89   3rd Qu.:36.634  
##                                                   Max.   :67.395  
##   rainfall_mm        humidity     pesticide_usage_ml
##  Min.   : 35.12   Min.   :40.23   Min.   : 5.05     
##  1st Qu.:111.98   1st Qu.:52.09   1st Qu.:14.76     
##  Median :186.02   Median :65.69   Median :25.82     
##  Mean   :189.92   Mean   :65.15   Mean   :26.46     
##  3rd Qu.:243.87   3rd Qu.:77.96   3rd Qu.:37.88     
##  Max.   :444.17   Max.   :90.00   Max.   :49.94

You can show the head of the dataset, and use the function datatable of the DT package to show the table. Using this function allows to sort and filter the table directly on the report, what’s not possible with some other basic functions (knitr::kable for example).

# The option `scrollX=TRUE` mean you can scroll in the x axis, if the table is too wide.
crops_data_less %>%
  head() %>%
  DT::datatable(options = list(scrollX = TRUE))

Data visualisation

Now we will use the package ggplot2 in order to make plots. We will recreate the plots from TP6.

Histogram

The first step was to perform an histogram showing the distribution of soil moisture.

You can use ggplot:Esquisse to have the main code. But Esquisse has its limits and you can’t add vertical lines for example. To do it, you can copy and paste the code from Esquisse then add + geom_vline() to add this vertical line.
Note that you can’t use Esquisse when generating a Rmarkdown report. If you want to use it, you necessarily have to retrieve te code to put it in a chunk, and then render your document.

ggplot(crops_data_less) +
  aes(x = soil_moisture) +
  geom_histogram(bins = 30L, fill = "#91ABDA") +
  theme_minimal() +
  geom_vline(xintercept = mean(crops_data_less$soil_moisture),
             colour = "red")

Boxplot

You can also create a boxplot showing the pesticide usage per region.

ggplot(crops_data_less) +
  aes(x = region, y = pesticide_usage_ml, fill = region) +
  geom_boxplot() +
  scale_fill_viridis_d(option = "viridis", direction = 1) +
  labs(
    x = "Region",
    y = "Pesticide usage (mL)",
    title = "Pesticide usage by region",
    fill = "Region"
  ) +
  theme_minimal() +
  theme(legend.position = "none") # This argument is used to remove the legend

Scatterplot

You can display multiple pieces of information on the same plot.

For example, we can create a new data frame containing the mean rainfall and moisture values for each region, in order to display their centroids on the scatterplot.

# We create a data.frame for centroids coordinates

centroids = crops_data_less %>%
  dplyr::group_by(region) %>%
  summarise(mean_rainfall = mean(rainfall_mm),
            mean_soil_moisture = mean(soil_moisture)
)

ggplot(crops_data_less) +
  aes(x = soil_moisture, y = rainfall_mm, colour = region) +
  geom_point(size = 2.3, shape = "triangle") +
  geom_point(data = centroids, #in order to show the centroids, you should specify `data = centroids`, otherwise ggplot won't understand which argument centroids comes from
             aes(x = mean_soil_moisture, y = mean_rainfall, color = region),
             shape = 3,
             size = 8
  ) +
  scale_color_manual(
    values = c(
      `Central USA` = "#6D7BF8",
      `East Africa` = "#44D051",
      `North India` = "#111111",
      `South India` = "#E85C3C",
      `South USA` = "#FF61C3"
    )
  ) +
  labs(
    x = "Soil moisture",
    y = "Rainfall (mm)",
    title = "Soil moisture according to rainfall, colored by region",
    color = "Region"
  ) +
  theme_minimal()

Session info

It is useful to show the session info at the end of the report to indicate which packages have been used and their versions.

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
## 
## locale:
##  [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Paris
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] DT_0.34.0       lubridate_1.9.4 forcats_1.0.1   stringr_1.5.2  
##  [5] dplyr_1.1.4     purrr_1.1.0     readr_2.1.5     tidyr_1.3.1    
##  [9] tibble_3.3.0    ggplot2_4.0.0   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.4.2     tidyselect_1.2.1  
##  [5] jquerylib_0.1.4    scales_1.4.0       yaml_2.3.10        fastmap_1.2.0     
##  [9] R6_2.6.1           labeling_0.4.3     generics_0.1.4     knitr_1.50        
## [13] htmlwidgets_1.6.4  bslib_0.9.0        pillar_1.11.1      RColorBrewer_1.1-3
## [17] tzdb_0.5.0         rlang_1.1.6        stringi_1.8.7      cachem_1.1.0      
## [21] xfun_0.53          sass_0.4.10        S7_0.2.0           viridisLite_0.4.2 
## [25] timechange_0.3.0   cli_3.6.5          withr_3.0.2        magrittr_2.0.4    
## [29] crosstalk_1.2.2    digest_0.6.37      grid_4.4.2         rstudioapi_0.17.1 
## [33] hms_1.1.3          lifecycle_1.0.4    vctrs_0.6.5        evaluate_1.0.5    
## [37] glue_1.8.0         farver_2.1.2       rmarkdown_2.30     tools_4.4.2       
## [41] pkgconfig_2.0.3    htmltools_0.5.8.1