Preamble

Practical informations

Schedule :

  • March 30th to April 1st
  • 9am to 5pm

Breaks :

  • every half-day
  • lunch for 1 hour, around 12:30

Lunch :

  • micro-waves available if needed
  • possibility to buy sandwiches or hot dished on the campus
  • possibility to eat in the building’s cafeteria

At the bottom left, there is a menu to better navigate through the slides

Bilille


Bilille is the Lille bioinformatics and biostatistics platform, within the UAR 2014 - US 41 “Plateformes Lilloises en Biologie et Santé”.


PLBS includes 8 platforms, providing access to expertise and equipments to support research in biology and health.


In Bilille, we currently are 10 full time engineers, directed by Jimmy Vandel (research engineer CNRS), Ségolène Caboche (research engineer University of Lille) and Mamadou-Dia Sow (research engineer University of Lille).


Our missions are to :

  • support scientific projects
  • organise training courses
  • provide access to cloud computing resources
  • ensure access to software resources
  • conduct scientific and technical animation

Quick presentation


Us


What about you ?

  • name
  • profile
  • labs
  • experience with programming (in few words) : have you already tried using Python or another language?
  • your expectations regarding this training ?

Introduction

Python in a few words

  • Open-source interpreted programming langage developed since 1991
  • Very large number of libraries developed by a community of contributors
  • The Python Package Index (PyPI) is a repository of software for the Python programming language, with currently > 600,000 projects
  • Current major version is python3, but lots of scripts are still in python2
  • This training is based on python3

Installation

  • Windows: Go on this link and download the last update of python3.
  • Linux: It’s already installed and you shouldn’t try to update it except if you are a pro.
  • Mac: Go on this link and download the last update of python3.

Integrated development environment

  • Programming languages are written as scripts.
  • Python code files finish with .py extension.
  • Scripts can be written in a notebook, but the development of projects can be difficult.
  • Integrated development environments (IDE) are used to create projects and write scripts in any language that can help with project management and scripting, whether in Python, R, Julia, C…
  • One of the most popular IDEs is VScode, which you are invited to download for use during the course.

Visual Studio Code

  • VScode presents extensions and utility modules that are necessary or helpful for development.
  • Extensions can be downloaded from the corresponding tab.
  • Development package for Python: demystifying-javascript.python-extensions-pack
  • Extension for Jupyter: ms-toolsai.jupyter
  • You are free to install other extensions (for example: indent-rainbow…).

Extensions help you write scripts, but too many packages can slow down your IDE! Use sparingly.

Let’s get down to business

Variables

Variables presentation

  • Variables are used to store values in memory for later use.
  • A variable can contain any object type.
    • 5 : integer
    • 3.1415 : real number (float)
    • "abc" : string
    • True : boolean (a boolean is a variable that can only take values True or False)
    • print("...") : function
    • and so many other types

Variables are fundamental in programming. You must understand their purpose and how they work in order to obtain the desired results.

Regarding float variables, please note that the separator for decimals is a period (.), not a comma (,).

Variable assignment

  • To assign a value to a variable, you use the assignment operator (equals sign, =) after the variable name.

  • a = 5 : this command assigns the value 5 to the variable a.

  • a = 7 : if the same variable is used again, the previous value is overwritten. The object type can change if the variable is reused / overwritten.

  • It is possible to have as many variables as memory space allows.

Variable naming conventions

  • It is important to give meaningful names to variables so that you know at a glance what they are used for.
    It is preferable to have long variable names rather than short, meaningless names.
  • Variable names must start with a letter and contain only unaccented letters, numbers and underscores. Do not use special characters like é à ö &
  • Variable names are case-sensitive, which means that upper and lower case letters are distinguished.
    myvariable and MYVARIABLE are different objects.
  • Main conventions for variables with several words:
    • snake case: average_car_speed = 50
    • camel case: averageCarSpeed = 50

Brainstorming time

What do you expect to be displayed with the following examples?

"my_variable"
'my_variable'

It is not a variable name but only an object (a string). Variable names are not written between quotes.

b
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 b

NameError: name 'b' is not defined

It raises a NameError because the variable b hasn’t been assigned yet.

b = "my_variable"
b
'my_variable'
c = "my_variable"
c = 3.1415
c
3.1415

Brainstorming time

What do you expect to be displayed with the following examples?

2nd_variable = "my_second_variable"
2nd_variable
  Cell In[5], line 1
    2nd_variable = "my_second_variable"
    ^
SyntaxError: invalid decimal literal

It raises a SyntaxError because your second variable starts with a number.

second_variable = "my_second_variable"
second_variable
'my_second_variable'

String concatenation

  • Strings can be concatenated using the plus sign (+).”
'patient_' + '1'
'patient_1'
  • A string can only be concatenated with another string.
'patient_' + 1
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 'patient_' + 1

TypeError: can only concatenate str (not "int") to str
  • You can concatenate two variables together if they both contain strings.
a = "patient_"
b = "1"
a + b
'patient_1'
  • If you want to repeat a string a certain number of times, you can use the asterisk (*) with an integer.
laugh = "ha" * 3
laugh
'hahaha'

Variable conversion

  • Integers and floats can be converted to string with str(variable).
a = 1
'patient_' + str(a)
'patient_1'
  • In some cases, strings can be converted to integer or float with int(variable) or float(variable).
int('3')
3
float('3.5')
3.5
int('3.5')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[14], line 1
----> 1 int('3.5')

ValueError: invalid literal for int() with base 10: '3.5'

Variable display

  • The print() function can be used to display what is between the parentheses.
a = 3.1415
print(a)
3.1415
  • A quick way to concatenate string and integer or float is to use the formatted string literals syntax also called f-string syntax.
a = 1
print(f'patient_{a}')
patient_1

Do not forget the f before the quotes.

The variable or operation result between the {} will automatically be converted to a string.

You can also separate strings and variables with commas. This adds spaces automatically.

number = 2
price = 12
print("She purchased", number, "ice creams for", price, "euros.")
print(f"She purchased {number} ice creams for {price} euros.")
She purchased 2 ice creams for 12 euros.
She purchased 2 ice creams for 12 euros.

Variable display - backslash

Simple (’’) or double (““) quotes can be used around a string, but you must not mix them.

If you need additional quotes inside a string, you can use the other type of quotes, or escape them with backslash (\).

print("When I arrive in the morning, I say 'good morning' to everyone.")
print('When I arrive in the morning, I say "good morning" to everyone.')
print("When I arrive in the morning, I say \"good morning\" to everyone.")
When I arrive in the morning, I say 'good morning' to everyone.
When I arrive in the morning, I say "good morning" to everyone.
When I arrive in the morning, I say "good morning" to everyone.

Variable display - backslash

  • You can also use raw string (r-string) in order to print exactly what’s between the quote. It comes handy when writing with lots of backslash (\) (cf : Windows path)
print(r"C:\Users\Georges\Documents\test.txt")
C:\Users\Georges\Documents\test.txt
  • You can also use two backslashes (\\). The first backslash (\) escapes the second one, so it is interpreted as a literal backslash.
print("C:\\Users\\Georges\\Documents\\test.txt")
C:\Users\Georges\Documents\test.txt
  • If you don’t use one of these methods, you will get an error.
print("C:\Users\Georges\Documents\test.txt")
  Cell In[21], line 1
    print("C:\Users\Georges\Documents\test.txt")
          ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

Arithmetic operations

The following arithmetic operators are available in Python:

number1 = 5
number2 = 3
  • addition:
addition = number1 + number2
print(f'{number1} + {number2} = {addition}')
5 + 3 = 8
  • subtraction:
subtraction = number1 - number2
print(f'{number1} - {number2} = {subtraction}')
5 - 3 = 2
  • multiplication:
multiplication = number1 * number2
print(f'{number1} * {number2} = {multiplication}')
5 * 3 = 15
  • division:
division = number1 / number2
print(f'{number1} / {number2} = {division}')
5 / 3 = 1.6666666666666667

Arithmetic operations

  • integer division (quotient of the division):
integer_division = number1 // number2
print(f'{number1} // {number2} = {integer_division}')
5 // 3 = 1
  • modulo (remainder from integer division):
modulo = number1 % number2
print(f'{number1} % {number2} = {modulo}')
5 % 3 = 2

  • power:
power = number1 ** number2
print(f'{number1} ** {number2} = {power}')
5 ** 3 = 125

A few caveats

  • A simple division will always return a float even if the result is an integer.
print(f"8/2 = {8/2} and its type is {type(8/2)}")
print(f"8//2 = {8//2} and its type is {type(8//2)}")
8/2 = 4.0 and its type is <class 'float'>
8//2 = 4 and its type is <class 'int'>
  • The spaces before and after the operator are optional but helpful for readability.

Variable operators

  • Operations can be performed directly on variables by putting the operator in front of the equal (=) symbol. The two following syntaxes are equivalent:
counter = 0
counter += 1
print(counter)
1
counter2 = 0
counter2 = counter2 + 1
print(counter2)
1
  • This can be done with all operators.
counter -= -2
print(counter)
3
counter2 = counter2 - (-2)
print(counter2)
3
counter *= 6
print(counter)
18
counter2 = counter2 * 6
print(counter2)
18
counter /= 4.5
print(counter)
4.0
counter2 = counter2 / 4.5
print(counter2)
4.0
counter //= 2
print(counter)
2.0
counter2 = counter2 // 2
print(counter2)
2.0
counter %= 2
print(counter)
0.0
counter2 = counter2 % 2
print(counter2)
0.0

Comparison operators

To compare values we can use the following operators:

  • > : strictly greater than
  • < : strictly less than
  • >= : greater than or equal to
  • <= : less than or equal to
  • == : equal to
  • != : not equal to

The result of a comparison is a boolean value.

3 < 9
True
1/3 < 1/4
False








Do not confuse '==' (test equality) and '=' (assign a value to a variable).

Comparison operators

  • They can be used to compare numbers (int or float) in numerical order, or strings in lexicographical order (based on their ASCII value).
'HELLO' == 'hello'
False
'a' < 'b'
True
'a' < 'B'
False
'Ben' < 'Benjamin'
True

Try to guess the answer :

745 >= 3.1415
True
"Sun" == "Moon"
False
"cat" = "dog"
  Cell In[51], line 1
    "cat" = "dog"
    ^
SyntaxError: cannot assign to literal here. Maybe you meant '==' instead of '='?

Methods

  • Methods are associated with a variable type.
  • They can create new objects that can be assigned to a variable, or modify existing objects.
  • Each type of variable has its own set of methods.
  • Syntax: variable.method(*optional parameters*).

Methods

Here are some examples of useful methods for strings.

Consider the following string:

my_str = "       THIS IS a string  "
print(f'*{my_str}*')
*       THIS IS a string  *
  • Convert all characters to uppercase:
my_str_upper = my_str.upper()
print(f'*{my_str_upper}*')
*       THIS IS A STRING  *
  • Convert all characters to lowercase:
my_str_lower = my_str.lower()
print(f'*{my_str_lower}*')
*       this is a string  *
  • Remove extra spaces at the beginning and end:
my_str_strip = my_str.strip()
print(f'*{my_str_strip}*')
*THIS IS a string*
  • Replace part of the string with other characters:
my_str_replace = my_str.replace("string", "sentence")
print(f'*{my_str_replace}*')
*       THIS IS a sentence  *

There are methods for other types of variables, which we will cover in another chapter.

Methods

  • Applying a method to a string does not change the string itself; it must be reassigned to a variable (but the same variable name can be reused).
  • Without reassignment:
my_str = "this is another string"
print(my_str)
my_str.upper()
print(my_str)
this is another string
this is another string
  • With reassignment:
print(my_str)
my_str_upper = my_str.upper()
print(my_str_upper)
this is another string
THIS IS ANOTHER STRING

Tips before going further: Comments

Comments

  • Comments can be written in your script to help you describe a difficult part for instance.
  • Comments are not executed.
  • Comments are unnecessary and in fact distracting if they state the obvious.
    Only write relevant comments.

Comments

  • Inline comments are written after a sharp sign (#).
    You can write some code before the # but you cannot write code after the comment.
# this is an inline comment
a = 5 # the code before the "#" will be executed normally
  • Block comments
# Block comments generally apply to some (or all) code that follows them,
# and are indented to the same level as that code.
# Each line of a block comment starts with a # and a single space 
  • Documentation strings are written between triple quotes:
"""
Documentation strings (a.k.a. “docstrings”) are used to
write a description for all public modules, functions, classes, and methods.
This is often used to write a function description.
"""

If written on several lines, the triple quotes should be written on a line by themselves, and on the same line than the comment itself for one liner descriptions.

""" This is a one-liner docstring. """

Docstrings are usually used when writing a function.

Summary of the variables section

  • The most common object types are integer, float, string, boolean, function, …
  • To assign a value to a variable, use the equal sign (=).
  • To display a variable or some text, use the print() function.
  • Do not mix simple quotes ('), double quotes (") and f-strings!
  • Mathematical operations can be performed on variables with an operation sign (+, -, *, /, //, %, **):

    Examples: my_variable *= 5 or my_variable = my_variable * 5.

  • You can compare values with a comparison operator (>, <, >=, <=, ==, !=).
  • Do not confuse = (variable assignment) and == (test equality)!
  • To comment lines, you can use # before your comment or add """ around it.

Let’s practise

Please open file 001_practical_variables.py

Lists

Lists presentation

  • You can use lists to store multiple values in an orderly manner in the same variable.
  • An empty list can be initialised with [] or list().
  • A list can also be initialised with values:
numbers = [1, 3, 5, 7, 9]
print(numbers)
[1, 3, 5, 7, 9]
  • It is possible to create a list from a string. In this case, each element of the list will contain a single character.
a_string = 'I have two cute cats.'
a_list_from_a_string = list(a_string)
print(a_list_from_a_string)
['I', ' ', 'h', 'a', 'v', 'e', ' ', 't', 'w', 'o', ' ', 'c', 'u', 't', 'e', ' ', 'c', 'a', 't', 's', '.']
  • You can store different types of data in the same list.
my_list = ["Mr_Pi", 3.1415, 5, True]

Lists - Indexing

  • Each item of a list can be accessed by giving its index, starting from 0 to n-1, with n the number of items in the list.

  • The number of items in the list is given by len(numbers).
n = len(numbers)

print(f'numbers = {numbers}')
print(f'There are {n} elements in the numbers list.')

print(f'First item is: {numbers[0]}')
print(f'Second item is: {numbers[1]}')
print(f'Last item is: {numbers[n-1]}')
numbers = [1, 3, 5, 7, 9]
There are 5 elements in the numbers list.
First item is: 1
Second item is: 3
Last item is: 9

Lists - Indexing

  • Each item of a list can also be accessed in revert order from -1 (last item) to -n (first element).

print(f'Another way to get last item is: {numbers[-1]}')
print(f'Second to last item is: {numbers[-2]}')
print(f'Last item is: {numbers[len(numbers)-1]}')
print(f'Another way to get first item is: {numbers[-len(numbers)]}')
Another way to get last item is: 9
Second to last item is: 7
Last item is: 9
Another way to get first item is: 1
  • If you give as index a value that doesn’t exist, it will raise an error.
numbers[6]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[66], line 1
----> 1 numbers[6]

IndexError: list index out of range

Brainstorming time

  • Please consider the following list:
amino_acids = ['Ala', 'Arg', 'Asp', 'Asn', 'Cys', 'Glu', 'Gln']
  • What does the following element contain? amino_acids[1]

‘Ala’

‘Arg’

‘Gln’

print(amino_acids[0])
Ala
print(amino_acids[1])
Arg
print(amino_acids[-1])
Gln

Remember that list numbering starts at zero and that the index “-1” allows you to access the last item in the list.

  • How can you access the following element? 'Glu'

amino_acids[6]

amino_acids[5]

amino_acids[-2]

print(amino_acids[6])
Gln
print(amino_acids[5])
Glu
print(amino_acids[-2])
Glu

Converting strings to lists (and vice versa)

  • In the previous chapter, we discovered a few methods for strings.
  • There are string methods that work with lists.
  • .join() turns a list of strings into a single string:
my_list = ["I", "love", "Python", "!"]
str_spaces = " ".join(my_list)
print(str_spaces)
I love Python !
my_list = ["I", "love", "Python", "!"]
str_underscores = "_".join(my_list)
print(str_underscores)
I_love_Python_!

You can use any separator with the .join() method.
It just needs to be a string.

  • .split() turns a string into a list:
split_spaces_list = str_spaces.split()
print(split_spaces_list)
['I', 'love', 'Python', '!']
split_underscores_list = str_underscores.split(sep = "_")
print(split_underscores_list)
['I', 'love', 'Python', '!']

If no separator is given in .split(), the string will be separated if there are new line (\n), carriage return (\r), tab (\t), form feed (\f) or spaces ( ).

Some operations on lists

  • Modify an item of the list using its index:
numbers = [1, 3, 7, 7, 9]
numbers[2] = 5
print(numbers)
[1, 3, 5, 7, 9]
  • Add a new item to the end of a list:
numbers.append(11)
print(numbers)
[1, 3, 5, 7, 9, 11]
  • Add a new item at a specific position:
numbers.insert(2, 7)
print(numbers)
[1, 3, 7, 5, 7, 9, 11]
  • Remove an item of a list and return it: removed_item = numbers.pop()
    If no index is given, the removed item is the last one.
    You can also provide the index of the item to be removed.
removed_item = numbers.pop(1)
print(removed_item)
print(numbers)
3
[1, 7, 5, 7, 9, 11]

After using pop(), the list items are renumbered.

numbers[1]
7

Some operations on lists

  • Remove an item of a list by using its value (not the index). Only the first item encountered will be removed; if the value exist several times in the list, the process has to be repeated.
print(numbers)
numbers.remove(7)
print(numbers)
[1, 7, 5, 7, 9, 11]
[1, 5, 7, 9, 11]
  • Reverse a list:
numbers.reverse()
print(numbers)
[11, 9, 7, 5, 1]
  • Copy a list:
odds = numbers.copy()
print(odds)
[11, 9, 7, 5, 1]

Lists are mutable objects which means you can modify them directly.

Brainstorming time

Consider the following list:

numbers = [1, 2, 3, 4, 5]

We would like to create the same list called values:

values = numbers

Then we need to remove the second element from the numbers list:

numbers.pop(1)
2
print(numbers)
[1, 3, 4, 5]

Now let’s check the content of values. What do you expect to get?

print(values)
[1, 3, 4, 5]

Here the values list is just referencing to the numbers list and so the elements are shared.

The method copy is required when a copy has to be made.

Brainstorming time

Let’s try again.

numbers = [1, 2, 3, 4, 5]
values = numbers.copy()
print(values)
[1, 2, 3, 4, 5]

This time we would like to remove the second-to-last element from the values list.
Which command(s) will work ? :

values.pop(-2)

values.pop(3)

values.remove(3)

values.pop(-2)
print(values)
[1, 2, 3, 5]
values = numbers.copy()
values.pop(3)
print(values)
[1, 2, 3, 5]
values = numbers.copy()
values.remove(3)
print(values)
[1, 2, 4, 5]

Now let’s check the content of numbers.

print(numbers)
[1, 2, 3, 4, 5]

The numbers list has not been affected by the changes made to the values list.

Nested lists

  • A list can contain any Python variable so it can also contain other lists.
    A single list may contain numbers, strings, and anything else.
numbers = [1, 3, 5, 7, 9, [11, 13, 15, 17, 19]]

numbers is a nested list.
numbers[5] is a simple list.
numbers[5][0] is an integer.

  • A matrix can be stored in a nested list.
matrix = [[1, 2, 3], 
          [4, 5, 6], 
          [7, 8, 9]]

matrix is a nested list.
matrix[0], matrix[1] and matrix[2] are simple lists.
matrix[0][0] is an integer.

  • Another way to assign a nested list to a variable is to write it on a single line.
matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

Brainstorming time

Consider the following nested list:

animals = [
  ["eagle", "pigeon", "owl", "seagull"],
  ["shark", "whale", "seahorse", "clownfish"],
  ["rabbit", "giraffe", "cat", "sheep"]
]

What command would you write to get :

  • the eagle ?
animals[0][0]
'eagle'
  • the clownfish ?
animals[1][3]
'clownfish'
  • the giraffe ?
animals[2][1]
'giraffe'

Which animal will you get if you type :

  • animals[2][2] ?
animals[2][2]
'cat'
  • animals[0][1] ?
animals[0][1]
'pigeon'
  • animals[1][0] ?
animals[1][0]
'shark'

Slicing

  • You can get a subset of a list by specifying ranges of values with a colon (:) in brackets.
  • Syntax: my_list[start:end:step]: will slice my_list from start to end (excluded) with a step of step (default value 1 if not provided).
  • Some examples with the following list:
amino_acids = ['Ala', 'Arg', 'Asp', 'Asn', 'Cys', 'Glu', 'Gln']

Returns the first two elements of the list.

amino_acids[0:2]
['Ala', 'Arg']

Returns every other element, from second element to fourth element (excluded).

amino_acids[1:4:2]
['Arg', 'Asn']

Returns every other element from the complete list, starting with the first element.

amino_acids[::2]
['Ala', 'Asp', 'Cys', 'Gln']

Slicing

print(amino_acids)
['Ala', 'Arg', 'Asp', 'Asn', 'Cys', 'Glu', 'Gln']

Returns the complete list except for the first element.

amino_acids[1:]
['Arg', 'Asp', 'Asn', 'Cys', 'Glu', 'Gln']

Returns the complete list except for the last element.

amino_acids[:-1]
['Ala', 'Arg', 'Asp', 'Asn', 'Cys', 'Glu']

Returns the complete list in reverse order.

amino_acids[::-1]
['Gln', 'Glu', 'Cys', 'Asn', 'Asp', 'Arg', 'Ala']

The slicing [::-1] simply displays the list in reverse order, while the method .reverse() changes the order within the list.

print(amino_acids)
print(amino_acids[::-1])
print(amino_acids)
['Ala', 'Arg', 'Asp', 'Asn', 'Cys', 'Glu', 'Gln']
['Gln', 'Glu', 'Cys', 'Asn', 'Asp', 'Arg', 'Ala']
['Ala', 'Arg', 'Asp', 'Asn', 'Cys', 'Glu', 'Gln']
print(amino_acids)
amino_acids.reverse()
print(amino_acids)
['Ala', 'Arg', 'Asp', 'Asn', 'Cys', 'Glu', 'Gln']
['Gln', 'Glu', 'Cys', 'Asn', 'Asp', 'Arg', 'Ala']

Delete a list

  • If you don’t need a list anymore, you can delete it with the del keyword:
amino_acids = ['Ala', 'Arg', 'Asp', 'Asn', 'Cys', 'Glu', 'Gln']
del(amino_acids)
print(amino_acids)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[118], line 3
      1 amino_acids = ['Ala', 'Arg', 'Asp', 'Asn', 'Cys', 'Glu', 'Gln']
      2 del(amino_acids)
----> 3 print(amino_acids)

NameError: name 'amino_acids' is not defined
  • It is also possible to delete part of the list using slicing:
amino_acids = ['Ala', 'Arg', 'Asp', 'Asn', 'Cys', 'Glu', 'Gln']
del(amino_acids[:-3])
print(amino_acids)
['Cys', 'Glu', 'Gln']

Tuples

  • Tuples are similar to lists but they cannot be modified. They are immutable objects.
  • An empty tuple can be initialised with () or tuple().
  • A tuple can also be initialised with values:
values = (2.0, 7.5, 8.4, 3.1)
print(f"values = {values} and its type is {type(values)}")
values = (2.0, 7.5, 8.4, 3.1) and its type is <class 'tuple'>
  • If you try to modify a tuple, Python won’t let you.
colours = ['red', 'orange', 'yello']
colours[2] = "yellow"
print(colours)
['red', 'orange', 'yellow']
colours = ('red', 'orange', 'yello')
colours[2] = "yellow"
print(colours)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[122], line 2
      1 colours = ('red', 'orange', 'yello')
----> 2 colours[2] = "yellow"
      3 print(colours)

TypeError: 'tuple' object does not support item assignment

Make sure to use [] or list() to create a list.

Summary of the lists section

  • A list is a variable that can store multiple values in an orderly manner.
  • To initialise an empty list, you can use [] or list().
  • List indexing starts at 0 from the left and starts at -1 from the right.

  • To access (or update) the element at position i, use my_list[i] (or my_list[i] = elt).
  • To add an element :
    • at the end : my_list.append(elt)
    • at position i : my_list.insert(i, elt)
  • To remove an element :
    • at the end : removed = my_list.pop()
    • at position i : removed = my_list.pop(i)
    • by its value : my_list.remove(elt)
  • To get a subset of your list, you can use my_list[start:end:step].
  • To delete your list, use del(my_list).

Let’s practise

Please open file 002_practical_lists.py

Dictionaries

Dictionaries presentation

  • Dictionaries are used to store data in a disorderly manner in the form of key:value pairs.
  • Each key is unique. If a key is reused, its contents will be overwritten.
  • An empty dictionary can be initialised with {} or dict().
  • A dictionary can also be initialised directly with data:
animal_sounds = {'cat': 'meow', 'dog':'woof', 'cow':'moo'}

Accessing dictionaries

Each item of a dictionary can be accessed by giving its key:

  • with the key in brackets:
print(f"Cat says {animal_sounds['cat']}.")
print(f"Fox says {animal_sounds['fox']}.")
Cat says meow.
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[124], line 2
      1 print(f"Cat says {animal_sounds['cat']}.")
----> 2 print(f"Fox says {animal_sounds['fox']}.")

KeyError: 'fox'

If you give a key that is not present in the dictionary it will raise an error.

  • with get() you can provide a default value in case the key is not in the dictionary:
    This method only allows you to access an item; it does not allow you to modify it.
    Syntax: my_dict.get(key, default_value)
animal_sounds.get('fox', 'This sound is not registered.')
'This sound is not registered.'

Dictionaries are not indexed as lists are.
my_dict[1] will raise an error unless there is a key called 1.

Accessing dictionaries

  • Dict keys must be immutable objects like strings, numbers or tuples.
    You can get all the keys with .keys().
animal_sounds.keys()
dict_keys(['cat', 'dog', 'cow'])
  • Dict values can contain items of different types, including other dictionaries.
    You can get all the values with .values().
animal_sounds.values()
dict_values(['meow', 'woof', 'moo'])
  • You can get all the pairs of key:value pairs as a list of tuple using .items():
animal_sounds.items()
dict_items([('cat', 'meow'), ('dog', 'woof'), ('cow', 'moo')])

Some operations on dictionaries

  • Get the number of key:value pairs:
len(animal_sounds)
3
  • Add an element or update an existing one:
animal_sounds['lion'] = 'roar'
print(animal_sounds)
{'cat': 'meow', 'dog': 'woof', 'cow': 'moo', 'lion': 'roar'}
animal_sounds.update({'rooster':'cock-a-doodle-doo'})
print(animal_sounds)
{'cat': 'meow', 'dog': 'woof', 'cow': 'moo', 'lion': 'roar', 'rooster': 'cock-a-doodle-doo'}

Some operations on dictionaries

The pop method can be used to delete a key:value pair and store the value in a variable.

was_removed = animal_sounds.pop('dog')
print(f"The removed value is {was_removed}.")
print(f"The dictionary contains {animal_sounds}.")
The removed value is woof.
The dictionary contains {'cat': 'meow', 'cow': 'moo', 'lion': 'roar', 'rooster': 'cock-a-doodle-doo'}.

If we want to remove a value from a dictionary, we can use the del keyword:

  • for a single key:
del(animal_sounds['cow'])
animal_sounds
{'cat': 'meow', 'lion': 'roar', 'rooster': 'cock-a-doodle-doo'}
  • for the whole dictionary:
del(animal_sounds)
animal_sounds
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[134], line 2
      1 del(animal_sounds)
----> 2 animal_sounds

NameError: name 'animal_sounds' is not defined

Brainstorming time

In this example, we want to create a dictionary named fruits_shop, with fruits as keys and numbers as values. These numbers represent the quantity of each fruit in the shop.

We received 10 apples, 5 pears and 1 banana.

How would you implement it ?

fruits_shop = {}
fruits_shop["apple"] = 10
fruits_shop["pear"] = 5
fruits_shop["banana"] = 1

print(fruits_shop)
{'apple': 10, 'pear': 5, 'banana': 1}

With this syntax, we must first initialise the dictionary and then add each element.

fruits_shop = {
  "apple":10,
  "pear": 5,
  "banana": 1
}
print(fruits_shop)
{'apple': 10, 'pear': 5, 'banana': 1}

With this syntax, the dictionary is initialised and populated at the same time.

Nice! But in the meantime, we received 45 more bananas and 10 grapes… and then someone ate an apple (oops).

fruits_shop["banana"] += 45
fruits_shop["grape"] = 10
fruits_shop["apple"] -= 1
print(fruits_shop)
{'apple': 9, 'pear': 5, 'banana': 46, 'grape': 10}
fruits_shop["banana"] = fruits_shop["banana"] + 45
fruits_shop["grape"] = 10
fruits_shop["apple"] = fruits_shop["apple"] - 1
print(fruits_shop)
{'apple': 9, 'pear': 5, 'banana': 46, 'grape': 10}

Brainstorming time

Pears are now prohibited worldwide, but we get 2 apples in exchange for each pear.

pears = fruits_shop.pop("pear")
fruits_shop["apple"] += 2 * pears
print(fruits_shop)
{'apple': 19, 'banana': 46, 'grape': 10}
pears = fruits_shop.pop("pear")
fruits_shop["apple"] = fruits_shop["apple"] + 2 * pears
print(fruits_shop)
{'apple': 19, 'banana': 46, 'grape': 10}

Unfortunately, we should remove the fruits_shop, as it has become useless and we need the space for something else. How would you proceed?

del(fruits_shop)
print(fruits_shop)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[143], line 2
      1 del(fruits_shop)
----> 2 print(fruits_shop)

NameError: name 'fruits_shop' is not defined

Summary of the dictionaries section

  • A dictionary is a variable that can store data in a disorderly manner in the form of key:value pairs.
  • An empty dictionary can be initialised with {} or dict().
  • To access the value corresponding to the key k, you can use :
    • my_dict[k]
    • my_dict.get(k, default_value)
  • Dictionaries are not indexed as lists are.
  • To add or update an element, use my_dict[k] = new_value.
  • To delete a key:value pair, use remove = my_dict.pop(k) or del(my_dict[k]).
  • To delete the dictionary, use del(my_dict).

Let’s practise

Please open file 003_practical_dictionaries.py

Conditional statements

Conditional statements presentation

  • An if / elif / else statement allows to determine which part of the code is executed, according to one or several conditions.
  • if, elif and else lines end with colon (:).
    The blocks of code to be executed are indented.

Do not mix spaces and tabs.
Python best practices recommend using 4 spaces.

  • elif and else are optional. If they are not provided, nothing will be executed if the if statement is not true.
  • You can write as many elif statements as needed.
    elif is short for else if.
  • If a statement is true, the other ones are not tested.

Examples

  • Example 1:
limit = 50
if current_speed > limit + 30:
    print('Slow down! You are going to kill someone!')
elif current_speed > limit:
    print('Slow down! You are going to get a fine!')
else:
    print('You are not exceeding the speed limit.')

What should this code return with these values:

- `current_speed = 60` ?
Slow down! You are going to get a fine!
- `current_speed = 160` ?
Slow down! You are going to kill someone!
- `current_speed = 30` ?
You are not exceeding the speed limit.

Examples

  • Example 2:
    Note: Instructions are executed in the order in which they are written.
  • What difference(s) do you see between these two examples ?
  • What change(s) should we expect with this code ?
limit = 50
current_speed = 100
if current_speed > limit + 30:
    print('Slow down! You are going to kill someone!')

elif current_speed > limit:
    print('Slow down! You are going to get a fine!')

else:
    print('You are not exceeding the speed limit.')
limit = 50
current_speed = 100
if current_speed > limit:
    print('Slow down! You are going to get a fine!')

elif current_speed > limit + 30:
    print('Slow down! You are going to kill someone!')
    
else:
    print('You are not exceeding the speed limit.')
Slow down! You are going to kill someone!
Slow down! You are going to get a fine!

In the example on the right we will never enter the current_speed > limit + 50 block.

Logical operators

  • We can combine expressions using and or or.
  • if A and B will be executed only if the 2 expressions are true.
admission = none
if age >= 18 and age < 65:
    admission = "full_price"


Note: there is a simpler syntax for checking whether a number is within a range.

if 18 <= age < 65:
    admission = "full_price"
  • if A or B will be executed if at least one of the 2 expressions is true.
if age < 18 or age >= 65:
    admission = "reduced_price"

You can notice that we initialised the variable admission before the conditional statement. This is a good practice, because if all conditions fail and you try to use an uninitialised variable, an error will occur and stop the execution of your script.

More complex conditions

There is no limit to the number of conditions, but it may be useful to use parentheses to indicate priorities.

Example 1:

age = 16
nb_available_seats = 0
if (age < 18 or age >= 65) and nb_available_seats > 0:
    print("You may enter at a reduced rate.")

Example 2:

age = 16
nb_available_seats = 0
if age < 18 or age >= 65 and nb_available_seats > 0:
    print("You may enter at a reduced rate.")
You may enter at a reduced rate.

The logical operator and has higher precedence than the logical operator or.
This means that when both and and or operators appear in the same expression, and is evaluated first.
If you are not sure of the priority, use parentheses!

Nested conditions

You can nest multiple conditions.

age = 16
nb_available_seats = 5

if nb_available_seats > 0:
    if age < 18 or age >= 65:
        admission = "reduced"
    else:
        admission = "full"
    print(f"You may enter with a {admission} price.")
else:
    print("There are no more seats available.")
You may enter with a reduced price.


Please mind the indentation!

Brainstorming time

Before leaving home, you should take an accessory depending on the weather.
Consider the following code:

temperature = 20
rain = False

if rain == True:
    print("Take an umbrella.")
else:
    if temperature >= 25:
        print("Wear a hat and sunglasses.")
    elif temperature >= 15:
        print("Wear sunglasses.")
    elif temperature >= 0:
        print("Wear a scarf.")
    else:
        print("Wear a scarf and gloves.")
  • Question 1: What does this code print?

This code prints: Wear sunglasses.

  • Question 2: Give an example of variables to obtain the message: Wear a scarf.

We must have rain == False and temperature between 0 and 14°C.

  • Question 3: When should you wear gloves?

When the temperature is strictly below 0°C.

Brainstorming time

Before leaving home, you should take an accessory depending on the weather.
Consider the following code:

temperature = 20
rain = False

if rain == True:
    print("Take an umbrella.")
else:
    if temperature >= 25:
        print("Wear a hat and sunglasses.")
    elif temperature >= 15:
        print("Wear sunglasses.")
    elif temperature >= 0:
        print("Wear a scarf.")
    else:
        print("Wear a scarf and gloves.")
  • Question 4: If it rains and the temperature is 0°C, should you take an umbrella or a scarf?

You should take an umbrella.

  • Question 5: Tomorrow, it is supposed to be 28°C and sunny.
    Which accessory or accessories will you take or wear?

Tomorrow, I will wear a hat and sunglasses.

Inverting conditions

Sometimes it is easier to check whether a condition is not true.
We can do this with the operator not.

if "banana" not in ["apple", "pear", "hazelnut"]:
    print("Banana not found in list.")
Banana not found in list.

This is equivalent to the following syntax:

if "banana" != "apple" and "banana" != "pear" and "banana" != "hazelnut":
    print("Banana not found in list.")
Banana not found in list.

Summary of the conditionals section

  • An if / elif / else statement allows to determine which part of the code is executed, according to one or several conditions.
  • elif is short for if and else.
  • if, elif and else lines end with colon (:).
    The blocks of code to be executed are indented.
  • elif and else are optional.
  • It is possible to combine expressions with and, or and add parentheses () to indicate priorities.
  • You can use in and not in keywords to check if an element is in a list.

Loops

Loops presentation

  • Loops are used to repeat the execution of a part of the program several times.
  • There are two ways to use loops in Python:
    • for loops are generally used when we know how many times to repeat the action.
    • while loops are generally preferred when we don’t know the number of repetitions in advance.

For

  • The for loop allows to perform an action for each element in a group like a list, a dictionary, a string
  • The line with for instruction must end with a colon (:) and the code that will run inside the for loop must be indented.
  • General syntax:
for element in collection:
    # Perform some action(s) on element.  
    # These actions can spread on several lines
    # which must all be indented.  

For: examples on lists

  • Example 1:
odds = [1, 3, 5, 7]
for element in odds:
    print(f"element contains {element}.")
element contains 1.
element contains 3.
element contains 5.
element contains 7.
  • Example 2:
odds = [1, 3, 5, 7]
numbers_power2 = list()
for i in odds:
    numbers_power2.append(i**2)
    print(f"i contains {i} and numbers_power2 contains {numbers_power2}.")
i contains 1 and numbers_power2 contains [1].
i contains 3 and numbers_power2 contains [1, 9].
i contains 5 and numbers_power2 contains [1, 9, 25].
i contains 7 and numbers_power2 contains [1, 9, 25, 49].
  • Example 3:
odds = [1, 3, 5, 7]
for index, element in enumerate(odds):
    print(f"The {index}-th item in list contains {element}.")
The 0-th item in list contains 1.
The 1-th item in list contains 3.
The 2-th item in list contains 5.
The 3-th item in list contains 7.

The enumerate function is useful for iterating through a list and finding out the position of each element in the list.

For: examples on dictionaries

fruits_shop = {"apple":10, "pear": 5, "banana": 1}
  • Iterate over the keys:
for key in fruits_shop.keys():
    print(f'Key {key} is associated with value {fruits_shop[key]}.')
Key apple is associated with value 10.
Key pear is associated with value 5.
Key banana is associated with value 1.
  • Iterate over the values:
for value in fruits_shop.values():
    print(value)
10
5
1
  • Iterate over both keys and values:
for key, value in fruits_shop.items():
    print(f'Key {key} is associated with value {value}.')
Key apple is associated with value 10.
Key pear is associated with value 5.
Key banana is associated with value 1.

While

  • The while loop allows to perform an action as long as an expression is true.
  • The line with while instruction must end with a colon (:) and the code that will run inside the while loop must be indented.

WARNING! If the expression evaluated by the while loop is never modified, you might end up with an infinite loop!

  • Example 1:
odds = [1, 3, 5, 7, 9]
i = 0
while i < len(odds):
    print(f"The list item with index {i} is {odds[i]}.")
    i += 1
The list item with index 0 is 1.
The list item with index 1 is 3.
The list item with index 2 is 5.
The list item with index 3 is 7.
The list item with index 4 is 9.
  • Example 2:
odds = [1, 3, 5, 7, 9]
numbers_power_2 = list()
i = 0
while i < len(odds):
    odd_number2 = odds[i]**2
    print(f"The list item with index {i} is {odds[i]}.")
    numbers_power_2.append(odd_number2)
    i += 1
print(numbers_power_2)
The list item with index 0 is 1.
The list item with index 1 is 3.
The list item with index 2 is 5.
The list item with index 3 is 7.
The list item with index 4 is 9.
[1, 9, 25, 49, 81]

Break loops

  • Sometimes you may need to end a loop prematurely.
  • With the break statement we can stop the loop even if the while condition is still true or if we are not done with the for iteration.
numbers = [2, 4, 6, 7, 8]
even_numbers = list()
for i in numbers:
    if i % 2 == 1:
        print(f"An odd number has been found ({i})")
        break
    else:
        even_numbers.append(i)
print("The consecutive even numbers are", even_numbers)
An odd number has been found (7)
The consecutive even numbers are [2, 4, 6]
numbers = [2, 4, 6, 7, 8]
even_numbers = list()
i = 0
while i < len(numbers):
    if numbers[i] % 2 == 1:
        print(f"An odd number has been found ({numbers[i]})")
        break
    else:
        even_numbers.append(numbers[i])
    i += 1
print("The consecutive even numbers are", even_numbers)
An odd number has been found (7)
The consecutive even numbers are [2, 4, 6]

Continue loops

  • With the continue statement we can go directly to the next iteration without executing the code in the loop for the current iteration.
for number in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]:
    print(f"number: {number}")
    if number != 5:
        continue
    print(f"Number {number} has been found!")
number: 0
number: 1
number: 2
number: 3
number: 4
number: 5
Number 5 has been found!
number: 6
number: 7
number: 8
number: 9

Combination

  • As you could see in previous slides, it is possible to combine multiple loops and conditions within the same block of code.

In this case, you should pay attention to the code indentation. If you get it wrong, the code may still run, but it will not produce the expected result.

Combination: example

Let’s generate all possible pairs of fruits among orange, mango, and lemon.

fruits = ["orange", "mango", "lemon"]
comb1 = list()

for my_first_fruit in fruits:
    print(f'Here, {my_first_fruit} is the first fruit.')
    for my_second_fruit in fruits:
        print(f'- {my_first_fruit} and {my_second_fruit}')
        comb1.append([my_first_fruit, my_second_fruit])
fruits = ["orange","mango","lemon"]
comb2 = list()

for my_first_fruit in fruits:
    print(f'Here, {my_first_fruit} is the first fruit.')
    for my_second_fruit in fruits:
        print(f'- {my_first_fruit} and {my_second_fruit}')
    comb2.append([my_first_fruit,my_second_fruit])
Here, orange is the first fruit.
- orange and orange
- orange and mango
- orange and lemon
Here, mango is the first fruit.
- mango and orange
- mango and mango
- mango and lemon
Here, lemon is the first fruit.
- lemon and orange
- lemon and mango
- lemon and lemon
Here, orange is the first fruit.
- orange and orange
- orange and mango
- orange and lemon
Here, mango is the first fruit.
- mango and orange
- mango and mango
- mango and lemon
Here, lemon is the first fruit.
- lemon and orange
- lemon and mango
- lemon and lemon
print(comb1)
[['orange', 'orange'], ['orange', 'mango'], ['orange', 'lemon'], ['mango', 'orange'], ['mango', 'mango'], ['mango', 'lemon'], ['lemon', 'orange'], ['lemon', 'mango'], ['lemon', 'lemon']]
print(comb2)
[['orange', 'lemon'], ['mango', 'lemon'], ['lemon', 'lemon']]

Brainstorming time

We have a list of integers from 0 to 12.

We want to classify them in a dictionary with keys odd and even. Each key in the dictionary has a list of numbers as its value.

How would you do that?

First, we initialise our list and dictionary:

my_int_list = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
my_number_dict = {"odd" : list(), "even" : list()}

Then we will iterate over my_int_list. For each element we will test if it is even or odd, and add the element to the list of the appropriate key.

for i in my_int_list:
    if i % 2 == 0:
        my_number_dict["even"].append(i)
    else:
        my_number_dict["odd"].append(i)
print(my_number_dict)
{'odd': [1, 3, 5, 7, 9, 11], 'even': [0, 2, 4, 6, 8, 10, 12]}

Summary of the loops section

  • Loops are used to repeat the execution of a part of the program several times.
    • for loops: the number of repetitions is known in advance.
    • while loops: the number of repetitions is not known in advance.
  • Syntax: the keyword for/while, an iterator, the keyword in, a list/dictionary and a colon (:).
for elt in my_dict:
  # action
while elt in my_dict:
  # action
  • The code inside a loop must be indented.
  • If the expression evaluated by a while loop is not modified, you will get an infinite loop.
  • With the break statement, the loop will stop prematurely.
  • With the continue statement, the loop will go to the next iteration prematurely.
  • It is possible to add a loop inside another loop and to add conditional statements inside a loop.

Let’s practise

Please open file 004_practical_conditionals_loops.py

Jupyter notebook

Introduction

  • Jupyter notebooks are interactive programming environments that allow you to combine text, images, mathematical formulas, tables, graphs and executable computer code in a single document. They can be manipulated in a web browser.

  • Jupyter notebooks support nearly 40 different languages, including Python.

  • The cell is the basic element of a Jupyter notebook. It can contain formatted text or computer code that can be executed.

  • A web browser can be used to open a notebook, but VSCode can also do so as long as the Jupyter notebook extension has been installed.

Notebook presentation

  • In this training we will focus on 2 types of cells:
    • Markdown cells: to write text (titles, mathematical formulas, tables, …)
    • Python cells: to write Python code
  • To create a Jupyter notebook, go to the Explorer menu on the top-left and click on New file.

The file extension for a Jupyter file is .ipynb.

Select kernel

  • Click on Select Kernel on the top-right of the tab to choose a Python version to run your code.

Markdown cells

  • To create a markdown cell, click on + Markdown.
  • You can write anything you want in this cell: it won’t be interpreted as code.
  • You can run a markdown cell to convert raw text to markdown format by clicking on the right-pointing arrow on the right.

Markdown cells

  • This is what it looks like after being executed.

Edit and delete a cell

  • To edit a Markdown cell after it has been executed, double-click on it.
  • To delete a Markdown cell, click on the dustbin on the right.

Python cells

  • To create a Python cell, click on + Code.
  • To run a Python cell, click on right-pointing arrow on the left.

Python cells

  • This is what it looks like after being executed.

Other ways to execute cells

  • There are other types of Python cell execution.
  • Execute Above Cells: Runs every cell above the current cell.
    It is useful if you have modified your variables and want to revert to a previous state.

  • Execute Cell and Below: Runs the current cell and all of the cells below this one.
    It is useful if you have modified your variables and want to refresh the resulting code.

Other ways to execute cells

  • Run All: Runs every cell in the notebook.
    It is useful if you know you are going to execute all cells.

Other ways to execute cells

  • Restart: Empties the memory (restarts the kernel).
    It is useful if you use it before a Run All to check if your code works correctly before giving it to someone.

Delete a cell

  • To delete a Python cell, click on the dustbin on the right of the cell.

Let’s practise

Please open file 005_practical_jupyter.ipynb

Functions

Functions presentation

  • Functions are useful for performing an operation multiple times within a program.

  • A few functions have been introduced during this training.

    • print() which displays what is between the parentheses
    • len() which returns the number of items in a list or dictionary
  • Basically, any function works like this:

    • Variables of any type are sent to the function.
    • One or many actions are processed by the function.
    • The function returns value(s) or object(s).

Functions definition

  • A function is built with the keyword def to start the definition of the function.

  • It has to be followed by the function name, parentheses () with optionally arguments inside and a colon :

  • Like for and while loops, the code that will run inside must be indented.

  • General syntax:
def function_name():
    # Perform some action(s).
    # These actions can spread on several lines
    # which must all be indented.
  • Example:
def hello():
    print("Hi !")

Functions with arguments

  • Arguments can be passed to a function.

  • Some operations can be performed within the function using one or several arguments given in parentheses.

def square(x):
    sqr=x**2
    print(f"The square of {x} is {sqr}.")

square(2)
The square of 2 is 4.
  • Multiple arguments can be passed to the function.

  • Each of them have to be separated by a comma (,) and can be of any type (str, int, float, list, dict, etc…).

def repeat_sequence(x, y):
    long_chain=x*y
    print(long_chain)

repeat_sequence("AT", 5)
ATATATATAT

Functions returning results

  • Function variables are specific to the code within the function block.
def square(x):
    sqr=x**2

square(2)
print(f"The square of 2 is {sqr}.")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[176], line 5
      2     sqr=x**2
      4 square(2)
----> 5 print(f"The square of 2 is {sqr}.")

NameError: name 'sqr' is not defined
  • The NameError happened because the variable defined in the function are not translatable to the global code.

  • To make a function variables usable outside of that function, we have to use return.

  • The return statement sends a termination signal to the function block and returns values, which can be of any type.

def square(x):
    sqr=x**2
    return sqr

sqr_val = square(2)
print(f"The square of 2 is {sqr_val}.")
The square of 2 is 4.

Functions returning results

  • The return statement can be inserted several times in a function.

  • However, the first return encountered will stop the function execution and return back to the global code.

  • This is useful when combined with conditional statements to exit the function when the condition is fulfilled.

  • Example:
def speed_limit(x):
    limit = 50
    if x > limit:
      return "Too fast !"
    else:
      return "Perfect !"

print("You are driving at 51km/h.")
result = speed_limit(51)
print(result)

print("You are driving at 30km/h.")
result = speed_limit(30)
print(result)
You are driving at 51km/h.
Too fast !
You are driving at 30km/h.
Perfect !

Default argument

  • You can add a default value to your arguments, but these arguments must be placed at the end of the argument list in the function.
def power(x, n = 2) :
    return x**n

pow_value = power(2)

print(f"2**2 = {pow_value}")
2**2 = 4

The second argument has been left empty since we wanted to apply the default value to the function.

  • Indeed, the default value is used when no value has been passed to this argument. If you provide a value, the function will use it instead of the default value:
def power(x, n = 2) :
    return x**n

pow_value = power(2,3)

print(f"2**3 = {pow_value}")
2**3 = 8

Good practices

  • Function names should be lowercase and words separated by underscores (_) for a better readability.

Function names should not be the same as other Python included functions/keywords.

  • You can specify the type of your argument and of the returned value. It is helpful to remember which type of value you should set as input. It is helpful but not mandatory.
def square(x : int|float) -> int|float:
    return x**2
sqr_val = square(2)

print(f"The square of 2 is {sqr_val}.")
The square of 2 is 4.

Type hints are only available for Python3 version greater than 3.10.

Brainstorming time

Imagine you want to create a function called ‘enzyme’, which takes a string as an argument and returns a split list. It splits every time there is a serine (S) residue (we are in a wonderful world where enzymes cut every time and there are no steric hindrances…).

How would you do that ?

We define the name of the function.

def enzyme():

We can then add the argument(s) :

def enzyme(my_string):

We can then add the instructions (beware of indentation):

def enzyme(my_string):
    my_string.split("S")

Then, we want see the result !

def enzyme(my_string):
    res = my_string.split("S")
    print(res)

Brainstorming time

Now let’s try !

def enzyme(my_string):
    res = my_string.split("S")
    print(res)

enzyme("AGESMKT")
['AGE', 'MKT']

Great, it works ! I want to see it in a variable.

def enzyme(my_string):
    res = my_string.split("S")
    print(res)

answer = enzyme("AGESMKT")
print(answer)
['AGE', 'MKT']
None

Oops, I forgot to include the return in the function.

def enzyme(my_string):
    res = my_string.split("S")
    return res

answer = enzyme("AGESMKT")
print(f'answer contains: {answer}')
answer_2 = enzyme("agesmkt")
print(f'answer_2 contains: {answer_2}')
answer contains: ['AGE', 'MKT']
answer_2 contains: ['agesmkt']

Brainstorming time

OK, now let’s enhance our function! Currently it cuts only on uppercase S but we want to be able to accept sequences in upper and lower case letters.

def enzyme(my_string):
    res = my_string.upper().split("S")
    return res

answer = enzyme("AGESMKT")
print(answer)
answer_2 = enzyme("agesmkt")
print(answer_2)
['AGE', 'MKT']
['AGE', 'MKT']

That’s pretty good, but now we want to add the ability to cut according to another amino acid, while keeping Serine as the default value.

def enzyme(my_string, catalytic_site = "S"):
    res = my_string.upper().split(catalytic_site)
    return res

pept = "AGESMKT"
answer = enzyme(pept)
print(answer)
answer_2 = enzyme(pept, "T")
print(answer_2)
answer_3 = enzyme(pept, "ES")
print(answer_3)
['AGE', 'MKT']
['AGESMK', '']
['AG', 'MKT']

Brainstorming time

You may have noticed that… the catalytic site is not in the list anymore… In reality, an enzyme can cut before or after the catalytic site, but the recognised amino acid should always be present. How would you approach this? (tips: before will be a boolean which, by default, performs an enzyme cut before a catalytic site).

def enzyme(my_string, catalytic_site = "S", before = True):
    res = my_string.upper().split(catalytic_site)
    if before == True:
      for my_peptide in range(1, len(res)):
        res[my_peptide] = catalytic_site + res[my_peptide]
    else:
      for my_peptide in range(0,(len(res)-1)):
        res[my_peptide] = res[my_peptide] + catalytic_site
    return res
pept = "AGESMKT"
answer = enzyme(pept)
print(answer)
answer = enzyme(pept, "T")
print(answer)
answer = enzyme(pept, "T", before=False)
print(answer)
answer = enzyme(pept, "A")
print(answer)
['AGE', 'SMKT']
['AGESMK', 'T']
['AGESMKT', '']
['', 'AGESMKT']

We can see that if our peptide began or ended at the catalytic site, it might produce an unexpected split with an empty character. We don’t want this empty character.
How would you do this?

Brainstorming time

def enzyme(my_string, catalytic_site = "S", before = True):
    res = my_string.upper().split(catalytic_site)
    if before == True:
      for my_peptide in range(1, len(res)):
        res[my_peptide] = catalytic_site + res[my_peptide]
      if res[0] == "":
        res.pop(0)
    else:
      for my_peptide in range(0,(len(res)-1)):
        res[my_peptide] = res[my_peptide] + catalytic_site
      if res[-1] == "":
        res.pop(-1)
    return res

pept = "AGESMKT"
print(enzyme(pept))
print(enzyme(pept, "T"))
print(enzyme(pept, "O"))
print(enzyme(pept, "T", before=False))
print(enzyme(pept, "A"))
['AGE', 'SMKT']
['AGESMK', 'T']
['AGESMKT']
['AGESMKT']
['AGESMKT']

Well played! You’re almost there with this beautiful function! Adding documentation within docstrings will be helpful if in two years you want to remember what the function does, or if you give your code to someone else.

Brainstorming time

def enzyme(my_string : str, catalytic_site = "S", before = True) -> list:
    """
    Simulate an enzyme cleavage using a catalytic site. The cleavage can occur before or after the  catalytic site.

    Arguments:
    my_string: string
      The protein to be digested.
    catalytic_site: string - optional
      The cleavage site used to split the protein.
    before: boolean - optional
      Whether the enzyme cuts before or after the cleavage site. 
      If `before` is True, the enzyme cuts before the catalytic site, otherwise it cuts after the catalytic site.
    """
    res = my_string.upper().split(catalytic_site)
    if before == True:
      for my_peptide in range(1, len(res)):
        res[my_peptide] = catalytic_site + res[my_peptide]
      if res[0] == "":
        res.pop(0)
    else:
      for my_peptide in range(0,(len(res)-1)):
        res[my_peptide] = res[my_peptide] + catalytic_site
      if res[-1] == "":
        res.pop(-1)
    return res

short_sab = "MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGEENFKALVLIAFAQYLQQCPFEDHVKLVNEVTEFAKTCVADESAENCDKSLHTLFGDKLCTVATLRETYGEMADCCAKQEP"
res = enzyme(short_sab)
print(res)
['MKWVTFI', 'SLLFLF', 'S', 'SAY', 'SRGVFRRDAHK', 'SEVAHRFKDLGEENFKALVLIAFAQYLQQCPFEDHVKLVNEVTEFAKTCVADE', 'SAENCDK', 'SLHTLFGDKLCTVATLRETYGEMADCCAKQEP']

Although not mandatory, docstrings are highly recommended!

Summary of the functions section

  • Functions are useful for performing an operation multiple times within a program.
  • Syntax: the keyword def, the function name, parenthesis () with optionally arguments inside and a colon (:).
def function_name(argument1, argument2):
  # action
  • The function name should not be the same as other Python included functions/keywords.
  • The code inside a function must be indented.
  • The return statement ends the function and sends a result where it is called.
  • There can be multiple return statements in a function if you use conditional statements but the first return encountered will stop the function execution and return back to the global code.

Let’s practise

Please open file 006_practical_functions.ipynb

Packages

Packages presentation

  • Standard Python is a powerful language that can do many things, and developers may help the community with “ready-to-use” functions bundled in packages.

  • Packages contain collections of functions developed to accomplish common tasks.

  • The Python community is really active and has developed many packages providing functions for almost any purpose.

Some examples of useful packages:

  • biopython : tools for computational molecular biology.
  • pandas : dataframe and data analysis toolkit.
  • scipy : toolkit for mathematics, statistics and various scientific processes.

Packages utilisation

  • Use import followed by the package name to load a package in Python.

  • Once imported, you can call a function from the package by writing package_name.function_name.

  • Example :
import random
print(random.randint(0,10))
7

Here we have imported the package random to use the function randint which draws a random integer between 0 and 10.

  • Another common way to import functions from a package is to use the keyword from.

  • from is useful to import one or several functions without recalling the package’s name.

from random import randint
print(randint(0,10))
1

Packages utilisation

  • All functions of a package can be imported at once using *.
  • In the following example, all random functions have been imported and can be used directly by naming them like randint or choices.
from random import *
print("Random number:", randint(0,10))
print("Random name:", choices(['Binjamain','Izabèl','Toma','Lauraine']))
Random number: 7
Random name: ['Binjamain']

Be careful when using * with multiple packages. Some packages might have functions with the same name, and this can cause conflicts in Python. In fact, it is greatly recommended to not use * to import everything from a package.

Packages utilisation

  • It is also possible to define an alias for a module:
import random as rand
print("Random number:", rand.randint(0,10))
Random number: 9
  • Aliases can be useful when the packages or function names are long. Using them can make your code more readable.
    Furthermore, it prevents from function overwrites as you should specify the function from the aliases.
  • Like every variable, aliases can be overwritten if you specify something else with it.
import numpy as np
# np here calls numpy ...
import seaborn as np
# But here, np is overwritten by seaborn
np = "no problem"
# And now becomes "no problem"

Choose your aliases wisely!

Packages utilisation

Be careful! Importing a function with from and using an alias may overwrite another function!

print("[bold red]This sentence is different[/bold red]")
from rich import print as pprint
pprint("[bold red]From this sentence[/bold red]")
[bold red]This sentence is different[/bold red]
From this sentence

Packages download

  • All packages available in The Python Package Index (PyPI) can be installed.
  • These can be downloaded easily through the package installer for Python pip.
pip install biopython
  • If the package is strictly available on GitHub, you can use:
pip install git+https://github.com/pseudo/repo-name.git

A large number of packages, or certain combinations, might result in conflicts. For advanced usage, it will be recommended to use conda interpreter.

Summary of the packages section

  • Packages contain collections of “ready-to-use” functions developed to accomplish common tasks.
  • They are useful because you do not have to code some complicated functions.
  • To use a package, you need to install it first.
    • pip install package_name or
    • pip install git+https://github.com/pseudo/repo-name.git
  • Then you need to import it with one of these methods:
    • To import the whole package import package_name then package_name.function_name.
    • To import only one function from package_name import function_name then function_name.
    • To import every function from package_name import * then function_name.
  • If you find the package name too long, you can give it an alias:
    • import a_package_with_a_long_name as pack then pack.function_name
  • Be careful to not overwrite another function with an alias!

Let’s practise

Please open file 007_practical_packages.ipynb

Reading and writing files

Input/Output (I/O) presentation

  • The main operations that you can perform on files are: reading a file and writing to a file.

  • When you access a file on an operating system, a file path is required, which represents the location of a file. It is broken up into three major parts:

    • Folder Path: the file folder location on the file system where subsequent folders are separated by a forward slash / (Unix) or backslash \ (Windows)
    • File Name: the actual name of the file
    • Extension: the end of the file path pre-pended with a period (.) used to indicate the file type
  • The path can be:

    • absolute (the full path from the root of the computer)
    • relative (relative to the working directory).

Path example

.
└── home
    └── Toto
        ├── Desktop
        ├── Documents
        │   └── Trainings
        │       └── Python
        │           ├── practical_work/
        │           │   ├── Data
        │           │   │   └── sequences.fasta
        │           │   └── exercises.py
        │           └── Python_slides.html
        ├── Images
        ├── Downloads
        └── Videos
  • /home/Toto/Documents/Trainings/Python/ is the folder absolute path.

  • /home/Toto/Documents/Trainings/Python/practical_work/exercises.py is exercises.py absolute path.

  • ./Data/sequences.fasta is input.fasta relative path (relative to the exercises.py file)

  • exercises is the file name.

  • py is the file extension.


About relative paths

  • ./ means the same directory.
    ./Data/Sequences.fasta and Data/Sequences.fasta should work the same
  • ../ means the parent directory.
    If you need to call Python_slides.html file from exercises.py you will use ../Python_slides.html

File handlers

  • To manage a file, we use a file handler, which can be created with function open().
  • Open a file for reading with:
open('/home/Toto/Data/example.txt', 'r')
  • Open a file for:
  1. writing with:
open('/home/Toto/Data/example.txt', 'w')
  1. appending with:
open('/home/Toto/Data/example.txt', 'a')
  • There are two syntaxes for managing a file:
f = open('/home/Toto/Data/example.txt', 'r')
# do stuff with file
f.close()
with open('/home/Toto/Data/example.txt', 'r') as f:
    # do stuff with file
    # /!\ do not forget the indentation!

The syntax using with is recommended for most cases.
You can notice the alias as f, it means f is the file example.txt opened in r (read) mode.
File handler is automatically closed when you exit the with block.

If you open a file in ‘writing’ mode without using with and forget to close the file handler, your changes may not be saved.

Read a file

  • You can read a file all at once with method readlines(), or line by line.
with open('/home/Toto/Data/example.txt', 'r') as f:
    content = f.readlines()
    print(content)

content is a list.

The whole file is read in a go. It can be useful for files with few lines.

The whole file is stored in a list. This should not be done with big files.

with open('/home/Toto/Data/example.txt', 'r') as f:
    for line in f:
        print(line)


The file is read line by line. This is the most appropriate method for large files.

Write to a file

  • You can write to a file with write() method.
with open('output.txt', 'w') as f:
    f.write('Something I want to write to my file.\n')
with open('output.txt', 'a') as f:
    f.write('Something I want to write to my file.\n')


Be careful which parameter you choose in open(), “a” or “w”:
- in writing mode, any previous content is deleted.
- in appending mode, the text is added to the end of the file.


.write() method does not automatically add a new line (\n), contrary to print() function.

Summary of the I/O section

  • “I/O” stands for “Input/Output”.
  • To manage a file, we use a file handler, which can be created with function open().
  • You may open a file for:
    • reading with: open('/home/Toto/Data/example.txt', 'r')
    • writing with: open('/home/Toto/Data/example.txt', 'w') \(\rightarrow\) overwrites the file
    • appending with: open('/home/Toto/Data/example.txt', 'a') \(\rightarrow\) adds text at the end of the file
  • There are 2 syntaxes for managing a file:
f = open('/home/Toto/Data/example.txt', 'r')
# do stuff with file
f.close() # do not forget to close the file!
with open('/home/Toto/Data/example.txt', 'r') as f:
    # do stuff with file
    # /!\ do not forget the indentation!
  • To read a file, you can use:
with open('/home/Toto/Data/example.txt', 'r') as f:
    content = f.readlines() # store all file content at once
    print(content)
with open('/home/Toto/Data/example.txt', 'r') as f:
    for line in f: # read file line by line
        print(line)
  • To write to a file, you can use:
with open('output.txt', 'w') as f:
    f.write('Something I want to write to my file.\n')
with open('output.txt', 'a') as f:
    f.write('Something I want to write to my file.\n')

Let’s practise

Please open file 008_practical_io.ipynb

Dataframes

Dataframes presentation

  • DataFrames are objects used to store tables of data, such as Excel tables.
  • In Python, you can use multiple libraries in order to manipulate your dataframe, the most populars are pandas and Polars. In this training we will focus on pandas.
  • You can install pandas using pip:
pip install pandas
  • Then you may load it with an alias:
import pandas as pd
  • pd is a common alias used for pandas, but you could also simply write import pandas then just use the functions by calling pandas.function.

Dataframe creation

  • To create a dataframe, you can use pd.DataFrame, which creates an object DataFrame with various methods.
  • A simple way to initialise a dataframe is to use a dictionary.
import pandas as pd
grades_dict = {
  'names': ['Alphonse', 'Germaine', 'Célestine'],
  'math': [14, 17, 12],
  'history': [8, 14, 19],
  'music': [16, 15, 13]
}
school = pd.DataFrame(grades_dict)
print(school)
       names  math  history  music
0   Alphonse    14        8     16
1   Germaine    17       14     15
2  Célestine    12       19     13
  • The dictionary keys will become the column names in the dataframe.
  • The dictionary values are lists, each of which will become a column in the dictionary. They must all have the same length.

Please note that a column containing numbers starting from zero has been added. This column is called an index.

Dataframe creation

  • You can also load a dataframe from a CSV / TSV or an Excel file. But with a large dataframe, you will need some functions and methods in order to manipulate it properly.
my_data = pd.read_csv("./Data/crops_data.csv")
print(my_data)
      farm_id       region crop_type  soil_moisture  soil_pH  temperature_C  \
0    FARM0001  North India     Wheat          35.95     5.99          17.79   
1    FARM0002    South USA   Soybean          19.74     7.24          30.18   
2    FARM0003    South USA     Wheat          29.32     7.16          27.37   
3    FARM0004  Central USA     Maize          17.33     6.03          33.73   
4    FARM0005  Central USA    Cotton          19.37     5.92          33.86   
..        ...          ...       ...            ...      ...            ...   
495  FARM0496  Central USA      Rice          42.85     6.70          30.85   
496  FARM0497  North India   Soybean          34.22     6.75          17.46   
497  FARM0498  North India    Cotton          15.93     5.72          17.03   
498  FARM0499          NaN   Soybean          38.61     6.20          17.08   
499  FARM0500  North India     Wheat          30.22     7.42          20.57   

     rainfall_mm  humidity  sunlight_hours irrigation_type  ... sowing_date  \
0          75.62     77.03            7.27             NaN  ...    01-08-24   
1          89.91     61.13            5.67       Sprinkler  ...    02-04-24   
2         265.43     68.87            8.23            Drip  ...    02-03-24   
3         212.01     70.46            5.03       Sprinkler  ...    02-21-24   
4         269.09     55.73            7.93             NaN  ...    02-05-24   
..           ...       ...             ...             ...  ...         ...   
495        52.35     79.58            7.25          Manual  ...    01-16-24   
496       256.23     45.14            5.78             NaN  ...    01-01-24   
497       288.96     57.87            7.69            Drip  ...    01-02-24   
498       279.06     73.09            9.60            Drip  ...    01-25-24   
499        72.61     89.74            5.09             NaN  ...    02-16-24   

     harvest_date total_days yield_kg_per_hectare  sensor_id  timestamp  \
0        05-09-24        122              4408.07   SENS0001   03-19-24   
1        05-26-24        112              5389.98   SENS0002   04-21-24   
2        06-26-24        144              2931.16   SENS0003   02-28-24   
3        07-04-24        134              4227.80   SENS0004   05-14-24   
4        05-20-24        105              4979.96   SENS0005   04-13-24   
..            ...        ...                  ...        ...        ...   
495      06-02-24        138              4251.40   SENS0496   05-08-24   
496      04-14-24        104              3708.54   SENS0497   01-19-24   
497      05-09-24        128              2604.41   SENS0498   04-20-24   
498      06-04-24        131              2586.36   SENS0499   03-02-24   
499      06-29-24        134              5891.40   SENS0500   05-11-24   

      latitude  longitude  NDVI_index  crop_disease_status  
0    14.970941  82.997689        0.63                 Mild  
1    16.613022  70.869009        0.58                  NaN  
2    19.503156  79.068206        0.80                 Mild  
3    31.071298  85.519998        0.44                  NaN  
4    16.568540  81.691720        0.84               Severe  
..         ...        ...         ...                  ...  
495  30.386623  76.147700        0.59                 Mild  
496  18.832748  75.736924        0.85               Severe  
497  23.262016  81.992230        0.71                 Mild  
498  19.764989  84.426869        0.77               Severe  
499  13.455532  88.880605        0.85               Severe  

[500 rows x 22 columns]

Dataframe structure

  • A dataframe is a two-dimensional object.
  • A column in a dataframe is an object of type Series. It is a one-dimensional object.
    Thus, a dataframe is a collection of Series.
  • A Series can only contain one type of data, whereas a dataframe can contain columns of different types: a column of integers, a column of decimal numbers, etc.

Dataframe visualisation - head

  • In order to inspect your dataframe, you may need to see some rows. dataframe.head(n) prints the first n rows of the dataframe.
    If n is not provided, the first 5 lines are printed.
import pandas as pd
my_data = pd.read_csv("./Data/crops_data.csv")
my_data.head(6)
farm_id region crop_type soil_moisture soil_pH temperature_C rainfall_mm humidity sunlight_hours irrigation_type ... sowing_date harvest_date total_days yield_kg_per_hectare sensor_id timestamp latitude longitude NDVI_index crop_disease_status
0 FARM0001 North India Wheat 35.95 5.99 17.79 75.62 77.03 7.27 NaN ... 01-08-24 05-09-24 122 4408.07 SENS0001 03-19-24 14.970941 82.997689 0.63 Mild
1 FARM0002 South USA Soybean 19.74 7.24 30.18 89.91 61.13 5.67 Sprinkler ... 02-04-24 05-26-24 112 5389.98 SENS0002 04-21-24 16.613022 70.869009 0.58 NaN
2 FARM0003 South USA Wheat 29.32 7.16 27.37 265.43 68.87 8.23 Drip ... 02-03-24 06-26-24 144 2931.16 SENS0003 02-28-24 19.503156 79.068206 0.80 Mild
3 FARM0004 Central USA Maize 17.33 6.03 33.73 212.01 70.46 5.03 Sprinkler ... 02-21-24 07-04-24 134 4227.80 SENS0004 05-14-24 31.071298 85.519998 0.44 NaN
4 FARM0005 Central USA Cotton 19.37 5.92 33.86 269.09 55.73 7.93 NaN ... 02-05-24 05-20-24 105 4979.96 SENS0005 04-13-24 16.568540 81.691720 0.84 Severe
5 FARM0006 Central USA Rice 44.91 5.78 24.87 238.95 83.06 4.92 Sprinkler ... 01-13-24 05-06-24 114 4383.55 SENS0006 03-12-24 23.227859 89.421568 0.82 NaN

6 rows × 22 columns

  • It allows you to view a sample of the data more clearly than printing the entire dataset (as you may have noticed in the previous slide, printing a whole dataframe can be unreadable).

Dataframe visualisation - tail

  • In the same purpose, dataframe.tail(n) is used to show the last n rows of the dataframe.
    If n is not provided, the last 5 lines are printed.
my_data.tail()
farm_id region crop_type soil_moisture soil_pH temperature_C rainfall_mm humidity sunlight_hours irrigation_type ... sowing_date harvest_date total_days yield_kg_per_hectare sensor_id timestamp latitude longitude NDVI_index crop_disease_status
495 FARM0496 Central USA Rice 42.85 6.70 30.85 52.35 79.58 7.25 Manual ... 01-16-24 06-02-24 138 4251.40 SENS0496 05-08-24 30.386623 76.147700 0.59 Mild
496 FARM0497 North India Soybean 34.22 6.75 17.46 256.23 45.14 5.78 NaN ... 01-01-24 04-14-24 104 3708.54 SENS0497 01-19-24 18.832748 75.736924 0.85 Severe
497 FARM0498 North India Cotton 15.93 5.72 17.03 288.96 57.87 7.69 Drip ... 01-02-24 05-09-24 128 2604.41 SENS0498 04-20-24 23.262016 81.992230 0.71 Mild
498 FARM0499 NaN Soybean 38.61 6.20 17.08 279.06 73.09 9.60 Drip ... 01-25-24 06-04-24 131 2586.36 SENS0499 03-02-24 19.764989 84.426869 0.77 Severe
499 FARM0500 North India Wheat 30.22 7.42 20.57 72.61 89.74 5.09 NaN ... 02-16-24 06-29-24 134 5891.40 SENS0500 05-11-24 13.455532 88.880605 0.85 Severe

5 rows × 22 columns

Dataframe visualisation - describe

  • One of the most powerful pandas methods is describe, which gives a statistical summary of all numeric variables.

As shown in the summary below, only quantitative variables can be described.

my_data.describe()
soil_moisture soil_pH temperature_C rainfall_mm humidity sunlight_hours pesticide_usage_ml total_days yield_kg_per_hectare latitude longitude NDVI_index
count 497.000000 498.000000 499.000000 499.000000 497.000000 500.00000 500.000000 500.000000 499.000000 500.000000 499.000000 500.000000
mean 26.754789 6.525181 24.695130 181.872886 65.169618 7.03014 26.586980 119.496000 4032.258818 22.442473 80.403927 0.602060
std 10.122341 0.585128 5.336647 72.244299 14.655248 1.69167 13.202429 16.798046 1175.516477 7.283492 5.910818 0.175402
min 10.160000 5.510000 15.010000 50.170000 40.230000 4.01000 5.050000 90.000000 2023.560000 10.004243 70.020021 0.300000
25% 17.900000 6.030000 20.305000 119.760000 51.760000 5.66750 14.945000 105.750000 2994.750000 16.263202 75.380396 0.447500
50% 25.890000 6.530000 24.700000 192.360000 65.610000 6.99500 25.980000 119.000000 4070.970000 21.981743 80.669355 0.610000
75% 35.950000 7.040000 29.090000 239.120000 77.960000 8.47000 38.005000 134.000000 5066.060000 28.528948 85.656333 0.750000
max 44.980000 7.500000 34.840000 298.960000 90.000000 10.00000 49.940000 150.000000 5998.290000 34.981531 89.991901 0.900000

Dataframe visualisation - display one column

  • In a data frame, each column is explicitly named, allowing you to access a specific column by its name.
  • The syntax for accessing a single column is: my_data['column_name'].
my_data['farm_id'].head()
0    FARM0001
1    FARM0002
2    FARM0003
3    FARM0004
4    FARM0005
Name: farm_id, dtype: object

Please note that when you display only a selection of a dataframe, you always get a dataframe, so you can apply the usual dataframe functions (like head()) to it.

Dataframe visualisation - display several columns

  • The syntax for accessing several columns at once is:
    my_data[['column_name_1', 'column_name_2']].
my_data[['farm_id', 'region', 'crop_type']].head()
farm_id region crop_type
0 FARM0001 North India Wheat
1 FARM0002 South USA Soybean
2 FARM0003 South USA Wheat
3 FARM0004 Central USA Maize
4 FARM0005 Central USA Cotton

Please note the double pairs of brackets [[]] when displaying several columns.

Dataframe visualisation - selection via the index with .iloc

  • The .iloc method allows you to select a subset of your dataframe based on positions.
  • You must specify which rows and which columns you want to select, in this order and separated with a comma.
    Syntax: my_data.iloc[row_index, column_index]. You can use a colon (:) to select a range.
my_data.iloc[0:5,1:10]    # selects rows 0 to 4 and columns 1 to 9
region crop_type soil_moisture soil_pH temperature_C rainfall_mm humidity sunlight_hours irrigation_type
0 North India Wheat 35.95 5.99 17.79 75.62 77.03 7.27 NaN
1 South USA Soybean 19.74 7.24 30.18 89.91 61.13 5.67 Sprinkler
2 South USA Wheat 29.32 7.16 27.37 265.43 68.87 8.23 Drip
3 Central USA Maize 17.33 6.03 33.73 212.01 70.46 5.03 Sprinkler
4 Central USA Cotton 19.37 5.92 33.86 269.09 55.73 7.93 NaN
  • As with lists, you can use ::n to specify a step of n.
my_data.iloc[0::150,0::3]    # selects all rows with a step of 150 and all columns with a step of 3
farm_id soil_moisture rainfall_mm irrigation_type sowing_date yield_kg_per_hectare latitude crop_disease_status
0 FARM0001 35.95 75.62 NaN 01-08-24 4408.07 14.970941 Mild
150 FARM0151 28.82 69.76 Sprinkler 03-21-24 5338.11 17.754237 Mild
300 FARM0301 28.32 207.67 Manual 02-18-24 2043.13 22.816578 Severe
450 FARM0451 10.22 74.22 NaN 02-06-24 3498.61 13.358302 NaN

Brainstorming time

Which code allows access to the last 5 lines of the first 3 columns of a dataframe?

my_data.iloc[5:, :3]

my_data.iloc[-5:, :3]

my_data.iloc[-5:, :4]

This will display the first 3 columns for all rows except the first 5.

my_data.iloc[5:, :3]
farm_id region crop_type
5 FARM0006 Central USA Rice
6 FARM0007 North India Soybean
7 FARM0008 East Africa Maize
8 FARM0009 Central USA Soybean
9 FARM0010 East Africa Rice
... ... ... ...
495 FARM0496 Central USA Rice
496 FARM0497 North India Soybean
497 FARM0498 North India Cotton
498 FARM0499 NaN Soybean
499 FARM0500 North India Wheat

495 rows × 3 columns

This is the right answer.

my_data.iloc[-5:, :3]
farm_id region crop_type
495 FARM0496 Central USA Rice
496 FARM0497 North India Soybean
497 FARM0498 North India Cotton
498 FARM0499 NaN Soybean
499 FARM0500 North India Wheat

This will display the last 5 lines of the first 4 columns.

my_data.iloc[-5:, :4]
farm_id region crop_type soil_moisture
495 FARM0496 Central USA Rice 42.85
496 FARM0497 North India Soybean 34.22
497 FARM0498 North India Cotton 15.93
498 FARM0499 NaN Soybean 38.61
499 FARM0500 North India Wheat 30.22

Brainstorming time

Which code allows access to all rows of the third, fourth and fifth columns?

my_data.iloc[:, 2:5]

my_data.iloc[:, 3:5]

my_data.iloc[3:6, :]

This is the right answer. Don’t forget that the numbering starts at zero!

my_data.iloc[:, 2:5]
crop_type soil_moisture soil_pH
0 Wheat 35.95 5.99
1 Soybean 19.74 7.24
2 Wheat 29.32 7.16
3 Maize 17.33 6.03
4 Cotton 19.37 5.92
... ... ... ...
495 Rice 42.85 6.70
496 Soybean 34.22 6.75
497 Cotton 15.93 5.72
498 Soybean 38.61 6.20
499 Wheat 30.22 7.42

500 rows × 3 columns

This will only display columns 3 and 4 (5 is excluded).

my_data.iloc[:, 3:5]
soil_moisture soil_pH
0 35.95 5.99
1 19.74 7.24
2 29.32 7.16
3 17.33 6.03
4 19.37 5.92
... ... ...
495 42.85 6.70
496 34.22 6.75
497 15.93 5.72
498 38.61 6.20
499 30.22 7.42

500 rows × 2 columns

This will display all columns for lines 3 to 5.

my_data.iloc[3:6, :]
farm_id region crop_type soil_moisture soil_pH temperature_C rainfall_mm humidity sunlight_hours irrigation_type ... sowing_date harvest_date total_days yield_kg_per_hectare sensor_id timestamp latitude longitude NDVI_index crop_disease_status
3 FARM0004 Central USA Maize 17.33 6.03 33.73 212.01 70.46 5.03 Sprinkler ... 02-21-24 07-04-24 134 4227.80 SENS0004 05-14-24 31.071298 85.519998 0.44 NaN
4 FARM0005 Central USA Cotton 19.37 5.92 33.86 269.09 55.73 7.93 NaN ... 02-05-24 05-20-24 105 4979.96 SENS0005 04-13-24 16.568540 81.691720 0.84 Severe
5 FARM0006 Central USA Rice 44.91 5.78 24.87 238.95 83.06 4.92 Sprinkler ... 01-13-24 05-06-24 114 4383.55 SENS0006 03-12-24 23.227859 89.421568 0.82 NaN

3 rows × 22 columns

Dataframe visualisation - selection via the labels with .loc

  • The .loc method allows you to select a subset of your dataframe based on labels (rows or columns names).
  • You must specify which rows and which columns you want to select, in this order and separated with a comma.
    Syntax: my_data.loc[row_names, column_names].
  • To select only rows that meet a certain condition on the content of a column:
    my_data['column'] ** condition

Here, you have to replace ** with a comparison operator like ==, >=, !=, etc.

  • To select some columns, write the names of the columns of interest in a list.

Dataframe visualisation - selection via the labels with .loc

  • Example: To select all rows (or all columns), use a colon (:) in the first (or second) position as an argument given to loc.
my_data.loc[:,["farm_id","region","soil_moisture"]].head()
farm_id region soil_moisture
0 FARM0001 North India 35.95
1 FARM0002 South USA 19.74
2 FARM0003 South USA 29.32
3 FARM0004 Central USA 17.33
4 FARM0005 Central USA 19.37
  • Selects columns “farm_id” and “crop_type” for all lines where crop_type is “Wheat”
my_data.loc[my_data["crop_type"] == "Wheat", ["farm_id", "crop_type"]].head()
farm_id crop_type
0 FARM0001 Wheat
2 FARM0003 Wheat
10 FARM0011 Wheat
17 FARM0018 Wheat
40 FARM0041 Wheat

Brainstorming time

  • Which code allows access to the columns “soil_moisture”, “soil_pH” and “temperature_C” for all regions in “North India”?

ANSWER:

my_data.loc[my_data['region'] == 'North India', ["soil_moisture","soil_pH","temperature_C"]]
soil_moisture soil_pH temperature_C
0 35.95 5.99 17.79
6 36.28 7.04 21.80
13 12.80 5.87 26.90
20 16.25 7.43 20.31
31 39.76 6.70 17.42
... ... ... ...
491 32.14 7.44 21.49
494 12.52 5.99 33.18
496 34.22 6.75 17.46
497 15.93 5.72 17.03
499 30.22 7.42 20.57

99 rows × 3 columns

Dataframe manipulation - modifying a column

  • The syntax my_data['column_name'] not only allows you to access a column in a dataframe, but also to modify it.
my_data['humidity'] = my_data['humidity'] / 100    # converts the degree of humidity into a percentage
my_data.iloc[0:5, 0:10]    # checks that the dataframe has been modified in-place
farm_id region crop_type soil_moisture soil_pH temperature_C rainfall_mm humidity sunlight_hours irrigation_type
0 FARM0001 North India Wheat 35.95 5.99 17.79 75.62 0.7703 7.27 NaN
1 FARM0002 South USA Soybean 19.74 7.24 30.18 89.91 0.6113 5.67 Sprinkler
2 FARM0003 South USA Wheat 29.32 7.16 27.37 265.43 0.6887 8.23 Drip
3 FARM0004 Central USA Maize 17.33 6.03 33.73 212.01 0.7046 5.03 Sprinkler
4 FARM0005 Central USA Cotton 19.37 5.92 33.86 269.09 0.5573 7.93 NaN

Dataframe manipulation - adding a column

  • If my_data['column_name'] does not already exist, it will be created on the fly.
my_data['temperature_F'] = my_data['temperature_C'] * 9/5 + 32
my_data.loc[0:5, ['temperature_C', 'temperature_F']]
temperature_C temperature_F
0 17.79 64.022
1 30.18 86.324
2 27.37 81.266
3 33.73 92.714
4 33.86 92.948
5 24.87 76.766

Dataframe manipulation - deleting a column

  • There are officially three ways to delete a column.
my_data['id'] = my_data['farm_id']    # create a column that will be deleted on the next line
my_data = my_data.drop(columns='id')

Unlike the following two options, the drop method does not modify the existing dataframe; it simply returns a copy of the data frame with the changes applied. You will need to replace your data frame to compensate for this.

my_data['id'] = my_data['farm_id']    # create a column that will be deleted on the next line
del my_data['id']
my_data['id'] = my_data['farm_id']    # create a column that will be deleted on the next line
my_data.pop('id')

The pop function returns the deleted column, which can be assigned to a variable with col = my_data.pop('id').

my_data.head(0)    # checks that the 'id' column has been deleted
farm_id region crop_type soil_moisture soil_pH temperature_C rainfall_mm humidity sunlight_hours irrigation_type ... harvest_date total_days yield_kg_per_hectare sensor_id timestamp latitude longitude NDVI_index crop_disease_status temperature_F

0 rows × 23 columns

Dataframe manipulation - renaming a column

  • The rename method allows you to rename one or more columns at a time using the following syntax:
    my_dataframe.rename(columns={'old name': 'new name'})

The rename method does not modify the existing dataframe, unless the inplace = True argument is used.
The two following syntaxes are equivalent:
- my_dataframe.rename(columns={'old name': 'new name'}, inplace = True)
- my_dataframe = my_dataframe.rename(columns={'old name': 'new name'})

my_data.rename(columns={'temperature_C': 'temperature_Celsius', 'temperature_F': 'temperature_Fahrenheit'}, inplace = True)
my_data.head(0)    # checks that the columns have been renamed
farm_id region crop_type soil_moisture soil_pH temperature_Celsius rainfall_mm humidity sunlight_hours irrigation_type ... harvest_date total_days yield_kg_per_hectare sensor_id timestamp latitude longitude NDVI_index crop_disease_status temperature_Fahrenheit

0 rows × 23 columns

Dataframe manipulation - sorting a dataframe according to one column

  • The sort_values method allows you to sort a dataframe according to one or more columns specified in parentheses.
my_data.sort_values('temperature_Celsius', inplace = True)
my_data.head(10)
farm_id region crop_type soil_moisture soil_pH temperature_Celsius rainfall_mm humidity sunlight_hours irrigation_type ... harvest_date total_days yield_kg_per_hectare sensor_id timestamp latitude longitude NDVI_index crop_disease_status temperature_Fahrenheit
22 FARM0023 East Africa Soybean 20.53 6.60 15.01 121.73 0.6149 7.48 Manual ... 07-07-24 122 3892.74 SENS0023 06-30-24 33.995800 84.719229 0.70 Moderate 59.018
478 FARM0479 Central USA Maize 26.91 6.03 15.04 207.79 0.5968 9.54 Drip ... 05-24-24 112 2023.56 SENS0479 03-31-24 18.213795 77.077855 0.30 Mild 59.072
24 FARM0025 South USA Cotton 18.54 6.81 15.11 237.74 0.7850 4.64 NaN ... 07-14-24 119 2200.87 SENS0025 04-17-24 32.936750 72.427172 0.38 Severe 59.198
419 FARM0420 South USA Rice 38.91 5.51 15.20 139.47 0.6773 4.85 NaN ... 05-03-24 91 2796.49 SENS0420 03-09-24 14.353665 87.707645 0.73 Moderate 59.360
435 FARM0436 Central USA Cotton 39.95 6.29 15.21 78.67 0.8586 5.96 NaN ... 05-29-24 132 2969.17 SENS0436 05-09-24 13.506394 86.408534 0.80 Mild 59.378
197 FARM0198 South India Soybean 41.22 6.73 15.23 283.59 0.6528 6.82 NaN ... 05-26-24 105 3323.58 SENS0198 03-17-24 11.258768 74.454130 0.69 Severe 59.414
323 FARM0324 South USA Cotton 18.42 6.62 15.25 232.95 0.8750 4.80 Manual ... 04-30-24 120 4676.14 SENS0324 01-21-24 27.582612 87.158442 0.75 NaN 59.450
58 FARM0059 South India Wheat 33.14 5.55 15.30 247.50 0.5190 5.94 Sprinkler ... 07-15-24 123 2454.60 SENS0059 03-22-24 21.906149 85.560341 0.61 NaN 59.540
29 FARM0030 Central USA Cotton 18.83 5.66 15.39 184.85 0.9000 6.10 Drip ... 04-19-24 102 5356.92 SENS0030 03-27-24 13.809559 72.524419 0.70 Mild 59.702
442 FARM0443 East Africa Cotton 32.68 6.08 15.47 261.73 0.5656 5.45 Drip ... 08-06-24 136 2889.78 SENS0443 06-28-24 23.036798 73.670909 0.68 NaN 59.846

10 rows × 23 columns

Dataframe manipulation - sorting a dataframe according to one column

  • By default, the column is sorted in ascending order.
    Use the ascending = False to sort in descending order.
my_data.sort_values('rainfall_mm', ascending = False, inplace = True)
my_data.head(10)
farm_id region crop_type soil_moisture soil_pH temperature_Celsius rainfall_mm humidity sunlight_hours irrigation_type ... harvest_date total_days yield_kg_per_hectare sensor_id timestamp latitude longitude NDVI_index crop_disease_status temperature_Fahrenheit
274 FARM0275 East Africa Wheat 25.81 7.15 15.85 298.96 0.6594 6.37 Drip ... 06-12-24 102 3164.72 SENS0275 04-21-24 32.109939 85.473540 0.47 Mild 60.530
332 FARM0333 Central USA Cotton 41.36 7.44 30.08 298.52 0.7334 8.80 NaN ... 08-10-24 147 2160.32 SENS0333 08-04-24 12.921902 70.495912 0.67 Mild 86.144
186 FARM0187 East Africa Maize 24.46 7.24 18.02 298.09 0.5713 9.92 NaN ... 07-23-24 139 2323.25 SENS0187 05-30-24 25.775819 73.536485 0.68 NaN 64.436
266 FARM0267 East Africa Soybean 36.26 6.60 27.46 298.08 0.7475 8.01 NaN ... 06-16-24 106 2681.28 SENS0267 04-07-24 15.017401 83.930534 0.46 Severe 81.428
347 FARM0348 North India Maize 44.13 6.18 26.90 297.67 0.4614 9.03 NaN ... 07-04-24 107 5025.21 SENS0348 06-29-24 26.095779 78.004711 0.59 Severe 80.420
7 FARM0008 East Africa Maize 27.10 5.72 22.26 296.33 0.8034 5.44 Sprinkler ... 05-24-24 121 5264.09 SENS0008 04-30-24 23.317654 72.515210 0.70 Mild 72.068
230 FARM0231 South India Maize 12.80 5.58 22.69 296.11 0.7070 7.13 Drip ... 05-13-24 102 5402.27 SENS0231 05-13-24 22.953832 73.894930 0.77 Mild 72.842
31 FARM0032 North India Maize 39.76 6.70 17.42 295.96 0.7913 6.08 NaN ... 07-10-24 111 2050.61 SENS0032 05-13-24 30.558273 72.110777 0.88 Severe 63.356
408 FARM0409 East Africa Maize 23.54 7.18 31.24 295.95 0.4624 6.22 Sprinkler ... 07-17-24 138 3124.54 SENS0409 05-31-24 14.787792 86.325616 0.68 Mild 88.232
259 FARM0260 Central USA Cotton 25.66 6.29 29.53 295.74 0.6979 7.11 Manual ... 05-30-24 144 3259.62 SENS0260 03-17-24 32.977802 80.225430 0.64 Mild 85.154

10 rows × 23 columns

Dataframe manipulation - sorting a dataframe according to several columns

  • Use the following syntax to sort by column A and then column B:
    my_data.sort_values(['column A', 'column B']).
my_data.sort_values(['region', 'crop_type'], inplace = True)
my_data.head(10)
farm_id region crop_type soil_moisture soil_pH temperature_Celsius rainfall_mm humidity sunlight_hours irrigation_type ... harvest_date total_days yield_kg_per_hectare sensor_id timestamp latitude longitude NDVI_index crop_disease_status temperature_Fahrenheit
332 FARM0333 Central USA Cotton 41.36 7.44 30.08 298.52 0.7334 8.80 NaN ... 08-10-24 147 2160.32 SENS0333 08-04-24 12.921902 70.495912 0.67 Mild 86.144
259 FARM0260 Central USA Cotton 25.66 6.29 29.53 295.74 0.6979 7.11 Manual ... 05-30-24 144 3259.62 SENS0260 03-17-24 32.977802 80.225430 0.64 Mild 85.154
28 FARM0029 Central USA Cotton 35.35 7.18 33.39 295.18 0.6671 9.44 Drip ... 05-26-24 119 2726.92 SENS0029 03-01-24 19.477597 74.233206 0.50 Severe 92.102
4 FARM0005 Central USA Cotton 19.37 5.92 33.86 269.09 0.5573 7.93 NaN ... 05-20-24 105 4979.96 SENS0005 04-13-24 16.568540 81.691720 0.84 Severe 92.948
132 FARM0133 Central USA Cotton 13.71 5.70 19.44 236.71 0.6790 8.13 Sprinkler ... 07-07-24 133 4354.36 SENS0133 07-02-24 13.768623 89.954055 0.59 Moderate 66.992
288 FARM0289 Central USA Cotton 41.12 5.71 30.32 236.39 0.4112 8.55 Sprinkler ... 06-12-24 124 3276.60 SENS0289 04-05-24 26.778101 75.453084 0.39 NaN 86.576
217 FARM0218 Central USA Cotton 15.90 6.13 30.71 228.05 0.7204 5.66 Manual ... 06-14-24 119 3781.43 SENS0218 05-25-24 17.636795 81.033437 0.41 NaN 87.278
458 FARM0459 Central USA Cotton 41.86 6.99 29.50 213.48 0.7925 9.80 NaN ... 04-14-24 95 2445.53 SENS0459 02-22-24 28.514530 88.744213 0.75 NaN 85.100
191 FARM0192 Central USA Cotton 33.16 6.82 20.40 201.41 0.4686 8.98 Drip ... 04-16-24 97 5139.04 SENS0192 03-13-24 14.966167 73.994988 0.46 Mild 68.720
37 FARM0038 Central USA Cotton 13.99 5.63 24.83 194.26 0.7432 4.91 Manual ... 06-04-24 138 3664.70 SENS0038 03-15-24 29.392338 77.607561 0.85 Moderate 76.694

10 rows × 23 columns

Dataframe manipulation - sorting a dataframe according to several columns

  • If you want to sort one column in ascending order and the other in descending order, you must provide a list of Booleans to the ascending parameter.
my_data.sort_values(['region', 'crop_type'], ascending = [True, False], inplace = True)
my_data.head(10)
farm_id region crop_type soil_moisture soil_pH temperature_Celsius rainfall_mm humidity sunlight_hours irrigation_type ... harvest_date total_days yield_kg_per_hectare sensor_id timestamp latitude longitude NDVI_index crop_disease_status temperature_Fahrenheit
216 FARM0217 Central USA Wheat 18.77 5.89 26.61 287.88 0.5786 8.03 Sprinkler ... 04-26-24 110 3943.44 SENS0217 02-18-24 25.408561 76.113510 0.65 Moderate 79.898
54 FARM0055 Central USA Wheat 33.62 6.44 27.39 285.79 0.5640 7.66 Drip ... 07-19-24 131 3633.18 SENS0055 04-14-24 11.133670 70.744243 0.90 NaN 81.302
376 FARM0377 Central USA Wheat 39.12 6.53 24.79 271.35 0.6382 7.38 NaN ... 05-31-24 101 3736.42 SENS0377 04-14-24 12.323687 80.266829 0.88 Mild 76.622
296 FARM0297 Central USA Wheat 30.40 6.72 25.21 261.91 0.8263 4.37 Drip ... 06-28-24 145 3128.84 SENS0297 02-11-24 15.881029 84.044438 0.54 Moderate 77.378
251 FARM0252 Central USA Wheat 15.86 6.05 17.39 247.29 0.4045 4.25 NaN ... 05-30-24 95 2994.89 SENS0252 04-28-24 12.285039 82.372897 0.86 NaN 63.302
492 FARM0493 Central USA Wheat 28.81 7.46 30.56 245.13 0.4532 8.47 NaN ... 07-27-24 128 4203.51 SENS0493 07-12-24 15.515976 75.375870 0.65 Severe 87.008
111 FARM0112 Central USA Wheat 16.25 6.57 25.58 231.96 0.5113 4.02 Drip ... 07-13-24 117 4127.73 SENS0112 07-01-24 15.741602 79.212506 0.39 Mild 78.044
481 FARM0482 Central USA Wheat 24.74 6.60 31.00 228.58 0.5624 8.59 NaN ... 08-16-24 142 3555.39 SENS0482 04-24-24 33.941965 85.854259 0.38 Moderate 87.800
315 FARM0316 Central USA Wheat 14.23 5.78 23.30 224.07 0.6767 6.63 Drip ... 07-12-24 114 5110.65 SENS0316 03-22-24 31.990674 71.614452 0.30 NaN 73.940
81 FARM0082 Central USA Wheat 22.50 5.64 19.82 214.28 0.4518 7.49 Manual ... 07-31-24 142 4571.18 SENS0082 07-07-24 34.520480 79.570623 0.41 Mild 67.676

10 rows × 23 columns

Dataframe filtering - simple condition

  • Conditional statements can be applied on a dataframe to select rows and columns that fulfill a condition.
  • Several syntaxes are possible. Some examples are presented below.
  • Print only the rows with temperatures higher than 20°C:
my_data[my_data["temperature_Celsius"] > 20]
my_data.loc[my_data['temperature_Celsius'] > 20, :]    # the two syntaxes will return the same result
Click to see the results
farm_id region crop_type soil_moisture soil_pH temperature_Celsius rainfall_mm humidity sunlight_hours irrigation_type ... harvest_date total_days yield_kg_per_hectare sensor_id timestamp latitude longitude NDVI_index crop_disease_status temperature_Fahrenheit
216 FARM0217 Central USA Wheat 18.77 5.89 26.61 287.88 0.5786 8.03 Sprinkler ... 04-26-24 110 3943.44 SENS0217 02-18-24 25.408561 76.113510 0.65 Moderate 79.898
54 FARM0055 Central USA Wheat 33.62 6.44 27.39 285.79 0.5640 7.66 Drip ... 07-19-24 131 3633.18 SENS0055 04-14-24 11.133670 70.744243 0.90 NaN 81.302
376 FARM0377 Central USA Wheat 39.12 6.53 24.79 271.35 0.6382 7.38 NaN ... 05-31-24 101 3736.42 SENS0377 04-14-24 12.323687 80.266829 0.88 Mild 76.622
296 FARM0297 Central USA Wheat 30.40 6.72 25.21 261.91 0.8263 4.37 Drip ... 06-28-24 145 3128.84 SENS0297 02-11-24 15.881029 84.044438 0.54 Moderate 77.378
492 FARM0493 Central USA Wheat 28.81 7.46 30.56 245.13 0.4532 8.47 NaN ... 07-27-24 128 4203.51 SENS0493 07-12-24 15.515976 75.375870 0.65 Severe 87.008
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
61 FARM0062 South USA Cotton 35.54 6.13 21.25 113.72 0.8768 8.12 NaN ... 04-29-24 117 2386.26 SENS0062 04-06-24 19.316171 72.602309 0.34 Moderate 70.250
462 FARM0463 South USA Cotton 20.47 7.13 33.25 89.80 0.4108 6.35 NaN ... 06-19-24 115 3564.25 SENS0463 05-02-24 28.168865 75.647282 0.85 Severe 91.850
396 FARM0397 South USA Cotton 14.53 6.91 32.27 79.51 0.5063 6.84 Sprinkler ... 06-12-24 144 3634.48 SENS0397 02-19-24 19.239649 75.791812 0.61 NaN 90.086
310 FARM0311 South USA Cotton 24.14 6.96 31.25 67.52 0.5129 6.31 NaN ... 06-23-24 133 3211.31 SENS0311 02-25-24 25.806813 89.176478 0.35 Moderate 88.250
449 FARM0450 NaN Rice 39.04 6.01 21.04 291.92 0.7292 6.30 Manual ... 07-01-24 95 2437.10 SENS0450 06-09-24 29.417278 76.887856 0.39 Mild 69.872

380 rows × 23 columns

  • Print only the rows where the region is ‘South USA’:
my_data[my_data['region'] == 'South USA']
my_data.loc[my_data['region'] == 'South USA', :]    # the two syntaxes will return the same result
Click to see the results
farm_id region crop_type soil_moisture soil_pH temperature_Celsius rainfall_mm humidity sunlight_hours irrigation_type ... harvest_date total_days yield_kg_per_hectare sensor_id timestamp latitude longitude NDVI_index crop_disease_status temperature_Fahrenheit
460 FARM0461 South USA Wheat 28.84 6.42 17.89 285.72 0.8186 7.00 NaN ... 06-04-24 132 5396.51 SENS0461 02-01-24 24.972008 76.177829 0.89 NaN 64.202
443 FARM0444 South USA Wheat 43.38 5.60 34.84 284.57 0.4628 5.04 Sprinkler ... 06-22-24 119 3245.85 SENS0444 05-26-24 14.938407 78.480336 0.66 Moderate 94.712
127 FARM0128 South USA Wheat 20.21 6.28 16.69 275.28 0.8526 9.87 Sprinkler ... 04-27-24 109 3073.63 SENS0128 01-27-24 11.581679 78.693525 0.55 Severe 62.042
2 FARM0003 South USA Wheat 29.32 7.16 27.37 265.43 0.6887 8.23 Drip ... 06-26-24 144 2931.16 SENS0003 02-28-24 19.503156 79.068206 0.80 Mild 81.266
276 FARM0277 South USA Wheat 18.75 6.88 33.14 249.12 0.7592 4.74 Drip ... 06-05-24 123 4829.12 SENS0277 05-06-24 29.776665 80.233329 0.87 Severe 91.652
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
61 FARM0062 South USA Cotton 35.54 6.13 21.25 113.72 0.8768 8.12 NaN ... 04-29-24 117 2386.26 SENS0062 04-06-24 19.316171 72.602309 0.34 Moderate 70.250
462 FARM0463 South USA Cotton 20.47 7.13 33.25 89.80 0.4108 6.35 NaN ... 06-19-24 115 3564.25 SENS0463 05-02-24 28.168865 75.647282 0.85 Severe 91.850
340 FARM0341 South USA Cotton 21.91 7.32 17.05 88.64 0.5106 5.13 Drip ... 05-25-24 142 2524.93 SENS0341 01-08-24 21.987956 76.231469 0.52 Severe 62.690
396 FARM0397 South USA Cotton 14.53 6.91 32.27 79.51 0.5063 6.84 Sprinkler ... 06-12-24 144 3634.48 SENS0397 02-19-24 19.239649 75.791812 0.61 NaN 90.086
310 FARM0311 South USA Cotton 24.14 6.96 31.25 67.52 0.5129 6.31 NaN ... 06-23-24 133 3211.31 SENS0311 02-25-24 25.806813 89.176478 0.35 Moderate 88.250

93 rows × 23 columns

Dataframe filtering - complex conditions

  • Several conditions can be combined using & (meaning and), or | (meaning or).
  • Print only the rows with temperatures higher than 20°C and a sunlight time higher than 7 hours.
my_data[ (my_data["temperature_Celsius"] > 20 ) & (my_data["sunlight_hours"] > 7) ].head()
my_data.loc[(my_data['temperature_Celsius'] > 20) & (my_data["sunlight_hours"] > 7), :].head()
# the two syntaxes will return the same result
farm_id region crop_type soil_moisture soil_pH temperature_Celsius rainfall_mm humidity sunlight_hours irrigation_type ... harvest_date total_days yield_kg_per_hectare sensor_id timestamp latitude longitude NDVI_index crop_disease_status temperature_Fahrenheit
216 FARM0217 Central USA Wheat 18.77 5.89 26.61 287.88 0.5786 8.03 Sprinkler ... 04-26-24 110 3943.44 SENS0217 02-18-24 25.408561 76.113510 0.65 Moderate 79.898
54 FARM0055 Central USA Wheat 33.62 6.44 27.39 285.79 0.5640 7.66 Drip ... 07-19-24 131 3633.18 SENS0055 04-14-24 11.133670 70.744243 0.90 NaN 81.302
376 FARM0377 Central USA Wheat 39.12 6.53 24.79 271.35 0.6382 7.38 NaN ... 05-31-24 101 3736.42 SENS0377 04-14-24 12.323687 80.266829 0.88 Mild 76.622
492 FARM0493 Central USA Wheat 28.81 7.46 30.56 245.13 0.4532 8.47 NaN ... 07-27-24 128 4203.51 SENS0493 07-12-24 15.515976 75.375870 0.65 Severe 87.008
481 FARM0482 Central USA Wheat 24.74 6.60 31.00 228.58 0.5624 8.59 NaN ... 08-16-24 142 3555.39 SENS0482 04-24-24 33.941965 85.854259 0.38 Moderate 87.800

5 rows × 23 columns

Dataframe filtering and modifying

  • It is possible to modify only certain cells in a column, depending on their value.
  • This must be done with using .loc.
  • Examples:
my_data.loc[my_data['region'] == 'North India', ['region']] = 'India_North'
my_data.loc[my_data['region'] == 'India_North', :].head()
farm_id region crop_type soil_moisture soil_pH temperature_Celsius rainfall_mm humidity sunlight_hours irrigation_type ... harvest_date total_days yield_kg_per_hectare sensor_id timestamp latitude longitude NDVI_index crop_disease_status temperature_Fahrenheit
366 FARM0367 India_North Wheat 42.31 6.79 27.53 276.71 0.8871 5.19 Sprinkler ... 06-05-24 106 2597.00 SENS0367 04-07-24 14.253072 81.344858 0.31 Severe 81.554
260 FARM0261 India_North Wheat 26.11 5.81 20.30 272.41 0.5249 5.54 Manual ... 07-21-24 136 2308.81 SENS0261 05-29-24 29.822605 73.458050 0.80 NaN 68.540
112 FARM0113 India_North Wheat 38.33 6.34 30.32 270.94 0.4078 5.24 Drip ... 08-04-24 135 5488.85 SENS0113 05-23-24 28.513527 78.045307 0.44 Mild 86.576
392 FARM0393 India_North Wheat 28.81 6.28 29.38 269.97 0.6602 7.24 Sprinkler ... 06-07-24 111 5028.19 SENS0393 03-08-24 10.585544 87.806387 0.62 Moderate 84.884
314 FARM0315 India_North Wheat 27.40 7.10 19.41 251.11 0.6131 8.87 NaN ... 07-21-24 124 2549.32 SENS0315 07-04-24 34.117310 74.264637 0.33 Mild 66.938

5 rows × 23 columns

my_data.loc[my_data['region'] == 'South India', ['region']] = 'India_South'
my_data.loc[my_data['region'] == 'India_South', :].head()
farm_id region crop_type soil_moisture soil_pH temperature_Celsius rainfall_mm humidity sunlight_hours irrigation_type ... harvest_date total_days yield_kg_per_hectare sensor_id timestamp latitude longitude NDVI_index crop_disease_status temperature_Fahrenheit
298 FARM0299 India_South Wheat 14.80 7.11 32.20 273.40 0.7477 9.58 Sprinkler ... 07-23-24 118 3538.86 SENS0299 04-24-24 14.644776 82.091465 0.65 Mild 89.960
198 FARM0199 India_South Wheat 26.07 7.10 23.96 264.15 0.6235 4.71 NaN ... 05-08-24 116 2143.33 SENS0199 03-23-24 10.004243 71.817911 0.66 Moderate 75.128
278 FARM0279 India_South Wheat 31.79 6.01 24.17 263.85 0.6718 4.03 NaN ... 07-13-24 119 3640.61 SENS0279 03-26-24 25.030414 70.131460 0.83 Moderate 75.506
58 FARM0059 India_South Wheat 33.14 5.55 15.30 247.50 0.5190 5.94 Sprinkler ... 07-15-24 123 2454.60 SENS0059 03-22-24 21.906149 85.560341 0.61 NaN 59.540
69 FARM0070 India_South Wheat 15.13 5.89 27.05 240.05 0.7278 5.06 NaN ... 07-26-24 133 5696.62 SENS0070 06-05-24 31.606172 82.544348 0.39 NaN 80.690

5 rows × 23 columns

Aggregation

  • The .groupby() method takes a group of several rows as input. You can perform a calculation on it in order to return a single value for each of the groups.
  • Example 1: Calculate the sum of rainfall for each region in the dataframe:
my_data.groupby('region')['rainfall_mm'].sum()
region
Central USA    19014.33
East Africa    19734.47
India_North    18636.89
India_South    16972.94
South USA      15824.96
Name: rainfall_mm, dtype: float64

Aggregation

  • Example 2: Count the number of farms growing each type of crop in each region in the dataframe.
my_data.groupby(['region', 'crop_type']).count()
farm_id soil_moisture soil_pH temperature_Celsius rainfall_mm humidity sunlight_hours irrigation_type fertilizer_type pesticide_usage_ml ... harvest_date total_days yield_kg_per_hectare sensor_id timestamp latitude longitude NDVI_index crop_disease_status temperature_Fahrenheit
region crop_type
Central USA Cotton 26 26 26 26 26 26 26 17 26 26 ... 26 26 26 26 26 26 26 26 21 26
Maize 21 20 21 21 20 20 21 17 21 21 ... 21 21 21 21 21 21 21 21 14 21
Rice 18 18 18 18 18 18 18 13 17 18 ... 18 18 18 18 18 18 18 18 13 18
Soybean 26 26 25 26 26 26 26 20 26 26 ... 26 26 26 26 26 26 26 26 17 26
Wheat 17 17 17 17 17 17 17 12 17 17 ... 17 17 17 17 17 17 17 17 13 17
East Africa Cotton 24 24 24 24 24 24 24 17 24 24 ... 24 24 24 24 24 24 24 24 20 24
Maize 24 24 24 24 24 24 24 16 24 24 ... 24 24 24 24 24 24 24 24 15 24
Rice 20 20 20 20 20 20 20 15 20 20 ... 20 20 19 20 20 20 20 20 17 20
Soybean 24 24 24 24 24 23 24 18 24 24 ... 24 24 24 24 24 24 24 24 20 24
Wheat 15 15 15 15 15 15 15 11 15 15 ... 15 15 15 15 15 15 15 15 11 15
India_North Cotton 18 17 18 18 18 18 18 9 18 18 ... 18 18 18 18 18 18 18 18 15 18
Maize 24 24 24 24 24 24 24 15 24 24 ... 24 24 24 24 24 24 24 24 19 24
Rice 18 17 18 18 18 18 18 14 18 18 ... 18 18 18 18 18 18 18 18 14 18
Soybean 18 18 18 18 18 18 18 14 18 18 ... 18 18 18 18 18 18 18 18 13 18
Wheat 20 20 20 20 20 20 20 9 20 20 ... 20 20 20 20 20 20 20 20 16 20
India_South Cotton 20 20 20 20 20 20 20 16 20 20 ... 20 20 20 20 20 20 20 20 10 20
Maize 21 21 21 21 21 21 21 17 20 21 ... 21 21 21 21 21 21 21 21 14 21
Rice 6 6 6 6 6 6 6 2 6 6 ... 6 6 6 6 6 6 6 6 6 6
Soybean 22 22 22 21 22 22 22 14 22 22 ... 22 22 22 22 22 22 22 22 18 21
Wheat 21 21 21 21 21 20 21 13 21 21 ... 21 21 21 21 21 21 21 21 16 21
South USA Cotton 19 19 19 19 19 19 19 14 19 19 ... 19 19 19 19 19 19 19 19 15 19
Maize 21 21 20 21 21 21 21 15 21 21 ... 21 21 21 20 21 21 21 21 13 21
Rice 17 17 17 17 17 17 17 10 17 17 ... 17 17 17 17 17 17 17 17 9 17
Soybean 17 17 17 17 17 17 17 12 17 17 ... 17 17 17 17 17 17 17 17 13 17
Wheat 19 19 19 19 19 19 19 16 19 19 ... 19 19 19 19 19 19 18 19 15 19

25 rows × 21 columns

Aggregation

  • In this example you didn’t really need to print the values in all columns.
    You can simply print a limited number of columns of interest.
  • Please note that some columns contain lower values than others. This is because values such as “None” or “NA” are not taken into account.
my_data.groupby(['region', 'crop_type'])['farm_id'].count()
region       crop_type
Central USA  Cotton       26
             Maize        21
             Rice         18
             Soybean      26
             Wheat        17
East Africa  Cotton       24
             Maize        24
             Rice         20
             Soybean      24
             Wheat        15
India_North  Cotton       18
             Maize        24
             Rice         18
             Soybean      18
             Wheat        20
India_South  Cotton       20
             Maize        21
             Rice          6
             Soybean      22
             Wheat        21
South USA    Cotton       19
             Maize        21
             Rice         17
             Soybean      17
             Wheat        19
Name: farm_id, dtype: int64

Aggregation

  • You can even apply different aggregate methods depending on the column, or even apply multiple aggregate methods to the same column.
  • Example 3: For each region, find the minimum and maximum temperature (Celsius) and the sum of rainfall.
my_data.groupby('region').agg({'temperature_Celsius':['min', 'max'], 'rainfall_mm': 'sum'})
temperature_Celsius rainfall_mm
min max sum
region
Central USA 15.04 34.09 19014.33
East Africa 15.01 34.33 19734.47
India_North 15.64 34.52 18636.89
India_South 15.23 33.78 16972.94
South USA 15.11 34.84 15824.96

What methods can be applied after aggregation?

  • The complete list is available here.
  • Some common methods:
    • .min(): compute min of group values
    • .max(): compute max of group values
    • .mean(): compute mean of group values
    • .count(): compute count of group, excluding missing values
    • .describe(): generate descriptive statistics for each numeric column
    • .head(n): return the first n rows in each group
    • .tail(n): return the last n rows in each group
    • .size(): compute group sizes

Dataframe export

  • You can use to_csv() to export a dataframe to a tabulated file.
  • Syntax: my_data.to_csv('path_to_output_file')
  • Example:
my_data.to_csv('my_dataframe.csv', header = True, index = False, sep = ',')
  • Some common options:
    • header = True: the header will be printed
    • index = False : the index will not be printed
    • sep = ',' : the separator that will be used to separate the columns will be the comma (,)
  • The only mandatory parameter is the output file path.
    Please read the documentation to see the complete list of parameters.

Summary of the dataframes section (1/3)

  • DataFrames are objects used to store tables of data. They can be initialised:

    • with a dictionary: pandas.DataFrame(my_dict)
    • from a tabulated file: pandas.read_csv("my_tabulated_file")
  • Unlike nested lists, columns are identified by a name and must contain only one data type.

  • There are ways that allow you to view a subset of the data:

    • first lines with my_df.head(), last lines with my_df.tail(), generate statistics with my_df.describe()
    • display one column with my_df['column_1'] or several columns with my_df[['column_1', 'column_2']]
    • select data via the index with my_df.iloc[row_index, column_index]
    • select data via the labels with my_df.loc[my_df['column_1'] == value, ['column_2', 'column_3']]

Summary of the dataframes section (2/3)

  • There are ways that allow you to modify a subset of the data:
    • create or modify a column: my_df['column_name'] = value
    • delete a column: my_data = my_df.drop(columns='column_name'), del my_df['column_name'] or my_df.pop('column_name')
    • rename one or several columns with
      • my_df.rename(columns={'old name': 'new name'}, inplace = True) or
      • my_df = my_df.rename(columns={'old name': 'new name'})
    • sorting a dataframe according to one or several columns with
      • my_df.sort_values('column_1', ascending = True, inplace = True) or
      • my_df = my_df.sort_values(['column_1', 'column_2'], ascending = [True, False])

Summary of the dataframes section (3/3)

  • To filter a dataframe you can use:

    • my_df[my_df['column_1'] > value]
    • my_df.loc[my_df['column_1'] > value, ['column_3', 'column_4']]
    • my_df[ (my_df['column_1'] > value) & (my_df['column_2'] < other_value) ]
    • my_df.loc[(my_df['column_1'] > value) & (my_df['column_2'] < other_value), ['column_3', 'column_4']]
  • To modify certain cells in a column depending on their value, you can do:
    my_df.loc[my_df['column_1'] == old_value, ['column_1']] = new_value

  • An aggregation allows you to group your data according to one or several columns and perform one or several operations on other columns. For instance:

    • my_df.groupby('column_1')['column_2'].sum()
    • my_df.groupby(['column_1', 'column_2']).count()
    • my_df.groupby('column_1').agg({'temperature_Celsius':['min', 'max'], 'column_3': 'sum'})

Let’s practise

Please open file 009_practical_dataframes.ipynb

Plots

Plots presentation

  • There are several packages to create plots in Python.
  • In this training we will present matplotlib and seaborn.
    matplotlib is one of the most used Python data visualisation library.
    seaborn is based on matplotlib and provides new features.
  • matplotlib can be installed with pip install matplotlib.
  • seaborn can be installed with pip install seaborn.

Plots presentation

  • In this section we will see some of the most used types of plots:
    • line plot
    • scatterplot
    • pie plot
    • barplot
    • histogram
    • boxplot
    • violin plot
    • heatmap
    • pairplot

Line plot

  • A line plot is used to display the relationship between two numerical variables.
    In particular, this type of plot is best used for displaying trends over time.

A very basic line plot

import seaborn as sns
import matplotlib.pyplot as plt # please note that you must import matplotlib.pyplot and not simply matplotlib

import random

x = range(1, 11)
y = [100 * round(random.random(), 2) for i in range(1, 11)] # creates a list of 10 random int
plt.plot(x, y)
plt.show()

A slightly more customised plot

import random

x = range(1, 11)
y = [100 * round(random.random(), 2) for i in range(1, 11)]
z = [100 * round(random.random(), 2) for i in range(1, 11)]
plt.figure(figsize=(10, 3))    # configure plot size
plt.plot(x, y, label='y list', linewidth=4)    # add a label and change the default line width
plt.plot(x, z, label='z list', linewidth=4, linestyle='--', color='purple')  # change the default type and color
plt.xlabel('Title for x axis', fontsize=12)  # add a label for x axis
plt.ylabel('Title for y axis', fontsize=12)  # add a label for y axis
plt.legend(loc='upper right')    # add a legend and fix its position in upper right corner
plt.grid(color='gray', linewidth=0.5)    # add a grid
plt.title('A more customised plot line') # add a title
plt.show()

Before going further: the penguins dataset

  • The penguins dataset is a good dataset for data exploration and visualisation.
  • It can be imported directly with seaborn.
import seaborn as sns
penguins = sns.load_dataset('penguins')
penguins.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female

Before going further: the penguins dataset

  • Each individual in this dataset is a penguin.
    For each penguin, the available data are:
    • the species
    • the island where it lives
    • the bill length (mm)
    • the bill depth (mm)
    • the flipper length (mm)
    • the body mass (g)
    • the sex




Artwork by @allison_horst

Scatterplots

  • A scatterplot is used to display the relationship between two numerical variables.
  • Unlike a line plot, with a scatterplot, a value on the x-axis can be associated with several values on the y-axis.

A simple scatterplot with matplotlib

import seaborn as sns
import matplotlib.pyplot as plt

# configure plot size
plt.figure(figsize=(10, 4))
plt.scatter(penguins['flipper_length_mm'], penguins['body_mass_g'])
# label for x axis
plt.xlabel('Flipper length (mm)', fontsize=12)
# label for y axis
plt.ylabel('Body mass (g)', fontsize=12)
# plot title
plt.title('Body mass as a function of flipper length', size=16, color='red')
plt.show()






All species are mixed together.

Several scatterplots on the same plot with matplotlib

for species in penguins['species'].unique():
    df = penguins.loc[penguins['species'] == species, :]
    plt.scatter(df['flipper_length_mm'], df['body_mass_g'], label=species)
plt.xlabel('Flipper length (mm)', fontsize=12)
plt.ylabel('Body mass (g)', fontsize=12)
plt.legend()      # add a legend based on 'label' parameter in plt.scatter
plt.title('Body mass as a function of flipper length', size=16, color='red')
plt.show()






We have to loop on all species.

A nice scatterplot with seaborn

plt.figure(figsize=(9, 3.5))
sns.scatterplot(data=penguins, x='flipper_length_mm', y='body_mass_g', hue='species')
plt.xlabel('Flipper length (mm)', fontsize=12)
plt.ylabel('Body mass (g)', fontsize=12)
plt.title('Body mass as a function of flipper length', size=16, color='red')
plt.show()

By specifying the variable via the hue argument, seaborn automatically creates a color for each existing value.

Pie plots

  • A pie plot shows data as a percentage of a whole.
  • This kind of visualisation uses a circle to represent the whole, and slices of the circle to represent the specific categories that compose the whole.

Pie plot with matplotlib

island = penguins['island'].value_counts()                    # island is a Series
plt.pie(x=island.values, labels=island.index, autopct='%.2f') # values can be accessed with island.values
plt.title('Islands', size=16, color='#DAA520')
plt.show()                                                    # indexes can be accessed with island.index

The Seaborn library does not offer circular diagram implementations.

To create one, we must therefore use matplotlib’s pie function, to which we can apply seaborn’s various graphic styles (themes).

Bar plots

  • A bar plot shows the relationship between a numeric and a categoric variable.
  • Each entity of the categoric variable is represented as a bar.
  • The size of the bar represents its numeric value.
  • A bar plot can represent exactly the same information as a pie plot but from a different perspective.

Bar plot with matplotlib

flipper_mean = penguins.groupby('species')['flipper_length_mm'].mean()     # flipper_mean is a Series
plt.bar(height=flipper_mean.values, x=flipper_mean.index)                  # values can be accessed with flipper_mean.values
plt.title('Flipper Length for 3 Penguin Species', size=16, color='orange') # indexes can be accessed with flipper_mean.index
plt.show()

Bar plot with seaborn

sns.barplot(x ='species', y='flipper_length_mm', data=penguins)
plt.title('Flipper Length for 3 Penguin Species', size=16, color='orange')
plt.show()

seaborn will automatically calculate the mean of the y variable.

Histograms

  • Histograms are particularly useful when you want to get an idea of the distribution of a variable.
  • You can see roughly where the peaks of the distribution are, whether the distribution is skewed or symmetric, and if there are any outliers.

Histogram with matplotlib

plt.hist(penguins['flipper_length_mm'])
plt.title('Flipper Length', size=16, color='green')
plt.xlabel('Flipper length (mm)')
plt.show()

Basic histogram with seaborn

sns.set_theme()    # use defaut theme (grey background with horizontal white lines)
sns.histplot(x = 'flipper_length_mm', data = penguins)
plt.title('Flipper Length', size=16, color='green')
plt.show()

Histogram with kde with seaborn

sns.set_theme()
sns.histplot(x = 'flipper_length_mm', data = penguins, hue = 'species', kde = True)
plt.title('Flipper Length', size=16, color='green')
plt.show()

Boxplots

  • Box plots are used to show distributions of numeric data values, especially when you want to compare them between multiple groups.
  • They are built to provide high-level information at a glance, offering general information about a group of data’s symmetry, skew, variance, and outliers.

A simple boxplot with seaborn

sns.boxplot(x = 'species', y = 'flipper_length_mm', data = penguins, palette=['#FBB613','#38D4D6','#8A38D6'])
plt.title('Flipper Length for 3 Penguin Species', size=16, color='#00BFFF')
plt.show()

A more elaborated boxplot with seaborn

sns.set_theme()
sns.boxplot(x = 'species', y = 'flipper_length_mm', data = penguins, hue = 'sex')
plt.title('Flipper Length for 3 Penguin Species by Sex', size=16, color='#00BFFF')
plt.show()

Violin plots

  • You can think of the violin plot as a box plot.
  • This plot is used to compare the distribution of numerical values among categorical variables.
  • The peaks, valleys, and tails of each group’s density curve can be compared to see where groups are similar or different.

Violin plot with seaborn

sns.set_theme()
sns.violinplot(x = 'species', y = 'body_mass_g', data = penguins, hue = 'sex')
plt.title('Body mass for penguins by sex and species', size=20, color='blue')
plt.show()

Heatmaps

  • A heatmap shows how values vary across a grid using colors.
  • It’s often used to quickly spot patterns, trends, or areas of high and low activity in data.
  • In a correlation heatmap, colors show how strongly variables are related.

Heatmap with seaborn

# extract numeric columns from penguins dataframe
penguins_numeric = penguins[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]
# corr() calculates the correlation between variables
sns.heatmap(penguins_numeric.corr(), annot = True)
plt.title('Correlation between numeric variables', size=16, color='darkviolet')
plt.show()

Pairplots

  • You can use the pairplot method to see the pair relations of the variables.
  • This function creates cross-plots of each numeric variable in the dataset.
  • Several options are available to choose the plot types.

Pairplot with seaborn

sns.pairplot(penguins, hue = "species", height=1.5)
plt.show()

Going further

  • Teasing: seaborn gallery:

Summary of the plots section

  • matplotlib and seaborn are the most widely used Python packages for plotting graphs.
  • The data to be plotted should generally be stored in a list or a dataframe.
  • There are many different types of graphs: line plots, scatter plots, pie charts, bar charts, histograms, box plots, violin plots, heatmaps, pair plots…
  • The way these different functions are used and the options available are often very similar.
  • There are many customisation options available, and these are often the same across different types of charts (xlabel(), ylabel(), legend(), title()…)
  • Please refer to the documentation for instructions on how to use the relevant functions.

Let’s practise

Please open file 010_practical_plots.ipynb

Common errors

Introduction

  • When coding, you will certainly run into errors. Some are more common than others. Learning to identify errors will help you fix them quickly.
  • When you encounter an error, Python tell you which line causes a problem, the error name and explain briefly what is wrong.

Common errors (1/6)

  • NameError: You may have forgotten to define a variable and you are trying to access it.
    • How to debug: check if you initialised it or deleted it by mistake.
    • Example:
print(my_variable)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[253], line 1
----> 1 print(my_variable)

NameError: name 'my_variable' is not defined
  • SyntaxError: You may have forgotten a character like () or , or : etc …
    • How to debug: The error should indicate the position in the problematic line using a ^.
    • Example:
print my_variable
  Cell In[254], line 1
    print my_variable
    ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(...)?

Common errors (2/6)

  • TypeError: You may be trying to perform an operation or apply a function to a wrong object type.
    • How to debug: Check your variables and/or what kind of objects are accepted.
    • Example:
my_variable = "w"*1.2
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[255], line 1
----> 1 my_variable = "w"*1.2

TypeError: can't multiply sequence by non-int of type 'float'
  • ValueError: You may have given an object type in your function but the value is invalid.
    • How to debug: Check the value you are trying to give to the function.
    • Example:
my_variable = float("variable")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[256], line 1
----> 1 my_variable = float("variable")

ValueError: could not convert string to float: 'variable'

Common errors (3/6)

  • IndexError: You may be trying to access an element in a list that is outside the valid range.
    • How to debug: Check the length of your list.
    • Example:
my_list = [0,1,2,3]
print(my_list[5])
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[257], line 2
      1 my_list = [0,1,2,3]
----> 2 print(my_list[5])

IndexError: list index out of range
  • KeyError: You may be trying to access an element in a dictionary that doesn’t exist.
    • How to debug: Use the .get() method to check your keys.
    • Example:
my_dict = {"Laurène":0, "Thomas":1, "Isabelle":2, "Benjamin":3}
print(my_dict["Lauraine"])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[258], line 2
      1 my_dict = {"Laurène":0, "Thomas":1, "Isabelle":2, "Benjamin":3}
----> 2 print(my_dict["Lauraine"])

KeyError: 'Lauraine'

Common errors (4/6)

  • IndentationError: You may have forgotten to indent a part of your code.
    • How to debug: Check if you did not mix tabs with spaces.
    • Example:
for i in [0,1,2,3]:
print(i)
  Cell In[259], line 2
    print(i)
    ^
IndentationError: expected an indented block after 'for' statement on line 1
  • AttributeError: You may have used the wrong method for an object.
    • How to debug: Check your variable type and the method documentation.
    • Example:
my_variable = 1
my_variable.upper()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[260], line 2
      1 my_variable = 1
----> 2 my_variable.upper()

AttributeError: 'int' object has no attribute 'upper'

Common errors (5/6)

  • FileNotFoundError: The file you are trying to access either does not exist or is in a different folder or the file path is wrong.
    • How to debug: Check where your file is.
    • Example:
my_file = open("my_file.txt", "r")
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[261], line 1
----> 1 my_file = open("my_file.txt", "r")

File /usr/lib/python3/dist-packages/IPython/core/interactiveshell.py:310, in _modified_open(file, *args, **kwargs)
    303 if file in {0, 1, 2}:
    304     raise ValueError(
    305         f"IPython won't let you open fd={file} by default "
    306         "as it is likely to crash IPython. If you know what you are doing, "
    307         "you can use builtins' open."
    308     )
--> 310 return io_open(file, *args, **kwargs)

FileNotFoundError: [Errno 2] No such file or directory: 'my_file.txt'

Common errors (6/6)

  • ModuleNotFoundError: You may have forgotten to install the package before importing it, or you may have made a mistake when typing its name.
    • How to debug: Install it with pip install.
    • Example:
import sqlfactory
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[262], line 1
----> 1 import sqlfactory

ModuleNotFoundError: No module named 'sqlfactory'

How to debug ?

  • What should you do if an error occurs?
    1. Read the message: it explains what is wrong.
    2. Try to debug your code to get a better understanding (or to fix it if you can!).
    3. Type the error in Google: look for Stack Overflow links, they are helpful.
    4. If the 3 tips above do not work, you may ask an IA chatbot for help. If you give it the full error message, it will most likely tell you what is wrong.

Bring Your Own Project

Suggested exercises

If you don’t have any ideas for a program or analysis to implement, you can choose from the following options:

  • write a program (using basic Python concepts: lists, dictionaries, conditionals, functions, plots, etc.)
    • coding the Game of Life
    • processing and extraction of information from a dataset of non-coding RNAs
  • analyse a dataset (manipulation of dataframes and plots)

Conclusion

Take home message

  • Read the doc!

  • Practise!

  • Do not reinvent the wheel: use existing tools

  • Use AI assistant with caution! (copy-paste will not work every time)

Special thanks


Fabien KON-SUN-TACK
Former Bilille engineer who worked on this training.

Satisfaction survey

In order to help us improve our training, we would be grateful if you could take a few minutes to complete the following satisfaction survey.

(You can answer in English or French.)