Introduction to Python

Laurène AUBRY

laurene.aubry@univ-lille.fr

Thomas BINET

thomas.binet@univ-lille.fr

Isabelle GUIGON

isabelle.guigon@univ-lille.fr

Benjamin MARSAC

benjamin.marsac@univ-lille.fr

April 8, 2026

Preamble

Practical informations

Schedule :

March 30th to April 1st
9am to 5pm

Breaks :

every half-day
lunch for 1 hour, around 12:30

Lunch :

micro-waves available if needed
possibility to buy sandwiches or hot dished on the campus
possibility to eat in the building’s cafeteria

At the bottom left, there is a menu to better navigate through the slides

Bilille

Bilille is the Lille bioinformatics and biostatistics platform, within the UAR 2014 - US 41 “Plateformes Lilloises en Biologie et Santé”.

PLBS includes 8 platforms, providing access to expertise and equipments to support research in biology and health.

In Bilille, we currently are 10 full time engineers, directed by Jimmy Vandel (research engineer CNRS), Ségolène Caboche (research engineer University of Lille) and Mamadou-Dia Sow (research engineer University of Lille).

Our missions are to :

support scientific projects
organise training courses
provide access to cloud computing resources
ensure access to software resources
conduct scientific and technical animation

Quick presentation

What about you ?

name
profile
labs
experience with programming (in few words) : have you already tried using Python or another language?
your expectations regarding this training ?

Introduction

Python in a few words

Open-source interpreted programming langage developed since 1991
Very large number of libraries developed by a community of contributors
The Python Package Index (PyPI) is a repository of software for the Python programming language, with currently > 600,000 projects
Current major version is python3, but lots of scripts are still in python2
This training is based on python3

Installation

Windows: Go on this link and download the last update of python3.
Linux: It’s already installed and you shouldn’t try to update it except if you are a pro.
Mac: Go on this link and download the last update of python3.

Integrated development environment

Programming languages are written as scripts.

Python code files finish with .py extension.

Scripts can be written in a notebook, but the development of projects can be difficult.

Integrated development environments (IDE) are used to create projects and write scripts in any language that can help with project management and scripting, whether in Python, R, Julia, C…

One of the most popular IDEs is VScode, which you are invited to download for use during the course.

Visual Studio Code

VScode presents extensions and utility modules that are necessary or helpful for development.
Extensions can be downloaded from the corresponding tab.
Development package for Python: demystifying-javascript.python-extensions-pack
Extension for Jupyter: ms-toolsai.jupyter
You are free to install other extensions (for example: indent-rainbow…).

Extensions help you write scripts, but too many packages can slow down your IDE! Use sparingly.

Let’s get down to business

Variables

Variables presentation

Variables are used to store values in memory for later use.
A variable can contain any object type.
- 5 : integer
- 3.1415 : real number (float)
- "abc" : string
- True : boolean (a boolean is a variable that can only take values True or False)
- print("...") : function
- and so many other types

Variables are fundamental in programming. You must understand their purpose and how they work in order to obtain the desired results.

Regarding float variables, please note that the separator for decimals is a period (.), not a comma (,).

Variable assignment

To assign a value to a variable, you use the assignment operator (equals sign, =) after the variable name.
a = 5 : this command assigns the value 5 to the variable a.
a = 7 : if the same variable is used again, the previous value is overwritten. The object type can change if the variable is reused / overwritten.
It is possible to have as many variables as memory space allows.

Variable naming conventions

It is important to give meaningful names to variables so that you know at a glance what they are used for.
It is preferable to have long variable names rather than short, meaningless names.
Variable names must start with a letter and contain only unaccented letters, numbers and underscores. Do not use special characters like é à ö &…
Variable names are case-sensitive, which means that upper and lower case letters are distinguished.
myvariable and MYVARIABLE are different objects.
Main conventions for variables with several words:
- snake case: average_car_speed = 50
- camel case: averageCarSpeed = 50

Brainstorming time

What do you expect to be displayed with the following examples?

"my_variable"

'my_variable'

It is not a variable name but only an object (a string). Variable names are not written between quotes.

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 b

NameError: name 'b' is not defined

It raises a NameError because the variable b hasn’t been assigned yet.

b = "my_variable"
b

'my_variable'

c = "my_variable"
c = 3.1415
c

3.1415

Brainstorming time

What do you expect to be displayed with the following examples?

2nd_variable = "my_second_variable"
2nd_variable

  Cell In[5], line 1
    2nd_variable = "my_second_variable"
    ^
SyntaxError: invalid decimal literal

It raises a SyntaxError because your second variable starts with a number.

second_variable = "my_second_variable"
second_variable

'my_second_variable'

String concatenation

Strings can be concatenated using the plus sign (+).”

'patient_' + '1'

'patient_1'

A string can only be concatenated with another string.

'patient_' + 1

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 'patient_' + 1

TypeError: can only concatenate str (not "int") to str

You can concatenate two variables together if they both contain strings.

a = "patient_"
b = "1"
a + b

'patient_1'

If you want to repeat a string a certain number of times, you can use the asterisk (*) with an integer.

laugh = "ha" * 3
laugh

'hahaha'

Variable conversion

Integers and floats can be converted to string with str(variable).

a = 1
'patient_' + str(a)

'patient_1'

In some cases, strings can be converted to integer or float with int(variable) or float(variable).

int('3')

float('3.5')

3.5

int('3.5')

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[14], line 1
----> 1 int('3.5')

ValueError: invalid literal for int() with base 10: '3.5'

Variable display

The print() function can be used to display what is between the parentheses.

a = 3.1415
print(a)

3.1415

A quick way to concatenate string and integer or float is to use the formatted string literals syntax also called f-string syntax.

a = 1
print(f'patient_{a}')

patient_1

Do not forget the f before the quotes.

The variable or operation result between the {} will automatically be converted to a string.

You can also separate strings and variables with commas. This adds spaces automatically.

number = 2
price = 12
print("She purchased", number, "ice creams for", price, "euros.")
print(f"She purchased {number} ice creams for {price} euros.")

She purchased 2 ice creams for 12 euros.
She purchased 2 ice creams for 12 euros.

Variable display - backslash

Simple (’’) or double (““) quotes can be used around a string, but you must not mix them.

If you need additional quotes inside a string, you can use the other type of quotes, or escape them with backslash (\).

print("When I arrive in the morning, I say 'good morning' to everyone.")
print('When I arrive in the morning, I say "good morning" to everyone.')
print("When I arrive in the morning, I say \"good morning\" to everyone.")

When I arrive in the morning, I say 'good morning' to everyone.
When I arrive in the morning, I say "good morning" to everyone.
When I arrive in the morning, I say "good morning" to everyone.

Variable display - backslash

You can also use raw string (r-string) in order to print exactly what’s between the quote. It comes handy when writing with lots of backslash (\) (cf : Windows path)

print(r"C:\Users\Georges\Documents\test.txt")

C:\Users\Georges\Documents\test.txt

You can also use two backslashes (\\). The first backslash (\) escapes the second one, so it is interpreted as a literal backslash.

print("C:\\Users\\Georges\\Documents\\test.txt")

C:\Users\Georges\Documents\test.txt

If you don’t use one of these methods, you will get an error.

print("C:\Users\Georges\Documents\test.txt")

  Cell In[21], line 1
    print("C:\Users\Georges\Documents\test.txt")
          ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

Arithmetic operations

The following arithmetic operators are available in Python:

number1 = 5
number2 = 3

addition:

addition = number1 + number2
print(f'{number1} + {number2} = {addition}')

5 + 3 = 8

subtraction:

subtraction = number1 - number2
print(f'{number1} - {number2} = {subtraction}')

5 - 3 = 2

multiplication:

multiplication = number1 * number2
print(f'{number1} * {number2} = {multiplication}')

5 * 3 = 15

division:

division = number1 / number2
print(f'{number1} / {number2} = {division}')

5 / 3 = 1.6666666666666667

Arithmetic operations

integer division (quotient of the division):

integer_division = number1 // number2
print(f'{number1} // {number2} = {integer_division}')

5 // 3 = 1

modulo (remainder from integer division):

modulo = number1 % number2
print(f'{number1} % {number2} = {modulo}')

5 % 3 = 2

power:

power = number1 ** number2
print(f'{number1} ** {number2} = {power}')

5 ** 3 = 125

A few caveats

A simple division will always return a float even if the result is an integer.

print(f"8/2 = {8/2} and its type is {type(8/2)}")
print(f"8//2 = {8//2} and its type is {type(8//2)}")

8/2 = 4.0 and its type is <class 'float'>
8//2 = 4 and its type is <class 'int'>

The spaces before and after the operator are optional but helpful for readability.

Variable operators

Operations can be performed directly on variables by putting the operator in front of the equal (=) symbol. The two following syntaxes are equivalent:

counter = 0
counter += 1
print(counter)

counter2 = 0
counter2 = counter2 + 1
print(counter2)

This can be done with all operators.

counter -= -2
print(counter)

counter2 = counter2 - (-2)
print(counter2)

counter *= 6
print(counter)

counter2 = counter2 * 6
print(counter2)

counter /= 4.5
print(counter)

4.0

counter2 = counter2 / 4.5
print(counter2)

4.0

counter //= 2
print(counter)

2.0

counter2 = counter2 // 2
print(counter2)

2.0

counter %= 2
print(counter)

0.0

counter2 = counter2 % 2
print(counter2)

0.0

Comparison operators

To compare values we can use the following operators:

> : strictly greater than

< : strictly less than

>= : greater than or equal to

<= : less than or equal to

== : equal to

!= : not equal to

The result of a comparison is a boolean value.

3 < 9

True

1/3 < 1/4

False

Do not confuse '==' (test equality) and '=' (assign a value to a variable).

Comparison operators

They can be used to compare numbers (int or float) in numerical order, or strings in lexicographical order (based on their ASCII value).

'HELLO' == 'hello'

False

'a' < 'b'

True

'a' < 'B'

False

'Ben' < 'Benjamin'

True

Try to guess the answer :

745 >= 3.1415

True

"Sun" == "Moon"

False

"cat" = "dog"

  Cell In[51], line 1
    "cat" = "dog"
    ^
SyntaxError: cannot assign to literal here. Maybe you meant '==' instead of '='?

Methods

Methods are associated with a variable type.

They can create new objects that can be assigned to a variable, or modify existing objects.

Each type of variable has its own set of methods.

Syntax: variable.method(*optional parameters*).

Methods

Here are some examples of useful methods for strings.

Consider the following string:

my_str = "       THIS IS a string  "
print(f'*{my_str}*')

*       THIS IS a string  *

Convert all characters to uppercase:

my_str_upper = my_str.upper()
print(f'*{my_str_upper}*')

*       THIS IS A STRING  *

Convert all characters to lowercase:

my_str_lower = my_str.lower()
print(f'*{my_str_lower}*')

*       this is a string  *

Remove extra spaces at the beginning and end:

my_str_strip = my_str.strip()
print(f'*{my_str_strip}*')

*THIS IS a string*

Replace part of the string with other characters:

my_str_replace = my_str.replace("string", "sentence")
print(f'*{my_str_replace}*')

*       THIS IS a sentence  *

There are methods for other types of variables, which we will cover in another chapter.

Methods

Applying a method to a string does not change the string itself; it must be reassigned to a variable (but the same variable name can be reused).

Without reassignment:

my_str = "this is another string"
print(my_str)
my_str.upper()
print(my_str)

this is another string
this is another string

With reassignment:

print(my_str)
my_str_upper = my_str.upper()
print(my_str_upper)

this is another string
THIS IS ANOTHER STRING

Tips before going further: Comments

Comments

Comments can be written in your script to help you describe a difficult part for instance.
Comments are not executed.
Comments are unnecessary and in fact distracting if they state the obvious.
Only write relevant comments.

Comments

Inline comments are written after a sharp sign (#).
You can write some code before the # but you cannot write code after the comment.

# this is an inline comment
a = 5 # the code before the "#" will be executed normally

Block comments

# Block comments generally apply to some (or all) code that follows them,
# and are indented to the same level as that code.
# Each line of a block comment starts with a # and a single space

Documentation strings are written between triple quotes:

"""
Documentation strings (a.k.a. “docstrings”) are used to
write a description for all public modules, functions, classes, and methods.
This is often used to write a function description.
"""

If written on several lines, the triple quotes should be written on a line by themselves, and on the same line than the comment itself for one liner descriptions.

""" This is a one-liner docstring. """

Docstrings are usually used when writing a function.

Summary of the variables section

The most common object types are integer, float, string, boolean, function, …

To assign a value to a variable, use the equal sign (=).

To display a variable or some text, use the print() function.

Do not mix simple quotes ('), double quotes (") and f-strings!

Mathematical operations can be performed on variables with an operation sign (+, -, *, /, //, %, **):

Examples: my_variable *= 5 or my_variable = my_variable * 5.

You can compare values with a comparison operator (>, <, >=, <=, ==, !=).

Do not confuse = (variable assignment) and == (test equality)!

To comment lines, you can use # before your comment or add """ around it.

Let’s practise

Please open file 001_practical_variables.py

Lists

Lists presentation

You can use lists to store multiple values in an orderly manner in the same variable.

An empty list can be initialised with [] or list().

A list can also be initialised with values:

numbers = [1, 3, 5, 7, 9]
print(numbers)

[1, 3, 5, 7, 9]

It is possible to create a list from a string. In this case, each element of the list will contain a single character.

a_string = 'I have two cute cats.'
a_list_from_a_string = list(a_string)
print(a_list_from_a_string)

['I', ' ', 'h', 'a', 'v', 'e', ' ', 't', 'w', 'o', ' ', 'c', 'u', 't', 'e', ' ', 'c', 'a', 't', 's', '.']

You can store different types of data in the same list.

my_list = ["Mr_Pi", 3.1415, 5, True]

Lists - Indexing

Each item of a list can be accessed by giving its index, starting from 0 to n-1, with n the number of items in the list.

The number of items in the list is given by len(numbers).

n = len(numbers)

print(f'numbers = {numbers}')
print(f'There are {n} elements in the numbers list.')

print(f'First item is: {numbers[0]}')
print(f'Second item is: {numbers[1]}')
print(f'Last item is: {numbers[n-1]}')

numbers = [1, 3, 5, 7, 9]
There are 5 elements in the numbers list.
First item is: 1
Second item is: 3
Last item is: 9

Lists - Indexing

Each item of a list can also be accessed in revert order from -1 (last item) to -n (first element).

print(f'Another way to get last item is: {numbers[-1]}')
print(f'Second to last item is: {numbers[-2]}')
print(f'Last item is: {numbers[len(numbers)-1]}')
print(f'Another way to get first item is: {numbers[-len(numbers)]}')

Another way to get last item is: 9
Second to last item is: 7
Last item is: 9
Another way to get first item is: 1

If you give as index a value that doesn’t exist, it will raise an error.

numbers[6]

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[66], line 1
----> 1 numbers[6]

IndexError: list index out of range

Brainstorming time

Please consider the following list:

amino_acids = ['Ala', 'Arg', 'Asp', 'Asn', 'Cys', 'Glu', 'Gln']

What does the following element contain? amino_acids[1]

‘Ala’

‘Arg’

‘Gln’

print(amino_acids[0])

Ala

print(amino_acids[1])

Arg

print(amino_acids[-1])

Gln

Remember that list numbering starts at zero and that the index “-1” allows you to access the last item in the list.

How can you access the following element? 'Glu'

amino_acids[6]

amino_acids[5]

amino_acids[-2]

print(amino_acids[6])

Gln

print(amino_acids[5])

Glu

print(amino_acids[-2])

Glu

Converting strings to lists (and vice versa)

In the previous chapter, we discovered a few methods for strings.

There are string methods that work with lists.

.join() turns a list of strings into a single string:

my_list = ["I", "love", "Python", "!"]
str_spaces = " ".join(my_list)
print(str_spaces)

I love Python !

my_list = ["I", "love", "Python", "!"]
str_underscores = "_".join(my_list)
print(str_underscores)

I_love_Python_!

You can use any separator with the .join() method.
It just needs to be a string.

.split() turns a string into a list:

split_spaces_list = str_spaces.split()
print(split_spaces_list)

['I', 'love', 'Python', '!']

split_underscores_list = str_underscores.split(sep = "_")
print(split_underscores_list)

['I', 'love', 'Python', '!']

If no separator is given in .split(), the string will be separated if there are new line (\n), carriage return (\r), tab (\t), form feed (\f) or spaces ( ).

Some operations on lists

Modify an item of the list using its index:

numbers = [1, 3, 7, 7, 9]
numbers[2] = 5
print(numbers)

[1, 3, 5, 7, 9]

Add a new item to the end of a list:

numbers.append(11)
print(numbers)

[1, 3, 5, 7, 9, 11]

Add a new item at a specific position:

numbers.insert(2, 7)
print(numbers)

[1, 3, 7, 5, 7, 9, 11]

Remove an item of a list and return it: removed_item = numbers.pop()
If no index is given, the removed item is the last one.
You can also provide the index of the item to be removed.

removed_item = numbers.pop(1)
print(removed_item)
print(numbers)

3
[1, 7, 5, 7, 9, 11]

After using pop(), the list items are renumbered.

numbers[1]

Some operations on lists

Remove an item of a list by using its value (not the index). Only the first item encountered will be removed; if the value exist several times in the list, the process has to be repeated.

print(numbers)
numbers.remove(7)
print(numbers)

[1, 7, 5, 7, 9, 11]
[1, 5, 7, 9, 11]

Reverse a list:

numbers.reverse()
print(numbers)

[11, 9, 7, 5, 1]

Copy a list:

odds = numbers.copy()
print(odds)

[11, 9, 7, 5, 1]

Lists are mutable objects which means you can modify them directly.

Brainstorming time

Consider the following list:

numbers = [1, 2, 3, 4, 5]

We would like to create the same list called values:

values = numbers

Then we need to remove the second element from the numbers list:

numbers.pop(1)

print(numbers)

[1, 3, 4, 5]

Now let’s check the content of values. What do you expect to get?

print(values)

[1, 3, 4, 5]

Here the values list is just referencing to the numbers list and so the elements are shared.

The method copy is required when a copy has to be made.

Brainstorming time

Let’s try again.

numbers = [1, 2, 3, 4, 5]
values = numbers.copy()
print(values)

[1, 2, 3, 4, 5]

This time we would like to remove the second-to-last element from the values list.
Which command(s) will work ? :

values.pop(-2)

values.pop(3)

values.remove(3)

values.pop(-2)
print(values)

[1, 2, 3, 5]

values = numbers.copy()

values.pop(3)
print(values)

[1, 2, 3, 5]

values = numbers.copy()

values.remove(3)
print(values)

[1, 2, 4, 5]

Now let’s check the content of numbers.

print(numbers)

[1, 2, 3, 4, 5]

The numbers list has not been affected by the changes made to the values list.

Nested lists

A list can contain any Python variable so it can also contain other lists.
A single list may contain numbers, strings, and anything else.

numbers = [1, 3, 5, 7, 9, [11, 13, 15, 17, 19]]

numbers is a nested list.
numbers[5] is a simple list.
numbers[5][0] is an integer.

A matrix can be stored in a nested list.

matrix = [[1, 2, 3], 
          [4, 5, 6], 
          [7, 8, 9]]

matrix is a nested list.
matrix[0], matrix[1] and matrix[2] are simple lists.
matrix[0][0] is an integer.

Another way to assign a nested list to a variable is to write it on a single line.

matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

Brainstorming time

Consider the following nested list:

animals = [
  ["eagle", "pigeon", "owl", "seagull"],
  ["shark", "whale", "seahorse", "clownfish"],
  ["rabbit", "giraffe", "cat", "sheep"]
]

What command would you write to get :

the eagle ?

animals[0][0]

'eagle'

the clownfish ?

animals[1][3]

'clownfish'

the giraffe ?

animals[2][1]

'giraffe'

Which animal will you get if you type :

animals[2][2] ?

animals[2][2]

'cat'

animals[0][1] ?

animals[0][1]

'pigeon'

animals[1][0] ?

animals[1][0]

'shark'

Slicing

You can get a subset of a list by specifying ranges of values with a colon (:) in brackets.

Syntax: my_list[start:end:step]: will slice my_list from start to end (excluded) with a step of step (default value 1 if not provided).

Some examples with the following list:

amino_acids = ['Ala', 'Arg', 'Asp', 'Asn', 'Cys', 'Glu', 'Gln']

Returns the first two elements of the list.

amino_acids[0:2]

['Ala', 'Arg']

Returns every other element, from second element to fourth element (excluded).

amino_acids[1:4:2]

['Arg', 'Asn']

Returns every other element from the complete list, starting with the first element.

amino_acids[::2]

['Ala', 'Asp', 'Cys', 'Gln']

Slicing

print(amino_acids)

['Ala', 'Arg', 'Asp', 'Asn', 'Cys', 'Glu', 'Gln']

Returns the complete list except for the first element.

amino_acids[1:]

['Arg', 'Asp', 'Asn', 'Cys', 'Glu', 'Gln']

Returns the complete list except for the last element.

amino_acids[:-1]

['Ala', 'Arg', 'Asp', 'Asn', 'Cys', 'Glu']

Returns the complete list in reverse order.

amino_acids[::-1]

['Gln', 'Glu', 'Cys', 'Asn', 'Asp', 'Arg', 'Ala']

The slicing [::-1] simply displays the list in reverse order, while the method .reverse() changes the order within the list.

print(amino_acids)
print(amino_acids[::-1])
print(amino_acids)

['Ala', 'Arg', 'Asp', 'Asn', 'Cys', 'Glu', 'Gln']
['Gln', 'Glu', 'Cys', 'Asn', 'Asp', 'Arg', 'Ala']
['Ala', 'Arg', 'Asp', 'Asn', 'Cys', 'Glu', 'Gln']

print(amino_acids)
amino_acids.reverse()
print(amino_acids)

['Ala', 'Arg', 'Asp', 'Asn', 'Cys', 'Glu', 'Gln']
['Gln', 'Glu', 'Cys', 'Asn', 'Asp', 'Arg', 'Ala']

Delete a list

If you don’t need a list anymore, you can delete it with the del keyword:

amino_acids = ['Ala', 'Arg', 'Asp', 'Asn', 'Cys', 'Glu', 'Gln']
del(amino_acids)
print(amino_acids)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[118], line 3
      1 amino_acids = ['Ala', 'Arg', 'Asp', 'Asn', 'Cys', 'Glu', 'Gln']
      2 del(amino_acids)
----> 3 print(amino_acids)

NameError: name 'amino_acids' is not defined

It is also possible to delete part of the list using slicing:

amino_acids = ['Ala', 'Arg', 'Asp', 'Asn', 'Cys', 'Glu', 'Gln']
del(amino_acids[:-3])
print(amino_acids)

['Cys', 'Glu', 'Gln']

Tuples

Tuples are similar to lists but they cannot be modified. They are immutable objects.

An empty tuple can be initialised with () or tuple().

A tuple can also be initialised with values:

values = (2.0, 7.5, 8.4, 3.1)
print(f"values = {values} and its type is {type(values)}")

values = (2.0, 7.5, 8.4, 3.1) and its type is <class 'tuple'>

If you try to modify a tuple, Python won’t let you.

colours = ['red', 'orange', 'yello']
colours[2] = "yellow"
print(colours)

['red', 'orange', 'yellow']

colours = ('red', 'orange', 'yello')
colours[2] = "yellow"
print(colours)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[122], line 2
      1 colours = ('red', 'orange', 'yello')
----> 2 colours[2] = "yellow"
      3 print(colours)

TypeError: 'tuple' object does not support item assignment

Make sure to use [] or list() to create a list.

Summary of the lists section

A list is a variable that can store multiple values in an orderly manner.

To initialise an empty list, you can use [] or list().

List indexing starts at 0 from the left and starts at -1 from the right.

To access (or update) the element at position i, use my_list[i] (or my_list[i] = elt).

To add an element :
- at the end : my_list.append(elt)
- at position i : my_list.insert(i, elt)

To remove an element :
- at the end : removed = my_list.pop()
- at position i : removed = my_list.pop(i)
- by its value : my_list.remove(elt)

To get a subset of your list, you can use my_list[start:end:step].

To delete your list, use del(my_list).

Let’s practise

Please open file 002_practical_lists.py

Dictionaries

Dictionaries presentation

Dictionaries are used to store data in a disorderly manner in the form of key:value pairs.

Each key is unique. If a key is reused, its contents will be overwritten.

An empty dictionary can be initialised with {} or dict().

A dictionary can also be initialised directly with data:

animal_sounds = {'cat': 'meow', 'dog':'woof', 'cow':'moo'}

Accessing dictionaries

Each item of a dictionary can be accessed by giving its key:

with the key in brackets:

print(f"Cat says {animal_sounds['cat']}.")
print(f"Fox says {animal_sounds['fox']}.")

Cat says meow.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[124], line 2
      1 print(f"Cat says {animal_sounds['cat']}.")
----> 2 print(f"Fox says {animal_sounds['fox']}.")

KeyError: 'fox'

If you give a key that is not present in the dictionary it will raise an error.

with get() you can provide a default value in case the key is not in the dictionary:
This method only allows you to access an item; it does not allow you to modify it.
Syntax: my_dict.get(key, default_value)

animal_sounds.get('fox', 'This sound is not registered.')

'This sound is not registered.'

Dictionaries are not indexed as lists are.
my_dict[1] will raise an error unless there is a key called 1.

Accessing dictionaries

Dict keys must be immutable objects like strings, numbers or tuples.
You can get all the keys with .keys().

animal_sounds.keys()

dict_keys(['cat', 'dog', 'cow'])

Dict values can contain items of different types, including other dictionaries.
You can get all the values with .values().

animal_sounds.values()

dict_values(['meow', 'woof', 'moo'])

You can get all the pairs of key:value pairs as a list of tuple using .items():

animal_sounds.items()

dict_items([('cat', 'meow'), ('dog', 'woof'), ('cow', 'moo')])

Some operations on dictionaries

Get the number of key:value pairs:

len(animal_sounds)

Add an element or update an existing one:

animal_sounds['lion'] = 'roar'
print(animal_sounds)

{'cat': 'meow', 'dog': 'woof', 'cow': 'moo', 'lion': 'roar'}

animal_sounds.update({'rooster':'cock-a-doodle-doo'})
print(animal_sounds)

{'cat': 'meow', 'dog': 'woof', 'cow': 'moo', 'lion': 'roar', 'rooster': 'cock-a-doodle-doo'}

Some operations on dictionaries

The pop method can be used to delete a key:value pair and store the value in a variable.

was_removed = animal_sounds.pop('dog')
print(f"The removed value is {was_removed}.")
print(f"The dictionary contains {animal_sounds}.")

The removed value is woof.
The dictionary contains {'cat': 'meow', 'cow': 'moo', 'lion': 'roar', 'rooster': 'cock-a-doodle-doo'}.

If we want to remove a value from a dictionary, we can use the del keyword:

for a single key:

del(animal_sounds['cow'])
animal_sounds

{'cat': 'meow', 'lion': 'roar', 'rooster': 'cock-a-doodle-doo'}

for the whole dictionary:

del(animal_sounds)
animal_sounds

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[134], line 2
      1 del(animal_sounds)
----> 2 animal_sounds

NameError: name 'animal_sounds' is not defined

Brainstorming time

In this example, we want to create a dictionary named fruits_shop, with fruits as keys and numbers as values. These numbers represent the quantity of each fruit in the shop.

We received 10 apples, 5 pears and 1 banana.

How would you implement it ?

fruits_shop = {}
fruits_shop["apple"] = 10
fruits_shop["pear"] = 5
fruits_shop["banana"] = 1

print(fruits_shop)

{'apple': 10, 'pear': 5, 'banana': 1}

With this syntax, we must first initialise the dictionary and then add each element.

fruits_shop = {
  "apple":10,
  "pear": 5,
  "banana": 1
}
print(fruits_shop)

{'apple': 10, 'pear': 5, 'banana': 1}

With this syntax, the dictionary is initialised and populated at the same time.

Nice! But in the meantime, we received 45 more bananas and 10 grapes… and then someone ate an apple (oops).

fruits_shop["banana"] += 45
fruits_shop["grape"] = 10
fruits_shop["apple"] -= 1
print(fruits_shop)

{'apple': 9, 'pear': 5, 'banana': 46, 'grape': 10}

fruits_shop["banana"] = fruits_shop["banana"] + 45
fruits_shop["grape"] = 10
fruits_shop["apple"] = fruits_shop["apple"] - 1
print(fruits_shop)

{'apple': 9, 'pear': 5, 'banana': 46, 'grape': 10}

Brainstorming time

Pears are now prohibited worldwide, but we get 2 apples in exchange for each pear.

pears = fruits_shop.pop("pear")
fruits_shop["apple"] += 2 * pears
print(fruits_shop)

{'apple': 19, 'banana': 46, 'grape': 10}

pears = fruits_shop.pop("pear")
fruits_shop["apple"] = fruits_shop["apple"] + 2 * pears
print(fruits_shop)

{'apple': 19, 'banana': 46, 'grape': 10}

Unfortunately, we should remove the fruits_shop, as it has become useless and we need the space for something else. How would you proceed?

del(fruits_shop)
print(fruits_shop)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[143], line 2
      1 del(fruits_shop)
----> 2 print(fruits_shop)

NameError: name 'fruits_shop' is not defined

Summary of the dictionaries section

A dictionary is a variable that can store data in a disorderly manner in the form of key:value pairs.

An empty dictionary can be initialised with {} or dict().

To access the value corresponding to the key k, you can use :
- my_dict[k]
- my_dict.get(k, default_value)

Dictionaries are not indexed as lists are.

To add or update an element, use my_dict[k] = new_value.

To delete a key:value pair, use remove = my_dict.pop(k) or del(my_dict[k]).

To delete the dictionary, use del(my_dict).

Let’s practise

Please open file 003_practical_dictionaries.py

Conditional statements

Conditional statements presentation

An if / elif / else statement allows to determine which part of the code is executed, according to one or several conditions.

if, elif and else lines end with colon (:).
The blocks of code to be executed are indented.

Do not mix spaces and tabs.
Python best practices recommend using 4 spaces.

elif and else are optional. If they are not provided, nothing will be executed if the if statement is not true.

You can write as many elif statements as needed.
elif is short for else if.

If a statement is true, the other ones are not tested.

Examples

Example 1:

limit = 50
if current_speed > limit + 30:
    print('Slow down! You are going to kill someone!')
elif current_speed > limit:
    print('Slow down! You are going to get a fine!')
else:
    print('You are not exceeding the speed limit.')

What should this code return with these values:

- `current_speed = 60` ?

Slow down! You are going to get a fine!

- `current_speed = 160` ?

Slow down! You are going to kill someone!

- `current_speed = 30` ?

You are not exceeding the speed limit.

Examples

Example 2:
Note: Instructions are executed in the order in which they are written.

What difference(s) do you see between these two examples ?
What change(s) should we expect with this code ?

limit = 50
current_speed = 100
if current_speed > limit + 30:
    print('Slow down! You are going to kill someone!')

elif current_speed > limit:
    print('Slow down! You are going to get a fine!')

else:
    print('You are not exceeding the speed limit.')

limit = 50
current_speed = 100
if current_speed > limit:
    print('Slow down! You are going to get a fine!')

elif current_speed > limit + 30:
    print('Slow down! You are going to kill someone!')
    
else:
    print('You are not exceeding the speed limit.')

Slow down! You are going to kill someone!

Slow down! You are going to get a fine!

In the example on the right we will never enter the current_speed > limit + 50 block.

Logical operators

We can combine expressions using and or or.

if A and B will be executed only if the 2 expressions are true.

admission = none
if age >= 18 and age < 65:
    admission = "full_price"

Note: there is a simpler syntax for checking whether a number is within a range.

if 18 <= age < 65:
    admission = "full_price"

if A or B will be executed if at least one of the 2 expressions is true.

if age < 18 or age >= 65:
    admission = "reduced_price"

You can notice that we initialised the variable admission before the conditional statement. This is a good practice, because if all conditions fail and you try to use an uninitialised variable, an error will occur and stop the execution of your script.

More complex conditions

There is no limit to the number of conditions, but it may be useful to use parentheses to indicate priorities.

Example 1:

age = 16
nb_available_seats = 0
if (age < 18 or age >= 65) and nb_available_seats > 0:
    print("You may enter at a reduced rate.")

Example 2:

age = 16
nb_available_seats = 0
if age < 18 or age >= 65 and nb_available_seats > 0:
    print("You may enter at a reduced rate.")

You may enter at a reduced rate.

The logical operator and has higher precedence than the logical operator or.
This means that when both and and or operators appear in the same expression, and is evaluated first.
If you are not sure of the priority, use parentheses!

Nested conditions

You can nest multiple conditions.

age = 16
nb_available_seats = 5

if nb_available_seats > 0:
    if age < 18 or age >= 65:
        admission = "reduced"
    else:
        admission = "full"
    print(f"You may enter with a {admission} price.")
else:
    print("There are no more seats available.")

You may enter with a reduced price.

Please mind the indentation!

Brainstorming time

Before leaving home, you should take an accessory depending on the weather.
Consider the following code:

temperature = 20
rain = False

if rain == True:
    print("Take an umbrella.")
else:
    if temperature >= 25:
        print("Wear a hat and sunglasses.")
    elif temperature >= 15:
        print("Wear sunglasses.")
    elif temperature >= 0:
        print("Wear a scarf.")
    else:
        print("Wear a scarf and gloves.")

Question 1: What does this code print?

This code prints: Wear sunglasses.

Question 2: Give an example of variables to obtain the message: Wear a scarf.

We must have rain == False and temperature between 0 and 14°C.

Question 3: When should you wear gloves?

When the temperature is strictly below 0°C.

Brainstorming time

Before leaving home, you should take an accessory depending on the weather.
Consider the following code:

temperature = 20
rain = False

if rain == True:
    print("Take an umbrella.")
else:
    if temperature >= 25:
        print("Wear a hat and sunglasses.")
    elif temperature >= 15:
        print("Wear sunglasses.")
    elif temperature >= 0:
        print("Wear a scarf.")
    else:
        print("Wear a scarf and gloves.")

Question 4: If it rains and the temperature is 0°C, should you take an umbrella or a scarf?

You should take an umbrella.

Question 5: Tomorrow, it is supposed to be 28°C and sunny.
Which accessory or accessories will you take or wear?

Tomorrow, I will wear a hat and sunglasses.

Inverting conditions

Sometimes it is easier to check whether a condition is not true.
We can do this with the operator not.

if "banana" not in ["apple", "pear", "hazelnut"]:
    print("Banana not found in list.")

Banana not found in list.

This is equivalent to the following syntax:

if "banana" != "apple" and "banana" != "pear" and "banana" != "hazelnut":
    print("Banana not found in list.")

Banana not found in list.

Summary of the conditionals section

An if / elif / else statement allows to determine which part of the code is executed, according to one or several conditions.

elif is short for if and else.

if, elif and else lines end with colon (:).
The blocks of code to be executed are indented.

elif and else are optional.

It is possible to combine expressions with and, or and add parentheses () to indicate priorities.

You can use in and not in keywords to check if an element is in a list.

Loops

Loops presentation

Loops are used to repeat the execution of a part of the program several times.
There are two ways to use loops in Python:
- for loops are generally used when we know how many times to repeat the action.
- while loops are generally preferred when we don’t know the number of repetitions in advance.

For

The for loop allows to perform an action for each element in a group like a list, a dictionary, a string…

The line with for instruction must end with a colon (:) and the code that will run inside the for loop must be indented.

General syntax:

for element in collection:
    # Perform some action(s) on element.  
    # These actions can spread on several lines
    # which must all be indented.

For: examples on lists

Example 1:

odds = [1, 3, 5, 7]
for element in odds:
    print(f"element contains {element}.")

element contains 1.
element contains 3.
element contains 5.
element contains 7.

Example 2:

odds = [1, 3, 5, 7]
numbers_power2 = list()
for i in odds:
    numbers_power2.append(i**2)
    print(f"i contains {i} and numbers_power2 contains {numbers_power2}.")

i contains 1 and numbers_power2 contains [1].
i contains 3 and numbers_power2 contains [1, 9].
i contains 5 and numbers_power2 contains [1, 9, 25].
i contains 7 and numbers_power2 contains [1, 9, 25, 49].

Example 3:

odds = [1, 3, 5, 7]
for index, element in enumerate(odds):
    print(f"The {index}-th item in list contains {element}.")

The 0-th item in list contains 1.
The 1-th item in list contains 3.
The 2-th item in list contains 5.
The 3-th item in list contains 7.

The enumerate function is useful for iterating through a list and finding out the position of each element in the list.

For: examples on dictionaries

fruits_shop = {"apple":10, "pear": 5, "banana": 1}

Iterate over the keys:

for key in fruits_shop.keys():
    print(f'Key {key} is associated with value {fruits_shop[key]}.')

Key apple is associated with value 10.
Key pear is associated with value 5.
Key banana is associated with value 1.

Iterate over the values:

for value in fruits_shop.values():
    print(value)

10
5
1

Iterate over both keys and values:

for key, value in fruits_shop.items():
    print(f'Key {key} is associated with value {value}.')

Key apple is associated with value 10.
Key pear is associated with value 5.
Key banana is associated with value 1.

While

The while loop allows to perform an action as long as an expression is true.

The line with while instruction must end with a colon (:) and the code that will run inside the while loop must be indented.

WARNING! If the expression evaluated by the while loop is never modified, you might end up with an infinite loop!

Example 1:

odds = [1, 3, 5, 7, 9]
i = 0
while i < len(odds):
    print(f"The list item with index {i} is {odds[i]}.")
    i += 1

The list item with index 0 is 1.
The list item with index 1 is 3.
The list item with index 2 is 5.
The list item with index 3 is 7.
The list item with index 4 is 9.

Example 2:

odds = [1, 3, 5, 7, 9]
numbers_power_2 = list()
i = 0
while i < len(odds):
    odd_number2 = odds[i]**2
    print(f"The list item with index {i} is {odds[i]}.")
    numbers_power_2.append(odd_number2)
    i += 1
print(numbers_power_2)

The list item with index 0 is 1.
The list item with index 1 is 3.
The list item with index 2 is 5.
The list item with index 3 is 7.
The list item with index 4 is 9.
[1, 9, 25, 49, 81]

Break loops

Sometimes you may need to end a loop prematurely.

With the break statement we can stop the loop even if the while condition is still true or if we are not done with the for iteration.

numbers = [2, 4, 6, 7, 8]
even_numbers = list()
for i in numbers:
    if i % 2 == 1:
        print(f"An odd number has been found ({i})")
        break
    else:
        even_numbers.append(i)
print("The consecutive even numbers are", even_numbers)

An odd number has been found (7)
The consecutive even numbers are [2, 4, 6]

numbers = [2, 4, 6, 7, 8]
even_numbers = list()
i = 0
while i < len(numbers):
    if numbers[i] % 2 == 1:
        print(f"An odd number has been found ({numbers[i]})")
        break
    else:
        even_numbers.append(numbers[i])
    i += 1
print("The consecutive even numbers are", even_numbers)

An odd number has been found (7)
The consecutive even numbers are [2, 4, 6]

Continue loops

With the continue statement we can go directly to the next iteration without executing the code in the loop for the current iteration.

for number in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]:
    print(f"number: {number}")
    if number != 5:
        continue
    print(f"Number {number} has been found!")

number: 0
number: 1
number: 2
number: 3
number: 4
number: 5
Number 5 has been found!
number: 6
number: 7
number: 8
number: 9

Combination

As you could see in previous slides, it is possible to combine multiple loops and conditions within the same block of code.

In this case, you should pay attention to the code indentation. If you get it wrong, the code may still run, but it will not produce the expected result.

Combination: example

Let’s generate all possible pairs of fruits among orange, mango, and lemon.

fruits = ["orange", "mango", "lemon"]
comb1 = list()

for my_first_fruit in fruits:
    print(f'Here, {my_first_fruit} is the first fruit.')
    for my_second_fruit in fruits:
        print(f'- {my_first_fruit} and {my_second_fruit}')
        comb1.append([my_first_fruit, my_second_fruit])

fruits = ["orange","mango","lemon"]
comb2 = list()

for my_first_fruit in fruits:
    print(f'Here, {my_first_fruit} is the first fruit.')
    for my_second_fruit in fruits:
        print(f'- {my_first_fruit} and {my_second_fruit}')
    comb2.append([my_first_fruit,my_second_fruit])

Here, orange is the first fruit.
- orange and orange
- orange and mango
- orange and lemon
Here, mango is the first fruit.
- mango and orange
- mango and mango
- mango and lemon
Here, lemon is the first fruit.
- lemon and orange
- lemon and mango
- lemon and lemon

Here, orange is the first fruit.
- orange and orange
- orange and mango
- orange and lemon
Here, mango is the first fruit.
- mango and orange
- mango and mango
- mango and lemon
Here, lemon is the first fruit.
- lemon and orange
- lemon and mango
- lemon and lemon

print(comb1)

[['orange', 'orange'], ['orange', 'mango'], ['orange', 'lemon'], ['mango', 'orange'], ['mango', 'mango'], ['mango', 'lemon'], ['lemon', 'orange'], ['lemon', 'mango'], ['lemon', 'lemon']]

print(comb2)

[['orange', 'lemon'], ['mango', 'lemon'], ['lemon', 'lemon']]

Brainstorming time

We have a list of integers from 0 to 12.

We want to classify them in a dictionary with keys odd and even. Each key in the dictionary has a list of numbers as its value.

How would you do that?

First, we initialise our list and dictionary:

my_int_list = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
my_number_dict = {"odd" : list(), "even" : list()}

Then we will iterate over my_int_list. For each element we will test if it is even or odd, and add the element to the list of the appropriate key.

for i in my_int_list:
    if i % 2 == 0:
        my_number_dict["even"].append(i)
    else:
        my_number_dict["odd"].append(i)
print(my_number_dict)

{'odd': [1, 3, 5, 7, 9, 11], 'even': [0, 2, 4, 6, 8, 10, 12]}

Summary of the loops section

Loops are used to repeat the execution of a part of the program several times.
- for loops: the number of repetitions is known in advance.
- while loops: the number of repetitions is not known in advance.

Syntax: the keyword for/while, an iterator, the keyword in, a list/dictionary and a colon (:).

for elt in my_dict:
  # action

while elt in my_dict:
  # action

The code inside a loop must be indented.

If the expression evaluated by a while loop is not modified, you will get an infinite loop.

With the break statement, the loop will stop prematurely.

With the continue statement, the loop will go to the next iteration prematurely.

It is possible to add a loop inside another loop and to add conditional statements inside a loop.

Let’s practise

Please open file 004_practical_conditionals_loops.py

Jupyter notebook

Introduction

Jupyter notebooks are interactive programming environments that allow you to combine text, images, mathematical formulas, tables, graphs and executable computer code in a single document. They can be manipulated in a web browser.
Jupyter notebooks support nearly 40 different languages, including Python.
The cell is the basic element of a Jupyter notebook. It can contain formatted text or computer code that can be executed.
A web browser can be used to open a notebook, but VSCode can also do so as long as the Jupyter notebook extension has been installed.

Notebook presentation

In this training we will focus on 2 types of cells:
- Markdown cells: to write text (titles, mathematical formulas, tables, …)
- Python cells: to write Python code

To create a Jupyter notebook, go to the Explorer menu on the top-left and click on New file.

The file extension for a Jupyter file is .ipynb.

Select kernel

Click on Select Kernel on the top-right of the tab to choose a Python version to run your code.

Markdown cells

To create a markdown cell, click on + Markdown.

You can write anything you want in this cell: it won’t be interpreted as code.

You can run a markdown cell to convert raw text to markdown format by clicking on the right-pointing arrow on the right.

Markdown cells

This is what it looks like after being executed.

Edit and delete a cell

To edit a Markdown cell after it has been executed, double-click on it.

To delete a Markdown cell, click on the dustbin on the right.

Python cells

To create a Python cell, click on + Code.

To run a Python cell, click on right-pointing arrow on the left.

Python cells

This is what it looks like after being executed.

Other ways to execute cells

There are other types of Python cell execution.

Execute Above Cells: Runs every cell above the current cell.
It is useful if you have modified your variables and want to revert to a previous state.

Execute Cell and Below: Runs the current cell and all of the cells below this one.
It is useful if you have modified your variables and want to refresh the resulting code.

Other ways to execute cells

Run All: Runs every cell in the notebook.
It is useful if you know you are going to execute all cells.

Other ways to execute cells

Restart: Empties the memory (restarts the kernel).
It is useful if you use it before a Run All to check if your code works correctly before giving it to someone.

Delete a cell

To delete a Python cell, click on the dustbin on the right of the cell.

Let’s practise

Please open file 005_practical_jupyter.ipynb

Functions

Functions presentation

Functions are useful for performing an operation multiple times within a program.
A few functions have been introduced during this training.
- print() which displays what is between the parentheses
- len() which returns the number of items in a list or dictionary
Basically, any function works like this:
- Variables of any type are sent to the function.
- One or many actions are processed by the function.
- The function returns value(s) or object(s).

Functions definition

A function is built with the keyword def to start the definition of the function.
It has to be followed by the function name, parentheses () with optionally arguments inside and a colon :
Like for and while loops, the code that will run inside must be indented.

General syntax:

def function_name():
    # Perform some action(s).
    # These actions can spread on several lines
    # which must all be indented.

Example:

def hello():
    print("Hi !")

Functions with arguments

Arguments can be passed to a function.
Some operations can be performed within the function using one or several arguments given in parentheses.

def square(x):
    sqr=x**2
    print(f"The square of {x} is {sqr}.")

square(2)

The square of 2 is 4.

Multiple arguments can be passed to the function.
Each of them have to be separated by a comma (,) and can be of any type (str, int, float, list, dict, etc…).

def repeat_sequence(x, y):
    long_chain=x*y
    print(long_chain)

repeat_sequence("AT", 5)

ATATATATAT

Functions returning results

Function variables are specific to the code within the function block.

def square(x):
    sqr=x**2

square(2)
print(f"The square of 2 is {sqr}.")

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[176], line 5
      2     sqr=x**2
      4 square(2)
----> 5 print(f"The square of 2 is {sqr}.")

NameError: name 'sqr' is not defined

The NameError happened because the variable defined in the function are not translatable to the global code.
To make a function variables usable outside of that function, we have to use return.
The return statement sends a termination signal to the function block and returns values, which can be of any type.

def square(x):
    sqr=x**2
    return sqr

sqr_val = square(2)
print(f"The square of 2 is {sqr_val}.")

The square of 2 is 4.

Functions returning results

The return statement can be inserted several times in a function.
However, the first return encountered will stop the function execution and return back to the global code.
This is useful when combined with conditional statements to exit the function when the condition is fulfilled.

Example:

def speed_limit(x):
    limit = 50
    if x > limit:
      return "Too fast !"
    else:
      return "Perfect !"

print("You are driving at 51km/h.")
result = speed_limit(51)
print(result)

print("You are driving at 30km/h.")
result = speed_limit(30)
print(result)

You are driving at 51km/h.
Too fast !
You are driving at 30km/h.
Perfect !

Default argument

You can add a default value to your arguments, but these arguments must be placed at the end of the argument list in the function.

def power(x, n = 2) :
    return x**n

pow_value = power(2)

print(f"2**2 = {pow_value}")

2**2 = 4

The second argument has been left empty since we wanted to apply the default value to the function.

Indeed, the default value is used when no value has been passed to this argument. If you provide a value, the function will use it instead of the default value:

def power(x, n = 2) :
    return x**n

pow_value = power(2,3)

print(f"2**3 = {pow_value}")

2**3 = 8

Good practices

Function names should be lowercase and words separated by underscores (_) for a better readability.

Function names should not be the same as other Python included functions/keywords.

You can specify the type of your argument and of the returned value. It is helpful to remember which type of value you should set as input. It is helpful but not mandatory.

def square(x : int|float) -> int|float:
    return x**2
sqr_val = square(2)

print(f"The square of 2 is {sqr_val}.")

The square of 2 is 4.

Type hints are only available for Python3 version greater than 3.10.

Brainstorming time

Imagine you want to create a function called ‘enzyme’, which takes a string as an argument and returns a split list. It splits every time there is a serine (S) residue (we are in a wonderful world where enzymes cut every time and there are no steric hindrances…).

How would you do that ?

We define the name of the function.

def enzyme():

We can then add the argument(s) :

def enzyme(my_string):

We can then add the instructions (beware of indentation):

def enzyme(my_string):
    my_string.split("S")

Then, we want see the result !

def enzyme(my_string):
    res = my_string.split("S")
    print(res)

Brainstorming time

Now let’s try !

def enzyme(my_string):
    res = my_string.split("S")
    print(res)

enzyme("AGESMKT")

['AGE', 'MKT']

Great, it works ! I want to see it in a variable.

def enzyme(my_string):
    res = my_string.split("S")
    print(res)

answer = enzyme("AGESMKT")
print(answer)

['AGE', 'MKT']
None

Oops, I forgot to include the return in the function.

def enzyme(my_string):
    res = my_string.split("S")
    return res

answer = enzyme("AGESMKT")
print(f'answer contains: {answer}')
answer_2 = enzyme("agesmkt")
print(f'answer_2 contains: {answer_2}')

answer contains: ['AGE', 'MKT']
answer_2 contains: ['agesmkt']

Brainstorming time

OK, now let’s enhance our function! Currently it cuts only on uppercase S but we want to be able to accept sequences in upper and lower case letters.

def enzyme(my_string):
    res = my_string.upper().split("S")
    return res

answer = enzyme("AGESMKT")
print(answer)
answer_2 = enzyme("agesmkt")
print(answer_2)

['AGE', 'MKT']
['AGE', 'MKT']

That’s pretty good, but now we want to add the ability to cut according to another amino acid, while keeping Serine as the default value.

def enzyme(my_string, catalytic_site = "S"):
    res = my_string.upper().split(catalytic_site)
    return res

pept = "AGESMKT"
answer = enzyme(pept)
print(answer)
answer_2 = enzyme(pept, "T")
print(answer_2)
answer_3 = enzyme(pept, "ES")
print(answer_3)

['AGE', 'MKT']
['AGESMK', '']
['AG', 'MKT']

Brainstorming time

You may have noticed that… the catalytic site is not in the list anymore… In reality, an enzyme can cut before or after the catalytic site, but the recognised amino acid should always be present. How would you approach this? (tips: before will be a boolean which, by default, performs an enzyme cut before a catalytic site).

def enzyme(my_string, catalytic_site = "S", before = True):
    res = my_string.upper().split(catalytic_site)
    if before == True:
      for my_peptide in range(1, len(res)):
        res[my_peptide] = catalytic_site + res[my_peptide]
    else:
      for my_peptide in range(0,(len(res)-1)):
        res[my_peptide] = res[my_peptide] + catalytic_site
    return res
pept = "AGESMKT"
answer = enzyme(pept)
print(answer)
answer = enzyme(pept, "T")
print(answer)
answer = enzyme(pept, "T", before=False)
print(answer)
answer = enzyme(pept, "A")
print(answer)

['AGE', 'SMKT']
['AGESMK', 'T']
['AGESMKT', '']
['', 'AGESMKT']

We can see that if our peptide began or ended at the catalytic site, it might produce an unexpected split with an empty character. We don’t want this empty character.
How would you do this?

Brainstorming time

def enzyme(my_string, catalytic_site = "S", before = True):
    res = my_string.upper().split(catalytic_site)
    if before == True:
      for my_peptide in range(1, len(res)):
        res[my_peptide] = catalytic_site + res[my_peptide]
      if res[0] == "":
        res.pop(0)
    else:
      for my_peptide in range(0,(len(res)-1)):
        res[my_peptide] = res[my_peptide] + catalytic_site
      if res[-1] == "":
        res.pop(-1)
    return res

pept = "AGESMKT"
print(enzyme(pept))
print(enzyme(pept, "T"))
print(enzyme(pept, "O"))
print(enzyme(pept, "T", before=False))
print(enzyme(pept, "A"))

['AGE', 'SMKT']
['AGESMK', 'T']
['AGESMKT']
['AGESMKT']
['AGESMKT']

Well played! You’re almost there with this beautiful function! Adding documentation within docstrings will be helpful if in two years you want to remember what the function does, or if you give your code to someone else.

Brainstorming time

def enzyme(my_string : str, catalytic_site = "S", before = True) -> list:
    """
    Simulate an enzyme cleavage using a catalytic site. The cleavage can occur before or after the  catalytic site.

    Arguments:
    my_string: string
      The protein to be digested.
    catalytic_site: string - optional
      The cleavage site used to split the protein.
    before: boolean - optional
      Whether the enzyme cuts before or after the cleavage site. 
      If `before` is True, the enzyme cuts before the catalytic site, otherwise it cuts after the catalytic site.
    """
    res = my_string.upper().split(catalytic_site)
    if before == True:
      for my_peptide in range(1, len(res)):
        res[my_peptide] = catalytic_site + res[my_peptide]
      if res[0] == "":
        res.pop(0)
    else:
      for my_peptide in range(0,(len(res)-1)):
        res[my_peptide] = res[my_peptide] + catalytic_site
      if res[-1] == "":
        res.pop(-1)
    return res

short_sab = "MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGEENFKALVLIAFAQYLQQCPFEDHVKLVNEVTEFAKTCVADESAENCDKSLHTLFGDKLCTVATLRETYGEMADCCAKQEP"
res = enzyme(short_sab)
print(res)

['MKWVTFI', 'SLLFLF', 'S', 'SAY', 'SRGVFRRDAHK', 'SEVAHRFKDLGEENFKALVLIAFAQYLQQCPFEDHVKLVNEVTEFAKTCVADE', 'SAENCDK', 'SLHTLFGDKLCTVATLRETYGEMADCCAKQEP']

Although not mandatory, docstrings are highly recommended!

Summary of the functions section

Functions are useful for performing an operation multiple times within a program.

Syntax: the keyword def, the function name, parenthesis () with optionally arguments inside and a colon (:).

def function_name(argument1, argument2):
  # action

The function name should not be the same as other Python included functions/keywords.

The code inside a function must be indented.

The return statement ends the function and sends a result where it is called.

There can be multiple return statements in a function if you use conditional statements but the first return encountered will stop the function execution and return back to the global code.

Let’s practise

Please open file 006_practical_functions.ipynb

Packages

Packages presentation

Standard Python is a powerful language that can do many things, and developers may help the community with “ready-to-use” functions bundled in packages.
Packages contain collections of functions developed to accomplish common tasks.
The Python community is really active and has developed many packages providing functions for almost any purpose.

Some examples of useful packages:

biopython : tools for computational molecular biology.
pandas : dataframe and data analysis toolkit.
scipy : toolkit for mathematics, statistics and various scientific processes.

Packages utilisation

Use import followed by the package name to load a package in Python.
Once imported, you can call a function from the package by writing package_name.function_name.

Example :

import random
print(random.randint(0,10))

Here we have imported the package random to use the function randint which draws a random integer between 0 and 10.

Another common way to import functions from a package is to use the keyword from.
from is useful to import one or several functions without recalling the package’s name.

from random import randint
print(randint(0,10))

Packages utilisation

All functions of a package can be imported at once using *.

In the following example, all random functions have been imported and can be used directly by naming them like randint or choices.

from random import *
print("Random number:", randint(0,10))
print("Random name:", choices(['Binjamain','Izabèl','Toma','Lauraine']))

Random number: 7
Random name: ['Binjamain']

Be careful when using * with multiple packages. Some packages might have functions with the same name, and this can cause conflicts in Python. In fact, it is greatly recommended to not use * to import everything from a package.

Packages utilisation

It is also possible to define an alias for a module:

import random as rand
print("Random number:", rand.randint(0,10))

Random number: 9

Aliases can be useful when the packages or function names are long. Using them can make your code more readable.
Furthermore, it prevents from function overwrites as you should specify the function from the aliases.

Like every variable, aliases can be overwritten if you specify something else with it.

import numpy as np
# np here calls numpy ...
import seaborn as np
# But here, np is overwritten by seaborn
np = "no problem"
# And now becomes "no problem"

Choose your aliases wisely!

Packages utilisation

Be careful! Importing a function with from and using an alias may overwrite another function!

print("[bold red]This sentence is different[/bold red]")
from rich import print as pprint
pprint("[bold red]From this sentence[/bold red]")

[bold red]This sentence is different[/bold red]

From this sentence

Packages download

All packages available in The Python Package Index (PyPI) can be installed.

We recommend doing a quick search of the desired package on https://pypi.org/.

These can be downloaded easily through the package installer for Python pip.

pip install biopython

If the package is strictly available on GitHub, you can use:

pip install git+https://github.com/pseudo/repo-name.git

A large number of packages, or certain combinations, might result in conflicts. For advanced usage, it will be recommended to use conda interpreter.

Summary of the packages section

Packages contain collections of “ready-to-use” functions developed to accomplish common tasks.
They are useful because you do not have to code some complicated functions.
To use a package, you need to install it first.
- pip install package_name or
- pip install git+https://github.com/pseudo/repo-name.git
Then you need to import it with one of these methods:
- To import the whole package import package_name then package_name.function_name.
- To import only one function from package_name import function_name then function_name.
- To import every function from package_name import * then function_name.
If you find the package name too long, you can give it an alias:
- import a_package_with_a_long_name as pack then pack.function_name
Be careful to not overwrite another function with an alias!

Let’s practise

Please open file 007_practical_packages.ipynb

Reading and writing files

Input/Output (I/O) presentation

The main operations that you can perform on files are: reading a file and writing to a file.
When you access a file on an operating system, a file path is required, which represents the location of a file. It is broken up into three major parts:
- Folder Path: the file folder location on the file system where subsequent folders are separated by a forward slash / (Unix) or backslash \ (Windows)
- File Name: the actual name of the file
- Extension: the end of the file path pre-pended with a period (.) used to indicate the file type
The path can be:
- absolute (the full path from the root of the computer)
- relative (relative to the working directory).

Path example

.
└── home
    └── Toto
        ├── Desktop
        ├── Documents
        │   └── Trainings
        │       └── Python
        │           ├── practical_work/
        │           │   ├── Data
        │           │   │   └── sequences.fasta
        │           │   └── exercises.py
        │           └── Python_slides.html
        ├── Images
        ├── Downloads
        └── Videos

/home/Toto/Documents/Trainings/Python/ is the folder absolute path.
/home/Toto/Documents/Trainings/Python/practical_work/exercises.py is exercises.py absolute path.
./Data/sequences.fasta is input.fasta relative path (relative to the exercises.py file)
exercises is the file name.
py is the file extension.

About relative paths

./ means the same directory.
./Data/Sequences.fasta and Data/Sequences.fasta should work the same
../ means the parent directory.
If you need to call Python_slides.html file from exercises.py you will use ../Python_slides.html

File handlers

To manage a file, we use a file handler, which can be created with function open().

Open a file for reading with:

open('/home/Toto/Data/example.txt', 'r')

Open a file for:

writing with:

open('/home/Toto/Data/example.txt', 'w')

appending with:

open('/home/Toto/Data/example.txt', 'a')

There are two syntaxes for managing a file:

f = open('/home/Toto/Data/example.txt', 'r')
# do stuff with file
f.close()

with open('/home/Toto/Data/example.txt', 'r') as f:
    # do stuff with file
    # /!\ do not forget the indentation!

The syntax using with is recommended for most cases.
You can notice the alias as f, it means f is the file example.txt opened in r (read) mode.
File handler is automatically closed when you exit the with block.

If you open a file in ‘writing’ mode without using with and forget to close the file handler, your changes may not be saved.

Read a file

You can read a file all at once with method readlines(), or line by line.

with open('/home/Toto/Data/example.txt', 'r') as f:
    content = f.readlines()
    print(content)

content is a list.

The whole file is read in a go. It can be useful for files with few lines.

The whole file is stored in a list. This should not be done with big files.

with open('/home/Toto/Data/example.txt', 'r') as f:
    for line in f:
        print(line)

The file is read line by line. This is the most appropriate method for large files.

Write to a file

You can write to a file with write() method.

with open('output.txt', 'w') as f:
    f.write('Something I want to write to my file.\n')

with open('output.txt', 'a') as f:
    f.write('Something I want to write to my file.\n')

Be careful which parameter you choose in open(), “a” or “w”:
- in writing mode, any previous content is deleted.
- in appending mode, the text is added to the end of the file.

.write() method does not automatically add a new line (\n), contrary to print() function.

Summary of the I/O section

“I/O” stands for “Input/Output”.
To manage a file, we use a file handler, which can be created with function open().
You may open a file for:
- reading with: open('/home/Toto/Data/example.txt', 'r')
- writing with: open('/home/Toto/Data/example.txt', 'w') \(\rightarrow\) overwrites the file
- appending with: open('/home/Toto/Data/example.txt', 'a') \(\rightarrow\) adds text at the end of the file
There are 2 syntaxes for managing a file:

f = open('/home/Toto/Data/example.txt', 'r')
# do stuff with file
f.close() # do not forget to close the file!

with open('/home/Toto/Data/example.txt', 'r') as f:
    # do stuff with file
    # /!\ do not forget the indentation!

To read a file, you can use:

with open('/home/Toto/Data/example.txt', 'r') as f:
    content = f.readlines() # store all file content at once
    print(content)

with open('/home/Toto/Data/example.txt', 'r') as f:
    for line in f: # read file line by line
        print(line)

To write to a file, you can use:

with open('output.txt', 'w') as f:
    f.write('Something I want to write to my file.\n')

with open('output.txt', 'a') as f:
    f.write('Something I want to write to my file.\n')

Let’s practise

Please open file 008_practical_io.ipynb

Dataframes

Dataframes presentation

DataFrames are objects used to store tables of data, such as Excel tables.

In Python, you can use multiple libraries in order to manipulate your dataframe, the most populars are pandas and Polars. In this training we will focus on pandas.

You can install pandas using pip:

pip install pandas

Then you may load it with an alias:

import pandas as pd

pd is a common alias used for pandas, but you could also simply write import pandas then just use the functions by calling pandas.function.

Dataframe creation

To create a dataframe, you can use pd.DataFrame, which creates an object DataFrame with various methods.

A simple way to initialise a dataframe is to use a dictionary.

import pandas as pd
grades_dict = {
  'names': ['Alphonse', 'Germaine', 'Célestine'],
  'math': [14, 17, 12],
  'history': [8, 14, 19],
  'music': [16, 15, 13]
}
school = pd.DataFrame(grades_dict)
print(school)

       names  math  history  music
0   Alphonse    14        8     16
1   Germaine    17       14     15
2  Célestine    12       19     13

The dictionary keys will become the column names in the dataframe.

The dictionary values are lists, each of which will become a column in the dictionary. They must all have the same length.

Please note that a column containing numbers starting from zero has been added. This column is called an index.

Dataframe creation

You can also load a dataframe from a CSV / TSV or an Excel file. But with a large dataframe, you will need some functions and methods in order to manipulate it properly.

my_data = pd.read_csv("./Data/crops_data.csv")
print(my_data)

      farm_id       region crop_type  soil_moisture  soil_pH  temperature_C  \
0    FARM0001  North India     Wheat          35.95     5.99          17.79   
1    FARM0002    South USA   Soybean          19.74     7.24          30.18   
2    FARM0003    South USA     Wheat          29.32     7.16          27.37   
3    FARM0004  Central USA     Maize          17.33     6.03          33.73   
4    FARM0005  Central USA    Cotton          19.37     5.92          33.86   
..        ...          ...       ...            ...      ...            ...   
495  FARM0496  Central USA      Rice          42.85     6.70          30.85   
496  FARM0497  North India   Soybean          34.22     6.75          17.46   
497  FARM0498  North India    Cotton          15.93     5.72          17.03   
498  FARM0499          NaN   Soybean          38.61     6.20          17.08   
499  FARM0500  North India     Wheat          30.22     7.42          20.57   

     rainfall_mm  humidity  sunlight_hours irrigation_type  ... sowing_date  \
0          75.62     77.03            7.27             NaN  ...    01-08-24   
1          89.91     61.13            5.67       Sprinkler  ...    02-04-24   
2         265.43     68.87            8.23            Drip  ...    02-03-24   
3         212.01     70.46            5.03       Sprinkler  ...    02-21-24   
4         269.09     55.73            7.93             NaN  ...    02-05-24   
..           ...       ...             ...             ...  ...         ...   
495        52.35     79.58            7.25          Manual  ...    01-16-24   
496       256.23     45.14            5.78             NaN  ...    01-01-24   
497       288.96     57.87            7.69            Drip  ...    01-02-24   
498       279.06     73.09            9.60            Drip  ...    01-25-24   
499        72.61     89.74            5.09             NaN  ...    02-16-24   

     harvest_date total_days yield_kg_per_hectare  sensor_id  timestamp  \
0        05-09-24        122              4408.07   SENS0001   03-19-24   
1        05-26-24        112              5389.98   SENS0002   04-21-24   
2        06-26-24        144              2931.16   SENS0003   02-28-24   
3        07-04-24        134              4227.80   SENS0004   05-14-24   
4        05-20-24        105              4979.96   SENS0005   04-13-24   
..            ...        ...                  ...        ...        ...   
495      06-02-24        138              4251.40   SENS0496   05-08-24   
496      04-14-24        104              3708.54   SENS0497   01-19-24   
497      05-09-24        128              2604.41   SENS0498   04-20-24   
498      06-04-24        131              2586.36   SENS0499   03-02-24   
499      06-29-24        134              5891.40   SENS0500   05-11-24   

      latitude  longitude  NDVI_index  crop_disease_status  
0    14.970941  82.997689        0.63                 Mild  
1    16.613022  70.869009        0.58                  NaN  
2    19.503156  79.068206        0.80                 Mild  
3    31.071298  85.519998        0.44                  NaN  
4    16.568540  81.691720        0.84               Severe  
..         ...        ...         ...                  ...  
495  30.386623  76.147700        0.59                 Mild  
496  18.832748  75.736924        0.85               Severe  
497  23.262016  81.992230        0.71                 Mild  
498  19.764989  84.426869        0.77               Severe  
499  13.455532  88.880605        0.85               Severe  

[500 rows x 22 columns]

Dataframe structure

A dataframe is a two-dimensional object.

A column in a dataframe is an object of type Series. It is a one-dimensional object.
Thus, a dataframe is a collection of Series.

A Series can only contain one type of data, whereas a dataframe can contain columns of different types: a column of integers, a column of decimal numbers, etc.

Dataframe visualisation - head

In order to inspect your dataframe, you may need to see some rows. dataframe.head(n) prints the first n rows of the dataframe.
If n is not provided, the first 5 lines are printed.

import pandas as pd
my_data = pd.read_csv("./Data/crops_data.csv")
my_data.head(6)

	farm_id	region	crop_type	soil_moisture	soil_pH	temperature_C	rainfall_mm	humidity	sunlight_hours	irrigation_type	...	sowing_date	harvest_date	total_days	yield_kg_per_hectare	sensor_id	timestamp	latitude	longitude	NDVI_index	crop_disease_status
0	FARM0001	North India	Wheat	35.95	5.99	17.79	75.62	77.03	7.27	NaN	...	01-08-24	05-09-24	122	4408.07	SENS0001	03-19-24	14.970941	82.997689	0.63	Mild
1	FARM0002	South USA	Soybean	19.74	7.24	30.18	89.91	61.13	5.67	Sprinkler	...	02-04-24	05-26-24	112	5389.98	SENS0002	04-21-24	16.613022	70.869009	0.58	NaN
2	FARM0003	South USA	Wheat	29.32	7.16	27.37	265.43	68.87	8.23	Drip	...	02-03-24	06-26-24	144	2931.16	SENS0003	02-28-24	19.503156	79.068206	0.80	Mild
3	FARM0004	Central USA	Maize	17.33	6.03	33.73	212.01	70.46	5.03	Sprinkler	...	02-21-24	07-04-24	134	4227.80	SENS0004	05-14-24	31.071298	85.519998	0.44	NaN
4	FARM0005	Central USA	Cotton	19.37	5.92	33.86	269.09	55.73	7.93	NaN	...	02-05-24	05-20-24	105	4979.96	SENS0005	04-13-24	16.568540	81.691720	0.84	Severe
5	FARM0006	Central USA	Rice	44.91	5.78	24.87	238.95	83.06	4.92	Sprinkler	...	01-13-24	05-06-24	114	4383.55	SENS0006	03-12-24	23.227859	89.421568	0.82	NaN

6 rows × 22 columns

It allows you to view a sample of the data more clearly than printing the entire dataset (as you may have noticed in the previous slide, printing a whole dataframe can be unreadable).

Dataframe visualisation - tail

In the same purpose, dataframe.tail(n) is used to show the last n rows of the dataframe.
If n is not provided, the last 5 lines are printed.

my_data.tail()

	farm_id	region	crop_type	soil_moisture	soil_pH	temperature_C	rainfall_mm	humidity	sunlight_hours	irrigation_type	...	sowing_date	harvest_date	total_days	yield_kg_per_hectare	sensor_id	timestamp	latitude	longitude	NDVI_index	crop_disease_status
495	FARM0496	Central USA	Rice	42.85	6.70	30.85	52.35	79.58	7.25	Manual	...	01-16-24	06-02-24	138	4251.40	SENS0496	05-08-24	30.386623	76.147700	0.59	Mild
496	FARM0497	North India	Soybean	34.22	6.75	17.46	256.23	45.14	5.78	NaN	...	01-01-24	04-14-24	104	3708.54	SENS0497	01-19-24	18.832748	75.736924	0.85	Severe
497	FARM0498	North India	Cotton	15.93	5.72	17.03	288.96	57.87	7.69	Drip	...	01-02-24	05-09-24	128	2604.41	SENS0498	04-20-24	23.262016	81.992230	0.71	Mild
498	FARM0499	NaN	Soybean	38.61	6.20	17.08	279.06	73.09	9.60	Drip	...	01-25-24	06-04-24	131	2586.36	SENS0499	03-02-24	19.764989	84.426869	0.77	Severe
499	FARM0500	North India	Wheat	30.22	7.42	20.57	72.61	89.74	5.09	NaN	...	02-16-24	06-29-24	134	5891.40	SENS0500	05-11-24	13.455532	88.880605	0.85	Severe

5 rows × 22 columns

Dataframe visualisation - describe

One of the most powerful pandas methods is describe, which gives a statistical summary of all numeric variables.

As shown in the summary below, only quantitative variables can be described.

my_data.describe()

	soil_moisture	soil_pH	temperature_C	rainfall_mm	humidity	sunlight_hours	pesticide_usage_ml	total_days	yield_kg_per_hectare	latitude	longitude	NDVI_index
count	497.000000	498.000000	499.000000	499.000000	497.000000	500.00000	500.000000	500.000000	499.000000	500.000000	499.000000	500.000000
mean	26.754789	6.525181	24.695130	181.872886	65.169618	7.03014	26.586980	119.496000	4032.258818	22.442473	80.403927	0.602060
std	10.122341	0.585128	5.336647	72.244299	14.655248	1.69167	13.202429	16.798046	1175.516477	7.283492	5.910818	0.175402
min	10.160000	5.510000	15.010000	50.170000	40.230000	4.01000	5.050000	90.000000	2023.560000	10.004243	70.020021	0.300000
25%	17.900000	6.030000	20.305000	119.760000	51.760000	5.66750	14.945000	105.750000	2994.750000	16.263202	75.380396	0.447500
50%	25.890000	6.530000	24.700000	192.360000	65.610000	6.99500	25.980000	119.000000	4070.970000	21.981743	80.669355	0.610000
75%	35.950000	7.040000	29.090000	239.120000	77.960000	8.47000	38.005000	134.000000	5066.060000	28.528948	85.656333	0.750000
max	44.980000	7.500000	34.840000	298.960000	90.000000	10.00000	49.940000	150.000000	5998.290000	34.981531	89.991901	0.900000

Dataframe visualisation - display one column

In a data frame, each column is explicitly named, allowing you to access a specific column by its name.

The syntax for accessing a single column is: my_data['column_name'].

my_data['farm_id'].head()

0    FARM0001
1    FARM0002
2    FARM0003
3    FARM0004
4    FARM0005
Name: farm_id, dtype: object

Please note that when you display only a selection of a dataframe, you always get a dataframe, so you can apply the usual dataframe functions (like head()) to it.

Dataframe visualisation - display several columns

The syntax for accessing several columns at once is:
my_data[['column_name_1', 'column_name_2']].

my_data[['farm_id', 'region', 'crop_type']].head()

	farm_id	region	crop_type
0	FARM0001	North India	Wheat
1	FARM0002	South USA	Soybean
2	FARM0003	South USA	Wheat
3	FARM0004	Central USA	Maize
4	FARM0005	Central USA	Cotton

Please note the double pairs of brackets [[]] when displaying several columns.

Dataframe visualisation - selection via the index with `.iloc`

The .iloc method allows you to select a subset of your dataframe based on positions.

You must specify which rows and which columns you want to select, in this order and separated with a comma.
Syntax: my_data.iloc[row_index, column_index]. You can use a colon (:) to select a range.

my_data.iloc[0:5,1:10]    # selects rows 0 to 4 and columns 1 to 9

	region	crop_type	soil_moisture	soil_pH	temperature_C	rainfall_mm	humidity	sunlight_hours	irrigation_type
0	North India	Wheat	35.95	5.99	17.79	75.62	77.03	7.27	NaN
1	South USA	Soybean	19.74	7.24	30.18	89.91	61.13	5.67	Sprinkler
2	South USA	Wheat	29.32	7.16	27.37	265.43	68.87	8.23	Drip
3	Central USA	Maize	17.33	6.03	33.73	212.01	70.46	5.03	Sprinkler
4	Central USA	Cotton	19.37	5.92	33.86	269.09	55.73	7.93	NaN

As with lists, you can use ::n to specify a step of n.

my_data.iloc[0::150,0::3]    # selects all rows with a step of 150 and all columns with a step of 3

	farm_id	soil_moisture	rainfall_mm	irrigation_type	sowing_date	yield_kg_per_hectare	latitude	crop_disease_status
0	FARM0001	35.95	75.62	NaN	01-08-24	4408.07	14.970941	Mild
150	FARM0151	28.82	69.76	Sprinkler	03-21-24	5338.11	17.754237	Mild
300	FARM0301	28.32	207.67	Manual	02-18-24	2043.13	22.816578	Severe
450	FARM0451	10.22	74.22	NaN	02-06-24	3498.61	13.358302	NaN

Brainstorming time

Which code allows access to the last 5 lines of the first 3 columns of a dataframe?

my_data.iloc[5:, :3]

my_data.iloc[-5:, :3]

my_data.iloc[-5:, :4]

my_data.iloc[5:, :3]
my_data.iloc[-5:, :3]
my_data.iloc[-5:, :4]

This will display the first 3 columns for all rows except the first 5.

my_data.iloc[5:, :3]

	farm_id	region	crop_type
5	FARM0006	Central USA	Rice
6	FARM0007	North India	Soybean
7	FARM0008	East Africa	Maize
8	FARM0009	Central USA	Soybean
9	FARM0010	East Africa	Rice
...	...	...	...
495	FARM0496	Central USA	Rice
496	FARM0497	North India	Soybean
497	FARM0498	North India	Cotton
498	FARM0499	NaN	Soybean
499	FARM0500	North India	Wheat

495 rows × 3 columns

This is the right answer.

my_data.iloc[-5:, :3]

	farm_id	region	crop_type
495	FARM0496	Central USA	Rice
496	FARM0497	North India	Soybean
497	FARM0498	North India	Cotton
498	FARM0499	NaN	Soybean
499	FARM0500	North India	Wheat

This will display the last 5 lines of the first 4 columns.

my_data.iloc[-5:, :4]

	farm_id	region	crop_type	soil_moisture
495	FARM0496	Central USA	Rice	42.85
496	FARM0497	North India	Soybean	34.22
497	FARM0498	North India	Cotton	15.93
498	FARM0499	NaN	Soybean	38.61
499	FARM0500	North India	Wheat	30.22

Brainstorming time

Which code allows access to all rows of the third, fourth and fifth columns?

my_data.iloc[:, 2:5]

my_data.iloc[:, 3:5]

my_data.iloc[3:6, :]

my_data.iloc[:, 2:5]
my_data.iloc[:, 3:5]
my_data.iloc[3:6, :]

This is the right answer. Don’t forget that the numbering starts at zero!

my_data.iloc[:, 2:5]

	crop_type	soil_moisture	soil_pH
0	Wheat	35.95	5.99
1	Soybean	19.74	7.24
2	Wheat	29.32	7.16
3	Maize	17.33	6.03
4	Cotton	19.37	5.92
...	...	...	...
495	Rice	42.85	6.70
496	Soybean	34.22	6.75
497	Cotton	15.93	5.72
498	Soybean	38.61	6.20
499	Wheat	30.22	7.42

500 rows × 3 columns

This will only display columns 3 and 4 (5 is excluded).

my_data.iloc[:, 3:5]

	soil_moisture	soil_pH
0	35.95	5.99
1	19.74	7.24
2	29.32	7.16
3	17.33	6.03
4	19.37	5.92
...	...	...
495	42.85	6.70
496	34.22	6.75
497	15.93	5.72
498	38.61	6.20
499	30.22	7.42

500 rows × 2 columns

This will display all columns for lines 3 to 5.

my_data.iloc[3:6, :]

	farm_id	region	crop_type	soil_moisture	soil_pH	temperature_C	rainfall_mm	humidity	sunlight_hours	irrigation_type	...	sowing_date	harvest_date	total_days	yield_kg_per_hectare	sensor_id	timestamp	latitude	longitude	NDVI_index	crop_disease_status
3	FARM0004	Central USA	Maize	17.33	6.03	33.73	212.01	70.46	5.03	Sprinkler	...	02-21-24	07-04-24	134	4227.80	SENS0004	05-14-24	31.071298	85.519998	0.44	NaN
4	FARM0005	Central USA	Cotton	19.37	5.92	33.86	269.09	55.73	7.93	NaN	...	02-05-24	05-20-24	105	4979.96	SENS0005	04-13-24	16.568540	81.691720	0.84	Severe
5	FARM0006	Central USA	Rice	44.91	5.78	24.87	238.95	83.06	4.92	Sprinkler	...	01-13-24	05-06-24	114	4383.55	SENS0006	03-12-24	23.227859	89.421568	0.82	NaN

3 rows × 22 columns

Dataframe visualisation - selection via the labels with `.loc`

The .loc method allows you to select a subset of your dataframe based on labels (rows or columns names).

You must specify which rows and which columns you want to select, in this order and separated with a comma.
Syntax: my_data.loc[row_names, column_names].

To select only rows that meet a certain condition on the content of a column:
my_data['column'] ** condition

Here, you have to replace ** with a comparison operator like ==, >=, !=, etc.

To select some columns, write the names of the columns of interest in a list.

Dataframe visualisation - selection via the labels with `.loc`

Example: To select all rows (or all columns), use a colon (:) in the first (or second) position as an argument given to loc.

my_data.loc[:,["farm_id","region","soil_moisture"]].head()

	farm_id	region	soil_moisture
0	FARM0001	North India	35.95
1	FARM0002	South USA	19.74
2	FARM0003	South USA	29.32
3	FARM0004	Central USA	17.33
4	FARM0005	Central USA	19.37

Selects columns “farm_id” and “crop_type” for all lines where crop_type is “Wheat”

my_data.loc[my_data["crop_type"] == "Wheat", ["farm_id", "crop_type"]].head()

	farm_id	crop_type
0	FARM0001	Wheat
2	FARM0003	Wheat
10	FARM0011	Wheat
17	FARM0018	Wheat
40	FARM0041	Wheat

Brainstorming time

Which code allows access to the columns “soil_moisture”, “soil_pH” and “temperature_C” for all regions in “North India”?

ANSWER:

my_data.loc[my_data['region'] == 'North India', ["soil_moisture","soil_pH","temperature_C"]]

	soil_moisture	soil_pH	temperature_C
0	35.95	5.99	17.79
6	36.28	7.04	21.80
13	12.80	5.87	26.90
20	16.25	7.43	20.31
31	39.76	6.70	17.42
...	...	...	...
491	32.14	7.44	21.49
494	12.52	5.99	33.18
496	34.22	6.75	17.46
497	15.93	5.72	17.03
499	30.22	7.42	20.57

99 rows × 3 columns

Dataframe manipulation - modifying a column

The syntax my_data['column_name'] not only allows you to access a column in a dataframe, but also to modify it.

my_data['humidity'] = my_data['humidity'] / 100    # converts the degree of humidity into a percentage
my_data.iloc[0:5, 0:10]    # checks that the dataframe has been modified in-place

	farm_id	region	crop_type	soil_moisture	soil_pH	temperature_C	rainfall_mm	humidity	sunlight_hours	irrigation_type
0	FARM0001	North India	Wheat	35.95	5.99	17.79	75.62	0.7703	7.27	NaN
1	FARM0002	South USA	Soybean	19.74	7.24	30.18	89.91	0.6113	5.67	Sprinkler
2	FARM0003	South USA	Wheat	29.32	7.16	27.37	265.43	0.6887	8.23	Drip
3	FARM0004	Central USA	Maize	17.33	6.03	33.73	212.01	0.7046	5.03	Sprinkler
4	FARM0005	Central USA	Cotton	19.37	5.92	33.86	269.09	0.5573	7.93	NaN

Dataframe manipulation - adding a column

If my_data['column_name'] does not already exist, it will be created on the fly.

my_data['temperature_F'] = my_data['temperature_C'] * 9/5 + 32
my_data.loc[0:5, ['temperature_C', 'temperature_F']]

	temperature_C	temperature_F
0	17.79	64.022
1	30.18	86.324
2	27.37	81.266
3	33.73	92.714
4	33.86	92.948
5	24.87	76.766

Dataframe manipulation - deleting a column

There are officially three ways to delete a column.

my_data['id'] = my_data['farm_id']    # create a column that will be deleted on the next line
my_data = my_data.drop(columns='id')

Unlike the following two options, the drop method does not modify the existing dataframe; it simply returns a copy of the data frame with the changes applied. You will need to replace your data frame to compensate for this.

my_data['id'] = my_data['farm_id']    # create a column that will be deleted on the next line
del my_data['id']

my_data['id'] = my_data['farm_id']    # create a column that will be deleted on the next line
my_data.pop('id')

The pop function returns the deleted column, which can be assigned to a variable with col = my_data.pop('id').

my_data.head(0)    # checks that the 'id' column has been deleted

	farm_id	region	crop_type	soil_moisture	soil_pH	temperature_C	rainfall_mm	humidity	sunlight_hours	irrigation_type	...	harvest_date	total_days	yield_kg_per_hectare	sensor_id	timestamp	latitude	longitude	NDVI_index	crop_disease_status	temperature_F

0 rows × 23 columns

Dataframe manipulation - renaming a column

The rename method allows you to rename one or more columns at a time using the following syntax:
my_dataframe.rename(columns={'old name': 'new name'})

The rename method does not modify the existing dataframe, unless the inplace = True argument is used.
The two following syntaxes are equivalent:
- my_dataframe.rename(columns={'old name': 'new name'}, inplace = True)
- my_dataframe = my_dataframe.rename(columns={'old name': 'new name'})

my_data.rename(columns={'temperature_C': 'temperature_Celsius', 'temperature_F': 'temperature_Fahrenheit'}, inplace = True)

my_data.head(0)    # checks that the columns have been renamed

	farm_id	region	crop_type	soil_moisture	soil_pH	temperature_Celsius	rainfall_mm	humidity	sunlight_hours	irrigation_type	...	harvest_date	total_days	yield_kg_per_hectare	sensor_id	timestamp	latitude	longitude	NDVI_index	crop_disease_status	temperature_Fahrenheit

0 rows × 23 columns

Dataframe manipulation - sorting a dataframe according to one column

The sort_values method allows you to sort a dataframe according to one or more columns specified in parentheses.

my_data.sort_values('temperature_Celsius', inplace = True)
my_data.head(10)

	farm_id	region	crop_type	soil_moisture	soil_pH	temperature_Celsius	rainfall_mm	humidity	sunlight_hours	irrigation_type	...	harvest_date	total_days	yield_kg_per_hectare	sensor_id	timestamp	latitude	longitude	NDVI_index	crop_disease_status	temperature_Fahrenheit
22	FARM0023	East Africa	Soybean	20.53	6.60	15.01	121.73	0.6149	7.48	Manual	...	07-07-24	122	3892.74	SENS0023	06-30-24	33.995800	84.719229	0.70	Moderate	59.018
478	FARM0479	Central USA	Maize	26.91	6.03	15.04	207.79	0.5968	9.54	Drip	...	05-24-24	112	2023.56	SENS0479	03-31-24	18.213795	77.077855	0.30	Mild	59.072
24	FARM0025	South USA	Cotton	18.54	6.81	15.11	237.74	0.7850	4.64	NaN	...	07-14-24	119	2200.87	SENS0025	04-17-24	32.936750	72.427172	0.38	Severe	59.198
419	FARM0420	South USA	Rice	38.91	5.51	15.20	139.47	0.6773	4.85	NaN	...	05-03-24	91	2796.49	SENS0420	03-09-24	14.353665	87.707645	0.73	Moderate	59.360
435	FARM0436	Central USA	Cotton	39.95	6.29	15.21	78.67	0.8586	5.96	NaN	...	05-29-24	132	2969.17	SENS0436	05-09-24	13.506394	86.408534	0.80	Mild	59.378
197	FARM0198	South India	Soybean	41.22	6.73	15.23	283.59	0.6528	6.82	NaN	...	05-26-24	105	3323.58	SENS0198	03-17-24	11.258768	74.454130	0.69	Severe	59.414
323	FARM0324	South USA	Cotton	18.42	6.62	15.25	232.95	0.8750	4.80	Manual	...	04-30-24	120	4676.14	SENS0324	01-21-24	27.582612	87.158442	0.75	NaN	59.450
58	FARM0059	South India	Wheat	33.14	5.55	15.30	247.50	0.5190	5.94	Sprinkler	...	07-15-24	123	2454.60	SENS0059	03-22-24	21.906149	85.560341	0.61	NaN	59.540
29	FARM0030	Central USA	Cotton	18.83	5.66	15.39	184.85	0.9000	6.10	Drip	...	04-19-24	102	5356.92	SENS0030	03-27-24	13.809559	72.524419	0.70	Mild	59.702
442	FARM0443	East Africa	Cotton	32.68	6.08	15.47	261.73	0.5656	5.45	Drip	...	08-06-24	136	2889.78	SENS0443	06-28-24	23.036798	73.670909	0.68	NaN	59.846

10 rows × 23 columns

Dataframe manipulation - sorting a dataframe according to one column

By default, the column is sorted in ascending order.
Use the ascending = False to sort in descending order.

my_data.sort_values('rainfall_mm', ascending = False, inplace = True)
my_data.head(10)

	farm_id	region	crop_type	soil_moisture	soil_pH	temperature_Celsius	rainfall_mm	humidity	sunlight_hours	irrigation_type	...	harvest_date	total_days	yield_kg_per_hectare	sensor_id	timestamp	latitude	longitude	NDVI_index	crop_disease_status	temperature_Fahrenheit
274	FARM0275	East Africa	Wheat	25.81	7.15	15.85	298.96	0.6594	6.37	Drip	...	06-12-24	102	3164.72	SENS0275	04-21-24	32.109939	85.473540	0.47	Mild	60.530
332	FARM0333	Central USA	Cotton	41.36	7.44	30.08	298.52	0.7334	8.80	NaN	...	08-10-24	147	2160.32	SENS0333	08-04-24	12.921902	70.495912	0.67	Mild	86.144
186	FARM0187	East Africa	Maize	24.46	7.24	18.02	298.09	0.5713	9.92	NaN	...	07-23-24	139	2323.25	SENS0187	05-30-24	25.775819	73.536485	0.68	NaN	64.436
266	FARM0267	East Africa	Soybean	36.26	6.60	27.46	298.08	0.7475	8.01	NaN	...	06-16-24	106	2681.28	SENS0267	04-07-24	15.017401	83.930534	0.46	Severe	81.428
347	FARM0348	North India	Maize	44.13	6.18	26.90	297.67	0.4614	9.03	NaN	...	07-04-24	107	5025.21	SENS0348	06-29-24	26.095779	78.004711	0.59	Severe	80.420
7	FARM0008	East Africa	Maize	27.10	5.72	22.26	296.33	0.8034	5.44	Sprinkler	...	05-24-24	121	5264.09	SENS0008	04-30-24	23.317654	72.515210	0.70	Mild	72.068
230	FARM0231	South India	Maize	12.80	5.58	22.69	296.11	0.7070	7.13	Drip	...	05-13-24	102	5402.27	SENS0231	05-13-24	22.953832	73.894930	0.77	Mild	72.842
31	FARM0032	North India	Maize	39.76	6.70	17.42	295.96	0.7913	6.08	NaN	...	07-10-24	111	2050.61	SENS0032	05-13-24	30.558273	72.110777	0.88	Severe	63.356
408	FARM0409	East Africa	Maize	23.54	7.18	31.24	295.95	0.4624	6.22	Sprinkler	...	07-17-24	138	3124.54	SENS0409	05-31-24	14.787792	86.325616	0.68	Mild	88.232
259	FARM0260	Central USA	Cotton	25.66	6.29	29.53	295.74	0.6979	7.11	Manual	...	05-30-24	144	3259.62	SENS0260	03-17-24	32.977802	80.225430	0.64	Mild	85.154

10 rows × 23 columns

Dataframe manipulation - sorting a dataframe according to several columns

Use the following syntax to sort by column A and then column B:
my_data.sort_values(['column A', 'column B']).

my_data.sort_values(['region', 'crop_type'], inplace = True)
my_data.head(10)

	farm_id	region	crop_type	soil_moisture	soil_pH	temperature_Celsius	rainfall_mm	humidity	sunlight_hours	irrigation_type	...	harvest_date	total_days	yield_kg_per_hectare	sensor_id	timestamp	latitude	longitude	NDVI_index	crop_disease_status	temperature_Fahrenheit
332	FARM0333	Central USA	Cotton	41.36	7.44	30.08	298.52	0.7334	8.80	NaN	...	08-10-24	147	2160.32	SENS0333	08-04-24	12.921902	70.495912	0.67	Mild	86.144
259	FARM0260	Central USA	Cotton	25.66	6.29	29.53	295.74	0.6979	7.11	Manual	...	05-30-24	144	3259.62	SENS0260	03-17-24	32.977802	80.225430	0.64	Mild	85.154
28	FARM0029	Central USA	Cotton	35.35	7.18	33.39	295.18	0.6671	9.44	Drip	...	05-26-24	119	2726.92	SENS0029	03-01-24	19.477597	74.233206	0.50	Severe	92.102
4	FARM0005	Central USA	Cotton	19.37	5.92	33.86	269.09	0.5573	7.93	NaN	...	05-20-24	105	4979.96	SENS0005	04-13-24	16.568540	81.691720	0.84	Severe	92.948
132	FARM0133	Central USA	Cotton	13.71	5.70	19.44	236.71	0.6790	8.13	Sprinkler	...	07-07-24	133	4354.36	SENS0133	07-02-24	13.768623	89.954055	0.59	Moderate	66.992
288	FARM0289	Central USA	Cotton	41.12	5.71	30.32	236.39	0.4112	8.55	Sprinkler	...	06-12-24	124	3276.60	SENS0289	04-05-24	26.778101	75.453084	0.39	NaN	86.576
217	FARM0218	Central USA	Cotton	15.90	6.13	30.71	228.05	0.7204	5.66	Manual	...	06-14-24	119	3781.43	SENS0218	05-25-24	17.636795	81.033437	0.41	NaN	87.278
458	FARM0459	Central USA	Cotton	41.86	6.99	29.50	213.48	0.7925	9.80	NaN	...	04-14-24	95	2445.53	SENS0459	02-22-24	28.514530	88.744213	0.75	NaN	85.100
191	FARM0192	Central USA	Cotton	33.16	6.82	20.40	201.41	0.4686	8.98	Drip	...	04-16-24	97	5139.04	SENS0192	03-13-24	14.966167	73.994988	0.46	Mild	68.720
37	FARM0038	Central USA	Cotton	13.99	5.63	24.83	194.26	0.7432	4.91	Manual	...	06-04-24	138	3664.70	SENS0038	03-15-24	29.392338	77.607561	0.85	Moderate	76.694

10 rows × 23 columns

Dataframe manipulation - sorting a dataframe according to several columns

If you want to sort one column in ascending order and the other in descending order, you must provide a list of Booleans to the ascending parameter.

my_data.sort_values(['region', 'crop_type'], ascending = [True, False], inplace = True)
my_data.head(10)

	farm_id	region	crop_type	soil_moisture	soil_pH	temperature_Celsius	rainfall_mm	humidity	sunlight_hours	irrigation_type	...	harvest_date	total_days	yield_kg_per_hectare	sensor_id	timestamp	latitude	longitude	NDVI_index	crop_disease_status	temperature_Fahrenheit
216	FARM0217	Central USA	Wheat	18.77	5.89	26.61	287.88	0.5786	8.03	Sprinkler	...	04-26-24	110	3943.44	SENS0217	02-18-24	25.408561	76.113510	0.65	Moderate	79.898
54	FARM0055	Central USA	Wheat	33.62	6.44	27.39	285.79	0.5640	7.66	Drip	...	07-19-24	131	3633.18	SENS0055	04-14-24	11.133670	70.744243	0.90	NaN	81.302
376	FARM0377	Central USA	Wheat	39.12	6.53	24.79	271.35	0.6382	7.38	NaN	...	05-31-24	101	3736.42	SENS0377	04-14-24	12.323687	80.266829	0.88	Mild	76.622
296	FARM0297	Central USA	Wheat	30.40	6.72	25.21	261.91	0.8263	4.37	Drip	...	06-28-24	145	3128.84	SENS0297	02-11-24	15.881029	84.044438	0.54	Moderate	77.378
251	FARM0252	Central USA	Wheat	15.86	6.05	17.39	247.29	0.4045	4.25	NaN	...	05-30-24	95	2994.89	SENS0252	04-28-24	12.285039	82.372897	0.86	NaN	63.302
492	FARM0493	Central USA	Wheat	28.81	7.46	30.56	245.13	0.4532	8.47	NaN	...	07-27-24	128	4203.51	SENS0493	07-12-24	15.515976	75.375870	0.65	Severe	87.008
111	FARM0112	Central USA	Wheat	16.25	6.57	25.58	231.96	0.5113	4.02	Drip	...	07-13-24	117	4127.73	SENS0112	07-01-24	15.741602	79.212506	0.39	Mild	78.044
481	FARM0482	Central USA	Wheat	24.74	6.60	31.00	228.58	0.5624	8.59	NaN	...	08-16-24	142	3555.39	SENS0482	04-24-24	33.941965	85.854259	0.38	Moderate	87.800
315	FARM0316	Central USA	Wheat	14.23	5.78	23.30	224.07	0.6767	6.63	Drip	...	07-12-24	114	5110.65	SENS0316	03-22-24	31.990674	71.614452	0.30	NaN	73.940
81	FARM0082	Central USA	Wheat	22.50	5.64	19.82	214.28	0.4518	7.49	Manual	...	07-31-24	142	4571.18	SENS0082	07-07-24	34.520480	79.570623	0.41	Mild	67.676

10 rows × 23 columns

Dataframe filtering - simple condition

Conditional statements can be applied on a dataframe to select rows and columns that fulfill a condition.

Several syntaxes are possible. Some examples are presented below.

Print only the rows with temperatures higher than 20°C:

my_data[my_data["temperature_Celsius"] > 20]
my_data.loc[my_data['temperature_Celsius'] > 20, :]    # the two syntaxes will return the same result

Click to see the results

	farm_id	region	crop_type	soil_moisture	soil_pH	temperature_Celsius	rainfall_mm	humidity	sunlight_hours	irrigation_type	...	harvest_date	total_days	yield_kg_per_hectare	sensor_id	timestamp	latitude	longitude	NDVI_index	crop_disease_status	temperature_Fahrenheit
216	FARM0217	Central USA	Wheat	18.77	5.89	26.61	287.88	0.5786	8.03	Sprinkler	...	04-26-24	110	3943.44	SENS0217	02-18-24	25.408561	76.113510	0.65	Moderate	79.898
54	FARM0055	Central USA	Wheat	33.62	6.44	27.39	285.79	0.5640	7.66	Drip	...	07-19-24	131	3633.18	SENS0055	04-14-24	11.133670	70.744243	0.90	NaN	81.302
376	FARM0377	Central USA	Wheat	39.12	6.53	24.79	271.35	0.6382	7.38	NaN	...	05-31-24	101	3736.42	SENS0377	04-14-24	12.323687	80.266829	0.88	Mild	76.622
296	FARM0297	Central USA	Wheat	30.40	6.72	25.21	261.91	0.8263	4.37	Drip	...	06-28-24	145	3128.84	SENS0297	02-11-24	15.881029	84.044438	0.54	Moderate	77.378
492	FARM0493	Central USA	Wheat	28.81	7.46	30.56	245.13	0.4532	8.47	NaN	...	07-27-24	128	4203.51	SENS0493	07-12-24	15.515976	75.375870	0.65	Severe	87.008
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
61	FARM0062	South USA	Cotton	35.54	6.13	21.25	113.72	0.8768	8.12	NaN	...	04-29-24	117	2386.26	SENS0062	04-06-24	19.316171	72.602309	0.34	Moderate	70.250
462	FARM0463	South USA	Cotton	20.47	7.13	33.25	89.80	0.4108	6.35	NaN	...	06-19-24	115	3564.25	SENS0463	05-02-24	28.168865	75.647282	0.85	Severe	91.850
396	FARM0397	South USA	Cotton	14.53	6.91	32.27	79.51	0.5063	6.84	Sprinkler	...	06-12-24	144	3634.48	SENS0397	02-19-24	19.239649	75.791812	0.61	NaN	90.086
310	FARM0311	South USA	Cotton	24.14	6.96	31.25	67.52	0.5129	6.31	NaN	...	06-23-24	133	3211.31	SENS0311	02-25-24	25.806813	89.176478	0.35	Moderate	88.250
449	FARM0450	NaN	Rice	39.04	6.01	21.04	291.92	0.7292	6.30	Manual	...	07-01-24	95	2437.10	SENS0450	06-09-24	29.417278	76.887856	0.39	Mild	69.872

380 rows × 23 columns

Print only the rows where the region is ‘South USA’:

my_data[my_data['region'] == 'South USA']
my_data.loc[my_data['region'] == 'South USA', :]    # the two syntaxes will return the same result

Click to see the results

	farm_id	region	crop_type	soil_moisture	soil_pH	temperature_Celsius	rainfall_mm	humidity	sunlight_hours	irrigation_type	...	harvest_date	total_days	yield_kg_per_hectare	sensor_id	timestamp	latitude	longitude	NDVI_index	crop_disease_status	temperature_Fahrenheit
460	FARM0461	South USA	Wheat	28.84	6.42	17.89	285.72	0.8186	7.00	NaN	...	06-04-24	132	5396.51	SENS0461	02-01-24	24.972008	76.177829	0.89	NaN	64.202
443	FARM0444	South USA	Wheat	43.38	5.60	34.84	284.57	0.4628	5.04	Sprinkler	...	06-22-24	119	3245.85	SENS0444	05-26-24	14.938407	78.480336	0.66	Moderate	94.712
127	FARM0128	South USA	Wheat	20.21	6.28	16.69	275.28	0.8526	9.87	Sprinkler	...	04-27-24	109	3073.63	SENS0128	01-27-24	11.581679	78.693525	0.55	Severe	62.042
2	FARM0003	South USA	Wheat	29.32	7.16	27.37	265.43	0.6887	8.23	Drip	...	06-26-24	144	2931.16	SENS0003	02-28-24	19.503156	79.068206	0.80	Mild	81.266
276	FARM0277	South USA	Wheat	18.75	6.88	33.14	249.12	0.7592	4.74	Drip	...	06-05-24	123	4829.12	SENS0277	05-06-24	29.776665	80.233329	0.87	Severe	91.652
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
61	FARM0062	South USA	Cotton	35.54	6.13	21.25	113.72	0.8768	8.12	NaN	...	04-29-24	117	2386.26	SENS0062	04-06-24	19.316171	72.602309	0.34	Moderate	70.250
462	FARM0463	South USA	Cotton	20.47	7.13	33.25	89.80	0.4108	6.35	NaN	...	06-19-24	115	3564.25	SENS0463	05-02-24	28.168865	75.647282	0.85	Severe	91.850
340	FARM0341	South USA	Cotton	21.91	7.32	17.05	88.64	0.5106	5.13	Drip	...	05-25-24	142	2524.93	SENS0341	01-08-24	21.987956	76.231469	0.52	Severe	62.690
396	FARM0397	South USA	Cotton	14.53	6.91	32.27	79.51	0.5063	6.84	Sprinkler	...	06-12-24	144	3634.48	SENS0397	02-19-24	19.239649	75.791812	0.61	NaN	90.086
310	FARM0311	South USA	Cotton	24.14	6.96	31.25	67.52	0.5129	6.31	NaN	...	06-23-24	133	3211.31	SENS0311	02-25-24	25.806813	89.176478	0.35	Moderate	88.250

93 rows × 23 columns

Dataframe filtering - complex conditions

Several conditions can be combined using & (meaning and), or | (meaning or).

Print only the rows with temperatures higher than 20°C and a sunlight time higher than 7 hours.

my_data[ (my_data["temperature_Celsius"] > 20 ) & (my_data["sunlight_hours"] > 7) ].head()
my_data.loc[(my_data['temperature_Celsius'] > 20) & (my_data["sunlight_hours"] > 7), :].head()
# the two syntaxes will return the same result

	farm_id	region	crop_type	soil_moisture	soil_pH	temperature_Celsius	rainfall_mm	humidity	sunlight_hours	irrigation_type	...	harvest_date	total_days	yield_kg_per_hectare	sensor_id	timestamp	latitude	longitude	NDVI_index	crop_disease_status	temperature_Fahrenheit
216	FARM0217	Central USA	Wheat	18.77	5.89	26.61	287.88	0.5786	8.03	Sprinkler	...	04-26-24	110	3943.44	SENS0217	02-18-24	25.408561	76.113510	0.65	Moderate	79.898
54	FARM0055	Central USA	Wheat	33.62	6.44	27.39	285.79	0.5640	7.66	Drip	...	07-19-24	131	3633.18	SENS0055	04-14-24	11.133670	70.744243	0.90	NaN	81.302
376	FARM0377	Central USA	Wheat	39.12	6.53	24.79	271.35	0.6382	7.38	NaN	...	05-31-24	101	3736.42	SENS0377	04-14-24	12.323687	80.266829	0.88	Mild	76.622
492	FARM0493	Central USA	Wheat	28.81	7.46	30.56	245.13	0.4532	8.47	NaN	...	07-27-24	128	4203.51	SENS0493	07-12-24	15.515976	75.375870	0.65	Severe	87.008
481	FARM0482	Central USA	Wheat	24.74	6.60	31.00	228.58	0.5624	8.59	NaN	...	08-16-24	142	3555.39	SENS0482	04-24-24	33.941965	85.854259	0.38	Moderate	87.800

5 rows × 23 columns

Dataframe filtering and modifying

It is possible to modify only certain cells in a column, depending on their value.

This must be done with using .loc.

Examples:

my_data.loc[my_data['region'] == 'North India', ['region']] = 'India_North'
my_data.loc[my_data['region'] == 'India_North', :].head()

	farm_id	region	crop_type	soil_moisture	soil_pH	temperature_Celsius	rainfall_mm	humidity	sunlight_hours	irrigation_type	...	harvest_date	total_days	yield_kg_per_hectare	sensor_id	timestamp	latitude	longitude	NDVI_index	crop_disease_status	temperature_Fahrenheit
366	FARM0367	India_North	Wheat	42.31	6.79	27.53	276.71	0.8871	5.19	Sprinkler	...	06-05-24	106	2597.00	SENS0367	04-07-24	14.253072	81.344858	0.31	Severe	81.554
260	FARM0261	India_North	Wheat	26.11	5.81	20.30	272.41	0.5249	5.54	Manual	...	07-21-24	136	2308.81	SENS0261	05-29-24	29.822605	73.458050	0.80	NaN	68.540
112	FARM0113	India_North	Wheat	38.33	6.34	30.32	270.94	0.4078	5.24	Drip	...	08-04-24	135	5488.85	SENS0113	05-23-24	28.513527	78.045307	0.44	Mild	86.576
392	FARM0393	India_North	Wheat	28.81	6.28	29.38	269.97	0.6602	7.24	Sprinkler	...	06-07-24	111	5028.19	SENS0393	03-08-24	10.585544	87.806387	0.62	Moderate	84.884
314	FARM0315	India_North	Wheat	27.40	7.10	19.41	251.11	0.6131	8.87	NaN	...	07-21-24	124	2549.32	SENS0315	07-04-24	34.117310	74.264637	0.33	Mild	66.938

5 rows × 23 columns

my_data.loc[my_data['region'] == 'South India', ['region']] = 'India_South'
my_data.loc[my_data['region'] == 'India_South', :].head()

	farm_id	region	crop_type	soil_moisture	soil_pH	temperature_Celsius	rainfall_mm	humidity	sunlight_hours	irrigation_type	...	harvest_date	total_days	yield_kg_per_hectare	sensor_id	timestamp	latitude	longitude	NDVI_index	crop_disease_status	temperature_Fahrenheit
298	FARM0299	India_South	Wheat	14.80	7.11	32.20	273.40	0.7477	9.58	Sprinkler	...	07-23-24	118	3538.86	SENS0299	04-24-24	14.644776	82.091465	0.65	Mild	89.960
198	FARM0199	India_South	Wheat	26.07	7.10	23.96	264.15	0.6235	4.71	NaN	...	05-08-24	116	2143.33	SENS0199	03-23-24	10.004243	71.817911	0.66	Moderate	75.128
278	FARM0279	India_South	Wheat	31.79	6.01	24.17	263.85	0.6718	4.03	NaN	...	07-13-24	119	3640.61	SENS0279	03-26-24	25.030414	70.131460	0.83	Moderate	75.506
58	FARM0059	India_South	Wheat	33.14	5.55	15.30	247.50	0.5190	5.94	Sprinkler	...	07-15-24	123	2454.60	SENS0059	03-22-24	21.906149	85.560341	0.61	NaN	59.540
69	FARM0070	India_South	Wheat	15.13	5.89	27.05	240.05	0.7278	5.06	NaN	...	07-26-24	133	5696.62	SENS0070	06-05-24	31.606172	82.544348	0.39	NaN	80.690

5 rows × 23 columns

Aggregation

The .groupby() method takes a group of several rows as input. You can perform a calculation on it in order to return a single value for each of the groups.

Example 1: Calculate the sum of rainfall for each region in the dataframe:

my_data.groupby('region')['rainfall_mm'].sum()

region
Central USA    19014.33
East Africa    19734.47
India_North    18636.89
India_South    16972.94
South USA      15824.96
Name: rainfall_mm, dtype: float64

Aggregation

Example 2: Count the number of farms growing each type of crop in each region in the dataframe.

my_data.groupby(['region', 'crop_type']).count()

		farm_id	soil_moisture	soil_pH	temperature_Celsius	rainfall_mm	humidity	sunlight_hours	irrigation_type	fertilizer_type	pesticide_usage_ml	...	harvest_date	total_days	yield_kg_per_hectare	sensor_id	timestamp	latitude	longitude	NDVI_index	crop_disease_status	temperature_Fahrenheit
region	crop_type
Central USA	Cotton	26	26	26	26	26	26	26	17	26	26	...	26	26	26	26	26	26	26	26	21	26
	Maize	21	20	21	21	20	20	21	17	21	21	...	21	21	21	21	21	21	21	21	14	21
	Rice	18	18	18	18	18	18	18	13	17	18	...	18	18	18	18	18	18	18	18	13	18
	Soybean	26	26	25	26	26	26	26	20	26	26	...	26	26	26	26	26	26	26	26	17	26
	Wheat	17	17	17	17	17	17	17	12	17	17	...	17	17	17	17	17	17	17	17	13	17
East Africa	Cotton	24	24	24	24	24	24	24	17	24	24	...	24	24	24	24	24	24	24	24	20	24
	Maize	24	24	24	24	24	24	24	16	24	24	...	24	24	24	24	24	24	24	24	15	24
	Rice	20	20	20	20	20	20	20	15	20	20	...	20	20	19	20	20	20	20	20	17	20
	Soybean	24	24	24	24	24	23	24	18	24	24	...	24	24	24	24	24	24	24	24	20	24
	Wheat	15	15	15	15	15	15	15	11	15	15	...	15	15	15	15	15	15	15	15	11	15
India_North	Cotton	18	17	18	18	18	18	18	9	18	18	...	18	18	18	18	18	18	18	18	15	18
	Maize	24	24	24	24	24	24	24	15	24	24	...	24	24	24	24	24	24	24	24	19	24
	Rice	18	17	18	18	18	18	18	14	18	18	...	18	18	18	18	18	18	18	18	14	18
	Soybean	18	18	18	18	18	18	18	14	18	18	...	18	18	18	18	18	18	18	18	13	18
	Wheat	20	20	20	20	20	20	20	9	20	20	...	20	20	20	20	20	20	20	20	16	20
India_South	Cotton	20	20	20	20	20	20	20	16	20	20	...	20	20	20	20	20	20	20	20	10	20
	Maize	21	21	21	21	21	21	21	17	20	21	...	21	21	21	21	21	21	21	21	14	21
	Rice	6	6	6	6	6	6	6	2	6	6	...	6	6	6	6	6	6	6	6	6	6
	Soybean	22	22	22	21	22	22	22	14	22	22	...	22	22	22	22	22	22	22	22	18	21
	Wheat	21	21	21	21	21	20	21	13	21	21	...	21	21	21	21	21	21	21	21	16	21
South USA	Cotton	19	19	19	19	19	19	19	14	19	19	...	19	19	19	19	19	19	19	19	15	19
	Maize	21	21	20	21	21	21	21	15	21	21	...	21	21	21	20	21	21	21	21	13	21
	Rice	17	17	17	17	17	17	17	10	17	17	...	17	17	17	17	17	17	17	17	9	17
	Soybean	17	17	17	17	17	17	17	12	17	17	...	17	17	17	17	17	17	17	17	13	17
	Wheat	19	19	19	19	19	19	19	16	19	19	...	19	19	19	19	19	19	18	19	15	19

25 rows × 21 columns

Aggregation

In this example you didn’t really need to print the values in all columns.
You can simply print a limited number of columns of interest.

Please note that some columns contain lower values than others. This is because values such as “None” or “NA” are not taken into account.

my_data.groupby(['region', 'crop_type'])['farm_id'].count()

region       crop_type
Central USA  Cotton       26
             Maize        21
             Rice         18
             Soybean      26
             Wheat        17
East Africa  Cotton       24
             Maize        24
             Rice         20
             Soybean      24
             Wheat        15
India_North  Cotton       18
             Maize        24
             Rice         18
             Soybean      18
             Wheat        20
India_South  Cotton       20
             Maize        21
             Rice          6
             Soybean      22
             Wheat        21
South USA    Cotton       19
             Maize        21
             Rice         17
             Soybean      17
             Wheat        19
Name: farm_id, dtype: int64

Aggregation

You can even apply different aggregate methods depending on the column, or even apply multiple aggregate methods to the same column.

Example 3: For each region, find the minimum and maximum temperature (Celsius) and the sum of rainfall.

my_data.groupby('region').agg({'temperature_Celsius':['min', 'max'], 'rainfall_mm': 'sum'})

	temperature_Celsius		rainfall_mm
	min	max	sum
region
Central USA	15.04	34.09	19014.33
East Africa	15.01	34.33	19734.47
India_North	15.64	34.52	18636.89
India_South	15.23	33.78	16972.94
South USA	15.11	34.84	15824.96

What methods can be applied after aggregation?

The complete list is available here.
Some common methods:
- .min(): compute min of group values
- .max(): compute max of group values
- .mean(): compute mean of group values
- .count(): compute count of group, excluding missing values
- .describe(): generate descriptive statistics for each numeric column
- .head(n): return the first n rows in each group
- .tail(n): return the last n rows in each group
- .size(): compute group sizes
- …

Dataframe export

You can use to_csv() to export a dataframe to a tabulated file.

Syntax: my_data.to_csv('path_to_output_file')

Example:

my_data.to_csv('my_dataframe.csv', header = True, index = False, sep = ',')

Some common options:
- header = True: the header will be printed
- index = False : the index will not be printed
- sep = ',' : the separator that will be used to separate the columns will be the comma (,)

The only mandatory parameter is the output file path.
Please read the documentation to see the complete list of parameters.

Summary of the dataframes section (1/3)

DataFrames are objects used to store tables of data. They can be initialised:
- with a dictionary: pandas.DataFrame(my_dict)
- from a tabulated file: pandas.read_csv("my_tabulated_file")
Unlike nested lists, columns are identified by a name and must contain only one data type.
There are ways that allow you to view a subset of the data:
- first lines with my_df.head(), last lines with my_df.tail(), generate statistics with my_df.describe()
- display one column with my_df['column_1'] or several columns with my_df[['column_1', 'column_2']]
- select data via the index with my_df.iloc[row_index, column_index]
- select data via the labels with my_df.loc[my_df['column_1'] == value, ['column_2', 'column_3']]

Summary of the dataframes section (2/3)

There are ways that allow you to modify a subset of the data:
- create or modify a column: my_df['column_name'] = value
- delete a column: my_data = my_df.drop(columns='column_name'), del my_df['column_name'] or my_df.pop('column_name')
- rename one or several columns with
  - my_df.rename(columns={'old name': 'new name'}, inplace = True) or
  - my_df = my_df.rename(columns={'old name': 'new name'})
- sorting a dataframe according to one or several columns with
  - my_df.sort_values('column_1', ascending = True, inplace = True) or
  - my_df = my_df.sort_values(['column_1', 'column_2'], ascending = [True, False])

Summary of the dataframes section (3/3)

To filter a dataframe you can use:
- my_df[my_df['column_1'] > value]
- my_df.loc[my_df['column_1'] > value, ['column_3', 'column_4']]
- my_df[ (my_df['column_1'] > value) & (my_df['column_2'] < other_value) ]
- my_df.loc[(my_df['column_1'] > value) & (my_df['column_2'] < other_value), ['column_3', 'column_4']]
To modify certain cells in a column depending on their value, you can do:
my_df.loc[my_df['column_1'] == old_value, ['column_1']] = new_value
An aggregation allows you to group your data according to one or several columns and perform one or several operations on other columns. For instance:
- my_df.groupby('column_1')['column_2'].sum()
- my_df.groupby(['column_1', 'column_2']).count()
- my_df.groupby('column_1').agg({'temperature_Celsius':['min', 'max'], 'column_3': 'sum'})

Let’s practise

Please open file 009_practical_dataframes.ipynb

Plots

Plots presentation

There are several packages to create plots in Python.

In this training we will present matplotlib and seaborn.
matplotlib is one of the most used Python data visualisation library.
seaborn is based on matplotlib and provides new features.

matplotlib can be installed with pip install matplotlib.

seaborn can be installed with pip install seaborn.

Plots presentation

In this section we will see some of the most used types of plots:
- line plot
- scatterplot
- pie plot
- barplot
- histogram
- boxplot
- violin plot
- heatmap
- pairplot

Line plot

A line plot is used to display the relationship between two numerical variables.
In particular, this type of plot is best used for displaying trends over time.

A very basic line plot

import seaborn as sns
import matplotlib.pyplot as plt # please note that you must import matplotlib.pyplot and not simply matplotlib

import random

x = range(1, 11)
y = [100 * round(random.random(), 2) for i in range(1, 11)] # creates a list of 10 random int
plt.plot(x, y)
plt.show()

A slightly more customised plot

import random

x = range(1, 11)
y = [100 * round(random.random(), 2) for i in range(1, 11)]
z = [100 * round(random.random(), 2) for i in range(1, 11)]
plt.figure(figsize=(10, 3))    # configure plot size
plt.plot(x, y, label='y list', linewidth=4)    # add a label and change the default line width
plt.plot(x, z, label='z list', linewidth=4, linestyle='--', color='purple')  # change the default type and color
plt.xlabel('Title for x axis', fontsize=12)  # add a label for x axis
plt.ylabel('Title for y axis', fontsize=12)  # add a label for y axis
plt.legend(loc='upper right')    # add a legend and fix its position in upper right corner
plt.grid(color='gray', linewidth=0.5)    # add a grid
plt.title('A more customised plot line') # add a title
plt.show()

Before going further: the penguins dataset

The penguins dataset is a good dataset for data exploration and visualisation.

It can be imported directly with seaborn.

import seaborn as sns
penguins = sns.load_dataset('penguins')
penguins.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female

Before going further: the penguins dataset

Each individual in this dataset is a penguin.
For each penguin, the available data are:
- the species
- the island where it lives
- the bill length (mm)
- the bill depth (mm)
- the flipper length (mm)
- the body mass (g)
- the sex

Artwork by @allison_horst

Scatterplots

A scatterplot is used to display the relationship between two numerical variables.

Unlike a line plot, with a scatterplot, a value on the x-axis can be associated with several values on the y-axis.

A simple scatterplot with `matplotlib`

import seaborn as sns
import matplotlib.pyplot as plt

# configure plot size
plt.figure(figsize=(10, 4))
plt.scatter(penguins['flipper_length_mm'], penguins['body_mass_g'])
# label for x axis
plt.xlabel('Flipper length (mm)', fontsize=12)
# label for y axis
plt.ylabel('Body mass (g)', fontsize=12)
# plot title
plt.title('Body mass as a function of flipper length', size=16, color='red')
plt.show()

All species are mixed together.

Several scatterplots on the same plot with `matplotlib`

for species in penguins['species'].unique():
    df = penguins.loc[penguins['species'] == species, :]
    plt.scatter(df['flipper_length_mm'], df['body_mass_g'], label=species)
plt.xlabel('Flipper length (mm)', fontsize=12)
plt.ylabel('Body mass (g)', fontsize=12)
plt.legend()      # add a legend based on 'label' parameter in plt.scatter
plt.title('Body mass as a function of flipper length', size=16, color='red')
plt.show()

We have to loop on all species.

A nice scatterplot with `seaborn`

plt.figure(figsize=(9, 3.5))
sns.scatterplot(data=penguins, x='flipper_length_mm', y='body_mass_g', hue='species')
plt.xlabel('Flipper length (mm)', fontsize=12)
plt.ylabel('Body mass (g)', fontsize=12)
plt.title('Body mass as a function of flipper length', size=16, color='red')
plt.show()

By specifying the variable via the hue argument, seaborn automatically creates a color for each existing value.

Pie plots

A pie plot shows data as a percentage of a whole.

This kind of visualisation uses a circle to represent the whole, and slices of the circle to represent the specific categories that compose the whole.

Pie plot with `matplotlib`

island = penguins['island'].value_counts()                    # island is a Series
plt.pie(x=island.values, labels=island.index, autopct='%.2f') # values can be accessed with island.values
plt.title('Islands', size=16, color='#DAA520')
plt.show()                                                    # indexes can be accessed with island.index

The Seaborn library does not offer circular diagram implementations.

To create one, we must therefore use matplotlib’s pie function, to which we can apply seaborn’s various graphic styles (themes).

Bar plots

A bar plot shows the relationship between a numeric and a categoric variable.

Each entity of the categoric variable is represented as a bar.

The size of the bar represents its numeric value.

A bar plot can represent exactly the same information as a pie plot but from a different perspective.

Bar plot with `matplotlib`

flipper_mean = penguins.groupby('species')['flipper_length_mm'].mean()     # flipper_mean is a Series
plt.bar(height=flipper_mean.values, x=flipper_mean.index)                  # values can be accessed with flipper_mean.values
plt.title('Flipper Length for 3 Penguin Species', size=16, color='orange') # indexes can be accessed with flipper_mean.index
plt.show()

Bar plot with `seaborn`

sns.barplot(x ='species', y='flipper_length_mm', data=penguins)
plt.title('Flipper Length for 3 Penguin Species', size=16, color='orange')
plt.show()

seaborn will automatically calculate the mean of the y variable.

Histograms

Histograms are particularly useful when you want to get an idea of the distribution of a variable.

You can see roughly where the peaks of the distribution are, whether the distribution is skewed or symmetric, and if there are any outliers.

Histogram with `matplotlib`

plt.hist(penguins['flipper_length_mm'])
plt.title('Flipper Length', size=16, color='green')
plt.xlabel('Flipper length (mm)')
plt.show()

Basic histogram with `seaborn`

sns.set_theme()    # use defaut theme (grey background with horizontal white lines)
sns.histplot(x = 'flipper_length_mm', data = penguins)
plt.title('Flipper Length', size=16, color='green')
plt.show()

Histogram with kde with `seaborn`

sns.set_theme()
sns.histplot(x = 'flipper_length_mm', data = penguins, hue = 'species', kde = True)
plt.title('Flipper Length', size=16, color='green')
plt.show()

Boxplots

Box plots are used to show distributions of numeric data values, especially when you want to compare them between multiple groups.

They are built to provide high-level information at a glance, offering general information about a group of data’s symmetry, skew, variance, and outliers.

A simple boxplot with `seaborn`

sns.boxplot(x = 'species', y = 'flipper_length_mm', data = penguins, palette=['#FBB613','#38D4D6','#8A38D6'])
plt.title('Flipper Length for 3 Penguin Species', size=16, color='#00BFFF')
plt.show()

A more elaborated boxplot with `seaborn`

sns.set_theme()
sns.boxplot(x = 'species', y = 'flipper_length_mm', data = penguins, hue = 'sex')
plt.title('Flipper Length for 3 Penguin Species by Sex', size=16, color='#00BFFF')
plt.show()

Violin plots

You can think of the violin plot as a box plot.

This plot is used to compare the distribution of numerical values among categorical variables.

The peaks, valleys, and tails of each group’s density curve can be compared to see where groups are similar or different.

Violin plot with `seaborn`

sns.set_theme()
sns.violinplot(x = 'species', y = 'body_mass_g', data = penguins, hue = 'sex')
plt.title('Body mass for penguins by sex and species', size=20, color='blue')
plt.show()

Heatmaps

A heatmap shows how values vary across a grid using colors.

It’s often used to quickly spot patterns, trends, or areas of high and low activity in data.

In a correlation heatmap, colors show how strongly variables are related.

Heatmap with `seaborn`

# extract numeric columns from penguins dataframe
penguins_numeric = penguins[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]
# corr() calculates the correlation between variables
sns.heatmap(penguins_numeric.corr(), annot = True)
plt.title('Correlation between numeric variables', size=16, color='darkviolet')
plt.show()

Pairplots

You can use the pairplot method to see the pair relations of the variables.

This function creates cross-plots of each numeric variable in the dataset.

Several options are available to choose the plot types.

Pairplot with `seaborn`

sns.pairplot(penguins, hue = "species", height=1.5)
plt.show()

Going further

matplotlib gallery: https://matplotlib.org/stable/gallery/index.html

seaborn official website: https://seaborn.pydata.org/

Teasing: seaborn gallery:

Summary of the plots section

matplotlib and seaborn are the most widely used Python packages for plotting graphs.

The data to be plotted should generally be stored in a list or a dataframe.

There are many different types of graphs: line plots, scatter plots, pie charts, bar charts, histograms, box plots, violin plots, heatmaps, pair plots…

The way these different functions are used and the options available are often very similar.

There are many customisation options available, and these are often the same across different types of charts (xlabel(), ylabel(), legend(), title()…)

Please refer to the documentation for instructions on how to use the relevant functions.

Let’s practise

Please open file 010_practical_plots.ipynb

Common errors

Introduction

When coding, you will certainly run into errors. Some are more common than others. Learning to identify errors will help you fix them quickly.

When you encounter an error, Python tell you which line causes a problem, the error name and explain briefly what is wrong.

Common errors (1/6)

NameError: You may have forgotten to define a variable and you are trying to access it.
- How to debug: check if you initialised it or deleted it by mistake.
- Example:

print(my_variable)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[253], line 1
----> 1 print(my_variable)

NameError: name 'my_variable' is not defined

SyntaxError: You may have forgotten a character like () or , or : etc …
- How to debug: The error should indicate the position in the problematic line using a ^.
- Example:

print my_variable

  Cell In[254], line 1
    print my_variable
    ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(...)?

Common errors (2/6)

TypeError: You may be trying to perform an operation or apply a function to a wrong object type.
- How to debug: Check your variables and/or what kind of objects are accepted.
- Example:

my_variable = "w"*1.2

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[255], line 1
----> 1 my_variable = "w"*1.2

TypeError: can't multiply sequence by non-int of type 'float'

ValueError: You may have given an object type in your function but the value is invalid.
- How to debug: Check the value you are trying to give to the function.
- Example:

my_variable = float("variable")

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[256], line 1
----> 1 my_variable = float("variable")

ValueError: could not convert string to float: 'variable'

Common errors (3/6)

IndexError: You may be trying to access an element in a list that is outside the valid range.
- How to debug: Check the length of your list.
- Example:

my_list = [0,1,2,3]
print(my_list[5])

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[257], line 2
      1 my_list = [0,1,2,3]
----> 2 print(my_list[5])

IndexError: list index out of range

KeyError: You may be trying to access an element in a dictionary that doesn’t exist.
- How to debug: Use the .get() method to check your keys.
- Example:

my_dict = {"Laurène":0, "Thomas":1, "Isabelle":2, "Benjamin":3}
print(my_dict["Lauraine"])

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[258], line 2
      1 my_dict = {"Laurène":0, "Thomas":1, "Isabelle":2, "Benjamin":3}
----> 2 print(my_dict["Lauraine"])

KeyError: 'Lauraine'

Common errors (4/6)

IndentationError: You may have forgotten to indent a part of your code.
- How to debug: Check if you did not mix tabs with spaces.
- Example:

for i in [0,1,2,3]:
print(i)

  Cell In[259], line 2
    print(i)
    ^
IndentationError: expected an indented block after 'for' statement on line 1

AttributeError: You may have used the wrong method for an object.
- How to debug: Check your variable type and the method documentation.
- Example:

my_variable = 1
my_variable.upper()

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[260], line 2
      1 my_variable = 1
----> 2 my_variable.upper()

AttributeError: 'int' object has no attribute 'upper'

Common errors (5/6)

FileNotFoundError: The file you are trying to access either does not exist or is in a different folder or the file path is wrong.
- How to debug: Check where your file is.
- Example:

my_file = open("my_file.txt", "r")

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[261], line 1
----> 1 my_file = open("my_file.txt", "r")

File /usr/lib/python3/dist-packages/IPython/core/interactiveshell.py:310, in _modified_open(file, *args, **kwargs)
    303 if file in {0, 1, 2}:
    304     raise ValueError(
    305         f"IPython won't let you open fd={file} by default "
    306         "as it is likely to crash IPython. If you know what you are doing, "
    307         "you can use builtins' open."
    308     )
--> 310 return io_open(file, *args, **kwargs)

FileNotFoundError: [Errno 2] No such file or directory: 'my_file.txt'

Common errors (6/6)

ModuleNotFoundError: You may have forgotten to install the package before importing it, or you may have made a mistake when typing its name.
- How to debug: Install it with pip install.
- Example:

import sqlfactory

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[262], line 1
----> 1 import sqlfactory

ModuleNotFoundError: No module named 'sqlfactory'

How to debug ?

What should you do if an error occurs?
1. Read the message: it explains what is wrong.
2. Try to debug your code to get a better understanding (or to fix it if you can!).
3. Type the error in Google: look for Stack Overflow links, they are helpful.
4. If the 3 tips above do not work, you may ask an IA chatbot for help. If you give it the full error message, it will most likely tell you what is wrong.

Bring Your Own Project

Suggested exercises

If you don’t have any ideas for a program or analysis to implement, you can choose from the following options:

write a program (using basic Python concepts: lists, dictionaries, conditionals, functions, plots, etc.)
- coding the Game of Life
- processing and extraction of information from a dataset of non-coding RNAs
analyse a dataset (manipulation of dataframes and plots)

Conclusion

Links and references

https://docs.python.org/3/: starting hub for all Python informations.
https://realpython.com/: basic and advanced tutorials.
https://stackoverflow.com/questions: discussion group to ask questions about code and troubleshooting.
https://peps.python.org/pep-0000/#: guide for good practices in Python coding.
https://pythontutor.com/visualize.html#: execute code and visualize execution for debugging.
https://www.w3schools.com/python/: examples and simple tutorials.

Take home message

Read the doc!
Practise!
Do not reinvent the wheel: use existing tools
Use AI assistant with caution! (copy-paste will not work every time)

Special thanks

Fabien KON-SUN-TACK
Former Bilille engineer who worked on this training.

Satisfaction survey

In order to help us improve our training, we would be grateful if you could take a few minutes to complete the following satisfaction survey.

(You can answer in English or French.)