April 8, 2026
Schedule :
Breaks :
Lunch :
At the bottom left, there is a menu to better navigate through the slides
Bilille is the Lille bioinformatics and biostatistics platform, within the UAR 2014 - US 41 “Plateformes Lilloises en Biologie et Santé”.
PLBS includes 8 platforms, providing access to expertise and equipments to support research in biology and health.
In Bilille, we currently are 10 full time engineers, directed by Jimmy Vandel (research engineer CNRS), Ségolène Caboche (research engineer University of Lille) and Mamadou-Dia Sow (research engineer University of Lille).
Our missions are to :
Us
What about you ?
.py extension.demystifying-javascript.python-extensions-packms-toolsai.jupyterindent-rainbow…).Extensions help you write scripts, but too many packages can slow down your IDE! Use sparingly.
5 : integer3.1415 : real number (float)"abc" : stringTrue : boolean (a boolean is a variable that can only take values True or False)print("...") : functionVariables are fundamental in programming. You must understand their purpose and how they work in order to obtain the desired results.
Regarding float variables, please note that the separator for decimals is a period (.), not a comma (,).
To assign a value to a variable, you use the assignment operator (equals sign, =) after the variable name.
a = 5 : this command assigns the value 5 to the variable a.
a = 7 : if the same variable is used again, the previous value is overwritten. The object type can change if the variable is reused / overwritten.
It is possible to have as many variables as memory space allows.
é à ö &…myvariable and MYVARIABLE are different objects.average_car_speed = 50averageCarSpeed = 50What do you expect to be displayed with the following examples?
'my_variable'
It is not a variable name but only an object (a string). Variable names are not written between quotes.
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[2], line 1 ----> 1 b NameError: name 'b' is not defined
It raises a NameError because the variable b hasn’t been assigned yet.
'my_variable'
3.1415
What do you expect to be displayed with the following examples?
Cell In[5], line 1 2nd_variable = "my_second_variable" ^ SyntaxError: invalid decimal literal
It raises a SyntaxError because your second variable starts with a number.
'my_second_variable'
string can only be concatenated with another string.strings.Integers and floats can be converted to string with str(variable).strings can be converted to integer or float with int(variable) or float(variable).print() function can be used to display what is between the parentheses.integer or float is to use the formatted string literals syntax also called f-string syntax.Do not forget the f before the quotes.
The variable or operation result between the {} will automatically be converted to a string.
You can also separate strings and variables with commas. This adds spaces automatically.
Simple (’’) or double (““) quotes can be used around a string, but you must not mix them.
If you need additional quotes inside a string, you can use the other type of quotes, or escape them with backslash (\).
print("When I arrive in the morning, I say 'good morning' to everyone.")
print('When I arrive in the morning, I say "good morning" to everyone.')
print("When I arrive in the morning, I say \"good morning\" to everyone.")When I arrive in the morning, I say 'good morning' to everyone.
When I arrive in the morning, I say "good morning" to everyone.
When I arrive in the morning, I say "good morning" to everyone.
\) (cf : Windows path)\\). The first backslash (\) escapes the second one, so it is interpreted as a literal backslash.The following arithmetic operators are available in Python:
float even if the result is an integer.=) symbol. The two following syntaxes are equivalent:To compare values we can use the following operators:
int or float) in numerical order, or strings in lexicographical order (based on their ASCII value).variable.method(*optional parameters*).Here are some examples of useful methods for strings.
Consider the following string:
There are methods for other types of variables, which we will cover in another chapter.
string does not change the string itself; it must be reassigned to a variable (but the same variable name can be reused).#).# but you cannot write code after the comment.If written on several lines, the triple quotes should be written on a line by themselves, and on the same line than the comment itself for one liner descriptions.
Docstrings are usually used when writing a function.
integer, float, string, boolean, function, …=).print() function.'), double quotes (") and f-strings!Mathematical operations can be performed on variables with an operation sign (+, -, *, /, //, %, **):
Examples: my_variable *= 5 or my_variable = my_variable * 5.
>, <, >=, <=, ==, !=).= (variable assignment) and == (test equality)!# before your comment or add """ around it.Please open file 001_practical_variables.py
lists to store multiple values in an orderly manner in the same variable.list can be initialised with [] or list().list can also be initialised with values:list from a string. In this case, each element of the list will contain a single character.list can be accessed by giving its index, starting from 0 to n-1, with n the number of items in the list.list is given by len(numbers).n = len(numbers)
print(f'numbers = {numbers}')
print(f'There are {n} elements in the numbers list.')
print(f'First item is: {numbers[0]}')
print(f'Second item is: {numbers[1]}')
print(f'Last item is: {numbers[n-1]}')numbers = [1, 3, 5, 7, 9]
There are 5 elements in the numbers list.
First item is: 1
Second item is: 3
Last item is: 9
list can also be accessed in revert order from -1 (last item) to -n (first element).print(f'Another way to get last item is: {numbers[-1]}')
print(f'Second to last item is: {numbers[-2]}')
print(f'Last item is: {numbers[len(numbers)-1]}')
print(f'Another way to get first item is: {numbers[-len(numbers)]}')Another way to get last item is: 9
Second to last item is: 7
Last item is: 9
Another way to get first item is: 1
amino_acids[1]‘Ala’
‘Arg’
‘Gln’
Remember that list numbering starts at zero and that the index “-1” allows you to access the last item in the list.
'Glu'amino_acids[6]
amino_acids[5]
amino_acids[-2]
strings.string methods that work with lists..join() turns a list of strings into a single string:You can use any separator with the .join() method.
It just needs to be a string.
.split() turns a string into a list:If no separator is given in .split(), the string will be separated if there are new line (\n), carriage return (\r), tab (\t), form feed (\f) or spaces ( ).
list using its index:list and return it: removed_item = numbers.pop()list by using its value (not the index). Only the first item encountered will be removed; if the value exist several times in the list, the process has to be repeated.Lists are mutable objects which means you can modify them directly.
Consider the following list:
We would like to create the same list called values:
Then we need to remove the second element from the numbers list:
Now let’s check the content of values. What do you expect to get?
Here the values list is just referencing to the numbers list and so the elements are shared.
The method copy is required when a copy has to be made.
Let’s try again.
This time we would like to remove the second-to-last element from the values list.
Which command(s) will work ? :
values.pop(-2)
values.pop(3)
values.remove(3)
Now let’s check the content of numbers.
The numbers list has not been affected by the changes made to the values list.
list can contain any Python variable so it can also contain other lists.list may contain numbers, strings, and anything else.numbers is a nested list.
numbers[5] is a simple list.
numbers[5][0] is an integer.
list.matrix is a nested list.
matrix[0], matrix[1] and matrix[2] are simple lists.
matrix[0][0] is an integer.
Consider the following nested list:
What command would you write to get :
Which animal will you get if you type :
list by specifying ranges of values with a colon (:) in brackets.my_list[start:end:step]: will slice my_list from start to end (excluded) with a step of step (default value 1 if not provided).list:Returns every other element, from second element to fourth element (excluded).
Returns the complete list except for the first element.
Returns the complete list except for the last element.
Returns the complete list in reverse order.
The slicing [::-1] simply displays the list in reverse order, while the method .reverse() changes the order within the list.
list anymore, you can delete it with the del keyword:--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[118], line 3 1 amino_acids = ['Ala', 'Arg', 'Asp', 'Asn', 'Cys', 'Glu', 'Gln'] 2 del(amino_acids) ----> 3 print(amino_acids) NameError: name 'amino_acids' is not defined
list using slicing:Tuples are similar to lists but they cannot be modified. They are immutable objects.tuple can be initialised with () or tuple().tuple can also be initialised with values:tuple, Python won’t let you.--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[122], line 2 1 colours = ('red', 'orange', 'yello') ----> 2 colours[2] = "yellow" 3 print(colours) TypeError: 'tuple' object does not support item assignment
Make sure to use [] or list() to create a list.
list is a variable that can store multiple values in an orderly manner.list, you can use [] or list().List indexing starts at 0 from the left and starts at -1 from the right.my_list[i] (or my_list[i] = elt).my_list.append(elt)my_list.insert(i, elt)removed = my_list.pop()removed = my_list.pop(i)my_list.remove(elt)my_list[start:end:step].del(my_list).Please open file 002_practical_lists.py
Dictionaries are used to store data in a disorderly manner in the form of key:value pairs.key is unique. If a key is reused, its contents will be overwritten.dictionary can be initialised with {} or dict().dictionary can also be initialised directly with data:Each item of a dictionary can be accessed by giving its key:
key in brackets:Cat says meow.
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[124], line 2 1 print(f"Cat says {animal_sounds['cat']}.") ----> 2 print(f"Fox says {animal_sounds['fox']}.") KeyError: 'fox'
If you give a key that is not present in the dictionary it will raise an error.
get() you can provide a default value in case the key is not in the dictionary:my_dict.get(key, default_value)Dictionaries are not indexed as lists are.
my_dict[1] will raise an error unless there is a key called 1.
keys must be immutable objects like strings, numbers or tuples..keys().values can contain items of different types, including other dictionaries..values().The pop method can be used to delete a key:value pair and store the value in a variable.
If we want to remove a value from a dictionary, we can use the del keyword:
key:In this example, we want to create a dictionary named fruits_shop, with fruits as keys and numbers as values. These numbers represent the quantity of each fruit in the shop.
We received 10 apples, 5 pears and 1 banana.
How would you implement it ?
With this syntax, we must first initialise the dictionary and then add each element.
Nice! But in the meantime, we received 45 more bananas and 10 grapes… and then someone ate an apple (oops).
Pears are now prohibited worldwide, but we get 2 apples in exchange for each pear.
Unfortunately, we should remove the fruits_shop, as it has become useless and we need the space for something else. How would you proceed?
dictionary is a variable that can store data in a disorderly manner in the form of key:value pairs.dictionary can be initialised with {} or dict().value corresponding to the key k, you can use :
my_dict[k]my_dict.get(k, default_value)lists are.my_dict[k] = new_value.key:value pair, use remove = my_dict.pop(k) or del(my_dict[k]).del(my_dict).Please open file 003_practical_dictionaries.py
if / elif / else statement allows to determine which part of the code is executed, according to one or several conditions.if, elif and else lines end with colon (:).Do not mix spaces and tabs.
Python best practices recommend using 4 spaces.
elif and else are optional. If they are not provided, nothing will be executed if the if statement is not true.elif statements as needed.elif is short for else if.What should this code return with these values:
- `current_speed = 60` ?
Slow down! You are going to get a fine!
- `current_speed = 160` ?
Slow down! You are going to kill someone!
- `current_speed = 30` ?
You are not exceeding the speed limit.
Slow down! You are going to kill someone!
Slow down! You are going to get a fine!
In the example on the right we will never enter the current_speed > limit + 50 block.
and or or.if A and B will be executed only if the 2 expressions are true.
Note: there is a simpler syntax for checking whether a number is within a range.
if A or B will be executed if at least one of the 2 expressions is true.You can notice that we initialised the variable admission before the conditional statement. This is a good practice, because if all conditions fail and you try to use an uninitialised variable, an error will occur and stop the execution of your script.
There is no limit to the number of conditions, but it may be useful to use parentheses to indicate priorities.
The logical operator and has higher precedence than the logical operator or.
This means that when both and and or operators appear in the same expression, and is evaluated first.
If you are not sure of the priority, use parentheses!
You can nest multiple conditions.
Please mind the indentation!
Before leaving home, you should take an accessory depending on the weather.
Consider the following code:
This code prints: Wear sunglasses.
Wear a scarf.We must have rain == False and temperature between 0 and 14°C.
When the temperature is strictly below 0°C.
Before leaving home, you should take an accessory depending on the weather.
Consider the following code:
You should take an umbrella.
Tomorrow, I will wear a hat and sunglasses.
Sometimes it is easier to check whether a condition is not true.
We can do this with the operator not.
This is equivalent to the following syntax:
if / elif / else statement allows to determine which part of the code is executed, according to one or several conditions.elif is short for if and else.if, elif and else lines end with colon (:).elif and else are optional.and, or and add parentheses () to indicate priorities.in and not in keywords to check if an element is in a list.for loops are generally used when we know how many times to repeat the action.while loops are generally preferred when we don’t know the number of repetitions in advance.for loop allows to perform an action for each element in a group like a list, a dictionary, a string…for instruction must end with a colon (:) and the code that will run inside the for loop must be indented.odds = [1, 3, 5, 7]
numbers_power2 = list()
for i in odds:
numbers_power2.append(i**2)
print(f"i contains {i} and numbers_power2 contains {numbers_power2}.")i contains 1 and numbers_power2 contains [1].
i contains 3 and numbers_power2 contains [1, 9].
i contains 5 and numbers_power2 contains [1, 9, 25].
i contains 7 and numbers_power2 contains [1, 9, 25, 49].
The enumerate function is useful for iterating through a list and finding out the position of each element in the list.
while loop allows to perform an action as long as an expression is true.while instruction must end with a colon (:) and the code that will run inside the while loop must be indented.WARNING! If the expression evaluated by the while loop is never modified, you might end up with an infinite loop!
odds = [1, 3, 5, 7, 9]
numbers_power_2 = list()
i = 0
while i < len(odds):
odd_number2 = odds[i]**2
print(f"The list item with index {i} is {odds[i]}.")
numbers_power_2.append(odd_number2)
i += 1
print(numbers_power_2)The list item with index 0 is 1.
The list item with index 1 is 3.
The list item with index 2 is 5.
The list item with index 3 is 7.
The list item with index 4 is 9.
[1, 9, 25, 49, 81]
break statement we can stop the loop even if the while condition is still true or if we are not done with the for iteration.numbers = [2, 4, 6, 7, 8]
even_numbers = list()
i = 0
while i < len(numbers):
if numbers[i] % 2 == 1:
print(f"An odd number has been found ({numbers[i]})")
break
else:
even_numbers.append(numbers[i])
i += 1
print("The consecutive even numbers are", even_numbers)An odd number has been found (7)
The consecutive even numbers are [2, 4, 6]
continue statement we can go directly to the next iteration without executing the code in the loop for the current iteration.loops and conditions within the same block of code.In this case, you should pay attention to the code indentation. If you get it wrong, the code may still run, but it will not produce the expected result.
Let’s generate all possible pairs of fruits among orange, mango, and lemon.
Here, orange is the first fruit.
- orange and orange
- orange and mango
- orange and lemon
Here, mango is the first fruit.
- mango and orange
- mango and mango
- mango and lemon
Here, lemon is the first fruit.
- lemon and orange
- lemon and mango
- lemon and lemon
Here, orange is the first fruit.
- orange and orange
- orange and mango
- orange and lemon
Here, mango is the first fruit.
- mango and orange
- mango and mango
- mango and lemon
Here, lemon is the first fruit.
- lemon and orange
- lemon and mango
- lemon and lemon
We have a list of integers from 0 to 12.
We want to classify them in a dictionary with keys odd and even. Each key in the dictionary has a list of numbers as its value.
How would you do that?
First, we initialise our list and dictionary:
Then we will iterate over my_int_list. For each element we will test if it is even or odd, and add the element to the list of the appropriate key.
for loops: the number of repetitions is known in advance.while loops: the number of repetitions is not known in advance.for/while, an iterator, the keyword in, a list/dictionary and a colon (:).while loop is not modified, you will get an infinite loop.break statement, the loop will stop prematurely.continue statement, the loop will go to the next iteration prematurely.Please open file 004_practical_conditionals_loops.py
Jupyter notebooks are interactive programming environments that allow you to combine text, images, mathematical formulas, tables, graphs and executable computer code in a single document. They can be manipulated in a web browser.
Jupyter notebooks support nearly 40 different languages, including Python.
The cell is the basic element of a Jupyter notebook. It can contain formatted text or computer code that can be executed.
A web browser can be used to open a notebook, but VSCode can also do so as long as the Jupyter notebook extension has been installed.
The file extension for a Jupyter file is .ipynb.
Execute Above Cells: Runs every cell above the current cell.Execute Cell and Below: Runs the current cell and all of the cells below this one.Run All: Runs every cell in the notebook.Restart: Empties the memory (restarts the kernel).Run All to check if your code works correctly before giving it to someone.Please open file 005_practical_jupyter.ipynb
Functions are useful for performing an operation multiple times within a program.
A few functions have been introduced during this training.
print() which displays what is between the parentheseslen() which returns the number of items in a list or dictionaryBasically, any function works like this:
function.function.function returns value(s) or object(s).A function is built with the keyword def to start the definition of the function.
It has to be followed by the function name, parentheses () with optionally arguments inside and a colon :
Like for and while loops, the code that will run inside must be indented.
Arguments can be passed to a function.
Some operations can be performed within the function using one or several arguments given in parentheses.
Multiple arguments can be passed to the function.
Each of them have to be separated by a comma (,) and can be of any type (str, int, float, list, dict, etc…).
function block.--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[176], line 5 2 sqr=x**2 4 square(2) ----> 5 print(f"The square of 2 is {sqr}.") NameError: name 'sqr' is not defined
The NameError happened because the variable defined in the function are not translatable to the global code.
To make a function variables usable outside of that function, we have to use return.
The return statement sends a termination signal to the function block and returns values, which can be of any type.
The return statement can be inserted several times in a function.
However, the first return encountered will stop the function execution and return back to the global code.
This is useful when combined with conditional statements to exit the function when the condition is fulfilled.
def speed_limit(x):
limit = 50
if x > limit:
return "Too fast !"
else:
return "Perfect !"
print("You are driving at 51km/h.")
result = speed_limit(51)
print(result)
print("You are driving at 30km/h.")
result = speed_limit(30)
print(result)You are driving at 51km/h.
Too fast !
You are driving at 30km/h.
Perfect !
function.The second argument has been left empty since we wanted to apply the default value to the function.
Function names should be lowercase and words separated by underscores (_) for a better readability.Function names should not be the same as other Python included functions/keywords.
Type hints are only available for Python3 version greater than 3.10.
Imagine you want to create a function called ‘enzyme’, which takes a string as an argument and returns a split list. It splits every time there is a serine (S) residue (we are in a wonderful world where enzymes cut every time and there are no steric hindrances…).
How would you do that ?
We can then add the argument(s) :
We can then add the instructions (beware of indentation):
Then, we want see the result !
Now let’s try !
Great, it works ! I want to see it in a variable.
Oops, I forgot to include the return in the function.
OK, now let’s enhance our function! Currently it cuts only on uppercase S but we want to be able to accept sequences in upper and lower case letters.
That’s pretty good, but now we want to add the ability to cut according to another amino acid, while keeping Serine as the default value.
You may have noticed that… the catalytic site is not in the list anymore… In reality, an enzyme can cut before or after the catalytic site, but the recognised amino acid should always be present. How would you approach this? (tips: before will be a boolean which, by default, performs an enzyme cut before a catalytic site).
def enzyme(my_string, catalytic_site = "S", before = True):
res = my_string.upper().split(catalytic_site)
if before == True:
for my_peptide in range(1, len(res)):
res[my_peptide] = catalytic_site + res[my_peptide]
else:
for my_peptide in range(0,(len(res)-1)):
res[my_peptide] = res[my_peptide] + catalytic_site
return res
pept = "AGESMKT"
answer = enzyme(pept)
print(answer)
answer = enzyme(pept, "T")
print(answer)
answer = enzyme(pept, "T", before=False)
print(answer)
answer = enzyme(pept, "A")
print(answer)['AGE', 'SMKT']
['AGESMK', 'T']
['AGESMKT', '']
['', 'AGESMKT']
We can see that if our peptide began or ended at the catalytic site, it might produce an unexpected split with an empty character. We don’t want this empty character.
How would you do this?
def enzyme(my_string, catalytic_site = "S", before = True):
res = my_string.upper().split(catalytic_site)
if before == True:
for my_peptide in range(1, len(res)):
res[my_peptide] = catalytic_site + res[my_peptide]
if res[0] == "":
res.pop(0)
else:
for my_peptide in range(0,(len(res)-1)):
res[my_peptide] = res[my_peptide] + catalytic_site
if res[-1] == "":
res.pop(-1)
return res
pept = "AGESMKT"
print(enzyme(pept))
print(enzyme(pept, "T"))
print(enzyme(pept, "O"))
print(enzyme(pept, "T", before=False))
print(enzyme(pept, "A"))['AGE', 'SMKT']
['AGESMK', 'T']
['AGESMKT']
['AGESMKT']
['AGESMKT']
Well played! You’re almost there with this beautiful function! Adding documentation within docstrings will be helpful if in two years you want to remember what the function does, or if you give your code to someone else.
def enzyme(my_string : str, catalytic_site = "S", before = True) -> list:
"""
Simulate an enzyme cleavage using a catalytic site. The cleavage can occur before or after the catalytic site.
Arguments:
my_string: string
The protein to be digested.
catalytic_site: string - optional
The cleavage site used to split the protein.
before: boolean - optional
Whether the enzyme cuts before or after the cleavage site.
If `before` is True, the enzyme cuts before the catalytic site, otherwise it cuts after the catalytic site.
"""
res = my_string.upper().split(catalytic_site)
if before == True:
for my_peptide in range(1, len(res)):
res[my_peptide] = catalytic_site + res[my_peptide]
if res[0] == "":
res.pop(0)
else:
for my_peptide in range(0,(len(res)-1)):
res[my_peptide] = res[my_peptide] + catalytic_site
if res[-1] == "":
res.pop(-1)
return res
short_sab = "MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGEENFKALVLIAFAQYLQQCPFEDHVKLVNEVTEFAKTCVADESAENCDKSLHTLFGDKLCTVATLRETYGEMADCCAKQEP"
res = enzyme(short_sab)
print(res)['MKWVTFI', 'SLLFLF', 'S', 'SAY', 'SRGVFRRDAHK', 'SEVAHRFKDLGEENFKALVLIAFAQYLQQCPFEDHVKLVNEVTEFAKTCVADE', 'SAENCDK', 'SLHTLFGDKLCTVATLRETYGEMADCCAKQEP']
Although not mandatory, docstrings are highly recommended!
Functions are useful for performing an operation multiple times within a program.def, the function name, parenthesis () with optionally arguments inside and a colon (:).function must be indented.return statement ends the function and sends a result where it is called.return statements in a function if you use conditional statements but the first return encountered will stop the function execution and return back to the global code.Please open file 006_practical_functions.ipynb
Standard Python is a powerful language that can do many things, and developers may help the community with “ready-to-use” functions bundled in packages.
Packages contain collections of functions developed to accomplish common tasks.
The Python community is really active and has developed many packages providing functions for almost any purpose.
Some examples of useful packages:
Use import followed by the package name to load a package in Python.
Once imported, you can call a function from the package by writing package_name.function_name.
Here we have imported the package random to use the function randint which draws a random integer between 0 and 10.
Another common way to import functions from a package is to use the keyword from.
from is useful to import one or several functions without recalling the package’s name.
*.random functions have been imported and can be used directly by naming them like randint or choices.Be careful when using * with multiple packages. Some packages might have functions with the same name, and this can cause conflicts in Python. In fact, it is greatly recommended to not use * to import everything from a package.
Choose your aliases wisely!
Be careful! Importing a function with from and using an alias may overwrite another function!
A large number of packages, or certain combinations, might result in conflicts. For advanced usage, it will be recommended to use conda interpreter.
Packages contain collections of “ready-to-use” functions developed to accomplish common tasks.package, you need to install it first.
pip install package_name orpip install git+https://github.com/pseudo/repo-name.gitimport package_name then package_name.function_name.from package_name import function_name then function_name.from package_name import * then function_name.alias:
import a_package_with_a_long_name as pack then pack.function_namealias!Please open file 007_practical_packages.ipynb
The main operations that you can perform on files are: reading a file and writing to a file.
When you access a file on an operating system, a file path is required, which represents the location of a file. It is broken up into three major parts:
/ (Unix) or backslash \ (Windows).) used to indicate the file typeThe path can be:
/home/Toto/Documents/Trainings/Python/ is the folder absolute path.
/home/Toto/Documents/Trainings/Python/practical_work/exercises.py is exercises.py absolute path.
./Data/sequences.fasta is input.fasta relative path (relative to the exercises.py file)
exercises is the file name.
py is the file extension.
About relative paths
./ means the same directory../Data/Sequences.fasta and Data/Sequences.fasta should work the same../ means the parent directory.Python_slides.html file from exercises.py you will use ../Python_slides.htmlopen().The syntax using with is recommended for most cases.
You can notice the alias as f, it means f is the file example.txt opened in r (read) mode.
File handler is automatically closed when you exit the with block.
If you open a file in ‘writing’ mode without using with and forget to close the file handler, your changes may not be saved.
readlines(), or line by line.content is a list.
The whole file is read in a go. It can be useful for files with few lines.
The whole file is stored in a list. This should not be done with big files.
write() method.Be careful which parameter you choose in open(), “a” or “w”:
- in writing mode, any previous content is deleted.
- in appending mode, the text is added to the end of the file.
.write() method does not automatically add a new line (\n), contrary to print() function.
open().open('/home/Toto/Data/example.txt', 'r')open('/home/Toto/Data/example.txt', 'w') \(\rightarrow\) overwrites the fileopen('/home/Toto/Data/example.txt', 'a') \(\rightarrow\) adds text at the end of the filePlease open file 008_practical_io.ipynb
pandas and Polars. In this training we will focus on pandas.pd is a common alias used for pandas, but you could also simply write import pandas then just use the functions by calling pandas.function.pd.DataFrame, which creates an object DataFrame with various methods.keys will become the column names in the dataframe.values are lists, each of which will become a column in the dictionary. They must all have the same length.Please note that a column containing numbers starting from zero has been added. This column is called an index.
farm_id region crop_type soil_moisture soil_pH temperature_C \
0 FARM0001 North India Wheat 35.95 5.99 17.79
1 FARM0002 South USA Soybean 19.74 7.24 30.18
2 FARM0003 South USA Wheat 29.32 7.16 27.37
3 FARM0004 Central USA Maize 17.33 6.03 33.73
4 FARM0005 Central USA Cotton 19.37 5.92 33.86
.. ... ... ... ... ... ...
495 FARM0496 Central USA Rice 42.85 6.70 30.85
496 FARM0497 North India Soybean 34.22 6.75 17.46
497 FARM0498 North India Cotton 15.93 5.72 17.03
498 FARM0499 NaN Soybean 38.61 6.20 17.08
499 FARM0500 North India Wheat 30.22 7.42 20.57
rainfall_mm humidity sunlight_hours irrigation_type ... sowing_date \
0 75.62 77.03 7.27 NaN ... 01-08-24
1 89.91 61.13 5.67 Sprinkler ... 02-04-24
2 265.43 68.87 8.23 Drip ... 02-03-24
3 212.01 70.46 5.03 Sprinkler ... 02-21-24
4 269.09 55.73 7.93 NaN ... 02-05-24
.. ... ... ... ... ... ...
495 52.35 79.58 7.25 Manual ... 01-16-24
496 256.23 45.14 5.78 NaN ... 01-01-24
497 288.96 57.87 7.69 Drip ... 01-02-24
498 279.06 73.09 9.60 Drip ... 01-25-24
499 72.61 89.74 5.09 NaN ... 02-16-24
harvest_date total_days yield_kg_per_hectare sensor_id timestamp \
0 05-09-24 122 4408.07 SENS0001 03-19-24
1 05-26-24 112 5389.98 SENS0002 04-21-24
2 06-26-24 144 2931.16 SENS0003 02-28-24
3 07-04-24 134 4227.80 SENS0004 05-14-24
4 05-20-24 105 4979.96 SENS0005 04-13-24
.. ... ... ... ... ...
495 06-02-24 138 4251.40 SENS0496 05-08-24
496 04-14-24 104 3708.54 SENS0497 01-19-24
497 05-09-24 128 2604.41 SENS0498 04-20-24
498 06-04-24 131 2586.36 SENS0499 03-02-24
499 06-29-24 134 5891.40 SENS0500 05-11-24
latitude longitude NDVI_index crop_disease_status
0 14.970941 82.997689 0.63 Mild
1 16.613022 70.869009 0.58 NaN
2 19.503156 79.068206 0.80 Mild
3 31.071298 85.519998 0.44 NaN
4 16.568540 81.691720 0.84 Severe
.. ... ... ... ...
495 30.386623 76.147700 0.59 Mild
496 18.832748 75.736924 0.85 Severe
497 23.262016 81.992230 0.71 Mild
498 19.764989 84.426869 0.77 Severe
499 13.455532 88.880605 0.85 Severe
[500 rows x 22 columns]
Series. It is a one-dimensional object.Series.Series can only contain one type of data, whereas a dataframe can contain columns of different types: a column of integers, a column of decimal numbers, etc.dataframe.head(n) prints the first n rows of the dataframe.n is not provided, the first 5 lines are printed.| farm_id | region | crop_type | soil_moisture | soil_pH | temperature_C | rainfall_mm | humidity | sunlight_hours | irrigation_type | ... | sowing_date | harvest_date | total_days | yield_kg_per_hectare | sensor_id | timestamp | latitude | longitude | NDVI_index | crop_disease_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | FARM0001 | North India | Wheat | 35.95 | 5.99 | 17.79 | 75.62 | 77.03 | 7.27 | NaN | ... | 01-08-24 | 05-09-24 | 122 | 4408.07 | SENS0001 | 03-19-24 | 14.970941 | 82.997689 | 0.63 | Mild |
| 1 | FARM0002 | South USA | Soybean | 19.74 | 7.24 | 30.18 | 89.91 | 61.13 | 5.67 | Sprinkler | ... | 02-04-24 | 05-26-24 | 112 | 5389.98 | SENS0002 | 04-21-24 | 16.613022 | 70.869009 | 0.58 | NaN |
| 2 | FARM0003 | South USA | Wheat | 29.32 | 7.16 | 27.37 | 265.43 | 68.87 | 8.23 | Drip | ... | 02-03-24 | 06-26-24 | 144 | 2931.16 | SENS0003 | 02-28-24 | 19.503156 | 79.068206 | 0.80 | Mild |
| 3 | FARM0004 | Central USA | Maize | 17.33 | 6.03 | 33.73 | 212.01 | 70.46 | 5.03 | Sprinkler | ... | 02-21-24 | 07-04-24 | 134 | 4227.80 | SENS0004 | 05-14-24 | 31.071298 | 85.519998 | 0.44 | NaN |
| 4 | FARM0005 | Central USA | Cotton | 19.37 | 5.92 | 33.86 | 269.09 | 55.73 | 7.93 | NaN | ... | 02-05-24 | 05-20-24 | 105 | 4979.96 | SENS0005 | 04-13-24 | 16.568540 | 81.691720 | 0.84 | Severe |
| 5 | FARM0006 | Central USA | Rice | 44.91 | 5.78 | 24.87 | 238.95 | 83.06 | 4.92 | Sprinkler | ... | 01-13-24 | 05-06-24 | 114 | 4383.55 | SENS0006 | 03-12-24 | 23.227859 | 89.421568 | 0.82 | NaN |
6 rows × 22 columns
dataframe.tail(n) is used to show the last n rows of the dataframe.n is not provided, the last 5 lines are printed.| farm_id | region | crop_type | soil_moisture | soil_pH | temperature_C | rainfall_mm | humidity | sunlight_hours | irrigation_type | ... | sowing_date | harvest_date | total_days | yield_kg_per_hectare | sensor_id | timestamp | latitude | longitude | NDVI_index | crop_disease_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 495 | FARM0496 | Central USA | Rice | 42.85 | 6.70 | 30.85 | 52.35 | 79.58 | 7.25 | Manual | ... | 01-16-24 | 06-02-24 | 138 | 4251.40 | SENS0496 | 05-08-24 | 30.386623 | 76.147700 | 0.59 | Mild |
| 496 | FARM0497 | North India | Soybean | 34.22 | 6.75 | 17.46 | 256.23 | 45.14 | 5.78 | NaN | ... | 01-01-24 | 04-14-24 | 104 | 3708.54 | SENS0497 | 01-19-24 | 18.832748 | 75.736924 | 0.85 | Severe |
| 497 | FARM0498 | North India | Cotton | 15.93 | 5.72 | 17.03 | 288.96 | 57.87 | 7.69 | Drip | ... | 01-02-24 | 05-09-24 | 128 | 2604.41 | SENS0498 | 04-20-24 | 23.262016 | 81.992230 | 0.71 | Mild |
| 498 | FARM0499 | NaN | Soybean | 38.61 | 6.20 | 17.08 | 279.06 | 73.09 | 9.60 | Drip | ... | 01-25-24 | 06-04-24 | 131 | 2586.36 | SENS0499 | 03-02-24 | 19.764989 | 84.426869 | 0.77 | Severe |
| 499 | FARM0500 | North India | Wheat | 30.22 | 7.42 | 20.57 | 72.61 | 89.74 | 5.09 | NaN | ... | 02-16-24 | 06-29-24 | 134 | 5891.40 | SENS0500 | 05-11-24 | 13.455532 | 88.880605 | 0.85 | Severe |
5 rows × 22 columns
pandas methods is describe, which gives a statistical summary of all numeric variables.As shown in the summary below, only quantitative variables can be described.
| soil_moisture | soil_pH | temperature_C | rainfall_mm | humidity | sunlight_hours | pesticide_usage_ml | total_days | yield_kg_per_hectare | latitude | longitude | NDVI_index | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 497.000000 | 498.000000 | 499.000000 | 499.000000 | 497.000000 | 500.00000 | 500.000000 | 500.000000 | 499.000000 | 500.000000 | 499.000000 | 500.000000 |
| mean | 26.754789 | 6.525181 | 24.695130 | 181.872886 | 65.169618 | 7.03014 | 26.586980 | 119.496000 | 4032.258818 | 22.442473 | 80.403927 | 0.602060 |
| std | 10.122341 | 0.585128 | 5.336647 | 72.244299 | 14.655248 | 1.69167 | 13.202429 | 16.798046 | 1175.516477 | 7.283492 | 5.910818 | 0.175402 |
| min | 10.160000 | 5.510000 | 15.010000 | 50.170000 | 40.230000 | 4.01000 | 5.050000 | 90.000000 | 2023.560000 | 10.004243 | 70.020021 | 0.300000 |
| 25% | 17.900000 | 6.030000 | 20.305000 | 119.760000 | 51.760000 | 5.66750 | 14.945000 | 105.750000 | 2994.750000 | 16.263202 | 75.380396 | 0.447500 |
| 50% | 25.890000 | 6.530000 | 24.700000 | 192.360000 | 65.610000 | 6.99500 | 25.980000 | 119.000000 | 4070.970000 | 21.981743 | 80.669355 | 0.610000 |
| 75% | 35.950000 | 7.040000 | 29.090000 | 239.120000 | 77.960000 | 8.47000 | 38.005000 | 134.000000 | 5066.060000 | 28.528948 | 85.656333 | 0.750000 |
| max | 44.980000 | 7.500000 | 34.840000 | 298.960000 | 90.000000 | 10.00000 | 49.940000 | 150.000000 | 5998.290000 | 34.981531 | 89.991901 | 0.900000 |
my_data['column_name'].Please note that when you display only a selection of a dataframe, you always get a dataframe, so you can apply the usual dataframe functions (like head()) to it.
my_data[['column_name_1', 'column_name_2']].Please note the double pairs of brackets [[]] when displaying several columns.
.iloc.iloc method allows you to select a subset of your dataframe based on positions.my_data.iloc[row_index, column_index]. You can use a colon (:) to select a range.| region | crop_type | soil_moisture | soil_pH | temperature_C | rainfall_mm | humidity | sunlight_hours | irrigation_type | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | North India | Wheat | 35.95 | 5.99 | 17.79 | 75.62 | 77.03 | 7.27 | NaN |
| 1 | South USA | Soybean | 19.74 | 7.24 | 30.18 | 89.91 | 61.13 | 5.67 | Sprinkler |
| 2 | South USA | Wheat | 29.32 | 7.16 | 27.37 | 265.43 | 68.87 | 8.23 | Drip |
| 3 | Central USA | Maize | 17.33 | 6.03 | 33.73 | 212.01 | 70.46 | 5.03 | Sprinkler |
| 4 | Central USA | Cotton | 19.37 | 5.92 | 33.86 | 269.09 | 55.73 | 7.93 | NaN |
::n to specify a step of n.| farm_id | soil_moisture | rainfall_mm | irrigation_type | sowing_date | yield_kg_per_hectare | latitude | crop_disease_status | |
|---|---|---|---|---|---|---|---|---|
| 0 | FARM0001 | 35.95 | 75.62 | NaN | 01-08-24 | 4408.07 | 14.970941 | Mild |
| 150 | FARM0151 | 28.82 | 69.76 | Sprinkler | 03-21-24 | 5338.11 | 17.754237 | Mild |
| 300 | FARM0301 | 28.32 | 207.67 | Manual | 02-18-24 | 2043.13 | 22.816578 | Severe |
| 450 | FARM0451 | 10.22 | 74.22 | NaN | 02-06-24 | 3498.61 | 13.358302 | NaN |
Which code allows access to the last 5 lines of the first 3 columns of a dataframe?
my_data.iloc[5:, :3]
my_data.iloc[-5:, :3]
my_data.iloc[-5:, :4]
This will display the first 3 columns for all rows except the first 5.
| farm_id | region | crop_type | |
|---|---|---|---|
| 5 | FARM0006 | Central USA | Rice |
| 6 | FARM0007 | North India | Soybean |
| 7 | FARM0008 | East Africa | Maize |
| 8 | FARM0009 | Central USA | Soybean |
| 9 | FARM0010 | East Africa | Rice |
| ... | ... | ... | ... |
| 495 | FARM0496 | Central USA | Rice |
| 496 | FARM0497 | North India | Soybean |
| 497 | FARM0498 | North India | Cotton |
| 498 | FARM0499 | NaN | Soybean |
| 499 | FARM0500 | North India | Wheat |
495 rows × 3 columns
This is the right answer.
This will display the last 5 lines of the first 4 columns.
Which code allows access to all rows of the third, fourth and fifth columns?
my_data.iloc[:, 2:5]
my_data.iloc[:, 3:5]
my_data.iloc[3:6, :]
This is the right answer. Don’t forget that the numbering starts at zero!
This will only display columns 3 and 4 (5 is excluded).
This will display all columns for lines 3 to 5.
| farm_id | region | crop_type | soil_moisture | soil_pH | temperature_C | rainfall_mm | humidity | sunlight_hours | irrigation_type | ... | sowing_date | harvest_date | total_days | yield_kg_per_hectare | sensor_id | timestamp | latitude | longitude | NDVI_index | crop_disease_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | FARM0004 | Central USA | Maize | 17.33 | 6.03 | 33.73 | 212.01 | 70.46 | 5.03 | Sprinkler | ... | 02-21-24 | 07-04-24 | 134 | 4227.80 | SENS0004 | 05-14-24 | 31.071298 | 85.519998 | 0.44 | NaN |
| 4 | FARM0005 | Central USA | Cotton | 19.37 | 5.92 | 33.86 | 269.09 | 55.73 | 7.93 | NaN | ... | 02-05-24 | 05-20-24 | 105 | 4979.96 | SENS0005 | 04-13-24 | 16.568540 | 81.691720 | 0.84 | Severe |
| 5 | FARM0006 | Central USA | Rice | 44.91 | 5.78 | 24.87 | 238.95 | 83.06 | 4.92 | Sprinkler | ... | 01-13-24 | 05-06-24 | 114 | 4383.55 | SENS0006 | 03-12-24 | 23.227859 | 89.421568 | 0.82 | NaN |
3 rows × 22 columns
.loc.loc method allows you to select a subset of your dataframe based on labels (rows or columns names).my_data.loc[row_names, column_names].my_data['column'] ** conditionHere, you have to replace ** with a comparison operator like ==, >=, !=, etc.
.loc:) in the first (or second) position as an argument given to loc.ANSWER:
| soil_moisture | soil_pH | temperature_C | |
|---|---|---|---|
| 0 | 35.95 | 5.99 | 17.79 |
| 6 | 36.28 | 7.04 | 21.80 |
| 13 | 12.80 | 5.87 | 26.90 |
| 20 | 16.25 | 7.43 | 20.31 |
| 31 | 39.76 | 6.70 | 17.42 |
| ... | ... | ... | ... |
| 491 | 32.14 | 7.44 | 21.49 |
| 494 | 12.52 | 5.99 | 33.18 |
| 496 | 34.22 | 6.75 | 17.46 |
| 497 | 15.93 | 5.72 | 17.03 |
| 499 | 30.22 | 7.42 | 20.57 |
99 rows × 3 columns
my_data['column_name'] not only allows you to access a column in a dataframe, but also to modify it.my_data['humidity'] = my_data['humidity'] / 100 # converts the degree of humidity into a percentage
my_data.iloc[0:5, 0:10] # checks that the dataframe has been modified in-place| farm_id | region | crop_type | soil_moisture | soil_pH | temperature_C | rainfall_mm | humidity | sunlight_hours | irrigation_type | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | FARM0001 | North India | Wheat | 35.95 | 5.99 | 17.79 | 75.62 | 0.7703 | 7.27 | NaN |
| 1 | FARM0002 | South USA | Soybean | 19.74 | 7.24 | 30.18 | 89.91 | 0.6113 | 5.67 | Sprinkler |
| 2 | FARM0003 | South USA | Wheat | 29.32 | 7.16 | 27.37 | 265.43 | 0.6887 | 8.23 | Drip |
| 3 | FARM0004 | Central USA | Maize | 17.33 | 6.03 | 33.73 | 212.01 | 0.7046 | 5.03 | Sprinkler |
| 4 | FARM0005 | Central USA | Cotton | 19.37 | 5.92 | 33.86 | 269.09 | 0.5573 | 7.93 | NaN |
my_data['column_name'] does not already exist, it will be created on the fly.Unlike the following two options, the drop method does not modify the existing dataframe; it simply returns a copy of the data frame with the changes applied. You will need to replace your data frame to compensate for this.
The pop function returns the deleted column, which can be assigned to a variable with col = my_data.pop('id').
| farm_id | region | crop_type | soil_moisture | soil_pH | temperature_C | rainfall_mm | humidity | sunlight_hours | irrigation_type | ... | harvest_date | total_days | yield_kg_per_hectare | sensor_id | timestamp | latitude | longitude | NDVI_index | crop_disease_status | temperature_F |
|---|
0 rows × 23 columns
rename method allows you to rename one or more columns at a time using the following syntax:my_dataframe.rename(columns={'old name': 'new name'})The rename method does not modify the existing dataframe, unless the inplace = True argument is used.
The two following syntaxes are equivalent:
- my_dataframe.rename(columns={'old name': 'new name'}, inplace = True)
- my_dataframe = my_dataframe.rename(columns={'old name': 'new name'})
| farm_id | region | crop_type | soil_moisture | soil_pH | temperature_Celsius | rainfall_mm | humidity | sunlight_hours | irrigation_type | ... | harvest_date | total_days | yield_kg_per_hectare | sensor_id | timestamp | latitude | longitude | NDVI_index | crop_disease_status | temperature_Fahrenheit |
|---|
0 rows × 23 columns
sort_values method allows you to sort a dataframe according to one or more columns specified in parentheses.| farm_id | region | crop_type | soil_moisture | soil_pH | temperature_Celsius | rainfall_mm | humidity | sunlight_hours | irrigation_type | ... | harvest_date | total_days | yield_kg_per_hectare | sensor_id | timestamp | latitude | longitude | NDVI_index | crop_disease_status | temperature_Fahrenheit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 22 | FARM0023 | East Africa | Soybean | 20.53 | 6.60 | 15.01 | 121.73 | 0.6149 | 7.48 | Manual | ... | 07-07-24 | 122 | 3892.74 | SENS0023 | 06-30-24 | 33.995800 | 84.719229 | 0.70 | Moderate | 59.018 |
| 478 | FARM0479 | Central USA | Maize | 26.91 | 6.03 | 15.04 | 207.79 | 0.5968 | 9.54 | Drip | ... | 05-24-24 | 112 | 2023.56 | SENS0479 | 03-31-24 | 18.213795 | 77.077855 | 0.30 | Mild | 59.072 |
| 24 | FARM0025 | South USA | Cotton | 18.54 | 6.81 | 15.11 | 237.74 | 0.7850 | 4.64 | NaN | ... | 07-14-24 | 119 | 2200.87 | SENS0025 | 04-17-24 | 32.936750 | 72.427172 | 0.38 | Severe | 59.198 |
| 419 | FARM0420 | South USA | Rice | 38.91 | 5.51 | 15.20 | 139.47 | 0.6773 | 4.85 | NaN | ... | 05-03-24 | 91 | 2796.49 | SENS0420 | 03-09-24 | 14.353665 | 87.707645 | 0.73 | Moderate | 59.360 |
| 435 | FARM0436 | Central USA | Cotton | 39.95 | 6.29 | 15.21 | 78.67 | 0.8586 | 5.96 | NaN | ... | 05-29-24 | 132 | 2969.17 | SENS0436 | 05-09-24 | 13.506394 | 86.408534 | 0.80 | Mild | 59.378 |
| 197 | FARM0198 | South India | Soybean | 41.22 | 6.73 | 15.23 | 283.59 | 0.6528 | 6.82 | NaN | ... | 05-26-24 | 105 | 3323.58 | SENS0198 | 03-17-24 | 11.258768 | 74.454130 | 0.69 | Severe | 59.414 |
| 323 | FARM0324 | South USA | Cotton | 18.42 | 6.62 | 15.25 | 232.95 | 0.8750 | 4.80 | Manual | ... | 04-30-24 | 120 | 4676.14 | SENS0324 | 01-21-24 | 27.582612 | 87.158442 | 0.75 | NaN | 59.450 |
| 58 | FARM0059 | South India | Wheat | 33.14 | 5.55 | 15.30 | 247.50 | 0.5190 | 5.94 | Sprinkler | ... | 07-15-24 | 123 | 2454.60 | SENS0059 | 03-22-24 | 21.906149 | 85.560341 | 0.61 | NaN | 59.540 |
| 29 | FARM0030 | Central USA | Cotton | 18.83 | 5.66 | 15.39 | 184.85 | 0.9000 | 6.10 | Drip | ... | 04-19-24 | 102 | 5356.92 | SENS0030 | 03-27-24 | 13.809559 | 72.524419 | 0.70 | Mild | 59.702 |
| 442 | FARM0443 | East Africa | Cotton | 32.68 | 6.08 | 15.47 | 261.73 | 0.5656 | 5.45 | Drip | ... | 08-06-24 | 136 | 2889.78 | SENS0443 | 06-28-24 | 23.036798 | 73.670909 | 0.68 | NaN | 59.846 |
10 rows × 23 columns
ascending = False to sort in descending order.| farm_id | region | crop_type | soil_moisture | soil_pH | temperature_Celsius | rainfall_mm | humidity | sunlight_hours | irrigation_type | ... | harvest_date | total_days | yield_kg_per_hectare | sensor_id | timestamp | latitude | longitude | NDVI_index | crop_disease_status | temperature_Fahrenheit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 274 | FARM0275 | East Africa | Wheat | 25.81 | 7.15 | 15.85 | 298.96 | 0.6594 | 6.37 | Drip | ... | 06-12-24 | 102 | 3164.72 | SENS0275 | 04-21-24 | 32.109939 | 85.473540 | 0.47 | Mild | 60.530 |
| 332 | FARM0333 | Central USA | Cotton | 41.36 | 7.44 | 30.08 | 298.52 | 0.7334 | 8.80 | NaN | ... | 08-10-24 | 147 | 2160.32 | SENS0333 | 08-04-24 | 12.921902 | 70.495912 | 0.67 | Mild | 86.144 |
| 186 | FARM0187 | East Africa | Maize | 24.46 | 7.24 | 18.02 | 298.09 | 0.5713 | 9.92 | NaN | ... | 07-23-24 | 139 | 2323.25 | SENS0187 | 05-30-24 | 25.775819 | 73.536485 | 0.68 | NaN | 64.436 |
| 266 | FARM0267 | East Africa | Soybean | 36.26 | 6.60 | 27.46 | 298.08 | 0.7475 | 8.01 | NaN | ... | 06-16-24 | 106 | 2681.28 | SENS0267 | 04-07-24 | 15.017401 | 83.930534 | 0.46 | Severe | 81.428 |
| 347 | FARM0348 | North India | Maize | 44.13 | 6.18 | 26.90 | 297.67 | 0.4614 | 9.03 | NaN | ... | 07-04-24 | 107 | 5025.21 | SENS0348 | 06-29-24 | 26.095779 | 78.004711 | 0.59 | Severe | 80.420 |
| 7 | FARM0008 | East Africa | Maize | 27.10 | 5.72 | 22.26 | 296.33 | 0.8034 | 5.44 | Sprinkler | ... | 05-24-24 | 121 | 5264.09 | SENS0008 | 04-30-24 | 23.317654 | 72.515210 | 0.70 | Mild | 72.068 |
| 230 | FARM0231 | South India | Maize | 12.80 | 5.58 | 22.69 | 296.11 | 0.7070 | 7.13 | Drip | ... | 05-13-24 | 102 | 5402.27 | SENS0231 | 05-13-24 | 22.953832 | 73.894930 | 0.77 | Mild | 72.842 |
| 31 | FARM0032 | North India | Maize | 39.76 | 6.70 | 17.42 | 295.96 | 0.7913 | 6.08 | NaN | ... | 07-10-24 | 111 | 2050.61 | SENS0032 | 05-13-24 | 30.558273 | 72.110777 | 0.88 | Severe | 63.356 |
| 408 | FARM0409 | East Africa | Maize | 23.54 | 7.18 | 31.24 | 295.95 | 0.4624 | 6.22 | Sprinkler | ... | 07-17-24 | 138 | 3124.54 | SENS0409 | 05-31-24 | 14.787792 | 86.325616 | 0.68 | Mild | 88.232 |
| 259 | FARM0260 | Central USA | Cotton | 25.66 | 6.29 | 29.53 | 295.74 | 0.6979 | 7.11 | Manual | ... | 05-30-24 | 144 | 3259.62 | SENS0260 | 03-17-24 | 32.977802 | 80.225430 | 0.64 | Mild | 85.154 |
10 rows × 23 columns
my_data.sort_values(['column A', 'column B']).| farm_id | region | crop_type | soil_moisture | soil_pH | temperature_Celsius | rainfall_mm | humidity | sunlight_hours | irrigation_type | ... | harvest_date | total_days | yield_kg_per_hectare | sensor_id | timestamp | latitude | longitude | NDVI_index | crop_disease_status | temperature_Fahrenheit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 332 | FARM0333 | Central USA | Cotton | 41.36 | 7.44 | 30.08 | 298.52 | 0.7334 | 8.80 | NaN | ... | 08-10-24 | 147 | 2160.32 | SENS0333 | 08-04-24 | 12.921902 | 70.495912 | 0.67 | Mild | 86.144 |
| 259 | FARM0260 | Central USA | Cotton | 25.66 | 6.29 | 29.53 | 295.74 | 0.6979 | 7.11 | Manual | ... | 05-30-24 | 144 | 3259.62 | SENS0260 | 03-17-24 | 32.977802 | 80.225430 | 0.64 | Mild | 85.154 |
| 28 | FARM0029 | Central USA | Cotton | 35.35 | 7.18 | 33.39 | 295.18 | 0.6671 | 9.44 | Drip | ... | 05-26-24 | 119 | 2726.92 | SENS0029 | 03-01-24 | 19.477597 | 74.233206 | 0.50 | Severe | 92.102 |
| 4 | FARM0005 | Central USA | Cotton | 19.37 | 5.92 | 33.86 | 269.09 | 0.5573 | 7.93 | NaN | ... | 05-20-24 | 105 | 4979.96 | SENS0005 | 04-13-24 | 16.568540 | 81.691720 | 0.84 | Severe | 92.948 |
| 132 | FARM0133 | Central USA | Cotton | 13.71 | 5.70 | 19.44 | 236.71 | 0.6790 | 8.13 | Sprinkler | ... | 07-07-24 | 133 | 4354.36 | SENS0133 | 07-02-24 | 13.768623 | 89.954055 | 0.59 | Moderate | 66.992 |
| 288 | FARM0289 | Central USA | Cotton | 41.12 | 5.71 | 30.32 | 236.39 | 0.4112 | 8.55 | Sprinkler | ... | 06-12-24 | 124 | 3276.60 | SENS0289 | 04-05-24 | 26.778101 | 75.453084 | 0.39 | NaN | 86.576 |
| 217 | FARM0218 | Central USA | Cotton | 15.90 | 6.13 | 30.71 | 228.05 | 0.7204 | 5.66 | Manual | ... | 06-14-24 | 119 | 3781.43 | SENS0218 | 05-25-24 | 17.636795 | 81.033437 | 0.41 | NaN | 87.278 |
| 458 | FARM0459 | Central USA | Cotton | 41.86 | 6.99 | 29.50 | 213.48 | 0.7925 | 9.80 | NaN | ... | 04-14-24 | 95 | 2445.53 | SENS0459 | 02-22-24 | 28.514530 | 88.744213 | 0.75 | NaN | 85.100 |
| 191 | FARM0192 | Central USA | Cotton | 33.16 | 6.82 | 20.40 | 201.41 | 0.4686 | 8.98 | Drip | ... | 04-16-24 | 97 | 5139.04 | SENS0192 | 03-13-24 | 14.966167 | 73.994988 | 0.46 | Mild | 68.720 |
| 37 | FARM0038 | Central USA | Cotton | 13.99 | 5.63 | 24.83 | 194.26 | 0.7432 | 4.91 | Manual | ... | 06-04-24 | 138 | 3664.70 | SENS0038 | 03-15-24 | 29.392338 | 77.607561 | 0.85 | Moderate | 76.694 |
10 rows × 23 columns
ascending parameter.my_data.sort_values(['region', 'crop_type'], ascending = [True, False], inplace = True)
my_data.head(10)| farm_id | region | crop_type | soil_moisture | soil_pH | temperature_Celsius | rainfall_mm | humidity | sunlight_hours | irrigation_type | ... | harvest_date | total_days | yield_kg_per_hectare | sensor_id | timestamp | latitude | longitude | NDVI_index | crop_disease_status | temperature_Fahrenheit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 216 | FARM0217 | Central USA | Wheat | 18.77 | 5.89 | 26.61 | 287.88 | 0.5786 | 8.03 | Sprinkler | ... | 04-26-24 | 110 | 3943.44 | SENS0217 | 02-18-24 | 25.408561 | 76.113510 | 0.65 | Moderate | 79.898 |
| 54 | FARM0055 | Central USA | Wheat | 33.62 | 6.44 | 27.39 | 285.79 | 0.5640 | 7.66 | Drip | ... | 07-19-24 | 131 | 3633.18 | SENS0055 | 04-14-24 | 11.133670 | 70.744243 | 0.90 | NaN | 81.302 |
| 376 | FARM0377 | Central USA | Wheat | 39.12 | 6.53 | 24.79 | 271.35 | 0.6382 | 7.38 | NaN | ... | 05-31-24 | 101 | 3736.42 | SENS0377 | 04-14-24 | 12.323687 | 80.266829 | 0.88 | Mild | 76.622 |
| 296 | FARM0297 | Central USA | Wheat | 30.40 | 6.72 | 25.21 | 261.91 | 0.8263 | 4.37 | Drip | ... | 06-28-24 | 145 | 3128.84 | SENS0297 | 02-11-24 | 15.881029 | 84.044438 | 0.54 | Moderate | 77.378 |
| 251 | FARM0252 | Central USA | Wheat | 15.86 | 6.05 | 17.39 | 247.29 | 0.4045 | 4.25 | NaN | ... | 05-30-24 | 95 | 2994.89 | SENS0252 | 04-28-24 | 12.285039 | 82.372897 | 0.86 | NaN | 63.302 |
| 492 | FARM0493 | Central USA | Wheat | 28.81 | 7.46 | 30.56 | 245.13 | 0.4532 | 8.47 | NaN | ... | 07-27-24 | 128 | 4203.51 | SENS0493 | 07-12-24 | 15.515976 | 75.375870 | 0.65 | Severe | 87.008 |
| 111 | FARM0112 | Central USA | Wheat | 16.25 | 6.57 | 25.58 | 231.96 | 0.5113 | 4.02 | Drip | ... | 07-13-24 | 117 | 4127.73 | SENS0112 | 07-01-24 | 15.741602 | 79.212506 | 0.39 | Mild | 78.044 |
| 481 | FARM0482 | Central USA | Wheat | 24.74 | 6.60 | 31.00 | 228.58 | 0.5624 | 8.59 | NaN | ... | 08-16-24 | 142 | 3555.39 | SENS0482 | 04-24-24 | 33.941965 | 85.854259 | 0.38 | Moderate | 87.800 |
| 315 | FARM0316 | Central USA | Wheat | 14.23 | 5.78 | 23.30 | 224.07 | 0.6767 | 6.63 | Drip | ... | 07-12-24 | 114 | 5110.65 | SENS0316 | 03-22-24 | 31.990674 | 71.614452 | 0.30 | NaN | 73.940 |
| 81 | FARM0082 | Central USA | Wheat | 22.50 | 5.64 | 19.82 | 214.28 | 0.4518 | 7.49 | Manual | ... | 07-31-24 | 142 | 4571.18 | SENS0082 | 07-07-24 | 34.520480 | 79.570623 | 0.41 | Mild | 67.676 |
10 rows × 23 columns
| farm_id | region | crop_type | soil_moisture | soil_pH | temperature_Celsius | rainfall_mm | humidity | sunlight_hours | irrigation_type | ... | harvest_date | total_days | yield_kg_per_hectare | sensor_id | timestamp | latitude | longitude | NDVI_index | crop_disease_status | temperature_Fahrenheit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 216 | FARM0217 | Central USA | Wheat | 18.77 | 5.89 | 26.61 | 287.88 | 0.5786 | 8.03 | Sprinkler | ... | 04-26-24 | 110 | 3943.44 | SENS0217 | 02-18-24 | 25.408561 | 76.113510 | 0.65 | Moderate | 79.898 |
| 54 | FARM0055 | Central USA | Wheat | 33.62 | 6.44 | 27.39 | 285.79 | 0.5640 | 7.66 | Drip | ... | 07-19-24 | 131 | 3633.18 | SENS0055 | 04-14-24 | 11.133670 | 70.744243 | 0.90 | NaN | 81.302 |
| 376 | FARM0377 | Central USA | Wheat | 39.12 | 6.53 | 24.79 | 271.35 | 0.6382 | 7.38 | NaN | ... | 05-31-24 | 101 | 3736.42 | SENS0377 | 04-14-24 | 12.323687 | 80.266829 | 0.88 | Mild | 76.622 |
| 296 | FARM0297 | Central USA | Wheat | 30.40 | 6.72 | 25.21 | 261.91 | 0.8263 | 4.37 | Drip | ... | 06-28-24 | 145 | 3128.84 | SENS0297 | 02-11-24 | 15.881029 | 84.044438 | 0.54 | Moderate | 77.378 |
| 492 | FARM0493 | Central USA | Wheat | 28.81 | 7.46 | 30.56 | 245.13 | 0.4532 | 8.47 | NaN | ... | 07-27-24 | 128 | 4203.51 | SENS0493 | 07-12-24 | 15.515976 | 75.375870 | 0.65 | Severe | 87.008 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 61 | FARM0062 | South USA | Cotton | 35.54 | 6.13 | 21.25 | 113.72 | 0.8768 | 8.12 | NaN | ... | 04-29-24 | 117 | 2386.26 | SENS0062 | 04-06-24 | 19.316171 | 72.602309 | 0.34 | Moderate | 70.250 |
| 462 | FARM0463 | South USA | Cotton | 20.47 | 7.13 | 33.25 | 89.80 | 0.4108 | 6.35 | NaN | ... | 06-19-24 | 115 | 3564.25 | SENS0463 | 05-02-24 | 28.168865 | 75.647282 | 0.85 | Severe | 91.850 |
| 396 | FARM0397 | South USA | Cotton | 14.53 | 6.91 | 32.27 | 79.51 | 0.5063 | 6.84 | Sprinkler | ... | 06-12-24 | 144 | 3634.48 | SENS0397 | 02-19-24 | 19.239649 | 75.791812 | 0.61 | NaN | 90.086 |
| 310 | FARM0311 | South USA | Cotton | 24.14 | 6.96 | 31.25 | 67.52 | 0.5129 | 6.31 | NaN | ... | 06-23-24 | 133 | 3211.31 | SENS0311 | 02-25-24 | 25.806813 | 89.176478 | 0.35 | Moderate | 88.250 |
| 449 | FARM0450 | NaN | Rice | 39.04 | 6.01 | 21.04 | 291.92 | 0.7292 | 6.30 | Manual | ... | 07-01-24 | 95 | 2437.10 | SENS0450 | 06-09-24 | 29.417278 | 76.887856 | 0.39 | Mild | 69.872 |
380 rows × 23 columns
| farm_id | region | crop_type | soil_moisture | soil_pH | temperature_Celsius | rainfall_mm | humidity | sunlight_hours | irrigation_type | ... | harvest_date | total_days | yield_kg_per_hectare | sensor_id | timestamp | latitude | longitude | NDVI_index | crop_disease_status | temperature_Fahrenheit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 460 | FARM0461 | South USA | Wheat | 28.84 | 6.42 | 17.89 | 285.72 | 0.8186 | 7.00 | NaN | ... | 06-04-24 | 132 | 5396.51 | SENS0461 | 02-01-24 | 24.972008 | 76.177829 | 0.89 | NaN | 64.202 |
| 443 | FARM0444 | South USA | Wheat | 43.38 | 5.60 | 34.84 | 284.57 | 0.4628 | 5.04 | Sprinkler | ... | 06-22-24 | 119 | 3245.85 | SENS0444 | 05-26-24 | 14.938407 | 78.480336 | 0.66 | Moderate | 94.712 |
| 127 | FARM0128 | South USA | Wheat | 20.21 | 6.28 | 16.69 | 275.28 | 0.8526 | 9.87 | Sprinkler | ... | 04-27-24 | 109 | 3073.63 | SENS0128 | 01-27-24 | 11.581679 | 78.693525 | 0.55 | Severe | 62.042 |
| 2 | FARM0003 | South USA | Wheat | 29.32 | 7.16 | 27.37 | 265.43 | 0.6887 | 8.23 | Drip | ... | 06-26-24 | 144 | 2931.16 | SENS0003 | 02-28-24 | 19.503156 | 79.068206 | 0.80 | Mild | 81.266 |
| 276 | FARM0277 | South USA | Wheat | 18.75 | 6.88 | 33.14 | 249.12 | 0.7592 | 4.74 | Drip | ... | 06-05-24 | 123 | 4829.12 | SENS0277 | 05-06-24 | 29.776665 | 80.233329 | 0.87 | Severe | 91.652 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 61 | FARM0062 | South USA | Cotton | 35.54 | 6.13 | 21.25 | 113.72 | 0.8768 | 8.12 | NaN | ... | 04-29-24 | 117 | 2386.26 | SENS0062 | 04-06-24 | 19.316171 | 72.602309 | 0.34 | Moderate | 70.250 |
| 462 | FARM0463 | South USA | Cotton | 20.47 | 7.13 | 33.25 | 89.80 | 0.4108 | 6.35 | NaN | ... | 06-19-24 | 115 | 3564.25 | SENS0463 | 05-02-24 | 28.168865 | 75.647282 | 0.85 | Severe | 91.850 |
| 340 | FARM0341 | South USA | Cotton | 21.91 | 7.32 | 17.05 | 88.64 | 0.5106 | 5.13 | Drip | ... | 05-25-24 | 142 | 2524.93 | SENS0341 | 01-08-24 | 21.987956 | 76.231469 | 0.52 | Severe | 62.690 |
| 396 | FARM0397 | South USA | Cotton | 14.53 | 6.91 | 32.27 | 79.51 | 0.5063 | 6.84 | Sprinkler | ... | 06-12-24 | 144 | 3634.48 | SENS0397 | 02-19-24 | 19.239649 | 75.791812 | 0.61 | NaN | 90.086 |
| 310 | FARM0311 | South USA | Cotton | 24.14 | 6.96 | 31.25 | 67.52 | 0.5129 | 6.31 | NaN | ... | 06-23-24 | 133 | 3211.31 | SENS0311 | 02-25-24 | 25.806813 | 89.176478 | 0.35 | Moderate | 88.250 |
93 rows × 23 columns
and), or | (meaning or).my_data[ (my_data["temperature_Celsius"] > 20 ) & (my_data["sunlight_hours"] > 7) ].head()
my_data.loc[(my_data['temperature_Celsius'] > 20) & (my_data["sunlight_hours"] > 7), :].head()
# the two syntaxes will return the same result| farm_id | region | crop_type | soil_moisture | soil_pH | temperature_Celsius | rainfall_mm | humidity | sunlight_hours | irrigation_type | ... | harvest_date | total_days | yield_kg_per_hectare | sensor_id | timestamp | latitude | longitude | NDVI_index | crop_disease_status | temperature_Fahrenheit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 216 | FARM0217 | Central USA | Wheat | 18.77 | 5.89 | 26.61 | 287.88 | 0.5786 | 8.03 | Sprinkler | ... | 04-26-24 | 110 | 3943.44 | SENS0217 | 02-18-24 | 25.408561 | 76.113510 | 0.65 | Moderate | 79.898 |
| 54 | FARM0055 | Central USA | Wheat | 33.62 | 6.44 | 27.39 | 285.79 | 0.5640 | 7.66 | Drip | ... | 07-19-24 | 131 | 3633.18 | SENS0055 | 04-14-24 | 11.133670 | 70.744243 | 0.90 | NaN | 81.302 |
| 376 | FARM0377 | Central USA | Wheat | 39.12 | 6.53 | 24.79 | 271.35 | 0.6382 | 7.38 | NaN | ... | 05-31-24 | 101 | 3736.42 | SENS0377 | 04-14-24 | 12.323687 | 80.266829 | 0.88 | Mild | 76.622 |
| 492 | FARM0493 | Central USA | Wheat | 28.81 | 7.46 | 30.56 | 245.13 | 0.4532 | 8.47 | NaN | ... | 07-27-24 | 128 | 4203.51 | SENS0493 | 07-12-24 | 15.515976 | 75.375870 | 0.65 | Severe | 87.008 |
| 481 | FARM0482 | Central USA | Wheat | 24.74 | 6.60 | 31.00 | 228.58 | 0.5624 | 8.59 | NaN | ... | 08-16-24 | 142 | 3555.39 | SENS0482 | 04-24-24 | 33.941965 | 85.854259 | 0.38 | Moderate | 87.800 |
5 rows × 23 columns
.loc.my_data.loc[my_data['region'] == 'North India', ['region']] = 'India_North'
my_data.loc[my_data['region'] == 'India_North', :].head()| farm_id | region | crop_type | soil_moisture | soil_pH | temperature_Celsius | rainfall_mm | humidity | sunlight_hours | irrigation_type | ... | harvest_date | total_days | yield_kg_per_hectare | sensor_id | timestamp | latitude | longitude | NDVI_index | crop_disease_status | temperature_Fahrenheit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 366 | FARM0367 | India_North | Wheat | 42.31 | 6.79 | 27.53 | 276.71 | 0.8871 | 5.19 | Sprinkler | ... | 06-05-24 | 106 | 2597.00 | SENS0367 | 04-07-24 | 14.253072 | 81.344858 | 0.31 | Severe | 81.554 |
| 260 | FARM0261 | India_North | Wheat | 26.11 | 5.81 | 20.30 | 272.41 | 0.5249 | 5.54 | Manual | ... | 07-21-24 | 136 | 2308.81 | SENS0261 | 05-29-24 | 29.822605 | 73.458050 | 0.80 | NaN | 68.540 |
| 112 | FARM0113 | India_North | Wheat | 38.33 | 6.34 | 30.32 | 270.94 | 0.4078 | 5.24 | Drip | ... | 08-04-24 | 135 | 5488.85 | SENS0113 | 05-23-24 | 28.513527 | 78.045307 | 0.44 | Mild | 86.576 |
| 392 | FARM0393 | India_North | Wheat | 28.81 | 6.28 | 29.38 | 269.97 | 0.6602 | 7.24 | Sprinkler | ... | 06-07-24 | 111 | 5028.19 | SENS0393 | 03-08-24 | 10.585544 | 87.806387 | 0.62 | Moderate | 84.884 |
| 314 | FARM0315 | India_North | Wheat | 27.40 | 7.10 | 19.41 | 251.11 | 0.6131 | 8.87 | NaN | ... | 07-21-24 | 124 | 2549.32 | SENS0315 | 07-04-24 | 34.117310 | 74.264637 | 0.33 | Mild | 66.938 |
5 rows × 23 columns
my_data.loc[my_data['region'] == 'South India', ['region']] = 'India_South'
my_data.loc[my_data['region'] == 'India_South', :].head()| farm_id | region | crop_type | soil_moisture | soil_pH | temperature_Celsius | rainfall_mm | humidity | sunlight_hours | irrigation_type | ... | harvest_date | total_days | yield_kg_per_hectare | sensor_id | timestamp | latitude | longitude | NDVI_index | crop_disease_status | temperature_Fahrenheit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 298 | FARM0299 | India_South | Wheat | 14.80 | 7.11 | 32.20 | 273.40 | 0.7477 | 9.58 | Sprinkler | ... | 07-23-24 | 118 | 3538.86 | SENS0299 | 04-24-24 | 14.644776 | 82.091465 | 0.65 | Mild | 89.960 |
| 198 | FARM0199 | India_South | Wheat | 26.07 | 7.10 | 23.96 | 264.15 | 0.6235 | 4.71 | NaN | ... | 05-08-24 | 116 | 2143.33 | SENS0199 | 03-23-24 | 10.004243 | 71.817911 | 0.66 | Moderate | 75.128 |
| 278 | FARM0279 | India_South | Wheat | 31.79 | 6.01 | 24.17 | 263.85 | 0.6718 | 4.03 | NaN | ... | 07-13-24 | 119 | 3640.61 | SENS0279 | 03-26-24 | 25.030414 | 70.131460 | 0.83 | Moderate | 75.506 |
| 58 | FARM0059 | India_South | Wheat | 33.14 | 5.55 | 15.30 | 247.50 | 0.5190 | 5.94 | Sprinkler | ... | 07-15-24 | 123 | 2454.60 | SENS0059 | 03-22-24 | 21.906149 | 85.560341 | 0.61 | NaN | 59.540 |
| 69 | FARM0070 | India_South | Wheat | 15.13 | 5.89 | 27.05 | 240.05 | 0.7278 | 5.06 | NaN | ... | 07-26-24 | 133 | 5696.62 | SENS0070 | 06-05-24 | 31.606172 | 82.544348 | 0.39 | NaN | 80.690 |
5 rows × 23 columns
.groupby() method takes a group of several rows as input. You can perform a calculation on it in order to return a single value for each of the groups.| farm_id | soil_moisture | soil_pH | temperature_Celsius | rainfall_mm | humidity | sunlight_hours | irrigation_type | fertilizer_type | pesticide_usage_ml | ... | harvest_date | total_days | yield_kg_per_hectare | sensor_id | timestamp | latitude | longitude | NDVI_index | crop_disease_status | temperature_Fahrenheit | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| region | crop_type | |||||||||||||||||||||
| Central USA | Cotton | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 17 | 26 | 26 | ... | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 21 | 26 |
| Maize | 21 | 20 | 21 | 21 | 20 | 20 | 21 | 17 | 21 | 21 | ... | 21 | 21 | 21 | 21 | 21 | 21 | 21 | 21 | 14 | 21 | |
| Rice | 18 | 18 | 18 | 18 | 18 | 18 | 18 | 13 | 17 | 18 | ... | 18 | 18 | 18 | 18 | 18 | 18 | 18 | 18 | 13 | 18 | |
| Soybean | 26 | 26 | 25 | 26 | 26 | 26 | 26 | 20 | 26 | 26 | ... | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 17 | 26 | |
| Wheat | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 12 | 17 | 17 | ... | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 13 | 17 | |
| East Africa | Cotton | 24 | 24 | 24 | 24 | 24 | 24 | 24 | 17 | 24 | 24 | ... | 24 | 24 | 24 | 24 | 24 | 24 | 24 | 24 | 20 | 24 |
| Maize | 24 | 24 | 24 | 24 | 24 | 24 | 24 | 16 | 24 | 24 | ... | 24 | 24 | 24 | 24 | 24 | 24 | 24 | 24 | 15 | 24 | |
| Rice | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 15 | 20 | 20 | ... | 20 | 20 | 19 | 20 | 20 | 20 | 20 | 20 | 17 | 20 | |
| Soybean | 24 | 24 | 24 | 24 | 24 | 23 | 24 | 18 | 24 | 24 | ... | 24 | 24 | 24 | 24 | 24 | 24 | 24 | 24 | 20 | 24 | |
| Wheat | 15 | 15 | 15 | 15 | 15 | 15 | 15 | 11 | 15 | 15 | ... | 15 | 15 | 15 | 15 | 15 | 15 | 15 | 15 | 11 | 15 | |
| India_North | Cotton | 18 | 17 | 18 | 18 | 18 | 18 | 18 | 9 | 18 | 18 | ... | 18 | 18 | 18 | 18 | 18 | 18 | 18 | 18 | 15 | 18 |
| Maize | 24 | 24 | 24 | 24 | 24 | 24 | 24 | 15 | 24 | 24 | ... | 24 | 24 | 24 | 24 | 24 | 24 | 24 | 24 | 19 | 24 | |
| Rice | 18 | 17 | 18 | 18 | 18 | 18 | 18 | 14 | 18 | 18 | ... | 18 | 18 | 18 | 18 | 18 | 18 | 18 | 18 | 14 | 18 | |
| Soybean | 18 | 18 | 18 | 18 | 18 | 18 | 18 | 14 | 18 | 18 | ... | 18 | 18 | 18 | 18 | 18 | 18 | 18 | 18 | 13 | 18 | |
| Wheat | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 9 | 20 | 20 | ... | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 16 | 20 | |
| India_South | Cotton | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 16 | 20 | 20 | ... | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 10 | 20 |
| Maize | 21 | 21 | 21 | 21 | 21 | 21 | 21 | 17 | 20 | 21 | ... | 21 | 21 | 21 | 21 | 21 | 21 | 21 | 21 | 14 | 21 | |
| Rice | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 2 | 6 | 6 | ... | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | |
| Soybean | 22 | 22 | 22 | 21 | 22 | 22 | 22 | 14 | 22 | 22 | ... | 22 | 22 | 22 | 22 | 22 | 22 | 22 | 22 | 18 | 21 | |
| Wheat | 21 | 21 | 21 | 21 | 21 | 20 | 21 | 13 | 21 | 21 | ... | 21 | 21 | 21 | 21 | 21 | 21 | 21 | 21 | 16 | 21 | |
| South USA | Cotton | 19 | 19 | 19 | 19 | 19 | 19 | 19 | 14 | 19 | 19 | ... | 19 | 19 | 19 | 19 | 19 | 19 | 19 | 19 | 15 | 19 |
| Maize | 21 | 21 | 20 | 21 | 21 | 21 | 21 | 15 | 21 | 21 | ... | 21 | 21 | 21 | 20 | 21 | 21 | 21 | 21 | 13 | 21 | |
| Rice | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 10 | 17 | 17 | ... | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 9 | 17 | |
| Soybean | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 12 | 17 | 17 | ... | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 13 | 17 | |
| Wheat | 19 | 19 | 19 | 19 | 19 | 19 | 19 | 16 | 19 | 19 | ... | 19 | 19 | 19 | 19 | 19 | 19 | 18 | 19 | 15 | 19 |
25 rows × 21 columns
region crop_type
Central USA Cotton 26
Maize 21
Rice 18
Soybean 26
Wheat 17
East Africa Cotton 24
Maize 24
Rice 20
Soybean 24
Wheat 15
India_North Cotton 18
Maize 24
Rice 18
Soybean 18
Wheat 20
India_South Cotton 20
Maize 21
Rice 6
Soybean 22
Wheat 21
South USA Cotton 19
Maize 21
Rice 17
Soybean 17
Wheat 19
Name: farm_id, dtype: int64
| temperature_Celsius | rainfall_mm | ||
|---|---|---|---|
| min | max | sum | |
| region | |||
| Central USA | 15.04 | 34.09 | 19014.33 |
| East Africa | 15.01 | 34.33 | 19734.47 |
| India_North | 15.64 | 34.52 | 18636.89 |
| India_South | 15.23 | 33.78 | 16972.94 |
| South USA | 15.11 | 34.84 | 15824.96 |
.min(): compute min of group values.max(): compute max of group values.mean(): compute mean of group values.count(): compute count of group, excluding missing values.describe(): generate descriptive statistics for each numeric column.head(n): return the first n rows in each group.tail(n): return the last n rows in each group.size(): compute group sizesto_csv() to export a dataframe to a tabulated file.my_data.to_csv('path_to_output_file')header = True: the header will be printedindex = False : the index will not be printedsep = ',' : the separator that will be used to separate the columns will be the comma (,)DataFrames are objects used to store tables of data. They can be initialised:
pandas.DataFrame(my_dict)pandas.read_csv("my_tabulated_file")Unlike nested lists, columns are identified by a name and must contain only one data type.
There are ways that allow you to view a subset of the data:
my_df.head(), last lines with my_df.tail(), generate statistics with my_df.describe()my_df['column_1'] or several columns with my_df[['column_1', 'column_2']]my_df.iloc[row_index, column_index]my_df.loc[my_df['column_1'] == value, ['column_2', 'column_3']]my_df['column_name'] = valuemy_data = my_df.drop(columns='column_name'), del my_df['column_name'] or my_df.pop('column_name')my_df.rename(columns={'old name': 'new name'}, inplace = True) ormy_df = my_df.rename(columns={'old name': 'new name'})my_df.sort_values('column_1', ascending = True, inplace = True) ormy_df = my_df.sort_values(['column_1', 'column_2'], ascending = [True, False])To filter a dataframe you can use:
my_df[my_df['column_1'] > value]my_df.loc[my_df['column_1'] > value, ['column_3', 'column_4']]my_df[ (my_df['column_1'] > value) & (my_df['column_2'] < other_value) ]my_df.loc[(my_df['column_1'] > value) & (my_df['column_2'] < other_value), ['column_3', 'column_4']]To modify certain cells in a column depending on their value, you can do:
my_df.loc[my_df['column_1'] == old_value, ['column_1']] = new_value
An aggregation allows you to group your data according to one or several columns and perform one or several operations on other columns. For instance:
my_df.groupby('column_1')['column_2'].sum()my_df.groupby(['column_1', 'column_2']).count()my_df.groupby('column_1').agg({'temperature_Celsius':['min', 'max'], 'column_3': 'sum'})Please open file 009_practical_dataframes.ipynb
matplotlib and seaborn.matplotlib is one of the most used Python data visualisation library.seaborn is based on matplotlib and provides new features.matplotlib can be installed with pip install matplotlib.seaborn can be installed with pip install seaborn.import random
x = range(1, 11)
y = [100 * round(random.random(), 2) for i in range(1, 11)]
z = [100 * round(random.random(), 2) for i in range(1, 11)]
plt.figure(figsize=(10, 3)) # configure plot size
plt.plot(x, y, label='y list', linewidth=4) # add a label and change the default line width
plt.plot(x, z, label='z list', linewidth=4, linestyle='--', color='purple') # change the default type and color
plt.xlabel('Title for x axis', fontsize=12) # add a label for x axis
plt.ylabel('Title for y axis', fontsize=12) # add a label for y axis
plt.legend(loc='upper right') # add a legend and fix its position in upper right corner
plt.grid(color='gray', linewidth=0.5) # add a grid
plt.title('A more customised plot line') # add a title
plt.show()penguins dataset is a good dataset for data exploration and visualisation.seaborn.| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
|---|---|---|---|---|---|---|---|
| 0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
| 1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
| 2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
| 3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
| 4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
Artwork by @allison_horst
matplotlibimport seaborn as sns
import matplotlib.pyplot as plt
# configure plot size
plt.figure(figsize=(10, 4))
plt.scatter(penguins['flipper_length_mm'], penguins['body_mass_g'])
# label for x axis
plt.xlabel('Flipper length (mm)', fontsize=12)
# label for y axis
plt.ylabel('Body mass (g)', fontsize=12)
# plot title
plt.title('Body mass as a function of flipper length', size=16, color='red')
plt.show()All species are mixed together.
matplotlibfor species in penguins['species'].unique():
df = penguins.loc[penguins['species'] == species, :]
plt.scatter(df['flipper_length_mm'], df['body_mass_g'], label=species)
plt.xlabel('Flipper length (mm)', fontsize=12)
plt.ylabel('Body mass (g)', fontsize=12)
plt.legend() # add a legend based on 'label' parameter in plt.scatter
plt.title('Body mass as a function of flipper length', size=16, color='red')
plt.show()We have to loop on all species.
seabornplt.figure(figsize=(9, 3.5))
sns.scatterplot(data=penguins, x='flipper_length_mm', y='body_mass_g', hue='species')
plt.xlabel('Flipper length (mm)', fontsize=12)
plt.ylabel('Body mass (g)', fontsize=12)
plt.title('Body mass as a function of flipper length', size=16, color='red')
plt.show()By specifying the variable via the hue argument, seaborn automatically creates a color for each existing value.
matplotlibisland = penguins['island'].value_counts() # island is a Series
plt.pie(x=island.values, labels=island.index, autopct='%.2f') # values can be accessed with island.values
plt.title('Islands', size=16, color='#DAA520')
plt.show() # indexes can be accessed with island.indexThe Seaborn library does not offer circular diagram implementations.
To create one, we must therefore use matplotlib’s pie function, to which we can apply seaborn’s various graphic styles (themes).
matplotlibflipper_mean = penguins.groupby('species')['flipper_length_mm'].mean() # flipper_mean is a Series
plt.bar(height=flipper_mean.values, x=flipper_mean.index) # values can be accessed with flipper_mean.values
plt.title('Flipper Length for 3 Penguin Species', size=16, color='orange') # indexes can be accessed with flipper_mean.index
plt.show()seabornsns.barplot(x ='species', y='flipper_length_mm', data=penguins)
plt.title('Flipper Length for 3 Penguin Species', size=16, color='orange')
plt.show()seaborn will automatically calculate the mean of the y variable.
matplotlibseabornseabornseabornseabornseabornseaborn# extract numeric columns from penguins dataframe
penguins_numeric = penguins[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]
# corr() calculates the correlation between variables
sns.heatmap(penguins_numeric.corr(), annot = True)
plt.title('Correlation between numeric variables', size=16, color='darkviolet')
plt.show()seabornmatplotlib gallery: https://matplotlib.org/stable/gallery/index.htmlseaborn official website: https://seaborn.pydata.org/seaborn gallery:matplotlib and seaborn are the most widely used Python packages for plotting graphs.xlabel(), ylabel(), legend(), title()…)Please open file 010_practical_plots.ipynb
NameError: You may have forgotten to define a variable and you are trying to access it.
SyntaxError: You may have forgotten a character like () or , or : etc …
^.TypeError: You may be trying to perform an operation or apply a function to a wrong object type.
ValueError: You may have given an object type in your function but the value is invalid.
IndexError: You may be trying to access an element in a list that is outside the valid range.
KeyError: You may be trying to access an element in a dictionary that doesn’t exist.
.get() method to check your keys.--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[258], line 2 1 my_dict = {"Laurène":0, "Thomas":1, "Isabelle":2, "Benjamin":3} ----> 2 print(my_dict["Lauraine"]) KeyError: 'Lauraine'
IndentationError: You may have forgotten to indent a part of your code.
AttributeError: You may have used the wrong method for an object.
method documentation.FileNotFoundError: The file you are trying to access either does not exist or is in a different folder or the file path is wrong.
--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) Cell In[261], line 1 ----> 1 my_file = open("my_file.txt", "r") File /usr/lib/python3/dist-packages/IPython/core/interactiveshell.py:310, in _modified_open(file, *args, **kwargs) 303 if file in {0, 1, 2}: 304 raise ValueError( 305 f"IPython won't let you open fd={file} by default " 306 "as it is likely to crash IPython. If you know what you are doing, " 307 "you can use builtins' open." 308 ) --> 310 return io_open(file, *args, **kwargs) FileNotFoundError: [Errno 2] No such file or directory: 'my_file.txt'
ModuleNotFoundError: You may have forgotten to install the package before importing it, or you may have made a mistake when typing its name.
pip install.If you don’t have any ideas for a program or analysis to implement, you can choose from the following options:
Read the doc!
Practise!
Do not reinvent the wheel: use existing tools
Use AI assistant with caution! (copy-paste will not work every time)
Fabien KON-SUN-TACK
Former Bilille engineer who worked on this training.
In order to help us improve our training, we would be grateful if you could take a few minutes to complete the following satisfaction survey.
(You can answer in English or French.)
Comments
Only write relevant comments.