The objective of this introductory course to Machine Learning is to provide students with a fundamental understanding and an extensive practical experience of how to extract knowledge from data.
Course goals:
Understand the basic principles behind Machine Learning.
Gain practical experience with the most relevant Machine Learning algorithms.
Have well-formed criteria to choose the most appropriate technique for a given application.
The following diagram attempts to describe the major problem families addressed by Machine Learning. We will next briefly describe some of these families that we will meet in the course.
Supervised Learning
The aim of Supervised Learning is to learn an input-output mapping from a “labelled” dataset. \[Y \sim f(X_1,\ldots, X_n)\] where
\(Y\) is the output or response variable.
\(X_i\) are the input or explanatory variables.
\(f\) is a function that approximates output as a function of inputs.
Applications:
Classification
Regression
Forecasting
Supervised Learning: Classification
In classification problems we have a categorical output variable\(Y\). That is \(Y\) is a factor that divides the data into classes/levels.
Supervised Learning: Regression
In regression problems we have a numerical output variable\(Y\).
Supervised Learning: Forecasting
In forecasting problems we use past values of the output variable\(Y\) to predict its future values. \[\underbrace{Y_t}_{\text{Output at time }t} \sim \underbrace{f(Y_{t-1}, Y_{t-2},\,\ldots)}_{\text{Output at previous time values}}\]
Note: If you have started there, and you have completed the setup, then you can continue below. Just make sure that you use the select kernel button (top right corner of this window) to select the correct Python environment for this notebook, as described in the setup guide.
Python Setup Checks
To test that your python setup is looking good please execute the remaining Python cells in this document. Begin by checking the output of the following commands. It should be /opt/conda/envs/mlmiin/bin/python.
import sysprint(sys.executable)
/opt/conda/envs/mlmiin/bin/python
Check the Basic Libraries for Data Science with Python
Check the versions and make sure that no errors appear in the output of the code cells below.
First check the Python version you are using (it should be 3.11.14).
print("Python version:", sys.version)
Python version: 3.11.14 | packaged by conda-forge | (main, Oct 22 2025, 22:39:18) [GCC 14.3.0]
Note about standard imports
Many Python libraries have standard import names. These are not official names, but we will require their use, to make your code readable and compatible. For example, NumPy should always be imported as np.
# The Standard import for Matplotlib isimport matplotlib as mpl # But we will more frequently use this import sequenceimport matplotlib.pyplot as pltmpl.__version__
'3.10.8'
# Standard import for Seabornimport seaborn as snssns.__version__
# Standard import for Scikit-learnimport sklearn as sk sk. __version__
'1.7.2'
Working with Pandas DataFrame Objects.
The Titanic Dataset
For our first example we will use a data set called titanic, available in the Seaborn library that you should have already installed. The data set contains information about the passengers of the Titanic, such as their age, gender, class in which they were traveling, whether they survived the ship shinking, etc. We will soon see how to use Python to read data from different sources: csv and Excel files, urls, databases, APIs, etc. But for now we just want to run some tests and get an overview of the data structures we will be working with.
As you can see, titanic is now a Pandas DataFrame. This, alongside with NumPy arrays, is the type of data objects that we will most frequently meet when working with data.
First Look at the Dataset
To see the first lines of the titanic data set we can use the head method. The optional n argument determines the number of rows in the output (the default is n = 5). Note: you will notice sone NaN values in the deck column of the table. This are missing data. We will see how to deal with missing data through the course lectures.
titanic.head(4)
survived
pclass
sex
age
sibsp
parch
fare
embarked
class
who
adult_male
deck
embark_town
alive
alone
0
0
3
male
22.0
1
0
7.2500
S
Third
man
True
NaN
Southampton
no
False
1
1
1
female
38.0
1
0
71.2833
C
First
woman
False
C
Cherbourg
yes
False
2
1
3
female
26.0
0
0
7.9250
S
Third
woman
False
NaN
Southampton
yes
True
3
1
1
female
35.0
1
0
53.1000
S
First
woman
False
C
Southampton
yes
False
Exercise 001.
Try running the above with other values of n. Also run it as print(titanic.head(4)). Use the Help menu to open the Reference link and look up the information about the head method for Pandas DataFrames.
Basic properties of a DataFrame
How many rows and columns of data are there in this data set? We get the answer with the shape method in Pandas:
titanic.shape
(891, 15)
Exercise 002.
What happens if you just use the name of the DataFrame in a notebook?
Apply the Pandas methods info and describe to the Titanic Dataframe.
Also run the command titanic.size and len(titanic).
Note that size, just as shaoe, is an attribute, not a method. Also note that len is a function. Make sure that you understand how these concepts are different from each other.
As we see, the column names are stored in an Index object. We will learn more about indices later in the course, but we can start with the idea that sometimes the rows of the Datafframe (or groups of rows) can contain information as well. This information is stored in the index which and be accessed as follows.
titanic.index
RangeIndex(start=0, stop=891, step=1)
In this case the answer indicates that the index is just the sequence of row numbers.
Accessing the data
We can get any element in the data table using brackets, the iloc method and row/column number pairs.
For example, to get the element in the second row and fourth column we use:
titanic.iloc[1, 3]
np.float64(38.0)
Always keep in mind that Python counts are zero-based!!
The zero index corresponds to the first element in an ordered set.
In the above example, the fourth column corresponds to the age variable of the Titanic passengers. It is often better to refer to variables by their names. We can do it with the loc method:
titanic.loc[1, 'age']
np.float64(38.0)
In this dataframe the row indices are numeric (in fact, consecutive zero-based integers), so we can use 1 with loc or iloc to select a row.
Accessing complete columns
Columns in a dataframe can be accesed with brackets and the quoted variable name, or with loc. But with loc we must use a colon : to include all the rows.
The age column can also be accessed as an attribute of the table, with titanic.age. The bracket method is required if the variable name contains special characters, but this attribute notation can be handy when we want to shorten our code sentences.
Dataframe Columns and Pandas Series
What kind of object is one such extracted column? It is not a dataframe with a single column (those exist too) but a Pandas Series object. In Pandas a DataFrame can be considered as a collection of Series (columns) with a common (row) index. Missing this tricky point leads to many mistakes.
type(titanic['age'])
pandas.core.series.Series
We have just said that dataframes with a single column exist. In this case you can turn the resultin Series into one such dataframe as follows.
The loc and iloc methods can also be used to modify elements of the table. Let us modify the age of that passenger and check the result with head:
titanic.iloc[1, 3] =19.0titanic.head(4)
survived
pclass
sex
age
sibsp
parch
fare
embarked
class
who
adult_male
deck
embark_town
alive
alone
0
0
3
male
22.0
1
0
7.2500
S
Third
man
True
NaN
Southampton
no
False
1
1
1
female
19.0
1
0
71.2833
C
First
woman
False
C
Cherbourg
yes
False
2
1
3
female
26.0
0
0
7.9250
S
Third
woman
False
NaN
Southampton
yes
True
3
1
1
female
35.0
1
0
53.1000
S
First
woman
False
C
Southampton
yes
False
Exercise 003.
Now use ioc instead of iloc to return the age of the passenger to the original 38.0 value and check your work with head.
Run the comand type(titanic[['age']]). Note the difference with type(titanic['age']).
Accessing Larger Portions of the DataFrame Through Indexing
If we want to access several rows and or columns we can use explicit indexing with lists (note that the order is relevant). The loc and iloc methods can also be used to modify elements of the table.
In slicing we use the colon : to get consecutive rows or columns.
titanic.loc[:4 , 'survived':'age']
survived
pclass
sex
age
0
0
3
male
22.0
1
1
1
female
19.0
2
1
3
female
26.0
3
1
1
female
35.0
4
0
3
male
35.0
Warning
Contrary to the usual Python (and NumPy) convention, in Pandas the final values of the indices are included in the output for loc.
You can also use slicing with iloc. However, in this case the final values of the numeric indexes are excluded in the output for iloc.
titanic.iloc[:2 , 0:2]
survived
pclass
0
0
3
1
1
1
Condition based filtering
One of the most powerful tools in data analysis is the ability to filter only those rows of a data table that meet some condition, usually expressed as a boolean condition. In Pandas we can use loc for this task.
For example, we can filter the titanic DataFrame, keeping only the rows corresponding to female passengers with ages equal or greater than 25 years.
Note that in the last line of the previous code chunk we have used iloc to select the columns that appear in the output. Always be mindful of the output of your commands! Communication skills are an essential tool in Data Science and Machine Learning!
Grouping
Together with conditional selection, grouping observations provides an essential tool to study the distribution of the variables in the dataset and answer questions about them.
For example, which was the average fare payed by the female passengers under 25 grouped by the class they were traveling in? Note: the observed=True argument is due to a recent change in the default behavior of pandas and will probably be removed in future versions.
class
First 104.930704
Second 20.199113
Third 16.705341
Name: fare, dtype: float64
Plots
We will have plenty of opportunities to learn how to use different types of plots to explore and visualize our data. BUt to begin with the subject and close this session we show below how to obtain a simple graph out of a pandas Dataframe.
In fact, let us plot a histogram of the age variable (using MatplotLib).
titanic.hist('age', edgecolor='black', linewidth=1.2, color ="tan", figsize=(6, 3))
In the next session we will start with the first family of supervised learning models: classification models. We will spend several sessions studying such models, using them to lay the foundations of Machine Learning in general, as well as the Python tools we will use throughout the course.