Chapter 1: Introduction

Machine Learning

Author

F.San Segundo & N.Rodríguez

Published

January 2026

Course Description and Grading

Course description:

The objective of this introductory course to Machine Learning is to provide students with a fundamental understanding and an extensive practical experience of how to extract knowledge from data.

Course goals:

Understand the basic principles behind Machine Learning.
Gain practical experience with the most relevant Machine Learning algorithms.
Have well-formed criteria to choose the most appropriate technique for a given application.

Grading:

This is a summary, see the Course Guide for the official version.

15% Mid-term exam
35% Final exam (\(\geq 4\))
50% Lab
Retake

Schedule.

Abridged version: due dates for tests and assignments to be confirmed.

January to February, before the Midterm (scheduled for the first week of March):

Ch. 1: Introduction
Ch. 2: Supervised Learning: Classification \(\rightarrow\) Assignment 1
Ch. 3: Supervised Learning: Regression

Late February to late March:

Ch. 4: Supervised Learning: Forecasting \(\rightarrow\) Assignment 2

April (Easter in the first week, last lecture day April 22th)

Ch. 5: Unsupervised Learning

Final Exam (late April or early May)

Final Exam Retake (June)

Introduction to Machine Learning

Motivation for ML

We are in a data rich but information poor situation.
In general, decision makers do not have the tools to extract the valuable knowledge embedded in the vast amounts of data.
Data-driven decisions increasingly make the difference between keeping up with competitors or falling further behind.

Temptative Definitions of Machine Learning

Field of study that gives computers the ability to learn without being explicitly programmed.
Making computers to modify or adapt their actions so that these actions get more accurate.
A machine is said to learn if it is able to take experience and utilize it such that its performance improves up on similar experiences in the future.
Ted Talk by Jeremy Howard: The wonderful and terrifying implications of computers that can learn.

Machine Learning Timeline

Source: Machine Learning Timeline (from DEEP LEARNING EXPLAINED by NVIDIA)

The Learning Process

Source: https://imarticus.org/what-is-machine-learning-and-does-it-matter/

Types of Learning

The following diagram attempts to describe the major problem families addressed by Machine Learning. We will next briefly describe some of these families that we will meet in the course.

Supervised Learning

The aim of Supervised Learning is to learn an input-output mapping from a “labelled” dataset. \[Y \sim f(X_1,\ldots, X_n)\] where

\(Y\) is the output or response variable.
\(X_i\) are the input or explanatory variables.
\(f\) is a function that approximates output as a function of inputs.

Applications:

Classification
Regression
Forecasting

Supervised Learning: Classification

In classification problems we have a categorical output variable \(Y\). That is \(Y\) is a factor that divides the data into classes/levels.

Supervised Learning: Regression

In regression problems we have a numerical output variable \(Y\).

Supervised Learning: Forecasting

In forecasting problems we use past values of the output variable \(Y\) to predict its future values. \[\underbrace{Y_t}_{\text{Output at time }t} \sim \underbrace{f(Y_{t-1}, Y_{t-2},\,\ldots)}_{\text{Output at previous time values}}\]

Software setup

Detailed Setup Guide

Follow the `instructions here.

Note: If you have started there, and you have completed the setup, then you can continue below. Just make sure that you use the select kernel button (top right corner of this window) to select the correct Python environment for this notebook, as described in the setup guide.

Python Setup Checks

To test that your python setup is looking good please execute the remaining Python cells in this document. Begin by checking the output of the following commands. It should be /opt/conda/envs/mlmiin/bin/python.

import sys
print(sys.executable)

/opt/conda/envs/mlmiin/bin/python

Check the Basic Libraries for Data Science with Python

Check the versions and make sure that no errors appear in the output of the code cells below.

First check the Python version you are using (it should be 3.11.14).

print("Python version:", sys.version)

Python version: 3.11.14 | packaged by conda-forge | (main, Oct 22 2025, 22:39:18) [GCC 14.3.0]

Note about standard imports

Many Python libraries have standard import names. These are not official names, but we will require their use, to make your code readable and compatible. For example, NumPy should always be imported as np.

NumPy

NumPy Official Website and Documentation: https://numpy.org/

# Standard import for NumPy
import numpy as np
np.__version__

'2.3.5'

A minimal performance check (Read here about %timeit and other built-in magic commands to use in your notebooks)

a = np.random.rand(1000)
%timeit a @ a

630 ns ± 19.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Pandas

Pandas Official Website and Documentation: https://pandas.pydata.org/

# Standard import for Pandas
import pandas as pd
pd.__version__

'2.3.3'

MatplotLib and Seaborn

MatplotLib Official Website and Documentation: https://matplotlib.org/
Seaborn Official Website and Documentation: https://seaborn.pydata.org/

# The Standard import for Matplotlib is
import matplotlib as mpl 
# But we will more frequently use this import sequence
import matplotlib.pyplot as plt
mpl.__version__

'3.10.8'

# Standard import for Seaborn
import seaborn as sns
sns.__version__

'0.13.2'

Scikit

scikit-learn Official Website and Documentation: https://scikit-learn.org/

# Standard import for Scikit-learn
import sklearn as sk 
sk. __version__

'1.7.2'

Working with Pandas DataFrame Objects.

The Titanic Dataset

For our first example we will use a data set called titanic, available in the Seaborn library that you should have already installed. The data set contains information about the passengers of the Titanic, such as their age, gender, class in which they were traveling, whether they survived the ship shinking, etc. We will soon see how to use Python to read data from different sources: csv and Excel files, urls, databases, APIs, etc. But for now we just want to run some tests and get an overview of the data structures we will be working with.

titanic = sns.load_dataset('titanic')
type(titanic)

pandas.core.frame.DataFrame

As you can see, titanic is now a Pandas DataFrame. This, alongside with NumPy arrays, is the type of data objects that we will most frequently meet when working with data.

First Look at the Dataset

To see the first lines of the titanic data set we can use the head method. The optional n argument determines the number of rows in the output (the default is n = 5).
Note: you will notice sone NaN values in the deck column of the table. This are missing data. We will see how to deal with missing data through the course lectures.

titanic.head(4)

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False

Exercise 001.

Try running the above with other values of n. Also run it as print(titanic.head(4)). Use the Help menu to open the Reference link and look up the information about the head method for Pandas DataFrames.

Basic properties of a DataFrame

How many rows and columns of data are there in this data set? We get the answer with the shape method in Pandas:

titanic.shape

(891, 15)

Exercise 002.

What happens if you just use the name of the DataFrame in a notebook?
Apply the Pandas methods info and describe to the Titanic Dataframe.
Also run the command titanic.size and len(titanic).
Note that size, just as shaoe, is an attribute, not a method. Also note that len is a function. Make sure that you understand how these concepts are different from each other.

len(titanic)

Columns of the DataFrame

To see the column names we use, quite naturally:

titanic.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

Index of a DataFrame

As we see, the column names are stored in an Index object. We will learn more about indices later in the course, but we can start with the idea that sometimes the rows of the Datafframe (or groups of rows) can contain information as well. This information is stored in the index which and be accessed as follows.

titanic.index

RangeIndex(start=0, stop=891, step=1)

In this case the answer indicates that the index is just the sequence of row numbers.

Accessing the data

We can get any element in the data table using brackets, the iloc method and row/column number pairs.

For example, to get the element in the second row and fourth column we use:

titanic.iloc[1, 3]

np.float64(38.0)

Always keep in mind that Python counts are zero-based!!

The zero index corresponds to the first element in an ordered set.

Tidy Data

In a tidy data table:

the columns correspond to variables
the rows correspond to observations.

And we should have a very good reason to do otherwise!! Wikipedia page (Spanish version with links to English References)

In the above example, the fourth column corresponds to the age variable of the Titanic passengers. It is often better to refer to variables by their names. We can do it with the loc method:

titanic.loc[1, 'age']

np.float64(38.0)

In this dataframe the row indices are numeric (in fact, consecutive zero-based integers), so we can use 1 with loc or iloc to select a row.

Accessing complete columns

Columns in a dataframe can be accesed with brackets and the quoted variable name, or with loc. But with loc we must use a colon : to include all the rows.

To extract the age column

titanic['age'] # or also titanic.loc[:, "age"]

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: age, Length: 891, dtype: float64

Warning

The age column can also be accessed as an attribute of the table, with titanic.age. The bracket method is required if the variable name contains special characters, but this attribute notation can be handy when we want to shorten our code sentences.

Dataframe Columns and Pandas Series

What kind of object is one such extracted column? It is not a dataframe with a single column (those exist too) but a Pandas Series object. In Pandas a DataFrame can be considered as a collection of Series (columns) with a common (row) index. Missing this tricky point leads to many mistakes.

type(titanic['age'])

pandas.core.series.Series

We have just said that dataframes with a single column exist. In this case you can turn the resultin Series into one such dataframe as follows.

age_df = pd.DataFrame(titanic['age'])
age_df.head(4)

	age
0	22.0
1	38.0
2	26.0
3	35.0

Modifying data

The loc and iloc methods can also be used to modify elements of the table. Let us modify the age of that passenger and check the result with head:

titanic.iloc[1, 3] = 19.0
titanic.head(4)

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	19.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False

Exercise 003.

Now use ioc instead of iloc to return the age of the passenger to the original 38.0 value and check your work with head.
Run the comand type(titanic[['age']]). Note the difference with type(titanic['age']).

Accessing Larger Portions of the DataFrame Through Indexing

If we want to access several rows and or columns we can use explicit indexing with lists (note that the order is relevant). The loc and iloc methods can also be used to modify elements of the table.

titanic.loc[[0, 1, 2, 3, 4] , ['sex', 'age', 'survived', 'pclass']]

	sex	age	survived	pclass
0	male	22.0	0	3
1	female	19.0	1	1
2	female	26.0	1	3
3	female	35.0	1	1
4	male	35.0	0	3

Slicing

In slicing we use the colon : to get consecutive rows or columns.

titanic.loc[:4 , 'survived':'age']

	survived	pclass	sex	age
0	0	3	male	22.0
1	1	1	female	19.0
2	1	3	female	26.0
3	1	1	female	35.0
4	0	3	male	35.0

Warning

Contrary to the usual Python (and NumPy) convention, in Pandas the final values of the indices are included in the output for loc.
You can also use slicing with iloc. However, in this case the final values of the numeric indexes are excluded in the output for iloc.

titanic.iloc[:2 , 0:2]

	survived	pclass
0	0	3
1	1	1

Condition based filtering

One of the most powerful tools in data analysis is the ability to filter only those rows of a data table that meet some condition, usually expressed as a boolean condition. In Pandas we can use loc for this task.

For example, we can filter the titanic DataFrame, keeping only the rows corresponding to female passengers with ages equal or greater than 25 years.

ttnc_female_25plus = titanic.loc[(titanic["age"] >= 25) &  (titanic["sex"] == "female")]
ttnc_female_25plus.iloc[:, 0:8].head(4)

	survived	pclass	sex	age	sibsp	parch	fare	embarked
2	1	3	female	26.0	0	0	7.9250	S
3	1	1	female	35.0	1	0	53.1000	S
8	1	3	female	27.0	0	2	11.1333	S
11	1	1	female	58.0	0	0	26.5500	S

Controlling the Output

Note that in the last line of the previous code chunk we have used iloc to select the columns that appear in the output. Always be mindful of the output of your commands! Communication skills are an essential tool in Data Science and Machine Learning!

Grouping

Together with conditional selection, grouping observations provides an essential tool to study the distribution of the variables in the dataset and answer questions about them.

For example, which was the average fare payed by the female passengers under 25 grouped by the class they were traveling in?
Note: the observed=True argument is due to a recent change in the default behavior of pandas and will probably be removed in future versions.

ttnc_female_25plus.groupby(by="class", observed=True )["fare"].mean()

class
First     104.930704
Second     20.199113
Third      16.705341
Name: fare, dtype: float64

Plots

We will have plenty of opportunities to learn how to use different types of plots to explore and visualize our data. BUt to begin with the subject and close this session we show below how to obtain a simple graph out of a pandas Dataframe.

In fact, let us plot a histogram of the age variable (using MatplotLib).

titanic.hist('age', edgecolor='black', linewidth=1.2, color = "tan", figsize=(6, 3))

array([[<Axes: title={'center': 'age'}>]], dtype=object)

Next Session

In the next session we will start with the first family of supervised learning models: classification models. We will spend several sessions studying such models, using them to lay the foundations of Machine Learning in general, as well as the Python tools we will use throughout the course.