Chapter 1: Introduction

Machine Learning

Author

F.San Segundo & N.Rodríguez

Published

January 2026

Course Description and Grading

Course description:

The objective of this introductory course to Machine Learning is to provide students with a fundamental understanding and an extensive practical experience of how to extract knowledge from data.

Course goals:
  • Understand the basic principles behind Machine Learning.
  • Gain practical experience with the most relevant Machine Learning algorithms.
  • Have well-formed criteria to choose the most appropriate technique for a given application.

Grading:

This is a summary, see the Course Guide for the official version.

  • 15% Mid-term exam
  • 35% Final exam (\(\geq 4\))
  • 50% Lab
  • Retake
Schedule.

Abridged version: due dates for tests and assignments to be confirmed.

January to February, before the Midterm (scheduled for the first week of March):

  • Ch. 1: Introduction
  • Ch. 2: Supervised Learning: Classification \(\rightarrow\) Assignment 1
  • Ch. 3: Supervised Learning: Regression

Late February to late March:

  • Ch. 4: Supervised Learning: Forecasting \(\rightarrow\) Assignment 2

April (Easter in the first week, last lecture day April 22th)

  • Ch. 5: Unsupervised Learning

Final Exam (late April or early May)

Final Exam Retake (June)


Introduction to Machine Learning

Motivation for ML
  • We are in a data rich but information poor situation.
  • In general, decision makers do not have the tools to extract the valuable knowledge embedded in the vast amounts of data.
  • Data-driven decisions increasingly make the difference between keeping up with competitors or falling further behind.

Harvard Paper Sexiest Job 21st Century


Machine Learning Definition


Temptative Definitions of Machine Learning

With vs without Machine Learning


Machine Learning Timeline


Data Abstraction Generalization Process

Steps to Apply ML to Data


Types of Learning

The following diagram attempts to describe the major problem families addressed by Machine Learning. We will next briefly describe some of these families that we will meet in the course.

Machine Learing Typpes


Supervised Learning

The aim of Supervised Learning is to learn an input-output mapping from a “labelled” dataset. \[Y \sim f(X_1,\ldots, X_n)\] where

  • \(Y\) is the output or response variable.
  • \(X_i\) are the input or explanatory variables.
  • \(f\) is a function that approximates output as a function of inputs.

Applications:

  • Classification
  • Regression
  • Forecasting

Supervised Learning: Classification

In classification problems we have a categorical output variable \(Y\). That is \(Y\) is a factor that divides the data into classes/levels.

Machine Learning Timeline


Supervised Learning: Regression
  • In regression problems we have a numerical output variable \(Y\).

Regression Example


Supervised Learning: Forecasting

In forecasting problems we use past values of the output variable \(Y\) to predict its future values. \[\underbrace{Y_t}_{\text{Output at time }t} \sim \underbrace{f(Y_{t-1}, Y_{t-2},\,\ldots)}_{\text{Output at previous time values}}\]


Software setup

Detailed Setup Guide

Follow the `instructions here.

Note: If you have started there, and you have completed the setup, then you can continue below. Just make sure that you use the select kernel button (top right corner of this window) to select the correct Python environment for this notebook, as described in the setup guide.


Python Setup Checks

To test that your python setup is looking good please execute the remaining Python cells in this document. Begin by checking the output of the following commands. It should be /opt/conda/envs/mlmiin/bin/python.

import sys
print(sys.executable)
/opt/conda/envs/mlmiin/bin/python

Check the Basic Libraries for Data Science with Python

Check the versions and make sure that no errors appear in the output of the code cells below.

First check the Python version you are using (it should be 3.11.14).

print("Python version:", sys.version)
Python version: 3.11.14 | packaged by conda-forge | (main, Oct 22 2025, 22:39:18) [GCC 14.3.0]
Note about standard imports

Many Python libraries have standard import names. These are not official names, but we will require their use, to make your code readable and compatible. For example, NumPy should always be imported as np.

# Standard import for NumPy
import numpy as np
np.__version__
'2.3.5'

A minimal performance check (Read here about %timeit and other built-in magic commands to use in your notebooks)

a = np.random.rand(1000)
%timeit a @ a
630 ns ± 19.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
# Standard import for Pandas
import pandas as pd
pd.__version__
'2.3.3'

# The Standard import for Matplotlib is
import matplotlib as mpl 
# But we will more frequently use this import sequence
import matplotlib.pyplot as plt
mpl.__version__
'3.10.8'
# Standard import for Seaborn
import seaborn as sns
sns.__version__
'0.13.2'
# Standard import for Scikit-learn
import sklearn as sk 
sk. __version__
'1.7.2'

Working with Pandas DataFrame Objects.

The Titanic Dataset

For our first example we will use a data set called titanic, available in the Seaborn library that you should have already installed. The data set contains information about the passengers of the Titanic, such as their age, gender, class in which they were traveling, whether they survived the ship shinking, etc. We will soon see how to use Python to read data from different sources: csv and Excel files, urls, databases, APIs, etc. But for now we just want to run some tests and get an overview of the data structures we will be working with.

titanic = sns.load_dataset('titanic')
type(titanic)
pandas.core.frame.DataFrame

As you can see, titanic is now a Pandas DataFrame. This, alongside with NumPy arrays, is the type of data objects that we will most frequently meet when working with data.


First Look at the Dataset

To see the first lines of the titanic data set we can use the head method. The optional n argument determines the number of rows in the output (the default is n = 5).
Note: you will notice sone NaN values in the deck column of the table. This are missing data. We will see how to deal with missing data through the course lectures.

titanic.head(4)
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
Exercise 001.

Try running the above with other values of n. Also run it as print(titanic.head(4)). Use the Help menu to open the Reference link and look up the information about the head method for Pandas DataFrames.


Basic properties of a DataFrame

How many rows and columns of data are there in this data set? We get the answer with the shape method in Pandas:

titanic.shape
(891, 15)
Exercise 002.
  • What happens if you just use the name of the DataFrame in a notebook?
  • Apply the Pandas methods info and describe to the Titanic Dataframe.
  • Also run the command titanic.size and len(titanic).
  • Note that size, just as shaoe, is an attribute, not a method. Also note that len is a function. Make sure that you understand how these concepts are different from each other.
len(titanic)
891

Columns of the DataFrame

To see the column names we use, quite naturally:

titanic.columns
Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')
Index of a DataFrame

As we see, the column names are stored in an Index object. We will learn more about indices later in the course, but we can start with the idea that sometimes the rows of the Datafframe (or groups of rows) can contain information as well. This information is stored in the index which and be accessed as follows.

titanic.index
RangeIndex(start=0, stop=891, step=1)

In this case the answer indicates that the index is just the sequence of row numbers.


Accessing the data

We can get any element in the data table using brackets, the iloc method and row/column number pairs.

For example, to get the element in the second row and fourth column we use:

titanic.iloc[1, 3]
np.float64(38.0)
Always keep in mind that Python counts are zero-based!!

The zero index corresponds to the first element in an ordered set.


Tidy Data

In a tidy data table:

  • the columns correspond to variables
  • the rows correspond to observations.

And we should have a very good reason to do otherwise!! Wikipedia page (Spanish version with links to English References)

In the above example, the fourth column corresponds to the age variable of the Titanic passengers. It is often better to refer to variables by their names. We can do it with the loc method:

titanic.loc[1, 'age']
np.float64(38.0)

In this dataframe the row indices are numeric (in fact, consecutive zero-based integers), so we can use 1 with loc or iloc to select a row.


Accessing complete columns

Columns in a dataframe can be accesed with brackets and the quoted variable name, or with loc. But with loc we must use a colon : to include all the rows.

To extract the age column

titanic['age'] # or also titanic.loc[:, "age"]
0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: age, Length: 891, dtype: float64
Warning

The age column can also be accessed as an attribute of the table, with titanic.age. The bracket method is required if the variable name contains special characters, but this attribute notation can be handy when we want to shorten our code sentences.


Dataframe Columns and Pandas Series

What kind of object is one such extracted column? It is not a dataframe with a single column (those exist too) but a Pandas Series object. In Pandas a DataFrame can be considered as a collection of Series (columns) with a common (row) index. Missing this tricky point leads to many mistakes.

type(titanic['age'])
pandas.core.series.Series

We have just said that dataframes with a single column exist. In this case you can turn the resultin Series into one such dataframe as follows.

age_df = pd.DataFrame(titanic['age'])
age_df.head(4)
age
0 22.0
1 38.0
2 26.0
3 35.0

Modifying data

The loc and iloc methods can also be used to modify elements of the table. Let us modify the age of that passenger and check the result with head:

titanic.iloc[1, 3] = 19.0
titanic.head(4)
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 19.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
Exercise 003.
  1. Now use ioc instead of iloc to return the age of the passenger to the original 38.0 value and check your work with head.
  2. Run the comand type(titanic[['age']]). Note the difference with type(titanic['age']).

Accessing Larger Portions of the DataFrame Through Indexing

If we want to access several rows and or columns we can use explicit indexing with lists (note that the order is relevant). The loc and iloc methods can also be used to modify elements of the table.

titanic.loc[[0, 1, 2, 3, 4] , ['sex', 'age', 'survived', 'pclass']]
sex age survived pclass
0 male 22.0 0 3
1 female 19.0 1 1
2 female 26.0 1 3
3 female 35.0 1 1
4 male 35.0 0 3

Slicing

In slicing we use the colon : to get consecutive rows or columns.

titanic.loc[:4 , 'survived':'age']
survived pclass sex age
0 0 3 male 22.0
1 1 1 female 19.0
2 1 3 female 26.0
3 1 1 female 35.0
4 0 3 male 35.0
Warning
  • Contrary to the usual Python (and NumPy) convention, in Pandas the final values of the indices are included in the output for loc.
  • You can also use slicing with iloc. However, in this case the final values of the numeric indexes are excluded in the output for iloc.
titanic.iloc[:2 , 0:2]
survived pclass
0 0 3
1 1 1

Condition based filtering

One of the most powerful tools in data analysis is the ability to filter only those rows of a data table that meet some condition, usually expressed as a boolean condition. In Pandas we can use loc for this task.

For example, we can filter the titanic DataFrame, keeping only the rows corresponding to female passengers with ages equal or greater than 25 years.

ttnc_female_25plus = titanic.loc[(titanic["age"] >= 25) &  (titanic["sex"] == "female")]
ttnc_female_25plus.iloc[:, 0:8].head(4)
survived pclass sex age sibsp parch fare embarked
2 1 3 female 26.0 0 0 7.9250 S
3 1 1 female 35.0 1 0 53.1000 S
8 1 3 female 27.0 0 2 11.1333 S
11 1 1 female 58.0 0 0 26.5500 S
Controlling the Output

Note that in the last line of the previous code chunk we have used iloc to select the columns that appear in the output. Always be mindful of the output of your commands! Communication skills are an essential tool in Data Science and Machine Learning!


Grouping

Together with conditional selection, grouping observations provides an essential tool to study the distribution of the variables in the dataset and answer questions about them.

For example, which was the average fare payed by the female passengers under 25 grouped by the class they were traveling in?
Note: the observed=True argument is due to a recent change in the default behavior of pandas and will probably be removed in future versions.

ttnc_female_25plus.groupby(by="class", observed=True )["fare"].mean()
class
First     104.930704
Second     20.199113
Third      16.705341
Name: fare, dtype: float64

Plots

We will have plenty of opportunities to learn how to use different types of plots to explore and visualize our data. BUt to begin with the subject and close this session we show below how to obtain a simple graph out of a pandas Dataframe.

In fact, let us plot a histogram of the age variable (using MatplotLib).

titanic.hist('age', edgecolor='black', linewidth=1.2, color = "tan", figsize=(6, 3))
array([[<Axes: title={'center': 'age'}>]], dtype=object)

Next Session

In the next session we will start with the first family of supervised learning models: classification models. We will spend several sessions studying such models, using them to lay the foundations of Machine Learning in general, as well as the Python tools we will use throughout the course.