Chapter 2, Part 5: Ensemble Methods and Boosting

Machine Learning

Author

F.San Segundo & N.Rodríguez

Published

January 2026

COPY THIS NOTEBOOK FIRST

Checklist

  • Have you started Docker Desktop?
  • Have you launched Docker from the MLMIIN repository folder?
  • Have you connected VS Code to the running container?

If you have missed any of these steps you may need to restart VS Code after completing them.
Also if Python seems unresponsive at first, try restarting the kernel.

IMPORTANT

  • Remember to make a copy of this notebook (in the same folder) before starting to work on it.
  • If you make changes to the original notebook, save it with another name, and use Git to undo the changes in the original notebook (ask for help with this if you need it).
Setting the working directory

We begin by using cd to make sure we are in the right folder.

Hide the code
%cd 2_5_EnsembleMethods_Boosting
/wd/2_5_EnsembleMethods_Boosting

Introduction to Ensemble Methods: the wisdom of the (wise!) crowds.

Fighting Overfitting as a Consequence of High Model Variance

The major problem with decision trees is their high variance, which makes them prone to overfitting. That means that training a model with different training sets leads to highly variable, volatile decision boundaries. A possible way to reduce the variance of the predictions is to use a different training set to train a family or ensemble of predictors and then average their results.

Of course, we often do not have enough data to spend in building independent training sets. A compromise solution then consists in making resampled (bootstrapped) training sets by sampling with replacement the training data. The resampled data sets will be of the same size as the original training set, but some points can be repeated and some others can be missing (about one third approx.) We then use these resampled datasets to train models and, in order to get predictions, we average their results. For binary classification that means that we can average the probability predicted by each one of the models. The models for each resample need not be of the same type, but they usually have a common base model type. This method is called bagging and it is usually applied with decision trees as base models.

We will begin exploring bagging to discover some shortcomings of this method. Then we move to Random Forests to address the limitations of bagging. And finally we will discuss gradient boosting methods, that approach the problem with a different perspective.

Starling murmuration

Recommended reading:

See References section at the end for details.


A Higher Dimensional Data Set for this Session

Some of the finer points of the discussion that follows only apply to datasets with a higher number of predictors than the examples we have ben using. So we begin by creating a new dataset that meets our requirements.

Let us begin by loading the basic libraries.

Hide the code
import numpy as np

import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as col
import seaborn as sns

import pandas as pd

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# This is a custom module that contains some plotting utilities, the source code is in the file 
# plot_utils.py in the same directory as this notebook
from plot_utils import plot_2d_data, plot_2d_classifier

from sklearn.datasets import make_moons
from sklearn.datasets import make_circles

from sklearn.model_selection import train_test_split

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score

We use the make_classification function to build the example. This is an utility function that you will use when you need to test your ideas, benchmark methods, etc.

Hide the code
N = 1000

from sklearn.datasets import make_classification
X, Y = make_classification(n_classes=2, n_samples=N, n_informative=5, random_state=1)
Hide the code
inputs = ["X" + str(k) for k in range(X.shape[1])]
output = "Y"

df = pd.DataFrame(X, columns = inputs)
df[output] = Y
df.iloc[:,:12].head()
X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11
0 -3.327193 0.832043 1.316756 -0.660973 0.597524 2.027022 -0.159786 -2.702438 0.984458 1.116811 -1.251926 0.202324
1 1.685766 -0.403897 -0.357680 0.172492 -0.639168 -1.151757 -1.219221 -1.012974 0.689497 0.177965 -0.785904 1.199356
2 2.140472 0.377846 -0.969882 -0.366210 0.106640 -0.595217 -1.234516 -0.813707 0.702014 -2.371404 -0.683127 0.188425
3 0.344467 -0.698053 2.175077 0.426490 -0.703525 -0.436798 -0.757212 2.192373 0.022481 -0.623113 -1.003478 -0.160083
4 -2.148952 -0.387040 3.550959 -1.488384 2.618837 -0.323193 -0.860368 -4.625203 0.015528 0.705384 -1.093792 -0.698836
Hide the code
from sklearn.model_selection import train_test_split
XTR, XTS, YTR, YTS = train_test_split(df[inputs], df[output],
                                      test_size=0.2,  # percentage preserved as test data
                                      random_state=1, # seed for replication
                                      stratify = df[output])   # Preserves distribution of y

Bagging

Constructing a Bagging Classifier

Let us use this dataset to explore bagging, with decision trees as base models. We begin choosing a number of trees and their maximum depth in the next cell. Later you can return here and modify these values to see their impact in the results below.

Hide the code
n_trees = 2500
md = 6

Bagging with a fixed base classifier is easily accomplished with BaggingClassifier. All the parameters of the BagggingClassifier have self describing names, except for the oob_score parameter. We will explain what this score means below.

Hide the code
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
BagDT = BaggingClassifier(estimator= DecisionTreeClassifier(
                          max_depth=md),
                          n_estimators=n_trees,
                          oob_score=True,
                          random_state=1)

Fitting the model and getting the scores follows the usual steps:

Hide the code
BagDT.fit(XTR, YTR)
BaggingClassifier(estimator=DecisionTreeClassifier(max_depth=6),
                  n_estimators=2500, oob_score=True, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Hide the code
BagDT.score(XTR, YTR), BagDT.score(XTS, YTS), BagDT.oob_score_
(0.9775, 0.875, 0.8875)

Datasets for Predictions

As usual we create datasets to store the predictions of the different models we train.

Hide the code
check_datasets = ['dfTR' in globals(), 'dfTR_eval' in globals(), 'dfTS' in globals(), 'dfTS_eval' in globals()]
print('Checking existence of dfTR, dfTR_eval and dfTS, dfTS_eval:', check_datasets)

if not('dfTR' in globals()):
    # Dataset for Training Predictions
    dfTR = XTR.copy()
    dfTR['Y'] = YTR
    print("Created dfTR")

if not('dfTS' in globals()):
    # Dataset for Training Predictions
    dfTS = XTS.copy()
    dfTS['Y'] = YTS
    print("Created dfTS")

if not('dfTR_eval' in globals()):
    # Dataset for Training Predictions
    dfTR_eval = dfTR.copy()
    print("Created dfTR_eval")

if not('dfTS_eval' in globals()):
    # Dataset for Training Predictions
    dfTS_eval = dfTS.copy()
    print("Created dfTS_eval")
Checking existence of dfTR, dfTR_eval and dfTS, dfTS_eval: [False, False, False, False]
Created dfTR
Created dfTS
Created dfTR_eval
Created dfTS_eval
Hide the code
model = BagDT
model_name = "BagDT"
The insert function in pandas

The code below uses the insert function in pandas to insert a new column in a dataframe. In this case this is not really necessary, as the result is equivalent to the following code we have been using so far:

    df['column_name'] = values

But using insert will be convenient when you want to insert a column in a specific position. The syntax is:

    df.insert(loc, column, value)
Hide the code
# Store the actual predictions
newCol = 'Y_'+ model_name +'_prob_neg'; 
dfTR_eval.insert(loc=dfTR_eval.shape[1], column=newCol, value=model.predict_proba(XTR)[:, 0])
newCol = 'Y_'+ model_name +'_prob_pos'; 
dfTR_eval.insert(loc=dfTR_eval.shape[1], column=newCol, value=model.predict_proba(XTR)[:, 1])
dfTR_eval[newCol] = model.predict_proba(XTR)[:, 1]
newCol = 'Y_'+ model_name +'_pred'; 
dfTR_eval.insert(loc=dfTR_eval.shape[1], column=newCol, value=model.predict(XTR))
The filter function in pandas

The code below uses the filter function in pandas to display a subset of columns (axis = 1) of the dataframe, whose names contain a given string (specified by the like parameter):

Hide the code
dfTR_eval.filter(like="BagDT", axis=1).head()
Y_BagDT_prob_neg Y_BagDT_prob_pos Y_BagDT_pred
31 0.462173 0.537827 1
856 0.032527 0.967473 1
802 0.660871 0.339129 0
205 0.967945 0.032055 0
680 0.915283 0.084717 0

We do the same for the test set (this time we do not use insert):

Hide the code
# Test predictions dataset
dfTS_eval = XTS.copy()
dfTS_eval['Y'] = YTS 
newCol = 'Y_'+ model_name +'_prob_neg'; 
dfTS_eval[newCol] = model.predict_proba(XTS)[:, 0]
newCol = 'Y_'+ model_name +'_prob_pos'; 
dfTS_eval[newCol] = model.predict_proba(XTS)[:, 1]
newCol = 'Y_'+ model_name +'_pred'; 
dfTS_eval[newCol] = model.predict(XTS)
Model Dictionary

We add this as first model to our model dictionary for comparison. Remember the code from the previous session to see how to use such a dictionary of models.

Hide the code
modelDict = {"BagDT" : {"model" : model, "inputs" : inputs}}

Exercise 001
  • Try setting a number of trees ten times bigger and get the scores for that model. What happens?
  • Set back the number of trees to 200 and change the max depth value to 6. What happens now?
  • Play a bit more with the number of trees in the bag and their depths.
  • Now open the code from the previous session, and using it fit the best decision tree model that you can find to this dataset. Find the scores and store the predictions of this model in dfTR, dfTS.
Hide the code
#%load "../exclude/MLMIINprv/exercises/2_5_Exercise001.py"
# %run -i "../exclude/MLMIINprv/exercises/2_5_Exercise001.py"

Understanding the Structure of the Bagging Classifier

This bagging classifier is, at its core, a collection of resamples of the training set and decision trees fitted to each such resample. The code below shows how to examine these components.

For example, the first decision tree in our bag of trees is obtained with:

Hide the code
BagDT.estimators_[0]
DecisionTreeClassifier(max_depth=6, random_state=1028862084)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

This tree was trained using using a subsample with indices which you can see here (only the first few are shown).

Hide the code
BagDT.estimators_samples_[0][0:50]
array([132, 220, 266, 791, 594, 249, 300, 365, 292,  74, 244, 316, 472,
       363, 770, 191, 615,  65, 499,  53, 466, 412, 670, 228, 413,  60,
        13, 487, 331,  83, 721, 376,  69,  33, 345,  86, 303, 359, 549,
       412, 742, 387, 344, 774, 129, 643, 465, 395, 488, 612])
Exercise 002

We said before that there was about one third of the elements of the training set missing from this resample. Can you check that statement?

Bonus points from the Maths Department: can you guess why?

Hide the code
# %load "../exclude/MLMIINprv/exercises/2_5_Exercise002.py"
# %run -i "../exclude/MLMIINprv/exercises/2_5_Exercise002.py"
The out-of-bag score (oob_score) and the bootstrap and samples parameter.

We have just said that about one third of the elements of the training set are missing from each resample. This means that we can use precisely those elements as a validation set to measure the model generalization performance. This is the idea behind the oob_score parameter. In a way it is as if the bagging classifier comes with a built-in cross-validation mechanism (note that there is no cv parameter in Bagging Classifier).

Hide the code
BagDT.score(XTR, YTR), BagDT.score(XTS, YTS), BagDT.oob_score_
(0.9775, 0.875, 0.8875)

The Problem with Bagging

We have a biodiversity problem lurking…

We turned to bagging hoping to reduce the variance associated with different samples.

Exercise 003

Here is an even simpler proposal compared to bagging: take a large number of identical copies of the training set (not resamples!), train a decision tree in each copy, and average the predictions. Is this a good idea?

Variance Reduction and Independence

A basic principle in Statistics is that in order to decrease the variance by using averages you need to consider independent measures (or as close to independent as possible). That is why the identical copies idea is just a waste of time and resources. How about our bag of trees? How independent are they? The following fragment of code plots the first splits of the 12 first trees in the bag. Remove the comments, run the code and look at them closely (you can see them here as well).

Hide the code
from sklearn.tree import plot_tree

fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(12, 16), dpi = 300)

for i in range(12):    
    DT = BagDT.estimators_[i]
    plot_tree(DT, max_depth=2, ax=axes[i//3, i%3])

It feels as if we were aiming for this jungle-like rich diversity

A jungle of diverse trees.

but we stumbled upon this non at all diverse plantation of quasi identical trees

A non diverse plantation of quasi identical trees.
Source: Wikimedia Commons


Our Bag of Trees Looks Like a Palm Plantation, not Like a Rain Forest.

Even though they are trained on different resamples, these trees seem to be making very similar questions, at least in the first splits. How does this reflect on their predictions, specifically in terms of probabilities?

Keep in mind that these models are trained on different resamples. To be able to make a sensible comparison we need a common set, and we will use the test set for this.

We create an array of the right size beforehand and use it to store the probabilities.

Hide the code
test_bag_probs_array = np.zeros([YTS.shape[0], n_trees])

for i in range(n_trees):
    probs_i = BagDT.estimators_[i].predict_proba(XTS.values)[:,1]
    test_bag_probs_array[:, i] = probs_i

These is what the first ones look like:

Hide the code
test_bag_probs_array[0:5, 0:5]
array([[0.01149425, 0.        , 0.        , 0.        , 0.        ],
       [1.        , 0.98760331, 1.        , 1.        , 0.99530516],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.88235294, 0.98760331, 0.89830508, 0.82882883, 0.95698925],
       [0.88235294, 0.98760331, 0.89830508, 1.        , 0.99530516]])
Exercise 004

What is the size of this array? What does a column represent? And a row?

Hide the code
# %load "../exclude/MLMIINprv/exercises/2_5_Exercise004.py"

test_bag_probs_array.shape

# A row represents the predictions of the 250 trees for a single observation
# A column represents the predictions of a single tree for all 200 test set observations

Correlated Predictions.

A simple way to check the similarity of these predictions is to compute the correlation of the columns (read the docs!). As you can see below, the correlations are really high! It seems like the trees are essentially making very similar predictions, far from the independence that we were hoping would reduce variance.

Hide the code

pd.DataFrame(test_bag_probs_array, columns=["T"+ str(i) for i in range(n_trees) ]).corr().iloc[0:10, 0:10]
print(f'The typical correlation between the predictions for two of these trees is {pd.DataFrame(test_bag_probs_array, columns=["T"+ str(i) for i in range(n_trees) ]).corr().median().median():.{2}f}')
The typical correlation between the predictions for two of these trees is 0.65
Exercise 005

As we did before, play with the number of trees and their depths and see how this impacts on the correlations above.

Exercise 006

Before we move on, suppose that you use a bag of ten trees with a fixed max depth. What is the difference (or differences) between doing this and using 10 fold cross validation with decision trees of the same fixed depth?

Exercise 007

Do a performance analysis for the bagging classifier with n_trees = 200 and md = 4 (include confusion matrices, roc curves, calibration and probability histograms for both training and test sets). Make an initial global assessment of this model performance (before we compare it to other models).

Hide the code
# %load "../exclude/MLMIINprv/exercises/2_5_Exercise007.py"
# %run -i "../exclude/MLMIINprv/exercises/2_5_Exercise007.py"

Random Forests


Decorrelating the Predictions.

Why did this palm tree plantation phenomenon happened with bagging? One possible explanation, discussed at (James et al. 2023, sec. 8.2.2.) is that the presence of strong predictors in the inputs causes the trees to be driven by them in too similar ways. Another factor is the bootstrapping (resampling) process itself.

Random Forest classifiers are based in the same idea as bagging but the try to lead us back to the jungle by tweaking the split selection procedure used by the collection of trees. The way this is done is by:

  • continue to use resamples as training sets
  • force the tree to use a randomly selected and small subset of the inputs at every split.
Random Forests in Scikit.

The code below shows how to implement a random forest classifier with scikit. For this part of the discussion we are keeping the depth of the trees equal to the one we used in bagging, so that you can focus on the difference caused by the random selection of inputs at the splits.

Parallelism

Ask us about the n-jobs=-1 option in this code.

Hide the code
from sklearn.ensemble import RandomForestClassifier
RF = RandomForestClassifier(n_estimators=n_trees,
                                       max_depth=md,
                                       max_features="sqrt",
                                       random_state=1,
                                       n_jobs=-1)

Let us fit the model.

Hide the code
RF.fit(XTR, YTR)
RandomForestClassifier(max_depth=6, n_estimators=2500, n_jobs=-1,
                       random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The scores in training and test are:

Hide the code
RF.score(XTR, YTR), RF.score(XTS, YTS)
(0.96125, 0.86)

These do not seem to be a major improvement from bagging.

Hide the code
BagDT.score(XTR, YTR), BagDT.score(XTS, YTS)
(0.9775, 0.875)

But the Trees Look Much More Promising

The estimators (individual trees) of the random forest can be accessed like we did before. And so we use similar code to check the trees:

Hide the code
fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(12, 16), dpi = 300)

for i in range(12):
    # fig, axes = plt.subplots(figsize=(3, 4), dpi = 300)
    DT = RF.estimators_[i]
    plot_tree(DT, max_depth=1, ax=axes[i//3, i%3])

What about the probabilities predicted?

Again, this is the same we did before.

Hide the code
test_RF_probs_array = np.zeros([YTS.shape[0], n_trees])

for i in range(n_trees):
    probs_i = RF.estimators_[i].predict_proba(XTS.values)[:,1]
    test_RF_probs_array[:, i] = probs_i
Hide the code
test_RF_probs_array[0:5, 0:5]
array([[0.1025641 , 1.        , 0.        , 0.01980198, 0.        ],
       [1.        , 1.        , 1.        , 0.63265306, 1.        ],
       [0.17073171, 0.84552846, 0.        , 0.        , 0.        ],
       [0.96551724, 0.84552846, 0.94615385, 0.01980198, 0.54166667],
       [0.        , 1.        , 1.        , 0.95454545, 1.        ]])

And the correlation matrix confirms that we have succeeded in decreasing the correlations significantly.

Hide the code
pd.DataFrame(test_RF_probs_array, columns=["T"+ str(i) for i in range(n_trees) ]).corr().iloc[0:10, 0:10]
print(f'The typical correlation between the predictions for two of these trees is {pd.DataFrame(test_RF_probs_array, columns=["T"+ str(i) for i in range(n_trees) ]).corr().median().median():.{2}f}')
The typical correlation between the predictions for two of these trees is 0.49

Hyperparameter Selection for Random Forest

The main idea behind the random forest method is that the combined work of sufficiently diverse and sufficiently many decision trees improves the performance and robustness of the resulting ensemble classifier. In general, the more trees in the forest, the more robust the classifier is, at the expense of higher training times. Here we will settle for a number of trees that keeps training time moderate and we will play with some of the hyperparameters of the individual trees. In many cases the default values for random forests are good enough, but we do this to give you another example of grid search for hyperparameter selection.

We use the minimum number of samples that a terminal node should contain and the total depth of the tree as hyperparameters. We include a special use case of the ‘n_estimators’ parameter (the number of trees in the forest) in which we provide a single, fixed value of that hyperparameter. By commenting and uncommenting lines in the cell below you can explore the impact of the number of trees in the performance of the random forest.

Hide the code
hyp_grid = {'RF__min_samples_leaf':range(5, 8),
            # 'RF__n_estimators': range(100, 1001, 100),
            'RF__max_depth': range(1, 6)}
Fitting the Grid Search

Constructing and fitting the grid search is as usual. Note that we set here the number of features and the maximum number of variables to select randomly at each split.

Hide the code
RF = RandomForestClassifier(random_state=1, max_features="sqrt", n_estimators=500)

from sklearn.pipeline import Pipeline
RF_pipe = Pipeline(steps=[('RF', RF)]) 

num_folds = 10

from sklearn.model_selection import GridSearchCV

RF_gridCV = GridSearchCV(estimator=RF_pipe, 
                        param_grid=hyp_grid, 
                        cv=num_folds,
                        return_train_score=True,
                        n_jobs=-1)

Give the model a name and fit it.

Hide the code
model_name = "RF"
model = RF_gridCV
model.fit(XTR, YTR)
GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('RF',
                                        RandomForestClassifier(n_estimators=500,
                                                               random_state=1))]),
             n_jobs=-1,
             param_grid={'RF__max_depth': range(1, 6),
                         'RF__min_samples_leaf': range(5, 8)},
             return_train_score=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Model Dictionary

We add this to our model dictionary for comparison.

Hide the code
modelDict[model_name] = {"model" : model, "inputs" : inputs}

Comparison with Bagging

The results below show an improvement of the predictive performance, while keeping overfitting under control.

Hide the code
model.score(XTR, YTR), model.score(XTS, YTS)
(0.93, 0.87)
Hide the code
BagDT.score(XTR, YTR), BagDT.score(XTS, YTS)
(0.9775, 0.875)

Examining the Grid Search

Let us see the selected hyperparameter values.

Hide the code
print(hyp_grid)
for param in hyp_grid.keys():
    print(f' Best value for {param}  = {model.best_params_[param]}')
{'RF__min_samples_leaf': range(5, 8), 'RF__max_depth': range(1, 6)}
 Best value for RF__min_samples_leaf  = 6
 Best value for RF__max_depth  = 5

The following script plots a graph that illustrates the hyperparameter search behind this choice.

Hide the code
%run -i "./2_5_GridSearch_Plot.py"

The diagram shows that the highest value of accuracy was obtained when considering the deepest trees in the grid. Therefore we may suspect that growing deeper trees can lead to further accuracy improvements. But we have to keep in mind the risk of overfitting that comes with deeper trees. In order to keep it in line we should probably increase the number of estimators, that would in turn result in higher training times. Hyperparameter tuning is always a tradeoff of several aspects of modeling .

Exercise 008

Do a performance analysis for this random forest model like we did before for bagging. And begin thinking about model comparison. We will return to this after we discuss some other models.

Python expertise exercise: examine the code that produced that plot. Below we will see an example of grid search with a single hyperparameter. Does this code work in that case? Can you fix it?

Hide the code
# %load "../exclude/MLMIINprv/exercises/2_5_Exercise008.py"
# %run -i "../exclude/MLMIINprv/exercises/2_5_Exercise008.py"

Variable Importance

Just like the tree models they are based on, random forest provide a natural way to measure variable importance.

Hide the code
var_importances = pd.DataFrame({'var':XTR.columns, 
                                'importance': model.best_estimator_.named_steps["RF"].feature_importances_}
                                ).sort_values(by="importance", ascending = False).set_index('var')

var_importances
importance
var
X0 0.281304
X16 0.138379
X18 0.129207
X17 0.119175
X7 0.112898
X14 0.064446
X2 0.036064
X1 0.014357
X6 0.011714
X3 0.010854
X15 0.010737
X11 0.009104
X5 0.009035
X13 0.008465
X12 0.007889
X19 0.007722
X8 0.007435
X10 0.007238
X4 0.007041
X9 0.006937
Exercise 009

Use this table to find out which are the most important variables that account for 75% of the total importance.

Hide the code
# %load "../exclude/MLMIINprv/exercises/2_5_Exercise009.py"
# %run -i "../exclude/MLMIINprv/exercises/2_5_Exercise009.py"

Other Variants of Bagging

The bagging method uses resampling of the data (row resampling) to train the models in order to improve the model performance in terms of the bias-variance tradeoff. As we have seen the main limitation of this methods is that the trees in the bag tend to be very similar. In contrast, in random forests we resample both the data (rows) and the features (columns). And, importantly, we do the feature resampling at the split level. That is, we get a different random set of features to consider at each split.

There are other variants of bagging that use different resampling strategies that can be considered midway between bagging and random forests. One of them is the Random Subspaces method, in which we take a random resample of the features for every tree in the bag, but we then use this set of variables for all the splits in the tree. Another one is the Random Patches method, in which we resample both the data and the variables, but again the random resample of the features is fixed for all the splits in each tree we train.

Bagging Variants.

The BaggingClassifier class in scikit-learn allows you to use these variants as follows, by selecting the values of max_features and max_samples.

  • max_features parameter: this is the fraction of features used by a particular tree in the ensemble (the same set of features is used for all splits in a fixed tree). The default value is 1, which means that all the features are used at each split.
  • max_samples parameter: this is the fraction of samples used by a particular tree in the ensamble. The default value is 1, which means that all the samples are used by the trees.

That means that if you set max_samplesto 1 while setting max_features to a value between 0 and 1, you will be using the Random Subspaces method. If you set both max_samples and max_features to a value between 0 and 1, you will be using the Random Patches method, as illustrated above. The basic bagging method is obtained by setting max_features to 1 while setting max_samples to a value between 0 and 1.

There are even more variants of bagging that can be implemented in scikit-learn. For example, in the preceding paragraphs we are assuming that sampling is done with replacement. But you can also use sampling without replacement, or even use a different sampling strategy. To sample without replacement (data or features) you can set the bootstrap and/or bootstrap_features parameters to False. The choice of the sampling strategy is usually guided by the size of the dataset and the number of features, and by the computational resources available. These resampling strategies can also be incorporated as hyperparameters in a grid search, to see if they improve the model performance. We recommend reading Chapter 2 of (Kunapuli 2023) for a detailed discussion of these methods.

Where does the Random Forests method fit in this picture? Keep in mind that the Bagging, Random Subspaces and Random Patches methods can be applied for base learners (estimators in scikit parlance) other than decision trees. Random Forests, on the other hand, extend the Random Patches method by taking resampling to the split level. In particular, that requires decision trees as base learners.

Exercise 010

You may have noticed that we did not perform a grid search to select hyperparameter values for BagDT.

  • Do it now. In fact, take this opportunity to do a grid search that explores the above variants of bagging.
  • What kind of ensemble method is used by the final selected model: bagging, random subspaces, random patches?
  • How does this model compare to the random forest model?

Keep in mind that one drawback of this approach is that the BaggingClassifier does not directly provide variable importance measures. They can be obtained by examining the individual trees in the bag, but this is not as straightforward as in the case of random forests.

Hide the code
# %run -i "../exclude/MLMIINprv/exercises/2_5_Exercise010.py"
# %load "../exclude/MLMIINprv/exercises/2_5_Exercise010.py"

2026-02-10 Reached this point


Introduction to Boosting Methods.


\(\quad\)

Sequential Ensemble Methods

The bagging or random forest methods that we have discussed in this session are examples of parallel ensemble methods. Both bagging and random forests share the idea of training many learners on different resamples of the training data and combining/averaging their results to obtain a stronger model. The difference between them is the additional injection of randomness in the splitting achieved by random forests. In the this section we will introduce sequential ensemble methods, in which the models are trained sequentially, and each new model tries to correct the errors of the previous ones. Many of the methods we will see use the boosting technique, which is a general method to improve the performance of a weak learner by combining it with other weak learners. Boosting models also obtain their predictions as the combination of an ensemble of simple models, often called weak models, (typically shallow decision trees), but in this case instead of resampling the training set we grow the models in the ensemble sequentially in such a way that each model focuses and tries to correct the mistakes made by the preceding one. The main difference between boosting methods is in the way that this error correction is implemented and they are therefore divided into two major categories:

  • Adaptive boosting: AdaBoost and related methods, the first generation of boosting.
  • Gradient boosting: LightGBM, XGBoost, CatBoost, etc. that represent a newer approach to boosting.

The Ensemble Taxonomy figure below (also from (Kunapuli 2023)) shows the difference between parallel and sequential ensemble methods.

Ensemble Methods for Machine Learning

Reference

Our discussion in this notebook follows closely the exposition in Chapters 4 to 6 of (Kunapuli 2023). In fact, we will use some of the examples in this book as illustration of the concepts. We strongly recommend Kunauli’s book if you look for a recent, thorough and clearly written introduction to ensemble methods.


Sequential Adaptive Boosting. AdaBoost.

AdaBoost guided example

AdaBoost was introduced by h was introduced by Freund and Schapire in 1995 (see Freund and Schapire (1995)). It works as follows: after training each model we modify the way that the next model sees the training set. We do that by making the misclassified samples bigger in a sense, by giving them more weight. That size or weight drives the next model to pay extra attention to them. The following example (Kunapuli 2023), section 4.2.1) illustrates this idea. The weak learners in this case are decision stumps, i.e., decision trees with only one split: a single yes/no question that divides the feature space in two regions, either horizontally or vertically.

Let us begin by loading the basic libraries.

Important

In Boosting methods (and in some other Machine Learning methods) it is customary to use 1 and -1 as the target values for binary classification problems. This is because these methods use the sign of the predictions to classify the samples (you may think that 0 defined the decision boundary that separates the two classes; this is not entirely true, as we will see later, but it is a useful guiding principle).

This is the reason why we convert the 0,1 labels in our dataset to 1,-1 in the following code.

Hide the code
X, y = make_moons(n_samples=100, noise=0.05, random_state=13)

y = 2 * y - 1  # Convert labels from [0, 1] to [-1, 1]
Hide the code

n_samples, n_features = X.shape
ensemble = []                                   # Initialize an empty ensemble

# cmap = cm.get_cmap('Blues')
cmap = mpl.colormaps.get_cmap('Blues')
colors = cmap(np.linspace(0, 0.5, num=2))

fig, ax = plt.subplots(1, 1, figsize=(12, 6))

ax.scatter(X[y <= 0, 0], X[y <= 0, 1], marker='o', c=col.rgb2hex(colors[0]), edgecolors='k', alpha=0.5)
ax.scatter(X[y > 0, 0], X[y > 0, 1], marker='s', c=col.rgb2hex(colors[1]), edgecolors='k', alpha=0.5)
ax.set_title('Initial classification problem for Adaboost')
ax.set_xticks([])
ax.set_yticks([])
plt.show();plt.close()

In this example we will train an AdaBoost ensemble model with only three stumps, using sklearn. Keep in mind that in practice you will use more stumps, and you will need to tune the hyperparameters of the model. Also it is worth mentioning that the AdaBoostClassifier in sklearn uses decision trees as weak learners by default.

Hide the code
n_estimators = 3
h = DecisionTreeClassifier(max_depth=1)     # Initialize a decision tree stump

Next we create an array to store the weights assigned by the Adaboost iterations to the samples. Initially the weight of all samples is the same, \(1/n\), where \(n\) is the number of samples in the training set. We also create an empty list to add the (weak learners) stumps as we train them.

Hide the code
D = np.ones((n_samples, ))        
D = D / np.sum(D)                           

ensemble = []
First step: Fit the first weak learner (stump)

Now we fit the first weak learner, a stump, using this sample weights to drive the model to pay more attention to the misclassified samples. In this first step, of course, all samples are equally weighted so the model makes no distinctions yet. We will plot the stump and the samples in the feature space. We will also use the predictiones of the stump to compute the weighted error of the model, which is the sum of the weights of the misclassified samples. In this first step, this is still the same as the unweighted error rate of the model.

Hide the code
h.fit(X, y, sample_weight=D)      

ypred = h.predict(X)          

e = 1 - accuracy_score(y, ypred, sample_weight=D)   # Weighted error of the weak learner

The stump is also assigned a (collective) weight of its own. Then the stump and its weight are added to the ensemble. When the Adaboost iterations are completed (all stumps trained and weighted) we will use the ensemble to make predictions on the test set by adding up the predictions of all stumps, each one weighted by its own weight (so more reliable stumps will have a bigger say in the final prediction).

The weight of the stump is computed as \[a = \dfrac{1}{2} \log \left( \dfrac{1 - \text{weighted error of the samples}}{\text{weighted error of the samples}} \right).\] We will return to this formula later.

Hide the code
a = 0.5 * np.log((1 - e) / e)     # Weak learner weight
print(a)
0.8673005276940532

The Adaboost algorithm next updates the weights of the samples. The weight of the misclassified samples is increased, and the weight of the correctly classified samples is decreased, according to this formula: 1. Increase the weight of misclassified examples to \(D_i e^{\alpha_t}\)
2. Decrease the weight of correctly classified examples to \(\dfrac{D_i}{e^{\alpha_t}}\)

In the code below m is used to identify correctly classified and misclassified points and assign them 1 or -1 labels. After the weights of the stump and the samples are updated, we can add the stump to the ensemble.

Hide the code
m = (y == ypred) * 1 + (y != ypred) * -1    

D = D * np.exp(- a * m)   # Update the sample weights

ensemble.append((a, h))    # Add the weak learner to the ensemble 

Let us visualize the decission boundary corresponding to the first stump and the weights assigned by this first iteration to the samples. We will use the following code to assign different sized markers to the samples according to their weights. We also compute the error (1 - accuracy) of the predictions of the first stump (as a percent).

Hide the code
s = D / np.max(D)
s[(0.00 <= s) & (s < 0.25)] = 16
s[(0.25 <= s) & (s < 0.50)] = 32
s[(0.50 <= s) & (s < 0.75)] = 64
s[(0.75 <= s) & (s <= 1.00)] = 128

err = (1 - accuracy_score(y, ypred)) * 100

Now we can plot the decision boundary and the samples with their updated weights:

Hide the code
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 6))

ax.scatter(X[y <= 0, 0], X[y <= 0, 1], s=s[y <= 0], marker='o', c=col.rgb2hex(colors[0]), edgecolors='k', alpha=0.5)
ax.scatter(X[y > 0, 0], X[y > 0, 1], s=s[y > 0], marker='s', c=col.rgb2hex(colors[1]), edgecolors='k', alpha=0.5)
ax.set_xticks([])
ax.set_yticks([])
title = f'First Weak Learner (error = {err:3.1f}%, weight of this stump = {a:3.2f})'
plot_2d_classifier(ax, X, y, predict_function=h.predict, 
                    alpha=0.25, xlabel=None, ylabel=None, 
                    title=title, colormap='Blues')

pos_err = (y > 0) & (y != ypred)
pos_cor = (y > 0) & (y == ypred)
neg_err = (y <= 0) & (y != ypred)
neg_cor = (y <= 0) & (y == ypred)
ax.scatter(X[neg_err, 0], X[neg_err, 1], marker='o', c=col.rgb2hex(colors[0]), edgecolors='k', s=80)
ax.scatter(X[pos_err, 0], X[pos_err, 1], marker='s', c=col.rgb2hex(colors[1]), edgecolors='k', s=80)
ax.set_xticks([])
ax.set_yticks([])
plt.show();plt.close()

Next iterations: fit the following weak learners (stumps)

Now we proceed to fit the next stumps, iterating through the following steps:

  1. Train a weak learner \(h_k(x)\) using the weights assigned to the samples in previous steps.
  2. Compute the training error \(e\) of this weak learner and use it to compute the weight \(a\) of the weak learner.
  3. Update the weights \(D\) of the training examples, with the rule we have seen above (increase weight of misclassified, decrease the weight of wellclassified).

The following code (adapted from Kunapuli (2023)) shows the next two steps of the AdaBoost algorithm. We train two more stumps (recall we set n_estimators = 3 above) and plot the decision boundaries and the weights of the samples after each iteration. Notice that the classification error of the stumps can be quite high. Still, the ensemble model will be able to classify the samples correctly thanks to the weights assigned to the stumps.

Hide the code
for k in range(1, n_estimators):
    
    # -- Plot the training examples in different sizes proportional to their weights
    s = D / np.max(D)
    s[(0.00 <= s) & (s < 0.25)] = 16
    s[(0.25 <= s) & (s < 0.50)] = 32
    s[(0.50 <= s) & (s < 0.75)] = 64
    s[(0.75 <= s) & (s <= 1.00)] = 128

    h = DecisionTreeClassifier(max_depth=1)  # Initialize a decision stump
    h.fit(X, y, sample_weight=D)                # Train a weak learner using sample weights
    ypred = h.predict(X)                        # Predict using the weak learner

    e = 1 - accuracy_score(y, ypred, sample_weight=D)   # Weighted error of the weak learner
    a = 0.5 * np.log((1 - e) / e)                       # Weak learner weight
    m = (y == ypred) * 1 + (y != ypred) * -1    # Identify correctly classified and misclassified points
    D *= np.exp(-a * m)                         # Update the sample weights

    # -- Plot the (first and last, no more than 10) individual weak learner
    if ((k < 5) or (n_estimators - k < 5)):
        fig, ax = plt.subplots(figsize=(12, 6))

        err = (1 - accuracy_score(y, ypred)) * 100
        title = f'Iteration {k + 1}: Weak Learner (error = {err:3.1f}%, weight of this stump = {a:3.2f})'
        plot_2d_classifier(ax, X, y, predict_function=h.predict, 
                        alpha=0.25, xlabel=None, ylabel=None, 
                        title=title, colormap='Blues')

        pos_err = (y > 0) & (y != ypred)
        pos_cor = (y > 0) & (y == ypred)
        neg_err = (y <= 0) & (y != ypred)
        neg_cor = (y <= 0) & (y == ypred)
        ax.scatter(X[neg_err, 0], X[neg_err, 1], marker='o', c=col.rgb2hex(colors[0]), edgecolors='k', s=80)
        ax.scatter(X[pos_err, 0], X[pos_err, 1], marker='s', c=col.rgb2hex(colors[1]), edgecolors='k', s=80)
        ax.set_xticks([])
        ax.set_yticks([])
        plt.show();plt.close()
        # --


    ensemble.append((a, h))                     # Save the weighted weak hypothesis

Combining the stumps to get the ensemble model predictions

Finally, when we stop training weak learners, their predictions are combined, taking into account the quality of each of the models: the opinion of good classifiers gets a high positive score, the opinion of a random classifier counts zero and a worse than random one gets negative score (so that it is still actually helpful!). The precise formulation of this combination is given by the formula

\(f(x) = \text{sign} \left( F(x) \right) = \text{sign}\left(\sum_{k=1}^{K} a_k h_k(x)\right),\)

where \(f(x)\) is the prediction of the ensemble model (either -1 or 1), \(a_k\) is the weight of the \(k\)-th weak learner and \(h_k(x)\) is the prediction of the \(k\)-th weak learner. Here \(F(x)\) represents the confidence score of the ensemble model. The larger \(|F(x)|\) is, the higher the confidence in the classification. + When $F(x) $ the model is very confident that \(x\) belongs to class +1 and conversely, when \(F(x) \ll 0\) the model is very confident that \(x\) belongs to class -1. + If \(F(x) \approx 0\), the model is uncertain about the classification.

The code below does precisely this: it computes the predictions of the ensemble model on the test set and plots the decision boundaries of the ensemble model.

Hide the code
def predict_boosting(X, estimators):
    pred = np.zeros((X.shape[0], ))

    for a, h in estimators:
        pred += a * h.predict(X)
    y = np.sign(pred)

    return y

ypred = predict_boosting(X, ensemble)
err = (1 - accuracy_score(y, ypred)) * 100

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(4, 4))
plot_2d_classifier(ax, X, y, predict_function=predict_boosting, predict_args=(ensemble), 
                   boundary_level=[0.0], alpha=0.25, xlabel='$x_1$', ylabel='$x_2$', s=80,
                   title=title, colormap='Blues')

fig.tight_layout()
ax.set_title(f'Overall ensemble (error = {err:3.1f}%)'.format(fontsize=12))
Text(0.5, 1.0, 'Overall ensemble (error = 9.0%)')

Exercise 011
  • Go back to the top of the Adaboost code and set the number of stumps to 10. What happens now?
  • Set it to 100 and look at the decision boundary end classification error of the ensemble model. Any thoughts? What is going on?
AdaBoost and outliers. LogitBoost.

Adaboost is well known to be highly sensitive to noisy data and outliers. The reason why outliers are a problem is that they are often misclassified by the weak learners, and Adaboost will try to correct these mistakes by giving them more weight. This can also be a cause for overfitting in AdaBoost models.

This discussion is connected with the weak learner weight update formula that we saw before \[a = \dfrac{1}{2} \log \left( \dfrac{1 - \text{weighted error of the samples}}{\text{weighted error of the samples}} \right).\] This formula is in turn based in the exponential loss function, one of the possible loss functions for classification used to fit Machine Learning models. There are other adaptive boosting ensemble methods that use different loss functions, like the logistic loss function, which is used in the LogitBoost method. We refer to (Section 4.5 of Kunapuli 2023) for a detailed discussion of LogitBoost and related methods.

AdaBoost and overfitting. Learning rate and early stopping.

Using Adaboost allows the combined force of weak learners to result in a strong model. Even though the stumps can only produce very simple linear boundariess (horizontal or vertical) the ensemble model is able to learn complex nonlinear boundaries, and it is less prone to overfitting than other models. That is not to say that AdaBoost is immune to overfitting, that can eventually appear as we increase the number of stumps.

One way to keep overfitting under control is to add a new hyperparameter to the model, the learning rate \(\eta\). This parameter controls the contribution of each weak learner to the ensemble model. The predictions of the weak learners are multiplied by \(\eta\) before being added to the ensemble model. This way, the learning rate can be used to slow down the learning process, and to prevent the model from overfitting. The precise way in which this is done is given by the formula

\[F_{k + 1}(x) = F_k(x) + \eta\,a_k\,h_k(x)\]

where \(h_k(x)\) is the (-1, 1) prediction of the \(k\)-th stump, \(a_k\) is the weight of that stump and \(F_k(x)\) is the confidence score of the ensemble model after the \(k\)-th iteration. This iterative expression replaces the basic computation of \(F(x)\) we saw before. The learning rate, as usual with hyperparameters, needs to be tuned. Note that the effect of \(\eta\) is multiplicative: later stumps in the algorithm are affected by higher powers of \(\eta\). That means that smaller values of \(\eta\) will slow down the learning process, trying to prevent the the model from overfitting the training data.

Another strategy to prevent overfitting is to use early stopping. This means that we stop the training process as soon as we detect that the classification rate has not improved in the last \(q\) iterations.

Both the use of learning rate and early stopping belong to the regularization set of tools that are used with many Machine Learning models.

AdaBoost in Scikit.

The example above showed a manual implementation of AdaBoost to illustrate the inner working of the algorithm. In real problems we would use the AdaBoostClassifier class in sklearn. In this section we will see how to use it to train AdaBoost based ensemble models. Let us see how this is done with the same dataset that we used for bagging and random forests.

Hide the code
# import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns
# import pandas as pd

# from sklearn.datasets import make_classification

# from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

N = 1000


X, Y = make_classification(n_classes=2, n_samples=N, n_informative=5, random_state=1)

inputs = ["X" + str(k) for k in range(X.shape[1])]
output = "Y"

df = pd.DataFrame(X, columns = inputs)
df[output] = Y
df.iloc[:,:12].head()


XTR, XTS, YTR, YTS = train_test_split(df[inputs], df[output],
                                      test_size=0.2,  # percentage preserved as test data
                                      random_state=1, # seed for replication
                                      stratify = df[output])   # Preserves distribution of y

Now we create and fit the grid search for this model with learning rate as hyperparameter. Note the way that the stumps are defined.

Hide the code
hyp_grid = {'AdB__learning_rate': np.linspace(0.01, 1, 11)} 
hyp_grid

n_trees = 100

stump = DecisionTreeClassifier(max_depth=1)

AdB = AdaBoostClassifier(estimator= stump,                        
                         n_estimators=n_trees,
                         algorithm='SAMME', # to prevent a deprecation warning
                         random_state=1)

AdB_pipe = Pipeline(steps=[('AdB', AdB)]) 

num_folds = 10

AdB_gridCV = GridSearchCV(estimator=AdB_pipe, 
                        param_grid=hyp_grid, 
                        cv=num_folds,
                        return_train_score=True,
                        n_jobs=-1)

model_name = "AdB"
model = AdB_gridCV
model.fit(XTR, YTR)

# Dataset for MOdel Predictions
dfTR_eval = XTR.copy()
dfTR_eval['Y'] = YTR 

dfTS_eval = XTS.copy()
dfTS_eval['Y'] = YTS 

We can now add the model to our dictionary as usual.

Hide the code
modelDict = {model_name : {"model" : model, "inputs" : inputs}}
First Checks for the AdaBoost Results

Let us see the scores achieved by this model.

Hide the code
model.score(XTR, YTR), model.score(XTS, YTS)
(0.85875, 0.845)

As you can see, it is a decent performance without much tuning. And no overfitting seems to be happening. Let us visualize the hyperparameter search.

Hide the code
param_name = list(hyp_grid.keys())[0]
param_values = hyp_grid[param_name]

mean_train_scores = model.cv_results_['mean_train_score']
plt.plot(param_values, 1 - mean_train_scores, marker='o', label='Mean Train Score')

mean_test_scores = model.cv_results_['mean_test_score']
plt.plot(param_values, 1 - mean_test_scores, marker='*', label='Mean Validation Score')

plt.xlabel('Learning Rate')
plt.ylabel('Error (1 - accuracy)')
plt.title('Hyperparameter Tuning Results')
plt.legend()  # Add a legend to the plot
plt.show()
plt.close()

The selected learning rate is:

Hide the code
model.best_params_
{'AdB__learning_rate': np.float64(0.30700000000000005)}
Exercise 012

In AdaBoost the learning rate and the number of stumps are related, as there is an interplay between the two. Add n_trees as a new hyperparameter to the grid search (you only need to modify the hyp_grid for this) and try to find the best combination of these two hyperparameters.
Note: after doing that the grid search plot code will not work as it is. You will need to modify it if you want to plot the results of the new grid search.

Visualizing the stumps

Remember that this is an ensemble of stumps. Run the code below to visualize the first few of them.

Hide the code
from sklearn.tree import plot_tree

fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(4, 3), dpi = 300)

for i in range(6):    
    DT = AdB_gridCV.best_estimator_.named_steps["AdB"].estimators_[i]
    plot_tree(DT, max_depth=2, ax=axes[i//3, i%3])

Exercise 013

Do a performance analysis for this new model. We have already seen many examples so running the code for this should be easy!

Hide the code
# %load "../exclude/MLMIINprv/exercises/2_5_Exercise013.py"
# %run -i "../exclude/MLMIINprv/exercises/2_5_Exercise013.py"

Gradient Boosting

Introduction to Gradient Boosting

Gradient boosting methods are similar to AdaBoost in that they try to sequentially improve the basic models they train. But they do this in a different way, which is inspired by the optimization theory gradient descent method (a 3d attempt at visualization here). In gradient boosting methods the models are trained to sequentially minimize the loss function, mimicking the idea of gradient descent. It turns out that in most cases the gradient of the loss function depends on the residuals of the model, i.e., the difference between the labels and the current predictions \(F_k(x)\) of the model, which in boosting is chosen to be a soft classification model. \[\text{true value} - F_k(x)\] This implies that missclasified samples will have a large residual, and well classified samples will have a small residual. This reminds of the way that AdaBoost assigns weights to the samples, but in this case we are not using weight but residuals. This is illustrated in the following [Fig. 5-8 from (Kunapuli 2023).

Kunapali Fig 5.8

So the basic idea is to fit a model to the residuals of the preceding model, much like every model in AdaBoost updated the weights it received from the previous model. But the criterion in Adaboost was the misclassification of the samples, while in gradient boosting what we have the residuals of the model. The key insight is this: gradient boosting models fit a weak learner to these residuals in order to approximate the gradient of the loss function. Each sample residual can be considered an approximate loss gradient and the weak model combines them to take the next step in gradient descent. So instead of a weight update we have an approximate-gradient update. The model update equation now is analogous to what we have seen for AdaBoost:

\[F_{k + 1}(x) = F_k(x) + \eta\,a_k\,h_k(x)\]

only now \(h_k(x)\) is the prediction of the \(k\)-th weak learner on the residuals of the model and the “weight” \(a_k\) is not really a weight but a measure of the step length in the gradient direction that we take in order to minimize the loss as much as possible (see (Kunapuli 2023), section 5.2.1 for a detailed discussion). The learning rate \(\eta\) is again used to control the step size in the approximate gradient descent.

From a technical point of view, stumps are used as weak learners to approximate the loss gradient. But since gradients are real numbers, these are regression stumps instead of classification stumps. Regression trees return values are obtained as averages of the output variable value for all the data points in a leaf, instead of using majority vote as in classification.

Gradient Boosting in Scikit Learn: HistGradientBoostingClassifier

Scikit provides two classes to train gradient boosting classification ensembles: GradientBoostingClassifier and HistGradientBoostingClassifier. The first one is a general implementation of gradient boosting, while the second one is a more efficient implementation that uses histograms to speed up the training process that we will discuss below. But first let us see how to apply the basic GradientBoostingClassifier model to the dataset we have been using.

Hide the code
from sklearn.ensemble import GradientBoostingClassifier

GradBoost = GradientBoostingClassifier(max_depth=1, 
                                      n_estimators=20, 
                                      learning_rate=0.75)

GradBoost_pipe = Pipeline(steps=[('GradBoost', GradBoost)]) 

num_folds = 10

hyp_grid = {'GradBoost__learning_rate': 10.**np.arange(-6, 1, 1), 
            'GradBoost__n_estimators': [20, 50, 100, 200, 500]}           

GradBoost_gridCV = GridSearchCV(estimator=GradBoost_pipe, 
                        param_grid=hyp_grid, 
                        cv=num_folds,
                        return_train_score=True,
                        n_jobs=-1)

model_name = "GradBoost"
model = GradBoost_gridCV
model.fit(XTR, YTR)

modelDict[model_name] = {"model" : model, "inputs" : inputs}
Hide the code
model.score(XTR, YTR), model.score(XTS, YTS)
(0.86875, 0.865)
Hide the code
model.best_params_
{'GradBoost__learning_rate': np.float64(0.1), 'GradBoost__n_estimators': 100}
Histogram Based Gradient Boosting in Scikit Learn: HistGradientBoostingClassifier

The problem with a naive approach to gradient boosting is that for large datasets the process of fitting many regression stumps to the model residuals in order to estimate the loss gradient can be computationally expensive (he model residuals need to be computed for each sample in the training set). The HistGradientBoostingClassifier in sklearn is a more efficient implementation of gradient boosting that uses histograms to speed up the training process. The idea is to bin the values of each feature into histograms and fit the weak learners to the histograms instead of the individual samples. This reduces the number of computations needed to fit the weak learners and makes the training process faster. The HistGradientBoostingClassifier is a good choice for large datasets with many samples and features. See Figure 5.17 of (Kunapuli 2023) for a nice illustration of the use of histograms in gradient boosting.

The implementation is very similar to the GradientBoostingClassifier class, but it an additional hyperparameter that control the number of bins used in the histograms.

Hide the code
from sklearn.ensemble import HistGradientBoostingClassifier

hyp_grid = {'HGB__learning_rate': 10.**np.arange(-6, 1, 1), 
            'HGB__max_iter':np.arange(25, 150, 25),
            'HGB__max_bins':[10, 25, 50, 100]} 

HGB = HistGradientBoostingClassifier(random_state=1)

HGB_pipe = Pipeline(steps=[('HGB', HGB)]) 

num_folds = 10

from sklearn.model_selection import GridSearchCV

HGB_gridCV = GridSearchCV(estimator=HGB_pipe, 
                        param_grid=hyp_grid, 
                        cv=num_folds,
                        return_train_score=True,
                        n_jobs=-1)

HGB_gridCV.fit(XTR, YTR)
GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('HGB',
                                        HistGradientBoostingClassifier(random_state=1))]),
             n_jobs=-1,
             param_grid={'HGB__learning_rate': array([1.e-06, 1.e-05, 1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00]),
                         'HGB__max_bins': [10, 25, 50, 100],
                         'HGB__max_iter': array([ 25,  50,  75, 100, 125])},
             return_train_score=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Exercise 014

Finish the job with this model:

  • Add the model to the dictionary.
  • Find the scores for training and test sets.
  • What are the best parameters selected by the model?
  • Visualize the hyperparameter grid search. Hint: run the same script we used before.
  • Do a performance analysis of the model.

And remember, for this and all subsequent exercises: this is not about copy-pasting and running code and getting plots. Your interpretation of the results is the really important part of the task.

Exercise 015

In this and the preceding session we have now trained several different classification models for this dataset. Compare the models! Which one would you choose and why? What are the pros and cons of each model?

LightGBM and XGBoost

Outside of scikit there are some very powerful gradient boosting libraries, like LightGBM and XGBoost. These libraries are optimized for speed and performance and are widely used in practice. They are based on the same principles as the gradient boosting models we have seen here, but they have additional features and optimizations that make them more efficient and powerful. We will return to these libraries when we deal with regression problems. Meanwhile, you can check the LightGBM and XGBoost documentation for more information, below we include a basic example of how to use LightGBM with the same dataset.

LightGBM setup

In order to run lightgbm you need to install the library. Remeber to backup ypur environment first!

For Windows machines it should be enough to activate the MLMIC 25 environment in an Anaconda prompt and then run:

conda install -c conda-forge lightgbm

For Mac mahines you need to make sure that you have gcc installed with brew. You can do this by running the following command in your terminal. Ask for help if you need it.

brew install cmake libomp gcc

Then you can activate the MLMIC 25 environment in a terminal and install lightgbm with pip:

pip install lightgbm
The next code cell can take > 10 minutes to run
Hide the code
# import lightgbm as lgb

# lgb_pipeline = Pipeline([
#     ('lgb', lgb.LGBMClassifier())  # First and only step
# ])

# lgb_hyp_grid = {
#     'lgb__num_leaves': [20, 31, 40],         # Number of leaves in one tree
#     'lgb__learning_rate': [0.01, 0.05, 0.1], # Step size shrinkage
#     'lgb__n_estimators': [100, 500, 1000],   # Number of boosting iterations
#     'lgb__max_depth': [-1, 10, 20],          # Maximum tree depth
#     'lgb__min_child_samples': [10, 20, 30]   # Minimum data points in a leaf
# }

# lgb_gridCV = GridSearchCV(estimator=lgb_pipeline, 
#                            param_grid=lgb_hyp_grid,
#                            cv=5, scoring='accuracy', verbose=3, n_jobs=-1)

# lgb_gridCV.fit(XTR, YTR)

# # Step 7: Get the best hyperparameters
# best_params = lgb_gridCV.best_params_
Hide the code
# import lightgbm as lgb
# from sklearn.experimental import enable_halving_search_cv  # noqa
# from sklearn.model_selection import HalvingGridSearchCV


# lgb_pipeline = Pipeline([
#     ('lgb', lgb.LGBMClassifier())  # First and only step
# ])

# lgb_hyp_grid = {
#     'lgb__num_leaves': [20, 31, 40],         # Number of leaves in one tree
#     'lgb__learning_rate': [0.01, 0.05, 0.1], # Step size shrinkage
#     'lgb__n_estimators': [100, 500, 1000],   # Number of boosting iterations
#     'lgb__max_depth': [-1, 10, 20],          # Maximum tree depth
#     'lgb__min_child_samples': [10, 20, 30]   # Minimum data points in a leaf
# }

# # This is significantly faster and gives updates on each "round"
# lgb_halvingCV = HalvingGridSearchCV(estimator=lgb_pipeline, 
#                                      param_grid=lgb_hyp_grid,
#                                      scoring='accuracy',
#                                      factor=3, verbose=1, n_jobs=-1)

# lgb_halvingCV.fit(XTR, YTR)

# # Step 7: Get the best hyperparameters
# best_params = lgb_halvingCV.best_params_
In the Next Session

We will meet another top performing model for many classification problems, the Support Vector Machine.

References

Freund, Yoav, and Robert E. Schapire. 1995. “A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting.” In Proceedings of the Second European Conference on Computational Learning Theory, 23–37. Springer. https://doi.org/10.1007/3-540-59119-2_166.
James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan Taylor. 2023. An Introduction to Statistical Learning with Applications in Python. Springer International Publishing. https://doi.org/10.1007/978-3-031-38747-0.
Kunapuli, Gautam. 2023. Ensemble Methods for Machine Learning. Manning Publications. https://www.manning.com/books/ensemble-methods-for-machine-learning.
Muller, Andreas C, and Sarah Guido. 2017. Introduction to Machine Learning with Python. O’Reilly.
Serrano, Luis G. 2021. Grokking Machine Learning. Shelter Island, NY, USA: Manning Publications. https://www.manning.com/books/grokking-machine-learning.