Chapter 3, Part 2: Feature Selection and Regularization

Machine Learning

Author

F.San Segundo & N.Rodríguez

Published

February 2026


Setting the working directory

We begin by using cd to make sure we are in the right folder.

Hide the code
%cd 3_2_FeatureSelection_Regularization
/wd/3_2_FeatureSelection_Regularization

Session Setup

Libraries

Let us begin by loading the libraries we will use.

Hide the code
### Load necessary modules -------------------------------
# interactive plotting
# %matplotlib inline
%config InlineBackend.figure_format = 'png' # ‘png’, ‘retina’, ‘jpeg’, ‘svg’, ‘pdf’

# plotting libraries
import seaborn as sns
import matplotlib.pyplot as plt
# sns.set()
import statsmodels.api as sm


# Data management libraries
import numpy as np 
import pandas as pd
import scipy.stats as stats

# Machine learning libraries
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

# Feature selection libraries
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.feature_selection import RFECV

# Machine learning libraries
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error


# Connecting statsmodels with sklearn via sklearn2pmml and patsy 
import statsmodels.api as sm
from statsmodels.api import OLS
from statsmodels.stats.outliers_influence import variance_inflation_factor 

from sklearn2pmml.statsmodels import StatsModelsRegressor

# Others
import warnings

# patsy for lineal model formula language
import patsy as ps
from patsy_utils.patsy_formulaic_transformer import FormulaTransformer
Hide the code
import warnings
import os

# Ignore all warnings
warnings.filterwarnings("ignore")

# Specifically target Scikit-Learn/FutureWarnings
warnings.filterwarnings("ignore", category=FutureWarning)

# Some environments also require setting an environment variable
os.environ['PYTHONWARNINGS'] = 'ignore'

Variable Selection Methods

The Parsimony Principle

This principle states that we should prefer simpler models over more complex ones. That is, we always prefer a model which is as simple as possible, while keeping a good predictive performance (meaning high performance and generalization). In Linear Regression Models we can try to achieve that goal by removing variables from the model or by shrinking their coefficients to reduce their influence on the model’s predictions. We do that while watching the impact of their removal on the model’s performance.

A brute force approach to this problem consists in trying all possible subsets of the input variables. If there are p of them that means that we need to train \[1 + p + \binom{p}{2} + \cdots + + \binom{p}{p} = 2^p\] different models in order to select the best one. For a moderate size example dataset, like the \(p=10\) that we will see below, that means training \(2^{10} = 1024\) models. And if you need to use cross validation to get robust results then we end up training quite a lot of models. For large size datasets this method will be too costly in terms of computation.

Heuristic methods: strategies for variable selection

Instead of doing that we can choose to explore only a part of the whole collection of possible models, each one with a different choice of variables. How do we select that partial collection of models to explore? There are two natural strategies that you can easily come up with:

  • Forward selection: Start with a model without input variables (only the intercept, equal to \(\bar Y\)). Then for each value \(k\) from 1 to \(p\) (the number of inputs) take your current model and add one variable that the model does not use. Do that for every possible such variable (there are \(p - k\) of them), fit the corresponding model (now with \(k\) inputs) and select the best model among those.

  • Backward selection: Start with a model that uses all the \(p\) input variables. Then for each value \(k\) from 1 to \(p\) take your current model and remove one variable that the model uses. Do that for every possible such variable (there are \(p - k + 1\) of them), fit the corresponding model (now with \(p - k\) inputs) and select the best model among those .

Any of those two methods results in a collection of \(p\) models, using from 1 to \(p\) inputs. Our final model is the best one of them, unless their performance is very similar. More among choosing the best model in a moment. In that case we may consider a simpler, more interpretable model at the expense of a little predictive performance.

Both forward and backward strategies use two types of model comparisons.

  • In horizontal comparisons we need to compare models with the same number of inputs. This is a simpler problem and any performance measure such as \(R^2_{adj}\) or \(SS_{residual}\) can be used here to select the best one.

  • But in vertical comparisons we need to compare models with a different number of predictors, and that is a harder problem. For example, using variable selection in a model with say \(p = 100\) predictors means that we need to compare a model with 2 predictors against a model with 95 predictors. And this is a problem because adding more variables always increases the value of \(R^2\). IN particular the model with all the inputs is bound to have the largest value of \(R^2\). So we need a different strategy to compare these different-size models.


Vertical comparison: models with different number of input variables

There are two frequently used approaches for this task:

  • You can use a performance measure that takes the number of inputs into account. A traditional choice is the adjusted R^2. But its theoretical foundation is not as solid as in the case of more modern alternatives, such as the Mallows \(C_p\) \[C_p = \dfrac{1}{n}(SS_{residual} + 2\,p\,\hat\sigma^2)\quad\text{ where }\hat\sigma^2\text{ is an estimate of the variance of the error}\] or the Akaike Information Criterion (AIC) \[AIC = \dfrac{1}{n\hat\sigma^2}(SS_{residual} + 2\,p\,\hat\sigma^2) + \text{constant}\] or the Bayesian Information Criterion (BIC) \[AIC = \dfrac{1}{n\hat\sigma^2}(SS_{residual} + \log(n)\,p\,\hat\sigma^2)\] These measures are similar, as you can see. and they all include the number of inputs \(p\) of the model. In all three cases the formulas shown above correspond to linear models fitted using least squares. And in all of them a model with a lower value of this measures is considered a better model.

  • A second approach is to use cross validation to compare the models. We use one of the folds to train models of all sizes, getting a validation score for each size and each fold. Then we compute the average of these scores across folds and use it to select the best size. Call it \(p_0\). The problem here is that the set of \(p_0\) variables for each fold may be different. So our final step is to select the best model of size using the full training set.


Variable Selection in Python

A Synthetic Example Dataset

We will use a datset with 10 numeric input variables to practice these variable selection methods in Python. We begin by loading the data and taking a look at the first few rows of the table. The output variable Y is in the last column of the table.

Hide the code
df = pd.read_csv("./3_2_data01.csv")
df.head()
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 Y
0 0.375265 1.584437 -2.601827 0.022447 -1.742627 -0.687230 -2.332911 -0.522880 -0.470547 0.169020 1.378568 3.401781
1 -2.628640 -6.398784 1.584658 -0.046849 0.202249 1.630798 3.696776 0.383786 0.137692 0.422389 0.797784 4.195411
2 0.968450 3.151469 3.797902 -0.615858 -0.992719 0.032080 2.594655 0.416864 0.698120 0.177978 0.846438 2.289632
3 1.451148 1.620372 -0.555819 0.005467 0.445291 1.496196 0.673055 0.779617 0.010274 -0.876502 -1.149223 6.289907
4 -0.207422 1.286931 -1.179233 1.834709 1.260945 0.036981 -1.982970 1.794626 -0.431231 0.659964 0.130603 -0.784177
Standard Names and Train/Test Split

We proceed as usual to benefit from the already available code.

Hide the code
output = "Y"
num_inputs = ["X" + str(i) for i  in range(1, df.shape[1])]
cat_inputs = []
inputs = num_inputs + cat_inputs
X = df[inputs]
Y = df[output]

Next we create the train/test split.

Hide the code
XTR, XTS, YTR, YTS = train_test_split(
    X, Y,
    test_size = 0.2,  
    random_state = 1) 

dfTR = pd.DataFrame(XTR, columns=inputs)
dfTR[output] = YTR
dfTS = pd.DataFrame(XTS, columns=inputs)
dfTS[output] = YTS
EDA: pairplot

This synthetic dataset contains no missing data and requires next to no processing. We will just make a pairplot to get a picture of some relations between variables.

Hide the code
sns.set_style("white")
sns.pairplot(dfTR, corner=True, height=1, aspect=1);

Hide the code
XTR.corr()
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11
X1 1.000000 0.800172 -0.017546 -0.003350 -0.032179 -0.010736 -0.036970 -0.029899 -0.020021 -0.011684 -0.056911
X2 0.800172 1.000000 -0.002804 -0.004936 -0.024655 -0.001665 -0.052179 0.008793 -0.007839 -0.053868 -0.057638
X3 -0.017546 -0.002804 1.000000 -0.052395 -0.028096 -0.066732 -0.022051 -0.013028 0.968926 -0.018638 -0.006310
X4 -0.003350 -0.004936 -0.052395 1.000000 -0.016979 0.028705 0.052662 -0.041596 -0.058323 0.004816 -0.001033
X5 -0.032179 -0.024655 -0.028096 -0.016979 1.000000 -0.040766 -0.050736 -0.014336 -0.020022 -0.027139 -0.046829
X6 -0.010736 -0.001665 -0.066732 0.028705 -0.040766 1.000000 0.596393 0.005231 -0.065437 0.065676 -0.034175
X7 -0.036970 -0.052179 -0.022051 0.052662 -0.050736 0.596393 1.000000 -0.004117 -0.025989 0.045109 -0.069177
X8 -0.029899 0.008793 -0.013028 -0.041596 -0.014336 0.005231 -0.004117 1.000000 -0.026976 0.001250 -0.002293
X9 -0.020021 -0.007839 0.968926 -0.058323 -0.020022 -0.065437 -0.025989 -0.026976 1.000000 -0.021060 0.000606
X10 -0.011684 -0.053868 -0.018638 0.004816 -0.027139 0.065676 0.045109 0.001250 -0.021060 1.000000 -0.007986
X11 -0.056911 -0.057638 -0.006310 -0.001033 -0.046829 -0.034175 -0.069177 -0.002293 0.000606 -0.007986 1.000000

There are some obvious high correlations in the plot. The correlation matrix can be used to confirm them. We can also use a heatmap to help visualize the matrix. This is specially useful when the number of variables is large.

Hide the code
plt.figure(figsize=(8, 6))
sns.heatmap(XTR.corr(), annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix Heatmap')
plt.show()

Looking at the pair plot or the heatmap it is clear that we can probably drop one of X1 and X2 and also one of X3 and X9. But what about e.g. X6 and X11? Are these variables part of the best model? And of course: how can we apply these ideas to models with hundreds of input variables, where manual inspection of plots and matrices is not feasible?


Sequential Feature Selection in Python

The forward and backward selection strategies described above are implemented (in basic form) in the SequentialFeatureSelector class of scikit. Let us see how to apply the backward selection strategy to our dataset. We will do this inside the framework that we have met in the previous session, using a pipeline that includes the formula for the model that uses the whole set of input variables.

In this dataset there are no categorical inputs amd so we use passthrough in the preprocessing step of the pipeline. Uncomment the last line from this cell if you want to see the effect of preprocessing on the training dataset.

Hide the code
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler().set_output(transform="pandas"))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_inputs),
        ('cat', 'passthrough', cat_inputs)
        ])

preprocessor.set_output(transform='pandas')
# preprocessor.fit_transform(XTR)
ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('scaler', StandardScaler())]),
                                 ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7',
                                  'X8', 'X9', 'X10', 'X11']),
                                ('cat', 'passthrough', [])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Next we define the model formula. Note that we have added num__ to the names so that the formula fits nicely inside the naming conventions used by the pipeline. Note also that the formula uses -1 to remove theintercept from the design matrix. This is because the sfs step in the pipeline below has fit_intercept = True to simplify the processing of the selected inputs in the code below (otherwise the intercept coming from the formula gets mixed in with the inputs and things get complicated).

Hide the code
model_Formula = " + ".join(["num__"+ item for item in inputs])
model_Formula = model_Formula + " - 1"
model_Formula
'num__X1 + num__X2 + num__X3 + num__X4 + num__X5 + num__X6 + num__X7 + num__X8 + num__X9 + num__X10 + num__X11 - 1'

Now we are ready to create the pipeline for SequentialFeatureSelector. This function requires us to select a number of features (and the backward or forward direction) in advance. Let us use it to select a model with 6 features.

Hide the code
num_features = 6

num_folds = 10

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    SFS_pipe = Pipeline([
        ('preprocessor', preprocessor),
        ("formula", FormulaTransformer(model_Formula)),
        ('sfs', SequentialFeatureSelector(StatsModelsRegressor(OLS, fit_intercept = True,
                                                                scoring='neg_mean_squared_error'), 
                                                                n_features_to_select= num_features,
                                                                direction='forward',
                                                                cv=num_folds))])    

Now we fit the pipeline.

Hide the code
SFS_pipe.fit(XTR, YTR)
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['X1', 'X2', 'X3', 'X4', 'X5',
                                                   'X6', 'X7', 'X8', 'X9',
                                                   'X10', 'X11']),
                                                 ('cat', 'passthrough', [])])),
                ('formula',
                 FormulaTransformer(formula='num__X1 + num__X2 + num__X3 + '
                                            'num__X4 + num__X5 + num__X6 + '
                                            'num__X7 + num__X8 + num__X9 + '
                                            'num__X10 + num__X11 - 1')),
                ('sfs',
                 SequentialFeatureSelector(cv=10,
                                           estimator=StatsModelsRegressor(model_class=<class 'statsmodels.regression.linear_model.OLS'>),
                                           n_features_to_select=6))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

To see the selected features run the code cell below. The first line makes sure that the names of the input variables are kept through the pipeline. Then we look at the support of the selected model, which is a boolean vector that indicates which variables were used in the model. And finally we show these variables in varSelection.

Hide the code
SFS_pipe.named_steps["sfs"].feature_names_in_ = np.array(inputs)
isVarSelected = SFS_pipe.named_steps["sfs"].support_.tolist()
isVarSelected
[True, False, True, True, True, True, False, False, False, True, False]
Hide the code
varSelection = SFS_pipe.named_steps["sfs"].feature_names_in_[isVarSelected].tolist()
varSelection
['X1', 'X3', 'X4', 'X5', 'X6', 'X10']

Examining the Selected Model

The model itself can be accessed with:

Hide the code
SFS_pipe_model = SFS_pipe.named_steps["sfs"].estimator
SFS_pipe_model
StatsModelsRegressor(model_class=<class 'statsmodels.regression.linear_model.OLS'>)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Model coefficients and statistics

If you want to examine the coefficients of the model and associated metrics and performance measures you can run the next code cell.

Technical notes about the Python code: First we apply the transform method of the pipeline to the training data, as the model operates on transformed inputs (in particular this results in a design matrix that only contains the selected inputs). We use a pandas dataframe because otherwise the names of the input variables get lost in the pipeline. And finally, then we need to reset (remove) the index values of YTR because they do not match the (newly created) index in the transformed training data.

Hide the code
XTR_selected_pd = pd.DataFrame(SFS_pipe.named_steps["sfs"].transform(XTR), columns=varSelection)
print(SFS_pipe_model.fit(XTR_selected_pd, YTR.reset_index(drop=True)).results_.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.700
Model:                            OLS   Adj. R-squared:                  0.698
Method:                 Least Squares   F-statistic:                     385.4
Date:                Tue, 24 Feb 2026   Prob (F-statistic):          2.91e-255
Time:                        14:29:15   Log-Likelihood:                -1880.0
No. Observations:                1000   AIC:                             3774.
Df Residuals:                     993   BIC:                             3808.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.8657      0.050     37.064      0.000       1.767       1.965
X1             0.6628      0.050     13.157      0.000       0.564       0.762
X3            -0.0132      0.008     -1.587      0.113      -0.029       0.003
X4             0.0988      0.049      2.034      0.042       0.003       0.194
X5            -1.4770      0.050    -29.413      0.000      -1.576      -1.378
X6             1.5067      0.052     29.167      0.000       1.405       1.608
X10           -1.0373      0.051    -20.243      0.000      -1.138      -0.937
==============================================================================
Omnibus:                      409.300   Durbin-Watson:                   2.054
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4638.791
Skew:                           1.551   Prob(JB):                         0.00
Kurtosis:                      13.085   Cond. No.                         6.44
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
/opt/conda/envs/mlmiin/lib/python3.11/site-packages/statsmodels/base/model.py:130: ValueWarning: unknown kwargs ['scoring']
  warnings.warn(msg, ValueWarning)
Remarks about these Results. Searching for the Best Number of Inputs with a For Loop.

Note first that the resulting model contains non significant coefficients for some variables. This is the case because the model has been selected among other models containing 6 variables using cross validation (with \(R^2\) as horizontal metric) and no attention has been paid to statistical significance at any point of this process.

As you see, the problem with this approach is that we need to select the final number of features that the model contains. The next logical step would be to use the number of features as a hyperparameter in a grid search. But SequentialFeatureSelector does not play nicely with GridSearchCV (the mlxtend library provides an alternative version of SequentialFeatureSelector that does). But the summary table above provides several vertical metrics, like the AIC and BIC that we can use to compare models with different numbers of inputs.


Searching for the Best Number of Inputs with a For Loop.

We will next use the AIC metric to perform the grid search manually using a for loop. The loop below will display the way that variables are added to the model as we increase the number of inputs. The results of the loop are stored in the list SFS_pipe_search. You will also notice that the highest number of inputs is one less than the original one; that is a requirement of SequentialFeatureSelector.

Change track_progress to True if you want to see the loop results as they appear.

Hide the code
SFS_pipe_search = [] 
num_folds = 10
track_progress = False

with warnings.catch_warnings():
    warnings.simplefilter("ignore")

    for nf in range(1, len(inputs)):
                
        SFS_pipe = Pipeline(steps=[
                ('preprocessor', preprocessor),
                ("formula", FormulaTransformer(model_Formula)),
                ('sfs', SequentialFeatureSelector(StatsModelsRegressor(OLS,
                                                                        fit_intercept = True,
                                                                        scoring='neg_mean_squared_error'),
                                                                        n_features_to_select= nf,
                                                                        direction='forward',
                                                                        cv=num_folds))])
                                                                        # cv=cv_splitter))])
        
        SFS_pipe.fit(XTR, YTR)
        SFS_pipe.named_steps["sfs"].feature_names_in_ = np.array(inputs)        
        
        isVarSelected = SFS_pipe.named_steps["sfs"].support_.tolist()
        varSelection = SFS_pipe.named_steps["sfs"].feature_names_in_[isVarSelected].tolist()                
    
        SFS_pipe_XTR = SFS_pipe.fit_transform(XTR, YTR)
        SFS_pipe_XTR_pd = pd.DataFrame(SFS_pipe.fit_transform(XTR, YTR), columns=varSelection)
        SFS_pipe_model = SFS_pipe.named_steps["sfs"].estimator
        aic = SFS_pipe_model.fit(SFS_pipe_XTR_pd, YTR.reset_index(drop=True)).results_.aic
        
        SFS_pipe_search.append({'nf':nf, 
                                'support':SFS_pipe.named_steps["sfs"].support_,
                                'varSelection':varSelection, 
                                'model':SFS_pipe, 
                                'SFS_pipe_XTR_pd':SFS_pipe_XTR_pd.reset_index(drop=True),
                                # 'cv_scores':cv_scores,
                                # 'cv_scores_mean':cv_scores_mean,
                                'aic':aic}) 
        
        if track_progress:
            print("--"*25+"\n")
            print("nf = ", nf)
            print('varSelection = ', varSelection)
            # print(f'cv_scores_mean = {cv_scores_mean:.3f})'
            print(f"aic = {aic:4.1f}")

print("Feature selection completed.")
Feature selection completed.
Hide the code
whichBest = np.argmin(np.array([SFS_pipe_search[k]['aic'] for k in range(len(SFS_pipe_search))]))
selected_model_info = SFS_pipe_search[whichBest]
print(f"Selected model uses variables {selected_model_info['varSelection']}")
print(f"and has AIC = {selected_model_info['aic']:4.1f}")
Selected model uses variables ['X1', 'X3', 'X4', 'X5', 'X6', 'X9', 'X10']
and has AIC = 3770.7

And we can get the summary of the selected model with this code, similar to the one we used with the 6 inputs model above.

Hide the code
selected_XTR = selected_model_info["SFS_pipe_XTR_pd"] 
print(selected_model_info["model"].named_steps["sfs"].estimator.fit(selected_XTR, YTR.reset_index(drop=True)).results_.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.701
Model:                            OLS   Adj. R-squared:                  0.699
Method:                 Least Squares   F-statistic:                     332.6
Date:                Tue, 24 Feb 2026   Prob (F-statistic):          4.19e-255
Time:                        14:29:22   Log-Likelihood:                -1877.4
No. Observations:                1000   AIC:                             3771.
Df Residuals:                     992   BIC:                             3810.
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.8862      0.050     37.563      0.000       1.788       1.985
X1             0.6641      0.050     13.212      0.000       0.565       0.763
X3            -0.5349      0.203     -2.633      0.009      -0.934      -0.136
X4             0.1060      0.050      2.107      0.035       0.007       0.205
X5            -1.4869      0.050    -29.530      0.000      -1.586      -1.388
X6             1.4760      0.050     29.232      0.000       1.377       1.575
X9             0.4695      0.203      2.310      0.021       0.071       0.868
X10           -1.0201      0.050    -20.260      0.000      -1.119      -0.921
==============================================================================
Omnibus:                      399.009   Durbin-Watson:                   2.065
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4482.638
Skew:                           1.504   Prob(JB):                         0.00
Kurtosis:                      12.927   Cond. No.                         8.00
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
/opt/conda/envs/mlmiin/lib/python3.11/site-packages/statsmodels/base/model.py:130: ValueWarning: unknown kwargs ['scoring']
  warnings.warn(msg, ValueWarning)

Searching for the Best Number of Inputs with a fixed improvement threshold (tolerance).

Instead of using a for loop to search for the best number of inputs we can use the “auto” mode of the n_features_to_select parameter of SequentialFeatureSelector. This parameter must be combined with the tol parameter (for tolerance), which is the threshold for the improvement in the performance metric that we are using to select the best model. That is, if the improvement in the metric is less than tol then the search stops. This method can be used both for forward and backward selection.

Hide the code
SFS2_pipe = Pipeline([
            ('preprocessor', preprocessor),
            ("formula", FormulaTransformer(model_Formula)),
            ('sfs', SequentialFeatureSelector(StatsModelsRegressor(OLS, 
                                                                        fit_intercept = True,
                                                                        scoring='neg_mean_squared_error'), 
                                                                        n_features_to_select= "auto",
                                                                        tol=0.005,
                                                                        direction='forward',
                                                                        cv=num_folds))])    
Hide the code
SFS2_pipe.fit(XTR, YTR)
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['X1', 'X2', 'X3', 'X4', 'X5',
                                                   'X6', 'X7', 'X8', 'X9',
                                                   'X10', 'X11']),
                                                 ('cat', 'passthrough', [])])),
                ('formula',
                 FormulaTransformer(formula='num__X1 + num__X2 + num__X3 + '
                                            'num__X4 + num__X5 + num__X6 + '
                                            'num__X7 + num__X8 + num__X9 + '
                                            'num__X10 + num__X11 - 1')),
                ('sfs',
                 SequentialFeatureSelector(cv=10,
                                           estimator=StatsModelsRegressor(model_class=<class 'statsmodels.regression.linear_model.OLS'>),
                                           tol=0.005))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The selected inputs are obtained as before.

Hide the code
SFS2_pipe.named_steps["sfs"].feature_names_in_ = np.array(inputs)
isVarSelected_2 = SFS2_pipe.named_steps["sfs"].support_.tolist()
varSelection_2 = SFS2_pipe.named_steps["sfs"].feature_names_in_[isVarSelected_2].tolist()
varSelection_2
['X1', 'X5', 'X6', 'X10']

Validation and Test Scores of the Model

In this particular example the model selected in the for loop has significant coefficients for all the input variable it uses (some are very significant, but the following argument holds anyway). Even if we expect that to be usually the case, you shoud keep in mind that that is not the way that the models are being compared and selected. We are using cross validation with measures such as \(R^2_{adj}\) and AIC that do not take the statistical significance of the coefficients into account.

Hide the code
selected_XTR = selected_model_info["SFS_pipe_XTR_pd"] 
SFS_val_scores = cross_val_score(selected_model_info["model"].named_steps["sfs"].estimator, selected_XTR, YTR.reset_index(drop=True), cv=10, scoring="neg_mean_squared_error")
SFS_val_scores
array([-1.98538507, -2.30128123, -1.51890765, -1.57197237, -3.10608577,
       -1.88150072, -2.32083721, -2.46768177, -4.18192065, -4.14466876])
Hide the code
selected_XTS = pd.DataFrame(selected_model_info["model"].transform(XTS), columns=selected_model_info['varSelection'])
Y_SFS_pred = selected_model_info["model"].named_steps["sfs"].estimator.predict(selected_XTS)
SFS_test_score =-np.sqrt(mean_squared_error(Y_SFS_pred, YTS.reset_index(drop=True)))
SFS_test_score
np.float64(-2.2762748400986776)
Hide the code
modelDict ={'SFS':{'val_scores':SFS_val_scores, 'test_score':SFS_test_score}}

You can uncomment the code in the next cell to see the residual diagnostic plots for this model.

Hide the code
%run -i "../3_1_LinearRegression/3_1_ResidualPlots.py"

XTR_for_res_plt = SFS_pipe.named_steps["preprocessor"].transform(XTR)
XTR_for_res_plt.columns = inputs
ResidualPlots(model= selected_model_info["model"].named_steps["sfs"].estimator,
             data=XTR_for_res_plt, num_inputs=inputs)
--------------------------------------------------
Density Curve and QQ-plot of Residuals: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11']
--------------------------------------------------
[<Axes: > <Axes: >]

--------------------------------------------------
Fitted Values vs Residuals:
--------------------------------------------------

--------------------------------------------------
Numerical inputs: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11']
--------------------------------------------------

--------------------------------------------------
No categorical inputs exist.
--------------------------------------------------

Analysis of the Model Selection Process

In this particular example the selected model has significant coefficients for all the input variable it uses. Even if we expect that to be usually the case, you shoud keep in mind that that is not the way that the models are being compared and selected. We are using cross validation with measures such as \(R^2\) and AIC that do not take significance into account.

The following plot shows the evolution of AIC in the loop. The horizontal dashed line indicates the value that corresponds to the selected model.

Hide the code
aic_pd = pd.DataFrame({'features':range(1,len(inputs)),
              'aic':[SFS_pipe_search[k]["aic"] for k in range(len(SFS_pipe_search))]})
sns.scatterplot(aic_pd, x="features", y="aic")
sns.lineplot(aic_pd, x="features", y="aic")
plt.axvline(x=len(selected_model_info['varSelection']), ls = '--')
plt.axhline(y=SFS_pipe_search[whichBest]["aic"], ls = '--')
plt.title(f'Selected model uses {selected_model_info["varSelection"]} as features.')
plt.show();

Exercise 001

Change the selection direction to backward and examine the results. Do not let the output of the for loop confuse you! Even though each loop increases the number of inputs, the internal SequentialFeatureSelector process will proceed from a model with all inputs to the desired number of inputs.


Recursive Feature Elimination

RFE and the Importance of the Predictors

Recursive feature selection (RFE) is an alternative approach to the forward and backward strategies that we have just seen. Those strategies are guided by global properties of the models: either cross validation scores or metrics such as AIC. In RFE we instead focus on the importance of each individual feature, as indicated by the size (absolute value) of their coefficients in the case of a linear model (after scaling of the dataset).

To perform RFE, we begin by training a model using the complete set of inputs. The trained model is used to assign importance value to each of these features. Then the least important variable (or variables) are removed and a new model is retrained using the remaining ones. This method can be combined with some cross-validation strategy (using one of the vertical performance measures) to select an optimal number of features for the model.

The RFE process should remind you of the backward strategy. But the difference between them is in the way that the next model is selected: in the backward strategy we use a horizontal measure to compare all models of a given size, while in RFE the feature importance is used to prune the model.

RFE in scikit

RFE is implemented in Python through the RFE class and also in a particularly useful class called REFCV which combines RFE with cross validation for the automatic selection of the optimal number of features.

Let us see how to apply it to our previous example. Since the training and test sets and the formula for the complete model are all already available we move directly to the modeling pipeline.

Recall that our model formula has no intercept:

Hide the code
model_Formula
'num__X1 + num__X2 + num__X3 + num__X4 + num__X5 + num__X6 + num__X7 + num__X8 + num__X9 + num__X10 + num__X11 - 1'

We define and fit the RFECV pipeline. Things are simpler here as we get automatic selection of the optimal number of features (no need of a for loop):

Hide the code
num_folds = 10

RFECV_pipe = Pipeline([
            ('preprocessor', preprocessor),
            ("formula", FormulaTransformer(model_Formula)),
            ('rfecv', RFECV(StatsModelsRegressor(OLS, fit_intercept = True, scoring='neg_mean_squared_error'), 
            step=1, 
            cv=num_folds)) 
        ])

RFECV_pipe.fit(XTR, YTR) 
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['X1', 'X2', 'X3', 'X4', 'X5',
                                                   'X6', 'X7', 'X8', 'X9',
                                                   'X10', 'X11']),
                                                 ('cat', 'passthrough', [])])),
                ('formula',
                 FormulaTransformer(formula='num__X1 + num__X2 + num__X3 + '
                                            'num__X4 + num__X5 + num__X6 + '
                                            'num__X7 + num__X8 + num__X9 + '
                                            'num__X10 + num__X11 - 1')),
                ('rfecv',
                 RFECV(cv=10,
                       estimator=StatsModelsRegressor(model_class=<class 'statsmodels.regression.linear_model.OLS'>)))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Now we can get the selected features as before:

Hide the code
RFECV_pipe.named_steps["rfecv"].feature_names_in_ = np.array(inputs)
isVarSelected = RFECV_pipe.named_steps["rfecv"].support_
varSelection = RFECV_pipe.named_steps["rfecv"].feature_names_in_[isVarSelected]
varSelection
array(['X1', 'X3', 'X5', 'X6', 'X9', 'X10'], dtype='<U3')

As we can see, the result includes all the inputs selected by the forward and backward strategies, but there is an additional feature X3. Let us examine the fitted model which is again a Statsmodels object:

Hide the code
RFECV_pipe_model = RFECV_pipe.named_steps["rfecv"].estimator_
RFECV_pipe_model
StatsModelsRegressor(model_class=<class 'statsmodels.regression.linear_model.OLS'>)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

And we get the model summary as before:

Hide the code
XTR_selected_pd = pd.DataFrame(RFECV_pipe.named_steps["rfecv"].transform(XTR), columns=varSelection)
print(RFECV_pipe_model.fit(XTR_selected_pd, YTR.reset_index(drop=True)).results_.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.700
Model:                            OLS   Adj. R-squared:                  0.698
Method:                 Least Squares   F-statistic:                     385.9
Date:                Tue, 24 Feb 2026   Prob (F-statistic):          1.86e-255
Time:                        14:29:23   Log-Likelihood:                -1879.6
No. Observations:                1000   AIC:                             3773.
Df Residuals:                     993   BIC:                             3808.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.8681      0.050     37.110      0.000       1.769       1.967
X1             0.6636      0.050     13.179      0.000       0.565       0.762
X3            -0.0866      0.033     -2.594      0.010      -0.152      -0.021
X5            -1.4820      0.050    -29.518      0.000      -1.580      -1.383
X6             1.5094      0.052     29.241      0.000       1.408       1.611
X9             0.4650      0.207      2.244      0.025       0.058       0.872
X10           -1.0359      0.051    -20.222      0.000      -1.136      -0.935
==============================================================================
Omnibus:                      398.904   Durbin-Watson:                   2.067
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4484.057
Skew:                           1.503   Prob(JB):                         0.00
Kurtosis:                      12.929   Cond. No.                         25.7
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Analysis of the Selected Model

Note that the coefficient of some variables in the selected model could be non significant. Again this is a consequence of the feature selection procedure not using statistical significance (but feature importance). The AIC value is almost identical to the one we got using forward selection. Therefore, in this example forward selection seems to be doing a better job at selecting features. The advantage of RFECV is in terms of ease of use, as the for loop is not needed. The following plot illustrates the cv scores that lead to the selection of the number of features (the dashed lines indicate the number of features and the corresponding score for that selection).

Hide the code
scores_pd = pd.DataFrame({'features':range(1,len(inputs)+1),
              'scores':RFECV_pipe.named_steps["rfecv"].cv_results_['mean_test_score']})
sns.scatterplot(scores_pd, x="features", y="scores")
sns.lineplot(scores_pd, x="features", y="scores")
plt.axhline(y=scores_pd["scores"].max(), ls = '--')
plt.axvline(x=varSelection.shape[0], ls = '--')
plt.show();

Hide the code
scores_pd["scores"]
0     0.230291
1     0.503592
2     0.637702
3     0.692652
4     0.693207
5     0.694266
6     0.692366
7     0.694090
8     0.692541
9     0.691855
10    0.691518
Name: scores, dtype: float64

Validation and Test Scores of the Model

In this particular example the selected model has significant coefficients for all the input variable it uses. Even if we expect that to be usually the case, you shoud keep in mind that that is not the way that the models are being compared and selected. We are using cross validation with measures such as \(R^2\) and AIC that do not take significance into account.

Hide the code
RFE_val_scores = cross_val_score(RFECV_pipe_model, XTR_selected_pd, YTR.reset_index(drop=True), cv=10, scoring="neg_mean_squared_error")
RFE_val_scores
array([-1.99511628, -2.33834018, -1.53777678, -1.60669992, -3.07963132,
       -1.89906504, -2.30713765, -2.45906988, -4.22282167, -4.12782631])
Hide the code
RFECV_pipe_model
StatsModelsRegressor(model_class=<class 'statsmodels.regression.linear_model.OLS'>)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Hide the code
XTS_selected_pd = pd.DataFrame(RFECV_pipe.named_steps["rfecv"].transform(XTS), columns=varSelection)
XTS_selected_pd
X1 X3 X5 X6 X9 X10
0 -1.623552 -5.340823 -0.670501 0.980370 -0.782473 0.540350
1 -2.023386 19.906353 1.241473 -1.080460 3.016985 1.946878
2 2.204650 5.981740 -0.711047 -0.310462 0.789065 0.631742
3 1.112091 -0.597322 -0.990069 -0.917676 -0.290526 0.976322
4 -0.927588 -2.073156 -1.777529 1.213873 -0.044390 0.265506
... ... ... ... ... ... ...
245 -0.219244 -13.247836 -0.135511 -0.256402 -2.012584 0.168967
246 2.500051 -0.446588 -0.346454 0.244533 -0.026462 0.357007
247 0.501914 1.119308 1.389443 0.496103 0.161849 2.135842
248 0.306868 -6.196531 -0.202906 -0.280527 -0.992225 0.779191
249 0.738184 -3.942354 1.363966 -1.279606 -0.496515 0.713708

250 rows × 6 columns

Hide the code
Y_RFE_pred = RFECV_pipe_model.predict(XTS_selected_pd[varSelection])
Y_RFE_pred
0      2.803276
1     -5.283380
2      3.110660
3      1.593541
4      5.602998
         ...   
245    1.573137
246    4.066295
247   -1.343169
248    1.217297
249   -2.223504
Length: 250, dtype: float64
Hide the code
RFE_test_score =-np.sqrt(mean_squared_error(Y_RFE_pred, YTS.reset_index(drop=True)))
RFE_test_score
np.float64(-2.279758648927032)
Hide the code
modelDict['RFE'] = {'val_scores':RFE_val_scores, 'test_score':RFE_test_score}
Residual Plots

Uncomment and run the next cells to get the residual plots.

Hide the code
%run -i "../3_1_LinearRegression/3_1_ResidualPlots.py"
XTR_for_res_plt = RFECV_pipe.named_steps["preprocessor"].transform(XTR)
XTR_for_res_plt.columns = inputs
ResidualPlots(model= RFECV_pipe_model,
             data=XTR_for_res_plt, num_inputs=inputs)
--------------------------------------------------
Density Curve and QQ-plot of Residuals: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11']
--------------------------------------------------
[<Axes: > <Axes: >]

--------------------------------------------------
Fitted Values vs Residuals:
--------------------------------------------------

--------------------------------------------------
Numerical inputs: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11']
--------------------------------------------------

--------------------------------------------------
No categorical inputs exist.
--------------------------------------------------

2026-02-18 Reached this point

Model Comparison

The follwoing code is similar to other previous model comparisons. We compute the scores and use boxplots to illustrate cross validation results for the models to be compared.

Hide the code
modelDict.keys()
dict_keys(['SFS', 'RFE', 'Lasso'])
Hide the code
val_scores = {ml:(modelDict[ml])['val_scores'] for ml in modelDict.keys()}
Hide the code
model_val_scores = pd.DataFrame(val_scores)
model_val_scores
SFS RFE Lasso
0 -1.985385 -1.995116 -1.964394
1 -2.301281 -2.338340 -2.332525
2 -1.518908 -1.537777 -1.544335
3 -1.571972 -1.606700 -1.634485
4 -3.106086 -3.079631 -3.105553
5 -1.881501 -1.899065 -1.985617
6 -2.320837 -2.307138 -2.306664
7 -2.467682 -2.459070 -2.555914
8 -4.181921 -4.222822 -4.250270
9 -4.144669 -4.127826 -4.179896
Hide the code
metric = "negRSME"
fig = plt.figure(figsize=(10, 4))
sns.boxplot(model_val_scores.melt(var_name="model", value_name=metric), x=metric, y ="model");

Hide the code
test_scores = {ml:(modelDict[ml])['test_score'] for ml in modelDict.keys()}
test_scores
{'SFS': np.float64(-2.2762748400986776),
 'RFE': np.float64(-2.279758648927032),
 'Lasso': -5.167256008319807}
Exercise 004

What is your interpretation of these plots?


Recommended reading:

See References section at the end for details.

References

Hastie, Trevor, Robert Tibshirani, and Jerome H. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer Series in Statistics. New York, NY, USA: Springer Science+Business Media. https://doi.org/10.1007/978-0-387-84858-7.
James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan Taylor. 2023. An Introduction to Statistical Learning with Applications in Python. Springer International Publishing. https://doi.org/10.1007/978-3-031-38747-0.
Muller, Andreas C, and Sarah Guido. 2017. Introduction to Machine Learning with Python. O’Reilly.
Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT Press. https://probml.github.io/pml-book/.
VanderPlas, J. 2016. Python Data Science Handbook: Essential Tools for Working with Data. O’Reilly. https://jakevdp.github.io/PythonDataScienceHandbook/.