We begin by using cd to make sure we are in the right folder.
Hide the code
%cd 3_2_FeatureSelection_Regularization
/wd/3_2_FeatureSelection_Regularization
Session Setup
Libraries
Let us begin by loading the libraries we will use.
Hide the code
### Load necessary modules -------------------------------# interactive plotting# %matplotlib inline%config InlineBackend.figure_format ='png'# ‘png’, ‘retina’, ‘jpeg’, ‘svg’, ‘pdf’# plotting librariesimport seaborn as snsimport matplotlib.pyplot as plt# sns.set()import statsmodels.api as sm# Data management librariesimport numpy as np import pandas as pdimport scipy.stats as stats# Machine learning librariesfrom sklearn.model_selection import train_test_split, GridSearchCV, cross_val_scorefrom sklearn.preprocessing import StandardScaler, OneHotEncoderfrom sklearn.compose import ColumnTransformerfrom sklearn.pipeline import Pipelinefrom sklearn.metrics import mean_squared_errorfrom sklearn.linear_model import LinearRegression# Feature selection librariesfrom sklearn.feature_selection import SequentialFeatureSelectorfrom sklearn.linear_model import Ridge, Lasso, ElasticNetfrom sklearn.feature_selection import RFECV# Machine learning librariesfrom sklearn.decomposition import PCAfrom sklearn.cross_decomposition import PLSRegressionfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScaler, OneHotEncoderfrom statsmodels.stats.outliers_influence import variance_inflation_factorfrom sklearn.compose import ColumnTransformerfrom sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error# Connecting statsmodels with sklearn via sklearn2pmml and patsy import statsmodels.api as smfrom statsmodels.api import OLSfrom statsmodels.stats.outliers_influence import variance_inflation_factor from sklearn2pmml.statsmodels import StatsModelsRegressor# Othersimport warnings# patsy for lineal model formula languageimport patsy as psfrom patsy_utils.patsy_formulaic_transformer import FormulaTransformer
Hide the code
import warningsimport os# Ignore all warningswarnings.filterwarnings("ignore")# Specifically target Scikit-Learn/FutureWarningswarnings.filterwarnings("ignore", category=FutureWarning)# Some environments also require setting an environment variableos.environ['PYTHONWARNINGS'] ='ignore'
Variable Selection Methods
The Parsimony Principle
This principle states that we should prefer simpler models over more complex ones. That is, we always prefer a model which is as simple as possible, while keeping a good predictive performance (meaning high performance and generalization). In Linear Regression Models we can try to achieve that goal by removing variables from the model or by shrinking their coefficients to reduce their influence on the model’s predictions. We do that while watching the impact of their removal on the model’s performance.
A brute force approach to this problem consists in trying all possible subsets of the input variables. If there are p of them that means that we need to train \[1 + p + \binom{p}{2} + \cdots + + \binom{p}{p} = 2^p\] different models in order to select the best one. For a moderate size example dataset, like the \(p=10\) that we will see below, that means training \(2^{10} = 1024\) models. And if you need to use cross validation to get robust results then we end up training quite a lot of models. For large size datasets this method will be too costly in terms of computation.
Heuristic methods: strategies for variable selection
Instead of doing that we can choose to explore only a part of the whole collection of possible models, each one with a different choice of variables. How do we select that partial collection of models to explore? There are two natural strategies that you can easily come up with:
Forward selection: Start with a model without input variables (only the intercept, equal to \(\bar Y\)). Then for each value \(k\) from 1 to \(p\) (the number of inputs) take your current model and add one variable that the model does not use. Do that for every possible such variable (there are \(p - k\) of them), fit the corresponding model (now with \(k\) inputs) and select the best model among those.
Backward selection: Start with a model that uses all the \(p\) input variables. Then for each value \(k\) from 1 to \(p\) take your current model and remove one variable that the model uses. Do that for every possible such variable (there are \(p - k + 1\) of them), fit the corresponding model (now with \(p - k\) inputs) and select the best model among those .
Any of those two methods results in a collection of \(p\) models, using from 1 to \(p\) inputs. Our final model is the best one of them, unless their performance is very similar. More among choosing the best model in a moment. In that case we may consider a simpler, more interpretable model at the expense of a little predictive performance.
Both forward and backward strategies use two types of model comparisons.
In horizontal comparisons we need to compare models with the same number of inputs. This is a simpler problem and any performance measure such as \(R^2_{adj}\) or \(SS_{residual}\) can be used here to select the best one.
But in vertical comparisons we need to compare models with a different number of predictors, and that is a harder problem. For example, using variable selection in a model with say \(p = 100\) predictors means that we need to compare a model with 2 predictors against a model with 95 predictors. And this is a problem because adding more variables always increases the value of \(R^2\). IN particular the model with all the inputs is bound to have the largest value of \(R^2\). So we need a different strategy to compare these different-size models.
Vertical comparison: models with different number of input variables
There are two frequently used approaches for this task:
You can use a performance measure that takes the number of inputs into account. A traditional choice is the adjusted R^2. But its theoretical foundation is not as solid as in the case of more modern alternatives, such as the Mallows \(C_p\)\[C_p = \dfrac{1}{n}(SS_{residual} + 2\,p\,\hat\sigma^2)\quad\text{ where }\hat\sigma^2\text{ is an estimate of the variance of the error}\] or the Akaike Information Criterion (AIC)\[AIC = \dfrac{1}{n\hat\sigma^2}(SS_{residual} + 2\,p\,\hat\sigma^2) + \text{constant}\] or the Bayesian Information Criterion (BIC)\[AIC = \dfrac{1}{n\hat\sigma^2}(SS_{residual} + \log(n)\,p\,\hat\sigma^2)\] These measures are similar, as you can see. and they all include the number of inputs \(p\) of the model. In all three cases the formulas shown above correspond to linear models fitted using least squares. And in all of them a model with a lower value of this measures is considered a better model.
A second approach is to use cross validation to compare the models. We use one of the folds to train models of all sizes, getting a validation score for each size and each fold. Then we compute the average of these scores across folds and use it to select the best size. Call it \(p_0\). The problem here is that the set of \(p_0\) variables for each fold may be different. So our final step is to select the best model of size using the full training set.
Variable Selection in Python
A Synthetic Example Dataset
We will use a datset with 10 numeric input variables to practice these variable selection methods in Python. We begin by loading the data and taking a look at the first few rows of the table. The output variable Y is in the last column of the table.
Hide the code
df = pd.read_csv("./3_2_data01.csv")df.head()
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
X11
Y
0
0.375265
1.584437
-2.601827
0.022447
-1.742627
-0.687230
-2.332911
-0.522880
-0.470547
0.169020
1.378568
3.401781
1
-2.628640
-6.398784
1.584658
-0.046849
0.202249
1.630798
3.696776
0.383786
0.137692
0.422389
0.797784
4.195411
2
0.968450
3.151469
3.797902
-0.615858
-0.992719
0.032080
2.594655
0.416864
0.698120
0.177978
0.846438
2.289632
3
1.451148
1.620372
-0.555819
0.005467
0.445291
1.496196
0.673055
0.779617
0.010274
-0.876502
-1.149223
6.289907
4
-0.207422
1.286931
-1.179233
1.834709
1.260945
0.036981
-1.982970
1.794626
-0.431231
0.659964
0.130603
-0.784177
Standard Names and Train/Test Split
We proceed as usual to benefit from the already available code.
Hide the code
output ="Y"num_inputs = ["X"+str(i) for i inrange(1, df.shape[1])]cat_inputs = []inputs = num_inputs + cat_inputsX = df[inputs]Y = df[output]
This synthetic dataset contains no missing data and requires next to no processing. We will just make a pairplot to get a picture of some relations between variables.
There are some obvious high correlations in the plot. The correlation matrix can be used to confirm them. We can also use a heatmap to help visualize the matrix. This is specially useful when the number of variables is large.
Looking at the pair plot or the heatmap it is clear that we can probably drop one of X1 and X2 and also one of X3 and X9. But what about e.g. X6 and X11? Are these variables part of the best model? And of course: how can we apply these ideas to models with hundreds of input variables, where manual inspection of plots and matrices is not feasible?
Sequential Feature Selection in Python
The forward and backward selection strategies described above are implemented (in basic form) in the SequentialFeatureSelector class of scikit. Let us see how to apply the backward selection strategy to our dataset. We will do this inside the framework that we have met in the previous session, using a pipeline that includes the formula for the model that uses the whole set of input variables.
In this dataset there are no categorical inputs amd so we use passthrough in the preprocessing step of the pipeline. Uncomment the last line from this cell if you want to see the effect of preprocessing on the training dataset.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Next we define the model formula. Note that we have added num__ to the names so that the formula fits nicely inside the naming conventions used by the pipeline. Note also that the formula uses -1 to remove theintercept from the design matrix. This is because the sfs step in the pipeline below has fit_intercept = True to simplify the processing of the selected inputs in the code below (otherwise the intercept coming from the formula gets mixed in with the inputs and things get complicated).
Hide the code
model_Formula =" + ".join(["num__"+ item for item in inputs])model_Formula = model_Formula +" - 1"model_Formula
Now we are ready to create the pipeline for SequentialFeatureSelector. This function requires us to select a number of features (and the backward or forward direction) in advance. Let us use it to select a model with 6 features.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
To see the selected features run the code cell below. The first line makes sure that the names of the input variables are kept through the pipeline. Then we look at the support of the selected model, which is a boolean vector that indicates which variables were used in the model. And finally we show these variables in varSelection.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
model_class
<class 'stats...ar_model.OLS'>
fit_intercept
True
Model coefficients and statistics
If you want to examine the coefficients of the model and associated metrics and performance measures you can run the next code cell.
Technical notes about the Python code: First we apply the transform method of the pipeline to the training data, as the model operates on transformed inputs (in particular this results in a design matrix that only contains the selected inputs). We use a pandas dataframe because otherwise the names of the input variables get lost in the pipeline. And finally, then we need to reset (remove) the index values of YTR because they do not match the (newly created) index in the transformed training data.
Remarks about these Results. Searching for the Best Number of Inputs with a For Loop.
Note first that the resulting model contains non significant coefficients for some variables. This is the case because the model has been selected among other models containing 6 variables using cross validation (with \(R^2\) as horizontal metric) and no attention has been paid to statistical significance at any point of this process.
As you see, the problem with this approach is that we need to select the final number of features that the model contains. The next logical step would be to use the number of features as a hyperparameter in a grid search. But SequentialFeatureSelector does not play nicely with GridSearchCV (the mlxtend library provides an alternative version of SequentialFeatureSelector that does). But the summary table above provides several vertical metrics, like the AIC and BIC that we can use to compare models with different numbers of inputs.
Searching for the Best Number of Inputs with a For Loop.
We will next use the AIC metric to perform the grid search manually using a for loop. The loop below will display the way that variables are added to the model as we increase the number of inputs. The results of the loop are stored in the list SFS_pipe_search. You will also notice that the highest number of inputs is one less than the original one; that is a requirement of SequentialFeatureSelector.
Change track_progress to True if you want to see the loop results as they appear.
whichBest = np.argmin(np.array([SFS_pipe_search[k]['aic'] for k inrange(len(SFS_pipe_search))]))selected_model_info = SFS_pipe_search[whichBest]print(f"Selected model uses variables {selected_model_info['varSelection']}")print(f"and has AIC = {selected_model_info['aic']:4.1f}")
Selected model uses variables ['X1', 'X3', 'X4', 'X5', 'X6', 'X9', 'X10']
and has AIC = 3770.7
And we can get the summary of the selected model with this code, similar to the one we used with the 6 inputs model above.
Searching for the Best Number of Inputs with a fixed improvement threshold (tolerance).
Instead of using a for loop to search for the best number of inputs we can use the “auto” mode of the n_features_to_select parameter of SequentialFeatureSelector. This parameter must be combined with the tol parameter (for tolerance), which is the threshold for the improvement in the performance metric that we are using to select the best model. That is, if the improvement in the metric is less than tol then the search stops. This method can be used both for forward and backward selection.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In this particular example the model selected in the for loop has significant coefficients for all the input variable it uses (some are very significant, but the following argument holds anyway). Even if we expect that to be usually the case, you shoud keep in mind that that is not the way that the models are being compared and selected. We are using cross validation with measures such as \(R^2_{adj}\) and AIC that do not take the statistical significance of the coefficients into account.
--------------------------------------------------
No categorical inputs exist.
--------------------------------------------------
Analysis of the Model Selection Process
In this particular example the selected model has significant coefficients for all the input variable it uses. Even if we expect that to be usually the case, you shoud keep in mind that that is not the way that the models are being compared and selected. We are using cross validation with measures such as \(R^2\) and AIC that do not take significance into account.
The following plot shows the evolution of AIC in the loop. The horizontal dashed line indicates the value that corresponds to the selected model.
Hide the code
aic_pd = pd.DataFrame({'features':range(1,len(inputs)),'aic':[SFS_pipe_search[k]["aic"] for k inrange(len(SFS_pipe_search))]})sns.scatterplot(aic_pd, x="features", y="aic")sns.lineplot(aic_pd, x="features", y="aic")plt.axvline(x=len(selected_model_info['varSelection']), ls ='--')plt.axhline(y=SFS_pipe_search[whichBest]["aic"], ls ='--')plt.title(f'Selected model uses {selected_model_info["varSelection"]} as features.')plt.show();
Exercise 001
Change the selection direction to backward and examine the results. Do not let the output of the for loop confuse you! Even though each loop increases the number of inputs, the internal SequentialFeatureSelector process will proceed from a model with all inputs to the desired number of inputs.
Recursive Feature Elimination
RFE and the Importance of the Predictors
Recursive feature selection (RFE) is an alternative approach to the forward and backward strategies that we have just seen. Those strategies are guided by global properties of the models: either cross validation scores or metrics such as AIC. In RFE we instead focus on the importance of each individual feature, as indicated by the size (absolute value) of their coefficients in the case of a linear model (after scaling of the dataset).
To perform RFE, we begin by training a model using the complete set of inputs. The trained model is used to assign importance value to each of these features. Then the least important variable (or variables) are removed and a new model is retrained using the remaining ones. This method can be combined with some cross-validation strategy (using one of the vertical performance measures) to select an optimal number of features for the model.
The RFE process should remind you of the backward strategy. But the difference between them is in the way that the next model is selected: in the backward strategy we use a horizontal measure to compare all models of a given size, while in RFE the feature importance is used to prune the model.
RFE in scikit
RFE is implemented in Python through the RFE class and also in a particularly useful class called REFCV which combines RFE with cross validation for the automatic selection of the optimal number of features.
Let us see how to apply it to our previous example. Since the training and test sets and the formula for the complete model are all already available we move directly to the modeling pipeline.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
As we can see, the result includes all the inputs selected by the forward and backward strategies, but there is an additional feature X3. Let us examine the fitted model which is again a Statsmodels object:
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Note that the coefficient of some variables in the selected model could be non significant. Again this is a consequence of the feature selection procedure not using statistical significance (but feature importance). The AIC value is almost identical to the one we got using forward selection. Therefore, in this example forward selection seems to be doing a better job at selecting features. The advantage of RFECV is in terms of ease of use, as the for loop is not needed. The following plot illustrates the cv scores that lead to the selection of the number of features (the dashed lines indicate the number of features and the corresponding score for that selection).
Hide the code
scores_pd = pd.DataFrame({'features':range(1,len(inputs)+1),'scores':RFECV_pipe.named_steps["rfecv"].cv_results_['mean_test_score']})sns.scatterplot(scores_pd, x="features", y="scores")sns.lineplot(scores_pd, x="features", y="scores")plt.axhline(y=scores_pd["scores"].max(), ls ='--')plt.axvline(x=varSelection.shape[0], ls ='--')plt.show();
In this particular example the selected model has significant coefficients for all the input variable it uses. Even if we expect that to be usually the case, you shoud keep in mind that that is not the way that the models are being compared and selected. We are using cross validation with measures such as \(R^2\) and AIC that do not take significance into account.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
--------------------------------------------------
No categorical inputs exist.
--------------------------------------------------
2026-02-18 Reached this point
Regularization: Lasso, Ridge and Related Shrinkage Methods.
Regularization and Shrinkage Metods
We have already met a first concept of regularization when we discussed the support vector machines and introduced a penalty term for trespassing the margin. The same idea can be applied to linear models as follows. Remember that in least squares we find the estimates of the linear model coefficients by minimizing the sum of squared residuals (the usual loss function for this method). That is, we minimize: \[
SS_{residual} = (\bar Y - {\mathbf X}\hat\beta)^T (\bar Y - {\mathbf X}\hat\beta)
\] In a regularization method we choose a value of a hyperparameter \(\lambda\) and then add a penalty term to the above expression, to get \[
(\bar Y - {\mathbf X}\hat\beta)^T (\bar Y - {\mathbf X}\hat\beta) + \lambda\cdot\text{penalty}
\] The model coefficient estimates are found as before by miinizing this quantity for the selected value of \(\lambda\).
A good choice of penalty term should be related to the complexity of the model. In linear models the complexity of the model is related to its coefficients, and therefore the typical penalty terms provide a measure of the number and/or size of the model’s coefficients. Then, as the value of \(\lambda\) increases the penalty applied forces the model to search for a model with less and/or smaller coefficients. This makes the coefficients shrink and that is the reason why these are called shrinkage methods.
Let us see two frequently used shrinkage methods, that differ on the kind of penalty term they use.
The Lasso
Let us suppose that we are trying to fit a linear model with \(p\) input variables, and so \[
f(\bar X) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p
\] Just like we did in the session about support vector machines, assume that we take a fixed complexity budget\(S\) that we are allowed to spend in the coefficients of the model. To make this precise we introduce the Lasso penalty as the sum of the absolute values of the model coefficients (Lasso = Least absolute shrinkage and selection operator): \[\displaystyle\sum_{i = 1}^p|\beta_i|\] The Lasso penalty term \(\displaystyle\sum_{i = 1}^p|\beta_i|\) is also called the \(L_1\) norm of the vector of input coefficients \((\beta_1,\ldots,\beta_p)\) (the intercept is excluded). Therefore in the Lasso we fit the model by trying to solve the optimization problem \[
\operatorname{argmin}_{(\beta_1,\ldots,\beta_p)}\left(\sum_{i = 1}^n\left(y_i - \beta_0 - \beta_1 x_{i1} - \cdots - \beta_p x_{ip} \right)^2\right)\qquad
\text{ subject to }\quad \displaystyle\sum_{i = 1}^p|\beta_i|\leq S
\] That is, find the values of \((\beta_1,\ldots,\beta_p)\) that result in the minimum of the function inside the parenthesis, while keeping the coefficients sizes small enough to fulfill the complexity requirement.
Another formulation for the Lasso.
When you learnt Calculus you may have come across the idea of Lagrange Mulipliers for constrained optimization problems. Applying that idea to the above expression of the Lasso as one such problem we can obtain the following equivalent dual formulation: \[
\operatorname{argmin}_{(\beta_1,\ldots,\beta_p)}\left(\sum_{i = 1}^n\left(y_i - \beta_0 - \beta_1 x_{i1} - \cdots - \beta_p x_{ip} \right)^2 + \lambda\,\underbrace{\displaystyle\sum_{i = 1}^p|\beta_i|}_{\text{Lasso penalty}}\right)
\] for a given fixed\(\lambda\). This version of the Lasso shows that it is indeed a regularization method, because it that can be expressed as: \[
(\bar Y - {\mathbf X}\hat\beta)^T (\bar Y - {\mathbf X}\hat\beta) + \lambda\,\underbrace{\displaystyle\sum_{i = 1}^p|\beta_i|}_{\text{Lasso penalty}}
\]
Geometric Interpretation of the Lasso.
Let us return to the first version of the Lasso in a problem with two inputs, so that we can visualize what is going on. Let us further suppose that the data has been centered so that \(\beta_0 = 0\). Then the Lasso asks for the solution of: \[
\operatorname{argmin}_{(\beta_1, \beta_2)}\left(\sum_{i = 1}^n\left(y_i - \beta_1 x_{i1} - \beta_2 x_{i2} \right)^2\right)\qquad
\text{ subject to }\quad |\beta_1| + |\beta_2| \leq S
\] If we ignore the condition \(|\beta_1| + |\beta_2| \leq S\) and just try to minimize the residual sum of squares: \[
F(\beta_1, \beta_2) = \sum_{i = 1}^n\left(y_i - \beta_1 x_{i1} - \beta_2 x_{i2} \right)^2
\] the solution will be the coefficients \((\beta_1^{OLS}, \beta_2^{OLS})\) from ordinary least squares (OLS). The plots below will represent points like this in the \((\beta_1, \beta_2)\) plane. The level curves of \(F(\beta_1, \beta_2)\) (the points where \(F\) takes a given fixed value) can be shown to be ellipses in this plane. The Lasso condition on the other hand \(|\beta_1| + |\beta_2| \leq S\) defines a diamond shaped region in the plane. The Lasso optimization requires us to find the point in the diamond region where the value of \(F\) is mininmal. The plot below illustrates a typical solution:
This plot is produced by a dynamic geometric construction (made with GeoGebra) (included here, in the cell below this one), that we will try to use in this session to illustrate how the Lasso works. In case you can not get it to work the static images below can help follow the discussion. The blue diamond is the boundary of the region of points \((\beta_1, \beta_2)\) with \(|\beta_1| + |\beta_2| \leq S\) for a particular choice of \(S\). The point labelled \(LS_{solution}\) corresponds to the OLS fitted model without using the Lasso. For this value of \(S\) the point is outside the diamond, so the model can not use it. The red ellipse represents the level curve of \(F\) with the minimum value of the function that still touches the diamond region. The point \((\beta_1, \beta_2)\) where the diamond and the ellipse meet corresponds to the Lasso solution for this choice of \(S\).
Of course, with a bigger value of \(S\) the diamond region becomes larger. And when it reaches the \(LS_{solution}\) (which is the absolute minimum of \(F\)) the Lasso restriction becomes irrelevant.
Feature Selection with the Help of the Hyperparameter S.
On the other hand if we reduce the complexity budget by making S smaller something interesting happens when we reach the situation in the plot below:
For this value of \(S\) the point where the diamond and the ellipse meet is in the horizontal axis. That means that \(\beta_2 = 0\) and so the Lasso fitted model does not use the second input. Feature selection happens automatically in the Lasso as we decrease \(S\).
Now we can reduce the budget even further. A lower value of \(S\) is represented by the smaller green diamond below. The black ellipse illustrates that once the problem reaches the axis, the contact point between diamonds and ellipses will remain in the axis, keeping \(\beta_2 = 0\). As \(S\) decreases it is now the turn of \(\beta_1\) to become smaller until it also becomes zero (in a typical implementation where \(\beta_0\) has not been forced to be 0, the optimization will continue adjusting the value of the model intercept as we decrease \(S\) / increase \(\lambda\)).
This example illustrates a general automatic feature selection property of the Lasso that holds in higher dimensions: as the regularization parameter \(\lambda\) increases, the coefficients of the model shrink until eventually they all become equal to zero. If we combine this property with cross validation using hyperparameter search we can use the Lasso and find the number of variables (remaining non zero coefficients) where the model performance is best.
The Lasso in Python
The Lasso method is implemented in scikit through the Lasso class. It provides a regularization parameter called alpha which is the analogue of the \(\lambda\) above: smaller values of alpha correspond to more complexity (a bigger budget for coefficients of the model). And alpha = 0 corresponds to basic linear regression modeling with no regularization.
We will use the same pipeline and patsy formulas (with no intercept) framework to inspect the resulting fitted models.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
We want to be able to explore the evolution of the coefficients of the model as alpha changes. Therefore we will use a for loop and we will keep a record of the coefficients in the list called coeffs.
And now we can make a plot that shows how the absolute value of the coefficients decreases as alpha increases until they all become eventually equal to 0.The dashed line indicates the value with the best validation score.
Hide the code
coefs_pd = pd.DataFrame(coefs, columns=inputs)coefs_pd["alpha"] = Lasso_alphasfig, ax = plt.subplots()coefs_pd.plot(x ="alpha", y = inputs, logx=logScale, ax = ax)fig.set_figwidth(12)fig.set_figheight(6)plt.legend(loc='upper right', title='Inputs', prop={'size': 8})plt.title('Lasso coefficients as a function of the regularization')for k inrange(len(inputs)): plt.text(Lasso_alphas[0], coefs_pd.iloc[1, k], s = inputs[k], fontsize='xx-small')plt.axvline(x=Lasso_alphas[np.argmax(scores)], ls ='--')plt.show();
A Grid for the Lasso
Usually we would directly use a grid search with cross validation for the lasso (instead of the for loop above) like this:
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
and we can also plot the evolution of the cross validation error as alpha changes, Remember that this is minus the error, and so in this plot higher is better. The dashed lines indicate the selected value of alpha and the corresponding error.
Hide the code
sns.scatterplot(x = Lasso_alphas, y = Lasso_grid.cv_results_["mean_test_score"])sns.lineplot(x = Lasso_alphas, y = Lasso_grid.cv_results_["mean_test_score"])plt.axhline(y=Lasso_grid.cv_results_["mean_test_score"].max(), ls ='--')plt.axvline(x=best_alpha, ls ='--')if logScale: plt.xscale('log')
Exercise 002
Change the value of logScale to false to zoom in and explore closely the region of alpha values close to the selected one. Rerun the code up to this exercise to do that.
The coefficients of the model selected by the Lasso are all different from 0 (no feature selection for this example):
Ridge regression is another shrinkage method that differs from the Lasso in the use of the \(L_2\) norm instead of the \(L_1\). That means that in Ridge regression we ask for the solution of: \[
\operatorname{argmin}_{(\beta_1,\ldots,\beta_p)}\left(\sum_{i = 1}^n\left(y_i - \beta_0 - \beta_1 x_{i1} - \cdots - \beta_p x_{ip} \right)^2\right)\qquad
\text{ subject to }\quad \underbrace{\displaystyle\sum_{i = 1}^p|\beta_i^2\leq S}_{\textbf{ Note the squares here!}}
\] This in turn implies that the geometry of the Ridge regression is defined by a ball in the coefficient space, like the blue ball in the 2d picture below. And therefore as \(S\) decreases (the complexity decreases and) we do not expect any individual coefficient to become actually zero. But all of them as a group become smaller and smaller with \(S\).
In new code cells below this one take the code cells for the Lasso and modify them it to run the Ridge method, to get a plot of the Ridge coefficients as a function of the regularization. Do you see the difference?
The follwoing code is similar to other previous model comparisons. We compute the scores and use boxplots to illustrate cross validation results for the models to be compared.
Hide the code
modelDict.keys()
dict_keys(['SFS', 'RFE', 'Lasso'])
Hide the code
val_scores = {ml:(modelDict[ml])['val_scores'] for ml in modelDict.keys()}
Hastie, Trevor, Robert Tibshirani, and Jerome H. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer Series in Statistics. New York, NY, USA: Springer Science+Business Media. https://doi.org/10.1007/978-0-387-84858-7.
James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan Taylor. 2023. An Introduction to Statistical Learning with Applications in Python. Springer International Publishing. https://doi.org/10.1007/978-3-031-38747-0.
Muller, Andreas C, and Sarah Guido. 2017. Introduction to Machine Learning with Python. O’Reilly.