diff --git a/src/ridge_regression_and_lasso.ipynb b/src/ridge_regression_and_lasso.ipynb index bf079e7..35bd6db 100644 --- a/src/ridge_regression_and_lasso.ipynb +++ b/src/ridge_regression_and_lasso.ipynb @@ -1,450 +1,452 @@ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Ridge Regression and the Lasso" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "New R commands:\n", "* `glmnet(x, y, alpha = , lambda = )`\n", "* `model.matrix(formula, data)`\n", "\n", "Useful R commands to remember:\n", "* `seq(from, to, length = )` create a list of a given `length` of linearly spaced numbers between `from` and `to`.\n", " \n", "\n", "This section follows the Lab 2 in chapter 6 from the textbook.\n", "\n", "We will use the `glmnet` package in order to perform ridge regression and\n", "the lasso. The main function in this package is `glmnet()`, which can be used\n", "to fit ridge regression models, lasso models, and more. This function has\n", "slightly different syntax from other model-fitting functions that we have\n", "encountered thus far in this book. In particular, we must pass in an x\n", "matrix as well as a y vector, and we do not use the y ∼ x syntax. We will\n", "now perform ridge regression and the lasso in order to predict Salary on\n", "the Hitters data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ridge regression" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "library(ISLR)\n", "names(Hitters)\n", "Hitters = na.omit(Hitters)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `model.matrix()` function is particularly useful for creating x; not only\n", "does it produce a matrix corresponding to the 19 predictors but it also\n", "automatically transforms any qualitative variables into dummy variables.\n", "The latter property is important because `glmnet()` can only take numerical,\n", "quantitative inputs. We remove the first column returned by `model.matrix()`,\n", "since it only contains ones that are only needed when fitting an intercept with other methods. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = model.matrix(Salary~., Hitters)[,-1]\n", "y = Hitters$Salary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `glmnet()` function has an `alpha` argument that determines what type\n", "of model is fit. If `alpha=0` then a ridge regression model is fit, and if `alpha=1`\n", "then a lasso model is fit. We first fit a ridge regression model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "library(glmnet)\n", "grid = 10^seq(10, -2, length = 100)\n", "ridge.mod = glmnet(x, y, alpha = 0, lambda = grid)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default the `glmnet()` function performs ridge regression for an automatically selected range of λ values. However, here we have chosen to implement\n", "the function over a grid of values ranging from $\\lambda = 10^{10}$ to $\\lambda = 10^{-2}$.\n", "Note that by default, the `glmnet()` function standardizes the\n", "variables so that they are on the same scale. To turn off this default setting,\n", "use the argument `standardize=FALSE`.\n", "\n", "Associated with each value of λ is a vector of ridge regression coefficients,\n", "stored in a matrix that can be accessed by `coef()`. In this case, it is a 20×100\n", "matrix, with 20 rows (one for each predictor, plus an intercept) and 100\n", "columns (one for each value of λ).\n", "We expect the coefficient estimates to be much smaller, in terms of $\\ell_2$ norm,\n", "when a large value of λ is used, as compared to when a small value of λ is\n", "used. \n", "\n", "Check that this is the case by repeating the following steps for another index than 50." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ridge.mod$lambda[50]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "coef(ridge.mod)[,50]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we exclude the intercept and compute the $\\ell_2$ norm of the coefficients." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sqrt(sum(coef(ridge.mod)[-1,50]^2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the `predict()` function for a number of purposes. For instance,\n", "we can obtain the ridge regression coefficients for a new value of λ, say 50:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predict(ridge.mod, s = 50, type = \"coefficients\")[1:20,]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now split the samples into a training set and a test set in order\n", "to estimate the test error of ridge regression and the lasso.\n", "There are two\n", "common ways to randomly split a data set. The first is to produce a random\n", "vector of TRUE, FALSE elements and select the observations corresponding to\n", "TRUE for the training data. The second is to randomly choose a subset of\n", "numbers between 1 and n; these can then be used as the indices for the\n", "training observations. The two approaches work equally well." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set.seed(1)\n", "train = sample(1:nrow(x), nrow(x)/2)\n", "test = (-train)\n", "y.test = y[test]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we fit a ridge regression model on the training set, and evaluate\n", "its MSE on the test set, using λ = 4. Note the use of the `predict()`\n", "function again. This time we get predictions for a test set, by replacing\n", "`type=\"coefficients\"` with the `newx` argument." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ridge.mod = glmnet(x[train,], y[train], alpha = 0, lambda = grid)\n", "ridge.pred = predict(ridge.mod, s = 4, newx = x[test,])\n", "mean((ridge.pred - y.test)^2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember that in the extreme case of λ approaching infinite, all coefficients are shrunk to zero,\n", "which corresponds to fitting a model with only an intercept.\n", "Let us check that this is the case for our fit." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ridge.pred = predict(ridge.mod, s = 10^200, newx = x[test,])\n", "mean((ridge.pred - y.test)^2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mean((mean(y[train]) - y.test)^2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So fitting a ridge regression model with λ = 4 leads to a much lower test\n", "MSE than fitting a model with just an intercept.\n", "\n", "We now check whether\n", "there is any benefit to performing ridge regression with λ = 4 instead of\n", "just performing least squares regression. Recall that least squares is simply\n", "ridge regression with λ = 0. While the two methods yield very similar results,\n", "it is unclear to me why there are still small differences." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ridge.pred = predict(ridge.mod, s=0, newx=x[test,], exact=T, x=x[train,], y=y[train])\n", "mean((ridge.pred-y.test)^2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lm.pred = predict(lm(Salary ~ ., data = Hitters, subset=train), Hitters[test,])\n", "mean((lm.pred-y.test)^2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In general, instead of arbitrarily choosing λ = 4, it would be better to\n", "use cross-validation to choose the tuning parameter λ. We can do this using\n", "the built-in cross-validation function, `cv.glmnet()`. By default, the function\n", "performs ten-fold cross-validation, though this can be changed using the\n", "argument `nfolds`. Note that we set a random seed first so our results will\n", "be reproducible, since the choice of the cross-validation folds is random." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set.seed(1)\n", "cv.out=cv.glmnet(x[train,],y[train],alpha=0)\n", "plot(cv.out)\n", "bestlam=cv.out$lambda.min\n", "bestlam" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Take the best lambda `bestlam` and compute the mean squared error on the test set.\n", "\n", "Is there a further improvement in test error over the arbitrary choice of λ = 4?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we refit our ridge regression model on the full data set,\n", "using the value of λ chosen by cross-validation, and examine the coefficient\n", "estimates." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "out = glmnet(x, y, alpha=0)\n", "predict(out, type=\"coefficients\", s=bestlam)[1:20,]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Lasso\n", "\n", "We saw that ridge regression with a wise choice of λ can outperform least\n", "squares as well as the null model on the Hitters data set. We now ask\n", "whether the lasso can yield either a more accurate or a more interpretable\n", "model than ridge regression. In order to fit a lasso model, we once again\n", "use the `glmnet()` function; however, this time we use the argument `alpha=1`.\n", "Other than that change, we proceed just as we did in fitting a ridge model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)\n", "plot(lasso.mod)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Perform now cross-validation and compute the associated test error." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, refit the lasso on the full data set,\n", "using the value of λ chosen by cross-validation, examine the coefficient\n", "estimates and compare them to the coefficients found by ridge regression.\n", - "Use again the seed 1 to get reproducible results for the cross-validation." + "You don't need to create a new split of training/test data.\n", + "But make sure to use again the seed 1 just before running the cross-validation \n", + "to get reproducible results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you should be ready to do the following quiz." ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "IRdisplay::display_html('')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bayesian Interpretation for Ridge Regression and the Lasso (Optional)\n", "\n", "Read in the textbook on page 226 about the nice Bayesian interpretation of ridge regression and the lasso and have a look at exercises 7 on page 262." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }