diff --git a/src/model_selection.ipynb b/src/model_selection.ipynb index 259864a..42a6c90 100644 --- a/src/model_selection.ipynb +++ b/src/model_selection.ipynb @@ -1,525 +1,535 @@ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Working with Notebooks:\n", "* Press `Shift + Enter` to evaluate a cell.\n", "* Only the return value of the last line of each cell is shown in the `Out` cells. If you want to see the result of an intermediate step, insert a new cell (with the `+` symbol in the toolbar) and paste that intermediate step to that cell.\n", "\n", "New R commands:\n", "* `regsubsets`: used for subset selection, `method` can be e.g. `\"exhaustive\"` (best subset), `\"forward\"`, `\"backward\"`\n", "\n", "Useful R commands to remember:\n", "* `which.min` and `which.max` return the index that contains the smallest or largest element in a list.\n", "* `sample(range, number_of_samples, replace = True/False)`\n", "* `apply(data, i, function)` where `i = 1` if the function should be applied to all rows and `i=2` if the function should be applied to all columns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# The Contrived High-Dimensional Example " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Set\n", "First we create an artifical data set with n = 50 points of p = 1000 random predictors and one response that does not depend on the predictors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-10-14T15:18:03.937719Z", "start_time": "2019-10-14T15:18:03.836Z" } }, "outputs": [], "source": [ "set.seed(932)\n", "data = data.frame(matrix(rnorm(50*1000), 50, 1000))\n", "data$y = rnorm(50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Forward Selection with Adjusted Training Loss\n", "Now we load the `leaps` package and use the `regsubsets` function to do forward selection up to at most `nvmax = 30` predictors.\n", "We use the formula `y ~ .` to indicate that the response `y` should be regressed on all other columns of `data`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-10-14T15:27:04.754554Z", "start_time": "2019-10-14T15:27:04.295Z" } }, "outputs": [], "source": [ "library(leaps)\n", "regfit.fwd = regsubsets(y ~ .,data = data, method = \"forward\", nvmax = 30)\n", "regfit.fwd.summary = summary(regfit.fwd)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us plot the adjusted R^2 and mark in red the optimal (maximal) R^2." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-10-14T16:21:36.439367Z", "start_time": "2019-10-14T16:21:36.356Z" } }, "outputs": [], "source": [ "plot(regfit.fwd.summary$adjr2, type = \"l\", xlab = \"Number of Variables\", ylab = \"adjusted R^2\")\n", "adjr2.max = which.max(regfit.fwd.summary$adjr2)\n", "points(adjr2.max, regfit.fwd.summary$adjr2[adjr2.max], col = \"red\", cex = 2, pch = 20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot now the BIC-curve and mark the optimal BIC point in red. Write your answer in the following cell.\n", "\n", "Hints: \n", "1. Use `names(regfit.fwd.summary)` to see how BIC is called in the `regfit.fwd.summary`.\n", "2. What does optimal mean for the BIC?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Forward Selection with Cross-Validation (the Wrong Way)\n", "Now we do model selection with cross-validation. But we still use the full data set for training (which is wrong!!!).\n", "\n", "Since there is no predict function for regsubsets, we define our own one.\n", "You don't need to fully understand this code now. If you are motivated to dive in deeper you can have a look at the bottom of this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-10-14T15:42:10.442196Z", "start_time": "2019-10-14T15:42:10.427Z" } }, "outputs": [], "source": [ "predict.regsubsets = function(object, newdata, id, ...) {\n", " form = as.formula(object$call[[2]])\n", " mat = model.matrix(form, newdata)\n", " coefi = coef(object, id = id)\n", " xvars = names(coefi)\n", " mat[,xvars] %*% coefi\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we do 5-fold cross-validation. The variable `folds` will be an array of indices telling us in which of the `k` folds data-point `i` should lie. We will then loop over all folds and all models with 0 to 29 predictors (plus one variable for the intercept) and save the validation errors in the matrix `cv.errors.wrong` where we indicate in the variable name that these are the errors for the wrong cv approach.\n", "\n", "The next cell will take some time to run. The star in `[*]` on the left indicates that the cell is still running or in the bottom pane you can see that R is busy." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "k = 5\n", "folds = sample(1:k, nrow(data), replace = T)\n", "cv.errors.wrong = matrix(NA, k, 30)\n", "for (j in 1:k) {\n", " for (i in 1:30) {\n", " pred = predict(regfit.fwd, data[folds==j,], id = i)\n", " cv.errors.wrong[j,i] = mean( (data$y[folds==j] - pred)^2 )\n", " }\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we average over all columns (folds) to get the mean cv-error for each number of predictors. The `2` in the `apply` function tells that the `mean` should be taken over each column.\n", "\n", "Then we plot and include in the plot as a dashed line also the standard deviation of the reponse, which should be a lower bound for our estimated test error." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mean.cv.errors.wrong = apply(cv.errors.wrong, 2, mean)\n", "plot(mean.cv.errors.wrong, type = \"l\", ylim = c(0, 1.3), xlab = \"Number of Variables\", ylab = \"5-fold CV wrong\")\n", "cv.wrong.min = which.min(mean.cv.errors.wrong)\n", "points(cv.wrong.min, mean.cv.errors.wrong[cv.wrong.min], col = \"red\", cex = 2, pch = 20)\n", "abline(sd(data$y), 0, lty = \"dashed\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Forward Selection with Cross-Validation (the Right Way)\n", "The right way is almost the same as the wrong way, except that the training sets in the forward selection are the complements of the validation sets. Compare line 3 in the following cell to the for loop in the above section." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cv.errors = matrix(NA, k, 30)\n", "for (j in 1:k) {\n", " regfit.fwd = regsubsets(y ~ ., data = data[folds!=j,], method = \"forward\", nvmax = 30)\n", " for (i in 1:30) {\n", " pred = predict(regfit.fwd, data[folds==j,], id = i)\n", " cv.errors[j,i] = mean( (data$y[folds==j] - pred)^2 )\n", " }\n", "}\n", "mean.cv.errors = apply(cv.errors, 2, mean)\n", "plot(mean.cv.errors, type = \"l\", ylim = c(1, 3),\n", " xlab = \"Number of Variables\", ylab = \"5-fold CV right\")\n", "cv.min = which.min(mean.cv.errors)\n", "points(cv.min, mean.cv.errors[cv.min], col = \"red\", cex = 2, pch = 20)\n", - "abline(sd(y), 0, lty = \"dashed\")" + "abline(sd(data$y), 0, lty = \"dashed\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Model Selection Applied to the Hitters Data\n", "Here we will apply model section approaches to the Hitters data. We\n", "wish to predict a baseball player’s Salary on the basis of various statistics\n", "associated with performance in the previous year. This section follows the Lab 1 in chapter 6 from the textbook.\n", "\n", "First of all, we note that the Salary variable is missing for some of the\n", "players. The `is.na()` function can be used to identify the missing observations. It returns a vector of the same length as the input vector, with a TRUE\n", "for any elements that are missing, and a FALSE for non-missing elements.\n", "The `sum()` function can then be used to count all of the missing elements." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-10-14T16:15:16.280613Z", "start_time": "2019-10-14T16:15:16.258Z" } }, "outputs": [], "source": [ "library(ISLR)\n", "names(Hitters)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-10-14T16:15:16.848659Z", "start_time": "2019-10-14T16:15:16.820Z" } }, "outputs": [], "source": [ "dim(Hitters)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-10-14T16:15:17.454300Z", "start_time": "2019-10-14T16:15:17.434Z" } }, "outputs": [], "source": [ "sum(is.na(Hitters$Salary))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hence we see that Salary is missing for 59 players. The `na.omit()` function\n", "removes all of the rows that have missing values in any variable." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-10-14T16:15:19.053026Z", "start_time": "2019-10-14T16:15:19.035Z" } }, "outputs": [], "source": [ "Hitters = na.omit(Hitters)" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-10-14T16:15:38.210475Z", "start_time": "2019-10-14T16:15:38.187Z" } }, "source": [ "Write in the following the code to perform best subset selection with up to 19 predictors and plot the adjusted R^2 against the number of predictors.\n", "\n", "Hint: What argument do you need to set in `regsubsets()` to do best subset selection?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Determine the optimal number of predictors according to the adjusted R^2 criterion and the BIC criterion." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Write in the following the code to perform best subset selection with 10-fold cross-validation (the right way).\n", "\n", "Once you have written the code, you can run the same cell multiple times to get the results for different random seeds." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once you are convinced about the right number of variables, you can perform best subset selection on the full data set to obtain the best fit for the parameters of that model. To do so, assign your choice of the number of parameters to `i_best` and run the cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "i_best = \n", "reg.best = regsubsets(Salary ~ ., data = Hitters, nvmax = 19)\n", "coef(reg.best, i_best)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you are ready to do the following quiz. If you see `You do not have access to this content. Try logging in.` you have to log into your moodle account in a separate browser window and then return to this page." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 86, "metadata": { "ExecuteTime": { "end_time": "2019-10-14T19:27:31.332705Z", "start_time": "2019-10-14T19:27:31.315Z" }, - "hide_input": true, - "scrolled": true + "hide_input": false, + "scrolled": false }, - "outputs": [], + "outputs": [ + { + "data": { + "text/html": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], "source": [ - "IRdisplay::display_html('')" + "IRdisplay::display_html('')" ] }, { "cell_type": "markdown", "metadata": { "hide_input": true }, "source": [ "## Validation Set Approach (optional)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will now perform best subset selection with the validation set approach. For this, we split the data first into training and validation set.\n", "Then, we apply `regsubsets()` to the training set in order to perform best\n", "subset selection." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-10-14T16:34:42.188507Z", "start_time": "2019-10-14T16:34:42.162Z" } }, "outputs": [], "source": [ "set.seed(1)\n", "train = sample(c(TRUE,FALSE), nrow(Hitters), rep = TRUE)\n", "test = (!train)\n", "regfit.best = regsubsets(Salary ~ ., data = Hitters[train,], nvmax = 19)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-10-14T16:34:47.159461Z", "start_time": "2019-10-14T16:34:47.142Z" } }, "outputs": [], "source": [ "test.mat = model.matrix(Salary ~ ., data = Hitters[test,])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model.matrix() function is used in many regression packages for building an “X” matrix from data. Now we run a loop, and for each size i, we extract the coefficients from `regfit.best` for the best model of that size,\n", "multiply them into the appropriate columns of the test model matrix to\n", "form the predictions, and compute the test MSE. The symbol `%*%` stands for matrix multiplication." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-10-14T16:39:19.843578Z", "start_time": "2019-10-14T16:39:19.790Z" } }, "outputs": [], "source": [ "val.errors = rep(NA ,19)\n", "for ( i in 1:19) {\n", " coefi = coef(regfit.best, id = i)\n", " pred = test.mat[, names(coefi)] %*% coefi\n", " val.errors[i] = mean((Hitters$Salary[test] - pred)^2)\n", "}\n", "val.errors" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-10-14T16:39:34.117682Z", "start_time": "2019-10-14T16:39:34.098Z" } }, "outputs": [], "source": [ "which.min(val.errors)" ] }, { "cell_type": "markdown", "metadata": { "hide_input": false }, "source": [ "This was a little tedious, partly because there is no `predict()` method\n", "for `regsubsets()`. This is why we wrote the function `predict.regsubsets` above. Our function pretty much mimics what we did above. The only complex part is how we extracted the formula used in the call to `regsubsets()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }