diff --git a/src/lecture10.ipynb b/src/lecture10.ipynb
new file mode 100644
index 0000000..328f77f
--- /dev/null
+++ b/src/lecture10.ipynb
@@ -0,0 +1,805 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Lecture 10\n",
+ "\n",
+ "## Bagging and Random Forests"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "IRdisplay::display_html('')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "The code in the following cell generates the figure you saw on the first slide."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# generate data\n",
+ "set.seed(123)\n",
+ "data <- data.frame(X = 2*runif(30) - 1)\n",
+ "data$Y <- sin(4*data$X) + .2*rnorm(30)\n",
+ "\n",
+ "# load libraries\n",
+ "library(tree)\n",
+ "library(keras)\n",
+ "library(splines)\n",
+ "\n",
+ "# fit\n",
+ "f1 <- tree(Y ~ X, data, subset = 1:20, minsize = 2)\n",
+ "f2 <- keras_model_sequential() %>%\n",
+ " layer_dense(50, activation = \"relu\", input_shape = c(1)) %>%\n",
+ " layer_dense(50, activation = \"relu\", input_shape = c(1)) %>%\n",
+ " layer_dense(1) %>%\n",
+ " compile(optimizer = \"adam\", loss = \"mean_squared_error\")\n",
+ "fit(f2, data$X[10:30], data$Y[10:30], epochs = 2*10^3, verbose = 0)\n",
+ "f3 <- smooth.spline(data$X[5:25], data$Y[5:25], df = 15)\n",
+ "\n",
+ "# predict\n",
+ "grid <- seq(-1, 1, length = 100)\n",
+ "yhat1 <- predict(f1, data.frame(X = grid))\n",
+ "yhat2 <- predict(f2, grid)\n",
+ "yhat3 <- predict(f3, grid)$y\n",
+ "yhat.ensemble <- 1/3 * (yhat1 + yhat2 + yhat3)\n",
+ "\n",
+ "# plot\n",
+ "plot(data)\n",
+ "curve(sin(4*x), from = -1, to = 1, add = T, lwd = 3)\n",
+ "lines(grid, yhat1, col = \"blue\")\n",
+ "lines(grid, yhat2, col = \"red\")\n",
+ "lines(grid, yhat3, col = \"darkgreen\")\n",
+ "lines(grid, yhat.ensemble, col = \"orange\", lwd = 3)\n",
+ "legend(-.6, 1.1, c(\"tree\", \"neural net\", \"spline\", \"ensemble\",\n",
+ " \"true func.\"), lty = 1, col = c(\"blue\", \"red\",\n",
+ " \"darkgreen\", \"orange\", \"black\"))\n",
+ "\n",
+ "# compare predications\n",
+ "sapply(list(yhat1, yhat2, yhat3, yhat.ensemble),\n",
+ " function(yhat) mean((yhat - sin(4*grid))^2))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "To understand bagging and random forests a bit better we will apply it to the\n",
+ "XOR-problem. The following cell creates the XOR-data, fits a single tree and\n",
+ "plots the result."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# create data\n",
+ "library(MASS)\n",
+ "set.seed(123)\n",
+ "x1 <- mvrnorm(30, c(1.5, 1), .2*diag(2))\n",
+ "x2 <- mvrnorm(30, c(-1, -1), .2*diag(2))\n",
+ "x3 <- mvrnorm(30, c(-1, .8), .2*diag(2))\n",
+ "x4 <- mvrnorm(30, c(.9, -1), .2*diag(2))\n",
+ "data <- data.frame(X = rbind(x1, x2, x3, x4), Y = c(rep(1, 60), rep(0, 60)))\n",
+ "\n",
+ "# fit tree\n",
+ "library(tree)\n",
+ "t <- tree(as.factor(Y) ~ ., data, minsize = 30)\n",
+ "print(t)\n",
+ "\n",
+ "# visualize tree\n",
+ "par(mfrow = c(1, 2))\n",
+ "plot(t)\n",
+ "text(t)\n",
+ "plot(data$X.1, data$X.2, col = data$Y + 2)\n",
+ "abline(v = 1.50088, col = 'blue')\n",
+ "lines(c(-3, 1.50088), c(-.460662, -.460662), col = 'blue')\n",
+ "lines(c(-0.0517341, -0.0517341), c(-3, -.460662), col = 'blue')\n",
+ "lines(c(0.950088, 0.950088), c(3, -.460662), col = 'blue')\n",
+ "lines(c(-0.545005, -0.545005), c(3, -.460662), col = 'blue')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "Now we use the library `randomForest`.\n",
+ "Recall that bagging is simply a special case of a random forest with $m = p$.\n",
+ "We set $m$ in the code with `mtry`, i.e. to use bagging we set `mtry = 2`,\n",
+ "which is the number of predictors for the XOR data.\n",
+ "We perform bagging with $B$ = `ntree = 10` trees of at most 4 leaf nodes."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "outputs": [],
+ "source": [
+ "library(randomForest)\n",
+ "bag <- randomForest(as.factor(Y) ~ ., data, mtry = 2, ntree = 10, maxnodes = 4)\n",
+ "for (tree in 1:10) print(getTree(bag, tree, labelVar = T))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "As you can see in the output of the cell above, each one of the 10 trees of this\n",
+ "bag performs its first split along the `X.1` direction, roughly at value 1.5,\n",
+ "similarly as we have seen it for the single tree above. This indicates that all\n",
+ "split candidates in the `X.2` direction led to a smaller decrease in the loss.\n",
+ "\n",
+ "Now let's see what happens with a random forest, where for each split only one\n",
+ "of the two predictors is considered as split candidate (`mtry = 1`)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "outputs": [],
+ "source": [
+ "rf <- randomForest(as.factor(Y) ~ ., data, mtry = 1, ntree = 10, maxnodes = 4)\n",
+ "for (tree in 1:10) print(getTree(rf, tree, labelVar = T))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In the output above you can see that some of the trees have their first split in\n",
+ "`X.1` direction, while others have it in `X.2` direction. In this simple\n",
+ "setting, where at each split only one of the predictors is randomly picked as\n",
+ "split candidate, the direction of the first split is entirely determined by this\n",
+ "random selection; no matter what the maximal decrease of a split candidate in\n",
+ "the other direction would have been. Note that within each tree the different\n",
+ "splits can be on different predictors, because the split candidate directions\n",
+ "are randomly selected per node, and not per entire tree.\n",
+ "\n",
+ "Next, we apply bagging and random forests to the Heart data.\n",
+ "By default, `randomForest()` uses `p/3` variables when building a random forest\n",
+ "of regression trees, and `√p` variables when building a random forest of\n",
+ "classification trees. In the cell below we use the `randomForest()` function to\n",
+ "perform both random forests and bagging."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "Heart <-read.csv(\"http://faculty.marshall.usc.edu/gareth-james/ISL/Heart.csv\")[,-1]\n",
+ "Heart <- na.omit(Heart)\n",
+ "Heart$AHD <- as.factor(Heart$AHD)\n",
+ "Heart$ChestPain <- as.factor(Heart$ChestPain)\n",
+ "Heart$Thal <- as.factor(Heart$Thal)\n",
+ "Heart$Sex <- as.factor(Heart$Sex)\n",
+ "library(randomForest)\n",
+ "set.seed(2)\n",
+ "bag <- randomForest(AHD ~ ., Heart, mtry = 13)\n",
+ "rforest <- randomForest(AHD ~ ., Heart)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "Let us now split the data an run bagging and random forests only on the test\n",
+ "set and compute the predictions on the test set."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "outputs": [],
+ "source": [
+ "train <- sample(nrow(Heart), nrow(Heart)/2)\n",
+ "bag <- randomForest(AHD ~ ., Heart, subset = train, mtry = 13)\n",
+ "rf <- randomForest(AHD ~ ., Heart, subset = train)\n",
+ "pred.bag <- data.frame(predict(bag, Heart[-train,], predict.all = T))\n",
+ "pred.rf <- data.frame(predict(rf, Heart[-train,], predict.all = T))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "The prediction data frames `pred.bag` and `pred.rf` return the predictions using\n",
+ "the majority vote in the first column (`aggregrate`) and the predictions of each\n",
+ "of the 500 trees in the other columns."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "pred.bag"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "We define the `error.rate` function to compute the test error based on 2 to 500\n",
+ "trees of all the trees fitted. Instead we could also run the fits with the\n",
+ "`randomForest` function multiple times with the argument `ntree = 2` up to\n",
+ "`ntree = 500` and use the first column of the `predict` function above. Having\n",
+ "fitted already 500 trees, we can instead simply use them and our function\n",
+ "`error.rate` to compute the test error rate for different numbers of trees."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "error.rate <- function(ntree, y, pred) {\n",
+ " pred <- pred[,seq(2, ntree + 1)] == \"Yes\"\n",
+ " maj.vote <- rowMeans(pred) > 0.5\n",
+ " y <- y == \"Yes\"\n",
+ " mean(maj.vote != y)\n",
+ "}\n",
+ "test.err.bag <- sapply(2:500, error.rate, Heart[-train, 'AHD'], pred.bag)\n",
+ "test.err.rf <- sapply(2:500, error.rate, Heart[-train, 'AHD'], pred.rf)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "Let us plot the out-of-bag (OOB) and the test error rates."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "outputs": [],
+ "source": [
+ "plot(bag$err.rate[,1], type = 'l', ylim = c(.15, .35), col = 'darkgreen', ylab = \"Error Rate\", xlab = \"Number of Trees\")\n",
+ "lines(rf$err.rate[,1], col = 'orange')\n",
+ "lines(2:500, test.err.bag, col = 'blue')\n",
+ "lines(2:500, test.err.rf, col = 'red')\n",
+ "legend(\"topright\", c(\"OOB: Bagging\",\"Test: Bagging\", \"OOB: Random Forest\",\n",
+ " \"Test: Random Forest\"),\n",
+ " col = c('darkgreen', 'blue', 'orange', 'red'), lty = 1)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "We see that the random forest has a slightly lower error rate than bagging.\n",
+ "\n",
+ "Using the `importance()` function, we can view the importance of each\n",
+ "variable."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "outputs": [],
+ "source": [
+ "importance(rf)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Plots can be produced with the `varImpPlot()` function."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "outputs": [],
+ "source": [
+ "varImpPlot(rf)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We see that the Gini index decreases most with splits in variable `Thal`.\n",
+ "(A little bit of background: the variable `Thal` reports the outcome of a\n",
+ "Thallium Stress Test, also known as Nuclear Heart Scan. The outcome of this test\n",
+ "can be \"normal\", meaning that blood flows normally at rest and during exercise.\n",
+ "The outcome \"fixed\" means that some area of the heart does not get\n",
+ "enough blood at rest and under stress, whereas the outcome \"reversable\" means that\n",
+ "some area gets enough blood at rest but not under stress.)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Once you are ready, answer the question on the first page of\n",
+ "[this quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1114498).\n",
+ "\n",
+ "## Boosting\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "IRdisplay::display_html('')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To run boosting we use the library `xgboost`.\n",
+ "Because `xgboost` does not accept data frames we will first convert the data into ordinary matrices."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "library(xgboost)\n",
+ "library(Matrix)\n",
+ "heart.train.x <- sparse.model.matrix(AHD ~ . -1, data = Heart[train,])\n",
+ "heart.test.x = sparse.model.matrix(AHD ~ . -1, data = Heart[-train,])\n",
+ "heart.train.y = Heart[train, \"AHD\"] == \"Yes\"\n",
+ "heart.test.y = Heart[-train, \"AHD\"] == \"Yes\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Now we train with the standard binary classification loss `\"binary:logistic\"`.\n",
+ "For other objective functions see the documentation `?xgboost`.\n",
+ "The arguments of `xgboost` are related to the terminology in the slides as\n",
+ "follows: number of trees `nround` = $B$,\n",
+ "learning rate (or shrinkage parameter) `eta` = $\\lambda$,\n",
+ "maximal tree size `max_depth` = $d$"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "boost.heart = xgboost(heart.train.x, label = heart.train.y,\n",
+ " objective = \"binary:logistic\",\n",
+ " eta = 0.001,\n",
+ " max_depth = 4,\n",
+ " nround = 10000)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "yhat.boost = predict(boost.heart, heart.test.x)\n",
+ "table(yhat.boost > 0.5, heart.test.y)\n",
+ "mean((yhat.boost > 0.5) != heart.test.y)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The result is comparable to the one with Random Forests.\n",
+ "\n",
+ "Also in xgboost we can look at the importance plot."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "xgb.plot.importance(xgb.importance(model = boost.heart))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "In the following you see the code of the comparison between the different\n",
+ "methods. We use the `Boston` housing data set. The function `getdata` in the\n",
+ "following cell returns reproducible training and test splits that we will use\n",
+ "later."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "library(MASS)\n",
+ "getdata <- function (seed = 1) {\n",
+ " set.seed(seed)\n",
+ " train = sample(1:nrow(Boston), nrow(Boston)/2)\n",
+ " list(train = train,\n",
+ " boston.test = Boston[-train,\"medv\"],\n",
+ " boston.train.x = as.matrix(Boston[train, names(Boston) != \"medv\"]),\n",
+ " boston.test.x = as.matrix(Boston[-train, names(Boston) != \"medv\"])\n",
+ " )\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "In the next cell we fit bagging, random forests and linear regression on 50\n",
+ "different splits of the data. Have a look at `?with`, if you wonder what it\n",
+ "does."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "library(randomForest)\n",
+ "fit.and.evaluate <- function(seed = 1, method = randomForest, ...) {\n",
+ " with(getdata(seed), {\n",
+ " rf <- method(medv ~ ., Boston[train,], ...)\n",
+ " pred <- predict(rf, Boston[-train,])\n",
+ " mean((pred - Boston[-train,\"medv\"])^2)\n",
+ " })\n",
+ "}\n",
+ "rf.res <- sapply(1:50, fit.and.evaluate)\n",
+ "bag.res <- sapply(1:50, fit.and.evaluate, mtry = 13)\n",
+ "lm.res <- sapply(1:50, fit.and.evaluate, method = lm)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "Next we do the same for boosting..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "library(xgboost)\n",
+ "boost.func <- function(seed = 1) {\n",
+ " with(getdata(seed), {\n",
+ " boston.train.xgb <- xgb.DMatrix(boston.train.x,\n",
+ " label = Boston[train, \"medv\"])\n",
+ " boston.test.xgb <- xgb.DMatrix(boston.test.x,\n",
+ " label = Boston[-train, \"medv\"])\n",
+ " boost.boston <- xgb.train(data = boston.train.xgb,\n",
+ " nround = 2000,\n",
+ " max_depth = 4,\n",
+ " eta = 0.005,\n",
+ " verbose = 0,\n",
+ " watchlist = list(train = boston.train.xgb,\n",
+ " test = boston.test.xgb),\n",
+ " objective = \"reg:squarederror\")\n",
+ " yhat.boston <- predict(boost.boston, boston.test.xgb)\n",
+ " mean((yhat.boston - boston.test)^2)\n",
+ " })\n",
+ "}\n",
+ "boost.res <- sapply(1:50, boost.func)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "... and for neural networks. Here we go additionally through the hassle of\n",
+ "scaling the data appropriately."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "library(keras)\n",
+ "get.scale <- function(scaled) {\n",
+ " if (\"scaled:center\" %in% names(attributes(scaled))) {\n",
+ " center <- attr(scaled, \"scaled:center\")\n",
+ " } else {\n",
+ " center <- rep(0, ncol(scaled))\n",
+ " }\n",
+ " if (\"scaled:scale\" %in% names(attributes(scaled))) {\n",
+ " list(center, attr(scaled, \"scaled:scale\"))\n",
+ " } else {\n",
+ " list(center, rep(1., length(center)))\n",
+ " }\n",
+ "}\n",
+ "boston.x.scale <- function(x, scaled) {\n",
+ " s <- get.scale(scaled)\n",
+ " centered <- sweep(x, 2, s[[1]])\n",
+ " sweep(centered, 2, s[[2]], FUN = \"/\")\n",
+ "}\n",
+ "boston.y.scale <- function(y, scaled) {\n",
+ " s <- get.scale(scaled)\n",
+ " (y - s[[1]])/s[[2]]\n",
+ "}\n",
+ "boston.y.unscale <- function(y, scaled) {\n",
+ " s <- get.scale(scaled)\n",
+ " y * s[[2]] + s[[1]]\n",
+ "}\n",
+ "keras.func <- function (seed = 1) {\n",
+ " with(getdata(seed), {\n",
+ " boston.train.x.prep <- scale(boston.train.x, center = T, scale = T)\n",
+ " boston.train.y.prep <- scale(Boston[train, \"medv\"], center = T, scale = T)\n",
+ " nn <- keras_model_sequential()\n",
+ " nn <- nn %>%\n",
+ " layer_dense(200, activation = \"relu\", input_shape = c(13)) %>%\n",
+ " layer_dropout(rate = .25) %>%\n",
+ " layer_dense(200, activation = \"relu\") %>%\n",
+ " layer_dropout(rate = .25) %>%\n",
+ " layer_dense(1, activation = \"linear\")\n",
+ " nn %>% compile(optimizer = \"adam\", loss = \"mean_squared_error\")\n",
+ " history <- nn %>% fit(boston.train.x.prep,\n",
+ " boston.train.y.prep,\n",
+ " verbose = 0,\n",
+ " batch_size = length(boston.train.y.prep),\n",
+ " validation_data = list(boston.x.scale(boston.test.x, boston.train.x.prep),\n",
+ " boston.y.scale(boston.test, boston.train.y.prep)),\n",
+ " epochs = 500)\n",
+ " nn.pred <- predict(nn, boston.x.scale(boston.test.x, boston.train.x.prep))\n",
+ " mean((boston.y.unscale(nn.pred, boston.train.y.prep) - boston.test)^2)\n",
+ " })\n",
+ "}\n",
+ "keras.res <- sapply(1:50, keras.func)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "After all this, let us produce the summary plot."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "res <- data.frame(boost = boost.res, keras = keras.res,\n",
+ " rforest = rf.res, bag = bag.res, linreg = lm.res)\n",
+ "boxplot(res, ylab = \"test error\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Now you can answer the question on the second page of\n",
+ "[this quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1114498).\n",
+ "\n",
+ "## Beyond Supervised Learning"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "IRdisplay::display_html('')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "There is no code for this section.\n",
+ "\n",
+ "## Exercises\n",
+ "## Conceptual\n",
+ "\n",
+ "**Q1.**\n",
+ "We want to classify red versus blue points.\n",
+ "The numbers next to the data point indicate the index of the data point in the\n",
+ "training set, e.g. the 5th point of the training set has $X_1 = 2, X_2 = 2, Y =\n",
+ "\\mbox{\"blue\"}$"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "echo": false
+ },
+ "outputs": [],
+ "source": [
+ "plot(c(1, 2, 3, 1, 2, 3, 1, 2, 3), c(1, 1, 1, 2, 2, 2, 3, 3, 3),\n",
+ " col = c(\"red\", \"blue\")[c(1, 1, 2, 1, 2, 2, 2, 1, 1)],\n",
+ " xlab = \"X1\", ylab = \"X2\")\n",
+ "text(c(1, 2, 3, 1, 2, 3, 1, 2, 3), c(1, 1, 1, 2, 2, 2, 3, 3, 3),\n",
+ " seq(1, 9), pos = 4)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "a) Compute the entropy loss of the whole data set without considering a split.\n",
+ "Use the logarithm $\\log_2$ with base 2 to compute the entropy.\n",
+ "\n",
+ "b) Compute the Gini index for the same setting.\n",
+ "\n",
+ "c) Find the split that maximally reduces the entropy loss and compute the\n",
+ "reduction in loss and the value of the leaf nodes.\n",
+ "Make sure to weight the contributions to the reduction in loss of the different\n",
+ "regions with the number of data points they contain.\n",
+ "\n",
+ "d) Consider bagging with bootstrap training sets $b_1 = \\{1, 2, 3, 4, 5, 6, 7,\n",
+ "8, 8\\}$ and $b_2 = \\{1, 2, 3, 4, 5, 6, 6, 7, 8\\}$. Where are the first splits of\n",
+ "the trees fitted to $b_1$ and $b_2$, respectively?\n",
+ "\n",
+ "e) Compute the out-of-bag test error estimate for the bagged trees in d)\n",
+ "\n",
+ "f) Consider a Random Forest with $m=1$ and the same bootstrap training sets as\n",
+ "in d). Are the first splits of the two trees in this Random Forest at the same\n",
+ "place as in bagging? Justify your answer.\n",
+ "\n",
+ "**Q2.**\n",
+ "We want to apply boosting to the following regression problem.\n",
+ "This time, the numbers next to the data points indicate the value of the\n",
+ "response, e.g. our training set contains a data point with $X_1 = 1, X_2 = 1, Y = 1$."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "echo": false
+ },
+ "outputs": [],
+ "source": [
+ "plot(c(1, 2, 3, 4), c(1, 4, 3, 2), xlab = \"X1\", ylab = \"X2\")\n",
+ "text(c(1, 2, 3, 4), c(1, 4, 3, 2), c(1, 3, 5, 7), pos = 4)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We choose $d = 1$, $B = 2$ and $\\lambda = 0.5$. Follow the boosting algorithm\n",
+ "outlined in the slides of the lecture (it differs in the initialization from\n",
+ "algorithm 8.2 from the book).\n",
+ "\n",
+ "a) Compute $f_0(X_1,X_2)$\n",
+ "\n",
+ "b) Compute the residuals and determine the splits and leaf values of the first\n",
+ "tree and write $f_1(X_1,X_2)$ explicitly.\n",
+ "\n",
+ "c) Compute the residuals and determine the splits and leaf values of the second\n",
+ "tree and write $f_2(X_1,X_2)$ explicitly.\n",
+ "\n",
+ "## Applied\n",
+ "\n",
+ "**Q3.**\n",
+ "We now use boosting to predict `Salary` in the `Hitters` data set.\n",
+ "The `Hitters` data is in `library(ISRL)`.\n",
+ "\n",
+ "a) Remove the observations for whom the salary information is unknown, and then\n",
+ "log-transform the salaries.\n",
+ "\n",
+ "b) Create a training set consisting of the first 200 observations, and a test\n",
+ "set consisting of the remaining observations.\n",
+ "\n",
+ "c) Perform boosting on the training set with 1000 trees for a range of values of\n",
+ "the learning rate $\\lambda$. Produce a plot with different learning rate\n",
+ "values on the $x$-axis and the corresponding training set MSE on the $y$-axis.\n",
+ "\n",
+ "d) Produce a plot with different learning rate values on the $x$-axis and the\n",
+ "corresponding test set MSE on the $y$-axis.\n",
+ "\n",
+ "e) Compare the test MSE of boosting to the test MSE that results from applying\n",
+ "linear regression.\n",
+ "\n",
+ "f) Which variables appear to be the most important predictors in the boosted\n",
+ "model?\n",
+ "\n",
+ "g) Now apply bagging to the training set. What is the test set MSE for this\n",
+ "approach?\n",
+ "\n",
+ "**Q4.**\n",
+ "Use boosting to classify images in the Histopathalogic Cancer Detection data set\n",
+ "that we studied in the last exercise of sheet 7* - part 2. Try different\n",
+ "parameter values for $B$, $\\lambda$ and $d$. To tell R that the 0s and 1s in\n",
+ "`PCaml_y` should be treated as values of a categorical response, you may use\n",
+ "`PCaml_y <- as.factor(PCaml_y)`. Compare your results to the ones obtained with\n",
+ "logistic regression and convolutional networks. Which are the important factors (pixels)?\n"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "R",
+ "language": "R",
+ "name": "ir"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}