diff --git a/src/lecture6.ipynb b/src/lecture6.ipynb new file mode 100644 index 0000000..00aaac1 --- /dev/null +++ b/src/lecture6.ipynb @@ -0,0 +1,602 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "# Lecture 6\n", + "\n", + "## Discussion of Student Feedback" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "IRdisplay::display_html('')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "## Gradient Descent" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "IRdisplay::display_html('')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "In the following cell we define the function `gradient_descent` as discussed in\n", + "the video. We will make use of the function `auto_diff` from the `ADtools`\n", + "library to compute all partial derivatives with respect to (`wrt`) the\n", + "parameters." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 2 + }, + "outputs": [], + "source": [ + "library(ADtools)\n", + "gradient_descent <- function(f, params, fix = list(),\n", + " learning_rate = 0.01,\n", + " tol = 1e-6,\n", + " maxsteps = 10^3,\n", + " show = F) {\n", + " history <- rep(0, maxsteps)\n", + " for (i in 1:maxsteps) {\n", + " df <- auto_diff(f, at = append(params, fix), wrt = names(params))\n", + " if (show) print(df@x)\n", + " history[i] <- df@x\n", + " delta <- learning_rate * as.numeric(df@dx)\n", + " params <- relist(unlist(params) - delta, params)\n", + " if (max(abs(delta)) < tol) break\n", + " }\n", + " append(params, list(history = history))\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "### A Very Simple 2-Dimensional Example\n", + "\n", + "Let us test our gradient descent function with the following simple example of a\n", + "function $f(\\beta) = (\\beta_1 - 1)^2 + (\\beta_2 - 2)^2$ which has its minimum at\n", + "$\\hat\\beta_1 = 1$ and $\\hat\\beta_2 = 2$ and can be written in\n", + "code as `sum((params - c(1, 2))^2)`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 0 + }, + "outputs": [], + "source": [ + "f <- function(params) sum((params - c(1, 2))^2)\n", + "result <- gradient_descent(f, params = list(params = c(1, 1)))\n", + "result$params" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "We see that gradient descent found roughly the minimum approximately.\n", + "It may not have found the minimum perfectly, because gradient descent typically\n", + "fluctuates around the minimum, if the learning rate remains constant.\n", + "\n", + "### Linear Regression\n", + "Let us now prepare a linear regression example. We will generate the data and\n", + "perform the fit with 3 predictors but without an intercept." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 2 + }, + "outputs": [], + "source": [ + "n <- 10^3\n", + "p <- 3\n", + "X <- matrix(rnorm(n * p), nrow = n)\n", + "params <- rnorm(p)\n", + "Y <- X %*% params + 0.1*rnorm(n)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "Our training set consists of `X` and `Y`.\n", + "\n", + "In the following cell we define our loss function for linear regression.\n", + "We do not include the variance term $\\frac1{2\\sigma^2}$ in the scaling, because\n", + "this would only change the scaling of the loss function, but not the point where\n", + "it has its minimum. Note, how we implement the sum over training examples with\n", + "vectors and matrices, i.e. $\\frac1n\\sum_{i=1}^n(y_i - x_i^T\\beta)^2 =\n", + "\\mathrm{mean}(Y - X \\beta)$, where $Y$ is the vector of responses and $X$ is a\n", + "matrix with rows containing the input points." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 2 + }, + "outputs": [], + "source": [ + "lm_loss <- function(params, X, Y) mean((Y - X %*% params)^2)\n", + "result <- gradient_descent(f = lm_loss,\n", + " params = list(params = rnorm(p)),\n", + " fix = list(X = X, Y = Y),\n", + " show = T)\n", + "result$params" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "Because we used the `show = T` argument, our `gradient_descent` function shows\n", + "how the loss function is decreasing as gradient descent progresses. You may need\n", + "to scroll down in the output to see the result.\n", + "\n", + "Let us compare the result we obtained with the standard function of R.\n", + "We use `-1` in the formula to exclude the intercept." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 0 + }, + "outputs": [], + "source": [ + "coef(lm(Y ~ X - 1))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "Pretty close to our gradient descent result...\n", + "\n", + "### Logistic Regression\n", + "\n", + "We do the same now for logistic regression. We fit again without intercept.\n", + "\n", + "First we prepare the data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "logistic <- function(x) 1/(1 + exp(-x))\n", + "Y <- logistic(-X %*% params) > runif(n)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "The response `Y` above uses 0/1 coding, which we transform with `(2*Y - 1)`\n", + "to -1/1 coding to use the simple formula from the slides (see also Exercise\n", + "Q1.)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "lr_loss <- function(params, X, Y) -sum(log(logistic((2*Y - 1) * X %*% params)))\n", + "result <- gradient_descent(f = lr_loss, learning_rate = 1e-3,\n", + " params = list(params = rnorm(p, 1)),\n", + " fix = list(Y = Y, X = X))\n", + "result$params" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "Again we can compare to the result obtained with the standard method to see that\n", + "we got pretty close." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "coef(glm(Y ~ X - 1, family = \"binomial\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "### Multinomial Logistic Regression\n", + "In the following cell I use again the multivariate normal distribution from the\n", + "`MASS` library to sample some input points from multivariate normal\n", + "distributions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "set.seed(2)\n", + "library(MASS)\n", + "x1 <- mvrnorm(30, c(.5, 1), .3*diag(2))\n", + "x2 <- mvrnorm(30, c(.5, -1), .3*diag(2))\n", + "x3 <- mvrnorm(30, c(-1, 0), .3*diag(2))\n", + "x <- rbind(x1, x2, x3)\n", + "y <- c(rep(1, 30), rep(2, 30), rep(3, 30))\n", + "plot(x, col = y, xlim = c(-2.5, 2.5), ylim = c(-2.5, 2.5))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "In the following cell we define first the conditional probability distribution\n", + "of multinomial logistic regression. Both `X` and `params` can be matrices. The\n", + "matrix `X` would be our standard matrix with different input points as rows.\n", + "The matrix `params` is constructed from the vectors $\\beta_1, \\ldots, \\beta_K$ as\n", + "columns. Thus, the matrix product `X %*% params` returns a matrix from which we can\n", + "compute the conditional probabilities of the different classes (in the different\n", + "columns) for every input vector (in the different rows).\n", + "In the second line we define the Bayes classifier function for input `X` and\n", + "parameters `params`, which computes the conditional probabilities for all inputs\n", + "and than iterates over all rows to pick the index with the maximal value.\n", + "Finally, we define the loss function for multiple logistic regression.\n", + "We compute the first term $\\sum_{i=1}^n x_i^T\\beta_{y_i}$ of the loss function in a\n", + "somewhat funny way: $\\sum_{i=1}^n \\sum_{k=1}^Kx_i^T\\beta_kI(y_i = k)$ where\n", + "$I(y_i = k) = 1$ if the $i$-th training sample is in class $k$ and $I(y_i = k) =\n", + "0$ otherwise, because\n", + "we can write this easily in matrix notation.\n", + "The function `model.matrix(~factor(y) - 1)` computes all the necessary zeros and ones.\n", + "You may want to look at the output of `model.matrix(~ factor(c(1, 2, 3, 2)) - 1)`\n", + "to understand this better." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "mr_conditional <- function(X, params) exp(X %*% params)/rowSums(exp(X %*% params))\n", + "mr_bayes_classifier <- function(X, params) apply(mr_conditional(X, params), 1, function(row) which.max(row))\n", + "mr_loss <- function(params, X, Y) {\n", + " x_times_params <- X %*% params\n", + " label_matrix <- model.matrix(~ factor(Y) - 1)\n", + " 1/length(Y) * (-sum(x_times_params * label_matrix) + sum(log(rowSums(exp(x_times_params)))))\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "Now we define an initial parameter value for the 3 beta vectors and run 10\n", + "times 5 steps of gradient descent and plot the intermediate values of the\n", + "parameters vectors and the classification results to see how gradient descent\n", + "progresses." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 2 + }, + "outputs": [], + "source": [ + "params <- matrix(rnorm(3*2), nrow = 2)\n", + "for (i in 1:10) {\n", + " params <- gradient_descent(f = mr_loss, learning_rate = 1e-1,\n", + " maxsteps = 5,\n", + " params = list(params = params),\n", + " fix = list(X = x, Y = y))$params\n", + " pred <- mr_bayes_classifier(x, params)\n", + " plot(x, col = pred, xlim = c(-2.5, 2.5), ylim = c(-2.5, 2.5))\n", + " points(x[y != pred,], pch = 4)\n", + " arrows(rep(0, 3), rep(0, 3), params[1,], params[2,], col = 'red')\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "After only 50 steps, gradient descent will not have converged, as we can see\n", + "when we plot the learning curve of 500 iterations of gradient descent." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "params <- matrix(rnorm(3*2), nrow = 2)\n", + "result <- gradient_descent(f = mr_loss, learning_rate = 1e-1,\n", + " maxsteps = 500,\n", + " params = list(params = params),\n", + " fix = list(X = x, Y = y))\n", + "plot(result$history, ylab = \"training loss\", xlab = \"iteration\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "The library `nnet` computes multinomial logistic regression in a different way\n", + "and finds different values for the coefficients, but it finds the same\n", + "predictions as the ones we found with gradient descent." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "library(nnet)\n", + "pred2 <- predict(multinom(as.factor(y) ~ x - 1), type = \"class\")\n", + "mr_bayes_classifier(x, result$params) == pred2" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "### Regularization\n", + "\n", + "In the following cell we define a function that takes any loss function and\n", + "returns a regularized loss function. Note that this function is somewhat\n", + "special. In contrast to other functions that return a value or a data frame,\n", + "this function returns a new function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "regularized_loss <- function(loss, lambda0 = 0, lambda1 = 0) {\n", + " function(params, X, Y) {\n", + " loss(params, X, Y) + lambda0 * sum(sqrt(params^2)) + lambda1 * sum(params^2)\n", + " }\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "We will look at regularization again in a linear regression setting." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "n <- 10^3\n", + "p <- 3\n", + "X <- matrix(rnorm(n * p), nrow = n)\n", + "params <- rnorm(p)\n", + "Y <- X %*% params + 0.1*rnorm(n)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "Play in the following cell with different values for the regularization\n", + "hyper-parameters `lambda0` and `lambda1` and observe how the results are\n", + "changing." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 0 + }, + "outputs": [], + "source": [ + "result <- gradient_descent(regularized_loss(lm_loss, lambda1 = 5),\n", + " learning_rate = 1e-4,\n", + " params = list(params = rnorm(p)),\n", + " fix = list(X = X, Y = Y))\n", + "result$params" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "You are now ready to do the\n", + "[quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1104947).\n", + "\n", + "## Exercises\n", + "\n", + "### Conceptual\n", + "\n", + "**Q1.** The formulas of logistic regression can be written in different forms\n", + "depending on how the two response classes are encoded. We will use the notation\n", + "$\\sigma(x) = \\frac1{1 + e^{-x}}$ for the logistic function.\n", + "First, we look at the case where $y \\in \\{0, 1\\}$.\n", + "\n", + "(a) Prove that $\\mathrm{Pr}(Y = 0 | X = x, \\beta) = \\frac1{1 + e^{x^T\\beta}} =\n", + "\\sigma(-x^T\\beta)$ if $\\mathrm{Pr}(Y = 1 | X = x, \\beta) = \\sigma(x^T\\beta)$.\n", + "\n", + "(b) Prove that we can write the negative log-likelihood loss as $\\mathcal\n", + "L(\\beta) = -\\sum_{i = 1}^ny_i\\log(\\sigma(x_i^T\\beta)) + (1 - y_i)\\log(1 -\n", + "\\sigma(x_i^T\\beta))$\n", + "\n", + "(c) Now we study the case where $y\\in\\{-1, 1\\}$. Prove that we can write in this\n", + "case $\\mathrm{Pr}(Y = y | X = x, \\beta) = \\sigma(yx^T\\beta)$.\n", + "\n", + "(d) Prove that we can write in this case the negative log-likelihood loss as\n", + "$\\mathcal L(\\beta) = -\\sum_{i = 1}^n\\log(\\sigma(y_ix_i^T\\beta))$.\n", + "\n", + "### Applied\n", + "\n", + "**Q2.** Assume the noise in a linear regression setting comes from a Laplace\n", + "distribution, i.e. the conditional probability of the response is given by\n", + "$\\mathrm{Pr}(Y = y | X = x, \\beta) = \\frac1{2s}\\exp(-\\frac{|y - x^T\\beta|}{s})$.\n", + "For simplicity we assume throughout this exercise that the intercept is 0 and\n", + "does not need to be fitted.\n", + "\n", + "(a) Generate a training set with 30 points, 3 predictors with values uniformly\n", + "distributed in $[0, 1]$ and responses with Laplace distributed noise with scale\n", + "parameters $s = 0.3$.\n", + "You can sample `n` noise values with mean zero and scale parameters $s$ of the\n", + "Laplace distribution with the following function" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "rlapl <- function(n, s = 1) {\n", + " p <- runif(n) - .5\n", + " s*log(1 - 2*sqrt(p^2))*sign(p)\n", + "}\n", + "error <- rlapl(30, s = 0.3)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "(b) Compute with paper and pencil the negative log-likelihood loss.\n", + "\n", + "(c) Write the R-code to compute the loss on the training set for a given\n", + "parameter vector. You can complete the function in the cell below.\n", + "Use `sqrt(x^2)` to compute $|x|$." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "laplace_lr_loss <- function(params, X, Y) {\n", + "\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "(d) Perform gradient descent on the training set. Plot the learning curve to see\n", + "whether gradient descent has converged. If you see large fluctuations at the end\n", + "of training, decrease the learning rate. If the learning curve is not flat at\n", + "the end, increase the maximal number of steps.\n", + "\n", + "(e) Estimate the coefficients with the `lm` function. Hint: do not forget that\n", + "we fit without intercept.\n", + "\n", + "(f) Compare the test error of the results in (d) and (e) for a very large test\n", + "set, e.g. 10^6 samples." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "R", + "language": "R", + "name": "ir" + }, + "language_info": { + "codemirror_mode": "r", + "file_extension": ".r", + "mimetype": "text/x-r-source", + "name": "R", + "pygments_lexer": "r", + "version": "4.0.2" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}