diff --git a/src/lecture6.ipynb b/src/lecture6.ipynb
new file mode 100644
index 0000000..00aaac1
--- /dev/null
+++ b/src/lecture6.ipynb
@@ -0,0 +1,602 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "# Lecture 6\n",
+ "\n",
+ "## Discussion of Student Feedback"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "IRdisplay::display_html('')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "## Gradient Descent"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "IRdisplay::display_html('')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "In the following cell we define the function `gradient_descent` as discussed in\n",
+ "the video. We will make use of the function `auto_diff` from the `ADtools`\n",
+ "library to compute all partial derivatives with respect to (`wrt`) the\n",
+ "parameters."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "lines_to_next_cell": 2
+ },
+ "outputs": [],
+ "source": [
+ "library(ADtools)\n",
+ "gradient_descent <- function(f, params, fix = list(),\n",
+ " learning_rate = 0.01,\n",
+ " tol = 1e-6,\n",
+ " maxsteps = 10^3,\n",
+ " show = F) {\n",
+ " history <- rep(0, maxsteps)\n",
+ " for (i in 1:maxsteps) {\n",
+ " df <- auto_diff(f, at = append(params, fix), wrt = names(params))\n",
+ " if (show) print(df@x)\n",
+ " history[i] <- df@x\n",
+ " delta <- learning_rate * as.numeric(df@dx)\n",
+ " params <- relist(unlist(params) - delta, params)\n",
+ " if (max(abs(delta)) < tol) break\n",
+ " }\n",
+ " append(params, list(history = history))\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "### A Very Simple 2-Dimensional Example\n",
+ "\n",
+ "Let us test our gradient descent function with the following simple example of a\n",
+ "function $f(\\beta) = (\\beta_1 - 1)^2 + (\\beta_2 - 2)^2$ which has its minimum at\n",
+ "$\\hat\\beta_1 = 1$ and $\\hat\\beta_2 = 2$ and can be written in\n",
+ "code as `sum((params - c(1, 2))^2)`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "outputs": [],
+ "source": [
+ "f <- function(params) sum((params - c(1, 2))^2)\n",
+ "result <- gradient_descent(f, params = list(params = c(1, 1)))\n",
+ "result$params"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "We see that gradient descent found roughly the minimum approximately.\n",
+ "It may not have found the minimum perfectly, because gradient descent typically\n",
+ "fluctuates around the minimum, if the learning rate remains constant.\n",
+ "\n",
+ "### Linear Regression\n",
+ "Let us now prepare a linear regression example. We will generate the data and\n",
+ "perform the fit with 3 predictors but without an intercept."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "lines_to_next_cell": 2
+ },
+ "outputs": [],
+ "source": [
+ "n <- 10^3\n",
+ "p <- 3\n",
+ "X <- matrix(rnorm(n * p), nrow = n)\n",
+ "params <- rnorm(p)\n",
+ "Y <- X %*% params + 0.1*rnorm(n)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "Our training set consists of `X` and `Y`.\n",
+ "\n",
+ "In the following cell we define our loss function for linear regression.\n",
+ "We do not include the variance term $\\frac1{2\\sigma^2}$ in the scaling, because\n",
+ "this would only change the scaling of the loss function, but not the point where\n",
+ "it has its minimum. Note, how we implement the sum over training examples with\n",
+ "vectors and matrices, i.e. $\\frac1n\\sum_{i=1}^n(y_i - x_i^T\\beta)^2 =\n",
+ "\\mathrm{mean}(Y - X \\beta)$, where $Y$ is the vector of responses and $X$ is a\n",
+ "matrix with rows containing the input points."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "lines_to_next_cell": 2
+ },
+ "outputs": [],
+ "source": [
+ "lm_loss <- function(params, X, Y) mean((Y - X %*% params)^2)\n",
+ "result <- gradient_descent(f = lm_loss,\n",
+ " params = list(params = rnorm(p)),\n",
+ " fix = list(X = X, Y = Y),\n",
+ " show = T)\n",
+ "result$params"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "Because we used the `show = T` argument, our `gradient_descent` function shows\n",
+ "how the loss function is decreasing as gradient descent progresses. You may need\n",
+ "to scroll down in the output to see the result.\n",
+ "\n",
+ "Let us compare the result we obtained with the standard function of R.\n",
+ "We use `-1` in the formula to exclude the intercept."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "outputs": [],
+ "source": [
+ "coef(lm(Y ~ X - 1))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "Pretty close to our gradient descent result...\n",
+ "\n",
+ "### Logistic Regression\n",
+ "\n",
+ "We do the same now for logistic regression. We fit again without intercept.\n",
+ "\n",
+ "First we prepare the data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "logistic <- function(x) 1/(1 + exp(-x))\n",
+ "Y <- logistic(-X %*% params) > runif(n)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "The response `Y` above uses 0/1 coding, which we transform with `(2*Y - 1)`\n",
+ "to -1/1 coding to use the simple formula from the slides (see also Exercise\n",
+ "Q1.)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "lr_loss <- function(params, X, Y) -sum(log(logistic((2*Y - 1) * X %*% params)))\n",
+ "result <- gradient_descent(f = lr_loss, learning_rate = 1e-3,\n",
+ " params = list(params = rnorm(p, 1)),\n",
+ " fix = list(Y = Y, X = X))\n",
+ "result$params"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "Again we can compare to the result obtained with the standard method to see that\n",
+ "we got pretty close."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "coef(glm(Y ~ X - 1, family = \"binomial\"))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "### Multinomial Logistic Regression\n",
+ "In the following cell I use again the multivariate normal distribution from the\n",
+ "`MASS` library to sample some input points from multivariate normal\n",
+ "distributions."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "set.seed(2)\n",
+ "library(MASS)\n",
+ "x1 <- mvrnorm(30, c(.5, 1), .3*diag(2))\n",
+ "x2 <- mvrnorm(30, c(.5, -1), .3*diag(2))\n",
+ "x3 <- mvrnorm(30, c(-1, 0), .3*diag(2))\n",
+ "x <- rbind(x1, x2, x3)\n",
+ "y <- c(rep(1, 30), rep(2, 30), rep(3, 30))\n",
+ "plot(x, col = y, xlim = c(-2.5, 2.5), ylim = c(-2.5, 2.5))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "In the following cell we define first the conditional probability distribution\n",
+ "of multinomial logistic regression. Both `X` and `params` can be matrices. The\n",
+ "matrix `X` would be our standard matrix with different input points as rows.\n",
+ "The matrix `params` is constructed from the vectors $\\beta_1, \\ldots, \\beta_K$ as\n",
+ "columns. Thus, the matrix product `X %*% params` returns a matrix from which we can\n",
+ "compute the conditional probabilities of the different classes (in the different\n",
+ "columns) for every input vector (in the different rows).\n",
+ "In the second line we define the Bayes classifier function for input `X` and\n",
+ "parameters `params`, which computes the conditional probabilities for all inputs\n",
+ "and than iterates over all rows to pick the index with the maximal value.\n",
+ "Finally, we define the loss function for multiple logistic regression.\n",
+ "We compute the first term $\\sum_{i=1}^n x_i^T\\beta_{y_i}$ of the loss function in a\n",
+ "somewhat funny way: $\\sum_{i=1}^n \\sum_{k=1}^Kx_i^T\\beta_kI(y_i = k)$ where\n",
+ "$I(y_i = k) = 1$ if the $i$-th training sample is in class $k$ and $I(y_i = k) =\n",
+ "0$ otherwise, because\n",
+ "we can write this easily in matrix notation.\n",
+ "The function `model.matrix(~factor(y) - 1)` computes all the necessary zeros and ones.\n",
+ "You may want to look at the output of `model.matrix(~ factor(c(1, 2, 3, 2)) - 1)`\n",
+ "to understand this better."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "mr_conditional <- function(X, params) exp(X %*% params)/rowSums(exp(X %*% params))\n",
+ "mr_bayes_classifier <- function(X, params) apply(mr_conditional(X, params), 1, function(row) which.max(row))\n",
+ "mr_loss <- function(params, X, Y) {\n",
+ " x_times_params <- X %*% params\n",
+ " label_matrix <- model.matrix(~ factor(Y) - 1)\n",
+ " 1/length(Y) * (-sum(x_times_params * label_matrix) + sum(log(rowSums(exp(x_times_params)))))\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "Now we define an initial parameter value for the 3 beta vectors and run 10\n",
+ "times 5 steps of gradient descent and plot the intermediate values of the\n",
+ "parameters vectors and the classification results to see how gradient descent\n",
+ "progresses."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "lines_to_next_cell": 2
+ },
+ "outputs": [],
+ "source": [
+ "params <- matrix(rnorm(3*2), nrow = 2)\n",
+ "for (i in 1:10) {\n",
+ " params <- gradient_descent(f = mr_loss, learning_rate = 1e-1,\n",
+ " maxsteps = 5,\n",
+ " params = list(params = params),\n",
+ " fix = list(X = x, Y = y))$params\n",
+ " pred <- mr_bayes_classifier(x, params)\n",
+ " plot(x, col = pred, xlim = c(-2.5, 2.5), ylim = c(-2.5, 2.5))\n",
+ " points(x[y != pred,], pch = 4)\n",
+ " arrows(rep(0, 3), rep(0, 3), params[1,], params[2,], col = 'red')\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "After only 50 steps, gradient descent will not have converged, as we can see\n",
+ "when we plot the learning curve of 500 iterations of gradient descent."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "params <- matrix(rnorm(3*2), nrow = 2)\n",
+ "result <- gradient_descent(f = mr_loss, learning_rate = 1e-1,\n",
+ " maxsteps = 500,\n",
+ " params = list(params = params),\n",
+ " fix = list(X = x, Y = y))\n",
+ "plot(result$history, ylab = \"training loss\", xlab = \"iteration\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "The library `nnet` computes multinomial logistic regression in a different way\n",
+ "and finds different values for the coefficients, but it finds the same\n",
+ "predictions as the ones we found with gradient descent."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "library(nnet)\n",
+ "pred2 <- predict(multinom(as.factor(y) ~ x - 1), type = \"class\")\n",
+ "mr_bayes_classifier(x, result$params) == pred2"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "### Regularization\n",
+ "\n",
+ "In the following cell we define a function that takes any loss function and\n",
+ "returns a regularized loss function. Note that this function is somewhat\n",
+ "special. In contrast to other functions that return a value or a data frame,\n",
+ "this function returns a new function."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "regularized_loss <- function(loss, lambda0 = 0, lambda1 = 0) {\n",
+ " function(params, X, Y) {\n",
+ " loss(params, X, Y) + lambda0 * sum(sqrt(params^2)) + lambda1 * sum(params^2)\n",
+ " }\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "We will look at regularization again in a linear regression setting."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "n <- 10^3\n",
+ "p <- 3\n",
+ "X <- matrix(rnorm(n * p), nrow = n)\n",
+ "params <- rnorm(p)\n",
+ "Y <- X %*% params + 0.1*rnorm(n)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "Play in the following cell with different values for the regularization\n",
+ "hyper-parameters `lambda0` and `lambda1` and observe how the results are\n",
+ "changing."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "outputs": [],
+ "source": [
+ "result <- gradient_descent(regularized_loss(lm_loss, lambda1 = 5),\n",
+ " learning_rate = 1e-4,\n",
+ " params = list(params = rnorm(p)),\n",
+ " fix = list(X = X, Y = Y))\n",
+ "result$params"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "lines_to_next_cell": 0
+ },
+ "source": [
+ "You are now ready to do the\n",
+ "[quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1104947).\n",
+ "\n",
+ "## Exercises\n",
+ "\n",
+ "### Conceptual\n",
+ "\n",
+ "**Q1.** The formulas of logistic regression can be written in different forms\n",
+ "depending on how the two response classes are encoded. We will use the notation\n",
+ "$\\sigma(x) = \\frac1{1 + e^{-x}}$ for the logistic function.\n",
+ "First, we look at the case where $y \\in \\{0, 1\\}$.\n",
+ "\n",
+ "(a) Prove that $\\mathrm{Pr}(Y = 0 | X = x, \\beta) = \\frac1{1 + e^{x^T\\beta}} =\n",
+ "\\sigma(-x^T\\beta)$ if $\\mathrm{Pr}(Y = 1 | X = x, \\beta) = \\sigma(x^T\\beta)$.\n",
+ "\n",
+ "(b) Prove that we can write the negative log-likelihood loss as $\\mathcal\n",
+ "L(\\beta) = -\\sum_{i = 1}^ny_i\\log(\\sigma(x_i^T\\beta)) + (1 - y_i)\\log(1 -\n",
+ "\\sigma(x_i^T\\beta))$\n",
+ "\n",
+ "(c) Now we study the case where $y\\in\\{-1, 1\\}$. Prove that we can write in this\n",
+ "case $\\mathrm{Pr}(Y = y | X = x, \\beta) = \\sigma(yx^T\\beta)$.\n",
+ "\n",
+ "(d) Prove that we can write in this case the negative log-likelihood loss as\n",
+ "$\\mathcal L(\\beta) = -\\sum_{i = 1}^n\\log(\\sigma(y_ix_i^T\\beta))$.\n",
+ "\n",
+ "### Applied\n",
+ "\n",
+ "**Q2.** Assume the noise in a linear regression setting comes from a Laplace\n",
+ "distribution, i.e. the conditional probability of the response is given by\n",
+ "$\\mathrm{Pr}(Y = y | X = x, \\beta) = \\frac1{2s}\\exp(-\\frac{|y - x^T\\beta|}{s})$.\n",
+ "For simplicity we assume throughout this exercise that the intercept is 0 and\n",
+ "does not need to be fitted.\n",
+ "\n",
+ "(a) Generate a training set with 30 points, 3 predictors with values uniformly\n",
+ "distributed in $[0, 1]$ and responses with Laplace distributed noise with scale\n",
+ "parameters $s = 0.3$.\n",
+ "You can sample `n` noise values with mean zero and scale parameters $s$ of the\n",
+ "Laplace distribution with the following function"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "rlapl <- function(n, s = 1) {\n",
+ " p <- runif(n) - .5\n",
+ " s*log(1 - 2*sqrt(p^2))*sign(p)\n",
+ "}\n",
+ "error <- rlapl(30, s = 0.3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "(b) Compute with paper and pencil the negative log-likelihood loss.\n",
+ "\n",
+ "(c) Write the R-code to compute the loss on the training set for a given\n",
+ "parameter vector. You can complete the function in the cell below.\n",
+ "Use `sqrt(x^2)` to compute $|x|$."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "laplace_lr_loss <- function(params, X, Y) {\n",
+ "\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "(d) Perform gradient descent on the training set. Plot the learning curve to see\n",
+ "whether gradient descent has converged. If you see large fluctuations at the end\n",
+ "of training, decrease the learning rate. If the learning curve is not flat at\n",
+ "the end, increase the maximal number of steps.\n",
+ "\n",
+ "(e) Estimate the coefficients with the `lm` function. Hint: do not forget that\n",
+ "we fit without intercept.\n",
+ "\n",
+ "(f) Compare the test error of the results in (d) and (e) for a very large test\n",
+ "set, e.g. 10^6 samples."
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "R",
+ "language": "R",
+ "name": "ir"
+ },
+ "language_info": {
+ "codemirror_mode": "r",
+ "file_extension": ".r",
+ "mimetype": "text/x-r-source",
+ "name": "R",
+ "pygments_lexer": "r",
+ "version": "4.0.2"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}