diff --git a/src/lecture6.ipynb b/src/lecture6.ipynb
new file mode 100644
index 0000000..00aaac1
--- /dev/null
+++ b/src/lecture6.ipynb
@@ -0,0 +1,602 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "source": [
+    "# Lecture 6\n",
+    "\n",
+    "## Discussion of Student Feedback"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "IRdisplay::display_html('<iframe width=\"640\" height=\"360\" src=\"https://tube.switch.ch/embed/88263ccb\" frameborder=\"0\" allow=\"fullscreen\" allowfullscreen></iframe>')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "source": [
+    "## Gradient Descent"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "IRdisplay::display_html('<iframe width=\"640\" height=\"360\" src=\"https://tube.switch.ch/embed/4cad6014\" frameborder=\"0\" allow=\"fullscreen\" allowfullscreen></iframe>')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "source": [
+    "In the following cell we define the function `gradient_descent` as discussed in\n",
+    "the video. We will make use of the function `auto_diff` from the `ADtools`\n",
+    "library to compute all partial derivatives with respect to (`wrt`) the\n",
+    "parameters."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "lines_to_next_cell": 2
+   },
+   "outputs": [],
+   "source": [
+    "library(ADtools)\n",
+    "gradient_descent <- function(f, params, fix = list(),\n",
+    "                             learning_rate = 0.01,\n",
+    "                             tol = 1e-6,\n",
+    "                             maxsteps = 10^3,\n",
+    "                             show = F) {\n",
+    "  history <- rep(0, maxsteps)\n",
+    "  for (i in 1:maxsteps) {\n",
+    "    df <- auto_diff(f, at = append(params, fix), wrt = names(params))\n",
+    "    if (show) print(df@x)\n",
+    "    history[i] <- df@x\n",
+    "    delta <- learning_rate * as.numeric(df@dx)\n",
+    "    params <- relist(unlist(params) - delta, params)\n",
+    "    if (max(abs(delta)) < tol) break\n",
+    "  }\n",
+    "  append(params, list(history = history))\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "source": [
+    "### A Very Simple 2-Dimensional Example\n",
+    "\n",
+    "Let us test our gradient descent function with the following simple example of a\n",
+    "function $f(\\beta) = (\\beta_1 - 1)^2 + (\\beta_2 - 2)^2$ which has its minimum at\n",
+    "$\\hat\\beta_1 = 1$ and $\\hat\\beta_2 = 2$ and can be written in\n",
+    "code as `sum((params - c(1, 2))^2)`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "outputs": [],
+   "source": [
+    "f <- function(params) sum((params - c(1, 2))^2)\n",
+    "result <- gradient_descent(f, params = list(params = c(1, 1)))\n",
+    "result$params"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "source": [
+    "We see that gradient descent found roughly the minimum approximately.\n",
+    "It may not have found the minimum perfectly, because gradient descent typically\n",
+    "fluctuates around the minimum, if the learning rate remains constant.\n",
+    "\n",
+    "### Linear Regression\n",
+    "Let us now prepare a linear regression example. We will generate the data and\n",
+    "perform the fit with 3 predictors but without an intercept."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "lines_to_next_cell": 2
+   },
+   "outputs": [],
+   "source": [
+    "n <- 10^3\n",
+    "p <- 3\n",
+    "X <- matrix(rnorm(n * p), nrow = n)\n",
+    "params <- rnorm(p)\n",
+    "Y <- X %*% params + 0.1*rnorm(n)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "source": [
+    "Our training set consists of `X` and `Y`.\n",
+    "\n",
+    "In the following cell we define our loss function for linear regression.\n",
+    "We do not include the variance term $\\frac1{2\\sigma^2}$ in the scaling, because\n",
+    "this would only change the scaling of the loss function, but not the point where\n",
+    "it has its minimum. Note, how we implement the sum over training examples with\n",
+    "vectors and matrices, i.e. $\\frac1n\\sum_{i=1}^n(y_i - x_i^T\\beta)^2 =\n",
+    "\\mathrm{mean}(Y - X \\beta)$, where $Y$ is the vector of responses and $X$ is a\n",
+    "matrix with rows containing the input points."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "lines_to_next_cell": 2
+   },
+   "outputs": [],
+   "source": [
+    "lm_loss <- function(params, X, Y) mean((Y - X %*% params)^2)\n",
+    "result <- gradient_descent(f = lm_loss,\n",
+    "                           params = list(params = rnorm(p)),\n",
+    "                           fix = list(X = X, Y = Y),\n",
+    "                           show = T)\n",
+    "result$params"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "source": [
+    "Because we used the `show = T` argument, our `gradient_descent` function shows\n",
+    "how the loss function is decreasing as gradient descent progresses. You may need\n",
+    "to scroll down in the output to see the result.\n",
+    "\n",
+    "Let us compare the result we obtained with the standard function of R.\n",
+    "We use `-1` in the formula to exclude the intercept."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "outputs": [],
+   "source": [
+    "coef(lm(Y ~ X - 1))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "source": [
+    "Pretty close to our gradient descent result...\n",
+    "\n",
+    "### Logistic Regression\n",
+    "\n",
+    "We do the same now for logistic regression. We fit again without intercept.\n",
+    "\n",
+    "First we prepare the data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "logistic <- function(x) 1/(1 + exp(-x))\n",
+    "Y <- logistic(-X %*% params) > runif(n)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "source": [
+    "The response `Y` above uses 0/1 coding, which we transform with `(2*Y - 1)`\n",
+    "to -1/1 coding to use the simple formula from the slides (see also Exercise\n",
+    "Q1.)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "lr_loss <- function(params, X, Y) -sum(log(logistic((2*Y - 1) * X %*% params)))\n",
+    "result <- gradient_descent(f = lr_loss, learning_rate = 1e-3,\n",
+    "                           params = list(params = rnorm(p, 1)),\n",
+    "                           fix = list(Y = Y, X = X))\n",
+    "result$params"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "source": [
+    "Again we can compare to the result obtained with the standard method to see that\n",
+    "we got pretty close."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "coef(glm(Y ~ X - 1, family = \"binomial\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "source": [
+    "### Multinomial Logistic Regression\n",
+    "In the following cell I use again the multivariate normal distribution from the\n",
+    "`MASS` library to sample some input points from multivariate normal\n",
+    "distributions."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "set.seed(2)\n",
+    "library(MASS)\n",
+    "x1 <- mvrnorm(30, c(.5, 1), .3*diag(2))\n",
+    "x2 <- mvrnorm(30, c(.5, -1), .3*diag(2))\n",
+    "x3 <- mvrnorm(30, c(-1, 0), .3*diag(2))\n",
+    "x <- rbind(x1, x2, x3)\n",
+    "y <- c(rep(1, 30), rep(2, 30), rep(3, 30))\n",
+    "plot(x, col = y, xlim = c(-2.5, 2.5), ylim = c(-2.5, 2.5))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "source": [
+    "In the following cell we define first the conditional probability distribution\n",
+    "of multinomial logistic regression. Both `X` and `params` can be matrices. The\n",
+    "matrix `X` would be our standard matrix with different input points as rows.\n",
+    "The matrix `params` is constructed from the vectors $\\beta_1, \\ldots, \\beta_K$ as\n",
+    "columns. Thus, the matrix product `X %*% params` returns a matrix from which we can\n",
+    "compute the conditional probabilities of the different classes (in the different\n",
+    "columns) for every input vector (in the different rows).\n",
+    "In the second line we define the Bayes classifier function for input `X` and\n",
+    "parameters `params`, which computes the conditional probabilities for all inputs\n",
+    "and than iterates over all rows to pick the index with the maximal value.\n",
+    "Finally, we define the loss function for multiple logistic regression.\n",
+    "We compute the first term $\\sum_{i=1}^n x_i^T\\beta_{y_i}$ of the loss function in a\n",
+    "somewhat funny way: $\\sum_{i=1}^n \\sum_{k=1}^Kx_i^T\\beta_kI(y_i = k)$ where\n",
+    "$I(y_i = k) = 1$ if the $i$-th training sample is in class $k$ and $I(y_i = k) =\n",
+    "0$ otherwise, because\n",
+    "we can write this easily in matrix notation.\n",
+    "The function `model.matrix(~factor(y) - 1)` computes all the necessary zeros and ones.\n",
+    "You may want to look at the output of `model.matrix(~ factor(c(1, 2, 3, 2)) - 1)`\n",
+    "to understand this better."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mr_conditional <- function(X, params) exp(X %*% params)/rowSums(exp(X %*% params))\n",
+    "mr_bayes_classifier <- function(X, params) apply(mr_conditional(X, params), 1, function(row) which.max(row))\n",
+    "mr_loss <- function(params, X, Y) {\n",
+    "    x_times_params <- X %*% params\n",
+    "    label_matrix <- model.matrix(~ factor(Y) - 1)\n",
+    "    1/length(Y) * (-sum(x_times_params * label_matrix) + sum(log(rowSums(exp(x_times_params)))))\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "source": [
+    "Now we define an initial parameter value for the 3 beta vectors and run 10\n",
+    "times 5 steps of gradient descent and plot the intermediate values of the\n",
+    "parameters vectors and the classification results to see how gradient descent\n",
+    "progresses."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "lines_to_next_cell": 2
+   },
+   "outputs": [],
+   "source": [
+    "params <- matrix(rnorm(3*2), nrow = 2)\n",
+    "for (i in 1:10) {\n",
+    "    params <- gradient_descent(f = mr_loss, learning_rate = 1e-1,\n",
+    "                               maxsteps = 5,\n",
+    "                               params = list(params = params),\n",
+    "                               fix = list(X = x, Y = y))$params\n",
+    "    pred <- mr_bayes_classifier(x, params)\n",
+    "    plot(x, col = pred, xlim = c(-2.5, 2.5), ylim = c(-2.5, 2.5))\n",
+    "    points(x[y != pred,], pch = 4)\n",
+    "    arrows(rep(0, 3), rep(0, 3), params[1,], params[2,], col = 'red')\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "source": [
+    "After only 50 steps, gradient descent will not have converged, as we can see\n",
+    "when we plot the learning curve of 500 iterations of gradient descent."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "params <- matrix(rnorm(3*2), nrow = 2)\n",
+    "result <- gradient_descent(f = mr_loss, learning_rate = 1e-1,\n",
+    "                            maxsteps = 500,\n",
+    "                            params = list(params = params),\n",
+    "                            fix = list(X = x, Y = y))\n",
+    "plot(result$history, ylab = \"training loss\", xlab = \"iteration\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "source": [
+    "The library `nnet` computes multinomial logistic regression in a different way\n",
+    "and finds different values for the coefficients, but it finds the same\n",
+    "predictions as the ones we found with gradient descent."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "library(nnet)\n",
+    "pred2 <- predict(multinom(as.factor(y) ~ x - 1), type = \"class\")\n",
+    "mr_bayes_classifier(x, result$params) == pred2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "source": [
+    "### Regularization\n",
+    "\n",
+    "In the following cell we define a function that takes any loss function and\n",
+    "returns a regularized loss function. Note that this function is somewhat\n",
+    "special. In contrast to other functions that return a value or a data frame,\n",
+    "this function returns a new function."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "regularized_loss <- function(loss, lambda0 = 0, lambda1 = 0) {\n",
+    "    function(params, X, Y) {\n",
+    "        loss(params, X, Y) + lambda0 * sum(sqrt(params^2)) + lambda1 * sum(params^2)\n",
+    "    }\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "source": [
+    "We will look at regularization again in a linear regression setting."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "n <- 10^3\n",
+    "p <- 3\n",
+    "X <- matrix(rnorm(n * p), nrow = n)\n",
+    "params <- rnorm(p)\n",
+    "Y <- X %*% params + 0.1*rnorm(n)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "source": [
+    "Play in the following cell with different values for the regularization\n",
+    "hyper-parameters `lambda0` and `lambda1` and observe how the results are\n",
+    "changing."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "outputs": [],
+   "source": [
+    "result <- gradient_descent(regularized_loss(lm_loss, lambda1 = 5),\n",
+    "                           learning_rate = 1e-4,\n",
+    "                           params = list(params = rnorm(p)),\n",
+    "                           fix = list(X = X, Y = Y))\n",
+    "result$params"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "source": [
+    "You are now ready to do the\n",
+    "[quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1104947).\n",
+    "\n",
+    "## Exercises\n",
+    "\n",
+    "### Conceptual\n",
+    "\n",
+    "**Q1.** The formulas of logistic regression can be written in different forms\n",
+    "depending on how the two response classes are encoded. We will use the notation\n",
+    "$\\sigma(x) = \\frac1{1 + e^{-x}}$ for the logistic function.\n",
+    "First, we look at the case where $y \\in \\{0, 1\\}$.\n",
+    "\n",
+    "(a) Prove that $\\mathrm{Pr}(Y = 0 | X = x, \\beta) = \\frac1{1 + e^{x^T\\beta}} =\n",
+    "\\sigma(-x^T\\beta)$ if $\\mathrm{Pr}(Y = 1 | X = x, \\beta) = \\sigma(x^T\\beta)$.\n",
+    "\n",
+    "(b) Prove that we can write the negative log-likelihood loss as $\\mathcal\n",
+    "L(\\beta) = -\\sum_{i = 1}^ny_i\\log(\\sigma(x_i^T\\beta)) + (1 - y_i)\\log(1 -\n",
+    "\\sigma(x_i^T\\beta))$\n",
+    "\n",
+    "(c) Now we study the case where $y\\in\\{-1, 1\\}$. Prove that we can write in this\n",
+    "case $\\mathrm{Pr}(Y = y | X = x, \\beta)  = \\sigma(yx^T\\beta)$.\n",
+    "\n",
+    "(d) Prove that we can write in this case the negative log-likelihood loss as\n",
+    "$\\mathcal L(\\beta) = -\\sum_{i = 1}^n\\log(\\sigma(y_ix_i^T\\beta))$.\n",
+    "\n",
+    "### Applied\n",
+    "\n",
+    "**Q2.** Assume the noise in a linear regression setting comes from a Laplace\n",
+    "distribution, i.e. the conditional probability of the response is given by\n",
+    "$\\mathrm{Pr}(Y = y | X = x, \\beta) = \\frac1{2s}\\exp(-\\frac{|y - x^T\\beta|}{s})$.\n",
+    "For simplicity we assume throughout this exercise that the intercept is 0 and\n",
+    "does not need to be fitted.\n",
+    "\n",
+    "(a) Generate a training set with 30 points, 3 predictors with values uniformly\n",
+    "distributed in $[0, 1]$ and responses with Laplace distributed noise with scale\n",
+    "parameters $s = 0.3$.\n",
+    "You can sample `n` noise values with mean zero and scale parameters $s$ of the\n",
+    "Laplace distribution with the following function"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "rlapl <- function(n, s = 1) {\n",
+    "    p <- runif(n) - .5\n",
+    "    s*log(1 - 2*sqrt(p^2))*sign(p)\n",
+    "}\n",
+    "error <- rlapl(30, s = 0.3)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "(b) Compute with paper and pencil the negative log-likelihood loss.\n",
+    "\n",
+    "(c) Write the R-code to compute the loss on the training set for a given\n",
+    "parameter vector. You can complete the function in the cell below.\n",
+    "Use `sqrt(x^2)` to compute $|x|$."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "laplace_lr_loss <- function(params, X, Y) {\n",
+    "\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "(d) Perform gradient descent on the training set. Plot the learning curve to see\n",
+    "whether gradient descent has converged. If you see large fluctuations at the end\n",
+    "of training, decrease the learning rate. If the learning curve is not flat at\n",
+    "the end, increase the maximal number of steps.\n",
+    "\n",
+    "(e) Estimate the coefficients with the `lm` function. Hint: do not forget that\n",
+    "we fit without intercept.\n",
+    "\n",
+    "(f) Compare the test error of the results in (d) and (e) for a very large test\n",
+    "set, e.g. 10^6 samples."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "R",
+   "language": "R",
+   "name": "ir"
+  },
+  "language_info": {
+   "codemirror_mode": "r",
+   "file_extension": ".r",
+   "mimetype": "text/x-r-source",
+   "name": "R",
+   "pygments_lexer": "r",
+   "version": "4.0.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}