{ "cells": [ { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "# Lecture 7\n", "\n", "## Solving the XOR Problem without Feature Engineering" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "IRdisplay::display_html('')" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Let us create again some XOR-problem data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "library(MASS)\n", "x1 <- mvrnorm(30, c(1.5, 1), .2*diag(2))\n", "x2 <- mvrnorm(30, c(-1, -1), .2*diag(2))\n", "x3 <- mvrnorm(30, c(-1, .8), .2*diag(2))\n", "x4 <- mvrnorm(30, c(.9, -1), .2*diag(2))\n", "data <- data.frame(X = rbind(x1, x2, x3, x4), Y = c(rep(1, 60), rep(0, 60)))\n", "plot(data$X.1, data$X.2, col = data$Y + 2)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "This is how we solved it two weeks ago." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "relu <- function(x) ifelse(x > 0, x, 0)\n", "features <- t(matrix(c(1, -1, 1, -1,\n", " 1, 1, -1, -1), ncol = 2))\n", "newdata <- data.frame(H = relu(as.matrix(data[, c(\"X.1\", \"X.2\")]) %*% features))\n", "newdata$Y <- data$Y\n", "fit <- glm(Y ~ ., newdata, family = \"binomial\")\n", "pred <- predict(fit, type = \"response\") > .5\n", "plot(data$X.1, data$X.2, col = pred + 2)\n", "misclassified <- pred != data$Y\n", "points(data[misclassified, 1:2], pch = 4)\n", "arrows(rep(0, 4), rep(0, 4), features[1,], features[2,], col = 'red')" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "And last week we defined gradient descent in the following way." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "library(ADtools)\n", "gradient_descent <- function(f, params, fix = list(),\n", " learning_rate = 0.01,\n", " tol = 1e-6,\n", " maxsteps = 10^3,\n", " show = F) {\n", " history <- rep(0, maxsteps)\n", " for (i in 1:maxsteps) {\n", " df <- auto_diff(f, at = append(params, fix), wrt = names(params))\n", " if (show) print(df@x)\n", " history[i] <- df@x\n", " delta <- learning_rate * as.numeric(df@dx)\n", " params <- relist(unlist(params) - delta, params)\n", " if (max(abs(delta)) < tol) break\n", " }\n", " append(params, list(history = history))\n", "}" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Although [the relu activiation\n", "function](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) is more\n", "commonly used in practice, we will use the soft-relu function here (also known\n", "as the softplus or smooth-relu). The reason is simply that ADtools cannot handle\n", "the relu function but it can handle its soft version. Later in this notebook we\n", "will also look at other libraries to fit neural networks that can handle the\n", "relu function. In the following plot you can see that the relu and the soft-relu\n", "are pretty similar to each other." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "soft_relu <- function(x) log(1 + exp(x))\n", "plot(c(), ylim = c(-.1, 5), xlim = c(-5, 5))\n", "curve(relu, from = -5, to = 5, add = T)\n", "curve(soft_relu, from = -5, to = 5, col = 'blue', add = T)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Let us define now the neural network `nn` that takes `X` as input, computes the\n", "feature representation `soft_relu(X %*% w)` and produces the output by\n", "multiplying the feature representation with the coefficients beta." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nn <- function(w, beta, X) soft_relu(X %*% w) %*% beta" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "To run logistic regression with this neural network we define very similar\n", "functions as last week, e.g. the loss function `nn_loss` differs from last\n", "week's logistic regression loss only by the fact that the input to the logistic\n", "function is `(2*Y - 1) * nn(w, beta, X)` instead of `(2*Y - 1) * X %*% params`.\n", "Similarly we can define the Bayes classifier for the neural network." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "logistic <- function(x) 1/(1 + exp(-x))\n", "nn_loss <- function(w, beta, X, Y) -mean(log(logistic((2*Y - 1) * nn(w, beta, X))))\n", "nn_bayes_classifier <- function(w, beta, X) logistic(nn(w, beta, X)) > .5" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "In the video it was mentioned that the initialization matters when fitting\n", "neural networks with gradient descent. A common choice for initial guesses is\n", "the \"Glorot normal initialization\" (named after student Xavier Glorot, who\n", "published with his supervisor an\n", "[article](http://proceedings.mlr.press/v9/glorot10a.html) on this topic). The\n", "recipe is as simple as constructing the initial guesses from matrices with\n", "normally distributed entries with mean zero and standard deviation\n", "$\\sqrt{\\frac2{n_\\mathrm{in} + n_\\mathrm{out}}}$, where $n_\\mathrm{in}$ is the\n", "number of rows of the matrix and $n_\\mathrm{out}$ is the number of columns of\n", "the matrix." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "glorot_normal <- function(n_in, n_out) matrix(rnorm(n_in * n_out)*sqrt(2/(n_in + n_out)), ncol = n_out)\n", "glorot_normal(2, 4)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "With this we are ready to apply our first neural network to the XOR problem.\n", "It learns four 2-dimensional feature vectors and performs at the same time\n", "logistic regression." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X <- as.matrix(data[,1:2])\n", "params <- list(w = glorot_normal(2, 4), beta = glorot_normal(4, 1))\n", "result <- gradient_descent(nn_loss, params, learning_rate = 0.1,\n", " fix = list(X = X, Y = data$Y),\n", " show = T)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Let us plot the results." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "pred <- nn_bayes_classifier(result$w, result$beta, X)\n", "plot(X, col = pred + 2)\n", "misclassified <- pred != data$Y\n", "points(X[misclassified,], pch = 4)\n", "arrows(rep(0, 4), rep(0, 4), result$w[1,], result$w[2,], col = 'red')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the results are just as good as with our hand-crafted features.\n", "The learned features are not exactly the same as our hand-crafted ones, but they\n", "point in similar directions." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "## From Artificial Neurons to Neural Networks" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "IRdisplay::display_html('')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is no code in this section. Before you move over to the quiz, I recommend\n", "you to think about the neural network we used to solve the XOR problem in terms\n", "of number of neurons, number of layers, number of weights and number of biases.\n", "You can now solve the first page of the\n", "[quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1107127)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Regression with Neural Networks" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "IRdisplay::display_html('')" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "To include the bias in the matrix notation, we define in the following cell the\n", "function `bias_input` that prepends a column of 1s to any matrix of activities." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bias_input <- function(X) cbind(rep(1, ifelse(is.null(nrow(X)), length(X), nrow(X))), X)\n", "bias_input(matrix(rnorm(20), nrow = 5))" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "We load again the life expectancy data.\n", "This time we shift and scale the data, because this tends to give better results\n", "with the `glorot_normal` initialization.\n", "You can see the effect of the `scale` function by looking at the mean of the\n", "resulting vector - which is up to numerical errors equal to 0 - and the standard\n", "deviation, which is 1." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data <- na.omit(read.csv(file.path(\"..\", \"data\", \"life_expectancy.csv\")))\n", "X <- scale(data$GDP)\n", "Y <- scale(data$LifeExpectancy)\n", "print(mean(X))\n", "print(sd(X))\n", "head(X)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "We define now a neural network with 1 input neuron, 8 soft-relu hidden neurons\n", "and 1 output neuron. All neural activity matrices get a column of 1s for the\n", "biases with the `bias_input` function. Because we are in a regression setting we\n", "take this time the mean squared error as loss function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nn_predict <- function(w1, w2, X) bias_input(soft_relu(bias_input(X) %*% w1)) %*% w2\n", "nn_loss <- function(w1, w2, X, Y) mean((Y - nn_predict(w1, w2, X))^2)\n", "params <- list(w1 = glorot_normal(2, 8), w2 = glorot_normal(9, 1))\n", "result <- gradient_descent(nn_loss, params, fix = list(X = X, Y = Y),\n", " learning_rate = 0.1, show = T)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Let us plot everything." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "grid <- seq(-1, 6, length = 100)\n", "pred <- nn_predict(result$w1, result$w2, grid)\n", "plot(X, Y, xlab = \"Scaled GDP\", ylab = \"Scaled LifeExpectancy\")\n", "lines(grid, pred, col = 'red', lwd = 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see that the result looks a bit like the one from a spline regression.\n", "The difference to spline regression is that we did not need to select the knots,\n", "but the useful features `soft_relu(bias_input(X) %*% w1)` were found by\n", "adjusting the weights and biases in the matrix `w1`.\n", "\n", "You can now solve the second page of the\n", "[quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1107127).\n", "\n", "## Stochastic Gradient Descent" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "IRdisplay::display_html('')" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Stochastic gradient descent has as additional hyper-parameter the `batchsize`,\n", "which determines how many training samples are processed for every estimate of\n", "the gradient. The indices of the training samples in the cell below are computed\n", "by sliding through the data set of size `n`. After the last index `n` the first\n", "index of the training set is used again. For example, in a training set with 100\n", "data points and batchsize 32, you can see below the indices of the first,\n", "second and forth batch." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "indices <- function(i, batchsize, n) (seq((i - 1) * batchsize, i * batchsize - 1) %% n) + 1\n", "print(indices(1, 32, 100))\n", "print(indices(2, 32, 100))\n", "print(indices(4, 32, 100))" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Here is the full code for stochastic gradient descent." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stochastic_gradient_descent <- function(f, params, X, Y,\n", " learning_rate = 0.01,\n", " tol = 1e-6,\n", " batchsize = 32,\n", " maxsteps = 10^3,\n", " show = F) {\n", " history <- rep(0, maxsteps)\n", " for (i in 1:maxsteps) {\n", " idxs <- indices(i, batchsize, length(Y))\n", " at <- append(params, list(X = X[idxs,], Y = Y[idxs])) # evaluate only at idxs\n", " df <- auto_diff(f, at = at, wrt = names(params))\n", " if (show) print(df@x)\n", " history[i] <- df@x\n", " delta <- learning_rate * as.numeric(df@dx)\n", " params <- relist(unlist(params) - delta, params)\n", " if (max(abs(delta)) < tol) break\n", " }\n", " append(params, list(history = history))\n", "}" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Let us now run fit the life expectancy data with stochastic gradient descent.\n", "We run for more steps and with a smaller learning rate. Because every step uses\n", "now only a fraction of the full data set and takes thus less time than in full\n", "gradient descent, we can afford to run it for more steps. A smaller learning\n", "rate may be useful to compensate for the larger fluctuations in the estimate of\n", "the gradient." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result2 <- stochastic_gradient_descent(nn_loss, params, X = X, Y = Y,\n", " learning_rate = 0.02, show = T,\n", " maxsteps = 3000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To compare gradient descent and stochastic gradient descent, we plot the\n", "histories as a function of number of training samples processed. Because\n", "gradient descent processes the full data set in every step we multiply the step\n", "number with the length of the data set `length(X) * seq(10^3)`. In SGD every\n", "step processes only 32 samples." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot(length(X) * seq(10^3), result$history, log = \"y\", type = \"l\", lwd = 2, xlab = \"training data processed\", ylab = \"training loss\")\n", "lines(seq(3000)*32, result2$history, col = 'blue')\n", "moving_average <- function(x, n = 5) filter(x, rep(1 / n, n))\n", "lines(seq(3000)*32, moving_average(result2$history, 50), col = 'red')\n", "legend(\"topright\", c(\"gradient descent\", \"stochastic gradient descent\", \"moving average\"), lty = 1, col = c(\"black\", \"blue\", \"red\"))" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "You can now solve the third page of the\n", "[quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1107127).\n", "\n", "## Keras\n", "There are nowadays great libraries to solve machine learning problems with\n", "artificial neural networks. Our approach with ADtools is great for small scale\n", "problems, but these libraries work also very well for large scale problems.\n", "Very popular are\n", "[pytorch](https://cran.r-project.org/web/packages/rTorch/index.html),\n", "[tensorflow](https://tensorflow.rstudio.com/) and\n", "[keras](https://keras.rstudio.com/).\n", "In the following you will see how to use keras to solve the XOR problem.\n", "We define once again some training data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "library(MASS)\n", "x1 <- mvrnorm(30, c(1.5, 1), .2*diag(2))\n", "x2 <- mvrnorm(30, c(-1, -1), .2*diag(2))\n", "x3 <- mvrnorm(30, c(-1, .8), .2*diag(2))\n", "x4 <- mvrnorm(30, c(.9, -1), .2*diag(2))\n", "data <- data.frame(X = rbind(x1, x2, x3, x4), Y = c(rep(1, 60), rep(0, 60)))\n", "plot(data$X.1, data$X.2, col = data$Y + 2)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Now load the library, define the network model and append some layers." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "library(keras)\n", "\n", "nn <- keras_model_sequential() %>%\n", " layer_dense(units = 4, activation = 'relu', input_shape = c(2)) %>%\n", " layer_dense(units = 1, activation = 'sigmoid') # append layers\n", "summary(nn)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "We made use again of the pipe operator `%>%` that you encountered already in\n", "week 5.\n", "\n", "Now we use the function `compile` that produces efficient code that can be used\n", "for fitting afterwards with the loss function `binary_crossentropy` and the\n", "stochastic gradient descent optimizer `optimizer_sgd` with learning rate `lr =\n", "0.1`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nn %>% compile(\n", " loss = 'binary_crossentropy', # the same as our standard loss for logistic regression\n", " optimizer = optimizer_sgd(lr = .1) # stochastic gradient descent\n", ") # specify loss function and optimization method" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Now we fit the neural network to the data.\n", "To compare with our results above, we use here SGD in a funny way: we define the\n", "batchsize to be equal to the size of the training set (= 60); hence we perform\n", "gradient descent." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "history <- nn %>% fit(\n", " as.matrix(data[,1:2]),\n", " data$Y,\n", " batch_size = 60,\n", " epochs = 1000,\n", " validation_split = 0, # use all data for training, none for validation.\n", ")\n", "plot(history)\n", "\n", "pred <- predict(nn, as.matrix(data[,1:2])) > 0.5\n", "plot(data[,1:2], col = pred+2)\n", "misclassified <- pred != data$Y\n", "points(X[misclassified,], pch = 4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You see that we got comparable results with `keras`.\n", "Because it has so much useful built-in functionality, you will use again `keras`\n", "in the upcoming lectures.\n", "\n", "## Exercises\n", "\n", "## Conceptual\n", "**Q1**\n", "To get a feeling for the kind of functions that can be fitted with neural networks, we will draw $y$ as a function of $x$ for some values of the weights. It may be helpful to sketch this neural network with the input neuron, the hidden neurons and the output neuron and label the connections with the weights." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(a) Draw in the same figure $a_1^{(1)} = g(w_{10}^{(1)} + w_{11}^{(1)} x)$, $a_2^{(1)}=g(w_{20}^{(1)} + w_{21}^{(1)} x)$ and $\\bar y = w_0^{(2)} + w_1^{(2)}a_1^{(1)} + w_2^{(2)}a_2^{(1)}$ as a function of $x$. Use $w_{10}^{(1)} = 0$, $w_{11}^{(1)} = 1$, $w_{20}^{(1)} = - 2$, $w_{21}^{(1)} = 2$, $w_0^{(2)} = 1$, $w_1^{(2)} = 2$, $w_2^{(2)} = 1$ and use the rectified linear activation function $g = \\mbox{relu}$. At which $x$-values does the slope change? Give the answer in terms of the weights.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(b) Draw a similar graph for $w_{11}^{(1)} < 0$ and $w_{10}^{(1)}/w_{11}^{(1)}