"- `nn = keras_model_sequential()` creates a new neural network. \n",
"\n",
"The following examples make use of the pipe operator `%>%`. See <a href=\"https://stackoverflow.com/questions/24536154/what-does-mean-in-r\" target=\"_blank\">here</a> for what it does.\n",
" lines(age.grid, spline.pred, lwd = 2, col = \"blue\")\n",
"}\n",
"plot.baseline()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now it gets more interesting: we will create our first neural network."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"library(keras)\n",
"nn = keras_model_sequential()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So far the network is empty. With the following function we attach some layers.\n",
"\n",
"We do not need to attach the input layer. Instead we tell the first hidden layer that there is only one predictor by setting `input_shape = c(1)`. We would set the `input_shape = c(5)`, if we had 5 predictors.\n",
"\n",
"For reproducibility we set the seed of the function that specifies the initial weights.\n",
"The `kernel_initializer` determines how all weights except the biases are initialized.\n",
"The `bias_initializer` sets the biases to zero by default."
"Since each neuron has as many weights as there are neurons in the previous layer plus one bias parameter, there are $20(1 + 1) = 40$ parameters in the first layer, $20(20 + 1) = 420$ parameters in the next layers and $1(20 + 1)$ parameters in the last layer.\n",
"\n",
"Next we choose the mean squared error `mse` as our loss function and the `adam` optimizer, which automatically determines the learning rates for our gradient descent fitting procedure.\n"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"nn %>% compile(\n",
" loss = 'mse',\n",
" optimizer = 'adam'\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We run the optimizer for 20 epochs.\n",
"In each epoch the parameters are updated 1000 times in direction of the gradient.\n",
"This will take some time; instead of running the cell yourself, you can directly have a look at the output.\n"
"To compare to the other methods we transform the mean squared error at the end of training to the residual standard error.\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>40.9290706890088</li>\n",
"\t<li>39.9130599754512</li>\n",
"\t<li>39.8196855504681</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 40.9290706890088\n",
"\\item 39.9130599754512\n",
"\\item 39.8196855504681\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 40.9290706890088\n",
"2. 39.9130599754512\n",
"3. 39.8196855504681\n",
"\n",
"\n"
],
"text/plain": [
"[1] 40.92907 39.91306 39.81969"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"final.rse <- function(history) {\n",
" sqrt(3000/2998 * tail(history$metric$loss, n = 1))\n",
"}\n",
"c(linear.summary$sigma,\n",
"spline.summary$sigma,\n",
"final.rse(history))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that the neural network has a slightly lower training error than the benchmark methods, but we see also in the plot that the difference to the spline fit is very small.\n",
"Note that in contrast to the spline fit we did not need to specify the position of the knots for the neural network.\n"
"plot.nn.pred <- function(nn, col = \"red\") {\n",
" nn.pred <- predict(nn, x = age.grid)\n",
" lines(age.grid, nn.pred, lwd = 2, col = col)\n",
"}\n",
"plot.baseline()\n",
"plot.nn.pred(nn)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An attentive reader may wonder, if it really is necessary to use such a large neural network to fit the wage data.\n",
"In fact, we saw that a neural network with one layer of five hidden neurons and relu-nonlinearity has the same expressiveness as a linear spline with four knots.\n",
"So let us see, if a much smaller model also does the job.\n"
"You may get slightly different results for every initial conditions, but typically we find that this smaller model is not better than linear regression.\n",
"Looking at the weights of the network is instructive here.\n",
"The first row below are the weights of the hidden layer.\n",
"The second row are the biases of the hidden neurons.\n"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<ol>\n",
"\t<li><table>\n",
"<caption>A matrix: 1 × 5 of type dbl</caption>\n",
"we see that the transition from zero to positive or negative slope is for all neurons outside of the interesting age region between 20 and 80 years, and thus the neural network is just linear in this region!\n",
"\n",
"Why did the fitting procedure not find the points where the prediction should change the slope?\n",
"The most probable answer is: wrong initial conditions.\n",
"In fact, if we manually set the inital weights, we can get a very good fit.\n"
"The dependence on the initial condition is unsatisfactory.\n",
"Fortunately, however, the initialization functions of popular deep learning libraries like `keras` work often quite well, if the predictors and responses are properly preprocessed, i.e. centered and scaled.\n",
"If we do this for the `Wage` data set we get for example the following.\n"
"This time the kinks lie in the interesting region between 20 and 80 years of age and the final RSE is between the one of the linear spline fit and the one of the large neural network.\n",
"\n",
"Why did the large neural network still get great performance even without proper preprocessing of the data?\n",
"The larger the network the higher the chance that some of the neurons still have some initial weights and biases that can become useful.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prediction Intervals"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The approach to obtain prediction intervals that we discuss here is by no means standard in the deep learning community.\n",
"Prediction intervals are not obvious to obtain for neural networks.\n",
"We look at this approach here, because it illustrates, how problems in deep learning are often addressed by creating a custom loss function and fitting a generic neural network.\n",
"Here is the custom loss function to get prediction intervals.\n",
"\n",
"Don't worry about the details here; what matters is that this is just a custom loss function that takes true and predicted values as input and returns some positive scalar."