diff --git a/src/lecture1.ipynb b/src/lecture1.ipynb index 1c18b31..bb24ae1 100644 --- a/src/lecture1.ipynb +++ b/src/lecture1.ipynb @@ -1,1003 +1,1003 @@ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lecture 1\n", "## Machine Learning Examples" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "IRdisplay::display_html('')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Organization of this Course" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "IRdisplay::display_html('')" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "## The Life Expectancy Dataset\n", "Run the following cell with `Shift + Enter` to watch the video." ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "IRdisplay::display_html('')" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "In this section you will load the life expectancy dataset,\n", "look at it and produce some plots.\n", "You will learn about the `R` functions `file.path`, `read.csv`, `str`, `?`,\n", "`<-`, `$`, `:`, `c`, `pdf`, `par`, `plot`.\n", "\n", "Let us load the life expectancy dataset from the csv file.\n", "You can run the following cell by clicking it and pressing `Shift + Enter`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data <- read.csv(file.path(\"..\", \"data\", \"life_expectancy.csv\"))\n", "str(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- We used the function `file.path`, to generate a valid path\n", " in an operating system independent way.\n", "- We used the function `read.csv` and assign the output to the variable `data`.\n", "- In `R` the it is common to use `<-` for (left-)assignment,\n", " but `=` can also be used.\n", "- On the second line we used the function `str`\n", " to look at the names and data types of the columns of `data`;\n", " if you just want to extract the names, you can use the function `names`.\n", "- In `R` the dot `.` does not have any special meaning\n", " and can be used in any variable or function name.\n", "- `R` has usually excellent documentation. You can access it with `?`, e.g." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "?read.csv" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?\"<-\"" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Let us actually look at the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "data$LifeExpectancy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- The output consists of a list with the life expectancies\n", " measured in different countries and in different years.\n", "- We can access the data in the columns of `data` by using the extraction\n", " operator `$`. Type `?\"$\"` in the empty field below and have a look at the\n", " examples at the bottom of the documentation.\n", "- We could have also accessed this data with `data[,6]`. Try it out." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us continue to explore the data.\n", "First we look at rows 30 to 40." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "data[30:40,]" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Do you wonder what `NA` means? Look it up with `?NA`.\n", "\n", "With the following command we look at rows 13, 33, 41 and 72." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[c(13, 33, 41, 72),]" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "The combine function `c` is important to know.\n", "You may want to have a look at its documentation\n", "or play with some examples, like" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x <- c(1, 2, 3)\n", "y <- c(4, 5, 6)\n", "x + y\n", "c(x, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before you move on to the next video, you find in the cell below\n", "the code to generate the figures used in the slides.\n", "The first and the last line are commented out,\n", "such that the plots are shown in this notebook\n", "instead of being printed to the pdf.\n", "Use the documentation, if you want to know more\n", "about the usage of the functions `pdf`, `par` and `plot`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# pdf(\"life_expectancy_example_plots.pdf\", width = 5.8, height = 2.8)\n", "par(mfcol = c(1, 3), cex = .7)\n", "plot(data$GDP, data$LifeExpectancy, xlab = \"GDP per capita [USD]\",\n", " ylab = \"Life Expectancy [Years]\",\n", " xlim = c(0, 100000))\n", "plot(data$BMI, data$LifeExpectancy, xlab = \"BMI [kg/m^2]\",\n", " ylab = \"Life Expectancy [Years]\")\n", "plot(data$Year, data$LifeExpectancy, ylab = \"Life Expectancy [Years]\",\n", " xlab = \"Year\", xlim = c(1999, 2016))\n", "# dev.off()" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "## Error Decomposition and Parametric versus Non-parametric Methods" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "IRdisplay::display_html('')" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "### Artificial Data Generation Process\n", "\n", "With the following code we define the custom function `f`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "f <- function(x) {x^2}" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "You may want to look at the documentation `?\"function\"`,\n", "evaluate `f` at different points or create your own function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "f(3)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "In the slides you saw data generated with the following function `myfunc`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "myfunc <- function(x) {sin(2*x) + 2*(x - 0.5)^3 - 0.5*x}" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "We will generate `N = 60` data points." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "set.seed(12)\n", "N <- 60\n", "x <- sort(runif(N))\n", "error <- .06*rnorm(N)\n", "y <- myfunc(x) + error\n", "par(mfcol = c(1, 1))\n", "plot(x, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the first line we set the pseudo-random number generator seed to 12.\n", "This means that we will obtain the same pseudo-random numbers,\n", "every time we run the code above.\n", "The functions `runif` and `rnorm` generate uniformly and normally distributed\n", "pseudo-random numbers, respectively. And the function `sort`, well, does the\n", "obvious :)\n", "\n", "You may want to convince yourself that\n", "running the cell above always gives the same data.\n", "What happens, if you remove the first line or replace it by `set.seed(123)`?\n", "\n", "If you feel you don't understand something in the code above,\n", "it would be a good idea to insert a cell below\n", "(you can use e.g. the + button above) and experiment a bit." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "### Parametric Method\n", "\n", "As an example of a parametric method to estimate the function `myfunc`,\n", "we will fit a linear function to the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "linear.fit <- lm(y ~ x)\n", "plot(x, y)\n", "curve(myfunc, 0, 1, col = 'blue', add = TRUE, lw = 2)\n", "abline(linear.fit, col = \"dark green\", lw = 2)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "The cryptically looking `linear.fit <- lm(y ~ x)` means simply:\n", "\"fit a linear model with response `y` and predictor `x`\n", "and assign the result to variable `linear.fit`\".\n", "The functions `curve` and `abline` plot the true function\n", "and the fitted line.\n", "\n", "### Non-Parametric Method\n", "\n", "As an example of a non-parametric method to estimate the function `myfunc`,\n", "we define the `kNN` method with three mandatory arguments and one optional\n", "argument with default value `k = 2`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "kNN <- function(x0, x, y, k = 2) { # test point x0, predictor values x, response values y, default k = 2 nearest neighbours\n", " d = abs(x - x0) # compute all distances between the test point and the data\n", " o = order(d) # compute the order of the distances (smallest to largest)\n", " mean(y[o[1:k]]) # take the average response of the k nearest neighbours\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you feel comfortable with the `kNN` function you can skip this paragraph and\n", "move to the next cell. Otherwise, create a new cell below and experiment a bit,\n", "e.g. `tmp.x <- c(5, 2, 3, 1)`, `tmp.x0 <- 1.4`, `abs(tmp.x - tmp.x0)`,\n", "`order(tmp.x)`, `kNN(tmp.x0, tmp.x, c(1, 2, 3, 4), k = 1)`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "grid <- seq(0, 1, length.out = 1000)\n", "y.hat <- sapply(grid, kNN, x, y, k = 1)\n", "plot(x, y)\n", "curve(myfunc, 0, 1, col = 'blue', add = TRUE, lw = 2)\n", "lines(grid, y.hat, col = 'red', lw = 2)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "With the `seq` function we generated 100 evenly spaced points in the interval\n", "[0, 1]. The `sapply` function lets us apply the function kNN to all values of\n", "grid; from the third argument onward, the `sapply` function passes the arguments\n", "to the function `kNN`, i.e. you can change the `k` to 50, for example, if you\n", "want to see the figure of the slide on the curse of dimensionality.\n", "It is highly recommendable to execute the cell above for different values of `k`\n", "between 1 and 50 and observe how well the curve fits the data.\n", "We call kNN with `k = 1` a flexible method, because it can fit very rough\n", "data with many jumps and kNN with `k = 50` an inflexible method, because it can\n", "fit only rather smooth data.\n", "We will later assess more formally the kNN method with different values of `k`.\n", "If you want to better understand the `sapply` function, it may be worthwile to\n", "experiment a bit, e.g. `tmp.f <- function(x, y) { x + y }; tmp.x <- c(1, 2, 3);\n", "sapply(tmp.x, tmp.f, 2)` or look at its documentation.\n", "\n", "Take a little moment to think about the definitions of the reducible and the\n", "irreducible error as well as parametric and non-parametric methods.\n", - "When you are ready, move over to [the quiz](https://moodle.epfl.ch).\n", + "When you are ready, move over to [the quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1088128).\n", "\n", "## Assessing Model Accuracy" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/html": [ " " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "IRdisplay::display_html(' ')" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "### Assessing Model Accuracy with Artificial Data\n", "Let us generate a test set with the same generative process as above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set.seed(42)\n", "N.test <- 10^3\n", "x.test <- sort(runif(N.test))\n", "error.test <- 0.06 * rnorm(N.test)\n", "y.test <- myfunc(x.test) + error.test" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Now we compute the training error for the linear model. To do so we first\n", "compute the predicted responses `y.pred`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "y.pred <- predict(linear.fit, data.frame(x = x))\n", "1/length(y) * sum((y - y.pred)^2)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "and the test error for the linear model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "y.test.pred <- predict(linear.fit, data.frame(x = x.test))\n", "1/length(y.test) * sum((y.test - y.test.pred)^2)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "The predict function takes as first argument some fitted model, like the\n", "`linear.fit` we obtained above. As a second argument it expects a `data.frame`\n", "with some values in a column called `x`.\n", "\n", "Let us do the same with the kNN method." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "y.pred.kNN <- sapply(x, kNN, x, y, k = 5)\n", "1/length(y) * sum((y - y.pred.kNN)^2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y.test.pred.kNN <- sapply(x.test, kNN, x, y, k = 5)\n", "1/length(y.test) * sum((y.test - y.test.pred.kNN)^2)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "As expected, the kNN method has a much lower training and test error than the\n", "linear method.\n", "\n", "Let us now investigate how the training and test errors depend on the choice of\n", "`k`. To do so, we will define the following `assess.kNN` function, and evaluate\n", "it with different `k`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "assess.kNN <- function(k, data.train, data.test) {\n", " x <- data.train$x\n", " y <- data.train$y\n", " x.test <- data.test$x\n", " y.test <- data.test$y\n", " y.pred <- sapply(x, kNN, x, y, k = k)\n", " error.train <- 1/length(y) * sum((y - y.pred)^2)\n", " y.test.pred <- sapply(x.test, kNN, x, y, k = k)\n", " error.test <- 1/length(y.test) * sum((y.test - y.test.pred)^2)\n", " c(error.train, error.test)\n", "}" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "This function takes `k`, a training set and a test set as input and returns both\n", "the training and the test error.\n", "\n", "We will evaluate this function now for different values of `k`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "ks <- seq(1, 20)\n", "errors.kNN <- sapply(ks, assess.kNN, data.frame(x, y), data.frame(x = x.test, y = y.test))\n", "plot(ks, errors.kNN[1,], col = \"red\", ylab = \"MSE\", xlab = \"k\", type = \"b\")\n", "points(ks, errors.kNN[2,], col = \"blue\", type = \"b\")\n", "abline(h = 0.0036, lty = 2)\n", "legend(\"bottomright\", legend = c(\"training error\", \"test error\", \"irreducible error\"),\n", " col = c(\"red\", \"blue\", \"black\"), lty = c(1, 1, 2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We repeat here three important observations we made saw already in the video:\n", "1. The test error is always above the irreducible error.\n", "2. The very flexible method with `k=1` can perfectly fit the training data\n", "(zero training error) but its test error is higher than the one of a less\n", "flexible method, with `k = 5` for example.\n", "3. Training and test error increase with decreasing flexibility of the method." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "### Assessing Model Accuracy with Real Data\n", "\n", "For real dataset we cannot easily generate an additional test set, typically.\n", "Common practice is therefore to split the dataset into two parts.\n", "We will do this for the life expectancy dataset.\n", "For now, we will only look at the GDP as input and the life expectancy as\n", "output." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "data1 <- na.omit(data[, c(\"GDP\", \"LifeExpectancy\")])" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "The function `na.omit` removes all rows where either the BMI or the life\n", "expectancy is not available (na).\n", "\n", "Now we split into training and test set." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "set.seed(199)\n", "idx.train <- sample(nrow(data1), nrow(data1)/2)\n", "data1.train <- data1[idx.train,]\n", "data1.test <- data1[-idx.train,]" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "There are `n = nrow(data1)` data points in total. The first line above samples\n", "`n/2` indices from the indices 1 to n (without replacement). In the second and\n", "third line we extract every row with index occurring in `idx.train` or its\n", "complement `-idx.train` to form the training set `data.train` and the test set\n", "`data1.test`.\n", "\n", "Next we fit a linear model to the training set, define a function to compute\n", "the MSE and compute the training and test error." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fit <- lm(LifeExpectancy ~ GDP, data1.train)\n", "mse <- function(fit, data) {\n", " 1/nrow(data) * sum((data$LifeExpectancy - predict(fit, data))^2)\n", "}\n", "c(mse(fit, data1.train), mse(fit, data1.test))" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Interestingly, the training error is higher than the test error. This is an\n", "indication that the model is not flexible enough. This we can also see in the\n", "plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot(data1)\n", "abline(fit, col = 'dark green', lw = 2)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Instead of linear fits, we could use a quadratic fit." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "q.fit <- lm(LifeExpectancy ~ poly(GDP, 2), data1.train)\n", "c(mse(q.fit, data1.train), mse(q.fit, data1.test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We used the function `poly(GDP, 2)` to form a polynomial of degree 2, i.e.\n", "$$\\beta_0 + \\beta_1 \\mathrm{GDP} + \\beta_2 \\mathrm{GDP}^2$$.\n", "Still the training error is higher than the test error, but they both decreased.\n", "How does the plot look like?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "plot(data1)\n", "grid <- seq(min(data1$GDP), max(data1$GDP), length.out = 1000)\n", "lines(grid, predict(q.fit, data.frame(GDP = grid)), col = 'orange', lw = 2)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Hm, also a quadratic model does not seem ideal.\n", "Let us move on to polynomials of arbitrary degree.\n", "To do so we create the function `poly.fit` that takes the degree `d` and\n", "training and test sets as input, and returns the training error, the test error\n", "and the fit object." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "poly.fit <- function(d, data.train, data.test) {\n", " fit <- lm(LifeExpectancy ~ poly(GDP, d), data.train)\n", " c(mse(fit, data.train), mse(fit, data.test), fit)\n", "}\n", "ds <- 1:14\n", "results.poly <- sapply(ds, poly.fit, data1.train, data1.test)\n", "plot(ds, results.poly[1,], col = \"red\", ylab = \"MSE\", xlab = \"d\", type = \"b\")\n", "points(ds, results.poly[2,], col = \"blue\", type = \"b\")\n", "legend(\"topright\", legend = c(\"training error\", \"test error\"),\n", " col = c(\"red\", \"blue\"), lty = c(1, 1))" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Here, a polynomial of degree `d = 1` is the least flexible method and it\n", "underfits the data, while the polynomial with degree `d = 14` is the most\n", "flexible method we considered and it overfits the data. We see again the\n", "typical U-shaped curve of the test error: first it decreases with increasing\n", "flexibility, but at some point it starts to increase again. In contrast, the\n", "training error decreases continually.\n", "\n", "Let us look at how well the best performing polynomial (with degree `d = 11`)\n", "fits the data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "plot(data1)\n", "grid <- seq(min(data1$GDP), max(data1$GDP), length.out = 1000)\n", "lines(grid, predict(lm(LifeExpectancy ~ poly(GDP, 11), data1.train),\n", " data.frame(GDP = grid)), col = 'orange', lw = 2)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "In my opinion, this does not look like coming very close to the true generating\n", "function. The downswing for very high GDPs looks wrong and also the wiggles for\n", "the high GDPs where there is only little data does not look convincing. Maybe we\n", "have to try our luck with other methods.\n", "\n", "Take a little moment to think about the definitions of the test and training\n", "error and the flexibility of methods.\n", - "When you are ready, move over to [the quiz](https://moodle.epfl.ch).\n", + "When you are ready, move over to [the quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1088139).\n", "\n", "## Exercises\n", "\n", "**Q1.** For each of parts (a) through (d), indicate whether we would generally\n", "expect the performance of a flexible statistical learning method to be better or\n", "worse than an inflexible method. Justify your answer.\n", "\n", "a) The sample size $n$ is extremely large, and the number of predictors $p$ is small?\n", "\n", "b) The number of predictors $p$ is extremely large, and the number of\n", "observations $n$ is small ?\n", "\n", "c) The relationship between the predictors and response is highly non-linear?\n", "\n", "d) The variance of the error terms, i.e. $\\sigma^2 = \\mathrm{var}(\\epsilon)$ is extremely high?\n", "\n", "**Q2.** Describe the differences between a parametric and a non-parametric\n", "machine learning approach. What are the advantages of a parametric approach (as\n", "opposed to a nonparametric approach)? What are its disadvantages?\n", "\n", "**Q3.** In this exercise you will look at a [Parkinsons Telemonitoring Data\n", "Set](https://archive.ics.uci.edu/ml/datasets/Parkinsons+Telemonitoring).\n", "Navigate to that page to find some information about the dataset (if you click\n", "on the link in the previous sentence it will typically open a new browser tab).\n", "You can load the data with the following command." ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "data.parkinsons <- read.csv(\"https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/telemonitoring/parkinsons_updrs.data\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to make progress in progamming in R I recommend you to refrain from\n", "copy-pasting as much as possible.\n", "\n", "We will only focus on the PPE feature and the total UPDRS as a response here.\n", "\n", "a) Plot the PPE feature against the total UPDRS response.\n", "\n", "b) Create a training and a test set.\n", "\n", "c) Fit kNN to the training data for different values of k and compute the\n", "training and the test error. For which value of k do you neither see\n", "underfitting nor overfitting?\n", "\n", "d) Estimate an upper bound for the irreducible error of this dataset." ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "4.0.2" } }, "nbformat": 4, "nbformat_minor": 4 }