diff --git a/src/lecture11.ipynb b/src/lecture11.ipynb new file mode 100644 index 0000000..9e8932e --- /dev/null +++ b/src/lecture11.ipynb @@ -0,0 +1,787 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Lecture 11\n", + "\n", + "## Principal Component Analysis: Directions of Highest Variance" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "IRdisplay::display_html('')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "Let us first load the wine data and look at it with the `pairs` function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "wine <- read.csv(\"http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data\", sep=\",\")\n", + "colnames(wine) <- c(\"Cultivar\",\"Alcohol\",\"Malic acid\",\"Ash\",\"Alcalinity of ash\", \"Magnesium\", \"Total phenols\", \"Flavanoids\", \"Nonflavanoid phenols\", \"Proanthocyanins\", \"Color intensity\", \"Hue\", \"OD280/OD315 of diluted wines\", \"Proline\")\n", + "pairs(wine[,-1], cex = .5) # the first column contains the label of the cultivar; we pretend here that we don't know it." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "We will scale the data first with `scale` and perform a principal component\n", + "analysis on this data set with the function `prcomp` and plot the scores of the\n", + "data onto the first two components." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 0 + }, + "outputs": [], + "source": [ + "wine.sc <- scale(wine[,-1])\n", + "pca <- prcomp(wine.sc, scale = F)\n", + "# pca <- prcomp(wine[,-1], scale = T) # is an alternative, where we don't scale manually\n", + "plot(pca$x[,1:2])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The function `prcomp` returns the scores in the variable `x` and the loadings in\n", + "variable `rotation`. We can verify that there are $p$ loading vectors of length\n", + "$p$ that have norm 1 and are orthogonal." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dim(pca$rotation)\n", + "norm(pca$rotation[,1], \"2\") # use the standard Euclidean 2-norm\n", + "sum(pca$rotation[,1] * pca$rotation[,2]) # the scalar product is approximately 0" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "Let us now produce the `biplot` from the slides" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 0 + }, + "outputs": [], + "source": [ + "biplot(pca, col = c('gray', 'red'), scale = 0)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "You can verify by looking at `pca$x[,1:2]` and `pca$rotation[,1:2]` that the\n", + "texts are placed at the right position.\n", + "\n", + "To produce a similar figure to the one in the slides without scaling we can use the code" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "biplot(prcomp(wine[,-1], scale = F), col = c('gray', 'red'), scale = 0)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "Let us check the relation between Singular Value Decomposition and PCA." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 0 + }, + "outputs": [], + "source": [ + "s <- svd(wine.sc)\n", + "sum((s$v - pca$rotation)^2)\n", + "sum((s$u %*% diag(s$d) - pca$x)^2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can see in the output above that the loadings matrix `pca$rotation` is the\n", + "same as the $V$ matrix of SVD (up to numerical errors) and the scores `pca$x`\n", + "are the same as $U\\Sigma$ where $\\Sigma$ is given by `diag(s$d)`.\n", + "\n", + "You can now answer the questions on the first page of the\n", + "[quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1117364).\n", + "\n", + "## Principal Component Analysis: Closest Low-Dimensional Subspace" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "IRdisplay::display_html('')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "In the next cell you can see that we can perfectly reconstruct the original data\n", + "$X$ by computing either $Z\\Phi^T$ or $X\\Phi\\Phi^T$." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "wine.sc.rec1 <- pca$x %*% t(pca$rotation)\n", + "wine.sc.rec2 <- wine.sc %*% pca$rotation %*% t(pca$rotation)\n", + "sum((wine.sc - wine.sc.rec1)^2)\n", + "sum((wine.sc - wine.sc.rec2)^2)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "Now we reconstruct the data using only the first two principal components and\n", + "compare the reconstruction error to the one we obtain by using the first and the\n", + "third principal component. You can see that the reconstruction error is higher\n", + "in the latter case. No matter which pair of principal components we use to\n", + "reconstruct the original data: the reconstruction error will always be higher\n", + "than the one with the first two principal components. The first two principal\n", + "components find indeed the two-dimensional plane in this 13-dimensional space\n", + "that is closest to the original data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "wine.sc.rec.2 <- wine.sc %*% pca$rotation[,1:2] %*% t(pca$rotation[,1:2])\n", + "sum((wine.sc - wine.sc.rec.2)^2)\n", + "wine.sc.rec.2b <- wine.sc %*% pca$rotation[,c(1, 3)] %*% t(pca$rotation[,c(1, 3)])\n", + "sum((wine.sc - wine.sc.rec.2b)^2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can now answer the questions on the second page of the\n", + "[quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1117364).\n", + "\n", + "## Proportion Variance Explained" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "IRdisplay::display_html('')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "The standard deviation of the scores of each component are in `pca$sdev`; we can\n", + "get the variance by squaring these numbers and the proportion of variance\n", + "explained by normalizing with the total variance." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pca.var <- pca$sdev^2\n", + "pca.vare <- pca.var / sum(pca.var)\n", + "plot(pca.vare, xlab = \"Principal Component\", ylab = \"Prop. Variance Explained\", col = \"blue\", type = \"b\")\n", + "plot(cumsum(pca.vare), xlab = \"Principal Component\", ylab = \"Cumulative Prop. Variance Explained\", col = \"blue\", type = \"b\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "In the following cell we download and plot the image of the boat you saw in the\n", + "video." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "library(tiff)\n", + "fname <- tempfile()\n", + "download.file(\"http://sipi.usc.edu/database/download.php?vol=misc&img=boat.512\", fname)\n", + "img <- readTIFF(fname)\n", + "image(t(img)[,512:1], col = gray.colors(100, 0, 1, 1), axes = F, asp = 1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "Now, let us scale the data, perform PCA and plot the variance explained, the\n", + "original image and the reconstructed image with the first 25 components." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "img.s <- scale(img)\n", + "pZ <- prcomp(img.s)\n", + "library(DMwR) # used for the unscale function\n", + "newimg <- unscale(pZ$x[,1:25] %*% t(pZ$rotation[,1:25]), img.s)\n", + "par(mfrow = c(1, 3), oma = rep(1, 4), cex = 1)\n", + "plot(pZ$sdev^2/sum(pZ$sdev^2), ylab = 'Prop. Variance Explained')\n", + "image(t(img)[,512:1], col = gray.colors(100, 0, 1, 1), axes = F, asp = 1)\n", + "image(t(newimg)[,512:1], col = gray.colors(100, 0, 1, 1), axes = F, asp = 1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can now answer the questions on the third page of the\n", + "[quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1117364).\n", + "\n", + "## Limitations of PCA" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "IRdisplay::display_html('')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "Let us generate below an artificial dataset with 8 point clouds." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "set.seed(517)\n", + "library(MASS)\n", + "x1 <- mvrnorm(100, c(1, 1, 1), .1*diag(3))\n", + "x2 <- mvrnorm(100, c(-1, 1, 1), .1*diag(3))\n", + "x3 <- mvrnorm(100, c(1, -1, 1), .1*diag(3))\n", + "x4 <- mvrnorm(100, c(-1, -1, 1), .1*diag(3))\n", + "x5 <- mvrnorm(100, c(1, 1, -.8), .1*diag(3))\n", + "x6 <- mvrnorm(100, c(-1, 1, -.8), .1*diag(3))\n", + "x7 <- mvrnorm(100, c(1, -1, -.8), .1*diag(3))\n", + "x8 <- mvrnorm(100, c(-1, -1, -.8), .1*diag(3))\n", + "x <- rbind(x1, x2, x3, x4, x5, x6, x7, x8)\n", + "cols <- c(rep(1, 400), rep(2, 400))\n", + "library(plotly)\n", + "plot_ly(x = x[,1], y = x[,2], z = x[,3], type = 'scatter3d', mode = 'markers', marker = list(size = 2), color = cols)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "Without scaling the data you get a very similar image as the one shown in the\n", + "slides." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pca <- prcomp(x)\n", + "order <- sample(1:800)\n", + "plot(pca$x[order, 1:2], col = cols[order], xlab = \"PC1\", ylab = \"PC2\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "With scaling before applying PCA the result looks a bit better, but there are\n", + "still overlaps of the point clouds after projection to the lower-dimensional\n", + "space." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x <- scale(x)\n", + "pca <- prcomp(x)\n", + "order <- sample(1:800)\n", + "plot(pca$x[order, 1:2], col = cols[order], xlab = \"PC1\", ylab = \"PC2\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "The result is very different when we apply t-SNE instead." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "library(Rtsne)\n", + "tsne <- Rtsne(x)\n", + "plot(tsne$Y, col = cols)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Principal Component Regression" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "IRdisplay::display_html('')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "The code in this section may look complicated. Don't worry, if you don't\n", + "understand the details. I included the code for transparency: you can see,\n", + "how I produced the figures in the slides. It is totally fine if you do not want\n", + "to spend time on reading it. The only thing that you may want to remember is\n", + "that there is the `library(pls)` that contains the function `pcr` to do\n", + "principal component regression; however, the Lasso or Ridge Regression is\n", + "usually preferable.\n", + "\n", + "In the following cell we define a data generator function that samples\n", + "potentially low-dimensional, hidden predictors `z` and linearly transforms them\n", + "to a higher dimensional representation `x` using transformation matrix `P`.\n", + "For the first dataset we leave the transformation matrix `P` the identity\n", + "matrix, i.e. `z` is as high-dimensional as `x`, but we will use a sparse `beta`,\n", + "i.e. as `beta` with some zero entries. For the second artificial dataset we will\n", + "use a low-dimensional `z` and a dense `beta`.\n", + "The response `y` is then sampled following the ordinary assumption of multiple\n", + "linear regression.\n", + "The functions `bias.and.variance`, `glm.fit` and `pcr.fit` will be used to\n", + "compute the bias and variance of the different methods and fit the data with the\n", + "Lasso or PCR, respectively." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data.generator <- function(beta, n = 50, p = 45, P = diag(rep(1, p))) {\n", + " z <- matrix(rnorm(n * nrow(P)), nrow = n)\n", + " x <- z %*% P\n", + " y = x %*% beta + rnorm(n)\n", + " list(x = x, y = y, data = data.frame(X = x, Y = y))\n", + "}\n", + "bias.and.variance <- function(beta, beta.hats, x0) {\n", + " m.beta.hats <- rowMeans(beta.hats)\n", + " list(test.error = mean(colMeans((x0 %*% as.matrix(sweep(beta.hats, 1, beta)))^2)),\n", + " bias = mean(rowSums(x0 %*% (beta - m.beta.hats))^2),\n", + " variance = mean(colMeans(((x0 %*% as.matrix(sweep(beta.hats, 1, m.beta.hats))))^2)))\n", + "}\n", + "library(glmnet)\n", + "glm.fit <- function(data, ...) as.matrix(coef(glmnet(data$x, data$y, intercept = F, ...)))[seq(2, ncol(data$x)+1)]\n", + "library(pls)\n", + "pcr.fit <- function(data, ...) as.matrix(coef(pcr(Y ~ ., data = data$data, ...)))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "In this cell we fit multiple training sets with the Lasso and different values\n", + "for the regularization parameter `lambda` and we repeat the same for PCR and\n", + "different number of components. Here we use data with 45-dimensional\n", + "predictors and sparse `beta`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 0 + }, + "outputs": [], + "source": [ + "set.seed(123)\n", + "beta <- rnorm(45)\n", + "beta[sample(45, 10)] <- 0 # like this we obtain a sparse beta, where 10 entries are 0\n", + "x0 <- data.generator(beta, n = 500)$x # test data\n", + "lambdas <- 10^seq(-5, 0, length = 20) # lambdas to try\n", + "res1 <- sapply(lambdas, function(lambda) {\n", + " bias.and.variance(beta, data.frame(replicate(500, glm.fit(data.generator(beta), alpha = 1, lambda = lambda))), x0) })\n", + "res1 <- data.frame(res1)\n", + "ncomps <- seq(1, 45, 2)\n", + "res2 <- sapply(ncomps, function(ncomp) {\n", + " bias.and.variance(beta, data.frame(replicate(500, pcr.fit(data.generator(beta), ncomp = ncomp))), x0) })\n", + "res2 <- data.frame(res2)\n", + "\n", + "par(mfrow = c(1, 2))\n", + "plot(log10(lambdas), unlist(res1[1,]), main = \"Lasso\", type = 'l', col = 'red', ylim = c(0, 20), ylab = 'MSE')\n", + "lines(log10(lambdas), unlist(res1[2,]), col = 'blue')\n", + "lines(log10(lambdas), unlist(res1[3,]))\n", + "legend(\"topleft\", c(\"bias^2\", \"variance\", \"test\"),\n", + " col = c(\"blue\", \"black\", \"red\"), lty = 1)\n", + "\n", + "plot(ncomps, unlist(res2[1,]), main = \"PCR\", type = 'l', col = 'red', ylim = c(0, 20), ylab = \"MSE\")\n", + "lines(ncomps, unlist(res2[2,]), col = 'blue')\n", + "lines(ncomps, unlist(res2[3,]))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "The plots above show typical curves for bias, variance, test error as a function\n", + "of the flexibility of the method. The Lasso is most flexible for `lambda = 1` (=\n", + "unregularised linear regression) and we see that this is the case where the bias\n", + "is the smallest but the variance is highest. PCR is most flexible\n", + "when all components are used (= unregularised linear regression) and decreases\n", + "in flexibility with decreasing number of components. The lowest test error is\n", + "lower for the Lasso than for PCR.\n", + "\n", + "Next, we run the same analysis for data with 10-dimensional predictors embedded\n", + "in a 45-dimensional space (if you want to visualize this situation think of data\n", + "points that lie in a plane that is arbitrarily oriented in 3D space)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 0 + }, + "outputs": [], + "source": [ + "set.seed(123)\n", + "beta <- rnorm(45)\n", + "P <- matrix(rnorm(10*45), nrow = 10) # this transforms the predictors from 10 to 45 dimensions.\n", + "x0 <- data.generator(beta, n = 500, P = P)$x\n", + "lambdas <- 10^seq(-5, 0, length = 20)\n", + "res1 <- sapply(lambdas, function(lambda) {\n", + " bias.and.variance(beta, data.frame(replicate(500, glm.fit(data.generator(beta, P = P), alpha = 1, lambda = lambda))), x0) })\n", + "res1 <- data.frame(res1)\n", + "ncomps <- seq(1, 45, 3)\n", + "res2 <- sapply(ncomps, function(ncomp) {\n", + " bias.and.variance(beta, data.frame(replicate(500, pcr.fit(data.generator(beta, P = P), ncomp = ncomp))), x0) })\n", + "res2 <- data.frame(res2)\n", + "\n", + "par(mfrow = c(1, 2))\n", + "plot(log10(lambdas), unlist(res1[1,]), main = \"Lasso\", type = 'l', col = 'red', ylim = c(0, 5), ylab = 'MSE')\n", + "lines(log10(lambdas), unlist(res1[2,]), col = 'blue')\n", + "lines(log10(lambdas), unlist(res1[3,]))\n", + "legend(\"topleft\", c(\"bias^2\", \"variance\", \"test\"),\n", + " col = c(\"blue\", \"black\", \"red\"), lty = 1)\n", + "\n", + "plot(ncomps, unlist(res2[1,]), main = \"PCR\", type = 'l', col = 'red', ylim = c(0, 5), ylab = \"MSE\")\n", + "lines(ncomps, unlist(res2[2,]), col = 'blue')\n", + "lines(ncomps, unlist(res2[3,]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As expected, PCR has zero bias as soon as at least 10 components are included.\n", + "Including more than 10 components leads to higher variance.\n", + "The Lasso has the same minimal test error for many values of lambda.\n", + "\n", + "## Exercises\n", + "\n", + "### Conceptual\n", + "\n", + "**Q1.**\n", + "\n", + "(a) Estimate from the figure below the principal component loadings $\\phi_{11}, \\phi_{21}, \\phi_{12}, \\phi_{22}$." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "set.seed(1)\n", + "data = data.frame(as.matrix(data.frame(X1 = rnorm(400, sd = 6), X2 = rnorm(400, sd = 4))) %*% matrix(rnorm(4), 2, 2))\n", + "data = scale(data, scale = F)\n", + "plot(data, xlim = c(-30, 30), ylim = c(-15, 15))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "(b) In the figure below you see the scores of $n = 40$ measuments and the loadings of a principal component analysis. How many features $p$ were measured in this data set?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "set.seed(913)\n", + "data = data.frame(as.matrix(data.frame(X1 = rnorm(40, sd = 6), X2 = rnorm(40, sd = 4), X3 = rnorm(40, sd = 2), X4 = rnorm(40, sd = 2))) %*% matrix(rnorm(16), 4, 4))\n", + "pca = prcomp(data)\n", + "biplot(pca, scale = 0)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "(c) Estimate from the figure in (b) the loadings of the first two principal components and the scores of data point 28.\n", + "\n", + "(d) In which direction does the data in the figure in (b) vary the most?\n", + "\n", + "**Q2.** In this exercise you will explore the connection between PCA and SVD.\n", + "\n", + "(a) In the lecture we said that the first principal component is the eigenvector $\\phi_1$ with the largest eigenvalue $\\lambda_1$ of the matrix $X^T X$. From linear algebra you know that a real, symmetric matrix $A$ can be diagonalized such that $A = W \\Lambda W^T$ holds, where $\\Lambda$ is a diagonal matrix that contains the eigenvalues and $W$ is an orthogonal matrix that satisfies $W^T W = I$, where $I$ is the identity matrix. The columns of $W$ are the eigenvectors of $A$. Let us assume we have found the diagonalization $X^T X = W \\Lambda W^T$. Multiply this equation from the right with $W$ and show that the columns of $W$ contain the eigenvectors of $X^T X$.\n", + "\n", + "(b) The singular value decomposition (SVD) of $X$ is $X = U\\Sigma V^T$ where $U$ and $V$ are orthogonal matrices and $\\Sigma$ is a diagonal matrix. Use the singular value decomposition to express the right-hand-side of $X^TX = ...$ in terms of $U$, $V$ and $\\Sigma$, show how $U, V, \\Sigma, W$ and $\\Lambda$ relate to each other and prove the statements on slide 11 of the lecture. *Hint*: the product of two diagonal matrices is again a diagonal matrix." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Applied\n", + "\n", + "**Q3.** Look at the following three artificial data sets." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data1 <- data.frame(matrix(rnorm(50*2), nrow = 50) %*% matrix(rnorm(2*3), nrow = 2))\n", + "data2 <- data.frame(matrix(rnorm(50*2), nrow = 50) %*% matrix(rnorm(2*3), nrow = 2) + 0.1*rnorm(50*3))\n", + "data3 <- data.frame(matrix(rnorm(50*2), nrow = 50) %*% matrix(rnorm(2*10), nrow = 2))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "(a) Plot the datasets `data1` and `data2`. Hint: you can find an example in\n", + "section \"Limitations of PCA\" of this notebook.\n", + "\n", + "(b) Write down your expectations on the curve of proportion of variance\n", + "explained for the tree datasets.\n", + "\n", + "(c) Perform PCA and plot the proportion of variance explained.\n", + "\n", + "**Q4.** In this exercise we look at a food consumption dataset with the relative\n", + "consumption of certain food items in European and Scandinavian countries. The\n", + "numbers represent the percentage of the population consuming that food type.\n", + "You can load the data with" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data <- read.csv(url(\"https://openmv.net/file/food-consumption.csv\"))\n", + "data.sc1 <- scale(na.omit(data[,2:21]))\n", + "data.sc2 <- scale(data[,2:21])\n", + "data.sc2[is.na(data.sc2)] <- 0" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "(a) Perform PCA on this dataset and produce a two biplots: one with PC1 versus\n", + "PC2 and another with PC1 versus PC3. Hint: see `?biplot`.\n", + "\n", + "(b) Which are the important variables in the first three components?\n", + "\n", + "(c) What does it mean for a country to have a high score in the first component?\n", + "Pick one country that has a high score in the first component and verify that\n", + "this country does indeed have this interpretation.\n", + "\n", + "**Q5.** (optional) PCA has some nice history in human face recognition research\n", + "(see for example [here](https://en.wikipedia.org/wiki/Eigenface). In this\n", + "exercise you will look at a small dataset of face images and perform a similar\n", + "analysis as we did it with the image of the ship in section \"Proportion\n", + "Variance Explained\". However, instead of treating the rows of a single image\n", + "as individual data points, each face image is treated as a single data point by\n", + "transforming the two-dimensional images to one vector (concatenating the rows to\n", + "one long vector). You can load the dataset with the following code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "download.file('https://lcnwww.epfl.ch/bio322/att_faces.zip', 'faces.zip')\n", + "unzip('faces.zip')\n", + "library(png)\n", + "faces <- sapply(list.files('att_faces', full.names = T), readPNG)\n", + "data <- t(faces)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0 + }, + "source": [ + "To look at a single image you can use the following function" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "show.image <- function(x) image(t(matrix(x, nrow = 112))[,112:1], col = gray.colors(256), axes = F, asp = 1)\n", + "show.image(data[101,])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "(a) Perform PCA on this dataset and plot the first few principal components by\n", + "using the `show.image` function above.\n", + "\n", + "(b) Plot the curve of the proportion of variance explained.\n", + "\n", + "(c) Plot some of the original images together with their low-dimensional\n", + "reconstruction based on a reasonable number of principal components." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "R", + "language": "R", + "name": "ir" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}