diff --git a/src/lecture11.ipynb b/src/lecture11.ipynb index 9e8932e..623c74d 100644 --- a/src/lecture11.ipynb +++ b/src/lecture11.ipynb @@ -1,787 +1,787 @@ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lecture 11\n", "\n", "## Principal Component Analysis: Directions of Highest Variance" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "IRdisplay::display_html('')" + "IRdisplay::display_html('')" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Let us first load the wine data and look at it with the `pairs` function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wine <- read.csv(\"http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data\", sep=\",\")\n", "colnames(wine) <- c(\"Cultivar\",\"Alcohol\",\"Malic acid\",\"Ash\",\"Alcalinity of ash\", \"Magnesium\", \"Total phenols\", \"Flavanoids\", \"Nonflavanoid phenols\", \"Proanthocyanins\", \"Color intensity\", \"Hue\", \"OD280/OD315 of diluted wines\", \"Proline\")\n", "pairs(wine[,-1], cex = .5) # the first column contains the label of the cultivar; we pretend here that we don't know it." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "We will scale the data first with `scale` and perform a principal component\n", "analysis on this data set with the function `prcomp` and plot the scores of the\n", "data onto the first two components." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "wine.sc <- scale(wine[,-1])\n", "pca <- prcomp(wine.sc, scale = F)\n", "# pca <- prcomp(wine[,-1], scale = T) # is an alternative, where we don't scale manually\n", "plot(pca$x[,1:2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function `prcomp` returns the scores in the variable `x` and the loadings in\n", "variable `rotation`. We can verify that there are $p$ loading vectors of length\n", "$p$ that have norm 1 and are orthogonal." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dim(pca$rotation)\n", "norm(pca$rotation[,1], \"2\") # use the standard Euclidean 2-norm\n", "sum(pca$rotation[,1] * pca$rotation[,2]) # the scalar product is approximately 0" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Let us now produce the `biplot` from the slides" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "biplot(pca, col = c('gray', 'red'), scale = 0)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "You can verify by looking at `pca$x[,1:2]` and `pca$rotation[,1:2]` that the\n", "texts are placed at the right position.\n", "\n", "To produce a similar figure to the one in the slides without scaling we can use the code" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "biplot(prcomp(wine[,-1], scale = F), col = c('gray', 'red'), scale = 0)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Let us check the relation between Singular Value Decomposition and PCA." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "s <- svd(wine.sc)\n", "sum((s$v - pca$rotation)^2)\n", "sum((s$u %*% diag(s$d) - pca$x)^2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see in the output above that the loadings matrix `pca$rotation` is the\n", "same as the $V$ matrix of SVD (up to numerical errors) and the scores `pca$x`\n", "are the same as $U\\Sigma$ where $\\Sigma$ is given by `diag(s$d)`.\n", "\n", "You can now answer the questions on the first page of the\n", "[quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1117364).\n", "\n", "## Principal Component Analysis: Closest Low-Dimensional Subspace" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "IRdisplay::display_html('')" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "In the next cell you can see that we can perfectly reconstruct the original data\n", "$X$ by computing either $Z\\Phi^T$ or $X\\Phi\\Phi^T$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wine.sc.rec1 <- pca$x %*% t(pca$rotation)\n", "wine.sc.rec2 <- wine.sc %*% pca$rotation %*% t(pca$rotation)\n", "sum((wine.sc - wine.sc.rec1)^2)\n", "sum((wine.sc - wine.sc.rec2)^2)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Now we reconstruct the data using only the first two principal components and\n", "compare the reconstruction error to the one we obtain by using the first and the\n", "third principal component. You can see that the reconstruction error is higher\n", "in the latter case. No matter which pair of principal components we use to\n", "reconstruct the original data: the reconstruction error will always be higher\n", "than the one with the first two principal components. The first two principal\n", "components find indeed the two-dimensional plane in this 13-dimensional space\n", "that is closest to the original data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wine.sc.rec.2 <- wine.sc %*% pca$rotation[,1:2] %*% t(pca$rotation[,1:2])\n", "sum((wine.sc - wine.sc.rec.2)^2)\n", "wine.sc.rec.2b <- wine.sc %*% pca$rotation[,c(1, 3)] %*% t(pca$rotation[,c(1, 3)])\n", "sum((wine.sc - wine.sc.rec.2b)^2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can now answer the questions on the second page of the\n", "[quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1117364).\n", "\n", "## Proportion Variance Explained" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "IRdisplay::display_html('')" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "The standard deviation of the scores of each component are in `pca$sdev`; we can\n", "get the variance by squaring these numbers and the proportion of variance\n", "explained by normalizing with the total variance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pca.var <- pca$sdev^2\n", "pca.vare <- pca.var / sum(pca.var)\n", "plot(pca.vare, xlab = \"Principal Component\", ylab = \"Prop. Variance Explained\", col = \"blue\", type = \"b\")\n", "plot(cumsum(pca.vare), xlab = \"Principal Component\", ylab = \"Cumulative Prop. Variance Explained\", col = \"blue\", type = \"b\")" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "In the following cell we download and plot the image of the boat you saw in the\n", "video." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "library(tiff)\n", "fname <- tempfile()\n", "download.file(\"http://sipi.usc.edu/database/download.php?vol=misc&img=boat.512\", fname)\n", "img <- readTIFF(fname)\n", "image(t(img)[,512:1], col = gray.colors(100, 0, 1, 1), axes = F, asp = 1)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Now, let us scale the data, perform PCA and plot the variance explained, the\n", "original image and the reconstructed image with the first 25 components." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "img.s <- scale(img)\n", "pZ <- prcomp(img.s)\n", "library(DMwR) # used for the unscale function\n", "newimg <- unscale(pZ$x[,1:25] %*% t(pZ$rotation[,1:25]), img.s)\n", "par(mfrow = c(1, 3), oma = rep(1, 4), cex = 1)\n", "plot(pZ$sdev^2/sum(pZ$sdev^2), ylab = 'Prop. Variance Explained')\n", "image(t(img)[,512:1], col = gray.colors(100, 0, 1, 1), axes = F, asp = 1)\n", "image(t(newimg)[,512:1], col = gray.colors(100, 0, 1, 1), axes = F, asp = 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can now answer the questions on the third page of the\n", "[quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1117364).\n", "\n", "## Limitations of PCA" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "IRdisplay::display_html('')" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Let us generate below an artificial dataset with 8 point clouds." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set.seed(517)\n", "library(MASS)\n", "x1 <- mvrnorm(100, c(1, 1, 1), .1*diag(3))\n", "x2 <- mvrnorm(100, c(-1, 1, 1), .1*diag(3))\n", "x3 <- mvrnorm(100, c(1, -1, 1), .1*diag(3))\n", "x4 <- mvrnorm(100, c(-1, -1, 1), .1*diag(3))\n", "x5 <- mvrnorm(100, c(1, 1, -.8), .1*diag(3))\n", "x6 <- mvrnorm(100, c(-1, 1, -.8), .1*diag(3))\n", "x7 <- mvrnorm(100, c(1, -1, -.8), .1*diag(3))\n", "x8 <- mvrnorm(100, c(-1, -1, -.8), .1*diag(3))\n", "x <- rbind(x1, x2, x3, x4, x5, x6, x7, x8)\n", "cols <- c(rep(1, 400), rep(2, 400))\n", "library(plotly)\n", "plot_ly(x = x[,1], y = x[,2], z = x[,3], type = 'scatter3d', mode = 'markers', marker = list(size = 2), color = cols)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Without scaling the data you get a very similar image as the one shown in the\n", "slides." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pca <- prcomp(x)\n", "order <- sample(1:800)\n", "plot(pca$x[order, 1:2], col = cols[order], xlab = \"PC1\", ylab = \"PC2\")" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "With scaling before applying PCA the result looks a bit better, but there are\n", "still overlaps of the point clouds after projection to the lower-dimensional\n", "space." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x <- scale(x)\n", "pca <- prcomp(x)\n", "order <- sample(1:800)\n", "plot(pca$x[order, 1:2], col = cols[order], xlab = \"PC1\", ylab = \"PC2\")" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "The result is very different when we apply t-SNE instead." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "library(Rtsne)\n", "tsne <- Rtsne(x)\n", "plot(tsne$Y, col = cols)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Principal Component Regression" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "IRdisplay::display_html('')" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "The code in this section may look complicated. Don't worry, if you don't\n", "understand the details. I included the code for transparency: you can see,\n", "how I produced the figures in the slides. It is totally fine if you do not want\n", "to spend time on reading it. The only thing that you may want to remember is\n", "that there is the `library(pls)` that contains the function `pcr` to do\n", "principal component regression; however, the Lasso or Ridge Regression is\n", "usually preferable.\n", "\n", "In the following cell we define a data generator function that samples\n", "potentially low-dimensional, hidden predictors `z` and linearly transforms them\n", "to a higher dimensional representation `x` using transformation matrix `P`.\n", "For the first dataset we leave the transformation matrix `P` the identity\n", "matrix, i.e. `z` is as high-dimensional as `x`, but we will use a sparse `beta`,\n", "i.e. as `beta` with some zero entries. For the second artificial dataset we will\n", "use a low-dimensional `z` and a dense `beta`.\n", "The response `y` is then sampled following the ordinary assumption of multiple\n", "linear regression.\n", "The functions `bias.and.variance`, `glm.fit` and `pcr.fit` will be used to\n", "compute the bias and variance of the different methods and fit the data with the\n", "Lasso or PCR, respectively." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.generator <- function(beta, n = 50, p = 45, P = diag(rep(1, p))) {\n", " z <- matrix(rnorm(n * nrow(P)), nrow = n)\n", " x <- z %*% P\n", " y = x %*% beta + rnorm(n)\n", " list(x = x, y = y, data = data.frame(X = x, Y = y))\n", "}\n", "bias.and.variance <- function(beta, beta.hats, x0) {\n", " m.beta.hats <- rowMeans(beta.hats)\n", " list(test.error = mean(colMeans((x0 %*% as.matrix(sweep(beta.hats, 1, beta)))^2)),\n", " bias = mean(rowSums(x0 %*% (beta - m.beta.hats))^2),\n", " variance = mean(colMeans(((x0 %*% as.matrix(sweep(beta.hats, 1, m.beta.hats))))^2)))\n", "}\n", "library(glmnet)\n", "glm.fit <- function(data, ...) as.matrix(coef(glmnet(data$x, data$y, intercept = F, ...)))[seq(2, ncol(data$x)+1)]\n", "library(pls)\n", "pcr.fit <- function(data, ...) as.matrix(coef(pcr(Y ~ ., data = data$data, ...)))" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "In this cell we fit multiple training sets with the Lasso and different values\n", "for the regularization parameter `lambda` and we repeat the same for PCR and\n", "different number of components. Here we use data with 45-dimensional\n", "predictors and sparse `beta`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "set.seed(123)\n", "beta <- rnorm(45)\n", "beta[sample(45, 10)] <- 0 # like this we obtain a sparse beta, where 10 entries are 0\n", "x0 <- data.generator(beta, n = 500)$x # test data\n", "lambdas <- 10^seq(-5, 0, length = 20) # lambdas to try\n", "res1 <- sapply(lambdas, function(lambda) {\n", " bias.and.variance(beta, data.frame(replicate(500, glm.fit(data.generator(beta), alpha = 1, lambda = lambda))), x0) })\n", "res1 <- data.frame(res1)\n", "ncomps <- seq(1, 45, 2)\n", "res2 <- sapply(ncomps, function(ncomp) {\n", " bias.and.variance(beta, data.frame(replicate(500, pcr.fit(data.generator(beta), ncomp = ncomp))), x0) })\n", "res2 <- data.frame(res2)\n", "\n", "par(mfrow = c(1, 2))\n", "plot(log10(lambdas), unlist(res1[1,]), main = \"Lasso\", type = 'l', col = 'red', ylim = c(0, 20), ylab = 'MSE')\n", "lines(log10(lambdas), unlist(res1[2,]), col = 'blue')\n", "lines(log10(lambdas), unlist(res1[3,]))\n", "legend(\"topleft\", c(\"bias^2\", \"variance\", \"test\"),\n", " col = c(\"blue\", \"black\", \"red\"), lty = 1)\n", "\n", "plot(ncomps, unlist(res2[1,]), main = \"PCR\", type = 'l', col = 'red', ylim = c(0, 20), ylab = \"MSE\")\n", "lines(ncomps, unlist(res2[2,]), col = 'blue')\n", "lines(ncomps, unlist(res2[3,]))" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "The plots above show typical curves for bias, variance, test error as a function\n", "of the flexibility of the method. The Lasso is most flexible for `lambda = 1` (=\n", "unregularised linear regression) and we see that this is the case where the bias\n", "is the smallest but the variance is highest. PCR is most flexible\n", "when all components are used (= unregularised linear regression) and decreases\n", "in flexibility with decreasing number of components. The lowest test error is\n", "lower for the Lasso than for PCR.\n", "\n", "Next, we run the same analysis for data with 10-dimensional predictors embedded\n", "in a 45-dimensional space (if you want to visualize this situation think of data\n", "points that lie in a plane that is arbitrarily oriented in 3D space)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "set.seed(123)\n", "beta <- rnorm(45)\n", "P <- matrix(rnorm(10*45), nrow = 10) # this transforms the predictors from 10 to 45 dimensions.\n", "x0 <- data.generator(beta, n = 500, P = P)$x\n", "lambdas <- 10^seq(-5, 0, length = 20)\n", "res1 <- sapply(lambdas, function(lambda) {\n", " bias.and.variance(beta, data.frame(replicate(500, glm.fit(data.generator(beta, P = P), alpha = 1, lambda = lambda))), x0) })\n", "res1 <- data.frame(res1)\n", "ncomps <- seq(1, 45, 3)\n", "res2 <- sapply(ncomps, function(ncomp) {\n", " bias.and.variance(beta, data.frame(replicate(500, pcr.fit(data.generator(beta, P = P), ncomp = ncomp))), x0) })\n", "res2 <- data.frame(res2)\n", "\n", "par(mfrow = c(1, 2))\n", "plot(log10(lambdas), unlist(res1[1,]), main = \"Lasso\", type = 'l', col = 'red', ylim = c(0, 5), ylab = 'MSE')\n", "lines(log10(lambdas), unlist(res1[2,]), col = 'blue')\n", "lines(log10(lambdas), unlist(res1[3,]))\n", "legend(\"topleft\", c(\"bias^2\", \"variance\", \"test\"),\n", " col = c(\"blue\", \"black\", \"red\"), lty = 1)\n", "\n", "plot(ncomps, unlist(res2[1,]), main = \"PCR\", type = 'l', col = 'red', ylim = c(0, 5), ylab = \"MSE\")\n", "lines(ncomps, unlist(res2[2,]), col = 'blue')\n", "lines(ncomps, unlist(res2[3,]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As expected, PCR has zero bias as soon as at least 10 components are included.\n", "Including more than 10 components leads to higher variance.\n", "The Lasso has the same minimal test error for many values of lambda.\n", "\n", "## Exercises\n", "\n", "### Conceptual\n", "\n", "**Q1.**\n", "\n", "(a) Estimate from the figure below the principal component loadings $\\phi_{11}, \\phi_{21}, \\phi_{12}, \\phi_{22}$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set.seed(1)\n", "data = data.frame(as.matrix(data.frame(X1 = rnorm(400, sd = 6), X2 = rnorm(400, sd = 4))) %*% matrix(rnorm(4), 2, 2))\n", "data = scale(data, scale = F)\n", "plot(data, xlim = c(-30, 30), ylim = c(-15, 15))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(b) In the figure below you see the scores of $n = 40$ measuments and the loadings of a principal component analysis. How many features $p$ were measured in this data set?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set.seed(913)\n", "data = data.frame(as.matrix(data.frame(X1 = rnorm(40, sd = 6), X2 = rnorm(40, sd = 4), X3 = rnorm(40, sd = 2), X4 = rnorm(40, sd = 2))) %*% matrix(rnorm(16), 4, 4))\n", "pca = prcomp(data)\n", "biplot(pca, scale = 0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(c) Estimate from the figure in (b) the loadings of the first two principal components and the scores of data point 28.\n", "\n", "(d) In which direction does the data in the figure in (b) vary the most?\n", "\n", "**Q2.** In this exercise you will explore the connection between PCA and SVD.\n", "\n", "(a) In the lecture we said that the first principal component is the eigenvector $\\phi_1$ with the largest eigenvalue $\\lambda_1$ of the matrix $X^T X$. From linear algebra you know that a real, symmetric matrix $A$ can be diagonalized such that $A = W \\Lambda W^T$ holds, where $\\Lambda$ is a diagonal matrix that contains the eigenvalues and $W$ is an orthogonal matrix that satisfies $W^T W = I$, where $I$ is the identity matrix. The columns of $W$ are the eigenvectors of $A$. Let us assume we have found the diagonalization $X^T X = W \\Lambda W^T$. Multiply this equation from the right with $W$ and show that the columns of $W$ contain the eigenvectors of $X^T X$.\n", "\n", "(b) The singular value decomposition (SVD) of $X$ is $X = U\\Sigma V^T$ where $U$ and $V$ are orthogonal matrices and $\\Sigma$ is a diagonal matrix. Use the singular value decomposition to express the right-hand-side of $X^TX = ...$ in terms of $U$, $V$ and $\\Sigma$, show how $U, V, \\Sigma, W$ and $\\Lambda$ relate to each other and prove the statements on slide 11 of the lecture. *Hint*: the product of two diagonal matrices is again a diagonal matrix." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Applied\n", "\n", "**Q3.** Look at the following three artificial data sets." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data1 <- data.frame(matrix(rnorm(50*2), nrow = 50) %*% matrix(rnorm(2*3), nrow = 2))\n", "data2 <- data.frame(matrix(rnorm(50*2), nrow = 50) %*% matrix(rnorm(2*3), nrow = 2) + 0.1*rnorm(50*3))\n", "data3 <- data.frame(matrix(rnorm(50*2), nrow = 50) %*% matrix(rnorm(2*10), nrow = 2))" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "(a) Plot the datasets `data1` and `data2`. Hint: you can find an example in\n", "section \"Limitations of PCA\" of this notebook.\n", "\n", "(b) Write down your expectations on the curve of proportion of variance\n", "explained for the tree datasets.\n", "\n", "(c) Perform PCA and plot the proportion of variance explained.\n", "\n", "**Q4.** In this exercise we look at a food consumption dataset with the relative\n", "consumption of certain food items in European and Scandinavian countries. The\n", "numbers represent the percentage of the population consuming that food type.\n", "You can load the data with" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data <- read.csv(url(\"https://openmv.net/file/food-consumption.csv\"))\n", "data.sc1 <- scale(na.omit(data[,2:21]))\n", "data.sc2 <- scale(data[,2:21])\n", "data.sc2[is.na(data.sc2)] <- 0" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "(a) Perform PCA on this dataset and produce a two biplots: one with PC1 versus\n", "PC2 and another with PC1 versus PC3. Hint: see `?biplot`.\n", "\n", "(b) Which are the important variables in the first three components?\n", "\n", "(c) What does it mean for a country to have a high score in the first component?\n", "Pick one country that has a high score in the first component and verify that\n", "this country does indeed have this interpretation.\n", "\n", "**Q5.** (optional) PCA has some nice history in human face recognition research\n", "(see for example [here](https://en.wikipedia.org/wiki/Eigenface). In this\n", "exercise you will look at a small dataset of face images and perform a similar\n", "analysis as we did it with the image of the ship in section \"Proportion\n", "Variance Explained\". However, instead of treating the rows of a single image\n", "as individual data points, each face image is treated as a single data point by\n", "transforming the two-dimensional images to one vector (concatenating the rows to\n", "one long vector). You can load the dataset with the following code." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "download.file('https://lcnwww.epfl.ch/bio322/att_faces.zip', 'faces.zip')\n", "unzip('faces.zip')\n", "library(png)\n", "faces <- sapply(list.files('att_faces', full.names = T), readPNG)\n", "data <- t(faces)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "To look at a single image you can use the following function" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "show.image <- function(x) image(t(matrix(x, nrow = 112))[,112:1], col = gray.colors(256), axes = F, asp = 1)\n", "show.image(data[101,])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(a) Perform PCA on this dataset and plot the first few principal components by\n", "using the `show.image` function above.\n", "\n", "(b) Plot the curve of the proportion of variance explained.\n", "\n", "(c) Plot some of the original images together with their low-dimensional\n", "reconstruction based on a reasonable number of principal components." ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" } }, "nbformat": 4, "nbformat_minor": 4 }