diff --git a/src/lecture12.ipynb b/src/lecture12.ipynb index fa64c07..edf0762 100644 --- a/src/lecture12.ipynb +++ b/src/lecture12.ipynb @@ -1,493 +1,493 @@ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lecture 12" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "IRdisplay::display_html('')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## K-Means Clustering" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "IRdisplay::display_html('')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function `kmeans` performs K-means clustering in R . We begin with\n", "a simple simulated example in which there truly are two clusters in the\n", "data: the first 25 observations have a mean shift relative to the next 25\n", "observations." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "set.seed(2)\n", "x <- matrix(rnorm(50*2), ncol=2)\n", "x[1:25,1] <- x[1:25,1] + 3 # move the first 25 points to another center\n", "x[1:25,2] <- x[1:25,2] - 4 # move the first 25 points to another center\n", "plot(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see that we have two clusters.\n", "\n", "We now perform K-means clustering with `K = 2`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "km.out <- kmeans(x, 2, nstart = 20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The cluster assignments of the 50 observations are contained in\n", "km.out$cluster." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "km.out$cluster" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The K-means clustering perfectly separated the observations into two clusters even though we did not supply any group information to `kmeans()`. We\n", "can plot the data, with each observation colored according to its cluster\n", "assignment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot(x, col = (km.out$cluster+1), main = \"K-Means Clustering Results with K=2\", xlab=\"\", ylab=\"\", pch=20, cex=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here the observations can be easily plotted because they are two-dimensional.\n", "If there were more than two variables then we could instead perform PCA\n", "and plot the first two principal components score vectors.\n", "\n", "In this example, we knew that there really were two clusters because\n", "we generated the data. However, for real data, in general we do not know\n", "the true number of clusters. We could instead have performed K-means\n", "clustering on this example with `K = 3`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set.seed(4)\n", "km.out <- kmeans(x,3,nstart=20)\n", "km.out" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot(x, col=(km.out$cluster+1), main=\"K-Means Clustering Results with K=3\", xlab=\"\", ylab=\"\", pch=20, cex=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With `K = 3`, K-means clustering splits up the two clusters.\n", "\n", "To run the `kmeans()` function in R with multiple initial cluster assignments, we use the `nstart` argument. If a value of `nstart` greater than one\n", "is used, then K-means clustering will be performed using multiple random\n", "assignments in Step 1 of the Algorithm on slide 35 (Algorithm 10.1 in the book), and the `kmeans()` function will\n", "report only the best results. Here we compare using `nstart=1` to `nstart=20`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set.seed(311)\n", "km.out=kmeans(x,3,nstart=1)\n", "km.out$tot.withinss" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "km.out=kmeans(x,3,nstart=20)\n", "km.out$tot.withinss" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Note that `km.out$tot.withinss` is the total within-cluster sum of squares,\n", "which we seek to minimize by performing K-means clustering.\n", "The individual within-cluster sum-of-squares are contained in the\n", "vector `km.out$withinss`.\n", "\n", "We strongly recommend always running K-means clustering with a large\n", "value of `nstart`, such as 20 or 50, since otherwise an undesirable local\n", "optimum may be obtained.\n", "\n", "In the following cell we load and scale again the wine data to prepare\n", "for K-means clustering." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wine <- read.csv(\"http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data\", sep=\",\")\n", "colnames(wine) <- c(\"Cultivar\",\"Alcohol\",\"Malic acid\",\"Ash\",\"Alcalinity of ash\", \"Magnesium\", \"Total phenols\", \"Flavanoids\", \"Nonflavanoid phenols\", \"Proanthocyanins\", \"Color intensity\", \"Hue\", \"OD280/OD315 of diluted wines\", \"Proline\")\n", "wine.sc <- scale(wine[,-1])" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Let us now run 3-Means Clustering on the wine data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "set.seed(912)\n", "km <- kmeans(wine.sc, 3, nstart = 20)\n", "km$cluster" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "For clustering we did not use the fact that the wines came from different types\n", "of cultivated wine grapes (cultivars). But it looks like the type of grape has\n", "such a strong influence on the properties of the wine, that the clustering based\n", "on the properties of the wine recovers almost perfectly the three cultivars, as\n", "we can see in the following cell:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mean(km$cluster == wine$Cultivar)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Let us use again PCA to plot the data in a low-dimensional space." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "pca <- prcomp(wine.sc)\n", "plot(pca$x[,1:2], col = km$cluster)\n", "wrong <- km$cluster != wine$Cultivar\n", "points(pca$x[wrong,1:2], pch = 4, col = wine$Cultivar[wrong], cex = 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The red crosses indicate wines that were derived from the \"red\" cultivar.\n", "There are only six wines that got assigned to the \"wrong\" cultivar, despite not\n", "using any label information during learning the clusters.\n", "\n", "You can now answer the questions on the first page of the\n", "[quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1118676).\n", "\n", "## Hierarchical Clustering" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "IRdisplay::display_html('')" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "We use the function `dist` to compute pairwise distances between all pairs of\n", "data points in a given dataset. Let us look at an extremely simple example with\n", "3 one-dimensional data points" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "data <- c(-1, 4, 7)\n", "dist(data)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "As expected, the distance between data point 1 and 2 is 5, between 1 and 3 it is\n", "8 and between 2 and 3 it is 3.\n", "\n", "Then we can apply the function `hclust` to\n", "perform hierarchical clustering." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "hc <- hclust(dist(data))\n", "plot(hc)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "The function `hclust` computes hierarchical clustering with complete linkage by\n", "default. You can see that the first link (between data points 2 and 3) is at\n", "height 3, corresponding to the distance between these two points. The second\n", "link between \"cluster\" {1} and cluster {2, 3} is at height 8, because this is the\n", "maximal distance between point 1 and any point in the cluster {2, 3}.\n", "\n", "Let us now apply hierarchical clustering to the wine data set." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "hc.complete <- hclust(dist(wine.sc), method = \"complete\")\n", "hc.average <- hclust(dist(wine.sc), method = \"average\")\n", "hc.single <- hclust(dist(wine.sc), method = \"single\")\n", "hc.centroid <- hclust(dist(wine.sc), method = \"centroid\")\n", "par(mfrow = c(1, 4))\n", "plot(hc.complete, main = \"Complete Linkage\", xlab = \"\", sub = \"\")\n", "plot(hc.average, main = \"Average Linkage\", xlab = \"\", sub = \"\")\n", "plot(hc.single, main = \"Single Linkage\", xlab = \"\", sub = \"\")\n", "plot(hc.centroid, main = \"Centroid Linkage\", xlab = \"\", sub = \"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see that the three types of linkages lead to quite different results.\n", "For centroid linkage you can see the typical inversions: there is not a\n", "monotonous increase in the height (cluster dissimilarity) when moving from a\n", "leaf node upwards.\n", "\n", "To cut the tree at a particular height, we can use the function `cutree`, with\n", "the second argument specifying the number of leaf nodes." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "plot(pca$x[,1:2], main = \"Complete Linkage\", col = cutree(hc.complete, 3))\n", "plot(pca$x[,1:2], main = \"Average Linkage\", col = cutree(hc.average, 3))\n", "plot(pca$x[,1:2], main = \"Single Linkage\", col = cutree(hc.single, 3))\n", "plot(pca$x[,1:2], main = \"Centroid Linkage\", col = cutree(hc.centroid, 3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Average, Single and Centroid Linkage do not lead to a result we would have expected.\n", "With complete linkage the result looks \"better\".\n", "\n", "If you are interested in an example of hierarchical clustering applied to genome\n", "data, I recommend to look at the Lab 3 in the text book on page 407.\n", "\n", "You can now answer the questions on the second page of the\n", "[quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1118676).\n", "\n", "## Exercises\n", "\n", "**Q1.** In this problem, you will perform clustering manually, with $K = 2$, on a small example with $n = 6$ observations and $p = 2$ features. The observations are as follows.\n", " $$\\begin{array}{c|cc}\n", " \\hline\n", " \\mbox{Obs.} &X_1 &X_2 \\cr\n", " \\hline\n", " 1 &1 &4 \\cr\n", " 2 &1 &3 \\cr\n", " 3 &0 &4 \\cr\n", " 4 &5 &1 \\cr\n", " 5 &6 &2 \\cr\n", " 6 &4 &0 \\cr\n", " \\hline\n", " \\end{array}$$\n", "\n", "(a) Plot the observations.\n", "\n", "(b) Randomly assign a cluster label to each observation. You can use the `sample()` command in `R` to do this, if you do not want to do this manually. Report the cluster labels for each observation.\n", "\n", "(c) Compute the centroid for each cluster.\n", "\n", "(d) Assign each observation to the centroid to which it is closest, in terms of Euclidean distance. Report the cluster labels for each observation.\n", "\n", "(e) Repeat (c) and (d) until the answers obtained stop changing.\n", "\n", "(f) In your plot from (a), color the observations according to the clusters labels obtained.\n", "\n", "(g) Apply hierarchical clustering with complete linkage to the same data set.\n", "Draw the dendrogram with the links at the correct heights.\n", "\n", "## Applied\n", "\n", "**Q2.** In this problem, you will generate simulated data, and then perform PCA and K-means clustering on the data.\n", "\n", "(a) Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total), and 50 variables.\n", "\n", "*Hint*: There are a number of functions in `R` that you can use to generate data, like `rnorm` or `runif` or `mvrnorm` from `library(MASS)`. Be sure to create the data such that there are three distinct classes.\n", "\n", "(b) Perform PCA on the 60 observations and plot the first two principal component score vectors. Use a different color to indicate the observations in each of the three classes. If the three classes appear separated in this plot, then continue on to part (c). If not, the return to part (a) and modify the simulation so that there is greater separation between the three classes. Do not continue to part (c) until the three classes show at least some separation in the first two principal component score vectors. Compute and plot the cumulative proportion of variance explained and comment your result.\n", "\n", "(c) Explain how you could change your simulated data such that the cumulative proportion of variance explained approaches 1 faster/slower than in (b).\n", "\n", "(d) Perform K-means clustering of the observations with $K = 3$. How well do the clusters that you obtained in K-means clustering compare to the true class labels?\n", "\n", "*Hint*: You can use the `table()` function in `R` to compare the true class labels to the class labels obtained by clustering. Be careful how you interpret the results: K-means clustering will arbitrarily number the clusters, so you cannot simply check whether the true class labels and clustering labels are the same.\n", "\n", "(e) Perform K-means clustering with $K = 2$. Describe your results.\n", "\n", "(f) Now perform K-means clustering with $K = 4$, and describe your results.\n", "\n", "(g) Now perform K-means clustering with $K = 3$ on the first two principal component score vectors, rather than on the raw data. That is, perform K-means clustering on the 60x2 matrix of which the first column is the first principal component score vector, and the second column is the second principal component score vector. Comment on the results.\n", "\n", - "(h) Using the `scale()` function, perform K-means clustering with $K = 3$ on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (b)? Explain.\n", + "(h) Using the `scale()` function, perform K-means clustering with $K = 3$ on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (d)? Explain.\n", "\n", "(i) Perform hierarchical clustering on the data. By choosing different cuts, do\n", "you get similar results as with K-means clustering with $K = 2, 3, 4$?\n" ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" } }, "nbformat": 4, "nbformat_minor": 4 }