diff --git a/src/unsupervised_learning.ipynb b/src/unsupervised_learning.ipynb new file mode 100644 index 0000000..c060bde --- /dev/null +++ b/src/unsupervised_learning.ipynb @@ -0,0 +1,563 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Unsupervised Learning" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "New R commands\n", + "- `prcomp(x, scale = TRUE)` performs PCA on data `x`\n", + "- `biplot(pr.out, scale=0)` plots scores and loadings of a PCA `pr.out` in one figure.\n", + "- `cumsum(x)` computes the cumulative sum of the elements of a numeric vector.\n", + "- `kmeans(x,k,nstart=)` run k-means clustering on data `x` with `nstart` restarts." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## PCA" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this lab, we perform PCA on the USArrests data set, which is part of\n", + "the base R package. The rows of the data set contain the 50 states, in\n", + "alphabetical order." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "states=row.names(USArrests)\n", + "states" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The columns of the data set contain the four variables." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "names(USArrests)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We first briefly examine the data. We notice that the variables have vastly\n", + "different means." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "apply(USArrests, 2, mean)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that the apply() function allows us to apply a function—in this case,\n", + "the mean() function—to each row or column of the data set. The second\n", + "input here denotes whether we wish to compute the mean of the rows, 1,\n", + "or the columns, 2. We see that there are on average three times as many\n", + "rapes as murders, and more than eight times as many assaults as rapes.\n", + "We can also examine the variances of the four variables using the apply()\n", + "function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "apply(USArrests, 2, var)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Not surprisingly, the variables also have vastly different variances: the\n", + "UrbanPop variable measures the percentage of the population in each state\n", + "living in an urban area, which is not a comparable number to the number of rapes in each state per 100,000 individuals. If we failed to scale the\n", + "variables before performing PCA, then most of the principal components\n", + "that we observed would be driven by the Assault variable, since it has by\n", + "far the largest mean and variance. Thus, it is important to standardize the\n", + "variables to have mean zero and standard deviation one before performing\n", + "PCA.\n", + "\n", + "We now perform principal components analysis using the prcomp() function, which is one of several functions in R that perform PCA." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pr.out=prcomp(USArrests, scale=TRUE)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "By default, the prcomp() function centers the variables to have mean zero.\n", + "By using the option scale=TRUE , we scale the variables to have standard\n", + "deviation one. The output from prcomp() contains a number of useful quantities." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "names(pr.out)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pr.out$center" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pr.out$scale" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The rotation matrix provides the principal component loadings; each column of pr.out$rotation contains the corresponding principal component\n", + "loading vector." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pr.out$rotation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We see that there are four distinct principal components. This is to be\n", + "expected because there are in general min(n − 1, p) informative principal\n", + "components in a data set with n observations and p variables.\n", + "\n", + "This function names the loadings the rotation matrix, because when we matrix-multiply the\n", + "X matrix by pr.out$rotation, it gives us the coordinates of the data in the rotated\n", + "coordinate system. These coordinates are the principal component scores." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pr.out$x" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To see that we actually recover these scores by multiplying the data X with the loading matrix pr.out$rotation we scale the data and perform the matrix multiplication." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "USArrests.scaled = scale(USArrests)\n", + "USArrests.scaled %*% pr.out$rotation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can plot the first two principal components as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "biplot(pr.out, scale=0)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The scale=0 argument to biplot() ensures that the arrows are scaled to\n", + "represent the loadings; other values for scale give slightly different biplots\n", + "with different interpretations.\n", + "\n", + "Alternatively we could directly plot the scores alone.\n", + "We see clearly the scores of the states California, Nevada and Florida on the left." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plot(pr.out$x[,1:2], xlim = c(-3, 3), ylim = c(-3, 3))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The principal components are only unique up to a sign change, so we could have also obtained the following picture with equivalent information (this figure can be found in the text book as Figure 10.1):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pr.out$rotation=-pr.out$rotation\n", + "pr.out$x=-pr.out$x\n", + "biplot(pr.out, scale=0)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The prcomp() function also outputs the standard deviation of each principal component. For instance, on the USArrests data set, we can access\n", + "these standard deviations as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pr.out$sdev" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The variance explained by each principal component is obtained by squar-\n", + "ing these:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pr.var=pr.out$sdev^2\n", + "pr.var" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To compute the proportion of variance explained by each principal component, we simply divide the variance explained by each principal component\n", + "by the total variance explained by all four principal components:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pve=pr.var/sum(pr.var)\n", + "pve" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We see that the first principal component explains 62.0 % of the variance\n", + "in the data, the next principal component explains 24.7 % of the variance,\n", + "and so forth. We can plot the PVE explained by each component, as well\n", + "as the cumulative PVE, as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plot(pve, xlab=\"Principal Component\", ylab=\"Proportion of Variance Explained\", ylim=c(0,1),type='b')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plot(cumsum(pve), xlab=\"Principal Component\", ylab=\"Cumulative Proportion of Variance Explained\", ylim=c(0,1),type='b')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that the function cumsum() computes the cumulative sum of the elements of a numeric vector. For instance:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "a=c(1,2,8,-3)\n", + "cumsum(a)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## K-Means Clustering" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The function kmeans() performs K-means clustering in R . We begin with\n", + "a simple simulated example in which there truly are two clusters in the\n", + "data: the first 25 observations have a mean shift relative to the next 25\n", + "observations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "set.seed(2)\n", + "x=matrix(rnorm(50*2), ncol=2)\n", + "x[1:25,1]=x[1:25,1]+3\n", + "x[1:25,2]=x[1:25,2]-4" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We now perform K-means clustering with K = 2." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "km.out=kmeans(x,2,nstart=20)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The cluster assignments of the 50 observations are contained in\n", + "km.out$cluster." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "km.out$cluster" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The K-means clustering perfectly separated the observations into two clusters even though we did not supply any group information to kmeans() . We\n", + "can plot the data, with each observation colored according to its cluster\n", + "assignment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plot(x, col=(km.out$cluster+1), main=\"K-Means Clustering Results with K=2\", xlab=\"\", ylab=\"\", pch=20, cex=2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here the observations can be easily plotted because they are two-dimensional.\n", + "If there were more than two variables then we could instead perform PCA\n", + "and plot the first two principal components score vectors.\n", + "\n", + "In this example, we knew that there really were two clusters because\n", + "we generated the data. However, for real data, in general we do not know\n", + "the true number of clusters. We could instead have performed K-means\n", + "clustering on this example with K = 3." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "set.seed(4)\n", + "km.out=kmeans(x,3,nstart=20)\n", + "km.out" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plot(x, col=(km.out$cluster+1), main=\"K-Means Clustering Results with K=3\", xlab=\"\", ylab=\"\", pch=20, cex=2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When K = 3, K-means clustering splits up the two clusters.\n", + "\n", + "To run the kmeans() function in R with multiple initial cluster assignments, we use the nstart argument. If a value of nstart greater than one\n", + "is used, then K-means clustering will be performed using multiple random\n", + "assignments in Step 1 of the Algorithm on slide 35 (Algorithm 10.1 in the book), and the kmeans() function will\n", + "report only the best results. Here we compare using nstart=1 to nstart=20 ." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "set.seed(311)\n", + "km.out=kmeans(x,3,nstart=1)\n", + "km.out$tot.withinss" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "km.out=kmeans(x,3,nstart=20)\n", + "km.out$tot.withinss" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that km.out$tot.withinss is the total within-cluster sum of squares,\n", + "which we seek to minimize by performing K-means clustering (Equation\n", + "10.11). The individual within-cluster sum-of-squares are contained in the\n", + "vector km.out$withinss.\n", + "\n", + "We strongly recommend always running K-means clustering with a large\n", + "value of nstart, such as 20 or 50, since otherwise an undesirable local\n", + "optimum may be obtained.\n", + "\n", + "When performing K-means clustering, in addition to using multiple ini-\n", + "tial cluster assignments, it is also important to set a random seed using the\n", + "set.seed() function. This way, the initial cluster assignments in Step 1 can\n", + "be replicated, and the K-means output will be fully reproducible." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "R", + "language": "R", + "name": "ir" + }, + "language_info": { + "codemirror_mode": "r", + "file_extension": ".r", + "mimetype": "text/x-r-source", + "name": "R", + "pygments_lexer": "r", + "version": "3.6.1" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}