diff --git a/src/Intro to R.ipynb b/src/Intro to R.ipynb new file mode 100644 index 0000000..65450af --- /dev/null +++ b/src/Intro to R.ipynb @@ -0,0 +1,1069 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Basic Commands" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "R uses functions to perform operations. To run a function called funcname,\n", + "we type funcname(input1, input2) , where the inputs (or arguments) input1\n", + "and input2 tell R how to run the function. A function can have any number\n", + "of inputs. For example, to create a vector of numbers, we use the function\n", + "c() (for concatenate). Any numbers inside the parentheses are joined together. \n", + "The following command instructs R to join together the numbers\n", + "1, 3, 2, and 5, and to save them as a vector named x . When we type x , it\n", + "gives us back the vector." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x <- c(1,3,2,5)\n", + "x" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that the > is not part of the command; rather, it is printed by R to\n", + "indicate that it is ready for another command to be entered. We can also\n", + "save things using = rather than <- :" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x = c(1,6,2)\n", + "x" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "y = c(1,4,3)\n", + "length(y)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can tell R to add two sets of numbers together. It will then add the\n", + "first number from x to the first number from y , and so on. However, x and\n", + "y should be the same length. We can check their length using the length()\n", + "function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x + y" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The ls() function allows us to look at a list of all of the objects, such\n", + "as data and functions, that we have saved so far. The rm() function can be\n", + "used to delete any that we don’t want." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ls()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "rm(x, y)\n", + "ls()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It’s also possible to remove all objects at once" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "rm(list=ls())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The matrix() function can be used to create a matrix of numbers. Before\n", + "we use the matrix() function, we can learn more about it:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "?matrix" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The help file reveals that the matrix() function takes a number of inputs,\n", + "but for now we focus on the first three: the data (the entries in the matrix),\n", + "the number of rows, and the number of columns. First, we create a simple\n", + "matrix." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x=matrix(data=c(1,2,3,4), nrow=2, ncol=2)\n", + "x" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that we could just as well omit typing data= , nrow= , and ncol= in the\n", + "matrix() command above: that is, we could just type" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x=matrix(c(1,2,3,4),2,2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "and this would have the same effect. However, it can sometimes be useful to\n", + "specify the names of the arguments passed in, since otherwise R will assume\n", + "that the function arguments are passed into the function in the same order\n", + "that is given in the function’s help file. As this example illustrates, by\n", + "default R creates matrices by successively filling in columns. Alternatively,\n", + "the byrow=TRUE option can be used to populate the matrix in order of the\n", + "rows." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "matrix(c(1,2,3,4),2,2,byrow=TRUE)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice that in the above command we did not assign the matrix to a value\n", + "such as x . In this case the matrix is printed to the screen but is not saved\n", + "for future calculations. The sqrt() function returns the square root of each\n", + "element of a vector or matrix. The command x^2 raises each element of x\n", + "to the power 2; any powers are possible, including fractional or negative\n", + "powers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sqrt(x)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x^2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The rnorm() function generates a vector of random normal variables,\n", + "with first argument n the sample size. Each time we call this function, we\n", + "will get a different answer. Here we create two correlated sets of numbers,\n", + "x and y , and use the cor() function to compute the correlation between\n", + "them." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x=rnorm(50)\n", + "y=x+rnorm(50,mean=50,sd=.1)\n", + "cor(x,y)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "By default, rnorm() creates standard normal random variables with a mean\n", + "of 0 and a standard deviation of 1. However, the mean and standard devi-\n", + "ation can be altered using the mean and sd arguments, as illustrated above.\n", + "Sometimes we want our code to reproduce the exact same set of random\n", + "numbers; we can use the set.seed() function to do this. The set.seed()\n", + "function takes an (arbitrary) integer argument. Evaluate the following cell\n", + "multiple times to see the reproducability." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "set.seed(1303)\n", + "rnorm(50)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The mean() and var() functions can be used to compute the mean and\n", + "variance of a vector of numbers. Applying sqrt() to the output of var()\n", + "will give the standard deviation. Or we can simply use the sd() function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "set.seed(3)\n", + "y=rnorm(100)\n", + "mean(y)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "var(y)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sqrt(var(y))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sd(y)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Graphics" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The plot() function is the primary way to plot data in R . For instance,\n", + "plot(x,y) produces a scatterplot of the numbers in x versus the numbers\n", + "in y . There are many additional options that can be passed in to the plot()\n", + "function. For example, passing in the argument xlab will result in a label\n", + "on the x-axis. To find out more information about the plot() function,\n", + "type ?plot ." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x=rnorm(100)\n", + "y=rnorm(100)\n", + "plot(x,y)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plot(x,y,xlab=\"this is the x-axis\",ylab=\"this is the y-axis\",main=\"Plot of X vs Y\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will often want to save the output of an R plot. The command that we\n", + "use to do this will depend on the file type that we would like to create. For\n", + "instance, to create a pdf, we use the pdf() function, and to create a jpeg,\n", + "we use the jpeg() function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pdf(\"Figure.pdf\")\n", + "plot(x,y,col=\"green\")\n", + "dev.off()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The function dev.off() indicates to R that we are done creating the plot.\n", + "Alternatively, we can simply copy the plot window and paste it into an\n", + "appropriate file type, such as a Word document." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The function seq() can be used to create a sequence of numbers. For\n", + "instance, seq(a,b) makes a vector of integers between a and b . There are\n", + "many other options: for instance, seq(0,1,length=10) makes a sequence of\n", + "10 numbers that are equally spaced between 0 and 1 . Typing 3:11 is a\n", + "shorthand for seq(3,11) for integer arguments." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x=seq(1,10)\n", + "x" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x=1:10\n", + "x" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x=seq(-pi,pi,length=50)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will now create some more sophisticated plots. The contour() function \n", + "produces a contour plot in order to represent three-dimensional data; contour plot\n", + "it is like a topographical map. It takes three arguments:\n", + "1. A vector of the x values (the first dimension),\n", + "2. A vector of the y values (the second dimension), and\n", + "3. A matrix whose elements correspond to the z value (the third dimension) for each pair of ( x , y ) coordinates.\n", + "\n", + "As with the plot() function, there are many other inputs that can be used\n", + "to fine-tune the output of the contour() function. To learn more about\n", + "these, take a look at the help file by typing ?contour ." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "y=x\n", + "f=outer(x,y,function(x,y)cos(y)/(1+x^2))\n", + "contour(x,y,f)\n", + "contour(x,y,f,nlevels=45,add=T)\n", + "fa=(f-t(f))/2\n", + "contour(x,y,fa,nlevels=15)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The image() function works the same way as contour() , except that it\n", + "produces a color-coded plot whose colors depend on the z value. This is\n", + "known as a heatmap, and is sometimes used to plot temperature in weather heatmap\n", + "forecasts. Alternatively, persp() can be used to produce a three-dimensional\n", + "plot. The arguments theta and phi control the angles at which the plot is\n", + "viewed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "image(x,y,fa)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "persp(x,y,fa)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "persp(x,y,fa,theta=30)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "persp(x,y,fa,theta=30,phi=20)\n", + "persp(x,y,fa,theta=30,phi=70)\n", + "persp(x,y,fa,theta=30,phi=40)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For details on plotting it is useful to study the help pages." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "?plot" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "?par" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Indexing Data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We often wish to examine part of a set of data. Suppose that our data is\n", + "stored in the matrix A." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "A=matrix(1:16,4,4)\n", + "A" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Then, typing" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "A[2,3]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "will select the element corresponding to the second row and the third col-\n", + "umn. The first number after the open-bracket symbol [ always refers to\n", + "the row, and the second number always refers to the column. We can also\n", + "select multiple rows and columns at a time, by providing vectors as the\n", + "indices." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "A[c(1,3),c(2,4)]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "A[1:3,2:4]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "A[1:2,]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "A[,1:2]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The last two examples include either no index for the columns or no index\n", + "for the rows. These indicate that R should include all columns or all rows,\n", + "respectively. R treats a single row or column of a matrix as a vector." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "A[1,]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The use of a negative sign - in the index tells R to keep all rows or columns\n", + "except those indicated in the index." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "A[-c(1,3),]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "A[-c(1,3),-c(1,3,4)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The dim() function outputs the number of rows followed by the number of\n", + "columns of a given matrix." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dim(A)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Data Frames" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Oftentimes it is useful to store data in data frames.\n", + "Imagine a data frame as an excel sheet with named columns." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df = data.frame(x = rnorm(3), y = c(\"a\", \"b\", \"c\"), z = 1:3)\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To find the columns of a data frame we can use the names() function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "names(df)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To access a single column, say column 'x', we use the $ sign." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df$x" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If we want to make all columns available to the workspace, we can attach the data frame.\n", + "Before we do that we cannot access, say column 'z', directly:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "z" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "attach(df)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "But after attaching, we can access 'z' directly. If you see an error message saying that objects x and y are masked, this means that we previously had already some values assigned to x and y and by attaching the data frame these values are overwritten." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "z" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Sometimes you may not know the value of one measurement. Say you measured the height and the weight of different individuals, but you forgot to measure the weight of the third one. Then you can use the special value `NA` (not available) to indicate the missing value and your data frame could look like this:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data = data.frame(name = c(\"John\", \"Mary\", \"Vincent\"), heigth = c(1.75, 1.73, 1.83), weight = c(84, 65, NA))\n", + "data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you now want to do some analysis with this data, but without the rows that contain missing values, you can use the function na.omit() to remove all rows that contain 'NA' entries." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data = na.omit(data)\n", + "data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Data that is loaded from libraries often has some missing data. For example by loading the ISLR library we load also the Hitters data frame." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "library(ISLR)\n", + "names(Hitters)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dim(Hitters)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We see that this data frame contains 322 rows and 20 columns, i.e. information about 322 baseball hitters.\n", + "If we now remove all the rows that contain missing values we are left with 263 rows." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dim(na.omit(Hitters))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Writing Functions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As we have seen, R comes with many useful functions, and still more \n", + "functions are available by way of R libraries. However, we will often be \n", + "interested in performing an operation for which no function is available. In this\n", + "setting, we may want to write our own function. For instance, below we\n", + "provide a simple function that reads in the ISLR and MASS libraries, called\n", + "LoadLibraries() . Before we have created the function, R returns an error if\n", + "we try to call it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "LoadLibraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "LoadLibraries()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We now create the function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "LoadLibraries=function(){\n", + " library(ISLR)\n", + " library(MASS)\n", + " print(\"The libraries have been loaded.\")\n", + " }" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "LoadLibraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "LoadLibraries()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The above function does not have any arguments.\n", + "To write a function that accepts arguments we can do the following:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "f = function(x) {\n", + " x^2 + 3\n", + " }" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can call this function in two ways:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "f(4)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "f(x = 4)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can also provide default values to some of the function arguments." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "func = function(x = 7, y = 3, z = 1) {\n", + " x + y + z\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we have multiple possibilities to call the function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "func()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "func(y = -8)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "func(1, 2, 3)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "func(z = 0, x = 2, y = 1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "R", + "language": "R", + "name": "ir" + }, + "language_info": { + "codemirror_mode": "r", + "file_extension": ".r", + "mimetype": "text/x-r-source", + "name": "R", + "pygments_lexer": "r", + "version": "3.6.1" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}