diff --git a/src/ensembles.ipynb b/src/ensembles.ipynb new file mode 100644 index 0000000..a815e79 --- /dev/null +++ b/src/ensembles.ipynb @@ -0,0 +1,295 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Bagging, Random Forests and Boosting" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "New R commands:\n", + "- `randomForest(formula, data = , subset = , mtry, importance = T)` random forests with mtry = number of features to be considered for splits.\n", + "- `xgboost(x, label = y, nround = B, eta = λ, max_depth = d)` boosting\n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Bagging and Random Forests" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here we apply bagging and random forests to the Boston data, using the\n", + "randomForest package in R. Recall that bagging is simply a special case of a random forest with m = p.\n", + "Therefore, the randomForest() function can\n", + "be used to perform both random forests and bagging. We perform bagging\n", + "as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "library(randomForest)\n", + "library(MASS)\n", + "set.seed(1)\n", + "train = sample(1:nrow(Boston), nrow(Boston)/2)\n", + "boston.test=Boston[-train,\"medv\"]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "bag.boston=randomForest(medv~.,data=Boston,subset=train,mtry=13,importance=TRUE)\n", + "bag.boston" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The argument mtry=13 indicates that all 13 predictors should be considered\n", + "for each split of the tree—in other words, that bagging should be done. How\n", + "well does this bagged model perform on the test set?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "yhat.bag = predict(bag.boston,newdata=Boston[-train,])\n", + "plot(yhat.bag, boston.test)\n", + "abline(0,1)\n", + "mean((yhat.bag-boston.test)^2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The test set MSE associated with the bagged regression tree is considerably lower than that obtained using an optimally-pruned single tree. We could change\n", + "the number of trees grown by randomForest() using the ntree argument:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "bag.boston=randomForest(medv~.,data=Boston,subset=train,mtry=13,ntree=25)\n", + "yhat.bag = predict(bag.boston,newdata=Boston[-train,])\n", + "mean((yhat.bag-boston.test)^2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Growing a random forest proceeds in exactly the same way, except that\n", + "we use a smaller value of the mtry argument. By default, randomForest()\n", + "uses p/3 variables when building a random forest of regression trees, and\n", + "√p variables when building a random forest of classification trees. Here we\n", + "use mtry = 6 ." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "set.seed(1)\n", + "rf.boston=randomForest(medv~.,data=Boston,subset=train,mtry=6,importance=TRUE)\n", + "yhat.rf = predict(rf.boston,newdata=Boston[-train,])\n", + "mean((yhat.rf-boston.test)^2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A further improvement over bagging.\n", + "\n", + "Using the importance() function, we can view the importance of each\n", + "variable." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "importance(rf.boston)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Two measures of variable importance are reported. The former is based\n", + "upon the mean decrease of accuracy in predictions on the out of bag samples\n", + "when a given variable is excluded from the model. The latter is a measure\n", + "of the total decrease in node impurity that results from splits over that\n", + "variable, averaged over all trees. In the\n", + "case of regression trees, the node impurity is measured by the training\n", + "RSS, and for classification trees by the deviance. Plots of these importance\n", + "measures can be produced using the varImpPlot() function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "varImpPlot(rf.boston)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The results indicate that across all of the trees considered in the random\n", + "forest, the wealth level of the community ( lstat ) and the house size ( rm )\n", + "are by far the two most important variables." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Boosting" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Since xgboost does not accept data frames we will first convert the data into ordinary matrices." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "boston.train.x = as.matrix(Boston[train, names(Boston) != \"medv\"])\n", + "boston.test.x = as.matrix(Boston[-train, names(Boston) != \"medv\"])\n", + "boston.train.y = Boston[train, \"medv\"]\n", + "boston.test.y = Boston[-train, \"medv\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we train with the standard regression loss \"reg:squarederror\". For other objective functions see [https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters](https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "boost.boston = xgboost(boston.train.x, label = boston.train.y,\n", + " eta = 0.001, objective = \"reg:squarederror\",\n", + " max_depth = 4,\n", + " nround = 5000)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "yhat.boost = predict(boost.boston, boston.test.x)\n", + "plot(yhat.boost, boston.test.y)\n", + "abline(0, 1)\n", + "mean((yhat.boost - boston.test.y)^2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The result is comparable to the one with Random Forests. We could try to use a higher learning rate." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "boost.boston = xgboost(boston.train.x, label = boston.train.y,\n", + " eta = 0.02, objective = \"reg:squarederror\",\n", + " max_depth = 4,\n", + " nround = 5000)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "yhat.boost = predict(boost.boston, boston.test.x)\n", + "mean((yhat.boost - boston.test.y)^2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Also in xgboost we can look at the importance plot." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "xgb.plot.importance(xgb.importance(model = boost.boston))" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "R", + "language": "R", + "name": "ir" + }, + "language_info": { + "codemirror_mode": "r", + "file_extension": ".r", + "mimetype": "text/x-r-source", + "name": "R", + "pygments_lexer": "r", + "version": "3.6.1" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}