{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " CS411 - Autumn 2019-2020
\n", " Patrick Jermann

\n", " How to use this notebook?\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# The MOOC dataset\n", "The data we use in this document to demonstrate the use of basic statistics is about students' performance in academic exams (e.g. the algebra course taken at the university) and their activitiy online in a related Massive Open Online Course (e.g. the algebra MOOC). The general question we try to answer is: what is the relation between the use of the MOOC by students and their academic achievement. \n", "\n", "**Question**: Does the use of MOOCs help succeed in courses ? Do only good students use the MOOCs ? Are MOOCs more helpful for better students ?\n", "\n", "As a general remark, we should note that because we did not conduct a controlled experiment with random assignment of subjects to experimental groups, we cannot establish any causal relationships between MOOC use and academic performance. Rather, we use statistical tools to \"explore\" data.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading libraries\n", "Some extra libraries are needed to perform some operations. We collect the commands to load the libraries in this cell. Executing this cell first ensures that all the necessary libraries are loaded. If one library is missing it can be installed via the install.packages() command. Simply add the line to the cell and re-execute." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Uncomment and run once if these packages are not installed yet.\n", "# install.packages(\"gdata\")\n", "# install.packages(\"gplots\")\n", "# install.packages(\"plyr\")\n", "# install.packages(\"dplyr\")\n", "# install.packages(\"ggplot2\")\n", "# install.packages(\"ggpubr\")\n", "# install.packages(\"car\")\n", "\n", "library(gdata) # reorder\n", "library(gplots) # plotmeans\n", "library(plyr) # ddply\n", "library(dplyr) # summarySE\n", "library(ggplot2) # plotting library\n", "library(ggpubr) # plotting\n", "library(car) # Anova\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Loading Data\n", "Let's load data. The first step is to read a CSV file into a dataframe, which is the data structure used by R to store data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "# TODO: Adapt the path to your own installation.\n", "# Set the working directory\n", "setwd(\"./\")\n", "\n", "# Read a csv file (header = T signals that the first line of the file contains the names of the variables).\n", "# The <- operator assigns the result of read.csv to the variable moocs.\n", "# moocs is a dataframe, similar to an excel worksheet or a DataFrame in pandas.\n", "moocs <- read.csv(file=\"moocs.basic.csv\", header = T)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our dataset contains variables which name starts with *EPFL* and other variables starting with *MOOC*. The variables with EPFL come from the academic database and describe student's performance in their EPFL exams. The variables starting with MOOC conern their online activity in the MOOC." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# List the names of the variables\n", "names(moocs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Descriptive Statistics\n", "\n", "## Summary for variables" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# For each variable, print some information about the distribution (if the variable is quantitative) \n", "# or about the distinct values taken by the variable (if the variable is categorical). \n", "\n", "summary(moocs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# To get information about only one variable, use the $ sign to separate the name of \n", "# the dataframe and the name of the variable like so: $\n", "\n", "# For nominal variables we get the count for the different categories \n", "summary(moocs$EPFL_StudentSection)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# For a quantitative variable we get the ususal central tendency and dispersion indicators\n", "summary(moocs$MOOC_PercentageVideosWatched)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We can also compute these separately.\n", "mean(moocs$MOOC_PercentageVideosWatched, na.rm=T) # na.rm removes missing values\n", "sd(moocs$MOOC_PercentageVideosWatched, na.rm=T) # standard deviation\n", "median(moocs$MOOC_PercentageVideosWatched, na.rm=T) # median (50% observations below and 50% observations above)\n", "max(moocs$MOOC_PercentageVideosWatched, na.rm=T) # maximum\n", "min(moocs$MOOC_PercentageVideosWatched, na.rm=T) # minimum" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " TODO : What is the mean and the standard deviation of the EPFL_CourseGrade ?\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "################\n", "# Begin solution\n", "\n", "\n", "# End solution\n", "###############" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Defining new categorical variables\n", "It is sometimes useful to transform a quantitative variable into a categorical variable. \n", "\n", "For example, below we create a variable that reflects whether students used the MOOC or not based on the number of videos they watched and the number of problems they attempted in the MOOC. In R, categorical variables are created with the function `factor`. The variable `didMOOC` is given the value `NO.MOOC` whenever the values for the number of videos watched and the number of problems attempted is missin. In other cases, the value is `MOOC`. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "moocs$didMOOC = factor(\n", " ifelse(\n", " is.na(moocs$MOOC_NVideosWatched) & \n", " is.na(moocs$MOOC_NProblemsSolved), \n", " \"NO.MOOC\", \n", " \"MOOC\")\n", ")\n", "\n", "summary(moocs$didMOOC)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We consider that not taking the MOOC is like not watching videos\n", "moocs$watched_videos = factor(\n", " ifelse(\n", " moocs$MOOC_PercentageVideosWatched == 0 |\n", " is.na(moocs$MOOC_PercentageVideosWatched),\n", " \"NO.VIDEO\",\n", " \"SOME.VIDEO\")\n", " )\n", "\n", "summary(moocs$watched_videos)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We consider that not taking the MOOC is like not solving any problems\n", "moocs$solved_problems = factor(\n", " ifelse(\n", " moocs$MOOC_PercentageProblemsSolved == 0 |\n", " is.na(moocs$MOOC_PercentageProblemsSolved),\n", " \"NO.PROBLEM\",\n", " \"SOME.PROBLEM\")\n", " )\n", "\n", "summary(moocs$solved_problems)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Checking activity types\n", "Do people who solve problems also watch videos ? When using a MOOC, some students only watch videos and don't do exercices. But most of the students who do exercices also watch videos. It appears that almost nobody is *only* solving problems. \n", "> **Problem**: If we use these variables as two factors to analyse our data set we'll have very uneven groups. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# With two arguments, the table command cross-tabulates the number of observations per category.\n", "table(moocs$watched_videos, moocs$solved_problems)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " TODO : To fix this problem we qualify MOOC usage with three categories: \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `NONE` corresponds to no use of MOOCs, \n", "* `WATCH` corresponds to students mainly watching videos and \n", "* `DO` corresponds to students doing some exercices." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "#################\n", "# Begin Solution:\n", "\n", "# End Solution \n", "###############\n", "\n", "summary(moocs$MOOC)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The order of the categories is a bit 'unlogical'. We'd like to have them appear in outputs and graphs sorted by increasing level of involvement: NONE, WATCH and DO. Let's fix this problem by reordering the levels of the factor." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "library(gdata)\n", "# put the levels into a convenient order (for graphing, etc.)\n", "moocs$MOOC <- reorder(moocs$MOOC, new.order=c(\"NONE\", \"WATCH\", \"DO\"))\n", "table(moocs$MOOC)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Do people who use the MOOC get better grades ?\n", "\n", "## Looking at the means.\n", "Let's start by looking at the mean grade for the three groups of our variable `MOOC`. The `ddply` function allows to apply a function to a subset of the data defined by a categorical variable (`MOOC` in our case). The ddply returns a new data frame with the category and the result." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "mu <- ddply(moocs, \"MOOC\", summarise, grp.mean=mean(EPFL_CourseGrade))\n", "mu\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Looking at the variance ?\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "sigma <- ddply(moocs, \"MOOC\", summarise, grp.sd=sd(EPFL_CourseGrade))\n", "#head(sigma)\n", "\n", "info <- merge(sigma, mu)\n", "info\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting\n", "To further investigate the difference between the ghree groups, let's plot the grade of the students `EPFL_CourseGrade` against the categorical variable `MOOC` that we just defined. We'll use three different plots: \n", "\n", "* boxplot, \n", "* histogram and \n", "* mean plots.\n", "\n", "### Boxplots\n", "\n", "Boxplots allow to visualise the distribution of the variable. The central line in the box shows the median and the edges of the boxes show the interquartiles. A simple tutorial for variations of this type of plot is available here: http://www.sthda.com/english/wiki/ggplot2-box-plot-quick-start-guide-r-software-and-data-visualization.\n", "\n", "From the plot below, it seems that the distribution of grades for people who did not use the MOOC (`NONE`) tends to be lower in the grade scale than the grades of people who `WATCH` videos or people who `DO` assignments. From the shape of the boxesm we also see that the distributions for `NONE` and `WATCH` are more symmetric than for `DO`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "# Setting the size of plots generated with ggplot below. \n", "options(repr.plot.width = 5, repr.plot.height = 5)\n", "\n", "# Quick and simple boxplot\n", "boxplot(moocs$EPFL_CourseGrade ~ moocs$MOOC, main=\"Grade by Activity Level\", xlab=\"Activity Level\", ylab=\"Grade\")\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "# Setting the size of plots generated with ggplot below. \n", "options(repr.plot.width = 8, repr.plot.height = 5)\n", "\n", "# More sophisticated plots with GGPLOT.\n", "# Tutorial: http://www.sthda.com/english/wiki/ggplot2-box-plot-quick-start-guide-r-software-and-data-visualization\n", "ggplot(moocs, aes(x=MOOC, y=EPFL_CourseGrade, fill=MOOC)) +\n", " geom_boxplot(notch=TRUE) + coord_flip() " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Histograms\n", "\n", "Histograms give a more precise representation of the actual grade distribution. Below, we define bins of 0.5 to represent the frequency of the different grades. The vertical dashed lines represent the means of the distributions. Notice that we reuse the dataframe `mu` that we defined above. We see indeed that the `DO` group is less symmetric than the others." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "# Tutorial about generating such a figure\n", "# http://www.sthda.com/english/wiki/ggplot2-histogram-plot-quick-start-guide-r-software-and-data-visualization\n", "options(repr.plot.width = 8, repr.plot.height = 5)\n", "\n", "ggplot(moocs, aes(x=EPFL_CourseGrade, fill=MOOC)) +\n", " geom_histogram(binwidth=.5, alpha=.5, position=\"dodge\") + \n", " geom_vline(data=mu, aes(xintercept=grp.mean, color=MOOC), linetype=\"dashed\") \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Mean plots\n", "Mean plots represent the mean along with the confidence interval for the mean. There is a 95% chance that the confidence interval contains the true mean of `EPFL_CourseGrade` for the population. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "options(repr.plot.width = 5, repr.plot.height = 5)\n", "\n", "# plotmeans is defined in the gplots library \n", "# make sure to load the library by executing > library(gplots) before. \n", "plotmeans(moocs$EPFL_CourseGrade ~ moocs$MOOC, main=\"Grade given MOOC activity\", ylab=\"Grade\", xlab=\"Activity Level\")\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "options(repr.plot.width = 5, repr.plot.height = 5)\n", "\n", "# Alternate form using the package ggpubr\n", "# See examples here: http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/79-plot-meansmedians-and-error-bars/\n", "\n", "ggline(moocs, x = \"MOOC\", y = \"EPFL_CourseGrade\", \n", " add = c(\"mean_ci\"))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# ANOVA - Analysis of Variance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Question: Does the grade of students (EPFL_CourseGrade) depend on students' type of MOOC interaction (MOOC, a categorical variable).\n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Test: ANOVA a.k.a. ANalysis Of VAriance \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ANOVA allows comparing whether one variable has the same mean in different groups ? The formula to specify the test is done in tow steps. First we define a linear model with `m <- lm(quantitative ~  categorical)` and then we do the Anova of the model `Anova(m)`.\n", "\n", "* H0: mean_1 = mean_2 = ...= Mean_n\n", "* H1: mean_1 != mean_2 != ... != mean_n\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use the Analysis of Variance (ANOVA) to compare the mean of a variable across several modalities of a categorical variable. In the example below, we compare the course grade (`EPFL_CourseGrade`) across different modalities of MOOC usage (the `MOOC` variable takes three possible values: `NONE`, `WATCH` and `DO`). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "# needs library(car)\n", "\n", "m <- lm(moocs$EPFL_CourseGrade ~ moocs$MOOC) \n", "Anova(m)\n", "\n", "r <- residuals(m)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Reporting\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This is the way you report the ANOVA in an article.\n", "\n", "> We conducted a one-way ANOVA to test whether the course grades differ for different categories of MOOC usage. There is a significant difference of course grades across different types of MOOC use (F[2,9311]=291.6, p<.001). \n", "\n", "Beware, that since we did not randomly assign subjects to experimental groups, but rather students chose whether they use the MOOC or not, we can't conclude about any causal effect. It is possible that better students are inclided to use MOOCs more than weaker students. Hence, the \"effect of\" the MOOC would be due to this hidden variable which actually explains the difference." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Advanced topic : There are different ways to run anovas in R. They differ with regards to the type of Sum of Square that they report. \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Following http://goanna.cs.rmit.edu.au/~fscholer/anova.php the Type III Sum of Square is more adapted to interpret the contributions of the differnt variables. R and anova report Type I sums of square.\n", "\n", "Here is an explanation about how to obtain Type III sums of squares: https://www.r-bloggers.com/anova-%E2%80%93-type-iiiiii-ss-explained/\n", "And some explanations here: http://prometheus.scp.rochester.edu/zlab/sites/default/files/InteractionsAndTypesOfSS.pdf\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Question: Which groups differ among each other ? Now that the ANOVA tells us that the three types of MOOC useage do not lead to the same grade, we want to know more precisely which categories are different from others.\n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Test: Pair-wise t-test\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A pairwise test allows to know which groups differ from each other. Because of repeated testing, we have to adjust the p-value by dividing the .05 value by the number of tests. In our case, the tests show that all groups differ from each other when taken 2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "# Checks which group is different from what other group in doing 2 by 2 tests.\n", "pairwise.t.test(moocs$EPFL_CourseGrade, moocs$MOOC, p.adj = \"bonf\") \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "cat(\"Means\")\n", "tapply(moocs$EPFL_CourseGrade, moocs$MOOC, mean)\n", "\n", "cat(\"Standard Deviations\")\n", "tapply(moocs$EPFL_CourseGrade, moocs$MOOC, sd)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Reporting\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here is the wording for the article. \n", "\n", "> A series of pair-wise t-tests with bonferonni adjusment for repeated testing, shows that the average grade for students who did not use the MOOC (m=3.5, sd=2.5) is smaller than the average grade of students who watched videos (m=3.8, sd=1.3) and smaller than the average grade of students who did the assignments (m=4.4, sd=1.2)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Assumptions\n", "Is it ok to run an ANOVA blindly as we did just before ? \n", "\n", "Not really, we should have checked the assumptions for ANOVA first.\n", "\n", "* Normality of the residuals\n", "* Homoscedasticity of the residuals = the error distribution should be the same regardless of the category.\n", "\n", "### Visually inspecting normality Assumption\n", "Testing normality visually with quantile plots. If the variable is normally distributed, the observations would follow the line drawn by qqline. We see from the plots below that the distribution of the residuals of our model used for the Anova departs very strongly from a normal distribution.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "# split the plotting area into 2 lines and 2 columns\n", "par(mfrow=c(2,2))\n", "\n", "# Setting the size of plots below. \n", "options(repr.plot.width = 7.5, repr.plot.height = 7.5)\n", "\n", "# we generate 1000 data points from a normally distributed variable (by default average = 0 and standard deviation = 1)\n", "v <- rnorm(1000) \n", "hist(v, main=\"Normal variable (generated)\")\n", "qqnorm(v, main=\"Normal variable (generated)\");\n", "qqline(v);\n", "\n", "# The residuals of the model we used to run the Anova\n", "# r <- residuals(m)\n", "hist(r, main=\"Residuals from ANOVA\")\n", "qqnorm(r, main=\"Residuals from ANOVA\");\n", "qqline(r)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Question: Are the residuals normally distributed ?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Test: Shapiro-Wilks Test\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Normality can also be tested statistically, however the normality tests are rather sensitive to small deviations, especially when sample sizes are large. For example, the Shapiro-Wilks test, is testing the null hypothesis that the variable is normally distributed. In this case the adequate p.value is .1. Hence, with p <<.1 we have to reject the null hypothesis and conclude that the variable is not normally distributed. This test work with samples smaller than 5000 observations. \n", "\n", "* H0: variable is normally distributed\n", "* H1: variable is not normally distributed\n", "\n", "We see below that for our variable `v` that we generated to be normally distributed, p = 0.47, far greater than 0.1, hence we cannot reject H0.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "shapiro.test(v)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "However, for the residuals of our Anova, we have to reject H0, and conclude that they are not normally distributed.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "# taking a sample of 1500 residuals for illustrating the output of the shapiro test\n", "shapiro.test(sample(r,1500)) \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Question: Are the residuals homoscedastic ? In other terms, is the variance of the residuals the same for the different groups ?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Test: Bartlett test\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* H0: variances are the same\n", "* H1: variances are different\n", "\n", "We see below, that the p-value is much smaller than .05 and therefore we reject H0 (and conclude that variances are different). \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "bartlett.test(r ~ moocs$MOOC)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "The difference in variuance is also appararent in a boxplot, where we see that the variance for the DO group is smaller. We also see that the residuals for WATCH tend to be slightly more positive than negative (their median is above 0) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "options(repr.plot.width = 5, repr.plot.height = 5)\n", "boxplot(r ~ moocs$MOOC)\n", "abline(h=0)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## If variances are unequal or residuals are not normally distributed\n", "\n", "If variances are unequal we can either \n", "\n", "* adapt the oneway test by stating that `var.equal = F` \n", "* or use a non parametric alternative to ANOVA like the Kruskal Wallis test. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "# adapt the caluclations to take unequal variances into account \n", "oneway.test(moocs$EPFL_CourseGrade ~ moocs$MOOC, var.equal = F)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Question: Are samples from the different groups stemming from the same distribution ?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Test: Kruskal test, a non-parametric alternative to ANOVA which does not pressupose normally distributed residuals.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tests the null hypothesis that the location parameters are the same in each of the groups. This test does not make assumptions about the distribution of the residuals, nor about the variances.\n", "\n", "* H0: the mean ranks of observations are the same in all groups\n", "* H1: the mean ranks of observations are different." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "kruskal.test(moocs$EPFL_CourseGrade ~ moocs$MOOC)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Reporting\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Because the assumption of normality of residuals was not respected (W = 0.94655, p-value < .001) in our sample, we conducted a non-parametric Kruskal-Wallis test that showed that the mean rank of course grades differs significantly across groups (X2[2]=608.7, p < .001). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise : analyse the effect of prior knowlege on the grades.\n", "Students' baccalaureat grade is given by the variable `EPFL_BacGrade`. In the plot below we see the distribution of the grades at the baccalaureat and display the median and quartiles as vertical bars.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "options(repr.plot.width = 6, repr.plot.height = 4)\n", "\n", "# Remove observations for which we do not have a EPFL_BacGrade\n", "moocs.bac <- subset(moocs, ! is.na(moocs$EPFL_BacGrade))\n", "\n", "# Draw a histogram with 50 bins\n", "hist(moocs.bac$EPFL_BacGrade, breaks=50)\n", "\n", "# Draw a vertical line at the median\n", "abline(v=median(moocs.bac$EPFL_BacGrade, na.rm=T), col=\"red\", lwd=3) \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Task 1 : Define two categorical variables named prior and prior4 that distinguish students given their baccalaureat grade. \n", "
    \n", "
  • prior : The values of the variable are HI for strong students and LO for weakers students. use median() to define high and low levels of ability
  • \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO: Define two categorical variables named moocs$prior and moocs$prior4 that distinguishstudents given at their baccalaureat grade.\n", "\n", "# 1) moocs$prior. \n", "# The values of the variable are \"HI\"\" for strong students and \"LO\"\" for weakers students.\n", "# => use median() to define high and low levels of ability\n", "\n", "################\n", "# Begin solution\n", "\n", "# end solution\n", "##############\n", "\n", "\n", "# count the number of observations in LO and HI \n", "table(moocs.bac$prior)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Task 2 : Plot the means for the EPFL_CourseGrade given the prior grade level. \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "################\n", "# Begin Solution\n", "\n", "\n", "\n", "# end Solution\n", "################\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Task 3 : Compare the means of EPFL_CourseGrade. \n", "
    \n", "
  • Check assumptions to run an ANOVA
  • \n", "
  • Run an ANOVA or a replacement if the assumptions are not set
  • \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "################\n", "# Begin Solution\n", "\n", "\n", "\n", "# end Solution\n", "################\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Two factor ANOVA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Question: Do different types of MOOC usage affect grades differently ? Is this effect the same for strong and weak students ? \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can add more than one categorical predictor and also model the interaction of two factors. When there is an interaction, the effect of one variable is conditional on the effect of another variable \n", "\n", "Interaction between two terms is represented by a `:`. A syntaxic equivalent for the model that contains single and interaction effects is `A * B == A + B + A:B`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "m3 <- lm(moocs.bac$EPFL_CourseGrade ~ moocs.bac$MOOC + moocs.bac$prior + moocs.bac$MOOC:moocs.bac$prior)\n", "print(Anova(m3, type=3))\n", "\n", "r3 <- residuals(m3)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Reporting\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> We conducted a 3 x 2 factorial ANOVA with prior performance and MOOC useage as factors. The results indicate a main effect of MOOC usage (F[2,374]=102.40, p<.001), as well as a main effect of prior knowledge (F[1,1179]=645.78, p<.001). There is also an interaction effect between the two factors (F[20,2]=5.3, p<.01). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting\n", "\n", "The ggline package comes in handy to plot interactions between variables. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Reporting\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> In our case here, we see that the relation between the different types of MOOC activity is not the same from strong or weaker students. It seems that for strong students, watching videos does not increase the score compared to doing nothing. However, doing exercices is beneficial for both strong and weak students. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ggline(moocs.bac, x = \"MOOC\", y = \"EPFL_CourseGrade\", \n", " add = c(\"mean_ci\"),\n", " color = \"prior\", palette = \"jco\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Testing Assumptions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Advanced topic : With relatively large samples, normality and variance tests tend to produce small p-values and force us to reject the null hypotheses for normality and homoscedasticity. Instead of systematically (and somewhat blindly) using normality and homoscedasticity tests, we suggest that you explore visually the distribution of residuals to judge whether there are strong deviations from normality and whether there are large variance disparities.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "options(repr.plot.width = 8, repr.plot.height = 4)\n", "par(mfrow=c(1,3))\n", "qqnorm(r3)\n", "qqline(r3)\n", "\n", "boxplot(r3 ~ moocs.bac$MOOC)\n", "boxplot(r3 ~ moocs.bac$prior)\n", "\n", "bartlett.test(r3 ~ moocs.bac$MOOC)\n", "bartlett.test(r3 ~ moocs.bac$prior)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercice " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Question: Do different usage types of MOOCs affect grades ? Are MOOCs more beneficial for repeating students ?\n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We define a new categorical variable called \"repeating\" that reflects whether a student is repeating the year. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "moocs.bac$repeating <- factor(ifelse(moocs.bac$EPFL_IsRepeating,\"REPEAT\",\"NO.REPEAT\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Task 1 : Plot the means for the EPFL_CourseGrade given the repeating variable and the type of usage. \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##########\n", "# Begin solution\n", "\n", "\n", "\n", "# End solution\n", "############" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Task 2 : Do a two factor ANOVA to test for the three effects.\n", "
    \n", "
  • What are the three effects ?
  • \n", "
  • Interpret them
  • \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##########\n", "# Begin solution\n", "\n", "\n", "# End solution\n", "##########\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Task 3 : Check the quality of the model (the assumptions)\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##########\n", "# Begin solution\n", "\n", "\n", "# End solution\n", "##########" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Linear regression\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Question: In our example data set, we want to know whether the EPFL course grade can be predicted by the percentage of videos that were watched by students.\n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Linear Model: The null model\n", "
The resulting model indicates that the average EPFL_Course_Grade is 3.74. \n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Test: t-test on Estimate\n", "
The resulting model indicates that the average EPFL_Course_Grade is 3.74. The t-test in addition tests whether this value can be considered larger than zero. From observing the statistics reported for the residuals, it appears that the median is larger than zerp (we would expext a median of zero if the residuals were normally distributed and if there were as many positive as negative error terms). \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# lm estimates the parameters of y = ax + b so as to minimise the Error. \n", "# In this model, we use a = 0 and estimate only b (represented as the constant term ~1)\n", "m0 <- lm(EPFL_CourseGrade ~ 1, data=moocs.bac)\n", "\n", "summary(m0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Linear Model: The model with one predictor\n", "
The resulting model indicates that the average EPFL_Course_Grade is 3.62 + 0,01 for each percentage of videos watched. \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Tests: Again, the t-tests check whether the Estimates can be considered larger than zero. Additionally, a F-statistic is computed to compare the new model with the null model. This test checker whether we have reduced the residuals, and augmented the variance explained by the model by adding a term (in our case MOOC_PercentageVideosWatched. This indicates that this model m1 is better (accounts for more variance of the response variable) than model m0. \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# lm estimates the parameters of y = ax + b so as to minimise the Error. \n", "m1 <- lm(EPFL_CourseGrade ~ MOOC_PercentageVideosWatched, data=moocs.bac)\n", "\n", "# The adjusted R^2 is interpreted as the proportion of variance explained.# \n", "# The F-statistic indicates whereh this model is better than the null model that would include only one parameter (the mean).\n", "summary(m1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Question: Is the model satisfying the assumptions for regression ? \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assumptions for linear regression :\n", " * Linearity: The relationship between X and the mean of Y is linear.\n", " * Homoscedasticity: The variance of residual is the same for any value of X.\n", " * Independence: Observations are independent of each other.\n", " * Normality: For any fixed value of X, Y, and the residuals are normally distributed.\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Linear Model: Checking linearity\n", "
A simple way to check whether the relation between the dependent variable and predictors is linear consists of plotting one against he other. \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the plot below, we see there is clearly a problem. because the cloud of points does not seem to be well summarised by a line." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "options(repr.plot.width = 5, repr.plot.height = 5)\n", "\n", "# Simple plot with basic R plotting. Use the command plot(y ~ x). \n", "# pch means plotting character, 20 is a small dot, 1 is a circle, etc.\n", "# jitter allows to add a little bit of noise so that the points do not overlap in the plot.\n", "\n", "plot(jitter(EPFL_CourseGrade) ~ jitter(MOOC_PercentageVideosWatched), pch = 20, data=moocs.bac)\n", "\n", "# also plot the regression line that corresponds to model m0\n", "abline(m0, col=\"blue\", lwd=3)\n", "\n", "# also plot the regression line that corresponds to model m1\n", "abline(m1, col=\"red\", lwd=3)\n", "\n", "# See https://rpkgs.datanovia.com/ggpubr/reference/ggscatter.html for more sophisticated plotting.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Linear Model: Checking normality and homoscedasticity of the residuals.\n", "
A simple way to check these assumptions consists of plotting a QQ plot for residuals and investigating whether there is a \"trend\" in th residuals with th fitted values. \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "options(repr.plot.width = 6, repr.plot.height = 6)\n", "par(mfrow=c(2,2))\n", "\n", "plot(m1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Part of our trouble comes from thre asymetric distribution of the variable MOOC_PercentageVideosWatched" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "par(mfrow=c(1,2))\n", "options(repr.plot.width = 8, repr.plot.height = 4)\n", "hist(moocs.bac$MOOC_PercentageVideosWatched, main=\"Percentage Videos Watched\", xlab=\"Percentage\")\n", "hist(moocs.bac$EPFL_CourseGrade, main=\"Course Grade\", xlab=\"Grade\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Anscombe's dataset\n", "Anscombe, Francis J. (1973). Graphs in statistical analysis. The American Statistician, 27, 17–21. doi: 10.2307/2682899." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# library(datasets)\n", "anscombe" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# x and y are correlated with r = 0.81\n", "cor.test(anscombe$x1, anscombe$y1)\n", "cor.test(anscombe$x2, anscombe$y2)\n", "cor.test(anscombe$x3, anscombe$y3)\n", "cor.test(anscombe$x4, anscombe$y4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# the parameters for a linear model y~x are the same\n", "m1 <- lm(anscombe$y1 ~ anscombe$x1)\n", "m1\n", "\n", "m2 <- lm(anscombe$y2 ~ anscombe$x2)\n", "m2\n", "\n", "m3 <- lm(anscombe$y3 ~ anscombe$x3)\n", "m3\n", "\n", "m4 <- lm(anscombe$y4 ~ anscombe$x4)\n", "m4" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot the relation between y and x for the four datasets (y1~x1; y2~x2; y3~x3 and y4~x4) \n", "# What do you observe ?\n", "options(repr.plot.width = 7, repr.plot.height = 7)\n", "par(mfrow=c(2,2))\n", "\n", "plot(anscombe$y1~anscombe$x1, pch=19)\n", "abline(m1)\n", "\n", "plot(anscombe$y2~anscombe$x2, pch=19)\n", "abline(m2)\n", "\n", "plot(anscombe$y3~anscombe$x3, pch=19)\n", "abline(m3)\n", "\n", "plot(anscombe$y4~anscombe$x4, pch=19)\n", "abline(m4)\n", "par(mfrow=c(1,1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Homoscedasticity" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# linearity (there should not be huge residuals) \n", "# and homoscedasticity (there should not be a trend for residuals as we increase the fitted values)\n", "par(mfrow=c(2,2))\n", "plot(m1, which=1, main=\"Model 1\") # \n", "plot(m2, which=1, main=\"Model 2\") # Looks like there is a pattern\n", "plot(m3, which=1, main=\"Model 3\") # There is a trend downwards, the error becomes larger and more negative as we predict larger values. This means we surestimate more and more the predicted values. \n", "plot(m4, which=1, main=\"Model 4\")\n", "par(mfrow=c(1,1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normality" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Assumption: normality of residuals. \n", "par(mfrow=c(2,2))\n", "plot(m1, which=2, main=\"Model 1\") # \n", "plot(m2, which=2, main=\"Model 2\") # Looks like there is a pattern\n", "plot(m3, which=2, main=\"Model 3\") # There is an outlier (observation 3) \n", "plot(m4, which=2, main=\"Model 4\")\n", "par(mfrow=c(1,1))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Fitted agains standardised residuals. there shoudl be no trend in the data \n", "par(mfrow=c(2,2))\n", "plot(m1, which=3, main=\"Model 1\") # \n", "plot(m2, which=3, main=\"Model 2\") # Looks like there is a pattern\n", "plot(m3, which=3, main=\"Model 3\") # There is an outlier (observation 3) \n", "plot(m4, which=3, main=\"Model 4\")\n", "par(mfrow=c(1,1))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Cook's distance. Measures the influence of one particular observation on the parameters of the regression. Values close and above 1 have a too large influence and should be observed. \n", "par(mfrow=c(2,2))\n", "plot(m1, which=4, main=\"Model 1\") # \n", "plot(m2, which=4, main=\"Model 2\") # Looks like observations 6 and 8 are quite large\n", "plot(m3, which=4, main=\"Model 3\") # There is an outlier (observation 3) \n", "plot(m4, which=4, main=\"Model 4\")\n", "par(mfrow=c(1,1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise : analyse the effect of prior knowlege on the grades using a linear regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Task 1 : Build a regression model that predicts the results of students in the EPFL course (EPFL_CourseGrade) with their score in highschool (EPFL_BacGrade).\n", "
    \n", "
  • Is the high school grade contributing to better course grades ?
  • \n", "
  • What can you say about the the intercept, and how can you interpret the value of the Estimates ?
  • \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "################\n", "# Begin Solution\n", "\n", "\n", "# End Solution\n", "################\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Task 2 : Check normality and homoscedasticity assumptions\n", "
    \n", "
  • What can you say about the normality of the residuals ?
  • \n", "
  • What can you say about the variance of the error ?
  • \n", "
  • Are there outliers ?
  • \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "################\n", "# Begin Solution\n", "\n", "\n", "\n", "# End Solution\n", "################" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Task 3 : Check assumptions\n", "
    \n", "
  • Are there outliers ? What observations could we get rid of to get a better model ?
  • \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "################\n", "# Begin Solution\n", "\n", "\n", "# End Solution\n", "################\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Task 4 : Check linear dependence\n", "
    \n", "
  • Plot the variables against each other
  • \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "################\n", "# Begin Solution\n", "\n", "\n", "# End Solution\n", "################" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Categorical variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Question: Do people with higher prior ability (the prior variable) use the MOOC differently (the MOOC variable) ? This question asks about the relationship of two categorical variables (prior and MOOC both have a set of distinct nominal values). \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
\n", " Test: Chi-square test\n", "
\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "The chi-square tests whether the observations in two categorical variables can be considered independent. It does this by comparing the observed number of cases and the expected number of cases under the assumption of independence. \n", "\n", "* H0: the variables are independent.\n", "* H1: the variables are not independent.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tt <- table( moocs.bac$MOOC, moocs.bac$prior)\n", "print(tt)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tt.chisq <- chisq.test(tt)\n", "print(tt.chisq)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(tt.chisq$obs)\n", "print(tt.chisq$exp)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# pearson residuals larger than 1.96 are significantly different from 0 with p<.05\n", "# the Pearson residuals, (observed - expected) / sqrt(expected).\n", "print(tt.chisq$res)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compute residual for cell 1,1\n", "(3235-3030.8894) / sqrt(3030.8894)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Reporting\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> A test of independence shows that the students with high prior ability do not have the same pattern of MOOC useage compared to the students of low ability ($\\chi^2$[2]=116.57, p<.001). A closer inspection of the Pearson residuals shows that low ability students are more likely to not use the MOOC and that high ability students are more likey to do the exercices. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Setting the size of plots generated with ggplot below. \n", "options(repr.plot.width = 5, repr.plot.height = 5)\n", "\n", "mosaicplot(tt, shade = TRUE, las=2, main = \"MOOC Usage and Prior Ability\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "names(moocs.bac)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 4 }