diff --git a/DataAnalysis.ipynb b/DataAnalysis.ipynb index 516ece0..173aedf 100644 --- a/DataAnalysis.ipynb +++ b/DataAnalysis.ipynb @@ -1,5637 +1,5679 @@ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Teaching and Learning differently with Jupyter Notebooks - HEG, December 12, 2019
\n", " C. Hardebolle, CC BY-NC-SA 4.0 Int.

\n", " How to use this notebook?
\n", " This notebook is made of text cells and code cells. The code cells have to be executed to see the result of the program. To execute a cell, simply select it and click on the \"play\" button () in the tool bar just above the notebook, or type shift + enter. It is important to execute the code cells in their order of appearance in the notebook.\n", "
" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "### TODO\n", "\n", "Work on the text\n", "Integrate exercises\n", "Fake some more data?\n", "\n", "DONE (keep ?) Test normality and homoscedasity\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# How can we know if a teaching method produces the impact we hope for?\n", "\n", "Let's imagine that you teach a course for the users of your library and this year you are trying a new teaching method. \n", "**How can you know whether this teaching method is actually \"better\" than the previous one?** \n", "One solution is to collect data from your class and analyze it to determine how the method is influencing the way your students learn in the course.\n", "\n", "In this notebook, you are going to analyze data which has been collected in the course you are actually attending, \"Formation des usagers en bibliothèques (765-21n)\". \n", "\n", "## Context\n", "\n", "As you know from being a student in this course, one of the requirements to pass the course is to submit a number of assignments on which you work individually. \n", "These assignements get validated and the **success rate** is the number of accepted assignments over the total number of submitted assignments by each student. \n", "Your teacher collects the success rate of students attending the course each year for normal grading purposes.\n", "\n", "## Question\n", "\n", "Let's imagine that one year, your teacher modifies the course with the goal of helping students submit assignments of better quality. \n", "One way to know whether students' assignments are really of better quality is to look at students' success rate, which should be higher after the modification. \n", "\n", "The activities in this notebook will guide you in the *detective work* of analyzing the data from the course to answer this question: \n", "**How does the success rate of students after the modification (year 2) compare to the success rate of students the previous year (year 1)?** \n", "\n", "\n", "## Learning goals\n", "\n", "After using this notebook, you should be able to:\n", "* Compare two samples using a statistical test called the \"t-test\" and interpret its results\n", "* Evaluate the magnitude of the difference between two samples using a measure of the effect size called \"Cohen's d\" and interpret the results\n", "* Use Python to make statistical analyses on a dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Instructions \n", "\n", "1. Read the notebook and execute the code cells. \n", "2. Answer the questions\n", "\n", "*You can check your answers with the solution available [in this file](solution/DataAnalysis-solution.ipynb).*\n", "\n", "\n", " \n", "\n", "--- \n", "\n", "## Getting started\n", "\n", "### Python tools for stats\n", "Python comes with a number of libraries for processing data and computing statistics.\n", "To use these tool you first have to load them using the `import` keyword. \n", "The role of the code cell just below is to load the tools that we use in the rest of the notebook. It is important to execute this cell *prior to executing any other cell in the notebook*." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# plotting and display tools\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt \n", "plt.style.use('seaborn-whitegrid') # global style for plotting\n", "\n", "from IPython.display import display, set_matplotlib_formats\n", "set_matplotlib_formats('svg') # vector format for graphs\n", "\n", "# data computation tools\n", "import numpy as np \n", "import pandas as pan\n", "\n", "# statistics tools\n", "import scipy.stats as stats\n", "import pingouin as pg" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data available\n", "\n", "You have the raw coming from the last two years of the class, which is stored in the CSV file `students_success_rate.csv` that you can see in the file explorer in the left pane. \n", "If you double click on the file it will open in a new tab and you can look at what is inside.\n", "\n", "Now to analyze the data using Python you have to read the file:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
studentyearrate
0#101Year10.727273
1#102Year10.583333
2#103Year10.750000
3#104Year10.818182
4#105Year10.800000
\n", "
" ], "text/plain": [ " student year rate\n", "0 #101 Year1 0.727273\n", "1 #102 Year1 0.583333\n", "2 #103 Year1 0.750000\n", "3 #104 Year1 0.818182\n", "4 #105 Year1 0.800000" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read the sample data from the CSV file\n", "sample_data = pan.read_csv('students_success_rate.csv')\n", "\n", "# Display the first few lines of the dataset\n", "sample_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After reading the CSV file, its content is stored in the variable `sample_data`, which is a kind of table. Each line of the table is given an index number to identify it. We see above that the table contains one column called `student` with identifiers for the students, one column called `year` indicating which cohort the student is from and a column `rate` with the success rate of each student.\n", "\n", "To get the list of all the values for the success rate stored in the `rate` column, you can use the following syntax: `sample_data[\"rate\"]`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.727273\n", "1 0.583333\n", "2 0.750000\n", "3 0.818182\n", "4 0.800000\n", "5 0.777778\n", "6 0.500000\n", "7 0.692308\n", "8 0.714286\n", "9 0.888889\n", "10 0.750000\n", "11 0.615385\n", "12 0.769231\n", "13 0.700000\n", "14 0.818182\n", "15 1.000000\n", "16 0.818182\n", "17 1.000000\n", "18 0.900000\n", "19 0.818182\n", "20 1.000000\n", "21 0.909091\n", "22 0.750000\n", "23 1.000000\n", "24 0.750000\n", "25 0.777778\n", "26 0.700000\n", "27 0.571429\n", "28 0.777778\n", "29 0.727273\n", "30 0.642857\n", "31 0.888889\n", "32 0.857143\n", "33 0.615385\n", "34 0.888889\n", "35 0.857143\n", "36 0.777778\n", "37 0.600000\n", "38 0.777778\n", "39 0.777778\n", "Name: rate, dtype: float64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# All values stored in the \"rate\" column of the \"sample_data\" table\n", "sample_data[\"rate\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question
\n", "How can you access the list of the values stored in the `year` column? \n", "Type your code in the following cell and execute it to check the result." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Year1\n", "1 Year1\n", "2 Year1\n", "3 Year1\n", "4 Year1\n", "5 Year1\n", "6 Year1\n", "7 Year1\n", "8 Year1\n", "9 Year1\n", "10 Year1\n", "11 Year1\n", "12 Year1\n", "13 Year1\n", "14 Year1\n", "15 Year1\n", "16 Year1\n", "17 Year1\n", "18 Year1\n", "19 Year1\n", "20 Year1\n", "21 Year1\n", "22 Year1\n", "23 Year2\n", "24 Year2\n", "25 Year2\n", "26 Year2\n", "27 Year2\n", "28 Year2\n", "29 Year2\n", "30 Year2\n", "31 Year2\n", "32 Year2\n", "33 Year2\n", "34 Year2\n", "35 Year2\n", "36 Year2\n", "37 Year2\n", "38 Year2\n", "39 Year2\n", "Name: year, dtype: object" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data[\"year\"] # TODO delete" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Filter data according to the year\n", "\n", "Now we want to access the data of students who attended the course on year 1 only. For this, we want to select only the lines for which the column `year` contains the value `\"Year1\"` in the whole table. This can be done by putting a condition such as `sample_data[\"year\"]==\"Year1\"` between brackets as we use for selecting columns. This will play de role of a filter, as shown below:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
studentyearrate
0#101Year10.727273
1#102Year10.583333
2#103Year10.750000
3#104Year10.818182
4#105Year10.800000
5#106Year10.777778
6#107Year10.500000
7#108Year10.692308
8#109Year10.714286
9#110Year10.888889
10#111Year10.750000
11#112Year10.615385
12#113Year10.769231
13#114Year10.700000
14#115Year10.818182
15#116Year11.000000
16#117Year10.818182
17#118Year11.000000
18#119Year10.900000
19#120Year10.818182
20#121Year11.000000
21#122Year10.909091
22#123Year10.750000
\n", "
" ], "text/plain": [ " student year rate\n", "0 #101 Year1 0.727273\n", "1 #102 Year1 0.583333\n", "2 #103 Year1 0.750000\n", "3 #104 Year1 0.818182\n", "4 #105 Year1 0.800000\n", "5 #106 Year1 0.777778\n", "6 #107 Year1 0.500000\n", "7 #108 Year1 0.692308\n", "8 #109 Year1 0.714286\n", "9 #110 Year1 0.888889\n", "10 #111 Year1 0.750000\n", "11 #112 Year1 0.615385\n", "12 #113 Year1 0.769231\n", "13 #114 Year1 0.700000\n", "14 #115 Year1 0.818182\n", "15 #116 Year1 1.000000\n", "16 #117 Year1 0.818182\n", "17 #118 Year1 1.000000\n", "18 #119 Year1 0.900000\n", "19 #120 Year1 0.818182\n", "20 #121 Year1 1.000000\n", "21 #122 Year1 0.909091\n", "22 #123 Year1 0.750000" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Use a condition to select only the line of the table for which the column \"year\" contains the value \"Year1\"\n", "sample1_data = sample_data[sample_data[\"year\"]==\"Year1\"]\n", "\n", "# Display the result\n", "sample1_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question
\n", "How can you get the data regarding students from year 2 only? \n", "Type your code in the following cell and execute it to check the result." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
studentyearrate
23#201Year21.000000
24#202Year20.750000
25#203Year20.777778
26#204Year20.700000
27#205Year20.571429
28#206Year20.777778
29#207Year20.727273
30#208Year20.642857
31#209Year20.888889
32#210Year20.857143
33#211Year20.615385
34#212Year20.888889
35#213Year20.857143
36#214Year20.777778
37#215Year20.600000
38#216Year20.777778
39#217Year20.777778
\n", "
" ], "text/plain": [ " student year rate\n", "23 #201 Year2 1.000000\n", "24 #202 Year2 0.750000\n", "25 #203 Year2 0.777778\n", "26 #204 Year2 0.700000\n", "27 #205 Year2 0.571429\n", "28 #206 Year2 0.777778\n", "29 #207 Year2 0.727273\n", "30 #208 Year2 0.642857\n", "31 #209 Year2 0.888889\n", "32 #210 Year2 0.857143\n", "33 #211 Year2 0.615385\n", "34 #212 Year2 0.888889\n", "35 #213 Year2 0.857143\n", "36 #214 Year2 0.777778\n", "37 #215 Year2 0.600000\n", "38 #216 Year2 0.777778\n", "39 #217 Year2 0.777778" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Filter data from year 2 only\n", "sample2_data = sample_data[sample_data[\"year\"]==\"Year2\"] # TODO delete\n", "\n", "# Display the result\n", "sample2_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## First look at the data\n", "\n", "Now that we have loaded the data and we know how to access it, we perform our first detective tasks to get an idea of what the data looks like: we compute some descriptive statistics and we visualize the data using plots.\n", "\n", "### Descriptive statistics\n", "\n", "Let's compute some simple descriptive statistics on this sample data. The `describe()` function gives us right away a number of useful descriptive stats. In the cell code below we compute the descriptive statistics for the students from year 1:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rate
count23.000000
mean0.786970
std0.128173
min0.500000
25%0.720779
50%0.777778
75%0.853535
max1.000000
\n", "
" ], "text/plain": [ " rate\n", "count 23.000000\n", "mean 0.786970\n", "std 0.128173\n", "min 0.500000\n", "25% 0.720779\n", "50% 0.777778\n", "75% 0.853535\n", "max 1.000000" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Compute the descriptive stats\n", "sample1_stats = sample1_data.describe()\n", "\n", "# Display the result\n", "sample1_stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question
\n", "How can you get the same descriptive statistics for the students from year 2? \n", "Type your code in the following cell and execute it to check the result." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rate
count17.000000
mean0.763994
std0.114993
min0.571429
25%0.700000
50%0.777778
75%0.857143
max1.000000
\n", "
" ], "text/plain": [ " rate\n", "count 17.000000\n", "mean 0.763994\n", "std 0.114993\n", "min 0.571429\n", "25% 0.700000\n", "50% 0.777778\n", "75% 0.857143\n", "max 1.000000" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Compute the descriptive stats\n", "sample2_stats = sample2_data.describe() # TODO delete\n", "\n", "# Display the result\n", "sample2_stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extracting specific statistics\n", "You can access individual elements of the `sample1_stats` table using the corresponding names for the line and column of the value. \n", "The following cell illustrates how to get the mean success rate of students in year 1:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.7869695521913044" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Extract the sample mean from the descriptive stats\n", "sample1_mean = sample1_stats.loc[\"mean\",\"rate\"]\n", "\n", "# Display the result\n", "sample1_mean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question
\n", "How can you get the mean success rate of students in year 2? \n", "Type your code in the following cell and execute it to check the result." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.7639938492941176" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Extract the sample mean from the descriptive stats\n", "sample2_mean = sample2_stats.loc[\"mean\",\"rate\"]\n", "\n", "# Display the result\n", "sample2_mean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualizations\n", "\n", - "To have a more precise idea of what the data looks like, an important step is to generate visualisations of this data. One interesting element is to look at how the success rates are distributed for each year. The Python code below creates two graphs which represent how the samples are distributed: a bar graph and boxplots.\n", + "To have a more precise idea of what the data looks like, an important step is to generate visualisations of this data. One interesting element is to look at how the success rates are distributed for each year. The Python code below creates two graphs which represent how the success rate of students is distributed: a bar graph and boxplots.\n", "\n", "**Execute the code cell below** to generate the graphs.\n", "\n", "*Note:* \n", "Python code for plotting can be quite verbose and not particularly interesting to look at unless you want to learn how to generate plots in Python. \n", "The good news is that you can **hide** a code cell from the notebook by selecting it and clicking on the blue bar which appears on its left. \n", "To make the cell visible again, just click again on the blue bar, or on the three \"dots\" which represent the collapsed cell." ] }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", - "\n", + "\n", " \n", " \n", " \n", " \n", " \n", - " \n", " \n", " \n", " \n", - " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", + " \n", " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", " \n", + " \n", " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", " \n", + " \n", " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", " \n", + " \n", " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", " \n", + " \n", " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", " \n", + " \n", " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", " \n", + " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", " \n", + " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", " \n", + " \n", " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", " \n", + " \n", " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", " \n", + " \n", " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", " \n", + " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", " \n", + " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", + "\" id=\"DejaVuSans-100\"/>\n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", - " \n", - " \n", + " \n", " \n", - " \n", - " \n", + " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", - " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", " \n", - " \n", + " \n", + " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", " \n", - " \n", + " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", " \n", - " \n", + " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", " \n", - " \n", + " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", " \n", - " \n", + " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", " \n", - " \n", + " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", " \n", - " \n", + " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", " \n", - " \n", + " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", + " \n", " \n", - " \n", - " \n", + " \n", " \n", - " \n", - " \n", + " \n", " \n", - " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", + "\" id=\"m0f25c56109\" style=\"stroke:#1a1a1a;\"/>\n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", " \n", - " \n", - " \n", + " \n", " \n", - " \n", - " \n", + " \n", " \n", - " \n", - " \n", + " \n", " \n", - " \n", - " \n", + " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", " \n", - " \n", - " \n", + " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", - " \n", - " \n", + " \n", " \n", - " \n", - " \n", + " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", - " \n", + " \n", " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", " \n", - " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", + " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", " \n", - " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", + " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", " \n", "\n" ], "text/plain": [ - "
" + "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Create visualisation\n", - "fig = plt.figure(figsize=(8, 4))\n", + "fig = plt.figure(figsize=(10, 4))\n", + "fig.suptitle(\"Distribution of the success rate\", y=1.03)\n", "\n", "### 1. Plot the distribution of the samples\n", "ax1 = plt.subplot(121)\n", "n1, bins1, patches1 = ax1.hist(sample1_data[\"rate\"], color='green', edgecolor='black', alpha=0.3)\n", "n2, bins2, patches2 = ax1.hist(sample2_data[\"rate\"], bins=bins1, color='blue', edgecolor='black', alpha=0.3)\n", "\n", "# Customize the plot\n", "ax1.set_xlabel('success rate')\n", - "ax1.set_ylabel('number of samples')\n", - "ax1.set_title(\"Distribution\")\n", + "ax1.set_ylabel('number of students')\n", "\n", "# Add the means\n", "mean1line = ax1.axvline(x=sample1_mean, color='green', alpha=0.5, linestyle='-.', linewidth=1.5)\n", "mean2line = ax1.axvline(x=sample2_mean, color='blue', alpha=0.5, linestyle='-.', linewidth=1.5)\n", "\n", "### 2. Create a boxplot in which we can see the quartiles\n", "ax2 = plt.subplot(122)\n", "box1 = ax2.boxplot(sample1_data[\"rate\"], positions=[1], sym='k+', patch_artist=True, boxprops=dict(facecolor=\"green\", alpha=0.3))\n", "box2 = ax2.boxplot(sample2_data[\"rate\"], positions=[2], sym='k+', patch_artist=True, boxprops=dict(facecolor=\"blue\", alpha=0.3))\n", "\n", "# Customize the plot\n", "plt.setp(box1['medians'], color='black')\n", "plt.setp(box2['medians'], color='black')\n", "ax2.set_ylabel('success rate')\n", "ax2.set_xticklabels([\"Year1\", \"Year2\"])\n", - "ax2.set_title(\"Quartiles\")\n", "\n", "# Add the means\n", "ax2.axhline(y=sample1_mean, color='green', alpha=0.5, linestyle='-.', linewidth=1.5)\n", "ax2.axhline(y=sample2_mean, color='blue', alpha=0.5, linestyle='-.', linewidth=1.5)\n", "\n", "### Add a general legend at the top of the figure\n", "fig.legend([patches1[0], mean1line, patches2[0], mean2line], \n", " ['Year 1', 'Mean 1', 'Year 2', 'Mean 2'],\n", - " loc='upper center', ncol=4, borderaxespad=0)\n", + " loc='upper center', ncol=4, bbox_to_anchor=(.4, 1))\n", "\n", "# Display the graph\n", - "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interpretation of the results\n", "\n", "The simple analyses we have made so far allow us to have a preliminary idea about how the success rate of students in the two years compare to each other. \n", "One feature to look at for the comparison is their respective mean. We see above that the mean success rate of year 1 and the mean success rate of year 2 are really close to each other but that **the mean of year 2 is actually lower than that of year 1**.\n", "\n", "Can we say right away that the modification to the course has been inefficient since we see a decrease in the mean success rate? \n", "Actually **no**, that would be concluding much too fast!

\n", - "First, we see not only that the difference is quite small but also there is quite a variation in the success rate of students in both years and we have to take this variation into account when estimating the *size* of the difference. In addition, as with any experiment in real world, we don't know how much this result is only due to random (bad?) luck or if we are looking only at the effect of the modification on the course.\n", + "First, we see not only that the difference is quite small but also there is quite a variation in the success rate of students in both years and we have to take this variation into account when estimating the *size* of the difference. In addition, as with any experiment in real world, we don't know how much of this result is only due to random (bad?) luck or if we are looking truly at the effect of the modification on the course.\n", "\n", "In the next section, we will see how **statistical tools** can help us:\n", "* determine the probability of getting such results by chance: this test is called the **t-test**,\n", "* evaluate how large is the difference: this measure is called the **effect size**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "---\n", "\n", "## Using statistical tools\n", "\n", "### Formulating hypotheses\n", "\n", "Before using any statistical tool, we need to clarify **what we want to test**, and what we want to test is called an **hypothesis**.\n", "\n", - "The hypothesis that we can make based on our observations from the previous section is that: the mean $m_1$ of the year 1 sample is similar to the mean $m_2$ of year 2 sample, which is noted $m_1 = m_2$. \n", + "Based on our observations above, we would like to know whether the mean success rate of year 2 is really lower than the mean success rate of year 1. \n", + "To test this idea, we actually have to do like in a trial or in a mathematical proof: start from the \"opposite\" hypothesis and prove that it is false.\n", + "\n", + "The \"opposite\" hypothesis is that the mean $m_1$ of the year 1 sample is actually similar to the mean $m_2$ of year 2 sample, which is noted $m_1 = m_2$. \n", "This hypothesis is noted $H_0$ and called the \"null\" hypothesis because it states that there is no difference between the sample and the population. \n", - "The \"alternate\" hypothesis $H_a$ is that the mean of year 1 is not similar to the mean of year 2, $m_1 \\neq m_2$.\n", + "The \"alternate\" hypothesis $H_a$ is that the mean of year 1 is not similar to the mean of year 2, $m_1 \\neq m_2$, and this is what we really want to know.\n", + "\n", + "Our goal now is to see if the result of the statistical tests allow us to **reject the null hypothesis $H_0$** or not.\n", "\n", - "### Choosing a cut-off point for \"being sure\"\n", + "### Choosing a threshold for \"being sure\"\n", "\n", - "Statistical tools never give exact answers, they give probabilities. Therefore there is a chance that we make wrong conclusions based on the result of a statistical test. For instance, it can happen that we reject an hypothesis based on the result of a test when actually the hypothesis was true.\n", + "Statistical tools never give exact answers, they give probabilities. Therefore it is important to decide in advance **how sure** we want to be of the results. \n", + "In the case of the t-test, the result of the test will indicate what is the probability of getting a difference between $m_1$ and $m_2$ only by chance.\n", + "We need to choose a probability threshold under which we think it is reasonable to say that the difference between $m_1$ and $m_2$ cannot be not just luck.\n", "\n", - "Therefore it is important to decide in advance **how sure** we want to be of the results, i.e. what is the acceptable probability that we make a mistake in the interpretation. \n", - "This probability is usually noted $\\alpha$ and is called the **significance level**. \n", + "This threshold is usually noted $\\alpha$ and is called the **significance level**. \n", + "Researchers generally use fixed significance levels such as $\\alpha = .05$, $\\alpha = .01$ or $\\alpha = .001$. \n", "\n", - "Researchers generally use $\\alpha = .05$, $\\alpha = .01$ or $\\alpha = .001$. \n", - "Choosing a significance level of $\\alpha = .05$ means that we decide that it is acceptable for us if our conclusions are wrong in 5% of the cases. This is the level we are going to choose.\n", + "In the following we are going to use $\\alpha = .05$.\n", + "This means that if the result of the t-test gives us a probability which is lower than $.05$ then it will mean that the difference we observe between $m_1$ and $m_2$ cannot be just luck.\n", "\n", - "**Execute the cell below** so that our choice of confidence level is defined in Python." + "**Execute the cell below** so that our significance level gets defined in Python." ] }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "0.05" + "0.01" ] }, - "execution_count": 12, + "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Define alpha\n", - "alpha = 0.05\n", + "alpha = 0.01\n", "\n", "# Display alpha\n", "alpha" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Computing the results\n", + "### Computing the statistical test\n", "\n", - "Python comes with a number of libraries with interesting statistical tools. \n", - "In particular, the `pingouin` library includes a function for doing a **two-sample t-test**, which also gives a measure of the **effect size**.\n", - "\n", - "In the following, we use this function to test our hypothesis $H_0$ that the mean success rate of students in year 1 is equivalent to the mean success rate of students in year 2.\n", + "Python comes with a number of libraries with interesting statistical tools. \n", + "In particular, the `pingouin` library includes a function for doing a **two-sample t-test**, which also gives a measure of the **effect size**.
\n", + "In the following, we use this function to compare the success rate of students in year 1 to the success rate of students in year 2.\n", "\n", "**Execute the cell below** to see the result of the test." ] }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Tdoftailp-valCI95%cohen-dBF10power
T-test0.59536.5two-sided0.55565[-0.06, 0.1]0.1870.3580.088
\n", "
" ], "text/plain": [ " T dof tail p-val CI95% cohen-d BF10 power\n", "T-test 0.595 36.5 two-sided 0.55565 [-0.06, 0.1] 0.187 0.358 0.088" ] }, - "execution_count": 13, + "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "# Compute the t-test\n", - "ttest_result = pg.ttest(sample1_data[\"rate\"], sample2_data[\"rate\"], correction=True)\n", + "# Compute the t-test to compare the success rate of year 1 to the success rate of year 2\n", + "ttest_result = pg.ttest(sample1_data[\"rate\"], sample2_data[\"rate\"])\n", "\n", "# Display the result\n", "ttest_result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function returns a table which contains different information for interpreting the test. \n", - "In this exercise we are going to focus only on two of these informations:\n", - "* `p-val`: the \"p-value\" is the probability to get a more extreme sample mean than the one we observe ; `p` has to be compared to `alpha` (0.05) to know if our sample mean is in the most extremes 5%.\n", - "* `cohen-d`: the \"effect-size\" \n", - "\n", + "In this exercise we are going to focus on two elements:\n", + "* `p-val`: this is the \"p-value\", noted $p$, which represents the probability to observe such a difference between $m_1$ and $m_2$ just by luck. \n", + "The smaller the p-value, the stronger the evidence is that $m_1$ really is different from $m_2$. The p-value has to be compared to our significance level $\\alpha$.\n", + "* `cohen-d`: this is a measure of the \"effect-size\", noted $d$, which represents the size of the difference between $m_1$ and $m_2$. \n", + "The bigger the effect size, the stronger the evidence is that $m_1$ really is different from $m_2$. The effect-size has to be compared to thresholds from the litterature as we will see below." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "### Intepreting the \"p-value\": what is the probability of getting such results by chance?\n", - "* evaluate the size of the difference: this measure is called the **effect size**. of the p-value\n", - "\n", - "We see above that `p > alpha`, which means that the probability of getting more extreme sample mean than the one we observe is higher than 5% so it cannot be considered as one of the extreme possible values.\n", "\n", - "The test tells us that the difference between the mean success rates of year 1 and year 2 is **not statistically significant**, that is to say that we cannot exclude that ..." + "We see in the results table above that the p-value is quite big since $p = 0.556$, and it is much higher than our significance level of $.05$. \n", + "The fact that $p > \\alpha$ means that the difference between the mean success rates of year 1 and year 2 is **not statistically significant**, which means we cannot actually reject our hypothesis $H_0$ as we wanted to. \n", + "In other words, we cannot rule out that the difference that we see between $m_1$ and $m_2$ could be just random luck. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Using Python we can visualize what that means graphically by plotting the t-distribution of all the possible sample means that would be drawn from a population with the same mean as Anderson's population and showing where `t` is in the distribution compared to the zone defined by our $\\alpha$ of 5%:" + "Using Python we can visualize the p-value graphically. In the graph below, the blue zone represents the significance leve we have chosen. The green hatched zone represents the p-value. When this green hatched zone is bigger than the blue zone then it means the difference between $m_1$ and $m_2$ is not statistically significant.\n", + "\n", + "**Execute the cell below** to see the graph." ] }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", + "\" id=\"m8d00c204ff\" style=\"stroke:#0000ff;stroke-opacity:0.3;\"/>\n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", " \n", " \n", " \n", - " \n", + "L 538.957327 -21.345938 \n", + "L 538.078291 -21.345938 \n", + "L 537.199254 -21.345938 \n", + "L 536.320218 -21.345938 \n", + "L 535.441181 -21.345938 \n", + "L 534.562145 -21.345938 \n", + "L 533.683108 -21.345938 \n", + "L 532.804072 -21.345938 \n", + "L 531.925035 -21.345938 \n", + "L 531.045999 -21.345938 \n", + "L 530.166962 -21.345938 \n", + "L 529.287926 -21.345938 \n", + "L 528.408889 -21.345938 \n", + "L 527.529853 -21.345938 \n", + "L 526.650817 -21.345938 \n", + "L 525.77178 -21.345938 \n", + "L 524.892744 -21.345938 \n", + "L 524.013707 -21.345938 \n", + "L 523.134671 -21.345938 \n", + "L 522.255634 -21.345938 \n", + "L 521.376598 -21.345938 \n", + "L 520.497561 -21.345938 \n", + "L 519.618525 -21.345938 \n", + "L 518.739488 -21.345938 \n", + "L 517.860452 -21.345938 \n", + "L 516.981415 -21.345938 \n", + "L 516.102379 -21.345938 \n", + "L 515.223342 -21.345938 \n", + "L 514.344306 -21.345938 \n", + "L 513.465269 -21.345938 \n", + "L 512.586233 -21.345938 \n", + "L 511.707197 -21.345938 \n", + "L 510.82816 -21.345938 \n", + "L 509.949124 -21.345938 \n", + "L 509.070087 -21.345938 \n", + "L 508.191051 -21.345938 \n", + "L 507.312014 -21.345938 \n", + "L 506.432978 -21.345938 \n", + "L 505.553941 -21.345938 \n", + "L 504.674905 -21.345938 \n", + "L 503.795868 -21.345938 \n", + "L 502.916832 -21.345938 \n", + "L 502.037795 -21.345938 \n", + "L 501.158759 -21.345938 \n", + "L 500.279722 -21.345938 \n", + "L 499.400686 -21.345938 \n", + "L 498.521649 -21.345938 \n", + "L 497.642613 -21.345938 \n", + "L 496.763576 -21.345938 \n", + "L 495.88454 -21.345938 \n", + "L 495.005504 -21.345938 \n", + "L 494.126467 -21.345938 \n", + "L 493.247431 -21.345938 \n", + "L 492.368394 -21.345938 \n", + "L 491.489358 -21.345938 \n", + "L 490.610321 -21.345938 \n", + "L 489.731285 -21.345938 \n", + "L 488.852248 -21.345938 \n", + "L 487.973212 -21.345938 \n", + "L 487.094175 -21.345938 \n", + "L 486.215139 -21.345938 \n", + "L 485.336102 -21.345938 \n", + "L 484.457066 -21.345938 \n", + "L 483.578029 -21.345938 \n", + "L 482.698993 -21.345938 \n", + "L 481.819956 -21.345938 \n", + "L 480.94092 -21.345938 \n", + "L 480.061884 -21.345938 \n", + "L 479.182847 -21.345938 \n", + "L 478.303811 -21.345938 \n", + "L 477.424774 -21.345938 \n", + "L 476.545738 -21.345938 \n", + "L 475.666701 -21.345938 \n", + "L 474.787665 -21.345938 \n", + "L 473.908628 -21.345938 \n", + "L 473.029592 -21.345938 \n", + "L 472.150555 -21.345938 \n", + "L 471.271519 -21.345938 \n", + "L 470.392482 -21.345938 \n", + "L 469.513446 -21.345938 \n", + "L 468.634409 -21.345938 \n", + "L 467.755373 -21.345938 \n", + "L 466.876336 -21.345938 \n", + "L 465.9973 -21.345938 \n", + "L 465.118263 -21.345938 \n", + "L 464.239227 -21.345938 \n", + "L 463.360191 -21.345938 \n", + "L 462.481154 -21.345938 \n", + "L 461.602118 -21.345938 \n", + "L 460.723081 -21.345938 \n", + "L 459.844045 -21.345938 \n", + "L 458.965008 -21.345938 \n", + "L 458.085972 -21.345938 \n", + "L 457.206935 -21.345938 \n", + "L 456.327899 -21.345938 \n", + "L 455.448862 -21.345938 \n", + "L 454.569826 -21.345938 \n", + "L 453.690789 -21.345938 \n", + "L 452.811753 -21.345938 \n", + "z\n", + "\" id=\"m9d6c6f072c\" style=\"stroke:#0000ff;stroke-opacity:0.3;\"/>\n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", + "\" style=\"fill:url(#h642fe9c7bf);\"/>\n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", + "\" style=\"fill:url(#h642fe9c7bf);\"/>\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", "\n" ], "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from lib.dataanalysis import * \n", "build_ttest_visualization(ttest_result, alpha)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Question
\n", + "How does the blue zone on the graph change when you change the value of $\\alpha$? \n", + "Go back to the code cell in which we have defined $\\alpha$ and change its value to $.01$. Describe how the blue zone changes:" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "Type your answer here." + ] + }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Intepreting the \"effect-size\": how large is the difference?\n", "\n", - "So far we have tested our hypothesis regarding the similarity of means between our sample and the population and the results show us that there is probably no difference.\n", - "However whether this statistical test finds a difference depends on the sample size: with small samples it is harder to find a statistically significant difference.\n", - "We need therefore to **distinguish between a difference being statistically significant and a difference being large**.\n", - "The t-test is used to assess whether the difference is statistically significant. To assess whether the difference is large we use a measure called the effect size.\n", + "So far we have seen with the p-value that there is probably no difference between $m_1$ and $m_2$. \n", + "However whether the t-test finds a difference depends on the sample size (here the number of students): with small samples ($N < 30$) it is harder to find a statistically significant difference.\n", "\n", - "The effect size represents the size of the difference between the sample mean and the population mean taking into account the variation inside the sample (and inside the population, if known). \n", - "Cohen's d is one of the existing measures of effect size. \n", + "According to the descriptive statistics computer earlier, we have 23 students in year 1 and 17 students in year 2 so our samples can be considered of small size.\n", + "We need therefore to **distinguish between the difference being statistically significant and the difference being large** and look at the effect size before drawing any conclusions.\n", "\n", - "To interpret the effect size we have to **compare it to thresholds from the litterature**. Cohen suggested that $d=0.2$ was a small effect size, $0.5$ represents a medium effect size and $0.8$ represents a large effect size.\n", - "For our Vullierens sample the effect size is therefore small. In this case, the difference is also not statistically significant. However, with a larger sample size, it would be possible to have a statistically significant difference which nonetheless would be so small as to be trivial.\n", + "We see in the result table above that the effect size is $d = 0.187$.\n", + "To interpret the effect size we have to **compare it to thresholds from the litterature**. \n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's visualize the result graphically. \n", - "The graph below represents the theoretical distributions of our two samples in blue and in green. We see below that the two groups largely overlap, which is representative of the size of the difference being trivial." + "The graph below represents the theoretical distributions of the success rate of students of year 1 in blue and of year 2 in green. We see below that the two groups overlap very largely, which is representative of a small difference." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", - " \n", " \n", " \n", + "\" id=\"m012fda1c69\" style=\"stroke:#ff0000;\"/>\n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", "\n" ], "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from lib.dataanalysis import * \n", "build_effectsize_visualization(ttest_result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "---\n", "\n", "## Summary\n", "\n", "To summarize, to compare the mean of a sample to a reference value from a population, you have to proceed in four main steps:\n", "1. formulate the hypothese you want to test: the null hypothesis $H_0: m = \\mu$ and its alternate $H_a: m \\neq \\mu$ \n", "1. choose a cut-off point for being sure, usually $\\alpha = .05$, $\\alpha = .01$ or $\\alpha = .001$ \n", "1. compute the result of the t-test and interpret the result - in particular if the p-value is *below* the significance level you have chosen, $p \\lt \\alpha$, then it means $H_0$ should probably be rejected\n", "1. compute the effect size and interpret the result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "---\n", "\n", - "## Exercise\n", + "## Exercise: \"what if?\"\n", "\n", "**Analyzing another dataset.**\n", "\n", "A researcher from Tokyo sends you the results of a series of measurements she has done on the Irises of the [Meiji Jingu Imperial Gardens](http://www.meijijingu.or.jp/english/nature/2.html). The dataset can be found in the `iris-sample2-meiji.csv` file. \n", "How similar (or different) is the Meiji sample compared to the Iris virginica population documented by Edgar Anderson? \n", "The following questions are designed to guide you in analyzing this new dataset using this notebook.\n", "\n", "1. Which of the code cells above loads the data from the file containing the Vullierens dataset? Modify it to load the Meiji dataset.\n", "1. Do you need to modify anything else in the code to analyze this new dataset?\n", "1. What can you conclude about the Meiji sample from this analysis?\n", "\n", "**C. Going a bit further in the interpretation of the t-test.**\n", "1. In the code cells above, where is the cut-off point $\\alpha$ defined? Change its value to 0.01 and re-execute the notebook. \n", "1. How does this affect the result of the t-test for the Meiji sample?\n", "\n", " \n", "\n", "*You can check your answer with the solution available [in this file](solution/DataAnalysis-solution.ipynb).*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "---\n", "\n", "

Other resources

\n", "\n", "* Really well made [video on how to interpret the p-value](https://www.youtube.com/watch?v=eyknGvncKLw)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 4 } diff --git a/Untitled.ipynb b/Untitled.ipynb new file mode 100644 index 0000000..9c4e8b8 --- /dev/null +++ b/Untitled.ipynb @@ -0,0 +1,232 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0.76571978, 0.65437484, 0.86286965, 0.82264655, 0.87496524,\n", + " 0.65167681, 0.71852321, 0.85367579, 0.82386823, 0.79332654,\n", + " 0.83566853, 0.81235828, 0.73133742, 0.74396695, 0.91243463,\n", + " 0.92900505, 0.80218026, 0.77502595, 0.91283284, 0.93668933])" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import numpy as np\n", + "mean1 = 0.7869695521913044\n", + "std1 = 0.128173\n", + "mean2 = 0.7639938492941176\n", + "std2 = 0.114993\n", + "\n", + "x_data = np.random.normal(loc=mean2,scale=std2,size=20) + (np.random.rand()/(mean2+mean2)) \n", + "\n", + "x_data" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "np.savetxt(\"students_success_rate2.csv\", x_data, delimiter=\",\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Clicker ici pour voir la solution\n", + "
\n", + "
\n", + "\n", + "$\\lvert\\vec{T}\\rvert = 1686 $N$\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Clicker ici pour voir la solution 2\n", + "
\n", + "
\n", + "\n", + "Ici la solution 2 *en markdown* avec du LaTeX $x_0 = \\alpha$.\n", + "\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import display, Math, Markdown, Latex, HTML\n", + "js = \"\"\"\n", + "\n", + "\"\"\"\n", + "display(HTML(js));" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from jupyter_core.paths import jupyter_config_dir\n", + "jupyter_dir = jupyter_config_dir()\n", + "import os.path\n", + "custom_js_path = os.path.join(jupyter_dir, 'custom', 'custom.js')\n", + "print(\"searching for custom.js in \", custom_js_path)\n", + "# my custom js\n", + "if os.path.isfile(custom_js_path):\n", + " with open(custom_js_path) as f:\n", + " print(f.read())\n", + "else:\n", + " print(\"You don't have a custom.js file\")" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "ename": "ZeroDivisionError", + "evalue": "division by zero", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mZeroDivisionError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;36m1\u001b[0m\u001b[0;34m/\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mZeroDivisionError\u001b[0m: division by zero" + ] + } + ], + "source": [ + "1/0" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "ename": "ZeroDivisionError", + "evalue": "float division by zero", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mZeroDivisionError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;36m1.2\u001b[0m\u001b[0;34m/\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mZeroDivisionError\u001b[0m: float division by zero" + ] + } + ], + "source": [ + "1.2/0" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:2: RuntimeWarning: divide by zero encountered in true_divide\n", + " \n" + ] + }, + { + "data": { + "text/plain": [ + "inf" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import numpy as np\n", + "np.divide(1.2, 0)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/figs/diagram-normalcurve.png b/figs/diagram-normalcurve.png new file mode 100644 index 0000000..4211516 Binary files /dev/null and b/figs/diagram-normalcurve.png differ diff --git a/figs/diagram-normalcurveAB.png b/figs/diagram-normalcurveAB.png new file mode 100644 index 0000000..b3b92b5 Binary files /dev/null and b/figs/diagram-normalcurveAB.png differ diff --git a/figs/diagram-normalcurveAlpha.png b/figs/diagram-normalcurveAlpha.png new file mode 100644 index 0000000..ff72970 Binary files /dev/null and b/figs/diagram-normalcurveAlpha.png differ diff --git a/figs/diagram-samples.png b/figs/diagram-samples.png new file mode 100644 index 0000000..5fc20f3 Binary files /dev/null and b/figs/diagram-samples.png differ diff --git a/solution/DataAnalysis-solution.ipynb b/solution/DataAnalysis-solution.ipynb index e5cea1e..a02090d 100644 --- a/solution/DataAnalysis-solution.ipynb +++ b/solution/DataAnalysis-solution.ipynb @@ -1,131 +1,158 @@ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " How People Learn - Autumn 2019-2020
\n", " C. Hardebolle, R. Tormey - CC BY-NC-SA 4.0 Int.
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Solution of the questions\n", "\n", " \n", "\n", "--- \n", "\n", "## Getting started\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question
\n", "How can you access the list of the values stored in the `year` column? " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample_data[\"year\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question
\n", "How can you get the data regarding students from year 2 only? " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample2_data = sample_data[sample_data[\"year\"]==\"Year2\"]\n", "sample2_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "---\n", "\n", "## First look at the data\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question
\n", "How can you get the same descriptive statistics for the students from year 2? " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample2_stats = sample2_data.describe()\n", "sample2_stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question
\n", "How can you get the mean success rate of students in year 2? " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample2_mean = sample2_stats.loc[\"mean\",\"rate\"]\n", "sample2_mean" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "## Using statistical tools\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Question
\n", + "How does the blue zone on the graph change when you change the value of $\\alpha$? \n", + "Go back to the code cell in which we have defined $\\alpha$ and change its value to $.01$. Describe how the blue zone changes:" + ] + }, + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "The blue zone gets reduced and becomes quite small actually. \n", + "Since this zone represents the probability that we find acceptable for saying the difference between m1 and m2 cannot be just luck, it means we want to be super super super sure that it is not an effect of luck." + ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.8" + "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 4 } diff --git a/students_success_rate2.csv b/students_success_rate2.csv new file mode 100644 index 0000000..e7738ac --- /dev/null +++ b/students_success_rate2.csv @@ -0,0 +1,20 @@ +7.657197761584620954e-01 +6.543748366873820554e-01 +8.628696486506030050e-01 +8.226465508253819614e-01 +8.749652427714347258e-01 +6.516768078457561009e-01 +7.185232067119213806e-01 +8.536757898183831017e-01 +8.238682292333517898e-01 +7.933265391446522319e-01 +8.356685301737689642e-01 +8.123582764892773866e-01 +7.313374239085163042e-01 +7.439669501544037278e-01 +9.124346268191283471e-01 +9.290050457160961006e-01 +8.021802590759877782e-01 +7.750259488858373125e-01 +9.128328367522325903e-01 +9.366893335956023581e-01