\n",
" Teaching and Learning differently with Jupyter Notebooks - HEG, December 12, 2019 \n",
" C. Hardebolle, CC BY-NC-SA 4.0 Int.
\n",
" How to use this notebook? \n",
" This notebook is made of text cells and code cells. The code cells have to be executed to see the result of the program. To execute a cell, simply select it and click on the \"play\" button (►) in the tool bar just above the notebook, or type shift + enter. It is important to execute the code cells in their order of appearance in the notebook.\n",
"
"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"### TODO\n",
"\n",
"Work on the text\n",
"Integrate exercises\n",
"Fake some more data?\n",
"\n",
"DONE (keep ?) Test normality and homoscedasity\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# How can we know if a teaching method produces the impact we hope for?\n",
"\n",
"Let's imagine that you teach a course for the users of your library and this year you are trying a new teaching method. \n",
"**How can you know whether this teaching method is actually \"better\" than the previous one?** \n",
"One solution is to collect data from your class and analyze it to determine how the method is influencing the way your students learn in the course.\n",
"\n",
"In this notebook, you are going to analyze data which has been collected in the course you are actually attending, \"Formation des usagers en bibliothèques (765-21n)\". \n",
"\n",
"## Context\n",
"\n",
"As you know from being a student in this course, one of the requirements to pass the course is to submit a number of assignments on which you work individually. \n",
"These assignements get validated and the **success rate** is the number of accepted assignments over the total number of submitted assignments by each student. \n",
"Your teacher collects the success rate of students attending the course each year for normal grading purposes.\n",
"\n",
"## Question\n",
"\n",
"Let's imagine that one year, your teacher modifies the course with the goal of helping students submit assignments of better quality. \n",
"One way to know whether students' assignments are really of better quality is to look at students' success rate, which should be higher after the modification. \n",
"\n",
"The activities in this notebook will guide you in the *detective work* of analyzing the data from the course to answer this question: \n",
"**How does the success rate of students after the modification (year 2) compare to the success rate of students the previous year (year 1)?** \n",
"\n",
"\n",
"## Learning goals\n",
"\n",
"After using this notebook, you should be able to:\n",
"* Compare two samples using a statistical test called the \"t-test\" and interpret its results\n",
"* Evaluate the magnitude of the difference between two samples using a measure of the effect size called \"Cohen's d\" and interpret the results\n",
"* Use Python to make statistical analyses on a dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Instructions \n",
"\n",
"1. Read the notebook and execute the code cells. \n",
"2. Answer the questions\n",
"\n",
"*You can check your answers with the solution available [in this file](solution/DataAnalysis-solution.ipynb).*\n",
"\n",
"\n",
" \n",
"\n",
"--- \n",
"\n",
"## Getting started\n",
"\n",
"### Python tools for stats\n",
"Python comes with a number of libraries for processing data and computing statistics.\n",
"To use these tool you first have to load them using the `import` keyword. \n",
"The role of the code cell just below is to load the tools that we use in the rest of the notebook. It is important to execute this cell *prior to executing any other cell in the notebook*."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# plotting and display tools\n",
"%matplotlib inline\n",
"import matplotlib.pyplot as plt \n",
"plt.style.use('seaborn-whitegrid') # global style for plotting\n",
"\n",
"from IPython.display import display, set_matplotlib_formats\n",
"set_matplotlib_formats('svg') # vector format for graphs\n",
"\n",
"# data computation tools\n",
"import numpy as np \n",
"import pandas as pan\n",
"\n",
"# statistics tools\n",
"import scipy.stats as stats\n",
"import pingouin as pg"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data available\n",
"\n",
"You have the raw coming from the last two years of the class, which is stored in the CSV file `students_success_rate.csv` that you can see in the file explorer in the left pane. \n",
"If you double click on the file it will open in a new tab and you can look at what is inside.\n",
"\n",
"Now to analyze the data using Python you have to read the file:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
student
\n",
"
year
\n",
"
rate
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
#101
\n",
"
Year1
\n",
"
0.727273
\n",
"
\n",
"
\n",
"
1
\n",
"
#102
\n",
"
Year1
\n",
"
0.583333
\n",
"
\n",
"
\n",
"
2
\n",
"
#103
\n",
"
Year1
\n",
"
0.750000
\n",
"
\n",
"
\n",
"
3
\n",
"
#104
\n",
"
Year1
\n",
"
0.818182
\n",
"
\n",
"
\n",
"
4
\n",
"
#105
\n",
"
Year1
\n",
"
0.800000
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" student year rate\n",
"0 #101 Year1 0.727273\n",
"1 #102 Year1 0.583333\n",
"2 #103 Year1 0.750000\n",
"3 #104 Year1 0.818182\n",
"4 #105 Year1 0.800000"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Read the sample data from the CSV file\n",
"sample_data = pan.read_csv('students_success_rate.csv')\n",
"\n",
"# Display the first few lines of the dataset\n",
"sample_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After reading the CSV file, its content is stored in the variable `sample_data`, which is a kind of table. Each line of the table is given an index number to identify it. We see above that the table contains one column called `student` with identifiers for the students, one column called `year` indicating which cohort the student is from and a column `rate` with the success rate of each student.\n",
"\n",
"To get the list of all the values for the success rate stored in the `rate` column, you can use the following syntax: `sample_data[\"rate\"]`."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0.727273\n",
"1 0.583333\n",
"2 0.750000\n",
"3 0.818182\n",
"4 0.800000\n",
"5 0.777778\n",
"6 0.500000\n",
"7 0.692308\n",
"8 0.714286\n",
"9 0.888889\n",
"10 0.750000\n",
"11 0.615385\n",
"12 0.769231\n",
"13 0.700000\n",
"14 0.818182\n",
"15 1.000000\n",
"16 0.818182\n",
"17 1.000000\n",
"18 0.900000\n",
"19 0.818182\n",
"20 1.000000\n",
"21 0.909091\n",
"22 0.750000\n",
"23 1.000000\n",
"24 0.750000\n",
"25 0.777778\n",
"26 0.700000\n",
"27 0.571429\n",
"28 0.777778\n",
"29 0.727273\n",
"30 0.642857\n",
"31 0.888889\n",
"32 0.857143\n",
"33 0.615385\n",
"34 0.888889\n",
"35 0.857143\n",
"36 0.777778\n",
"37 0.600000\n",
"38 0.777778\n",
"39 0.777778\n",
"Name: rate, dtype: float64"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# All values stored in the \"rate\" column of the \"sample_data\" table\n",
"sample_data[\"rate\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Question \n",
"How can you access the list of the values stored in the `year` column? \n",
"Type your code in the following cell and execute it to check the result."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Year1\n",
"1 Year1\n",
"2 Year1\n",
"3 Year1\n",
"4 Year1\n",
"5 Year1\n",
"6 Year1\n",
"7 Year1\n",
"8 Year1\n",
"9 Year1\n",
"10 Year1\n",
"11 Year1\n",
"12 Year1\n",
"13 Year1\n",
"14 Year1\n",
"15 Year1\n",
"16 Year1\n",
"17 Year1\n",
"18 Year1\n",
"19 Year1\n",
"20 Year1\n",
"21 Year1\n",
"22 Year1\n",
"23 Year2\n",
"24 Year2\n",
"25 Year2\n",
"26 Year2\n",
"27 Year2\n",
"28 Year2\n",
"29 Year2\n",
"30 Year2\n",
"31 Year2\n",
"32 Year2\n",
"33 Year2\n",
"34 Year2\n",
"35 Year2\n",
"36 Year2\n",
"37 Year2\n",
"38 Year2\n",
"39 Year2\n",
"Name: year, dtype: object"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sample_data[\"year\"] # TODO delete"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Filter data according to the year\n",
"\n",
"Now we want to access the data of students who attended the course on year 1 only. For this, we want to select only the lines for which the column `year` contains the value `\"Year1\"` in the whole table. This can be done by putting a condition such as `sample_data[\"year\"]==\"Year1\"` between brackets as we use for selecting columns. This will play de role of a filter, as shown below:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
student
\n",
"
year
\n",
"
rate
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
#101
\n",
"
Year1
\n",
"
0.727273
\n",
"
\n",
"
\n",
"
1
\n",
"
#102
\n",
"
Year1
\n",
"
0.583333
\n",
"
\n",
"
\n",
"
2
\n",
"
#103
\n",
"
Year1
\n",
"
0.750000
\n",
"
\n",
"
\n",
"
3
\n",
"
#104
\n",
"
Year1
\n",
"
0.818182
\n",
"
\n",
"
\n",
"
4
\n",
"
#105
\n",
"
Year1
\n",
"
0.800000
\n",
"
\n",
"
\n",
"
5
\n",
"
#106
\n",
"
Year1
\n",
"
0.777778
\n",
"
\n",
"
\n",
"
6
\n",
"
#107
\n",
"
Year1
\n",
"
0.500000
\n",
"
\n",
"
\n",
"
7
\n",
"
#108
\n",
"
Year1
\n",
"
0.692308
\n",
"
\n",
"
\n",
"
8
\n",
"
#109
\n",
"
Year1
\n",
"
0.714286
\n",
"
\n",
"
\n",
"
9
\n",
"
#110
\n",
"
Year1
\n",
"
0.888889
\n",
"
\n",
"
\n",
"
10
\n",
"
#111
\n",
"
Year1
\n",
"
0.750000
\n",
"
\n",
"
\n",
"
11
\n",
"
#112
\n",
"
Year1
\n",
"
0.615385
\n",
"
\n",
"
\n",
"
12
\n",
"
#113
\n",
"
Year1
\n",
"
0.769231
\n",
"
\n",
"
\n",
"
13
\n",
"
#114
\n",
"
Year1
\n",
"
0.700000
\n",
"
\n",
"
\n",
"
14
\n",
"
#115
\n",
"
Year1
\n",
"
0.818182
\n",
"
\n",
"
\n",
"
15
\n",
"
#116
\n",
"
Year1
\n",
"
1.000000
\n",
"
\n",
"
\n",
"
16
\n",
"
#117
\n",
"
Year1
\n",
"
0.818182
\n",
"
\n",
"
\n",
"
17
\n",
"
#118
\n",
"
Year1
\n",
"
1.000000
\n",
"
\n",
"
\n",
"
18
\n",
"
#119
\n",
"
Year1
\n",
"
0.900000
\n",
"
\n",
"
\n",
"
19
\n",
"
#120
\n",
"
Year1
\n",
"
0.818182
\n",
"
\n",
"
\n",
"
20
\n",
"
#121
\n",
"
Year1
\n",
"
1.000000
\n",
"
\n",
"
\n",
"
21
\n",
"
#122
\n",
"
Year1
\n",
"
0.909091
\n",
"
\n",
"
\n",
"
22
\n",
"
#123
\n",
"
Year1
\n",
"
0.750000
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" student year rate\n",
"0 #101 Year1 0.727273\n",
"1 #102 Year1 0.583333\n",
"2 #103 Year1 0.750000\n",
"3 #104 Year1 0.818182\n",
"4 #105 Year1 0.800000\n",
"5 #106 Year1 0.777778\n",
"6 #107 Year1 0.500000\n",
"7 #108 Year1 0.692308\n",
"8 #109 Year1 0.714286\n",
"9 #110 Year1 0.888889\n",
"10 #111 Year1 0.750000\n",
"11 #112 Year1 0.615385\n",
"12 #113 Year1 0.769231\n",
"13 #114 Year1 0.700000\n",
"14 #115 Year1 0.818182\n",
"15 #116 Year1 1.000000\n",
"16 #117 Year1 0.818182\n",
"17 #118 Year1 1.000000\n",
"18 #119 Year1 0.900000\n",
"19 #120 Year1 0.818182\n",
"20 #121 Year1 1.000000\n",
"21 #122 Year1 0.909091\n",
"22 #123 Year1 0.750000"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Use a condition to select only the line of the table for which the column \"year\" contains the value \"Year1\"\n",
"sample1_data = sample_data[sample_data[\"year\"]==\"Year1\"]\n",
"\n",
"# Display the result\n",
"sample1_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Question \n",
"How can you get the data regarding students from year 2 only? \n",
"Type your code in the following cell and execute it to check the result."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
student
\n",
"
year
\n",
"
rate
\n",
"
\n",
" \n",
" \n",
"
\n",
"
23
\n",
"
#201
\n",
"
Year2
\n",
"
1.000000
\n",
"
\n",
"
\n",
"
24
\n",
"
#202
\n",
"
Year2
\n",
"
0.750000
\n",
"
\n",
"
\n",
"
25
\n",
"
#203
\n",
"
Year2
\n",
"
0.777778
\n",
"
\n",
"
\n",
"
26
\n",
"
#204
\n",
"
Year2
\n",
"
0.700000
\n",
"
\n",
"
\n",
"
27
\n",
"
#205
\n",
"
Year2
\n",
"
0.571429
\n",
"
\n",
"
\n",
"
28
\n",
"
#206
\n",
"
Year2
\n",
"
0.777778
\n",
"
\n",
"
\n",
"
29
\n",
"
#207
\n",
"
Year2
\n",
"
0.727273
\n",
"
\n",
"
\n",
"
30
\n",
"
#208
\n",
"
Year2
\n",
"
0.642857
\n",
"
\n",
"
\n",
"
31
\n",
"
#209
\n",
"
Year2
\n",
"
0.888889
\n",
"
\n",
"
\n",
"
32
\n",
"
#210
\n",
"
Year2
\n",
"
0.857143
\n",
"
\n",
"
\n",
"
33
\n",
"
#211
\n",
"
Year2
\n",
"
0.615385
\n",
"
\n",
"
\n",
"
34
\n",
"
#212
\n",
"
Year2
\n",
"
0.888889
\n",
"
\n",
"
\n",
"
35
\n",
"
#213
\n",
"
Year2
\n",
"
0.857143
\n",
"
\n",
"
\n",
"
36
\n",
"
#214
\n",
"
Year2
\n",
"
0.777778
\n",
"
\n",
"
\n",
"
37
\n",
"
#215
\n",
"
Year2
\n",
"
0.600000
\n",
"
\n",
"
\n",
"
38
\n",
"
#216
\n",
"
Year2
\n",
"
0.777778
\n",
"
\n",
"
\n",
"
39
\n",
"
#217
\n",
"
Year2
\n",
"
0.777778
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" student year rate\n",
"23 #201 Year2 1.000000\n",
"24 #202 Year2 0.750000\n",
"25 #203 Year2 0.777778\n",
"26 #204 Year2 0.700000\n",
"27 #205 Year2 0.571429\n",
"28 #206 Year2 0.777778\n",
"29 #207 Year2 0.727273\n",
"30 #208 Year2 0.642857\n",
"31 #209 Year2 0.888889\n",
"32 #210 Year2 0.857143\n",
"33 #211 Year2 0.615385\n",
"34 #212 Year2 0.888889\n",
"35 #213 Year2 0.857143\n",
"36 #214 Year2 0.777778\n",
"37 #215 Year2 0.600000\n",
"38 #216 Year2 0.777778\n",
"39 #217 Year2 0.777778"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Filter data from year 2 only\n",
"sample2_data = sample_data[sample_data[\"year\"]==\"Year2\"] # TODO delete\n",
"\n",
"# Display the result\n",
"sample2_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" \n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## First look at the data\n",
"\n",
"Now that we have loaded the data and we know how to access it, we perform our first detective tasks to get an idea of what the data looks like: we compute some descriptive statistics and we visualize the data using plots.\n",
"\n",
"### Descriptive statistics\n",
"\n",
"Let's compute some simple descriptive statistics on this sample data. The `describe()` function gives us right away a number of useful descriptive stats. In the cell code below we compute the descriptive statistics for the students from year 1:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
rate
\n",
"
\n",
" \n",
" \n",
"
\n",
"
count
\n",
"
23.000000
\n",
"
\n",
"
\n",
"
mean
\n",
"
0.786970
\n",
"
\n",
"
\n",
"
std
\n",
"
0.128173
\n",
"
\n",
"
\n",
"
min
\n",
"
0.500000
\n",
"
\n",
"
\n",
"
25%
\n",
"
0.720779
\n",
"
\n",
"
\n",
"
50%
\n",
"
0.777778
\n",
"
\n",
"
\n",
"
75%
\n",
"
0.853535
\n",
"
\n",
"
\n",
"
max
\n",
"
1.000000
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" rate\n",
"count 23.000000\n",
"mean 0.786970\n",
"std 0.128173\n",
"min 0.500000\n",
"25% 0.720779\n",
"50% 0.777778\n",
"75% 0.853535\n",
"max 1.000000"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Compute the descriptive stats\n",
"sample1_stats = sample1_data.describe()\n",
"\n",
"# Display the result\n",
"sample1_stats"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Question \n",
"How can you get the same descriptive statistics for the students from year 2? \n",
"Type your code in the following cell and execute it to check the result."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
rate
\n",
"
\n",
" \n",
" \n",
"
\n",
"
count
\n",
"
17.000000
\n",
"
\n",
"
\n",
"
mean
\n",
"
0.763994
\n",
"
\n",
"
\n",
"
std
\n",
"
0.114993
\n",
"
\n",
"
\n",
"
min
\n",
"
0.571429
\n",
"
\n",
"
\n",
"
25%
\n",
"
0.700000
\n",
"
\n",
"
\n",
"
50%
\n",
"
0.777778
\n",
"
\n",
"
\n",
"
75%
\n",
"
0.857143
\n",
"
\n",
"
\n",
"
max
\n",
"
1.000000
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" rate\n",
"count 17.000000\n",
"mean 0.763994\n",
"std 0.114993\n",
"min 0.571429\n",
"25% 0.700000\n",
"50% 0.777778\n",
"75% 0.857143\n",
"max 1.000000"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Compute the descriptive stats\n",
"sample2_stats = sample2_data.describe() # TODO delete\n",
"\n",
"# Display the result\n",
"sample2_stats"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Extracting specific statistics\n",
"You can access individual elements of the `sample1_stats` table using the corresponding names for the line and column of the value. \n",
"The following cell illustrates how to get the mean success rate of students in year 1:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.7869695521913044"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Extract the sample mean from the descriptive stats\n",
"sample1_mean = sample1_stats.loc[\"mean\",\"rate\"]\n",
"\n",
"# Display the result\n",
"sample1_mean"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Question \n",
"How can you get the mean success rate of students in year 2? \n",
"Type your code in the following cell and execute it to check the result."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.7639938492941176"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Extract the sample mean from the descriptive stats\n",
"sample2_mean = sample2_stats.loc[\"mean\",\"rate\"]\n",
"\n",
"# Display the result\n",
"sample2_mean"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Visualizations\n",
"\n",
- "To have a more precise idea of what the data looks like, an important step is to generate visualisations of this data. One interesting element is to look at how the success rates are distributed for each year. The Python code below creates two graphs which represent how the samples are distributed: a bar graph and boxplots.\n",
+ "To have a more precise idea of what the data looks like, an important step is to generate visualisations of this data. One interesting element is to look at how the success rates are distributed for each year. The Python code below creates two graphs which represent how the success rate of students is distributed: a bar graph and boxplots.\n",
"\n",
"**Execute the code cell below** to generate the graphs.\n",
"\n",
"*Note:* \n",
"Python code for plotting can be quite verbose and not particularly interesting to look at unless you want to learn how to generate plots in Python. \n",
"The good news is that you can **hide** a code cell from the notebook by selecting it and clicking on the blue bar which appears on its left. \n",
"To make the cell visible again, just click again on the blue bar, or on the three \"dots\" which represent the collapsed cell."
]
},
{
"cell_type": "code",
- "execution_count": 11,
+ "execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
- "