diff --git a/HPLstats_notebook.ipynb b/HPLstats_notebook.ipynb index 88859e9..034ac12 100644 --- a/HPLstats_notebook.ipynb +++ b/HPLstats_notebook.ipynb @@ -1,521 +1,5417 @@ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " How People Learn - Autumn 2019-2020
\n", " C. Hardebolle, R. Tormey - CC BY NC SA 4.0 International

\n", " How to use this notebook?
\n", " This notebook is made of text cells and code cells. The code cells have to be executed to see the result of the program.
To execute a cell, simply select it and click on the \"play\" button in the tool bar at the top of the notebook. It is important to execute the code cells in their order of appearance in the notebook.
\n", " Alternatively, you can use the Run all cells option in the Run menu at the top left of the window.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to hypothesis testing\n", "\n", "In this notebook we look at data on a type of flower called Iris and we analyze it using a programming language called Python." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learning goals\n", "\n", "This notebook is designed for you to learn:\n", "* How to distinguish between \"population\" datasets and \"sample\" datasets when dealing with experimental data\n", "* How to compare a sample dataset to a population using a statistical test called the \"t-test\" and interpret its results\n", "* How to quantify the difference between a sample and a population using a measure of the effect size called \"Cohen's d\" and interpret the results\n", "* How to use Python scripts to make statistical analyses on a dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "
\n", " \"iris\n", "\n", "###### Iris virginica (Credit: Frank Mayfield CC BY-SA 2.0)\n", "\n", "
\n", "\n", "In 1935, an american botanist called Edgar Anderson worked on quantifying the morphologic variation of Iris flowers of three related species, Iris setosa, Iris virginica and Iris versicolor [[1]](#Bibliography). He realized a series of measures of the petal length, petal width, sepal length, sepal width and species.\n", "Based on the combination of these four features, a British statistician and biologist named Ronald Fisher developed a model to distinguish the species from each other [[2]](#Bibliography).\n", "\n", "\n", "**Question** \n", "A recent series of measurements has been carried out at the [Iris Garden of the Vullierens Castle](https://chateauvullierens.ch/en/) near Lausanne, on a sample of $n=50$ flowers of the Iris virginica species. \n", "**How similar (or different) is the Iris sample from the Vullierens Castle compared to the Iris virginica population documented by Edgar Anderson?**\n", "\n", "\n", "**Instructions** \n", "A. Read the notebook and execute the code cells. To check your understanding, ask yourself the following questions:\n", "* Can I explain how the concept of sample differs from the concept of population?\n", "* What does a t-test tell me on my sample?\n", "* What does Cohen's d tell me on my sample?\n", "\n", "B. Do the exercise at the end of the notebook. \n", "\n", "## Getting started\n", "\n", "**Python tools for stats** \n", "Python comes with a number of libraries for processing data and computing statistics.\n", "To use these tool you first have to load them using the `import` keyword. \n", "The role of the code cell just below is to load the tools that we use in the rest of the notebook. It is important to execute this cell *prior to executing any other cell in the notebook*." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# plotting and display tools\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt \n", "plt.style.use('seaborn-whitegrid') # global style for plotting\n", "\n", "from IPython.display import display, set_matplotlib_formats\n", "set_matplotlib_formats('svg') # high def format from graphs\n", "\n", "# data computation tools\n", "import numpy as np \n", "import pandas as pan\n", "\n", "# statistics tools\n", "import scipy.stats as stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "**Data available** \n", + "**Data available on the Anderson dataset** \n", "Anderson has published summary statistics of his dataset. \n", "You have access to the mean petal length of the Iris virginica species: $\\mu = 5.552 cm$" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 19, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "5.552" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "# Define mean petal length of iris virginica species from Anderson\n", "mu = 5.552\n", "\n", "# Display mu\n", "mu" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ + "**Data available on the Vullierens sample** \n", "You have access to the raw data collected on the petal length of the Vullierens sample, which is stored in the CSV file `iris-sample1-vullierens.csv` that you can see in the file explorer in the left pane. \n", "If you double click on the file it will open in a new tab and you can look at what is inside.\n", "\n", "Now to analyze the data using Python you have to read the file:" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 20, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
petal_length
06.0
15.9
25.5
35.7
45.4
\n", + "
" + ], + "text/plain": [ + " petal_length\n", + "0 6.0\n", + "1 5.9\n", + "2 5.5\n", + "3 5.7\n", + "4 5.4" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "# Read the Vullierens sample data from the CSV file\n", "sample_data = pan.read_csv('iris-sample1-vullierens.csv')\n", "\n", "# Display the first few lines of the dataset\n", "sample_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Now the dataset is stored in the variable `sample_data`, which is a kind of table. You see above that there is only one column in this table. \n", - "`sample_data[\"petal_length\"]` gives you the list of all the values stored in that column." + "After reading the CSV file, its content is stored in the variable `sample_data`, which is a kind of table. Each line of the table is given an index number to identify it. We see above that, appart from the index, the table contains only one column, called `\"petal_length\"`, which contains all the measurements made on the Vullierens Irises. To get the list of all the values stored in that column, you can use the following syntax: `sample_data[\"petal_length\"]`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data analysis\n", "\n", "### 1. First look at the data\n", "\n", "Let's compute some simple descriptive statistics on this sample data:" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 21, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
petal_length
count50.000000
mean5.658000
std0.603084
min4.400000
25%5.225000
50%5.700000
75%6.000000
max6.800000
\n", + "
" + ], + "text/plain": [ + " petal_length\n", + "count 50.000000\n", + "mean 5.658000\n", + "std 0.603084\n", + "min 4.400000\n", + "25% 5.225000\n", + "50% 5.700000\n", + "75% 6.000000\n", + "max 6.800000" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "# Compute the descriptive stats\n", "sample_stats = sample_data.describe()\n", "\n", "# Display the result\n", "sample_stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can access individual elements of the `sample_stats` table using the corresponding names for the line and column of the value. \n", "The following cell illustrates how to get the mean value of the petal length in the sample:" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 22, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "5.6579999999999995" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "# Extract the sample mean from the descriptive stats\n", "sample_petal_length_mean = sample_stats.loc[\"mean\",\"petal_length\"]\n", "\n", "# Display the result\n", "sample_petal_length_mean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's make some simple visualisations of this data:" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 23, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "image/svg+xml": [ + "\n", + "\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n" + ], + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], "source": [ "# Create visualisation\n", "fig = plt.figure(figsize=(16, 4))\n", "\n", "# Plot the sample values\n", "ax1 = plt.subplot(131)\n", "ax1.set_xlabel('index of sample')\n", "ax1.set_ylabel('petal length')\n", "ax1.set_title(\"Samples\")\n", "ax1.plot(sample_data[\"petal_length\"], 'go')\n", "# Add the means\n", "ax1.axhline(y=sample_petal_length_mean, color='black', linestyle='-.', linewidth=1, label=\"sample mean\")\n", "ax1.axhline(y=mu, color='black', linestyle=':', linewidth=1, label=\"$\\mu$\")\n", "ax1.legend()\n", "\n", "# Plot the distribution of the samples\n", "ax2 = plt.subplot(132)\n", "ax2.set_xlabel('petal length')\n", "ax2.set_ylabel('number of samples')\n", "ax2.set_title(\"Distribution\")\n", "ax2.hist(sample_data[\"petal_length\"], color='green')\n", "# Add the means\n", "ax2.axvline(x=sample_petal_length_mean, color='black', linestyle='-.', linewidth=1, label=\"sample mean\")\n", "ax2.axvline(x=mu, color='black', linestyle=':', linewidth=1, label=\"$\\mu$\")\n", "ax2.legend()\n", "\n", "# Box plot with quartiles\n", "ax3 = plt.subplot(133, sharey = ax1)\n", "box = ax3.boxplot(sample_data[\"petal_length\"], sym='k+', patch_artist=True)\n", "ax3.set_ylabel('petal length')\n", "ax3.set_title(\"Quartiles\")\n", "ax3.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)\n", "plt.setp(box['medians'], color='black')\n", "for patch in box['boxes']:\n", " patch.set(facecolor='green')\n", "# Add the means\n", "ax3.axhline(y=sample_petal_length_mean, color='black', linestyle='-.', linewidth=1, label=\"sample mean\")\n", "ax3.axhline(y=mu, color='black', linestyle=':', linewidth=1, label=\"$\\mu$\")\n", "ax3.legend()\n", "\n", + "# Display the graph\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These simple analyses allow us to have a rough idea about how much the Irises from Vullierens differ (or not) from those observed by Anderson. \n", "But there are two questions that we cannot answer just looking at the data like that:\n", "* How **significant** is the difference between the two populations, i.e. how sure are we that the difference we observe between the two populations is not just an artefact from the sampling?\n", "* How small (or big) is the difference between the two populations taking into account the amount of variation occuring in our sample, i.e. what is the **relative size** of the difference?\n", "\n", "In the following, we use some **statistical tests** to explore these questions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Statistical significance of the difference between the two populations: t-test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The difference that we observe between the two populations may very well occur just because of the samples we have chosen. This is an issue in any study involving sampling. \n", "The *t-test*, or Student's t-test, can be used to evaluate the statistical significance of the difference between the means of our two datasets i.e. the likelyhood of the results occuring by chance.\n", "\n", "To use the t-test you have to proceed in 3 steps:\n", "1. formulate the hypothese you want to test\n", "1. choose a threshold for the statistical significance\n", "1. compute the result of the t-test and interpret the result\n", "\n", "\n", "**Formulation of the hypotheses to test:**\n", "* Null hypothesis ($H_0$) : the sample mean is similar to the mean of the reference population, $mean = \\mu$\n", "* Alternate hypothesis ($H_a$) : the sample mean is not similar to the mean of the reference population, $mean \\neq \\mu$\n", "\n", "**Choice of the threshold:** \n", "The threshold $\\alpha$ represents the level of probability under which we think it is \"safe enough\" to reject $H_0$. \n", "Usual thresholds are $\\alpha=.05$, $\\alpha=.01$ or $\\alpha=.001$. \n", "An $\\alpha$ of $.05$ means that we may be wrong in rejecting $H_0$ in 5% of the cases." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 24, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "0.05" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "alpha = 0.05" + "# Define alpha\n", + "alpha = 0.05\n", + "\n", + "# Display alpha\n", + "alpha" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Computation of the t-test:**" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 25, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "t = 1.243\n", + "p = 0.220\n" + ] + } + ], "source": [ "# Compute the t-test\n", "t, p = stats.ttest_1samp(sample_data[\"petal_length\"], mu)\n", "\n", "# Display the result\n", "print(\"t = {:.3f}\".format(t))\n", "print(\"p = {:.3f}\".format(p))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Intepretation of the t-test:** \n", "Let's assume that $H_0$ is true, so there is actually no difference between our sample and the reference population. \n", "The Student distribution represents the respective probability of all the possible values for the mean of the sample under this assumption. \n", "The t-statistics tells us where is our sample mean in this distribution. \n", "Then, the p-value gives us the probability of getting this sample mean.\n", "\n", "If the t-statistics falls into the rejection zone, it means that the probability of getting this sample mean simply by chance is very low - or at least lower that the threshold we have chosen. \n", "When that happens, we can reject $H_0$ with a relative confidence (which depends on our choice of $\\alpha$).\n", "\n", "A visualisation can greatly help understand these concepts:" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 26, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "image/svg+xml": [ + "\n", + "\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n" + ], + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], "source": [ "# Extract the sample size from the descriptive stats\n", "sample_petal_length_size = sample_stats.loc[\"count\",\"petal_length\"]\n", "\n", "# Create the t-test visualization\n", "fig, ax = plt.subplots(figsize=(8, 4))\n", "ax.set_title(\"Probability distribution of all possible sample means if $H_0$ is true\")\n", "\n", "# Let's plot the T distribution for this sample size\n", "tdist = stats.t(df=sample_petal_length_size, loc=0, scale=1)\n", "x = np.linspace(tdist.ppf(0.0001), tdist.ppf(0.9999), 100)\n", "y = tdist.pdf(x) \n", "ax.plot(x, y, color='blue', linewidth=1)\n", "ax.axvline(x=0, color='blue', linestyle='dashed', linewidth=1)\n", "\n", "# Plot the rejection zone two tailed\n", "x_zone_1 = np.linspace(tdist.ppf(0.0001), tdist.ppf(alpha/2), 100)\n", "x_zone_2 = np.linspace(tdist.ppf(1-alpha/2), tdist.ppf(0.9999), 100)\n", "y_zone_1 = tdist.pdf(x_zone_1) \n", "y_zone_2 = tdist.pdf(x_zone_2) \n", "ax.fill_between(x_zone_1, y_zone_1, 0, alpha=0.3, color='blue', label = r'rejection of $H_0$ with $\\alpha={}$'.format(alpha))\n", "ax.fill_between(x_zone_2, y_zone_2, 0, alpha=0.3, color='blue')\n", "\n", "# Plot the t-test stats\n", "ax.axvline(x=t, color='green', linestyle='dashed', linewidth=1)\n", "ax.annotate('t-test $t$={:.3f}'.format(t), xy=(t, 0), xytext=(-10, 130), textcoords='offset points', bbox=dict(boxstyle=\"round\", facecolor = \"white\", edgecolor = \"green\", alpha = 0.8))\n", "\n", "# Plot the p-value\n", "if t >= 0:\n", " x_t = np.linspace(t, tdist.ppf(0.9999), 100)\n", "else:\n", " x_t = np.linspace(tdist.ppf(0.0001), t, 100)\n", "y_t = tdist.pdf(x_t) \n", "ax.fill_between(x_t, y_t, 0, facecolor=\"none\", edgecolor=\"green\", hatch=\"///\", linewidth=0.0, label = 'p-value $p$={:.3f}'.format(p))\n", "\n", + "# Add a legend\n", "ax.legend()\n", "\n", + "# Display the graph\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Quantification of the difference between the two populations: effect size\n", "\n", "The effect size represents the size of the difference between the sample mean and the population mean taking into account the variation inside the sample. \n", "Cohen's d is one of the existing measures of effect size. \n", "With $m$ and $s$ respectively the mean and standard deviation of the sample and $\\mu$ the population mean, Cohen's d can be calculed as follows: \n", "\n", "$\n", "\\begin{align}\n", "d = \\frac{m - \\mu}{s}\n", "\\end{align}\n", "$\n", "\n", "Let's compute the effect size of the difference between the Vullierens sample and Anderson's population." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 27, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "d = 0.176\n" + ] + } + ], "source": [ "# Extract the sample standard deviation from the descriptive stats generated earlier\n", "sample_petal_length_std = sample_stats.loc[\"std\",\"petal_length\"]\n", "\n", "# Compute cohen's d\n", "d = (sample_petal_length_mean - mu)/sample_petal_length_std\n", "\n", "# Display the result\n", "print(\"d = {:.3f}\".format(d))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's visualize the result graphically!" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 28, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "image/svg+xml": [ + "\n", + "\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n" + ], + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], "source": [ "# Create the vizualisation for the effect size\n", "fig, ax = plt.subplots(figsize=(8, 4))\n", "\n", "# Plot the normal distribution\n", "norm = stats.norm(loc=0, scale=1)\n", "x = np.linspace(norm.ppf(0.0001), norm.ppf(0.9999), 100)\n", "y = norm.pdf(x) \n", "ax.plot(x, y, color='blue', linewidth=1)\n", "ax.axvline(x=0, color='blue', linestyle='dashed', linewidth=1)\n", "\n", "# Plot the distribution of the sample\n", "norm_d = stats.norm(loc=d, scale=1)\n", "x_d = np.linspace(norm_d.ppf(0.0001), norm_d.ppf(0.9999), 100)\n", "y_d = norm_d.pdf(x_d) \n", "ax.plot(x_d, y_d, color='green', linewidth=1)\n", "ax.axvline(x=d, color='green', linestyle='dashed', linewidth=1)\n", "\n", "# Display the value of Cohen's d\n", "max_y = np.max(y)+.02\n", "ax.plot([0,d], [max_y, max_y], color='red', linewidth=1, marker=\".\")\n", "ax.annotate(\"effect size $d$={:.3f}\".format(d), xy=(d, max_y), xytext=(15, -5), textcoords='offset points', bbox=dict(boxstyle=\"round\", facecolor = \"white\", edgecolor = \"red\", alpha = 0.8))\n", "\n", + "# Display the graph\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise\n", "\n", "**A. Getting familiar with the code.**\n", "1. In the code cells above, where is the t-test computed?\n", "1. What are the two parameters that the t-test function takes as input?\n", "1. How could you compare the Vullierens sample to another population mean? \n", "1. What is the result of the t-test if you compare the Vullierens sample to a population mean of $\\mu = 5.4$?\n", "\n", "*Change the value of $\\mu$ back to 5.552 before working on the following questions.*\n", "\n", "**B. Analyzing another dataset.**\n", "\n", "A researcher from Tokyo sends you the results of a series of measurements she has done on the Irises of the [Meiji Jingu Imperial Gardens](http://www.meijijingu.or.jp/english/nature/2.html). The dataset can be found in the `iris-sample2-meiji.csv` file. \n", "**How similar (or different) is the Meiji sample compared to the Iris virginica population documented by Edgar Anderson?**\n", "\n", "The following questions are designed to guide you in analyzing this new dataset using this notebook.\n", "\n", "1. Which of the code cells above loads the data from the file containing the Vullierens dataset? Modify it to load the Mejij dataset.\n", "1. Do you need to modify anything else in the code to analyze this new dataset? Why?\n", "1. What can you conclude about the Meiji sample from this analysis?\n", "\n", "**C. Going a bit further in the interpretation of the t-test.**\n", "1. In the code cells above, where is the statistical significance threshold $\\alpha$ defined? Change its value to 0.01 and re-execute the notebook. \n", "1. How does this affect the result of the t-test for the Meiji sample?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bibliography\n", "\n", "##### [1] E. Anderson (1935). \"The Irises of the Gaspe Peninsula.\" Bulletin of the American Iris Society 59: 2–5.\n", "##### [2] R. A. Fisher (1936). \"The use of multiple measurements in taxonomic problems\". Annals of Eugenics. 7 (2): 179–188. doi:10.1111/j.1469-1809.1936.tb02137.x\n", "##### More about the Iris Dataset on Wikipedia: https://en.wikipedia.org/wiki/Iris_flower_data_set\n", "##### Please note that the datasets used in this notebook have been generated using a random generator, they do not come from real measurement and cannot be used for any research purpose." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 4 }