diff --git a/HPLstats_notebook.ipynb b/HPLstats_notebook.ipynb index 319cfc1..a656c07 100644 --- a/HPLstats_notebook.ipynb +++ b/HPLstats_notebook.ipynb @@ -1,4496 +1,4395 @@ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " How People Learn - Autumn 2019-2020

\n", " How to use this notebook?
\n", " This notebook is made of text cells and code cells. The code cells have to be executed to see the result of the program. To execute a cell, simply select it and click on the \"play\" button in the tool bar at the top of the notebook. It is important to execute the code cells in their order of appearance in the notebook.
\n", " Alternatively, you can use the Run all cells option in the Run menu at the top left of the window.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to hypothesis testing" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import numpy as np # scientific computation tools\n", "import pandas # data management tools\n", "\n", "from statistics import * # standard stats tools\n", "import scipy.stats as stats # advanced stats tools\n", "import researchpy # dataframe stats\n", "import math # basic math\n", "\n", "import matplotlib.pyplot as plt # plotting tools\n", "plt.style.use('seaborn-whitegrid') # global style for plotting\n", "from IPython.display import set_matplotlib_formats\n", "set_matplotlib_formats('svg') # high def format from graphs\n", "\n", "from IPython.display import display, Math, Markdown # pretty printing in LaTeX" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learning goals\n", "\n", "TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "
\n", " \"iris\n", "\n", "###### Iris virginica (Credit: Frank Mayfield CC BY-SA 2.0)\n", "\n", "
\n", "\n", "In 1935, an american botanist called Edgar Anderson worked on quantifying the morphologic variation of Iris flowers of three related species, Iris setosa, Iris virginica and Iris versicolor [[1]](#Bibliography). He realized a series of measures of the petal length, petal width, sepal length, sepal width and species.\n", "Based on the combination of these four features, a British statistician and biologist named Ronald Fisher developed a model to distinguish the species from each other [[2]](#Bibliography).\n", "\n", "## Question\n", "\n", "A recent series of measurements has been carried out at the [Iris Garden of the Vullierens Castle](https://chateauvullierens.ch/en/) near Lausanne, on a sample of $n=50$ flowers of the Iris virginica species. \n", "**How similar (or different) is the Iris sample from the Vullierens Castle compared to the Iris virginica population documented by Edgar Anderson?**\n", "\n", "## Data available\n", "You have access to the following published summary statistics of the Anderson dataset: mean petal length of the Iris virginica species $\\mu = 5.552 cm$" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# mean petal length of iris virginica species from Anderson\n", "mu = 5.552" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You have access to the raw data collected on the petal length of the Vullierens sample:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
petal_length
06.0
15.9
25.5
35.7
45.4
\n", "
" ], "text/plain": [ " petal_length\n", "0 6.0\n", "1 5.9\n", "2 5.5\n", "3 5.7\n", "4 5.4" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read the Vullierens sample data from the CSV file\n", "sample_data = pandas.read_csv('iris-sample-vullierens.csv')\n", "\n", "# Display the first few lines of the dataset\n", "sample_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. First look at the data\n", "\n", "Let's compute some simple descriptive statistics on this sample data:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
petal_length
count50.000000
mean5.658000
std0.603084
min4.400000
25%5.225000
50%5.700000
75%6.000000
max6.800000
\n", "
" ], "text/plain": [ " petal_length\n", "count 50.000000\n", "mean 5.658000\n", "std 0.603084\n", "min 4.400000\n", "25% 5.225000\n", "50% 5.700000\n", "75% 6.000000\n", "max 6.800000" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Simple descriptive stats\n", "sample_stats = sample_data.describe()\n", "sample_stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make some simple visualisations of this data:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n" ], "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Get the mean from the descriptive stats\n", "sample_petal_length_mean = sample_stats.loc[\"mean\",\"petal_length\"]\n", "\n", "# Create visualisation\n", "fig = plt.figure(figsize=(16, 4))\n", "\n", "# Plot the sample values\n", "ax1 = plt.subplot(131)\n", "ax1.set_xlabel('index of sample')\n", "ax1.set_ylabel('petal length')\n", "ax1.set_title(\"Samples\")\n", "ax1.plot(sample_data[\"petal_length\"], 'go')\n", "# Add the means\n", "ax1.axhline(y=sample_petal_length_mean, color='black', linestyle='-.', linewidth=1, label=\"sample mean\")\n", "ax1.axhline(y=mu, color='black', linestyle=':', linewidth=1, label=\"$\\mu$\")\n", "ax1.legend()\n", "\n", "# Plot the distribution of the samples\n", "ax2 = plt.subplot(132)\n", "ax2.set_xlabel('petal length')\n", "ax2.set_ylabel('number of samples')\n", "ax2.set_title(\"Distribution\")\n", "ax2.hist(sample_data[\"petal_length\"], color='green')\n", "# Add the means\n", "ax2.axvline(x=sample_petal_length_mean, color='black', linestyle='-.', linewidth=1, label=\"sample mean\")\n", "ax2.axvline(x=mu, color='black', linestyle=':', linewidth=1, label=\"$\\mu$\")\n", "ax2.legend()\n", "\n", "# Box plot with quartiles\n", "ax3 = plt.subplot(133, sharey = ax1)\n", "box = ax3.boxplot(sample_data[\"petal_length\"], sym='k+', patch_artist=True)\n", "ax3.set_ylabel('petal length')\n", "ax3.set_title(\"Quartiles\")\n", "ax3.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)\n", "plt.setp(box['medians'], color='black')\n", "for patch in box['boxes']:\n", " patch.set(facecolor='green')\n", "# Add the means\n", "ax3.axhline(y=sample_petal_length_mean, color='black', linestyle='-.', linewidth=1, label=\"sample mean\")\n", "ax3.axhline(y=mu, color='black', linestyle=':', linewidth=1, label=\"$\\mu$\")\n", "ax3.legend()\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These simple analyses allow us to have a rough idea about how much the Irises from Vullierens differ (or not) from those observed by Anderson. \n", "But there are two questions that we cannot answer just looking at the data like that:\n", "* How **significant** is the difference between the two populations, i.e. how sure are we that the difference we observe between the two populations is not just an artefact from the sampling?\n", "* How small (or big) is the difference between the two populations taking into account the amount of variation occuring in our population, i.e. what is the **relative size** of the difference?\n", "\n", "In the following, we use some **statistical tests** to explore these questions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Statistical significance of the difference between the two populations: t-test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The difference that we observe between the two populations may very well occur just because of the samples we have chosen. This is an issue in any study involving sampling. \n", "The *t-test*, or Student's t-test, can be used to evaluate the statistical significance of the difference between the means of our two datasets i.e. the likelyhood of the results occuring by chance.\n", "\n", "To use the t-test you have to proceed in 3 steps:\n", "1. formulate the hypothese you want to test\n", "1. choose a threshold for the statistical significance\n", "1. compute the result of the t-test and interpret the result\n", "\n", "\n", "**Formulation of the hypotheses to test:**\n", "* Null hypothesis ($H_0$) : the sample mean is similar to the mean of the reference population, $mean = \\mu$\n", "* Alternate hypothesis ($H_a$) : the sample mean is not similar to the mean of the reference population, $mean \\neq \\mu$\n", "\n", "**Choice of the threshold:** \n", "The threshold $\\alpha$ represents the level of probability under which we think it is \"safe enough\" to reject $H_0$. \n", "Usual thresholds are $\\alpha=.05$, $\\alpha=.01$ or $\\alpha=.001$. \n", "An $\\alpha$ of $.05$ means that we may be wrong in rejecting $H_0$ in 5% of the cases." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "alpha = 0.05" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Computation of the t-test:**" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "t = 1.243\n", "p = 0.220\n" ] } ], "source": [ "t, p = stats.ttest_1samp(sample_data[\"petal_length\"], mu)\n", "\n", "print(\"t = {:.3f}\".format(t))\n", "print(\"p = {:.3f}\".format(p))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Intepretation of the t-test:** \n", "Let's assume that $H_0$ is true, so there is actually no difference between our sample and the reference population. \n", "The Student distribution represents the respective probability of all the possible values for the mean of the sample under this assumption. \n", "The t-statistics tells us where is our sample mean in this distribution. \n", "Then, the p-value gives us the probability of getting this sample mean.\n", "\n", "If the t-statistics falls into the rejection zone, it means that the probability of getting this sample mean simply by chance is very low - or at least lower that the threshold we have chosen. \n", "When that happens, we can reject $H_0$ with a relative confidence (which depends on our choice of $\\alpha$).\n", "\n", "A visualisation can greatly help understand these concepts:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n" ], "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Get the sample size from the descriptive stats\n", "sample_petal_length_size = sample_stats.loc[\"count\",\"petal_length\"]\n", "\n", "# Plot the t-test\n", "fig, ax = plt.subplots(figsize=(8, 4))\n", "ax.set_title(\"Probability distribution of all possible sample means if $H_0$ is true\")\n", "\n", "# Let's plot the T distribution for this sample size\n", "rv = stats.t(df=sample_petal_length_size, loc=0, scale=1)\n", "x = np.linspace(rv.ppf(0.0001), rv.ppf(0.9999), 100)\n", "y = rv.pdf(x) \n", "ax.plot(x, y, color='blue', linewidth=1)\n", "ax.axvline(x=0, color='blue', linestyle='dashed', linewidth=1)\n", "\n", "# Plot the rejection zone two tailed\n", "x_zone_1 = np.linspace(rv.ppf(0.0001), rv.ppf(alpha/2), 100)\n", "x_zone_2 = np.linspace(rv.ppf(1-alpha/2), rv.ppf(0.9999), 100)\n", "y_zone_1 = rv.pdf(x_zone_1) \n", "y_zone_2 = rv.pdf(x_zone_2) \n", "ax.fill_between(x_zone_1,y_zone_1,0, alpha=0.3, color='blue', label = r'rejection of $H_0$ with $\\alpha={}$'.format(alpha))\n", "ax.fill_between(x_zone_2,y_zone_2,0, alpha=0.3, color='blue')\n", "\n", "\n", "# Plot the t-test stats\n", "ax.axvline(x=t, color='green', linestyle='dashed', linewidth=1)\n", "ax.annotate('t-test $t$={:.3f}'.format(t), xy=(t, 0), xytext=(-10, 130), textcoords='offset points', bbox=dict(boxstyle=\"round\", facecolor = \"white\", edgecolor = \"green\", alpha = 0.8))\n", "\n", "# Plot the p-value\n", "x_t = np.linspace(t, rv.ppf(0.9999), 100)\n", "y_t = rv.pdf(x_t) \n", "ax.fill_between(x_t, y_t, 0, facecolor=\"none\", edgecolor=\"green\", hatch=\"///\", linewidth=0.0, label = 'p-value $p$={:.3f}'.format(p))\n", "\n", "ax.legend()\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Quantification of the difference between the two populations: effect size" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# access the standard deviation in the stats generated earlier\n", "sample_petal_length_std = sample_stats.loc[\"std\",\"petal_length\"]\n", "\n", "# compute cohen's d\n", "cohens_d = (sample_petal_length_mean - tools.iv_petal_length_mean)/sample_petal_length_std\n", "print(\"d = {:.3f}\".format(cohens_d))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(8, 4))\n", "\n", "# plot the normal distribution\n", "norm = stats.norm(loc=0, scale=1)\n", "x = np.linspace(norm.ppf(0.0001), norm.ppf(0.9999), 100)\n", "y = norm.pdf(x) \n", "ax.plot(x, y, color='blue', linewidth=1)\n", "ax.axvline(x=0, color='blue', linestyle='dashed', linewidth=1)\n", "\n", "# plot the distribution of the sample\n", "norm_d = stats.norm(loc=cohens_d, scale=1)\n", "x_d = np.linspace(norm_d.ppf(0.0001), norm_d.ppf(0.9999), 100)\n", "y_d = norm_d.pdf(x_d) \n", "ax.plot(x_d, y_d, color='green', linewidth=1)\n", "ax.axvline(x=cohens_d, color='green', linestyle='dashed', linewidth=1)\n", "\n", "# make Cohen's d visible\n", "max_y = np.max(y)+.02\n", "ax.plot([0,cohens_d], [max_y, max_y], color='red', linewidth=1, marker=\".\")\n", "ax.annotate(\"effect size $d$={:.3f}\".format(cohens_d), xy=(cohens_d, max_y), xytext=(15, -5), textcoords='offset points', bbox=dict(boxstyle=\"round\", facecolor = \"white\", edgecolor = \"red\", alpha = 0.8))\n", "\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Z-score\n", - "To determine this, we will use a simple statistical test called the **z-score**.\n", - "\n", - "First let's formulate two hypotheses that we will have to test:\n", - "* Null hypothesis ($H_0$) : The sample belongs to our population, `sample_mean = x_pop_mean`\n", - "* Alternate hypothesis ($H_a$) : The sample does not belongs to our population, `sample_mean != x_pop_mean`" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We first compute the z-score" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If the p-value is < .05 we conclude that either:\n", - "\n", - "* The null hypothesis is false, thus the sample probably does NOT belong to our population, it is probably from a DIFFERENT population\n", - "* The null hypothesis is true, and it's just bad luck if we obtained these values. We cannot make a conclusion.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "### Let's compute the Z-SCORE\n", - "z = (sample_petal_length_mean - tools.iv_petal_length_mean)/(tools.iv_petal_length_std / np.sqrt(tools.sample_size))\n", - "p = 2*(1-stats.norm.cdf(abs(z))) # two tailed test\n", - "print(\"z = {:.3f}\\np = {:.3f}\".format(z, p))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "### Let's visualize the Z-SCORE\n", - "fig, ax = plt.subplots(figsize=(8, 4))\n", - "\n", - "# Let's generate a dataset from the standard normal distribution\n", - "z_norm = stats.norm(loc=0, scale=1)\n", - "x = np.linspace(z_norm.ppf(0.0001), z_norm.ppf(0.9999), 100)\n", - "y = z_norm.pdf(x) \n", - "ax.plot(x, y, color='blue', linewidth=1)\n", - "ax.axvline(x=0, color='blue', linestyle='dashed', linewidth=1)\n", - "\n", - "# Confidence zone 5% two tailed\n", - "x_zone = np.linspace(z_norm.ppf(0.025), z_norm.ppf(0.975), 100) # 5% confidence two tailed\n", - "y_zone = z_norm.pdf(x_zone) \n", - "ax.fill_between(x_zone,y_zone,0, alpha=0.3, color='blue')\n", - "\n", - "# Z-Score of the data and p-value\n", - "ax.axvline(x=z, color='green', linestyle='dashed', linewidth=1)\n", - "ax.annotate('z-score={:.3f}\\np-value={:.3f}'.format(z, p), xy=(z, 0), xytext=(-10, 100), textcoords='offset points', bbox=dict(boxstyle=\"round\", facecolor = \"white\", edgecolor = \"green\", alpha = 0.8))\n", - "\n", - "plt.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Advanced use" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "tools.iris_virginica.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "researchpy.ttest(sample_data[\"petal_length\"], tools.iris_virginica[\"petal_length\"])" - ] - }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Please give us your opinion on this notebook!\n", "\n", "Execute the cell below and fill out this very very short survey. \n", "Your feedback will help us improve this notebook. Thank you in advance!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.display import IFrame\n", "IFrame('https://www.surveymonkey.com/r/NOTOSURVEY?notebook_set=HPLnotebooks¬ebook_id=HPLstats_notebook', 600, 800)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bibliography\n", "\n", "###### [1] E. Anderson (1935). \"The Irises of the Gaspe Peninsula.\" Bulletin of the American Iris Society 59: 2–5.\n", "###### [2] R. A. Fisher (1936). \"The use of multiple measurements in taxonomic problems\". Annals of Eugenics. 7 (2): 179–188. doi:10.1111/j.1469-1809.1936.tb02137.x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Other resources\n", "\n", "Hypothesis testing with python:\n", "\n", "https://www.kaggle.com/jgroff/unit-3-hypothesis-testing\n", "\n", "https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/z-score/\n", "\n", "The Iris Dataset :\n", "\n", "https://en.wikipedia.org/wiki/Iris_flower_data_set\n", "\n", "https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html\n", "\n", "https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 4 }