diff --git a/HPLstats_notebook.ipynb b/HPLstats_notebook.ipynb index 7703da7..543a372 100644 --- a/HPLstats_notebook.ipynb +++ b/HPLstats_notebook.ipynb @@ -1,5603 +1,789 @@ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " How People Learn - Autumn 2019-2020
\n", " C. Hardebolle, R. Tormey - CC BY NC SA 4.0 International

\n", " How to use this notebook?\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to hypothesis testing\n", "\n", "In this notebook we look at data on a type of flower called Iris and we analyze it using a programming language called Python." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learning goals\n", "\n", "This notebook is designed for you to learn:\n", "* How to distinguish between \"population\" datasets and \"sample\" datasets when dealing with experimental data\n", "* How to compare a sample to a population using a statistical test called the \"t-test\" and interpret its results\n", "* How to evaluate the magnitude of the difference between a sample and a population using a measure of the effect size called \"Cohen's d\" and interpret the results\n", "* How to use Python scripts to make statistical analyses on a dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "
\n", " \"iris\n", "\n", "###### Iris virginica (Credit: Frank Mayfield CC BY-SA 2.0)\n", "\n", "
\n", "\n", "In 1935, an american botanist called Edgar Anderson worked on quantifying the morphologic variation of Iris flowers of three related species, Iris setosa, Iris virginica and Iris versicolor [[1]](#Bibliography). He realized a series of measures of the petal length, petal width, sepal length, sepal width and species.\n", "Based on the combination of these four features, a British statistician and biologist named Ronald Fisher developed a model to distinguish the species from each other [[2]](#Bibliography).\n", "\n", "\n", "### Question\n", "A recent series of measurements has been carried out at the [Iris Garden of the Vullierens Castle](https://chateauvullierens.ch/en/) near Lausanne, on a sample of $n=50$ flowers of the Iris virginica species. \n", "**How similar (or different) is the Iris sample from the Vullierens Castle compared to the Iris virginica population documented by Edgar Anderson?**\n", "\n", "\n", "### Instructions \n", "\n", "**1. Read the notebook and execute the code cells**. To check your understanding, ask yourself the following questions:\n", "* Can I explain how the concept of sample differs from the concept of population?\n", "* What does a t-test tell me on my sample?\n", "* What does Cohen's d tell me on my sample?\n", "\n", "**2. Do the exercise** at the end of the notebook. \n", "\n", " \n", "\n", "--- \n", "\n", "## Getting started\n", "\n", "### Python tools for stats\n", "Python comes with a number of libraries for processing data and computing statistics.\n", "To use these tool you first have to load them using the `import` keyword. \n", "The role of the code cell just below is to load the tools that we use in the rest of the notebook. It is important to execute this cell *prior to executing any other cell in the notebook*." ] }, { "cell_type": "code", - "execution_count": 39, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# plotting and display tools\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt \n", "plt.style.use('seaborn-whitegrid') # global style for plotting\n", "\n", "from IPython.display import display, set_matplotlib_formats\n", "set_matplotlib_formats('svg') # vector format for graphs\n", "\n", "# data computation tools\n", "import numpy as np \n", "import pandas as pan\n", "import math\n", "\n", "# statistics tools\n", "import scipy.stats as stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data available on the Anderson dataset\n", "\n", "Anderson has published summary statistics of his dataset. \n", "You have the mean petal length of the Iris virginica species: $\\mu = 5.552 cm$" ] }, { "cell_type": "code", - "execution_count": 40, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "5.552" - ] - }, - "execution_count": 40, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "# Define mu as mean petal length of iris virginica species from Anderson\n", "mu = 5.552\n", "\n", "# Display mu\n", "mu" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data available on the Vullierens sample\n", "\n", "You have the raw data collected on the petal length of the Vullierens sample, which is stored in the CSV file `iris-sample1-vullierens.csv` that you can see in the file explorer in the left pane. \n", "If you double click on the file it will open in a new tab and you can look at what is inside.\n", "\n", "Now to analyze the data using Python you have to read the file:" ] }, { "cell_type": "code", - "execution_count": 41, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
petal_length
05.7
14.5
25.8
35.1
46.2
\n", - "
" - ], - "text/plain": [ - " petal_length\n", - "0 5.7\n", - "1 4.5\n", - "2 5.8\n", - "3 5.1\n", - "4 6.2" - ] - }, - "execution_count": 41, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "# Read the Vullierens sample data from the CSV file\n", "sample_data = pan.read_csv('iris-sample2-meiji.csv')\n", "\n", "# Display the first few lines of the dataset\n", "sample_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After reading the CSV file, its content is stored in the variable `sample_data`, which is a kind of table. Each line of the table is given an index number to identify it. We see above that, appart from the index, the table contains only one column, called `\"petal_length\"`, which contains all the measurements made on the Vullierens Irises. To get the list of all the values stored in that column, you can use the following syntax: `sample_data[\"petal_length\"]`." ] }, { "cell_type": "code", - "execution_count": 42, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0 5.7\n", - "1 4.5\n", - "2 5.8\n", - "3 5.1\n", - "4 6.2\n", - "5 6.1\n", - "6 5.3\n", - "7 5.7\n", - "8 6.4\n", - "9 5.2\n", - "10 5.7\n", - "11 6.7\n", - "12 5.8\n", - "13 5.6\n", - "14 5.6\n", - "15 5.3\n", - "16 6.1\n", - "17 5.7\n", - "18 5.4\n", - "19 5.7\n", - "20 5.0\n", - "21 5.4\n", - "22 5.6\n", - "23 5.8\n", - "24 5.6\n", - "25 5.1\n", - "26 5.7\n", - "27 4.9\n", - "28 5.5\n", - "29 6.2\n", - "30 5.6\n", - "31 6.8\n", - "32 6.1\n", - "33 4.5\n", - "34 6.5\n", - "35 5.8\n", - "36 5.8\n", - "37 5.9\n", - "38 5.5\n", - "39 5.7\n", - "40 6.6\n", - "41 6.0\n", - "42 5.7\n", - "43 6.3\n", - "44 5.8\n", - "45 6.0\n", - "46 5.3\n", - "47 5.4\n", - "48 6.2\n", - "49 5.9\n", - "Name: petal_length, dtype: float64" - ] - }, - "execution_count": 42, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "# All values stored in the \"petal_lenght\" column of the \"sample_data\" table\n", "sample_data[\"petal_length\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## First look at the data\n", "\n", "### Descriptive statistics\n", "\n", "Let's compute some simple descriptive statistics on this sample data. The `describe()` function gives us right away a number of useful descriptive stats:" ] }, { "cell_type": "code", - "execution_count": 43, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
petal_length
count50.000000
mean5.716000
std0.492954
min4.500000
25%5.425000
50%5.700000
75%6.000000
max6.800000
\n", - "
" - ], - "text/plain": [ - " petal_length\n", - "count 50.000000\n", - "mean 5.716000\n", - "std 0.492954\n", - "min 4.500000\n", - "25% 5.425000\n", - "50% 5.700000\n", - "75% 6.000000\n", - "max 6.800000" - ] - }, - "execution_count": 43, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "# Compute the descriptive stats\n", "sample_stats = sample_data.describe()\n", "\n", "# Display the result\n", "sample_stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can access individual elements of the `sample_stats` table using the corresponding names for the line and column of the value. \n", "The following cell illustrates how to get the mean value of the petal length in the sample:" ] }, { "cell_type": "code", - "execution_count": 44, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "5.716" - ] - }, - "execution_count": 44, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "# Extract the sample mean from the descriptive stats\n", "sample_mean = sample_stats.loc[\"mean\",\"petal_length\"]\n", "\n", "# Display the result\n", "sample_mean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualizations\n", "\n", "Now let's make some simple visualisations of this data:" ] }, { "cell_type": "code", - "execution_count": 45, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "\n" - ], - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "# Create visualisation\n", "fig = plt.figure(figsize=(16, 4))\n", "\n", "# Plot the sample values\n", "ax1 = plt.subplot(131)\n", "ax1.set_xlabel('index of sample')\n", "ax1.set_ylabel('petal length')\n", "ax1.set_title(\"Samples\")\n", "ax1.plot(sample_data[\"petal_length\"], 'go')\n", "# Add the means\n", "ax1.axhline(y=sample_mean, color='black', linestyle='-.', linewidth=1, label=\"sample mean $m$\")\n", "ax1.axhline(y=mu, color='black', linestyle=':', linewidth=1, label=\"$\\mu$\")\n", "ax1.legend()\n", "\n", "# Plot the distribution of the samples\n", "ax2 = plt.subplot(132)\n", "ax2.set_xlabel('petal length')\n", "ax2.set_ylabel('number of samples')\n", "ax2.set_title(\"Distribution\")\n", "ax2.hist(sample_data[\"petal_length\"], color='green')\n", "# Add the means\n", "ax2.axvline(x=sample_mean, color='black', linestyle='-.', linewidth=1, label=\"sample mean $m$\")\n", "ax2.axvline(x=mu, color='black', linestyle=':', linewidth=1, label=\"$\\mu$\")\n", "ax2.legend()\n", "\n", "# Box plot with quartiles\n", "ax3 = plt.subplot(133, sharey = ax1)\n", "box = ax3.boxplot(sample_data[\"petal_length\"], sym='k+', patch_artist=True)\n", "ax3.set_ylabel('petal length')\n", "ax3.set_title(\"Quartiles\")\n", "ax3.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)\n", "plt.setp(box['medians'], color='black')\n", "for patch in box['boxes']:\n", " patch.set(facecolor='green')\n", "# Add the means\n", "ax3.axhline(y=sample_mean, color='black', linestyle='-.', linewidth=1, label=\"sample mean $m$\")\n", "ax3.axhline(y=mu, color='black', linestyle=':', linewidth=1, label=\"$\\mu$\")\n", "ax3.legend()\n", "\n", "# Display the graph\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the Python code necessary to generate the graphs above is quite long. Hiding this particular code cell can improve the readability of the notebook. To hide it, select it and then click on the blue bar at its left. If you want to make the code visible again, simply click on the three \"dots\" or on the blue bar.\n", "\n", "\n", "### Interpretation and hypothesis\n", "\n", "The simple analyses we have made so far allow us to have a preliminary idea about how the Irises from Vullierens compare to those observed by Anderson. In particular, one feature to look at for the comparison is their respective mean petal length and we see above that the mean petal length $m$ of the Vullierens sample is quite close to the mean $\\mu$ reported by Anderson.\n", "\n", "Let's formulate this as an **hypothesis** which we state as: the sample mean $m$ is similar to the mean of the reference population $\\mu$, i.e. $m = \\mu$. This hypothesis is noted $H_0$ and called the \"null\" hypothesis because it states that there is no difference between the sample and the population. \n", "The \"alternate\" hypothesis $H_a$ is that the sample mean is not similar to the mean of the reference population, $m \\neq \\mu$.\n", "\n", "How can we test our hypothesis? In the following, we use a **statistical test** to answer this question." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "---\n", "\n", "## Testing our hypothesis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In our hypothesis we compare the mean of one sample to a reference value. To test this hypothesis we can use a statistical test called a **one-sample t-test**. \n", "\n", "But what does it mean when we test the hypothesis that a sample mean is potentially equal to a given value? \n", "\n", "### Sample versus population\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "To understand this, it is useful to start by thinking about a population, in this case our population of Irises which has a mean petal length of $\\mu = 5.552$ cm.\n", "\n", "Now imagine you take a sample of, say, 50 flowers from this population. The mean petal length of this sample is $m_1 = 6.234$ cm. You then take a second sample of 50, which ends up having a mean petal length of $m_2 = 5.874$ cm. You then take a third sample of 50 which gives you a mean petal length of $m_3 = 5.349$ cm.\n", "\n", "If you keep taking samples from this population, you will start to notice a pattern: while some of the samples will give a mean average length which is not at all close to the population mean, most of the mean petal lengths are reasonably close to the population mean of 5.552 cm. Furthermore, the mean average of the mean average of the samples will be the same as that of the population as a whole i.e. 5.552 cm. \n", "\n", "In fact, if we keep taking samples from this population, it turns out that the distribution of the average of these samples will take a very particular pattern that looks like a normal curve. In fact if you take bigger sample sizes (say 130 instead of 50) the distribution will get closer and closer to being a normal curve for which the mean average is equal to the mean average of the population. For these smaller samples, the distribution is called the **[Student's t-distribution](https://en.wikipedia.org/wiki/Student%27s_t-distribution)** (actually it is a family of distributions, which depend on the sample size).\n", "\n", "\n", "This is useful because it allows us to rephrase our question as to how similar or different our sample from Vullierens Castle is to the population of Irises as described by Edgar Anderson. \n", "**What we have from the Vullierens Castle is a sample**. We want to know if it is a sample that might have come from a population like that described by Edgar Anderson. We now know the shape (more or less a normal distribution) and the mean (5.552 cm) of all of the samples that could be taken from the population described by Edgar Anderson. **So our question becomes \"where does our sample fall on the distribution of all such sample means?\"**. \n", "If our mean is in position A on the figure on the right, then it is plausible that our sample came from a population like that of Edgar Anderson. If our mean is in position B, then it is less plausible to believe that our sample came from a population like Anderson’s.\n", "\n", "### Cut-off point\n", "\n", "You might be wondering, how far away is far enough away for us to think it is implausible that our sample comes from a population like Anderson’s. The answer is, it depends on how sure you want to be. \n", "\n", "One common answer to this question is to be 95% sure - meaning that a sample mean would need to be in the most extreme 5% of cases before we would think it is implausible that our sample comes from a population like Anderson’s ($\\alpha=0.05$). These most extreme 5% cases are represented by the zones in light blue on the figure on the right. If the sample mean falls into these most extreme zones, we say that *the difference is \"statistically significant\"*.\n", "\n", "A second, common answer is 99% sure meaning that a sample mean would need to be in the most extreme 1% of cases before we would think it is implausible that our sample comes from a population like Anderson’s ($\\alpha=0.01$). \n", "\n", "In the following, **we will work on the basis of being 95% sure**. Let's define our $\\alpha=0.05$:" ] }, { "cell_type": "code", - "execution_count": 46, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0.05" - ] - }, - "execution_count": 46, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "# Define alpha\n", "alpha = 0.05\n", "\n", "# Display alpha\n", "alpha" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If our distribution of sample means is a normal curve then we know that the most extreme 5% of sample means are found above or below ±1.96 standard deviations above and below the mean. In our case, because our sample size is less than 130 (it is 50), our distribution is close to normal but not quite normal. \n", "Still we can find out the relevant cut off point from looking it up in statistical tables: for a sample size of 50, the most extreme 5% of cases are found above or below 2.01 standard deviations from the mean. Let's define our cutoff point:" ] }, { "cell_type": "code", - "execution_count": 47, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "2.01" - ] - }, - "execution_count": 47, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "# Define the cutoff point\n", "cutoff = 2.01\n", "\n", "# Display cutoff\n", "cutoff" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Error in the distribution of means\n", "\n", "So far we know a lot that will help us to test the hypothesis that our sample mean is similar to Anderson’s population mean. We know:\n", "* Our sample mean $m$\n", "* The population mean $\\mu$\n", "* The shape of the distribution of the mean of all samples that would come from this population (a normal curve, centred on the population mean)\n", "* Our cut off point defined by $\\alpha$ (the most extreme 5% of cases, above or below 2.01 standard deviations from the mean)\n", "\n", "The last piece of information missing that would enable us to test this hypothesis is the size of the standard deviation of the distribution of sample means from Anderson’s population. \n", "It turns out that a good guess for the size of this standard deviation can be obtained from knowing the standard deviation of our sample.\n", "If $s$ is the sample standard deviation of our sample and $n$ is the sample size, then the standard deviation of the distribution of sample means is:\n", "\n", "$\n", "\\begin{align}\n", "\\sigma_{\\overline{X}} = \\frac{s}{\\sqrt{n}}\n", "\\end{align}\n", "$ \n", "\n", "This standard deviation of the distribution of sample means is called the \"standard error of the mean\" (also noted SEM). \n", "We can compute it by retrieving the sample size and standard deviation from the descriptive stats we have computed earlier: " ] }, { "cell_type": "code", - "execution_count": 48, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0.06971428571428571" - ] - }, - "execution_count": 48, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "# Extract the sample size from the descriptive stats generated earlier\n", "sample_size = sample_stats.loc[\"count\",\"petal_length\"]\n", "\n", "# Extract the sample standard deviation from the descriptive stats\n", "sample_std = sample_stats.loc[\"std\",\"petal_length\"]\n", "\n", "# Compute the estimation of the standard deviation of sample means from Anderson's population (standard error)\n", "sem = sample_std / math.sqrt(sample_size)\n", "\n", "# Display the standard error\n", "sem" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Comparison" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now restate our question in more precise terms: **\"is our sample mean in the most extreme 5% of samples that would be drawn from a population with the same mean as Anderson’s population?\"**. \n", "Or to be even more precise, **\"is the gap between our sample mean and Anderson’s population mean greater than 2.01 times the standard error?\"**. \n", "\n", "This would be equivalent to do the following comparison:\n", "$\n", "\\begin{align}\n", "\\frac{m - \\mu}{\\sigma_{\\overline{X}}}\\gt 2.01 \n", "\\end{align}\n", "$\n", ". That is the definition of the **t** statistics: the value $t = $\n", "$\n", "\\begin{align}\n", "\\frac{m - \\mu}{\\sigma_{\\overline{X}}}\n", "\\end{align}\n", "$ \n", " has to be compared to the cutoff point we have chosen to determine if the sample mean falls into the most extreme zones and to be able to say whether the difference is statistically significant or not.\n", "Let's compute $t$:" ] }, { "cell_type": "code", - "execution_count": 49, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "2.352459016393451\n", - "The difference is statistically significant.\n" - ] - } - ], + "outputs": [], "source": [ "# Compute the t statistics:\n", "t = (sample_mean - mu) / sem\n", "\n", "# Display t\n", "print(t)\n", "\n", "# Compare t with our cutoff point\n", "if t > cutoff: \n", " print(\"The difference is statistically significant.\")\n", "else: \n", " print(\"The difference is not statistically significant.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that $t < 2.01$, therefore the difference between the two means is not greater than 2.01 times the standard error. In other words, our sample mean **is not in the most extremes 5%** of samples that would be drawn from a population with the same mean as Anderson's population. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conclusion\n", "\n", "What can we conclude from there? What the one sample t-test tells us is that we don't have evidence which would lead us to think that the sample doesn't come from an Anderson like population. Therefore we **cannot reject our hypothesis $H_0$**. However this is not the same to say that it IS the same as the Anderson population. This is one of the limits of the t-test: like many other statistical tests, **it can be used only to reject an hypothesis**, not to confirm it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "---\n", "\n", "## Testing our hypothesis using a predefined Python function\n", "\n", "So far we have made the computations by hand but Python comes with a number of libraries with interesting statistical tools. \n", "In particular, the `stats` library includes a function for doing a **one-sample t-test** as we have done above. \n", "\n", "### Computation of the test\n", "\n", "Let's now use it and then look at what information it gives us." ] }, { "cell_type": "code", - "execution_count": 50, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "t = 2.352\n", - "p = 0.023\n" - ] - } - ], + "outputs": [], "source": [ "# Compute the t-test\n", "t, p = stats.ttest_1samp(sample_data[\"petal_length\"], mu)\n", "\n", "# Display the result\n", "print(\"t = {:.3f}\".format(t))\n", "print(\"p = {:.3f}\".format(p))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function returns two values `t` and `p` which say the same thing but in two different ways:\n", "* `t` tells us where our sample mean falls on the distribution of all the possible sample means for the Anderson population ; `t` has to be compared to the `cutoff` value (2.01) to know if our sample mean is in the most extremes 5%.\n", "* `p` (called the \"p-value\") is the probability to get a more extreme sample mean than the one we observe ; `p` has to be compared to `alpha` (0.05) to know if our sample mean is in the most extremes 5%.\n", "\n", "### Interpretation of the results\n", "\n", "We see above that `t < cutoff`, which means that the difference between the two means is smaller than 2.01 times the standard error and `p > alpha`, which means that the probability of getting the sample mean we have is higher than 5% so it cannot be considered as one of the extreme possible values. Because they convey the same information, you can use either `t` or `p` to interpret the result of the t-test. In practice, most people use the p-value. \n", "\n", "As expected from the calculations we have made above, the test confirms that the difference between the mean petal length of the Vullierens sample and the mean petal length of Anderson's population is **not statistically significant**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualization\n", "\n", "Using Python we can visualize what that means graphically by plotting the t-distribution of all the possible sample means that would be drawn from a population with the same mean as Anderson's population and showing where `t` is in the distribution compared to the zone defined by our $\\alpha$ of 5%:" ] }, { "cell_type": "code", - "execution_count": 51, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "\n" - ], - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "# Create the t-test visualization\n", "fig, ax = plt.subplots(figsize=(12, 4))\n", "ax.set_title(\"Probability distribution of all possible sample means if $H_0$ is true\")\n", "\n", "# Let's plot the T distribution for this sample size\n", "tdist = stats.t(df=sample_size, loc=0, scale=1)\n", "x = np.linspace(tdist.ppf(0.0001), tdist.ppf(0.9999), 100)\n", "y = tdist.pdf(x) \n", "ax.plot(x, y, color='blue', linewidth=1)\n", "\n", "# Polish the look of the graph\n", "ax.get_yaxis().set_visible(False) # hide the y axis\n", "ax.set_ylim(bottom=0) \n", "ax.grid(False) # hide the grid \n", "ax.spines['top'].set_visible(False) # hide the frame except bottom line\n", "ax.spines['right'].set_visible(False)\n", "ax.spines['left'].set_visible(False)\n", "\n", "# Plot the rejection zone two tailed\n", "x_zone_1 = np.linspace(tdist.ppf(0.0001), tdist.ppf(alpha/2), 100)\n", "x_zone_2 = np.linspace(tdist.ppf(1-alpha/2), tdist.ppf(0.9999), 100)\n", "y_zone_1 = tdist.pdf(x_zone_1) \n", "y_zone_2 = tdist.pdf(x_zone_2) \n", "ax.fill_between(x_zone_1, y_zone_1, 0, alpha=0.3, color='blue', label = r'rejection of $H_0$ with $\\alpha={}$'.format(alpha))\n", "ax.fill_between(x_zone_2, y_zone_2, 0, alpha=0.3, color='blue')\n", "\n", "# Plot the t-test stats\n", "ax.axvline(x=t, color='green', linestyle='dashed', linewidth=1)\n", "ax.annotate('t-test $t$={:.3f}'.format(t), xy=(t, 0), xytext=(-10, 130), textcoords='offset points', bbox=dict(boxstyle=\"round\", facecolor = \"white\", edgecolor = \"green\", alpha = 0.8))\n", "\n", "# Plot the p-value\n", "if t >= 0: x_t = np.linspace(t, tdist.ppf(0.9999), 100)\n", "else: x_t = np.linspace(tdist.ppf(0.0001), t, 100)\n", "y_t = tdist.pdf(x_t) \n", "ax.fill_between(x_t, y_t, 0, facecolor=\"none\", edgecolor=\"green\", hatch=\"///\", linewidth=0.0, label = r'p-value $p$={:.3f}'.format(p))\n", "\n", "# Add a legend\n", "ax.legend()\n", "\n", "# Display the graph\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "To summarize, to compare the mean of a sample to a reference value from a population, you have to proceed in four main steps:\n", "1. formulate the hypothese you want to test: the =null hypothesis $H_0: m = \\mu$ and its alternate $H_a: m \\neq \\mu$ \n", "1. choose a cut-off point for being sure, usually $\\alpha = .05$, $\\alpha = .01$ or $\\alpha = .001$ \n", "1. compute the result of the t-test \n", "1. interpret the result, in particular if the p-value is *below* the cut-off point you have chosen, $p \\lt \\alpha$, then it means $H_0$ should probably be rejected" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "---\n", "\n", "## Quantification of the difference between two means: effect size\n", "\n", "So far we have tested our hypothesis regarding the similarity of means between our sample and the population and the results show us that there is probably no difference.\n", "However whether this statistical test finds a difference depends on the sample size: with small samples it is harder to find a statistically significant difference.\n", "We need therefore to **distinguish between a difference being statistically significant and a difference being large**.\n", "The t-test is used to assess whether the difference is statistically significant. To assess whether the difference is large we use a test called the effect size.\n", "\n", "### Computation of the effect size\n", "\n", "The effect size represents the size of the difference between the sample mean and the population mean taking into account the variation inside the sample. \n", "Cohen's d is one of the existing measures of effect size. \n", "With $m$ and $s$ respectively the mean and standard deviation of the sample and $\\mu$ the population mean, Cohen's d can be calculed as follows: \n", "\n", "$\n", "\\begin{align}\n", "d = \\frac{m - \\mu}{s}\n", "\\end{align}\n", "$\n", "\n", "Let's compute the effect size of the difference between the Vullierens sample and Anderson's population." ] }, { "cell_type": "code", - "execution_count": 52, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "d = 0.333\n" - ] - } - ], + "outputs": [], "source": [ "# Compute cohen's d\n", "d = (sample_mean - mu)/sample_std\n", "\n", "# Display the result\n", "print(\"d = {:.3f}\".format(d))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interpretation of the results\n", "\n", "To interpret this result we have to **compare it to thresholds from the litterature**. Cohen suggested that $d=0.2$ was a small effect size, $0.5$ represents a medium effect size and $0.8$ represents a large effect size.\n", "For our Vullierens sample the effect size is therefore small. In this case, the difference is also not statistically significant. However, with a larger sample size, it would be possible to have a statistically significant difference which nonetheless would be so small as to be trivial.\n", "\n", "\n", "### Visualization\n", "\n", "Let's visualize the result graphically (again, you can hide the lengthy code by clicking on the blue bar at its left). \n", "The graph below represents the theoretical distributions of the population in blue and of the sample in green. We see below that the two groups largely overlap, which is representative of the size of the difference being trivial." ] }, { "cell_type": "code", - "execution_count": 53, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "\n" - ], - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "# Create the vizualisation for the effect size\n", "fig, ax = plt.subplots(figsize=(8, 4))\n", "\n", "# Plot the normal distribution\n", "norm = stats.norm(loc=0, scale=1)\n", "x = np.linspace(norm.ppf(0.0001), norm.ppf(0.9999), 100)\n", "y = norm.pdf(x) \n", "ax.plot(x, y, color='blue', linewidth=1)\n", "ax.axvline(x=0, color='blue', linestyle='dashed', linewidth=1)\n", "\n", "# Plot the distribution of the sample\n", "norm_d = stats.norm(loc=d, scale=1)\n", "x_d = np.linspace(norm_d.ppf(0.0001), norm_d.ppf(0.9999), 100)\n", "y_d = norm_d.pdf(x_d) \n", "ax.plot(x_d, y_d, color='green', linewidth=1)\n", "ax.axvline(x=d, color='green', linestyle='dashed', linewidth=1)\n", "\n", "# Display the value of Cohen's d\n", "max_y = np.max(y)+.02\n", "ax.plot([0,d], [max_y, max_y], color='red', linewidth=1, marker=\".\")\n", "ax.annotate(\"effect size $d$={:.3f}\".format(d), xy=(d, max_y), xytext=(15, -5), textcoords='offset points', bbox=dict(boxstyle=\"round\", facecolor = \"white\", edgecolor = \"red\", alpha = 0.8))\n", "\n", "# Display the graph\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "---\n", "\n", "## Exercise\n", "\n", "**A. Getting familiar with the code.**\n", "1. In the code cells above, where is the t-test computed using the predefined Python function?\n", "1. What are the two parameters that the t-test function takes as input?\n", "1. If you wanted to change the population mean to a different value, like $\\mu = 5.4$ cm for instance, in which cell would you change it? \n", "1. What is the result of the t-test if you compare the Vullierens sample to a population mean of $\\mu = 5.4$ cm?\n", "\n", "*Change the value of $\\mu$ back to 5.552 before working on the following questions.*\n", "\n", "**B. Analyzing another dataset.**\n", "\n", "A researcher from Tokyo sends you the results of a series of measurements she has done on the Irises of the [Meiji Jingu Imperial Gardens](http://www.meijijingu.or.jp/english/nature/2.html). The dataset can be found in the `iris-sample2-meiji.csv` file. \n", "How similar (or different) is the Meiji sample compared to the Iris virginica population documented by Edgar Anderson?\n", "\n", "The following questions are designed to guide you in analyzing this new dataset using this notebook.\n", "\n", "1. Which of the code cells above loads the data from the file containing the Vullierens dataset? Modify it to load the Meiji dataset.\n", "1. Do you need to modify anything else in the code to analyze this new dataset?\n", "1. What can you conclude about the Meiji sample from this analysis?\n", "\n", "**C. Going a bit further in the interpretation of the t-test.**\n", "1. In the code cells above, where is the cut-off point $\\alpha$ defined? Change its value to 0.01 and re-execute the notebook. \n", - "1. How does this affect the result of the t-test for the Meiji sample?" + "1. How does this affect the result of the t-test for the Meiji sample?\n", + "\n", + " \n", + "\n", + "*A solution of the exercise is available [in this file](solution/HPLstats_notebook-solution.ipynb).*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "---\n", "\n", "

Bibliography

\n", "\n", "[1] E. Anderson (1935). \"The Irises of the Gaspe Peninsula.\" Bulletin of the American Iris Society 59: 2–5.\n", "\n", "[2] R. A. Fisher (1936). \"The use of multiple measurements in taxonomic problems\". Annals of Eugenics. 7 (2): 179–188. doi:10.1111/j.1469-1809.1936.tb02137.x\n", "\n", "More about the Iris Dataset on Wikipedia: https://en.wikipedia.org/wiki/Iris_flower_data_set\n", "\n", "*Please note that the datasets used in this notebook have been generated using a random generator, they do not come from real measurement and cannot be used for any research purpose.*" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 4 }