diff --git a/StatisticsNotebook-group_B.ipynb b/StatisticsNotebook-group_B.ipynb index 00049fb..dc7b1bd 100644 --- a/StatisticsNotebook-group_B.ipynb +++ b/StatisticsNotebook-group_B.ipynb @@ -1,1438 +1,1952 @@ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Jupyter Notebooks for Teaching and Learning
\n", " C. Hardebolle, P. Jermann, R. Tormey, CC BY-NC-SA 4.0 Int.
\n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Introduction to hypothesis testing

\n", "\n", "An important part of the scientific process is to make hypotheses about the world or about the results of experiments. These hypotheses need then to be checked by collecting evidence and making comparisons. Hypothesis testing is a step in this process where statistical tools are used to test hypotheses using data.\n", "\n", "**This notebook is designed for you to learn**:\n", "* How to distinguish between \"population\" datasets and \"sample\" datasets when dealing with experimental data\n", "* How to compare a sample to a population, test a hypothesis using a statistical test called the \"t-test\" and interpret its results\n", "* How to use Python scripts to make statistical analyses on a dataset\n", "\n", "In the following, we will use an example dataset representing series of measurements on a type of flower called Iris." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "\n", "
\n", " \"iris\n", "\n", "###### Iris Virginica (Credit: Frank Mayfield CC BY-SA 2.0)\n", "\n", "
\n", "\n", "In 1935, an american botanist called Edgar Anderson worked on quantifying the morphologic variation of Iris flowers of three related species, Iris Setosa, Iris Virginica and Iris Versicolor [[1]](#Bibliography). He realized a series of measures of the petal length, petal width, sepal length, sepal width and species.\n", "Based on the combination of these four features, a British statistician and biologist named Ronald Fisher developed a model to distinguish the species from each other [[2]](#Bibliography).\n", "\n", "## Question\n", "\n", "A recent series of measurements has been carried out at the [Iris Garden of the Vullierens Castle](https://chateauvullierens.ch/en/) near Lausanne, on a sample of 50 flowers of the Iris Virginica species. \n", "**How similar (or different) is the Iris sample from the Vullierens Castle compared to the Iris Virginica population documented by Edgar Anderson?**\n", "\n", "## Instructions\n", "\n", "This notebook will guide you in the use of Python tools for analyzing this experimental dataset and perform statistical tests which are widely used in hypothesis testing. \n", "It includes:\n", - "* **explanations to read** about how to analyze experimental data to answer a research question,\n", + "* **explanations to read** about how to analyze experimental data to answer a research question.\n", "* **code to execute** to illustrate how to perform data analysis using Python.\n", "* **questions** to help you think about what you learn along the way.\n", "\n", "\n", - "**Solutions**. We recommend you to **check your answer** after each question, before moving to the next piece of content." + "**Solutions**. We recommend you to **check your answer** after each question, before moving to the next piece of content. The solutions are made visible by clicking on the \"...\" below a question." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " How to use this notebook?
\n", " \n", + " As a first task, edit the cell below and write 3*4 in it below the comment (#). Once this is done press shift + enter or to execute the cell

\n", + "

You can check your answer by clicking on the \"...\" below.

\n", "
\n", "
\n", "\n", - "While using the notebook, you can also **take notes on a piece of paper** if you feel this is helpful.\n", - "\n", - " \n", - "\n", - "\n", - "--- " + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Edit and execute this cell by writing below this line\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "jupyter": { + "source_hidden": true + } + }, + "source": [ + "\n", + "
\n", + " Solution
\n", + " You have written a your first line of pyhon in the cell by typing 3*4 and upon executing the cell the notebook should display the result, the number 12.
\n", + " The lines starting with a hash (#) are considered as comments and are not executed.
\n", + " Get a piece of paper and a pencil and move on to the next cell by clicking shift + enter
\n", + "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting started" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Python tools for stats\n", "Python comes with a number of libraries for processing data and computing statistics.\n", "To use these tool you first have to load them using the `import` keyword. \n", + "\n", "The role of the code cell just below is to load the tools that we use in the rest of the notebook. It is important to execute this cell *prior to executing any other cell in the notebook*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# plotting and display tools\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "plt.style.use('seaborn-whitegrid') # global style for plotting\n", "\n", "from IPython.display import display, set_matplotlib_formats\n", "set_matplotlib_formats('svg') # vector format for graphs\n", "\n", "# data computation tools\n", "import numpy as np \n", "import pandas as pan\n", "import math\n", "\n", "# statistics tools\n", "import scipy.stats as stats\n", - "from lib.dataanalysis import * " + "from scipy.stats import t as tdist\n", + "from lib.dataanalysis import * \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data available on the Anderson population\n", "\n", "Anderson has published summary statistics of his dataset. \n", "You have the **mean petal length of the Iris Virginica species** documented by Anderson: $\\mu = 5.552$ cm, which we define in the code below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define mu as mean petal length of Iris Virginica species from Anderson\n", "mu = 5.552\n", "\n", "# Display mu\n", "mu" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Question
\n", " What does the first line of code above do? And what is the role of the second line of code?
\n", - " How would you do to define another value in the code, for instance the mean petal length of Iris Versicolor $\\mu_{versicolor}= 4.26$ cm?
\n", - " Type your code using the cell below and execute it to test the result. \n", "

You can check your answer by clicking on the \"...\" below.

\n", "
" ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Define mu_versicolor here\n" - ] - }, { "cell_type": "markdown", "metadata": { "jupyter": { "source_hidden": true } }, "source": [ "\n", "
\n", " Solution
\n", " The first line of code defines a variable called mu and sets its value to 5.552.
\n", - " The role of the second line of code is to display the value of mu
\n", - " Based on the same model, below is the code to define mu_versicolor with a value of 4.26 and display it. \n", - "
\n", - "\n", - "
\n",
-    "# Define mu_versicolor here\n",
-    "mu_versicolor = 4.26\n",
-    "\n",
-    "# Display beta\n",
-    "mu_versicolor\n",
-    "
" + " The role of the second line of code is to display the value of mu.
\n", + "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data available on the Vullierens sample\n", "\n", - "You have the raw data collected on the petal length and petal width of the Vullierens sample, which is stored in the file `iris-sample-vullierens.csv` that you can see in the file explorer in the left pane. \n", - "If you double click on the file it will open in a new tab and you can look at what is inside.\n", + "You have the raw data collected on the petal length and petal width of the Vullierens sample, which is stored in the file `iris-sample-vullierens-1.csv` that you can see in the file explorer in the left pane. If you double click on the file it will open in a new tab and you can look at what is inside.\n", "\n", "Now to analyze the data using Python you have to read the file:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Read the Vullierens sample data from the CSV file\n", - "sample_data = pan.read_csv('iris-sample-vullierens.csv')\n", + "sample_data = pan.read_csv('iris-sample-vullierens-1.csv')\n", "\n", "# Display the first few lines of the dataset\n", "sample_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "After reading the file, its content is stored in the variable `sample_data`, which is a kind of table. The output above shows us an extract of the table, limited to the first 5 lines. We see above that each line of the table is given an index number to identify it. We also see that, appart from the index, the table contains two columns, called `\"petal_length\"` and `\"petal_width\"`, which contains all the measurements made on the Vullierens Irises.\n", + "After reading the file, its content is stored in the variable `sample_data`, which is a kind of table. The output above shows us an extract of the table, limited to the first 5 lines. We see above that each line of the table is given an index number to identify it. We also see that, appart from the index, the table contains a column, called `\"petal_length_sample1\"`, which contains all the measurements made on the Vullierens Irises.\n", "\n", - "To get the complete list of all the values stored in one specific column such as `\"petal_length\"`, you can use the following syntax: `sample_data[\"petal_length\"]`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# All values stored in the \"petal_length\" column of the \"sample_data\" table\n", - "sample_data[\"petal_length\"]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - " Question
\n", - " How would you access the data stored in the other column of this table, named \"petal_width\"?
\n", - " Type and test your code using the cell below.\n", - "

You can check your answer by clicking on the \"...\" below.

\n", - "
" + "To get the complete list of all the values stored in one specific column such as `\"petal_length_sample1\"`, you can use the following syntax: `sample_data[\"petal_length_sample1\"]`. We save these values in a variable called `vuillerens_sample` and then display the values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "# Access the values stored in the \"petal_width\" column of the \"sample_data\" table\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "jupyter": { - "source_hidden": true - } - }, - "source": [ - "\n", - "
\n", - " Solution
\n", - " Below is the code to access the data stored in the \"petal_width\" column of the table: we simply change the name of the column we want to access. \n", - "
\n", + "# Create a variable called \"first_sample\" which contain\n", + "# the values stored in the \"petal_length_sample1\" \n", + "# of the \"sample_data\" table\n", "\n", - "
\n",
-    "# Access the values stored in the \"petal_width\" column of the \"sample_data\" table\n",
-    "sample_data[\"petal_width\"]\n",
-    "
" + "vuillerens_sample = sample_data[\"petal_length_sample1\"]\n", + "\n", + "# Display the values of vuillerens_sample\n", + "vuillerens_sample" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# First look at the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Descriptive statistics\n", + "### Descriptive statistics\n", "\n", "A first important step in analyzing data is to get an idea of its basic characteristics using **descriptive statistics** such as the **mean** (i.e. the average value or \"moyenne\" in French) and the **standard deviation** (\"écart-type\" in French, generally abreviated std in English). \n", - "So let's compute some simple descriptive statistics on the Vullierens sample data. The `describe()` function gives us right away a number of useful descriptive statistics for all the columns in our data table:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Compute the descriptive stats\n", - "sample_stats = sample_data.describe()\n", "\n", - "# Display the result\n", - "sample_stats" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - " Question
\n", - " From the table above, what is the mean value of the petal length in the Vullierens sample?
\n", - " And the standard deviation (std) of the petal length in the Vullierens sample?\n", - "

You can check your answer by clicking on the \"...\" below.

\n", - "
" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "jupyter": { - "source_hidden": true - } - }, - "source": [ - "\n", - "
\n", - " Solution
\n", - " From the table above, we can read in the first column, second line that the mean value of the petal length of the Vullierens sample is 5.713045 cm.
\n", - " We can read in the first column, third line that the standard deviation of the petal length is 0.518940 cm.\n", - "
" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can access individual elements of the `sample_stats` table using the corresponding names for the line and column of the value. \n", - "The following cell illustrates how to get the **sample size** (named `count` in the table above):" + "The \"numpy\" library (available as `np`) provides simple functions to obtain the mean `np.mean()`, standard devition `np.std()` and number of observations `np.size()`.\n", + "\n", + "For memory, the mean is defined as:\n", + "\n", + "$\\large \\bar{x} = \\frac{\\sum_{i=1}^{n}{x_i}}{n}$ where n is the number of observations.\n", + "\n", + "And the standard deviation for a sample as:\n", + "\n", + "$\\large s = \\sqrt{\\frac{\\sum_{i=1}^{n}{(x_i-\\bar{x})^2}}{(n-1)}}$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "# Extract the sample mean from the descriptive stats\n", - "sample_size = sample_stats.loc[\"count\",\"petal_length\"]\n", + "# Descriptive statistics\n", + "vuillerens_sample_mean = np.mean(vuillerens_sample) # mean\n", + "vuillerens_sample_std = np.std(vuillerens_sample, ddof=1) # standard deviation, ddof=1 is for (n-1).\n", + "vuillerens_sample_size = np.size(vuillerens_sample) # number of observations\n", "\n", - "# Display the result\n", - "sample_size" + "# in these print statements, the {:.3f} is replaced in the output \n", + "# with the variable that comes in the .format() function\n", + "print(\"Vuillerens sample mean = {:.3f}\".format(vuillerens_sample_mean))\n", + "print(\"Vuillerens sample standard deviation = {:.3f}\".format(vuillerens_sample_std))\n", + "print(\"Vuillerens sample size = {:}\".format(vuillerens_sample_size))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Another interesting information to extract from these descriptive statistics is the **mean value of the petal length** in the sample:" + "\n", + "### You collect a second sample.\n", + "Imagine you go to Vuillerens and get another sample of 50 flowers from the farm. The code in the cell below simulates such a sampling with the function `draw_sample`. It returns 50 new measurements of petal length and saves them in a variable called `my_sample`. \n", + "\n", + "The cell also computes computes the mean, standard deviation and sample size for the second sample.\n", + "\n", + "Finally it displays a histogram of the Vuillerens sample and your own sample side by side. Histograms are useful to visualize the [frequency distribution](https://en.wikipedia.org/wiki/Frequency_distribution) of the sample values: the horizontal axis displays intervals of the variable we are looking at, in our case the petal length, and the vertical axis indicates the number of samples in each interval." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "# Extract the sample mean of the petal length from the descriptive stats\n", - "sample_mean = sample_stats.loc[\"mean\",\"petal_length\"]\n", + "# You draw a sample of size 50 from the Vuillerens gardens.\n", + "my_sample = draw_sample(50)\n", + "my_sample\n", "\n", - "# Display the result\n", - "sample_mean" + "print(\"My sample measurments:\\n\")\n", + "print(my_sample)\n", + "\n", + "# Compute descriptive statistics\n", + "my_sample_mean = np.mean(my_sample)\n", + "my_sample_std = np.std(my_sample, ddof=1)\n", + "my_sample_size = np.size(my_sample)\n", + "\n", + "print(\"\\nDescriptive statistics:\\n\") \n", + "print(\"My sample mean = {:.3f}\".format(my_sample_mean))\n", + "print(\"My sample standard deviation = {:.3f}\".format(my_sample_std))\n", + "print(\"My sample size = {:}\".format(my_sample_size))\n", + "print(\"\\n\") # an emtpy line\n", + "\n", + "# Plot histograms for Vuillerens and My sample side by side\n", + "plt.figure(figsize=(10, 4))\n", + "\n", + "# The Vuillerens sample\n", + "plt.subplot(1, 2, 1)\n", + "plot_sample_histogram(vuillerens_sample, mu, \"Vuillerens Sample\", \"green\")\n", + "\n", + "# Your sample\n", + "plt.subplot(1, 2, 2)\n", + "plot_sample_histogram(my_sample, mu, \"My Sample\", \"blue\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", - " Question
\n", - " How could you access the value of the standard deviation of the petal length in the sample_stats table?
\n", - " Type and test your code using the cell below.\n", + " Activity
\n", + " Does your sample (in blue) differ from the Vuillerens sample (in green) that was collected first ?

\n", + " Execute the cell above to generate a few samples until you get a sample (my_sample) which has a mean below the population mean. For memory, the population mean was $\\mu=5.552$ and is represented by a black dashed line.

\n", + " How it possible that you get a mean petal length different from the Vuillerens sample, since you are taking flowers from the same place ? And how come both are different from the population mean ($\\mu=5.552$) represented with a dashed black line.
\n", "

You can check your answer by clicking on the \"...\" below.

\n", "
" ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Extract the sample standard deviation of the petal length from the descriptive stats\n" - ] - }, { "cell_type": "markdown", "metadata": { "jupyter": { "source_hidden": true } }, "source": [ "\n", "
\n", - " Solution
\n", - " Below is the code to access the value of the standard deviation of the petal length in the sample_stats table: we use the name of the line containing the value, std, and we store the result in a variable called sample_std.\n", - "
\n", - "\n", - "
\n",
-    "# Extract the sample standard deviation of the petal length from the descriptive stats\n",
-    "sample_std = sample_stats.loc[\"std\",\"petal_length\"]\n",
-    "\n",
-    "# Display the result\n",
-    "sample_std\n",
-    "
" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Visualization\n", - "\n", - "After having looked at simple descriptive statistics, another important step is to **visualize the data**, to better identify its characteristics. \n", - "Histograms are useful to visualize the [frequency distribution](https://en.wikipedia.org/wiki/Frequency_distribution) of the sample values: the horizontal axis displays intervals of the variable we are looking at, in our case the petal length, and the vertical axis indicates the number of samples in each interval." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Plot the histogram representing the distribution of the samples\n", - "plt.hist(sample_data[\"petal_length\"], color=\"green\")\n", - "plt.xticks(np.arange(4.6, 7.2, 0.2))\n", - "\n", - "# Add a vertical line for the sample mean\n", - "plt.axvline(x=sample_mean, color='black', linestyle='-.', linewidth=1, label=\"sample mean $m$\")\n", - "\n", - "# Add a vertical line for the population mean\n", - "plt.axvline(x=mu, color='black', linestyle=':', linewidth=1, label=\"population mean $\\mu$\")\n", - "\n", - "# Add a legend\n", - "plt.legend();" + " Answer
\n", + " Everytime we draw a sample from the Iris population (like the Vuillerens sample and the samples you have just picked), we only get a subset of the flowers in the population. The randomness of the sampling causes some variability so that not all samples have exactly a sample mean equal to the population mean $\\mu$. The distribution of petal lengths we have observed with the histograms also is a bit different for each sample. Sometimes, there are extreme values included, some times the distribution looks more symmetrical than others.\n", + "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Question
\n", - " From the graph above, how many irises from the Vullierens sample have a petal length between 4.7 and 4.95 cm?
\n", - " How is the mean petal length of the Vullierens sample represented? And the mean of the Anderson population?
\n", + " From the graph above on the left, how many irises from the Vuillerens sample have a petal length between 4.7 and 4.95 cm?
\n", + " How is the mean petal length of the Vuillerens sample represented? And the mean of the Anderson population?
\n", " How close are they to each other?\n", "

You can check your answer by clicking on the \"...\" below.

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "jupyter": { "source_hidden": true } }, "source": [ "\n", "
\n", - " Solution
\n", + " Solution
\n", " The irises with a petal length between 4.7 and 4.95 cm are represented by the first bar of the histogram (counting from the left) and we can read on the vertical axis that there are 5 irises represented in this bar.
\n", - " According to the legend, the mean petal length of the Vullierens sample is represented by a vertical dash-dotted line (-·-·-) and the mean of the Anderson population by a vertical dotted line (·····).
\n", + " According to the legend, the mean petal length of the Vullierens sample is represented by a green vertical dash-dotted line (-·-·-) and the mean of the Anderson population by a black vertical dash-dotted line (-·-·-).
\n", " These two means seem to be quite close to each other, with a difference of around 0.15 cm.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Interpretation and hypothesis\n", + "# Statistical test and hypothesis\n", "\n", "The simple analyses we have made so far allow us to have a preliminary idea about how the Irises from Vullierens compare to those observed by Anderson. One feature to look at for the comparison is their respective mean petal length. We see above that the mean petal length $m$ of the Vullierens sample is quite close to the mean $\\mu$ reported by Anderson. However, we also see that there is some variability in our sample, meaning that some irises in our sample actually have a petal length quite far from that of the Anderson population. So are the two means really that close to each other?\n", "\n", "Let's formulate this as an **hypothesis** which we state as: the sample mean $m$ is similar to the mean of the reference population $\\mu$, which we will note $m = \\mu$ (in this notation, the equal symbol should not be interpreted literally). This hypothesis is noted $H_0$ and called the \"null\" hypothesis because it states that there is no difference between the sample and the population. \n", "The \"alternate\" hypothesis $H_a$ is that the sample mean is not similar to the mean of the reference population, $m \\neq \\mu$.\n", "\n", - "How can we test our hypothesis? In the following, we use a **statistical test** to answer this question.\n", + "$H_0: m = \\mu$\n", "\n", + "$H_a: m \\neq \\mu$\n", + "\n", + "How can we test our hypothesis? In the following, we use a **statistical test** to answer this question.\n", " \n", "\n", "---\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# Testing our hypothesis" + "## Testing our hypothesis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In our hypothesis we compare the mean of one sample to a reference value. To test this hypothesis we can use a statistical test called a **one-sample t-test**. \n", "\n", - "But what does it mean when we test the hypothesis that a sample mean is potentially equal to a given value? " + "But what does it mean when we test the hypothesis that a sample mean is potentially equal to a the mean of a *reference population*? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sample versus population\n", "\n", "
\n", - "\n", + "\n", "Figure 1. Population and samples.\n", - "\n", - "Figure 2. Distribution of the means of all possible samples coming from a given population (Anderson's population in this case).\n", "
\n", "\n", - "To understand this, it is useful to start by thinking about a population, in this case our population of Irises which has a mean petal length of $\\mu = 5.552$ cm, illustrated by the big black circle on Figure 1 on the right.\n", - "\n", - "Now imagine you take a sample of (i.e. a subset of), say, 50 flowers from this population, represented by the green circle on Figure 1. The mean petal length of this sample is $m_1 = 6.234$ cm. You then take a second sample of 50 flowers (another subset, in blue on Figure 1), which ends up having a mean petal length of $m_2 = 5.874$ cm. You then take a third sample of 50 which gives you a mean petal length of $m_3 = 5.349$ cm, in yellow on Figure 1.\n", - "\n", - "If you keep taking samples from this population, you will start to notice a pattern: while some of the samples will give a mean petal length which is not at all close to the population mean, most of the mean petal lengths are reasonably close to the population mean of 5.552 cm. Furthermore, the mean of the mean petal length of the samples will be the same as that of the population as a whole i.e. 5.552 cm. \n", - "\n", - "In fact, if we keep taking samples from this population, it turns out that the distribution of the mean of these samples will take a very particular pattern that looks like a normal curve, as illustrated by Figure 2 on the right. Actually, if you take bigger sample sizes (say 130 instead of 50) the distribution will get closer and closer to being a normal curve for which the mean is equal to the mean of the population. For these smaller samples, the distribution is called the **[Student's t-distribution](https://en.wikipedia.org/wiki/Student%27s_t-distribution)** (actually it is a family of distributions, which depend on the sample size).\n", - "\n", + "To understand this, it is useful to start by thinking about a population, in this case our population of Virginical Irises which has a mean petal length of $\\mu = 5.552$ cm, illustrated by the big black circle on Figure 1 on the right.\n", "\n", - "This is useful because it allows us to rephrase our question as to how similar or different our sample from Vullierens Castle is to the population of Irises as described by Edgar Anderson. \n", - "**What we have from the Vullierens Castle is a sample**. We want to know if it is a sample that might have come from a population like that described by Edgar Anderson. We now know the shape (more or less a normal distribution) and the mean (5.552 cm) of all of the samples that could be taken from the population described by Edgar Anderson. **So our question becomes \"where does our sample fall on the distribution of all such sample means?\"**. \n", - "If our mean is in position A on the figure on the right, then it is plausible that our sample came from a population like that of Edgar Anderson. If our mean is in position B, then it is less plausible to believe that our sample came from a population like Anderson’s.\n", + "Now imagine you take a sample of (i.e. a subset of), say, 50 flowers from this population, represented by the yellow circle on Figure 1. The mean petal length of this sample is $m_1 = 5.713$ cm. You then take a second sample of 50 flowers (another subset, in green on Figure 1), which ends up having a mean petal length of $m_2 = 5.349$ cm. The sample of 50 you have taken just before by running the python code above gives you yet another mean petal length of $m_3 = ???$ cm, in blue on Figure 1.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Significance level and cutoff point\n", - "\n", - "
\n", - "\n", - "Figure 3. Distribution of the means of of all possible samples coming from Anderson's population with zones defined by the significance level $\\alpha=0.05$.
\n", - "
\n", - "\n", - "\n", - "You might be wondering, how far away is far enough away for us to think it is implausible that our sample comes from a population like Anderson’s. The answer is, it depends on how sure you want to be. \n", - "\n", - "One common answer to this question is to be 95% sure - meaning that a sample mean would need to be in the most extreme 5% of cases before we would think it is implausible that our sample comes from a population like Anderson’s. This value of 5% is called **significance level** and it is noted $\\alpha$, with $\\alpha=0.05$. These most extreme 5% cases are represented by the zones in light blue on Figure 3. If the sample mean falls into these most extreme zones, we say that *the difference is \"statistically significant\"*.\n", + "### Distribution of sample means. \n", "\n", - "A second, common answer is 99% sure meaning that a sample mean would need to be in the most extreme 1% of cases before we would think it is implausible that our sample comes from a population like Anderson’s ($\\alpha=0.01$). \n", + "If you keep taking samples from this population, you will start to notice a pattern: while some of the samples will give a mean petal length which is not at all close to the population mean, most of the mean petal lengths are reasonably close to the population mean of 5.552 cm. Furthermore, the mean of the mean petal length of the samples will be the same as that of the population as a whole i.e. 5.552 cm.\n", "\n", - "In the following, **we will work on the basis of being 95% sure**.
\n", - "Let's define our significance level $\\alpha=0.05$:" + "In the cell below we do exactly that, we draw 5000 samples (as defined by `n_samples`) of 50 flowers (as defined by `sample_size`) and for each of them compute the mean. We then collect these means in a list and plot their distribution. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "# Define alpha at 0.05\n", - "alpha05 = 0.05\n", + "# Let's pick 5000 samples of 50 flowers\n", + "# Where we draw flowers with draw_sample(50)\n", + "\n", + "n_samples = 5000\n", + "sample_size = 50\n", + "\n", + "# We will collect the means for each sample in a list []\n", + "means = []\n", + "samples = []\n", + "\n", + "# For 5000 times in a row\n", + "for i in range(n_samples):\n", + " \n", + " # We go to the garden and pick a sample of 50 flowers and measure the petal length\n", + " # This is the same function that we used before to draw my_sample.\n", + " sample = draw_sample(sample_size)\n", + "\n", + " # We compute the mean petal length\n", + " m = np.mean(sample)\n", + " \n", + " # We append the mean for this sample to the list of means\n", + " means.append(m)\n", + " \n", + " # We keep the samples for later analysis\n", + " samples.append(sample)\n", + "\n", + "# We now have 5000 sample means\n", + "# We print the 20 first ones (printing 5000 would take a lot of space ...)\n", + "\n", + "print(\"\\nEach number is the mean petal length of a sample of 50 flowers.\")\n", + "print (\"We display only the first 20 sample means (out of 5000). :\\n\")\n", + "print(means[0:20])\n", + "print(\"\\nThe mean of the 5000 means is: {:.3f}\\n\\n\".format(np.mean(means)))\n", "\n", - "# Display alpha\n", - "alpha05" + "# We plot a histogram of the 5000 means from samples of 50 flowers \n", + "# that we generated in the cell above\n", + "plot_mean_distribution(mu, means=means, sample_size=sample_size)\n", + " \n", + "# In green the Vuillerens sample\n", + "plt.axvline(x=vuillerens_sample_mean, color='green', linestyle='-.', linewidth=1, label=\"(Vuillerens sample)\")\n", + "\n", + "# In blue your own sample\n", + "plt.axvline(x=my_sample_mean, color='blue', linestyle='-.', linewidth=1, label=\"(My Sample)\");\n", + "\n", + "plt.legend();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", - " Question
\n", - " In the code cell below, create another variable called alpha01 to define a significance level of $\\alpha = 0.01$.\n", + " Activity
\n", + " Run the cell above with different values of nb_samples = (e.g 50, 500, 5000, 50000) and observe the distribution of means.
\n", + " Is the mean of means different from $\\mu$ ?\n", "

You can check your answer by clicking on the \"...\" below.

\n", "
" ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Define alpha at 0.01\n" - ] - }, { "cell_type": "markdown", "metadata": { "jupyter": { "source_hidden": true } }, "source": [ "\n", "
\n", " Solution
\n", - " The cell below defines the variable alpha01 and displays it.\n", - "
\n", + " The more samples we take, the more the distribution of their means perfectly matches a bell shaped curve. Also the more samples we take, the closer the mean of their means approaches the population mean $\\mu$.
\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In fact, it turns out that the distribution of the mean of these samples will take a very particular pattern that looks like a normal curve, as illustrated by the figure above. Actually, if you take bigger sample sizes (say 130 instead of 50) the distribution will get closer and closer to being a normal curve for which the mean is equal to the mean of the population. For these smaller samples, the distribution is called **[Student's t-distribution](https://en.wikipedia.org/wiki/Student%27s_t-distribution)** (actually it is a family of distributions, which depend on the sample size).\n", "\n", - "
\n",
-    "# Define alpha at 0.01\n",
-    "alpha01 = 0.01\n",
+    "This is useful because it allows us to rephrase our question as to how similar or different our sample from Vullierens Castle is to the population of Irises as described by Edgar Anderson. \n",
+    "**What we have from the Vullierens Castle is a sample**. We want to know if it is a sample that might have come from a population like that described by Edgar Anderson. We now know the shape (more or less a normal distribution) and the mean ($\\mu$=5.552 cm) of all of the samples that could be taken from the population described by Edgar Anderson. \n",
     "\n",
-    "# Display alpha\n",
-    "alpha01\n",
-    "
" + "**So our question becomes \"where does our sample fall on the distribution of all such sample means?\"**. \n", + "\n", + "* If a sample mean is close to the center of the distribution, then it is plausible that the sample came from a population like that of Edgar Anderson. \n", + "* If a sample mean is in to the far right or far left of the distribition (like the Vuillerens sample in green), then it is less plausible to believe that the sample came from a population like Anderson’s." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "If our distribution of sample means is a normal curve then we know that the most extreme 5% of sample means are found above or below ±1.96 standard deviations above and below the mean. In our case, because our sample size is less than 130 (it is 50), our distribution is close to normal but not quite normal. \n", - "In this case, it is possible to find out the relevant cut off point from [looking it up in statistical tables](https://en.wikipedia.org/wiki/Student%27s_t-distribution#Table_of_selected_values): for a sample size of 50, the most extreme 5% of cases are found above or below approximately 2.01 standard deviations from the mean. \n", + "## Significance level and cutoff point\n", + "\n", + "You might be wondering, how far away is far enough away for us to think it is implausible that our sample comes from a population like Anderson’s. The answer is, it depends on how sure you want to be. \n", + "\n", + "One common answer to this question is to be 95% sure - meaning that a sample mean would need to be in the most extreme 5% of cases before we would think it is implausible that our sample comes from a population like Anderson’s. This value of 5% is called **significance level** and it is noted $\\alpha$, with $\\alpha=0.05$. \n", "\n", - "The good news is that **Python gives us automatically the value of the cutoff point** based on the value of the significance level $\\alpha$ chosen and the sample size, thanks to the `stats` library which offers useful functions related to many statistical distributions such as Student's t:" + "We now plot in red the area in the means distribution that corresponds to the proportion $\\alpha$ of most extreme cases. There are $\\frac{\\alpha}{2}=2.5\\%$ cases on the left of the distribution and $\\frac{\\alpha}{2}=2.5\\%$ cases on the right of the distribution. \n", + "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "# Get the cutoff point for alpha at 0.05\n", - "cutoff05 = stats.t.isf(alpha05 / 2, sample_size)\n", + "alpha = 0.05 # significance level\n", + "plot_mean_distribution(mu, alpha=alpha, means=means)\n", "\n", - "# Display cutoff\n", - "cutoff05" + "# In green the Vuillerens sample\n", + "plt.axvline(x=vuillerens_sample_mean, color='green', linestyle='-.', linewidth=1, label=\"(Vuillerens)\")\n", + "\n", + "# In blue your own sample\n", + "plt.axvline(x=my_sample_mean, color='blue', linestyle='-.', linewidth=1, label=\"(My Sample)\");\n", + "plt.legend();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", - " Question
\n", - " How would you get the value of the cutoff point for the significance level $\\alpha = 0.01$?
\n", - " Type and test your code using the cell below.\n", + " Activity
\n", + " Run the cell above with different values of alpha = (e.g 0.01, 0.10, 0.50) and see how the zones change. You can remove the means=means argument to see only the theoretical distribution without the histogram of our 5000 samples.\n", "

You can check your answer by clicking on the \"...\" below.

\n", - "
" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Get the cutoff point for alpha at 0.01\n" + "\n" ] }, { "cell_type": "markdown", "metadata": { "jupyter": { "source_hidden": true } }, "source": [ "\n", "
\n", " Solution
\n", - " To get the value of the cutoff point for the significance level of $\\alpha = 0.01$, we can copy the first line of code stats.t.isf(alpha05 / 2, sample_size) and replace alpha05 with the variable alpha01 that we have previously defined.
\n", - " To save the value of this new cutoff point for later, it is good to store it in a new variable cutoff01.
\n", - " See the solution code below.\n", - "
\n", + " The smaller the alpha you choose, the smaller the red zone on the graph. The surface of the red zone corresponds to the proportion $\\alpha$ of sample means that are the most extreme.
\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "These most extreme 5% cases are represented by the zones in light red in the figure above. If the sample mean falls into these most extreme zones, we say that *the difference is \"statistically significant\"*.\n", "\n", - "
\n",
-    "# Get the cutoff point for alpha at 0.01\n",
-    "cutoff01 = stats.t.isf(alpha01 / 2, sample_size)\n",
+    "A second, common answer is 99% sure meaning that a sample mean would need to be in the most extreme 1% of cases before we would think it is implausible that our sample comes from a population like Anderson’s ($\\alpha=0.01$).   \n",
     "\n",
-    "# Display cutoff\n",
-    "cutoff01\n",
-    "
" + "In the following, **we will work on the basis of being 95% sure**.
\n", + "Let's define our significance level $\\alpha=0.05$:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Error in the distribution of means\n", + "## Definition of the t value\n", "\n", - "So far we know a lot that will help us to test the hypothesis that our sample mean is similar to Anderson’s population mean. We know:\n", - "* Our sample mean $m$\n", - "* The population mean $\\mu$\n", - "* The shape of the distribution of the mean of all samples that would come from this population (a normal curve, centred on the population mean)\n", - "* Our cut off point defined by $\\alpha$ (the most extreme 5% of cases, above or below 2.01 standard deviations from the mean)\n", + "### Scaling and centering the distribution of means\n", "\n", - "The last piece of information missing that would enable us to test this hypothesis is the size of the standard deviation of the distribution of sample means from Anderson’s population. \n", - "It turns out that a good guess for the size of this standard deviation can be obtained from knowing the standard deviation of our sample.\n", - "If $s$ is the sample standard deviation of our sample and $n$ is the sample size, then the standard deviation of the distribution of sample means is:\n", + "To facilitate the statistical comparison of our sample means with with the population mean, and to measure how far away from the population mean we are in terms of \"standard deviations\" we have to center and scale the distribution of means. This operation consists of substracting the mean and dividing by the standard deviation. \n", "\n", + "This would be equivalent to compute:\n", "$\n", "\\begin{align}\n", - "\\sigma_{\\overline{X}} = \\frac{s}{\\sqrt{n}}\n", + "t = \\frac{m - \\mu}{\\sigma_{\\overline{X}}}\n", + "\\end{align}\n", + "$\n", + "\n", + "In our case, we know the mean of our sample $m$, as well as the mean of the population $\\mu$ but we don't know the population's standard deviation. \n", + "\n", + "### Standard error of the mean\n", + "The last piece of information missing that would enable us to compute how many standard deviations separate the population mean $\\mu$ and the sample mean from Vuillerens is the size of the standard deviation of the distribution of sample means from Anderson’s population.\n", + "\n", + "It turns out that a good guess for the size of this standard deviation can be obtained from knowing the standard deviation of our sample. If $s$ is the sample standard deviation of our sample and $n$ is the sample size, then the standard deviation of the distribution of sample means is the **\"standard error of the mean\" (also noted SEM)**:\n", + "\n", + "$\n", + "\\begin{align}\n", + "{SEM} = \\frac{s}{\\sqrt{n}}\n", "\\end{align}\n", "$ \n", "\n", - "This standard deviation of the distribution of sample means is called the **\"standard error of the mean\" (also noted SEM)**. \n", - "We can compute it by using the sample size and the standard deviation from the descriptive stats we have computed earlier: " + "We can compute it by using the sample size and the standard deviation from the descriptive statistics we have computed earlier: \n", + "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "# Extract the sample standard deviation from the descriptive stats\n", - "sample_std = sample_stats.loc[\"std\",\"petal_length\"]\n", - "\n", "# Compute the estimation of the standard deviation of sample means from Anderson's population (standard error)\n", - "sem = sample_std / math.sqrt(sample_size)\n", "\n", - "# Display the standard error\n", - "sem" + "# In python, the square root of x is obtained with the function np.sqrt(x)\n", + "vuillerens_sem = vuillerens_sample_std / np.sqrt(vuillerens_sample_size)\n", + "\n", + "# Display the standard error of the mean\n", + "vuillerens_sem" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", - " Question
\n", - " In the code above, what function is used to compute the square root of the sample size, $\\sqrt{n}$?
\n", - " How would you compute the square root of 2?
\n", - " Type and test your code using the cell below.\n", + " Activity
\n", + " Compute the Standard Error of the Mean for your sample in the cell below and save it in a variable called my_sem.\n", "

You can check your answer by clicking on the \"...\" below.

\n", - "
" + "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "# Compute the square root of 2 and display the result\n" + "# Compute the sem for your sample\n", + "my_sem = ...\n", + "\n", + "# Display the standard error of the mean\n", + "my_sem" ] }, { "cell_type": "markdown", "metadata": { "jupyter": { "source_hidden": true } }, "source": [ "\n", "
\n", " Solution
\n", - " We see in the code making the calculation of the standard error of the mean (sem) above that the way to get $\\sqrt{n}$ in Python is math.sqrt(sample_size).
\n", - " Therefore we can replace sample_size by 2 to get $\\sqrt{2}$.
\n", - " See the solution code below.\n", + " See the solution code below.
\n", "
\n", "\n", "
\n",
-    "# Compute the square root of 2\n",
-    "sqrt2 = math.sqrt(2)\n",
+    "# Compute the sem for your sample\n",
+    "my_sem = my_sample_std / np.sqrt(my_sample_size)\n",
     "\n",
-    "# Display the result\n",
-    "sqrt2\n",
+    "# Display the standard error of the mean\n",
+    "my_sem\n",
     "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Comparison, and definition of the *t* statistics\n", - "\n", - "We can now restate our question in more precise terms: **\"is our sample mean in the most extreme 5% of samples that would be drawn from a population with the same mean as Anderson’s population?\"**. \n", - "Or to be even more precise, **\"is the gap between our sample mean and Anderson’s population mean greater than 2.01 times the standard error of the mean?\"**. \n", - "\n", - "This would be equivalent to compare\n", - "$\n", - "\\begin{align}\n", - "\\frac{m - \\mu}{\\sigma_{\\overline{X}}}\n", - "\\end{align}\n", - "$\n", - "to our cutoff point of 2.01. \n", + "### t value\n", + "Now that we have an estimate for the standard deviation of the means $\\sigma_{\\overline{X}}$ we can compute the **t value** for our samples.\n", "\n", - "That is the **definition of the *t* statistics**: the value $t = $\n", "$\n", "\\begin{align}\n", - "\\frac{m - \\mu}{\\sigma_{\\overline{X}}}\n", + "t = \\frac{m - \\mu}{SEM}\n", "\\end{align}\n", - "$ \n", - " has to be compared to the cutoff point we have chosen to determine if the sample mean falls into the most extreme zones and to be able to say whether the difference is statistically significant or not.
\n", - "Let's compute $t$:" + "$\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "# Compute the t statistics:\n", - "t = (sample_mean - mu) / sem\n", + "# Compute the t value:\n", + "vuillerens_t = (vuillerens_sample_mean - mu) / vuillerens_sem\n", "\n", "# Display t\n", - "t" + "vuillerens_t" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Now we can compare $t$ to our cutoff point. \n", - "\n", - "One issue here is that **when $m$ is smaller than $\\mu$, the value of $t$ can be negative**. This is because, just like for the Normal distribution, Student's t-distribution is symmetrical and centred on zero, zero meaning there is no difference between the mean of the sample and the mean of the population. So when comparing $t$ to the cutoff point, either we take its absolute value, which is what we do below, or if $t$ is negative we compare it to the negative value of the cutoff point (i.e. -2.01 for a significance level of 0.05)." + "
\n", + " Activity
\n", + " Compute the t value for your sample in the cell below and save it in a variable called my_t.
\n", + " What is the sign of my_t ?\n", + "

You can check your answer by clicking on the \"...\" below.

\n", + "
\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "# Compare t with our cutoff point\n", - "if abs(t) > cutoff05: \n", - " print(\"The difference IS statistically significant.\")\n", - "else: \n", - " print(\"The difference is NOT statistically significant.\")" + "# Compute my_t\n", + "my_t = ...\n", + "my_t" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "jupyter": { + "source_hidden": true + } + }, + "source": [ + "\n", + "
\n", + " Solution
\n", + " The t-value for your sample (my_t) is negative because the mean of your sample is smaller than $\\mu$, hence the numerator for the t-value is negative.
\n", + " See the solution code below.
\n", + "
\n", + "\n", + "
\n",
+    "# Compute my_t\n",
+    "my_t = (my_sample_mean - mu) / (my_sample_std / np.sqrt(my_sample_size))\n",
+    "my_t\n",
+    "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "We see in the results above that for our Vullierens sample $|t| > 2.01$, therefore the difference between the two means is greater than 2.01 times the standard error. In other words, **our sample mean IS in the most extremes 5%** of samples that would be drawn from a population with the same mean as Anderson's population. " + "### Cut-off point and $\\alpha$\n", + "With a normal distribition for example, we know that the most extreme 5% of observations are found above or below ±1.96 standard deviations above and below the mean. In our case, because our sample size is less than 130 (it is 50), our distribution is close to normal but not quite normal.\n", + "\n", + "In this case, it is possible to find out the relevant cut off point from [looking it up in statistical tables](https://en.wikipedia.org/wiki/Student%27s_t-distribution#Table_of_selected_values) for a Student's t distribution. The corresponding t distribution has a different shape for different samples sizes. The parameter used to determine the shape of the t distribution is called *degrees of freedom* and is equal to $n-1$, in our case 50-1 = 49.\n", + "\n", + "The most extreme 5% of cases are found above or below approximately 2.01 standard deviations from the mean. Because there are both positive and negative extreme cases, the cutoff point we are looking for is $t_{\\frac{\\alpha}{2}=0.025} = -2.01$ for the 2.5% negative extremes and $t_{1-\\frac{\\alpha}{2}=0.975} = 2.01$ for the 2.5% positive cases. The cutoff point 2.01 corresponds to most extreme 5% of possible values of |t| (positive and negative).\n", + "\n", + "The good news is that **Python gives us automatically the value of the cutoff point** based on the value of the significance level $\\alpha$ chosen and the sample size, thanks to the `stats` library which offers useful functions related to many statistical distributions such as Student's t.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Define alpha and sample size\n", + "alpha = 0.05\n", + "sample_size = 50\n", + "\n", + "# Get the cutoff point for alpha at 0.05 and sample size of 50\n", + "cutoff05 = stats.t.ppf(1 - alpha / 2, sample_size-1)\n", + "\n", + "# Print the cutoff point\n", + "print(\"\\ncutoff05 is the value of t for alpha = 1-({:.3f} / 2) => {:.3f}\\n\".format(alpha, cutoff05))\n", + "\n", + "# Plot the t distribution with cutoff points\n", + "plot_t_distribution(df=49, alpha=alpha);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", - " Question
\n", - " How would you compare $|t|$ to the cutoff point corresponding to a significance level of $\\alpha = 0.01$?
\n", - " Type and test your code using the cell below.\n", + " Activity
\n", + " How would you get the value of the cutoff point for the significance level $\\alpha = 0.01$?
\n", + " How many standard deviations away from the mean will you find the most extreme positive and negative cases ?
\n", + " Type and test your code using the cell below.
\n", + " What happens for other values of $\\alpha$, e.g. 10%, 20% ?\n", "

You can check your answer by clicking on the \"...\" below.

\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "# Compare t to the cutoff point for alpha=0.01\n" + "# Get the cutoff point for alpha at 0.01\n", + "alpha = ...\n", + "cutoff01 = ...\n", + "\n", + "# Display cutoff\n", + "print(\"\\ncutoff01 is the value of t for alpha ({:.3f} / 2) => {:.3f}\\n\".format(alpha, cutoff01))\n", + "\n", + "# Plot the t distribution with cutoff points\n", + "plot_t_distribution(df=49, alpha=alpha);" ] }, { "cell_type": "markdown", "metadata": { "jupyter": { "source_hidden": true } }, "source": [ "\n", "
\n", " Solution
\n", - " To compare the absolute value of $t$ to the cutoff point corresponding to $\\alpha = 0.01$, we can simply replace cutoff05 in the code above by the variable cutoff01 we have defined earlier with the appropriate value for the cutoff point. See the solution code below.
\n", - " In this case, the comparison would tell us if our sample mean is in the most extremes 1% of samples that would be drawn from a population with the same mean as Anderson's population. We already know that our sample mean IS in the most extremes 5%, but the result here shows that it is NOT in the most extremes 1%. \n", + " To get the value of the cutoff point for the significance level of $\\alpha = 0.01$, we can simply change the value of alpha by editing alpha = 0.01.
\n", + " Smaller values of $\\alpha$ result in larger cutoff values while the red area which corresponds to the most extreme cases becomes smaller.
\n", + " For a large $\\alpha$, e.g. alpha=0.2, the cutoff point is smaller and the area in red is larger.
\n", + " See the solution code below.
\n", "
\n", "\n", "
\n",
-    "# Compare t to the cutoff point for alpha=0.01\n",
-    "if abs(t) > cutoff01: \n",
-    "    print(\"The difference IS statistically significant.\")\n",
-    "else: \n",
-    "    print(\"The difference is NOT statistically significant.\")\n",
+    "alpha = 0.01\n",
+    "cutoff01 = stats.t.isf(alpha / 2, sample_size-1)\n",
+    "\n",
+    "# Display cutoff\n",
+    "print(\"\\ncutoff01 is the value of t for alpha ({:.3f} / 2) => {:.3f}\\n\".format(alpha, cutoff01))\n",
+    "\n",
+    "plot_t_distribution(df=49, alpha=alpha)\n",
     "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The statistical test we have just performed here, where we compare our sample mean to the mean of a population, is called a **one-sample t-test**: *one-sample* because we compare a sample to the mean of a population, and *t-test* because the distribution of all the possible sample means of the population follows a distribution called *Student's t-distribution*. " + "## t-test\n", + "\n", + "We can now restate our question: **\"is our sample mean in the most extreme 5% of samples that would be drawn from a population with the same mean as Anderson’s population?\"** in terms of a t-test: **\"is our t value greater than 2.01 times the standard error of the mean?\"**. \n", + "\n", + "This is equivalent to compare the **t value**\n", + "$\n", + "\\begin{align}\n", + "t = \\frac{m - \\mu}{\\sigma_{\\overline{X}}}\n", + "\\end{align}\n", + "$\n", + "to the cutoff point 2.01 (or -2.01).\n", + "\n", + "One issue here is that **when $m$ is smaller than $\\mu$, the value of $t$ can be negative**. This is because, just like for the Normal distribution, Student's t-distribution is symmetrical and centred on zero, zero meaning there is no difference between the mean of the sample and the mean of the population. \n", + "\n", + "So when comparing $t$ to the cutoff point, either we take its absolute value $|t|$, which is what we do below, or if $t$ is negative we compare it to the negative value of the cutoff point (i.e. -2.01 for a significance level of 0.05).\n", + "\n", + "If $|t| > $ cutoff$_\\alpha$ we say:\n", + "* the t-test is statistically significant at the level $\\alpha$\n", + "* we can reject $H_0: m = \\mu$ and accept $H_a: m \\neq \\mu$\n", + "* the mean from our sample is different from the population mean $\\mu$" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Compare t with our cutoff point\n", + "if abs(vuillerens_t) > cutoff05: \n", + " print(\"The difference IS statistically significant \"+\n", + " \"because the t value |{:.3f}| > {:.3f}\".format(vuillerens_t, cutoff05))\n", + "else: \n", + " print(\"The difference is NOT statistically significant \"+\n", + " \"because the t value |{:.3f}| < {:.3f}\".format(vuillerens_t, cutoff05))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We see in the results above that for our Vullierens sample $|t| > 2.01$, therefore the difference between the two means is greater than 2.01 times the standard error. In other words, **our sample mean IS in the most extremes 5%** of samples that would be drawn from a population with the same mean as Anderson's population. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualization of *t*\n", "\n", - "Using Python we can visualize what the t-test means graphically by plotting the t-distribution of all the possible sample means that would be drawn from a population with the same mean as Anderson's population and showing where `t` is in the distribution compared to the zone defined by our $\\alpha$ of 5%.\n", + "Using Python we can visualize what the t-test means graphically by plotting the t-distribution of all the possible sample means that would be drawn from a population with the same mean as Anderson's population and showing where `t` is in the distribution compared to the zone defined by our $\\alpha$ of 5%. \n", + "\n", + "### Rejection zones\n", + "\n", + "* If the *t* value falls outside of the rejection zone defined by $\\alpha$, then that means that the difference between our sample mean and the population mean is **not statistically significant**. \n", + "* If it falls into the rejection zone, then the difference is **statistically significant** and the sample should not be considered as coming from the Anderson population under the significance level we have chosen.\n", + "\n", + "The cell below uses an external library to generate a graphical visualization of the result of the t-test for the 2 samples we have used so far." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plot_t_distribution(df=49, alpha=0.05)\n", + "\n", + "# In green the t-value for the Vuillerens sample\n", + "plt.axvline(x=vuillerens_t, color='green', linestyle='-.', linewidth=1, label=\"t value for Vuillerens\")\n", "\n", - "It the *t* statistics falls outside of the rejection zone defined by $\\alpha$, then that means that the difference between our sample mean and the population mean is not statistically significant. If it falls into the rejection zone, then the difference is statistically significant and the sample should not be considered as coming from the Anderson population under the significance level we have chosen.\n", + "# In blue your own sample\n", + "plt.axvline(x=my_t, color='blue', linestyle='-.', linewidth=1, label=\"t value for My sample\")\n", "\n", - "The cell below uses an external library to generate a graphical visualization of the result of the t-test." + "plt.legend();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Testing for your own sample\n", + "Let's now check for your own sample whether the the t value falls inside the rejection zone." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "# Visualize graphically the result of the t-test with alpha at 0.05\n", - "visualize_ttest(sample_size, alpha05, t)" + "# Compare your sample t (the second sample t) with the cutoff point\n", + "if abs(my_t) > cutoff05: \n", + " print(\"The difference IS statistically significant because\" + \n", + " \"the t value |{:.3f}| > {:.3f}\".format(my_t, cutoff05))\n", + "else: \n", + " print(\"The difference is NOT statistically significant because\" +\n", + " \" the t value |{:.3f}| < {:.3f}\".format(my_t, cutoff05))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", - " Question
\n", - " What happens to the rejection zone in red on the figure when we choose an $\\alpha$ of 1%?
\n", + " Activity
\n", + " How would you compare $|t|$ from the Vuillerens sample to the cutoff point corresponding to a significance level of $\\alpha = 0.01$?
\n", " Type and test your code using the cell below.\n", "

You can check your answer by clicking on the \"...\" below.

\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "# Visualize graphically the result of the t-test with alpha at 0.01\n" + "# Compare t to the cutoff point for alpha=0.01\n", + "if ... : \n", + " print(\"The difference IS statistically significant because\" + \n", + " \"the t value |{:.3f}| > {:.3f}\".format(vuillerens_t, cutoff01))\n", + "else: \n", + " print(\"The difference is NOT statistically significant because\" +\n", + " \" the t value |{:.3f}| < {:.3f}\".format(vuillerens_t, cutoff01))\n" ] }, { "cell_type": "markdown", "metadata": { "jupyter": { "source_hidden": true } }, "source": [ "\n", "
\n", " Solution
\n", - " To visualize the rejection zone for an $\\alpha$ of 1%, we can simply replace alpha05 in the code above by the variable alpha01 we have defined earlier. See the solution code below.
\n", - " By comparing the two visualizations, we see that the rejection zone for $\\alpha=0.01$ is much smaller than for $\\alpha=0.05$, which means we want to reject only samples that have a mean extremely different from the mean of the Anderson population.\n", + " To compare the absolute value of $t$ to the cutoff point corresponding to $\\alpha = 0.01$, we can simply replace cutoff05 in the code above by the variable cutoff01 we have defined earlier with the appropriate value for the cutoff point. See the solution code below.

\n", + " With the result you obtained, the comparison tells us that the Vuillerens sample mean is not the most extremes 1% of samples that would be drawn from a population with the same mean as Anderson's population. We already know that our sample mean IS in the most extremes 5%, but the result here shows that it is NOT in the most extremes 1%.\n", "
\n", "\n", "
\n",
-    "# Visualize graphically the result of the t-test with alpha at 0.01\n",
-    "visualize_ttest(sample_size, alpha01, t)\n",
+    "if abs(vuillerens_t) > cutoff01: \n",
+    "    print(\"The difference IS statistically significant because\" + \n",
+    "          \"the t value |{:.3f}| > {:.3f}\".format(my_t, cutoff01))\n",
+    "else: \n",
+    "    print(\"The difference is NOT statistically significant because\" +\n",
+    "          \" the t value |{:.3f}| < {:.3f}\".format(my_t, cutoff01))\n",
     "
" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The statistical test we have just performed here, where we compare our sample mean to the mean of a population, is called a **one-sample t-test**: *one-sample* because we compare a sample to the mean of a population, and *t-test* because the distribution of all the possible sample means of the population follows a distribution called *Student's t-distribution*. " + ] + }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", - "What can we conclude from there? What the one sample t-test tells us is that we have evidence which would lead us to think that the sample doesn't come from an Anderson like population. Therefore we **can reject our hypothesis $H_0$**. \n", + "What can we conclude from there? What the one sample t-test tells us is that when we have evidence (a t value greater than a cutoff value) which would lead us to think that the sample doesn't come from an Anderson like population we **can reject our hypothesis $H_0$** and accept the **alternative hypothesis** $H_a$. The $H_a$ states that the sample does not come from an Anderson lilke population. \n", + "\n", + "$H_0: m = \\mu$\n", + "\n", + "$H_a: m \\neq \\mu$\n", + "\n", "\n", "Now there are some limitations to keep in mind when using the one sample t-test, that we will explore in the section below.\n", "\n", - " \n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Influence of the sample size\n", "\n", "Above, we have seen that $t = $\n", "$\n", "\\begin{align}\n", - "\\frac{m - \\mu}{\\sigma_{\\overline{X}}}\n", + "\\frac{m - \\mu}{SEM}\n", "\\end{align}\n", - "$ and that $\\sigma_{\\overline{X}} = $\n", + "$ with the standard error of the mean $SEM = $\n", "$\n", "\\begin{align}\n", "\\frac{s}{\\sqrt{n}}\n", "\\end{align}\n", "$.\n", "\n", "Therefore we can rewrite the *t* statistics as:\n", "\n", "$\n", "\\begin{align}\n", "t = \\frac{m - \\mu}{\\frac{s}{\\sqrt{n}}}\n", "\\end{align}\n", "$\n", "\n", "This means that *t* is actually:\n", "\n", "$\n", "\\begin{align}\n", "t = \\frac{m - \\mu}{s}\\sqrt{n}\n", "\\end{align}\n", "$\n", "\n", "From there, we see that the **sample size $n$ influences the value of $t$**: all else being equal (i.e. sample mean, sample standard deviation and population mean), **a larger sample would result in a higher value of $t$** and therefore more chances to find a significant result for the t-test.\n", - "\n" + "\n", + "The shape of the t distribution varies a bit depending on the sample size (for small sample sizes), hence the cutoff point also depends on the sample size. To simplify, we will use a cutoff value of 2.00 to illustrate the relationship of t and n. What sample size would make the value of $t$ reach 2.00, all else being equal (i.e. with identical sample mean, sample standard deviation and population mean)?
\n", + "\n", + "In other words, we are looking for the value of $n$ such as:\n", + "$\n", + "\\begin{align}\n", + "\\frac{m - \\mu}{s}\\sqrt{n} = 2.00\n", + "\\end{align}\n", + "$.\n", + "\n", + "We can rewrite this expression to find $n$, which gives: \n", + "$\n", + "\\begin{align}\n", + "n = \\left(\\frac{2.00 s}{m - \\mu}\\right)^2\n", + "\\end{align}\n", + "$ with $s$ the sample standard deviation, $m$ the sample mean and $\\mu$ the population mean.
\n", + "
\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For the Vuillerens sample, we can compute this number as shown below and find out the sample size that would have allowed us to find a statistically significant different petal length compared to the Anderson population mean. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# in Python the x**y notation allows to obtain x to the power of y (in french: x à la puissance y)\n", + "n = ((2.0 * vuillerens_sample_std) / (vuillerens_sample_mean - mu)) ** 2 \n", + "n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "So for instance, for our irises from the Vullierens Castle, **a sample of 41 flowers instead of 50** with exactly the same mean and standard deviation for the petal length would be considered as statistically different at a level of $\\alpha=0.05$ from the Anderson population. \n", + "\n", + "This is why when doing experiments, researchers generally try to get samples as large as possible - but of course this has a cost and is not always possible!\n", + "\n", + "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", - " Question
\n", - " For our irises from the Vullierens Castle, which sample size would make the value of $t$ reach our cutoff point of 2.01, all else being equal (i.e. with identical sample mean, sample standard deviation and population mean)?
\n", - " Use the code cell below to write your answer in Python and test it.\n", + " Activity
\n", + " For the irises from your own sample, which sample size would make the value of $t$ reach a t of 2.00, all else being equal (i.e. with identical sample mean, sample standard deviation and population mean)?
\n", "

You can check your answer by clicking on the \"...\" below.

\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "# Make your calculation in Python here\n" + "# What sample size would you need for the Second sample so that the t reaches the cutoff point of 2.01\n", + "n = ...\n", + "n" ] }, { "cell_type": "markdown", "metadata": { "jupyter": { "source_hidden": true } }, "source": [ "\n", "
\n", " Solution
\n", - "We want to find the value of $n$ that would make $t$ at least equal to 2.01.
\n", - "We know that $t = $\n", - "$\n", - "\\begin{align}\n", - "\\frac{m - \\mu}{s}\\sqrt{n}\n", - "\\end{align}\n", - "$.\n", - "
In other words, we are looking for the value of $n$ such as:\n", - "$\n", - "\\begin{align}\n", - "\\frac{m - \\mu}{s}\\sqrt{n} = 2.01\n", - "\\end{align}\n", - "$.
\n", - "We can rewrite this expression to find $n$, which gives: \n", - "$\n", - "\\begin{align}\n", - "n = \\left(\\frac{2.01 s}{m - \\mu}\\right)^2\n", - "\\end{align}\n", - "$ with $s$ the sample standard deviation, $m$ the sample mean and $\\mu$ the population mean.
\n", - "
\n", - "Then we have to write this in Python using the variables we have defined earlier and display the result. For squaring, either we just replace by a multiplication, which is what we have done below, or we use the Python operator ** for power raising ($x^2$ is then written x ** 2). See the solution code below, in which we have used the variable cutoff05 instead of the raw number 2.01 but that would work too.
\n", - "We obtain n = 41.88, which means that a sample size of 42 flowers or more with the same mean and standard deviation for the petal length would make the t-statistic above our cutoff point.\n", + " To obtain the sample size needed for $t$ to reach the cutoff point corresponding to $\\alpha = 0.05$ with your own sample, we can simply replace sample_mean and sample_std in the code above by the variables my_sample_mean and my_sample_std. See the solution code below.
\n", "
\n", "\n", "
\n",
-    "# Make your calculation in Python here\n",
-    "n = ((cutoff05 * sample_std) / (sample_mean - mu)) * ((cutoff05 * sample_std) / (sample_mean - mu))\n",
+    "n = ((2.0 * my_sample_std) / (my_sample_mean - mu)) ** 2 \n",
+    "n\n",
+    "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Activity
\n", + " How many flowers would you need in your sample to make the cutoff for $\\alpha=0.01$ which is approximately 2.66 ?\n", + "
\n", + "

You can check your answer by clicking on the \"...\" below.

\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Compute the number of flowers you would need for your sample's t value \n", + "# to be larger than the cutoff point for alpha = 0.01\n", + "n = ...\n", + "n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "jupyter": { + "source_hidden": true + } + }, + "source": [ + "\n", + "
\n", + " Solution
\n", + " To obtain the sample size needed for $t$ to reach the cutoff point corresponding to $\\alpha = 0.01$ with your own sample, we can simply replace cutoff05 in the code above by the variables cutoff01. See the solution code below.
\n", + "
\n", "\n", - "# Display the result\n", + "
\n",
+    "n = ((cutoff01 * my_sample_std) / (my_sample_mean - mu)) ** 2 \n",
     "n\n",
     "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "So for instance, for our irises from the Vullierens Castle, **a sample of 144 flowers instead of 50** with exactly the same mean and standard deviation for the petal length would be considered as statistically different from the Anderson population. \n", + "### Plotting t against sample size\n", + "We now investigate the relationship between the sample size and the corresponding |t| values with a plot that varies the sample size on the x-axis and the corresponding |t| values on the y-axis. The cell below illustrates this for the Vuillerens sample.\n", "\n", - "This is why when doing experiments, researchers generally try to get samples as large as possible - but of course this has a cost and is not always possible!\n", + "The function `plot_n_and_t` is increasing `n` from `from_n` to `to_n` in increments of `step_n`. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plot_n_and_t(vuillerens_sample_mean, vuillerens_sample_std, mu, from_n=10, to_n=60, step_n=5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The plot above illustrates how the t-value changes as a function of the sample size. In the Vuillerens sample, we need **n=41** flowers in the sample to reach a t-value equal to the cutoff point for $\\alpha=0.05$: $t[49]=2.01$." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Activity
\n", + " What smallest possible sample size would be needed to reach a t value close to the cutoff value for $\\alpha=0.01$ which is |2.66| ?
\n", + " Modify the sample size by modifying to_n in the cell above.\n", + "
\n", + "

You can check your answer by clicking on the \"...\" below.

\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "jupyter": { + "source_hidden": true + } + }, + "source": [ + "\n", + "
\n", + " Solution
\n", + " We see that we would need a sample size of n=69 flowers to reach a t-value approximately equal to the cutoff point for $\\alpha=0.01$: $t \\approx 2.66$.
\n", + "See the solution code below.
\n", + "
\n", "\n", - " \n", + "
\n",
+    "plot_n_and_t(vuillerens_sample_mean, vuillerens_sample_std, mu, from_n=10, to_n=80, step_n=5)\n",
+    "
\n", "\n", - "---" + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Activity
\n", + " For your own sample, modify the values of the sample size (by changing to_n) to find the smallest possible sample size that would be needed to reach a cutoff value of |2.00| and |2.66|.\n", + "
\n", + "

You can check your answer by clicking on the \"...\" below.

\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plot_n_and_t(my_sample_mean, my_sample_std, mu, from_n=10, to_n=30, step_n=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Using the *p-value*\n", "\n", "In scientific studies, researchers use frequently the t-test but they generally report not only the t-statistic but also **another result of the t-test which is called the p-value**. In the following, we explore what is the p-value, how it relates to the t-statistic and how it can be used." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Testing our hypothesis using a predefined Python function\n", "\n", "So far we have made the computations by hand but Python comes with a number of libraries with interesting statistical tools. \n", "In particular, the `stats` library includes a function for doing a **one-sample t-test** as we have done above. \n", "\n", "Let's now use it and then look at what information it gives us." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compute the t-test\n", - "t, p = stats.ttest_1samp(sample_data[\"petal_length\"], mu)\n", + "t, p = stats.ttest_1samp(vuillerens_sample, mu)\n", "\n", "# Display the result\n", "print(\"t = {:.3f}\".format(t))\n", - "print(\"p = {:.3f}\".format(p))" + "print(\"p = {:.3f}\".format(p))\n", + "\n", + "print(\"\\nWe had computed by hand vuillerens_t = {:.3f}\".format(vuillerens_t))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "We see that the predefined Python function for doing the one-sample t-test gives us the same value for the $t$ statistic as the calculations we have made by hand: $t = 2.194$. \n", + "We see that the predefined Python function for doing the one-sample t-test gives us the same value for the $t$ statistic as the calculations we have made by hand for the Vuillerens sample: $t = 2.194$. \n", "In addition, we see that it also returns another value, $p = 0.033$. \n", "\n", "Actually, the two values `t` and `p` returned by the function say the same thing but in two different ways:\n", "* `t` tells us where our sample mean falls on the distribution of all the possible sample means for the Anderson population ;
\n", " `t` has to be compared to the cutoff value (2.01) to know if our sample mean is in the most extremes 5%.\n", "* `p` is **called the \"p-value\"** and is the **probability to get a more extreme sample mean** than the one we observe ;
\n", " `p` has to be compared to $\\alpha$ (0.05) to know if our sample mean is in the most extremes 5%.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Question
\n", " How does t compare to the cutoff value (2.01)?
\n", " And how does p compare to $\\alpha$ (0.05)?
\n", "

You can check your answer by clicking on the \"...\" below.

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "jupyter": { "source_hidden": true } }, "source": [ "\n", "
\n", " Solution
\n", " \n", "We see above that:\n", - "* $t = 2.294$ therefore $|t| > 2.01$, which means that the difference between the two means is larger than 2.01 times the standard error \n", + "* $t = 2.194$ therefore $|t| > 2.01$, which means that the difference between the two means is larger than 2.01 times the standard error \n", "* and $p = 0.033$ therefore $p < 0.05$, which means that the probability of getting more extreme sample mean than the one we observe is smaller than 5% so our sample mean can be considered as one of the 5% most extreme possible values. \n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "As expected from the calculations we have made by hand above, the test using the predefined Python function confirms that the difference between the mean petal length of the Vullierens sample and the mean petal length of Anderson's population is **not statistically significant**.\n", + "As expected from the calculations we have made by hand above, the test using the predefined Python function confirms that the difference between the mean petal length of the Vullierens sample and the mean petal length of Anderson's population is **statistically significant** at the $\\alpha=0.05$ level.\n", "\n", "As we have just seen, **you can use either `t` or `p` to interpret the result of the t-test.** In practice, most people use the p-value because it can be directly compared to $\\alpha$ without having to look for the cutoff value in tables. However, as we will see more in details below, **`t` and `p` do not provide exactly the same information about the result of the test**, and it is important to understand how they differ." ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Activity
\n", + " Run the t test for your own sample and interpret the results.
\n", + " How large is |t| ?
\n", + " What is the sign of t ?
\n", + " How small is the p-value ?
\n", + " Is the test significant ?\n", + "
\n", + "

You can check your answer by clicking on the \"...\" below.

\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Compute the t-test for your own sample (my_sample).\n", + "my_t, my_p = stats.ttest_1samp(my_sample, mu)\n", + "\n", + "# Display the result\n", + "print(\"t = {:.3f}\".format(my_t))\n", + "print(\"p = {:.3f}\".format(my_p))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "jupyter": { + "source_hidden": true + } + }, + "source": [ + "\n", + "
\n", + " Solution
\n", + " \n", + "The test is significant at the alpha=0.05 level: \n", + "* If $|t| > 2.01$, it means that the difference between the two means is larger than 2.01 times the standard error \n", + "* and if $p < 0.05$, it means that the probability of getting more extreme sample mean than the one we observe is smaller than 5% so our sample mean can be considered as one of the 5% most extreme possible values. \n", + "
\n", + "
\n", + "\n", + "The test is not significant: \n", + "* If $|t| < 2.01$, it means that the difference between the two means is smaller than 2.01 times the standard error \n", + "* and if $p > 0.05$, it means that the probability of getting more extreme sample mean than the one we observe is larger than 5% so our sample mean cannot be considered as one of the 5% most extreme possible values. \n", + "\n", + "In addition:\n", + "* If $t$ is positive, it means the sample mean is larger than the population mean $\\mu$. \n", + "* If $t$ is negative, the sample mean is smaller than the population mean $\\mu$.\n", + "\n", + "
\n", + "\n", + "
\n",
+    "# Compute the t-test for your own sample (my_sample).\n",
+    "my_t, my_p = stats.ttest_1samp(my_sample, mu)\n",
+    "\n",
+    "# Display the result\n",
+    "print(\"t = {:.3f}\".format(my_t))\n",
+    "print(\"p = {:.3f}\".format(my_p))\n",
+    "
" + ] + }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualization of the p-value\n", "\n", - "Using Python we can visualize what the t-test graphically by plotting the t-distribution of all the possible sample means that would be drawn from a population with the same mean as Anderson's population and showing where `t` is in the distribution compared to the zone defined by our $\\alpha$ of 5%.\n", + "Using Python we can visualize the t-test graphically by plotting the t-distribution of all the possible sample means that would be drawn from a population with the same mean as Anderson's population and showing where the `t values` from our samples are in the distribution compared to the zone defined by our $\\alpha$ of 5%.\n", "\n", - "In addition to displaying the value of *t*, the visualization below also **shows the *p-value*** (represented by the hatched zone), which is the **area under the curve of the t-distribution** representing the probability of getting a more extreme sample mean than the one we observe. When this area is larger than the rejection zone defined by the $\\alpha$ we have chosen, then that means the difference between the sample mean and the population mean is not statistically significant." + "In addition to displaying the value of *t*, the visualization below also **shows the *p-value*** (represented by the hatched zones left and right), which is the **area under the curve of the t-distribution** representing the probability of getting a more extreme sample mean than the one we observe. When this area is larger than the rejection zone defined by the $\\alpha$ we have chosen, then that means the difference between the sample mean and the population mean is not statistically significant." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Visualize graphically the result of the t-test and the p-value with alpha at 0.05\n", - "visualize_ttest_pvalue(sample_size, alpha05, t, p)" + "plot_t_test(vuillerens_sample, mu, alpha=0.05)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "
\n", - " Question
\n", - " What values does the function visualize_ttest_pvalue need to generate a visualization of the result of a t-test?
\n", - " In the code cell below, use this function to generate the visualization of a value of $t=-1.702$ and $p=0.095$ with the same sample size and same value for $\\alpha$ as in the example above.
\n", - " What can you observe with this negative value for $t$?
\n", - " How would you say that the p-value (the hatched zone) evolves when $|t|$ gets bigger?
\n", - "

You can check your answer by clicking on the \"...\" below.

\n", - "
" + "We see that:\n", + "* p < $\\alpha$ \n", + "* the t value is larger than the cutoff value for $\\alpha=0.05$\n", + "* the hatched zone is included in the red zone. \n", + "* the test is statistically **signifiant**. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Importance of the choice of $\\alpha$\n", + "\n", + "In the cell below we illustrate the **influence of the choice of $\\alpha$** by let looking at the rejection zones and the cutoff values for $\\alpha=0.01$." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Visualize graphically the result of the t-test and the p-value with alpha at 0.01\n", + "plot_t_test(vuillerens_sample, mu, alpha=0.01)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We see that:\n", + "* p > $\\alpha$ \n", + "* the t value is smaller than the cutoff value for $\\alpha=0.01$\n", + "* the hatched zone is not included in the red zone. \n", + "* the test is statistically **not signifiant**. \n", + "\n", + "It is striking to see that the same Vuillerens sample leads to a t-test which is sgnificant (at $\\alpha=0.05$) or not significant (at $\\alpha=0.01$) depending on the choice of $\\alpha$. Depending on \"how certain do we want to be that the sample mean is different from the population mean $\\mu$\", the conclusion of the test is different. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Your own sample\n", + "Let's now visualise the t-test for your own sample with $\\alpha=0.05$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "# Visualize graphically the result of the t-test and the p-value with alpha at 0.01\n" + "plot_t_test(my_sample, mu, alpha=0.05)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Activity
\n", + " Is the test significant ?
\n", + " How can you tell, how does p compare to $\\alpha$ ?
\n", + " How does t compare to the cutoff t ?\n", + "
\n", + "
" ] }, { "cell_type": "markdown", "metadata": { "jupyter": { "source_hidden": true } }, "source": [ "\n", "
\n", " Solution
\n", - " The visualize_ttest_pvalue function needs 4 different values to generate the visualization: the sample size, the significance level $\\alpha$, the value of $t$ and the value of $p$.
\n", - " To generate a new visualization, we can simply copy-paste the visualize_ttest_pvalue function of the code cell above and replace alpha05 by alpha01, see the solution code below.
\n", - " When comparing with the previous graph, we see that the hatched zone representing $p$ is larger than the red zone that represents alpha.\n", - "\n", - "
\n", - "\n", - "
\n",
-    "# Visualize graphically the result of a t-test of t=-1.702 and p=0.095\n",
-    "visualize_ttest_pvalue(sample_size,alpha01, t, p)\n",
-    "
\n" + " \n", + "The t-test is statistically **not signifiant** if:\n", + "* p > $\\alpha$ \n", + "* the t value is smaller than the cutoff value for $\\alpha=0.05$\n", + "* the hatched zone is not included in the red zone. \n", + "\n", + "The t-test is statistically **signifiant** if:\n", + "* p < $\\alpha$ \n", + "* the t value is larger than the cutoff value for $\\alpha=0.05$\n", + "* the hatched zone is included in the red zone. \n", + " \n", + "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Thanks to the visualization above, we see that one important difference between the t-statistic and the p-value is that that $|t|$ and $p$ evolve in opposite directions: the bigger $|t|$ is, the smaller$p$ is.\n", + "### What the t value and p value tell us\n", + "Thanks to the visualizations above, we see that one important difference between the t-statistic and the p-value is that that $|t|$ and $p$ evolve in opposite directions: the bigger $|t|$ is, the smaller$p$ is.\n", "\n", "Another important difference, is that **the t-statistic tells us whether the sample mean $m$ is greater or smaller than the population mean $\\mu$** whereas this is impossible to know with the p-value only: since the p-value corresponds to the area under the curve of the t-distribution, it is always positive. \n", "As we have seen earlier, the t-distribution is centred on zero, with zero meaning $m = \\mu$ and:\n", "* when $t > 0$ (i.e. $t$ is on the *right* side of the distribution on the visualization above) it means that $m > \\mu$ ;\n", "* when $t < 0$ (i.e. $t$ is on the *left* side of the distribution on the visualization above) it means that $m < \\mu$.\n", "\n", "\n", "\n", " \n", "\n", "---\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# Importance of the choice of $\\alpha$\n", + "# Two tailed and one tailed t-test\n", + "\n", + "So far we have discussed **two tailed tests**. They are called this way because we were testing whether the t values computed from samples fall in the **upper** or **lower** tails of a Student's t distribution. Hence, two tails. The hypothesis we were testing so far was:\n", + "\n", + "$H_0: m = \\mu$\n", + "\n", + "$H_a: m \\neq \\mu$\n", + "\n", + "With **one tailed tests** we have a more precise hypothesis about the sign of the difference between $m$ and $\\mu$. We have good reasons to think that m will be either **smaller** or **greater** than $\\mu$ and that the t values will fall in either the upper or the lower tail of the t distribution.\n", + "\n", + "For an upper tail test:\n", + "\n", + "$H_0: m \\leq \\mu$\n", + "\n", + "$H_a: m > \\mu$\n", + "\n", + "\n", + "For a lower tail test:\n", "\n", - "So far we have seen two important points to keep in mind when using the t-test to compare a sample to a population: first the size of the sample matters and second the t-test provides us with two pieces of information, the t-statistic and the p-value, which are both useful but in different ways. In this section, we look at a third important point to keep in mind when doing statistical testing: the **influence of the choice of $\\alpha$**." + "$H_0: m \\geq \\mu$\n", + "\n", + "$H_a: m < \\mu$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Let's compare our Vullierens sample to another population\n", - "\n", - "
\n", - " \"Iris\n", + "## Upper tailed test\n", + "With a one tailed test, the rejection zone for the test is situated on one side of the t distribution only. For an **upper tailed** test it is situated **on the right** and corresponds to the proportion of $\\alpha$ most extreme **positive** cases as shown in the graph below. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plot_t_distribution(50, alpha=0.05, tail=\"upper\");\n", "\n", - "###### Iris Ensata (Credit: Laitche CC BY-SA 3.0)\n", + "# In green the Vuillerens sample\n", + "plt.axvline(x=vuillerens_t, color='green', linestyle='-.', linewidth=1, label=\"(Vuillerens)\")\n", + "plt.legend();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When we were doing two-tailed tests, the cutoff value for $\\alpha = 5\\%$ was 2.01 (or -2.01). Now, with a one-tailed test all the 5% extremes are on the same side and the cutoff value for $\\alpha=5\\%$ is **1.67**. This smaller cutoff value will be **easier to beat** in comparison with a two-tailed test." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# We look up the value of the t distribution for alpha = 0.05 to find the cutoff value\n", + "onesided_cutoff05 = stats.t.ppf(1-0.05, sample_size-1)\n", + "onesided_cutoff05 " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Testing for the efficiency of a fertilizer\n", "\n", + "
\n", + " \"Miracle\n", "
\n", "\n", - "Let's imagine we want to know how our Vullierens sample compares to another iris population, for instance a Japanese iris species called Iris Ensata with a mean petal length of $\\mu_{ensata} = 5.832$ cm. We can apply the hypothesis testing approach that we have just learned and use a one-sample t-test to do the comparison, which we do in 4 steps:\n", "\n", - "1. Get an idea about how the sample compares to the population: \n", - " At first sight, the sample petal mean of our Vullierens sample, $m= 5.646$ cm, is again quite close to the mean petal length of the Ensata population, $\\mu_{ensata} = 5.832$ cm. \n", + "Imagine that the Vuillerens sample was picked in a field where the gardeners have used a new type of fertilizer that boosts the growth of the flowers. We now have good reasons to believe that the fertilizer will increase the flower size and hence we can make a one tailed null hypothesis that our sample mean will be smaller than the population mean $\\mu$. \n", + "\n", + "This may sound a bit contradictory, but even though we believe the flowers will be larger, we can only reject null hypotheses and hence we state the null hypothesis that the sample mean will be smaller or equal to $\\mu$. \n", + "\n", + "$H_0: m_{Vuillerens} \\leq \\mu$\n", "\n", - "2. Formulate the hypotheses we want to test:\n", - " * First let's state our null hypothesis, which is that the Vullierens sample is similar to the Iris Ensata population, $H_0: m = \\mu_{ensata}$. \n", - " This is the hypothesis we want to know whether we can reject or not.\n", - " * And then state the alternate hypothesis $H_a: m \\neq \\mu_{ensata}$.\n", + "$H_a: m_{Vuillerens} > \\mu$\n", "\n", + "The calculations for the test are very similar to the two-tailed situation. \n", "\n", - "3. Choose a significance level: \n", - " Let's choose $\\alpha=0.05$ as previously, which means we want to be 95% sure.\n", + "We compare the **t value**\n", + "$\n", + "\\begin{align}\n", + "t = \\frac{m - \\mu}{\\sigma_{\\overline{X}}}\n", + "\\end{align}\n", + "$\n", + "to the cutoff point $t_{\\alpha=0.05}$=1.67.\n", "\n", - "4. Compute the result of the t-test by using the predefined Python function stats.ttest_1samp :" + "If the t value is larger than the cutoff value, then we can conclude: \n", + "* that we can reject $H_0$ and accept $H_a$\n", + "* that our sample falls in the 5% most extreme **positive** cases of the t distribution\n", + "* that the sample mean is larger than $\\mu$\n", + "* and that the test is statistically significant at the level $\\alpha=0.05$ \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Doing the test\n", + "We proceed as with the two-tailed test, except that we do not need to take the absolute value of |t| before comparing to the cutoff point. We know that we expect a positive t value since we conduct an upper tailed test." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "# Define the mean petal length of the Ensata population\n", - "mu_ensata = 5.832\n", - "\n", - "# Compute the t-test comparing the Vullierens sample petal length to the Ensata population mean\n", - "t, p = stats.ttest_1samp(sample_data[\"petal_length\"], mu_ensata)\n", - "\n", - "# Display the result\n", - "print(\"t = {:.3f}\".format(t))\n", - "print(\"p = {:.3f}\".format(p))" + "if vuillerens_t > onesided_cutoff05: \n", + " print(\"The difference IS statistically significant because \" + \n", + " \"the t value {:.3f} > {:.3f}\".format(vuillerens_t, onesided_cutoff05))\n", + "else: \n", + " print(\"The difference is NOT statistically significant because \" +\n", + " \" the t value {:.3f} < {:.3f}\".format(vuillerens_t, onesided_cutoff05))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plot_t_test(vuillerens_sample, mu, alpha=0.05, tail=\"upper\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The result of the t-test gives $t = -1.621$ and $p = 0.111$.\n", - "\n", - "With $\\alpha=0.05$, the cutoff value is 2.01. We see that $|t| < 2.01$ and $p > 0.05$. Therefore, the test tells us that the difference between the mean petal length of the Vullierens sample and the mean petal length of the Ensata population IS NOT statistically significant. In other words, we cannot reject the hypothesis that the Vullierens sample is similar to the Ensata population." + "We see from the plot above that the t-test is **significant**:\n", + "* the t value (2.194) is larger than the cutoff value (1.677).\n", + "* the p value (0.016) is smaller than the $\\alpha$ level (0.05).\n", + "* the hatched area that corresponds to the p-value is inside the rejection zone in red." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## What is the role of $\\alpha$ in this result?\n", + "## Lower tailed test\n", + "For a **lower tailed** test the rejection zone is situated **on the left** and corresponds to the proportion of $\\alpha$ most extreme **negative** cases as shown in the graph below.\n", + "\n", + "The corresponding hypothesis could be for example that the sample you have collected yourself was picked from a field that is in the shade and therefore could lead to smaller flowers. We state the null hypothesis $H_0$ that the sample mean is greater or equal to the population mean $\\mu$.\n", + "\n", + "$H_0: m_{YourSample} \\geq \\mu$\n", + "\n", + "$H_a: m_{YourSample} < \\mu$\n", "\n", - "Now let's ask ourselves **what would have been the conclusion of the test if we had chosen a significance level of $\\alpha=0.01$**, i.e. if we wanted to be 99% sure?" + "Since we do a lower tailed test, we actually are testing whether the t value for our sample falls **below** the cutoff on the left side of the distribution. We compare the t value to the the negative cutoff point $t_{\\alpha=0.05}$= -1.67.\n", + "\n", + "If the t value is smaller than the cutoff value, then we can conclude: \n", + "* that we can reject $H_0$\n", + "* that our sample falls in the 5% most extreme **positive** cases of the t distribution\n", + "* and that the difference is statistically significant at the level $\\alpha=0.05$ \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "For $\\alpha = 0.01$, the cutoff value which we get from the tables is 2.67. With this choice of $\\alpha$, we see that $|t| < 2.67$ and $p > 0.01$. This means that when choosing $\\alpha = 0.01$, the test tells us that the difference between the mean petal length of the Vullierens sample and the mean petal length the Ensata population is NOT statistically significant either. This is quite obvious, since the $t$ we have to \"beat\" is event larger with $\\alpha = 0.01$ than for $\\alpha = 0.05$." + "
\n", + " Activity
\n", + " Implement the python code needed to test the null hypothesis for your own sample my_sample for which you had computed my_t.
\n", + " Is the test significant ? How can you tell, how does p compare to $\\alpha$ ?
\n", + " How does t compare to the cutoff t ?\n", + "
\n", + "

You can check your answer by clicking on the \"...\" below.

\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "if ... : \n", + " print(\"The difference IS statistically significant because \" + \n", + " \"the t value {:.3f} < {:.3f}\".format(my_t, -onesided_cutoff05))\n", + "else: \n", + " print(\"The difference is NOT statistically significant because \" +\n", + " \" the t value {:.3f} > {:.3f}\".format(my_t, -onesided_cutoff05))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Plot the t-test\n", + "plot_t_test(my_sample, mu, alpha=0.05, tail=\"lower\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "jupyter": { + "source_hidden": true + } + }, + "source": [ + "\n", + "
\n", + " Solution
\n", + "The test is significant at the alpha=0.05 level: \n", + "\n", + "* If $t < -1.677$, it means that the difference between the two means is larger than 1.67 times the standard error \n", + "* and if $p < 0.05$, it means that the probability of getting more a extreme negative sample mean than the one we observe is smaller than 5% so our sample mean can be considered as one of the 5% most extreme possible negative values. \n", + "
\n", + "
\n", + "\n", + "The test is not significant: \n", + "* If $t > -1.677$, it means that the difference between the two means is smaller than 1.677 times the standard error \n", + "* and if $p > 0.05$, it means that the probability of getting more extreme negative sample mean than the one we observe is larger than 5% so our sample mean cannot be considered as one of the 5% most extreme possible negative values. \n", + "\n", + "See the solution code below.\n", + "
\n", + "\n", + "\n", + "
\n",
+    "if my_t < -onesided_cutoff05: \n",
+    "    print(\"The difference IS statistically significant because \" + \n",
+    "          \"the t value {:.3f} < {:.3f}\".format(my_t, -onesided_cutoff05))\n",
+    "else: \n",
+    "    print(\"The difference is NOT statistically significant because \" +\n",
+    "          \" the t value {:.3f} > {:.3f}\".format(my_t, -onesided_cutoff05))\n",
+    "          \n",
+    "plot_t_test(my_sample, mu, alpha=0.05, tail=\"lower\")\n",
+    "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Summary\n", "\n", "In this notebook, you have seen how to compare a sample to a population using an approach called **hypothesis testing** and using a statistical test called a **one-sample t-test**.\n", "\n", "To summarize, to compare the mean of a sample to a reference value from a population, you have to proceed in four main steps:\n", "1. Look at descriptive statistics and visualizations of the sample you have to get an idea about how it compares to the population\n", - "1. Formulate the hypothese you want to test: the null hypothesis $H_0: m = \\mu$ and its alternate $H_a: m \\neq \\mu$ \n", + "1. Formulate the hypothese you want to test: \n", + " * For two-tailed tests the null hypothesis $H_0: m = \\mu$ and its alternate $H_a: m \\neq \\mu$\n", + " * For upper-tailed tests the null hypothesis $H_0: m \\leq \\mu$ and its alternative $H_a: m > \\mu$\n", + " * For lower-tailed tests the null hypothesis $H_0: m \\geq \\mu$ and its alternative $H_a: m < \\mu$\n", "1. Choose a significance level for being sure, usually $\\alpha = 0.05$ or $\\alpha = 0.01$, or even $\\alpha = 0.001$ \n", - "1. Compute the result of the t-test and interpret the result - in particular if the p-value is *below* the significance level you have chosen, $p \\lt \\alpha$, then it means $H_0$ should probably be rejected" + "1. Determine the cutoff value for your given $\\alpha$ level.\n", + "1. Compute the result of the t-test and interpret the result\n", + " * if the |t| value is *larger* than the cutoff value for the the given $\\alpha$ level, then $H_0$ should probably be rejected. \n", + " * if the p-value is *below* the significance level you have chosen, $p \\lt \\alpha$, then it means $H_0$ should probably be rejected." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "---\n", "\n", "

Bibliography

\n", "\n", "[1] E. Anderson (1935). \"The Irises of the Gaspe Peninsula.\" Bulletin of the American Iris Society 59: 2–5.\n", "\n", "[2] R. A. Fisher (1936). \"The use of multiple measurements in taxonomic problems\". Annals of Eugenics. 7 (2): 179–188. doi:10.1111/j.1469-1809.1936.tb02137.x\n", "\n", "More about the Iris Dataset on Wikipedia: https://en.wikipedia.org/wiki/Iris_flower_data_set\n", "\n", "*Please note that the datasets used in this notebook have been generated using a random generator, it does not come from real measurement and cannot be used for any research purpose.*" ] } ], "metadata": { + "celltoolbar": "Edit Metadata", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.9" + "version": "3.8.3" }, "toc-autonumbering": true }, "nbformat": 4, "nbformat_minor": 4 } diff --git a/figs/miracle-gro-plant-flower-fertilizer.jpg b/figs/miracle-gro-plant-flower-fertilizer.jpg new file mode 100644 index 0000000..5364478 Binary files /dev/null and b/figs/miracle-gro-plant-flower-fertilizer.jpg differ diff --git a/figs/sample-plot-e.png b/figs/sample-plot-e.png new file mode 100644 index 0000000..8e5f63f Binary files /dev/null and b/figs/sample-plot-e.png differ diff --git a/iris-sample-vullierens-1.csv b/iris-sample-vullierens-1.csv new file mode 100644 index 0000000..d6c3da0 --- /dev/null +++ b/iris-sample-vullierens-1.csv @@ -0,0 +1,51 @@ +petal_length_sample1 +5.090980681 +5.224431282 +7.25161967 +5.607932251 +6.118801442 +6.35250702 +4.896926283 +5.220963626 +6.235352422 +6.200243713 +5.42281245 +5.296982791 +4.694441479 +5.911687455 +5.958682935 +5.764168553 +6.035652869 +6.848298609 +6.286982307 +5.117292272 +4.918408193 +5.663513771 +6.056574275 +6.075640695 +5.619982249 +6.091000498 +5.621478087 +5.20792709 +5.410301802 +5.714093194 +5.601680818 +5.706328904 +5.536061308 +5.742188216 +5.496693462 +5.520261845 +4.736357129 +5.445665795 +5.818556633 +6.115244705 +6.010444064 +5.692230766 +5.477746447 +5.620406098 +5.936959746 +6.194875692 +6.349759588 +4.781600864 +5.692977066 +6.260550251 \ No newline at end of file diff --git a/lib/dataanalysis.py b/lib/dataanalysis.py index eff6e9b..36ba204 100644 --- a/lib/dataanalysis.py +++ b/lib/dataanalysis.py @@ -1,201 +1,601 @@ import numpy as np from ipywidgets import interact, interactive, fixed, interact_manual from ipywidgets import HBox, VBox, Label, Layout import ipywidgets as widgets from IPython.display import IFrame from IPython.display import set_matplotlib_formats, display, Math, Markdown, Latex, HTML set_matplotlib_formats('svg') # Enable interactive backend for matplotlib from IPython import get_ipython get_ipython().run_line_magic('matplotlib', 'inline') import matplotlib.pyplot as plt import matplotlib.patches as pat import matplotlib.ticker as ticker plt.style.use('seaborn-whitegrid') # global style for plotting +from matplotlib.ticker import MultipleLocator + import scipy.stats as stats def visualize_ttest(sample_size, alpha, t): # Create the t-test visualization fig, ax = plt.subplots(figsize=(12, 4)) ax.set_title("Probability distribution of all possible sample means if $H_0$ is true") # Let's plot the T distribution for this sample size tdist = stats.t(df=sample_size, loc=0, scale=1) x = np.linspace(tdist.ppf(0.0001), tdist.ppf(0.9999), 100) y = tdist.pdf(x) ax.plot(x, y, color='black', linewidth=1) # Polish the look of the graph ax.get_yaxis().set_visible(False) # hide the y axis ax.set_ylim(bottom=0) ax.grid(False) # hide the grid ax.spines['top'].set_visible(False) # hide the frame except bottom line ax.spines['right'].set_visible(False) ax.spines['left'].set_visible(False) # Plot the rejection zone two tailed x_zone_1 = np.linspace(tdist.ppf(0.0001), tdist.ppf(alpha/2), 100) x_zone_2 = np.linspace(tdist.ppf(1-alpha/2), tdist.ppf(0.9999), 100) y_zone_1 = tdist.pdf(x_zone_1) y_zone_2 = tdist.pdf(x_zone_2) ax.fill_between(x_zone_1, y_zone_1, 0, alpha=0.3, color='red', label = r'rejection of $H_0$ with $\alpha={}$'.format(alpha)) ax.fill_between(x_zone_2, y_zone_2, 0, alpha=0.3, color='red') # Plot the t-test stat ax.axvline(x=t, color='firebrick', linestyle='dashed', linewidth=1) ax.annotate('t-test $t$={:.3f}'.format(t), xy=(t, 0), xytext=(-10, 130), textcoords='offset points', bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "firebrick", alpha = 0.8)) # Add a legend ax.legend() # Display the graph plt.show() - def visualize_ttest_pvalue(sample_size, alpha, t, p): # Create the t-test visualization fig, ax = plt.subplots(figsize=(12, 4)) ax.set_title("Probability distribution of all possible sample means if $H_0$ is true") # Let's plot the T distribution for this sample size - tdist = stats.t(df=sample_size, loc=0, scale=1) + tdist = stats.t(df=sample_size-1, loc=0, scale=1) x = np.linspace(tdist.ppf(0.0001), tdist.ppf(0.9999), 100) y = tdist.pdf(x) ax.plot(x, y, color='black', linewidth=1) # Polish the look of the graph ax.get_yaxis().set_visible(False) # hide the y axis ax.set_ylim(bottom=0) ax.grid(False) # hide the grid ax.spines['top'].set_visible(False) # hide the frame except bottom line ax.spines['right'].set_visible(False) ax.spines['left'].set_visible(False) # Plot the rejection zone two tailed x_zone_1 = np.linspace(tdist.ppf(0.0001), tdist.ppf(alpha/2), 100) x_zone_2 = np.linspace(tdist.ppf(1-alpha/2), tdist.ppf(0.9999), 100) y_zone_1 = tdist.pdf(x_zone_1) y_zone_2 = tdist.pdf(x_zone_2) ax.fill_between(x_zone_1, y_zone_1, 0, alpha=0.3, color='red', label = r'rejection of $H_0$ with $\alpha={}$'.format(alpha)) ax.fill_between(x_zone_2, y_zone_2, 0, alpha=0.3, color='red') # Plot the t-test stats ax.axvline(x=t, color='firebrick', linestyle='dashed', linewidth=1) ax.annotate('t-test $t$={:.3f}'.format(t), xy=(t, 0), xytext=(-10, 130), textcoords='offset points', bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "firebrick", alpha = 0.8)) # Plot the p-value if t >= 0: x_t = np.linspace(t, tdist.ppf(0.9999), 100) else: x_t = np.linspace(tdist.ppf(0.0001), t, 100) y_t = tdist.pdf(x_t) ax.fill_between(x_t, y_t, 0, facecolor="none", edgecolor="firebrick", hatch="///", linewidth=0.0, label = r'p-value $p$={:.3f}'.format(p)) # Add a legend ax.legend() # Display the graph plt.show() +def visualize_ttest2(): + return True + + +def draw_sample(sample_size, mu=5.552, sigma=0.56068): + # sigma = 0.56068 + # mu = 5.552 + sample_data = sigma * np.random.randn(sample_size) + mu + return sample_data + +def plot_sample_histogram(sample, mu, title, color="grey"): + + plt.title("Histogram of " + title) + plt.hist(sample, color="lightgrey") + #plt.xticks(np.arange(4.6, 7.2, 0.2)) + + # Add a vertical line for the population mean + plt.axvline(x=mu, color='black', linestyle='-.', linewidth=2, + label="population mean $\mu$") + # Add a vertical line for the Vuillerens sample mean + plt.axvline(x=np.mean(sample), color=color, linestyle='-.', linewidth=2, + label= title + " mean $m$") + plt.legend() + + +def plot_t_distribution(df, alpha=None, tail="two", loc=0, scale=1): + + fig, ax = plt.subplots(figsize=(10, 4)) + + plt.title('t distribution for {:} degrees of freedom'.format(df)) + plt.xlabel('t') + plt.ylabel('Probability density') + + tdist = stats.t(df=df, loc=loc, scale=scale) + + # Get 100 values along the x axis from the least probable t value (0.0001) to the most probable t value (0.9999) + x = (np.linspace(tdist.ppf(0.0001), tdist.ppf(0.9999), 100)) + + # Plot the corresponding probabilities to get these t-values + ax.plot(x, tdist.pdf(x), color="firebrick",linestyle='-', lw=1, alpha=1) # label='t[{:}]'.format(df) + ax.grid(b=None, which = 'major', axis='y') + ax.set_ylim(bottom=0) + + if (not alpha == None): + + if (tail=="two"): + + low_cutoff = tdist.ppf(alpha/2) + low_p = tdist.pdf(low_cutoff) + + ax.axvline(x=low_cutoff, color='firebrick', linestyle='-.', linewidth=1) + ax.annotate("Cutoff $t$={:.3f}\nfor $\\alpha/2$={:.3f}".format(low_cutoff,alpha/2), + xy=(low_cutoff, low_p), + xytext=(-80, 10), + textcoords='offset points', + bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "firebrick", alpha = 0.8)) + + high_cutoff = tdist.ppf(1-alpha/2) + high_p = tdist.pdf(high_cutoff) + ax.axvline(x=high_cutoff, color='firebrick', linestyle='-.', linewidth=1) + ax.annotate("Cutoff $t$={:.3f} \nfor $1-\\alpha/2$={:.3f}".format(high_cutoff,1-alpha/2), + xy=(high_cutoff, high_p), + xytext=(5, 10), + textcoords='offset points', + bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "firebrick", alpha = 0.8)) + + x_zone_1 = np.linspace(tdist.ppf(0.0001), tdist.ppf(alpha/2), 100) + x_zone_2 = np.linspace(tdist.ppf(1-alpha/2), tdist.ppf(0.9999), 100) + y_zone_1 = tdist.pdf(x_zone_1) + y_zone_2 = tdist.pdf(x_zone_2) + ax.fill_between(x_zone_1, y_zone_1, 0, alpha=0.3, color='red', zorder=10) + ax.fill_between(x_zone_2, y_zone_2, 0, alpha=0.3, color='red', zorder=10) + + elif (tail == "lower"): + low_cutoff = tdist.ppf(alpha) + low_p = tdist.pdf(low_cutoff) + + ax.axvline(x=low_cutoff, color='firebrick', linestyle='-.', linewidth=1) + ax.annotate("Cutoff $t$={:.3f} \nfor $\\alpha$={:.3f}".format(low_cutoff,alpha), + xy=(low_cutoff, low_p), + xytext=(-80, 10), + textcoords='offset points', + bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "firebrick", alpha = 0.8)) + + x_zone_1 = np.linspace(tdist.ppf(0.0001), tdist.ppf(alpha), 100) + y_zone_1 = tdist.pdf(x_zone_1) + ax.fill_between(x_zone_1, y_zone_1, 0, alpha=0.3, color='red', zorder=10) + + + elif (tail == "upper"): + high_cutoff = tdist.ppf(1-alpha) + high_p = tdist.pdf(high_cutoff) + ax.axvline(x=high_cutoff, color='firebrick', linestyle='-.', linewidth=1) + ax.annotate("Cutoff $t$={:.3f}\nfor $1-\\alpha$={:.3f}".format(high_cutoff,1-alpha), + xy=(high_cutoff, high_p), + xytext=(5, 10), + textcoords='offset points', + bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "firebrick", alpha = 0.8)) + + x_zone_2 = np.linspace(tdist.ppf(1-alpha), tdist.ppf(0.9999), 100) + y_zone_2 = tdist.pdf(x_zone_2) + ax.fill_between(x_zone_2, y_zone_2, 0, alpha=0.3, color='red', zorder=10) + + return ax + +def plot_t_test(sample, mu, alpha=None, tail="two"): + + sample_size = np.size(sample) + df = sample_size - 1 + + ax = plot_t_distribution(df, alpha=alpha, tail=tail) + t, p = stats.ttest_1samp(sample, mu) + + tdist = stats.t(df=df, loc=0, scale=1) + + if (tail == "two"): + cutoff = tdist.ppf(alpha/2) + else: + cutoff = tdist.ppf(alpha) + + cutoff_y = tdist.pdf(cutoff) + + + if (not tail == "two"): + if (tail=="lower" and t < 0): + p = p/2 + elif (tail == "lower" and t > 0): + p = 1 - p/2 + elif(tail == "upper" and t < 0): + p = 1 - p/2 + elif(tail == "upper" and t > 0): + p = p/2 + + + + # Plot the p-value + #if t >= 0: x_t = np.linspace(t, tdist.ppf(0.9999), 100) + #else: x_t = np.linspace(tdist.ppf(0.0001), t, 100) + + x_t = np.linspace(t, tdist.ppf(0.9999), 100) + + if (tail =="two" and t < 0): + x_t = np.linspace(-t, tdist.ppf(0.9999), 100) + + y_t = tdist.pdf(x_t) + if (tail == "two" or tail == "upper"): + ax.fill_between(x_t, y_t, 0, facecolor="none", edgecolor="firebrick", hatch="///", linewidth=0.0, label = r'p-value $p$={:.3f}'.format(p)) + + x_t = np.linspace(tdist.ppf(0.0001), t, 100) + + if (tail=="two" and t >=0): + x_t = np.linspace(tdist.ppf(0.0001), -t, 100) + + y_t = tdist.pdf(x_t) + if (tail == "two" or tail == "lower"): + ax.fill_between(x_t, y_t, 0, facecolor="none", edgecolor="firebrick", hatch="///", linewidth=0.0, label = r'p-value $p$={:.3f}'.format(p)) + + ax.axvline(x=t, color='firebrick', linestyle='-', linewidth=1) + + if (tail =="two"): + ## Add a small vertical segment for the side of the hatching which is not on the side of t + point1 = [-t, 0] + point2 = [-t, tdist.pdf(-t)] + + x_values = [point1[0], point2[0]] + y_values = [point1[1], point2[1]] + ax.plot(x_values, y_values, color="firebrick", linestyle='-', linewidth=1) + + + + if (p < 0.3): + annotation_y = p + else: + annotation_y = 0.3 + + annotation_offset = 0 + # are they overlapping ? + if (np.abs(cutoff_y - annotation_y) < 0.05): + if (annotation_y > cutoff_y): + annotation_offset = 25 + else: + annotation_offset = -5 + + ax.annotate('t-test $t$={:.3f}, p={:.3f}'.format(t,p), + xy=(t, annotation_y), + xytext=(5, annotation_offset), + textcoords='offset points', + bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "firebrick", alpha = 0.8)) + + + + +def plot_mean_distribution(mu, alpha=None, means=None, sample_size=50): + + plt.figure(figsize=(8, 4)) + + plt.title('Distribution of sample means for a population with $\mu$= {:}'.format(mu)) + plt.xlabel('Mean of samples') + plt.ylabel('Count') + + if (means == None): + + x = np.linspace(5.2, 5.9, 100) + + loc = mu + scale = 0.0796 + df = sample_size - 1 + + tdist = stats.t.pdf(x, sample_size-1, loc=mu, scale=0.0796) ## * (prop_tot * (n_samples / len(nn))) + scale_factor = 1.0 + frame1 = plt.gca() + frame1.plot(x, tdist, color='black', linestyle='-', lw=1, alpha=1); + frame1.set_ylim([0,6]) + frame1.axes.get_yaxis().set_visible(False) + + # In black the theoretical population mean mu + plt.axvline(x=mu, color='black', linestyle='-.', linewidth=2) + + # Display mu + frame1.annotate("$\mu$={:.3f}".format(mu), + xy=(mu, 5.5), + xytext=(-20, 0), + textcoords='offset points', + bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "black", alpha = 0.8)) + + else: + # We get a list of means + + # A histogram of the means + (nn,bb,pp) = plt.hist(means, color="lightgrey", bins=100, density=False); + + ### Control + #(nn,bb) = np.histogram(means, bins=100, density=False) + + # Add the t-distribution + # The positions in the histogram + x = np.linspace(bb[1], bb[len(bb)-1], len(bb-1)) + + frame1 = plt.gca() + + loc = mu + scale = np.std(means, ddof=1) + df = sample_size - 1 + + tdist = stats.t.pdf(x, df, loc=mu, scale=scale) ## * (prop_tot * (n_samples / len(nn))) + + scale_factor = np.size(means) / np.sum(tdist) + tdist = tdist * scale_factor + + plt.plot(x, tdist, color='black', linestyle='-', lw=1, alpha=1); + + frame1.axes.get_yaxis().set_visible(False) + frame1.set_ylim(bottom=0) + + # In black the theoretical population mean mu + plt.axvline(x=mu, color='black', linestyle='-.', linewidth=2) + + # Display mu + frame1.annotate("$\mu$={:.3f}".format(mu), + xy=(mu, np.max(nn)), + xytext=(-20, 0), + textcoords='offset points', + bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "black", alpha = 0.8)) + + if (not alpha == None): + df = sample_size - 1 + add_rejection_zones(frame1, loc, scale, df, alpha, scale_factor) + +def add_rejection_zones(ax, loc, scale, df, alpha, scale_factor): + + mytdist = stats.t(df=df, loc=loc, scale=scale) + # Plot the rejection zone two tailed + x_zone_1 = np.linspace(mytdist.ppf(0.0001), mytdist.ppf(alpha/2), 100) + x_zone_2 = np.linspace(mytdist.ppf(1-alpha/2), mytdist.ppf(0.9999), 100) + y_zone_1 = mytdist.pdf(x_zone_1) * scale_factor + y_zone_2 = mytdist.pdf(x_zone_2) * scale_factor + ax.fill_between(x_zone_1, y_zone_1, 0, alpha=0.3, color='red', zorder=10) + ax.fill_between(x_zone_2, y_zone_2, 0, alpha=0.3, color='red', zorder=10) + + #ax.axvline(x=t, color='firebrick', linestyle='dashed', linewidth=1) + #ax.annotate('cut-off $t$={:.3f}'.format(mytdist.ppf(alpha/2)), + # xy=(mytdist.ppf(alpha/2), 0), + # xytext=(mytdist.ppf(alpha/2), mytdist.pdf(alpha/2)), textcoords='offset points', + # bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "firebrick", alpha = 0.8)) + + +def plot_n_and_t(sample_mean, sample_std, mu, from_n, to_n, step_n, plotp=False, df=49, alpha=0.05): + + ###### + # Compute t-values (t) for different values of sample size (n) + # by using the Vuillerens sample_mean and sample_std. + + nvalues = [] + tvalues = [] + pvalues = [] + + # For sample sizes ranging from 10 to 60 in steps of 5 + for n in np.arange(from_n, to_n, step_n): + + # Compute the t-value if the sample size was n and with the + # sample_mean and sample_std from the Vuillerens sample + t = (sample_mean - mu) / (sample_std/np.sqrt(n)) + + # Collect the n and t values + + df = n-1 + p = stats.t.pdf(t, n-1) + + pvalues.append(p) + nvalues.append(n) + tvalues.append(t) + + ###### + # Plot the relation between sample size and t-value and p-value + fig, ax = plt.subplots(figsize=(8, 4)) + + # Set the axes + plt.title('$|t|$ as a function of sample size (from {:} to {:})'.format(from_n, to_n)) + + ax.set_xlabel('Sample size', color="black") + ax.set_ylabel('|t|', color="firebrick") + ax.tick_params(axis='y', labelcolor="firebrick") + + spacing = step_n # This can be your user specified spacing. + minorLocator = MultipleLocator(spacing) + # Set minor tick locations. + ax.xaxis.set_minor_locator(minorLocator) + # Set grid to use minor tick locations. + ax.grid(which = 'minor', axis = 'x', linestyle='dotted') + ax.grid(which = 'major', axis='x', linestyle='-') + ax.grid(b=None, which = 'major', axis='y') + + + # Plot the t-values that correspond to different samples sizes + ax.plot(nvalues, np.abs(tvalues), color="firebrick") + + tdist = stats.t(df=50-1, loc=0, scale=1) + cutoff = tdist.ppf(1-alpha/2) + + + # Draw a horizontal line for the cutoff point at alpha = 0.025 (upper tail) + # If our t is negative, the cutoff point can also be negative (lower tail) + #if (tvalues[0] < 0): + # cutoff = -cutoff + + plt.rcParams['path.sketch'] = (1, 100, 2) + ax.axhline(y=2.01, color='firebrick', linestyle='-.', linewidth=1); + plt.rcParams['path.sketch'] = None + ax.annotate('cutoff $t_{\\alpha=0.05} \\approx 2.00$', xy=(10, 2.00), xytext=(10, 0), textcoords='offset points', + bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "firebrick", alpha = 0.8)) + + + plt.rcParams['path.sketch'] = (1, 100, 2) + ax.axhline(y=2.61, color='firebrick', linestyle='-.', linewidth=1); + plt.rcParams['path.sketch'] = None + ax.annotate('cutoff $t_{\\alpha=0.01} \\approx 2.66$', xy=(10, 2.66), xytext=(10, 0), textcoords='offset points', + bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "firebrick", alpha = 0.8)) + + + # Draw a vertical grey line at the sample size that corresponds to the cutoff point for alpha=0.05 + #n = ((cutoff * sample_std) / (sample_mean - mu)) ** 2 + #ax.axvline(x=n, color='grey', linestyle='-.', linewidth=1); + + if (plotp): + + ax2 = ax.twinx() + + # Plot the p-values that correspond to different sample sizes + ax2.plot(nvalues, pvalues, color="blue") + + ax2.set_xlabel("Sample Size") + ax2.set_ylabel('p-value', color="blue") + ax2.tick_params(axis='y', labelcolor="blue") + + ax2.grid(which = 'major', axis='x', linestyle='-') + + + ax2.axhline(y=0.05, color='blue', linestyle='-.', linewidth=1); + ax2.annotate('$\\alpha = 0.05$', xy=(to_n, 0.05), xytext=(-40, 0), textcoords='offset points', + bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "blue", alpha = 0.8)) + + ax2.axhline(y=0.01, color='blue', linestyle='-.', linewidth=1); + ax2.annotate('$\\alpha = 0.01$', xy=(to_n, 0.01), xytext=(-40, 0), textcoords='offset points', + bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "blue", alpha = 0.8)) + + + #ax2.axhline(y=alpha, color='blue', linestyle='-.', linewidth=2); + + + +def one_sample_one_tailed(sample_data, mu, alpha=0.05, alternative='greater'): + t, p = stats.ttest_1samp(sample_data, mu) + p = p*2 + print ('t:',t) + print ('p:',p) + if alternative == 'greater' and (p < alpha) and t > 0: + print ('Reject Null Hypothesis for greater-than test') + if alternative == 'less' and (p < alpha) and t < 0: + print ('Reject Null Hypothesis for less-thane test') ###### TO DELETE def build_ttest_visualization(ttest_result, alpha): # Extract information from the result of the t-test n = round(ttest_result.loc["T-test","dof"]) t = ttest_result.loc["T-test","T"] p = ttest_result.loc["T-test","p-val"] d = ttest_result.loc["T-test","cohen-d"] # Create the figure fig = plt.figure(figsize=(14, 4)) ### 1. Create the t-test visualization ax1 = plt.subplot(121) ax1.set_title("Result of the t-test") # Let's plot the T distribution for this sample size tdist = stats.t(df=n, loc=0, scale=1) x = np.linspace(tdist.ppf(0.0001), tdist.ppf(0.9999), 100) y = tdist.pdf(x) ax1.plot(x, y, color='black', linewidth=1) # Polish the look of the graph ax1.get_yaxis().set_visible(False) # hide the y axis ax1.set_ylim(bottom=0) ax1.grid(False) # hide the grid ax1.spines['top'].set_visible(False) # hide the frame except bottom line ax1.spines['right'].set_visible(False) ax1.spines['left'].set_visible(False) # Plot the rejection zone two tailed x_zone_1 = np.linspace(tdist.ppf(0.0001), tdist.ppf(alpha/2), 100) x_zone_2 = np.linspace(tdist.ppf(1-alpha/2), tdist.ppf(0.9999), 100) y_zone_1 = tdist.pdf(x_zone_1) y_zone_2 = tdist.pdf(x_zone_2) ax1.fill_between(x_zone_1, y_zone_1, 0, alpha=0.3, color='red', label = r'threshold $\alpha={}$'.format(alpha)) ax1.fill_between(x_zone_2, y_zone_2, 0, alpha=0.3, color='red') # Plot the t-test stats ax1.axvline(x=t, color='firebrick', linestyle='dashed', linewidth=1) ax1.annotate('t-test $t$={:.3f}'.format(t), xy=(t, 0), xytext=(-10, 130), textcoords='offset points', bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "firebrick", alpha = 0.8)) # Plot the p-value if t >= 0: x_t = np.linspace(t, tdist.ppf(0.9999), 100) else: x_t = np.linspace(tdist.ppf(0.0001), t, 100) y_t = tdist.pdf(x_t) ax1.fill_between(x_t, y_t, 0, facecolor="none", edgecolor="firebrick", hatch="///", linewidth=0.0, label = r'p-value $p$={:.3f}'.format(p)) # Add a legend ax1.legend(loc='upper right') ### 2. Create the effect size visualization ax2 = plt.subplot(122) ax2.set_title("Effect size") # Plot the theoretical distribution of first sample norm = stats.norm(loc=0, scale=1) x = np.linspace(norm.ppf(0.0001), norm.ppf(0.9999), 100) y = norm.pdf(x) ax2.plot(x, y, color='black', alpha=0.3, linewidth=1) ax2.fill_between(x, y, 0, color='blue', alpha=0.3, label = 'Year 1 (theoretical)') ax2.axvline(x=0, color='blue', alpha=0.5, linestyle='dashed', linewidth=1) # Plot the theoretical distribution of second sample (if t > 0 means 2 < 1 so we plot the second sample on the left) loc_d = -d if t > 0 else d norm_d = stats.norm(loc=loc_d, scale=1) x_d = np.linspace(norm_d.ppf(0.0001), norm_d.ppf(0.9999), 100) y_d = norm_d.pdf(x_d) ax2.plot(x_d, y_d, color='black', alpha=0.3, linewidth=1) ax2.fill_between(x_d, y_d, 0, color='green', alpha=0.3, label = 'Year 2 (theoretical)') ax2.axvline(x=loc_d, color='green', alpha=0.5, linestyle='dashed', linewidth=1) # Display the value of Cohen's d max_y = np.max(y)+.02 ax2.plot([0,loc_d], [max_y, max_y], color='red', linewidth=1, marker=".") ax2.annotate("effect size $d$={:.3f}".format(d), xy=(loc_d, max_y), xytext=(15, -5), textcoords='offset points', bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "red", alpha = 0.8)) # Polish the look of the graph ax2.get_yaxis().set_visible(False) # hide the y axis ax2.set_ylim(bottom=0) ax2.grid(False) # hide the grid ax2.spines['top'].set_visible(False) # hide the frame except bottom line ax2.spines['right'].set_visible(False) ax2.spines['left'].set_visible(False) # Add a legend ax2.legend(loc='upper left') # Display the graph plt.subplots_adjust(wspace=.1) plt.show() # EOF