{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
Introduction to hypothesis testing
\n", "\n", "An important part of the scientific process is to make hypotheses about the world or about the results of experiments. These hypotheses need then to be checked by collecting evidence and making comparisons. Hypothesis testing is a step in this process where statistical tools are used to test hypotheses using data.\n", "\n", "**This notebook is designed for you to learn**:\n", "* How to distinguish between \"population\" datasets and \"sample\" datasets when dealing with experimental data\n", "* How to compare a sample to a population, test a hypothesis using a statistical test called the \"t-test\" and interpret its results\n", "* How to use Python scripts to make statistical analyses on a dataset\n", "\n", "In the following, we will use an example dataset representing series of measurements on a type of flower called Iris." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "\n", "shift + enter
.You can check your answer by clicking on the \"...\" below.
\n", "mu
and sets its value to 5.552
.mu
mu_versicolor
with a value of 4.26
and display it. \n",
"\n", "# Define mu_versicolor here\n", "mu_versicolor = 4.26\n", "\n", "# Display beta\n", "mu_versicolor\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data available on the Vullierens sample\n", "\n", "You have the raw data collected on the petal length and petal width of the Vullierens sample, which is stored in the file `iris-sample-vullierens.csv` that you can see in the file explorer in the left pane. \n", "If you double click on the file it will open in a new tab and you can look at what is inside.\n", "\n", "Now to analyze the data using Python you have to read the file:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Read the Vullierens sample data from the CSV file\n", "sample_data = pan.read_csv('iris-sample-vullierens.csv')\n", "\n", "# Display the first few lines of the dataset\n", "sample_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After reading the file, its content is stored in the variable `sample_data`, which is a kind of table. The output above shows us an extract of the table, limited to the first 5 lines. We see above that each line of the table is given an index number to identify it. We also see that, appart from the index, the table contains two columns, called `\"petal_length\"` and `\"petal_width\"`, which contains all the measurements made on the Vullierens Irises.\n", "\n", "To get the complete list of all the values stored in one specific column such as `\"petal_length\"`, you can use the following syntax: `sample_data[\"petal_length\"]`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# All values stored in the \"petal_length\" column of the \"sample_data\" table\n", "sample_data[\"petal_length\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\"petal_width\"
?You can check your answer by clicking on the \"...\" below.
\n", "\"petal_width\"
column of the table: we simply change the name of the column we want to access. \n",
"\n", "# Access the values stored in the \"petal_width\" column of the \"sample_data\" table\n", "sample_data[\"petal_width\"]\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# First look at the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Descriptive statistics\n", "\n", "A first important step in analyzing data is to get an idea of its basic characteristics using **descriptive statistics** such as the **mean** (i.e. the average value or \"moyenne\" in French) and the **standard deviation** (\"écart-type\" in French, generally abreviated std in English). \n", "So let's compute some simple descriptive statistics on the Vullierens sample data. The `describe()` function gives us right away a number of useful descriptive statistics for all the columns in our data table:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compute the descriptive stats\n", "sample_stats = sample_data.describe()\n", "\n", "# Display the result\n", "sample_stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
You can check your answer by clicking on the \"...\" below.
\n", "5.713045 cm
.0.518940 cm
.\n",
"sample_stats
table?You can check your answer by clicking on the \"...\" below.
\n", "sample_stats
table: we use the name of the line containing the value, std
, and we store the result in a variable called sample_std
.\n",
"\n", "# Extract the sample standard deviation of the petal length from the descriptive stats\n", "sample_std = sample_stats.loc[\"std\",\"petal_length\"]\n", "\n", "# Display the result\n", "sample_std\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualization\n", "\n", "After having looked at simple descriptive statistics, another important step is to **visualize the data**, to better identify its characteristics. \n", "Histograms are useful to visualize the [frequency distribution](https://en.wikipedia.org/wiki/Frequency_distribution) of the sample values: the horizontal axis displays intervals of the variable we are looking at, in our case the petal length, and the vertical axis indicates the number of samples in each interval." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot the histogram representing the distribution of the samples\n", "plt.hist(sample_data[\"petal_length\"], color=\"green\")\n", "plt.xticks(np.arange(4.6, 7.2, 0.2))\n", "\n", "# Add a vertical line for the sample mean\n", "plt.axvline(x=sample_mean, color='black', linestyle='-.', linewidth=1, label=\"sample mean $m$\")\n", "\n", "# Add a vertical line for the population mean\n", "plt.axvline(x=mu, color='black', linestyle=':', linewidth=1, label=\"population mean $\\mu$\")\n", "\n", "# Add a legend\n", "plt.legend();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
You can check your answer by clicking on the \"...\" below.
\n", "alpha01
to define a significance level of $\\alpha = 0.01$.\n",
" You can check your answer by clicking on the \"...\" below.
\n", "alpha01
and displays it.\n",
"\n", "# Define alpha at 0.01\n", "alpha01 = 0.01\n", "\n", "# Display alpha\n", "alpha01\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If our distribution of sample means is a normal curve then we know that the most extreme 5% of sample means are found above or below ±1.96 standard deviations above and below the mean. In our case, because our sample size is less than 130 (it is 50), our distribution is close to normal but not quite normal. \n", "In this case, it is possible to find out the relevant cut off point from [looking it up in statistical tables](https://en.wikipedia.org/wiki/Student%27s_t-distribution#Table_of_selected_values): for a sample size of 50, the most extreme 5% of cases are found above or below approximately 2.01 standard deviations from the mean. \n", "\n", "The good news is that **Python gives us automatically the value of the cutoff point** based on the value of the significance level $\\alpha$ chosen and the sample size, thanks to the `stats` library which offers useful functions related to many statistical distributions such as Student's t:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get the cutoff point for alpha at 0.05\n", "cutoff05 = stats.t.isf(alpha05 / 2, sample_size)\n", "\n", "# Display cutoff\n", "cutoff05" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
You can check your answer by clicking on the \"...\" below.
\n", "stats.t.isf(alpha05 / 2, sample_size)
and replace alpha05
with the variable alpha01
that we have previously defined.cutoff01
.\n", "# Get the cutoff point for alpha at 0.01\n", "cutoff01 = stats.t.isf(alpha01 / 2, sample_size)\n", "\n", "# Display cutoff\n", "cutoff01\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Error in the distribution of means\n", "\n", "So far we know a lot that will help us to test the hypothesis that our sample mean is similar to Anderson’s population mean. We know:\n", "* Our sample mean $m$\n", "* The population mean $\\mu$\n", "* The shape of the distribution of the mean of all samples that would come from this population (a normal curve, centred on the population mean)\n", "* Our cut off point defined by $\\alpha$ (the most extreme 5% of cases, above or below 2.01 standard deviations from the mean)\n", "\n", "The last piece of information missing that would enable us to test this hypothesis is the size of the standard deviation of the distribution of sample means from Anderson’s population. \n", "It turns out that a good guess for the size of this standard deviation can be obtained from knowing the standard deviation of our sample.\n", "If $s$ is the sample standard deviation of our sample and $n$ is the sample size, then the standard deviation of the distribution of sample means is:\n", "\n", "$\n", "\\begin{align}\n", "\\sigma_{\\overline{X}} = \\frac{s}{\\sqrt{n}}\n", "\\end{align}\n", "$ \n", "\n", "This standard deviation of the distribution of sample means is called the **\"standard error of the mean\" (also noted SEM)**. \n", "We can compute it by using the sample size and the standard deviation from the descriptive stats we have computed earlier: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extract the sample standard deviation from the descriptive stats\n", "sample_std = sample_stats.loc[\"std\",\"petal_length\"]\n", "\n", "# Compute the estimation of the standard deviation of sample means from Anderson's population (standard error)\n", "sem = sample_std / math.sqrt(sample_size)\n", "\n", "# Display the standard error\n", "sem" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
You can check your answer by clicking on the \"...\" below.
\n", "sem
) above that the way to get $\\sqrt{n}$ in Python is math.sqrt(sample_size)
.sample_size
by 2
to get $\\sqrt{2}$.\n", "# Compute the square root of 2\n", "sqrt2 = math.sqrt(2)\n", "\n", "# Display the result\n", "sqrt2\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparison, and definition of the *t* statistics\n", "\n", "We can now restate our question in more precise terms: **\"is our sample mean in the most extreme 5% of samples that would be drawn from a population with the same mean as Anderson’s population?\"**. \n", "Or to be even more precise, **\"is the gap between our sample mean and Anderson’s population mean greater than 2.01 times the standard error of the mean?\"**. \n", "\n", "This would be equivalent to compare\n", "$\n", "\\begin{align}\n", "\\frac{m - \\mu}{\\sigma_{\\overline{X}}}\n", "\\end{align}\n", "$\n", "to our cutoff point of 2.01. \n", "\n", "That is the **definition of the *t* statistics**: the value $t = $\n", "$\n", "\\begin{align}\n", "\\frac{m - \\mu}{\\sigma_{\\overline{X}}}\n", "\\end{align}\n", "$ \n", " has to be compared to the cutoff point we have chosen to determine if the sample mean falls into the most extreme zones and to be able to say whether the difference is statistically significant or not.
You can check your answer by clicking on the \"...\" below.
\n", "cutoff05
in the code above by the variable cutoff01
we have defined earlier with the appropriate value for the cutoff point. See the solution code below.\n", "# Compare t to the cutoff point for alpha=0.01\n", "if abs(t) > cutoff01: \n", " print(\"The difference IS statistically significant.\")\n", "else: \n", " print(\"The difference is NOT statistically significant.\")\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The statistical test we have just performed here, where we compare our sample mean to the mean of a population, is called a **one-sample t-test**: *one-sample* because we compare a sample to the mean of a population, and *t-test* because the distribution of all the possible sample means of the population follows a distribution called *Student's t-distribution*. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualization of *t*\n", "\n", "Using Python we can visualize what the t-test means graphically by plotting the t-distribution of all the possible sample means that would be drawn from a population with the same mean as Anderson's population and showing where `t` is in the distribution compared to the zone defined by our $\\alpha$ of 5%.\n", "\n", "It the *t* statistics falls outside of the rejection zone defined by $\\alpha$, then that means that the difference between our sample mean and the population mean is not statistically significant. If it falls into the rejection zone, then the difference is statistically significant and the sample should not be considered as coming from the Anderson population under the significance level we have chosen.\n", "\n", "The cell below uses an external library to generate a graphical visualization of the result of the t-test." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Visualize graphically the result of the t-test with alpha at 0.05\n", "visualize_ttest(sample_size, alpha05, t)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
You can check your answer by clicking on the \"...\" below.
\n", "alpha05
in the code above by the variable alpha01
we have defined earlier. See the solution code below.\n", "# Visualize graphically the result of the t-test with alpha at 0.01\n", "visualize_ttest(sample_size, alpha01, t)\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "What can we conclude from there? What the one sample t-test tells us is that we have evidence which would lead us to think that the sample doesn't come from an Anderson like population. Therefore we **can reject our hypothesis $H_0$**. \n", "\n", "Now there are some limitations to keep in mind when using the one sample t-test, that we will explore in the section below.\n", "\n", " \n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Influence of the sample size\n", "\n", "Above, we have seen that $t = $\n", "$\n", "\\begin{align}\n", "\\frac{m - \\mu}{\\sigma_{\\overline{X}}}\n", "\\end{align}\n", "$ and that $\\sigma_{\\overline{X}} = $\n", "$\n", "\\begin{align}\n", "\\frac{s}{\\sqrt{n}}\n", "\\end{align}\n", "$.\n", "\n", "Therefore we can rewrite the *t* statistics as:\n", "\n", "$\n", "\\begin{align}\n", "t = \\frac{m - \\mu}{\\frac{s}{\\sqrt{n}}}\n", "\\end{align}\n", "$\n", "\n", "This means that *t* is actually:\n", "\n", "$\n", "\\begin{align}\n", "t = \\frac{m - \\mu}{s}\\sqrt{n}\n", "\\end{align}\n", "$\n", "\n", "From there, we see that the **sample size $n$ influences the value of $t$**: all else being equal (i.e. sample mean, sample standard deviation and population mean), **a larger sample would result in a higher value of $t$** and therefore more chances to find a significant result for the t-test.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
You can check your answer by clicking on the \"...\" below.
\n", "**
for power raising ($x^2$ is then written x ** 2
). See the solution code below, in which we have used the variable cutoff05
instead of the raw number 2.01
but that would work too.n = 41.88
, which means that a sample size of 42 flowers or more with the same mean and standard deviation for the petal length would make the t-statistic above our cutoff point.\n",
"\n", "# Make your calculation in Python here\n", "n = ((cutoff05 * sample_std) / (sample_mean - mu)) * ((cutoff05 * sample_std) / (sample_mean - mu))\n", "\n", "# Display the result\n", "n\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So for instance, for our irises from the Vullierens Castle, **a sample of 144 flowers instead of 50** with exactly the same mean and standard deviation for the petal length would be considered as statistically different from the Anderson population. \n", "\n", "This is why when doing experiments, researchers generally try to get samples as large as possible - but of course this has a cost and is not always possible!\n", "\n", " \n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Using the *p-value*\n", "\n", "In scientific studies, researchers use frequently the t-test but they generally report not only the t-statistic but also **another result of the t-test which is called the p-value**. In the following, we explore what is the p-value, how it relates to the t-statistic and how it can be used." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Testing our hypothesis using a predefined Python function\n", "\n", "So far we have made the computations by hand but Python comes with a number of libraries with interesting statistical tools. \n", "In particular, the `stats` library includes a function for doing a **one-sample t-test** as we have done above. \n", "\n", "Let's now use it and then look at what information it gives us." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compute the t-test\n", "t, p = stats.ttest_1samp(sample_data[\"petal_length\"], mu)\n", "\n", "# Display the result\n", "print(\"t = {:.3f}\".format(t))\n", "print(\"p = {:.3f}\".format(p))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the predefined Python function for doing the one-sample t-test gives us the same value for the $t$ statistic as the calculations we have made by hand: $t = 2.194$. \n", "In addition, we see that it also returns another value, $p = 0.033$. \n", "\n", "Actually, the two values `t` and `p` returned by the function say the same thing but in two different ways:\n", "* `t` tells us where our sample mean falls on the distribution of all the possible sample means for the Anderson population ;
t
compare to the cutoff value (2.01)?p
compare to $\\alpha$ (0.05)?You can check your answer by clicking on the \"...\" below.
\n", "visualize_ttest_pvalue
need to generate a visualization of the result of a t-test?You can check your answer by clicking on the \"...\" below.
\n", "visualize_ttest_pvalue
function needs 4 different values to generate the visualization: the sample size, the significance level $\\alpha$, the value of $t$ and the value of $p$.visualize_ttest_pvalue
function of the code cell above and replace alpha05
by alpha01
, see the solution code below.\n", "# Visualize graphically the result of a t-test of t=-1.702 and p=0.095\n", "visualize_ttest_pvalue(sample_size,alpha01, t, p)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thanks to the visualization above, we see that one important difference between the t-statistic and the p-value is that that $|t|$ and $p$ evolve in opposite directions: the bigger $|t|$ is, the smaller$p$ is.\n", "\n", "Another important difference, is that **the t-statistic tells us whether the sample mean $m$ is greater or smaller than the population mean $\\mu$** whereas this is impossible to know with the p-value only: since the p-value corresponds to the area under the curve of the t-distribution, it is always positive. \n", "As we have seen earlier, the t-distribution is centred on zero, with zero meaning $m = \\mu$ and:\n", "* when $t > 0$ (i.e. $t$ is on the *right* side of the distribution on the visualization above) it means that $m > \\mu$ ;\n", "* when $t < 0$ (i.e. $t$ is on the *left* side of the distribution on the visualization above) it means that $m < \\mu$.\n", "\n", "\n", "\n", " \n", "\n", "---\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Importance of the choice of $\\alpha$\n", "\n", "So far we have seen two important points to keep in mind when using the t-test to compare a sample to a population: first the size of the sample matters and second the t-test provides us with two pieces of information, the t-statistic and the p-value, which are both useful but in different ways. In this section, we look at a third important point to keep in mind when doing statistical testing: the **influence of the choice of $\\alpha$**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let's compare our Vullierens sample to another population\n", "\n", "
stats.ttest_1samp
:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Define the mean petal length of the Ensata population\n",
"mu_ensata = 5.832\n",
"\n",
"# Compute the t-test comparing the Vullierens sample petal length to the Ensata population mean\n",
"t, p = stats.ttest_1samp(sample_data[\"petal_length\"], mu_ensata)\n",
"\n",
"# Display the result\n",
"print(\"t = {:.3f}\".format(t))\n",
"print(\"p = {:.3f}\".format(p))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The result of the t-test gives $t = -1.621$ and $p = 0.111$.\n",
"\n",
"With $\\alpha=0.05$, the cutoff value is 2.01. We see that $|t| < 2.01$ and $p > 0.05$. Therefore, the test tells us that the difference between the mean petal length of the Vullierens sample and the mean petal length of the Ensata population IS NOT statistically significant. In other words, we cannot reject the hypothesis that the Vullierens sample is similar to the Ensata population."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What is the role of $\\alpha$ in this result?\n",
"\n",
"Now let's ask ourselves **what would have been the conclusion of the test if we had chosen a significance level of $\\alpha=0.01$**, i.e. if we wanted to be 99% sure?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For $\\alpha = 0.01$, the cutoff value which we get from the tables is 2.67. With this choice of $\\alpha$, we see that $|t| < 2.67$ and $p > 0.01$. This means that when choosing $\\alpha = 0.01$, the test tells us that the difference between the mean petal length of the Vullierens sample and the mean petal length the Ensata population is NOT statistically significant either. This is quite obvious, since the $t$ we have to \"beat\" is event larger with $\\alpha = 0.01$ than for $\\alpha = 0.05$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" \n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Summary\n",
"\n",
"In this notebook, you have seen how to compare a sample to a population using an approach called **hypothesis testing** and using a statistical test called a **one-sample t-test**.\n",
"\n",
"To summarize, to compare the mean of a sample to a reference value from a population, you have to proceed in four main steps:\n",
"1. Look at descriptive statistics and visualizations of the sample you have to get an idea about how it compares to the population\n",
"1. Formulate the hypothese you want to test: the null hypothesis $H_0: m = \\mu$ and its alternate $H_a: m \\neq \\mu$ \n",
"1. Choose a significance level for being sure, usually $\\alpha = 0.05$ or $\\alpha = 0.01$, or even $\\alpha = 0.001$ \n",
"1. Compute the result of the t-test and interpret the result - in particular if the p-value is *below* the significance level you have chosen, $p \\lt \\alpha$, then it means $H_0$ should probably be rejected"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" \n",
"\n",
"---\n",
"\n",
"