\n",
"\n",
"An important part of the scientific process is to make hypotheses about the world or about the results of experiments. These hypotheses need then to be checked by collecting evidence and making comparisons. Hypothesis testing is a step in this process where statistical tools are used to test hypotheses using data.\n",
"\n",
"**This notebook is designed for you to learn**:\n",
"* How to distinguish between \"population\" datasets and \"sample\" datasets when dealing with experimental data\n",
"* How to compare a sample to a population, test a hypothesis using a statistical test called the \"t-test\" and interpret its results\n",
"* How to use Python scripts to make statistical analyses on a dataset\n",
"\n",
"In the following, we will use an example dataset representing series of measurements on a type of flower called Iris."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction\n",
"\n",
"
\n",
" \n",
"\n",
"###### Iris Virginica (Credit: Frank Mayfield CC BY-SA 2.0)\n",
"\n",
"
\n",
"\n",
"In 1935, an american botanist called Edgar Anderson worked on quantifying the morphologic variation of Iris flowers of three related species, Iris Setosa, Iris Virginica and Iris Versicolor [[1]](#Bibliography). He realized a series of measures of the petal length, petal width, sepal length, sepal width and species.\n",
"Based on the combination of these four features, a British statistician and biologist named Ronald Fisher developed a model to distinguish the species from each other [[2]](#Bibliography).\n",
"\n",
"## Question\n",
"\n",
"A recent series of measurements has been carried out at the [Iris Garden of the Vullierens Castle](https://chateauvullierens.ch/en/) near Lausanne, on a sample of 50 flowers of the Iris Virginica species. \n",
"**How similar (or different) is the Iris sample from the Vullierens Castle compared to the Iris Virginica population documented by Edgar Anderson?**\n",
"\n",
"## Instructions\n",
"\n",
"This notebook will guide you in the use of Python tools for analyzing this experimental dataset and perform statistical tests which are widely used in hypothesis testing. \n",
"It includes:\n",
"* **explanations to read** about how to analyze experimental data to answer a research question,\n",
"* **code to execute** to illustrate how to perform data analysis using Python.\n",
"* **questions** to help you think about what you learn along the way.\n",
"\n",
"\n",
"**Solutions** of all the questions are available [in this file](./solution/StatisticsNotebook-solution.ipynb), we recommend you to **check your answer** after each question, before moving to the next piece of content."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
" How to use this notebook? \n",
"
\n",
"
To execute the code in this notebook, simply click on the cell containing the code and then click on the \"play\" button (►) in the tool bar just above the notebook, or type shift + enter. It is important to execute the code cells in their order of appearance in the notebook.
\n",
"
You can change the content of all the code cells of this notebook, and also add new cells to the notebook by clicking on the \"plus\" button (+) in the tool bar just above the notebook. \n",
" By default, cells you add to the notebook are made to contain code. \n",
" If you want a new cell to contain text, select \"Markdown\" in the drop down menu in the same tool bar.
\n",
"
\n",
"
\n",
" \n",
"\n",
"While using the notebook, you can also **take notes on a piece of paper** if you feel this is helpful.\n",
"\n",
" \n",
"\n",
"\n",
"--- "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Getting started"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Python tools for stats\n",
"Python comes with a number of libraries for processing data and computing statistics.\n",
"To use these tool you first have to load them using the `import` keyword. \n",
"The role of the code cell just below is to load the tools that we use in the rest of the notebook. It is important to execute this cell *prior to executing any other cell in the notebook*."
]
},
{
"cell_type": "code",
- "execution_count": 3,
+ "execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# plotting and display tools\n",
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"plt.style.use('seaborn-whitegrid') # global style for plotting\n",
"\n",
"from IPython.display import display, set_matplotlib_formats\n",
"set_matplotlib_formats('svg') # vector format for graphs\n",
"\n",
"# data computation tools\n",
"import numpy as np \n",
"import pandas as pan\n",
"import math\n",
"\n",
"# statistics tools\n",
"import scipy.stats as stats\n",
"from lib.dataanalysis import * "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data available on the Anderson population\n",
"\n",
"Anderson has published summary statistics of his dataset. \n",
"You have the **mean petal length of the Iris Virginica species** documented by Anderson: $\\mu = 5.552$ cm, which we define in the code below."
]
},
{
"cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "5.552"
- ]
- },
- "execution_count": 4,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
"source": [
"# Define mu as mean petal length of Iris Virginica species from Anderson\n",
"mu = 5.552\n",
"\n",
"# Display mu\n",
"mu"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
" Question \n",
" What does the first line of code above do? And what is the role of the second line of code? \n",
" How would you do to define another value in the code, for instance the mean petal length of Iris Versicolor $\\mu_{versicolor}= 4.26$ cm? \n",
" Type your code using the cell below and execute it to test the result. \n",
"
You can check your answer by clicking on the \"...\" below.
\n",
" Solution \n",
" The first line of code defines a variable called mu and sets its value to 5.552. \n",
" The role of the second line of code is to display the value of mu \n",
" Based on the same model, below is the code to define mu_versicolor with a value of 4.26 and display it. \n",
- "
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data available on the Vullierens sample\n",
"\n",
"You have the raw data collected on the petal length and petal width of the Vullierens sample, which is stored in the file `iris-sample-vullierens.csv` that you can see in the file explorer in the left pane. \n",
"If you double click on the file it will open in a new tab and you can look at what is inside.\n",
"\n",
"Now to analyze the data using Python you have to read the file:"
]
},
{
"cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
petal_length
\n",
- "
petal_width
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
0
\n",
- "
5.090981
\n",
- "
1.787443
\n",
- "
\n",
- "
\n",
- "
1
\n",
- "
5.224431
\n",
- "
2.259538
\n",
- "
\n",
- "
\n",
- "
2
\n",
- "
7.251620
\n",
- "
2.055940
\n",
- "
\n",
- "
\n",
- "
3
\n",
- "
5.607932
\n",
- "
2.311074
\n",
- "
\n",
- "
\n",
- "
4
\n",
- "
6.118801
\n",
- "
1.997534
\n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " petal_length petal_width\n",
- "0 5.090981 1.787443\n",
- "1 5.224431 2.259538\n",
- "2 7.251620 2.055940\n",
- "3 5.607932 2.311074\n",
- "4 6.118801 1.997534"
- ]
- },
- "execution_count": 7,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
"source": [
"# Read the Vullierens sample data from the CSV file\n",
"sample_data = pan.read_csv('iris-sample-vullierens.csv')\n",
"\n",
"# Display the first few lines of the dataset\n",
"sample_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After reading the file, its content is stored in the variable `sample_data`, which is a kind of table. The output above shows us an extract of the table, limited to the first 5 lines. We see above that each line of the table is given an index number to identify it. We also see that, appart from the index, the table contains two columns, called `\"petal_length\"` and `\"petal_width\"`, which contains all the measurements made on the Vullierens Irises.\n",
"\n",
"To get the complete list of all the values stored in one specific column such as `\"petal_length\"`, you can use the following syntax: `sample_data[\"petal_length\"]`."
]
},
{
"cell_type": "code",
- "execution_count": 8,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "0 5.090981\n",
- "1 5.224431\n",
- "2 7.251620\n",
- "3 5.607932\n",
- "4 6.118801\n",
- "5 6.352507\n",
- "6 4.896926\n",
- "7 5.220964\n",
- "8 6.235352\n",
- "9 6.200244\n",
- "10 5.422812\n",
- "11 5.296983\n",
- "12 4.694441\n",
- "13 5.911687\n",
- "14 5.958683\n",
- "15 5.764169\n",
- "16 6.035653\n",
- "17 6.848299\n",
- "18 6.286982\n",
- "19 5.117292\n",
- "20 4.918408\n",
- "21 5.663514\n",
- "22 6.056574\n",
- "23 6.075641\n",
- "24 5.619982\n",
- "25 6.091000\n",
- "26 5.621478\n",
- "27 5.207927\n",
- "28 5.410302\n",
- "29 5.714093\n",
- "30 5.601681\n",
- "31 5.706329\n",
- "32 5.536061\n",
- "33 5.742188\n",
- "34 5.496693\n",
- "35 5.520262\n",
- "36 4.736357\n",
- "37 5.445666\n",
- "38 5.818557\n",
- "39 6.115245\n",
- "40 6.010444\n",
- "41 5.692231\n",
- "42 5.477746\n",
- "43 5.620406\n",
- "44 5.936960\n",
- "45 6.194876\n",
- "46 6.349760\n",
- "47 4.781601\n",
- "48 5.692977\n",
- "49 6.260550\n",
- "Name: petal_length, dtype: float64"
- ]
- },
- "execution_count": 8,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
"source": [
"# All values stored in the \"petal_length\" column of the \"sample_data\" table\n",
"sample_data[\"petal_length\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
" Question \n",
" How would you access the data stored in the other column of this table, named \"petal_width\"? \n",
" Type and test your code using the cell below.\n",
"
You can check your answer by clicking on the \"...\" below.
\n",
"
"
]
},
{
"cell_type": "code",
- "execution_count": 9,
+ "execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Access the values stored in the \"petal_width\" column of the \"sample_data\" table\n"
]
},
{
"cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "
\n",
- " Solution \n",
- " Below is the code to access the data stored in the \"petal_width\" column of the table: we simply change the name of the column we want to access. \n",
- "
\n",
+ " Solution \n",
+ " Below is the code to access the data stored in the \"petal_width\" column of the table: we simply change the name of the column we want to access. \n",
+ "
\n",
+ "\n",
+ "
\n",
"# Access the values stored in the \"petal_width\" column of the \"sample_data\" table\n",
- "sample_data[\"petal_width\"]"
+ "sample_data[\"petal_width\"]\n",
+ "
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" \n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# First look at the data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Descriptive statistics\n",
"\n",
"A first important step in analyzing data is to get an idea of its basic characteristics using **descriptive statistics** such as the **mean** (i.e. the average value or \"moyenne\" in French) and the **standard deviation** (\"écart-type\" in French, generally abreviated std in English). \n",
"So let's compute some simple descriptive statistics on the Vullierens sample data. The `describe()` function gives us right away a number of useful descriptive statistics for all the columns in our data table:"
]
},
{
"cell_type": "code",
- "execution_count": 11,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
" Question \n",
" From the table above, what is the mean value of the petal length in the Vullierens sample? \n",
" And the standard deviation (std) of the petal length in the Vullierens sample?\n",
"
You can check your answer by clicking on the \"...\" below.
\n",
" Solution \n",
" From the table above, we can read in the first column, second line that the mean value of the petal length of the Vullierens sample is 5.713045 cm. \n",
" We can read in the first column, third line that the standard deviation of the petal length is 0.518940 cm.\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can access individual elements of the `sample_stats` table using the corresponding names for the line and column of the value. \n",
"The following cell illustrates how to get the **sample size** (named `count` in the table above):"
]
},
{
"cell_type": "code",
- "execution_count": 15,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "50.0"
- ]
- },
- "execution_count": 15,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
"source": [
"# Extract the sample mean from the descriptive stats\n",
"sample_size = sample_stats.loc[\"count\",\"petal_length\"]\n",
"\n",
"# Display the result\n",
"sample_size"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another interesting information to extract from these descriptive statistics is the **mean value of the petal length** in the sample:"
]
},
{
"cell_type": "code",
- "execution_count": 16,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "5.713045387181936"
- ]
- },
- "execution_count": 16,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
"source": [
"# Extract the sample mean of the petal length from the descriptive stats\n",
"sample_mean = sample_stats.loc[\"mean\",\"petal_length\"]\n",
"\n",
"# Display the result\n",
"sample_mean"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
" Question \n",
" How could you access the value of the standard deviation of the petal length in the sample_stats table? \n",
" Type and test your code using the cell below.\n",
"
You can check your answer by clicking on the \"...\" below.
\n",
"
"
]
},
{
"cell_type": "code",
- "execution_count": 17,
+ "execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Extract the sample standard deviation of the petal length from the descriptive stats\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"jupyter": {
"source_hidden": true
}
},
"source": [
"\n",
"
\n",
" Solution \n",
" Below is the code to access the value of the standard deviation of the petal length in the sample_stats table: we use the name of the line containing the value, std, and we store the result in a variable called sample_std.\n",
- "
\n",
"# Extract the sample standard deviation of the petal length from the descriptive stats\n",
"sample_std = sample_stats.loc[\"std\",\"petal_length\"]\n",
"\n",
"# Display the result\n",
- "sample_std"
+ "sample_std\n",
+ "
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualization\n",
"\n",
"After having looked at simple descriptive statistics, another important step is to **visualize the data**, to better identify its characteristics. \n",
"Histograms are useful to visualize the [frequency distribution](https://en.wikipedia.org/wiki/Frequency_distribution) of the sample values: the horizontal axis displays intervals of the variable we are looking at, in our case the petal length, and the vertical axis indicates the number of samples in each interval."
]
},
{
"cell_type": "code",
- "execution_count": 19,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/svg+xml": [
- "\n",
- "\n",
- "\n",
- "\n"
- ],
- "text/plain": [
- "