diff --git a/StatisticsNotebook-group_A.ipynb b/StatisticsNotebook-group_A.ipynb
index 725c208..5cdd4af 100644
--- a/StatisticsNotebook-group_A.ipynb
+++ b/StatisticsNotebook-group_A.ipynb
@@ -1,972 +1,1598 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
- "
\n",
+ "## Upper tailed test\n",
+ "With a one tailed test, the rejection zone for the test is situated on one side of the t distribution only. For an **upper tailed** test it is situated **on the right** and corresponds to the proportion of $\\alpha$ most extreme **positive** cases as shown in the graph below. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "plot_t_distribution(50, alpha=0.05, tail=\"upper\");\n",
"\n",
- "###### Iris Ensata (Credit: Laitche CC BY-SA 3.0)\n",
+ "# In green the Vuillerens sample\n",
+ "plt.axvline(x=vuillerens_t, color='green', linestyle='-.', linewidth=1, label=\"(Vuillerens)\")\n",
+ "plt.legend();"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "When we were doing two-tailed tests, the cutoff value for $\\alpha = 5\\%$ was 2.01 (or -2.01). Now, with a one-tailed test all the 5% extremes are on the same side and the cutoff value for $\\alpha=5\\%$ is **1.67**. This smaller cutoff value will be **easier to beat** in comparison with a two-tailed test."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# We look up the value of the t distribution for alpha = 0.05 to find the cutoff value\n",
+ "onesided_cutoff05 = stats.t.ppf(1-0.05, sample_size-1)\n",
+ "onesided_cutoff05 "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Testing for the efficiency of a fertilizer\n",
"\n",
+ "
\n",
+ "
\n",
"
\n",
"\n",
- "Let's imagine we want to know how our Vullierens sample compares to another iris population, for instance a Japanese iris species called Iris Ensata with a mean petal length of $\\mu_{ensata} = 5.832$ cm. We can apply the hypothesis testing approach that we have just learned and use a one-sample t-test to do the comparison, which we do in 4 steps:\n",
"\n",
- "1. Get an idea about how the sample compares to the population: \n",
- " At first sight, the sample petal mean of our Vullierens sample, $m= 5.646$ cm, is again quite close to the mean petal length of the Ensata population, $\\mu_{ensata} = 5.832$ cm. \n",
+ "Imagine that the Vuillerens sample was picked in a field where the gardeners have used a new type of fertilizer that boosts the growth of the flowers. We now have good reasons to believe that the fertilizer will increase the flower size and hence we can make a one tailed null hypothesis that our sample mean will be smaller than the population mean $\\mu$. \n",
+ "\n",
+ "This may sound a bit contradictory, but even though we believe the flowers will be larger, we can only reject null hypotheses and hence we state the null hypothesis that the sample mean will be smaller or equal to $\\mu$. \n",
+ "\n",
+ "$H_0: m_{Vuillerens} \\leq \\mu$\n",
"\n",
- "2. Formulate the hypotheses we want to test:\n",
- " * First let's state our null hypothesis, which is that the Vullierens sample is similar to the Iris Ensata population, $H_0: m = \\mu_{ensata}$. \n",
- " This is the hypothesis we want to know whether we can reject or not.\n",
- " * And then state the alternate hypothesis $H_a: m \\neq \\mu_{ensata}$.\n",
+ "$H_a: m_{Vuillerens} > \\mu$\n",
"\n",
+ "The calculations for the test are very similar to the two-tailed situation. \n",
"\n",
- "3. Choose a significance level: \n",
- " Let's choose $\\alpha=0.05$ as previously, which means we want to be 95% sure.\n",
+ "We compare the **t value**\n",
+ "$\n",
+ "\\begin{align}\n",
+ "t = \\frac{m - \\mu}{\\sigma_{\\overline{X}}}\n",
+ "\\end{align}\n",
+ "$\n",
+ "to the cutoff point $t_{\\alpha=0.05}$=1.67.\n",
"\n",
- "4. Compute the result of the t-test by using the predefined Python function
stats.ttest_1samp
:"
+ "If the t value is larger than the cutoff value, then we can conclude: \n",
+ "* that we can reject $H_0$ and accept $H_a$\n",
+ "* that our sample falls in the 5% most extreme **positive** cases of the t distribution\n",
+ "* that the sample mean is larger than $\\mu$\n",
+ "* and that the test is statistically significant at the level $\\alpha=0.05$ \n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Doing the test\n",
+ "We proceed as with the two-tailed test, except that we do not need to take the absolute value of |t| before comparing to the cutoff point. We know that we expect a positive t value since we conduct an upper tailed test."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
- "# Define the mean petal length of the Ensata population\n",
- "mu_ensata = 5.832\n",
- "\n",
- "# Compute the t-test comparing the Vullierens sample petal length to the Ensata population mean\n",
- "t, p = stats.ttest_1samp(sample_data[\"petal_length\"], mu_ensata)\n",
- "\n",
- "# Display the result\n",
- "print(\"t = {:.3f}\".format(t))\n",
- "print(\"p = {:.3f}\".format(p))"
+ "if vuillerens_t > onesided_cutoff05: \n",
+ " print(\"The difference IS statistically significant because \" + \n",
+ " \"the t value {:.3f} > {:.3f}\".format(vuillerens_t, onesided_cutoff05))\n",
+ "else: \n",
+ " print(\"The difference is NOT statistically significant because \" +\n",
+ " \" the t value {:.3f} < {:.3f}\".format(vuillerens_t, onesided_cutoff05))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "plot_t_test(vuillerens_sample, mu, alpha=0.05, tail=\"upper\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "The result of the t-test gives $t = -1.621$ and $p = 0.111$.\n",
- "\n",
- "With $\\alpha=0.05$, the cutoff value is 2.01. We see that $|t| < 2.01$ and $p > 0.05$. Therefore, the test tells us that the difference between the mean petal length of the Vullierens sample and the mean petal length of the Ensata population
IS NOT statistically significant. In other words, we
cannot reject the hypothesis that the Vullierens sample is similar to the Ensata population."
+ "We see from the plot above that the t-test is **significant**:\n",
+ "* the t value (2.194) is larger than the cutoff value (1.677).\n",
+ "* the p value (0.016) is smaller than the $\\alpha$ level (0.05).\n",
+ "* the hatched area that corresponds to the p-value is inside the rejection zone in red."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "## What is the role of $\\alpha$ in this result?\n",
+ "## Lower tailed test\n",
+ "For a **lower tailed** test the rejection zone is situated **on the left** and corresponds to the proportion of $\\alpha$ most extreme **negative** cases as shown in the graph below.\n",
"\n",
- "Now let's ask ourselves **what would have been the conclusion of the test if we had chosen a significance level of $\\alpha=0.01$**, i.e. if we wanted to be 99% sure?"
+ "The corresponding hypothesis could be for example that the Second sample was picked from a field that is in the shade and therefore could lead to smaller flowers. We state the null hypothesis $H_0$ that the sample mean is greater or equal to the population mean $\\mu$.\n",
+ "\n",
+ "$H_0: m_{SecondSample} \\geq \\mu$\n",
+ "\n",
+ "$H_a: m_{SecondSample} < \\mu$\n",
+ "\n",
+ "Since we do a lower tailed test, we actually are testing whether the t value for our sample falls **below** the cutoff on the left side of the distribution. We compare the t value to the the negative cutoff point $t_{\\alpha=0.05}$= -1.67.\n",
+ "\n",
+ "If the t value is smaller than the cutoff value, then we can conclude: \n",
+ "* that we can reject $H_0$\n",
+ "* that our sample falls in the 5% most extreme **positive** cases of the t distribution\n",
+ "* and that the difference is statistically significant at the level $\\alpha=0.05$ \n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "if second_t < -onesided_cutoff05: \n",
+ " print(\"The difference IS statistically significant because \" + \n",
+ " \"the t value {:.3f} < {:.3f}\".format(second_t, -onesided_cutoff05))\n",
+ "else: \n",
+ " print(\"The difference is NOT statistically significant because \" +\n",
+ " \" the t value {:.3f} > {:.3f}\".format(second_t, -onesided_cutoff05))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "plot_t_test(second_sample, mu, alpha=0.05, tail=\"lower\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "For $\\alpha = 0.01$, the cutoff value which we get from the tables is 2.67. With this choice of $\\alpha$, we see that $|t| < 2.67$ and $p > 0.01$. This means that when choosing $\\alpha = 0.01$, the test tells us that the difference between the mean petal length of the Vullierens sample and the mean petal length the Ensata population
is NOT statistically significant either. This is quite obvious, since the $t$ we have to \"beat\" is event larger with $\\alpha = 0.01$ than for $\\alpha = 0.05$."
+ "We see from the plot above that the t-test is **not significant**:\n",
+ "* the negative t value (-1.325) is larger than the negative cutoff value (-1.677).\n",
+ "* the p value (0.096) is larger than the $\\alpha$ level (0.05).\n",
+ "* the hatched area that corresponds to the p-value is not inside the rejection zone in red.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" \n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Summary\n",
"\n",
"In this notebook, you have seen how to compare a sample to a population using an approach called **hypothesis testing** and using a statistical test called a **one-sample t-test**.\n",
"\n",
"To summarize, to compare the mean of a sample to a reference value from a population, you have to proceed in four main steps:\n",
"1. Look at descriptive statistics and visualizations of the sample you have to get an idea about how it compares to the population\n",
- "1. Formulate the hypothese you want to test: the null hypothesis $H_0: m = \\mu$ and its alternate $H_a: m \\neq \\mu$ \n",
+ "1. Formulate the hypothese you want to test: \n",
+ " * For two-tailed tests the null hypothesis $H_0: m = \\mu$ and its alternate $H_a: m \\neq \\mu$\n",
+ " * For upper-tailed tests the null hypothesis $H_0: m \\leq \\mu$ and its alternative $H_a: m > \\mu$\n",
+ " * For lower-tailed tests the null hypothesis $H_0: m \\geq \\mu$ and its alternative $H_a: m < \\mu$\n",
"1. Choose a significance level for being sure, usually $\\alpha = 0.05$ or $\\alpha = 0.01$, or even $\\alpha = 0.001$ \n",
- "1. Compute the result of the t-test and interpret the result - in particular if the p-value is *below* the significance level you have chosen, $p \\lt \\alpha$, then it means $H_0$ should probably be rejected"
+ "1. Determine the cutoff value for your given $\\alpha$ level.\n",
+ "1. Compute the result of the t-test and interpret the result\n",
+ " * if the |t| value is *larger* than the cutoff value for the the given $\\alpha$ level, then $H_0$ should probably be rejected. \n",
+ " * if the p-value is *below* the significance level you have chosen, $p \\lt \\alpha$, then it means $H_0$ should probably be rejected."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" \n",
"\n",
"---\n",
"\n",
"
Bibliography
\n",
"\n",
"[1] E. Anderson (1935). \"The Irises of the Gaspe Peninsula.\" Bulletin of the American Iris Society 59: 2–5.\n",
"\n",
"[2] R. A. Fisher (1936). \"The use of multiple measurements in taxonomic problems\". Annals of Eugenics. 7 (2): 179–188. doi:10.1111/j.1469-1809.1936.tb02137.x\n",
"\n",
"More about the Iris Dataset on Wikipedia: https://en.wikipedia.org/wiki/Iris_flower_data_set\n",
"\n",
"*Please note that the datasets used in this notebook have been generated using a random generator, it does not come from real measurement and cannot be used for any research purpose.*"
]
}
],
"metadata": {
+ "celltoolbar": "Edit Metadata",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.6.9"
+ "version": "3.8.3"
},
"toc-autonumbering": true
},
"nbformat": 4,
"nbformat_minor": 4
}
diff --git a/figs/miracle-gro-plant-flower-fertilizer.jpg b/figs/miracle-gro-plant-flower-fertilizer.jpg
new file mode 100644
index 0000000..5364478
Binary files /dev/null and b/figs/miracle-gro-plant-flower-fertilizer.jpg differ
diff --git a/figs/sample-plot-c.png b/figs/sample-plot-c.png
new file mode 100644
index 0000000..d06cb4c
Binary files /dev/null and b/figs/sample-plot-c.png differ
diff --git a/figs/sample-plot-e.png b/figs/sample-plot-e.png
new file mode 100644
index 0000000..8e5f63f
Binary files /dev/null and b/figs/sample-plot-e.png differ
diff --git a/iris-sample-vullierens-1.csv b/iris-sample-vullierens-1.csv
new file mode 100644
index 0000000..d6c3da0
--- /dev/null
+++ b/iris-sample-vullierens-1.csv
@@ -0,0 +1,51 @@
+petal_length_sample1
+5.090980681
+5.224431282
+7.25161967
+5.607932251
+6.118801442
+6.35250702
+4.896926283
+5.220963626
+6.235352422
+6.200243713
+5.42281245
+5.296982791
+4.694441479
+5.911687455
+5.958682935
+5.764168553
+6.035652869
+6.848298609
+6.286982307
+5.117292272
+4.918408193
+5.663513771
+6.056574275
+6.075640695
+5.619982249
+6.091000498
+5.621478087
+5.20792709
+5.410301802
+5.714093194
+5.601680818
+5.706328904
+5.536061308
+5.742188216
+5.496693462
+5.520261845
+4.736357129
+5.445665795
+5.818556633
+6.115244705
+6.010444064
+5.692230766
+5.477746447
+5.620406098
+5.936959746
+6.194875692
+6.349759588
+4.781600864
+5.692977066
+6.260550251
\ No newline at end of file
diff --git a/lib/dataanalysis.py b/lib/dataanalysis.py
index eff6e9b..36ba204 100644
--- a/lib/dataanalysis.py
+++ b/lib/dataanalysis.py
@@ -1,201 +1,601 @@
import numpy as np
from ipywidgets import interact, interactive, fixed, interact_manual
from ipywidgets import HBox, VBox, Label, Layout
import ipywidgets as widgets
from IPython.display import IFrame
from IPython.display import set_matplotlib_formats, display, Math, Markdown, Latex, HTML
set_matplotlib_formats('svg')
# Enable interactive backend for matplotlib
from IPython import get_ipython
get_ipython().run_line_magic('matplotlib', 'inline')
import matplotlib.pyplot as plt
import matplotlib.patches as pat
import matplotlib.ticker as ticker
plt.style.use('seaborn-whitegrid') # global style for plotting
+from matplotlib.ticker import MultipleLocator
+
import scipy.stats as stats
def visualize_ttest(sample_size, alpha, t):
# Create the t-test visualization
fig, ax = plt.subplots(figsize=(12, 4))
ax.set_title("Probability distribution of all possible sample means if $H_0$ is true")
# Let's plot the T distribution for this sample size
tdist = stats.t(df=sample_size, loc=0, scale=1)
x = np.linspace(tdist.ppf(0.0001), tdist.ppf(0.9999), 100)
y = tdist.pdf(x)
ax.plot(x, y, color='black', linewidth=1)
# Polish the look of the graph
ax.get_yaxis().set_visible(False) # hide the y axis
ax.set_ylim(bottom=0)
ax.grid(False) # hide the grid
ax.spines['top'].set_visible(False) # hide the frame except bottom line
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
# Plot the rejection zone two tailed
x_zone_1 = np.linspace(tdist.ppf(0.0001), tdist.ppf(alpha/2), 100)
x_zone_2 = np.linspace(tdist.ppf(1-alpha/2), tdist.ppf(0.9999), 100)
y_zone_1 = tdist.pdf(x_zone_1)
y_zone_2 = tdist.pdf(x_zone_2)
ax.fill_between(x_zone_1, y_zone_1, 0, alpha=0.3, color='red', label = r'rejection of $H_0$ with $\alpha={}$'.format(alpha))
ax.fill_between(x_zone_2, y_zone_2, 0, alpha=0.3, color='red')
# Plot the t-test stat
ax.axvline(x=t, color='firebrick', linestyle='dashed', linewidth=1)
ax.annotate('t-test $t$={:.3f}'.format(t), xy=(t, 0), xytext=(-10, 130), textcoords='offset points', bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "firebrick", alpha = 0.8))
# Add a legend
ax.legend()
# Display the graph
plt.show()
-
def visualize_ttest_pvalue(sample_size, alpha, t, p):
# Create the t-test visualization
fig, ax = plt.subplots(figsize=(12, 4))
ax.set_title("Probability distribution of all possible sample means if $H_0$ is true")
# Let's plot the T distribution for this sample size
- tdist = stats.t(df=sample_size, loc=0, scale=1)
+ tdist = stats.t(df=sample_size-1, loc=0, scale=1)
x = np.linspace(tdist.ppf(0.0001), tdist.ppf(0.9999), 100)
y = tdist.pdf(x)
ax.plot(x, y, color='black', linewidth=1)
# Polish the look of the graph
ax.get_yaxis().set_visible(False) # hide the y axis
ax.set_ylim(bottom=0)
ax.grid(False) # hide the grid
ax.spines['top'].set_visible(False) # hide the frame except bottom line
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
# Plot the rejection zone two tailed
x_zone_1 = np.linspace(tdist.ppf(0.0001), tdist.ppf(alpha/2), 100)
x_zone_2 = np.linspace(tdist.ppf(1-alpha/2), tdist.ppf(0.9999), 100)
y_zone_1 = tdist.pdf(x_zone_1)
y_zone_2 = tdist.pdf(x_zone_2)
ax.fill_between(x_zone_1, y_zone_1, 0, alpha=0.3, color='red', label = r'rejection of $H_0$ with $\alpha={}$'.format(alpha))
ax.fill_between(x_zone_2, y_zone_2, 0, alpha=0.3, color='red')
# Plot the t-test stats
ax.axvline(x=t, color='firebrick', linestyle='dashed', linewidth=1)
ax.annotate('t-test $t$={:.3f}'.format(t), xy=(t, 0), xytext=(-10, 130), textcoords='offset points', bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "firebrick", alpha = 0.8))
# Plot the p-value
if t >= 0: x_t = np.linspace(t, tdist.ppf(0.9999), 100)
else: x_t = np.linspace(tdist.ppf(0.0001), t, 100)
y_t = tdist.pdf(x_t)
ax.fill_between(x_t, y_t, 0, facecolor="none", edgecolor="firebrick", hatch="///", linewidth=0.0, label = r'p-value $p$={:.3f}'.format(p))
# Add a legend
ax.legend()
# Display the graph
plt.show()
+def visualize_ttest2():
+ return True
+
+
+def draw_sample(sample_size, mu=5.552, sigma=0.56068):
+ # sigma = 0.56068
+ # mu = 5.552
+ sample_data = sigma * np.random.randn(sample_size) + mu
+ return sample_data
+
+def plot_sample_histogram(sample, mu, title, color="grey"):
+
+ plt.title("Histogram of " + title)
+ plt.hist(sample, color="lightgrey")
+ #plt.xticks(np.arange(4.6, 7.2, 0.2))
+
+ # Add a vertical line for the population mean
+ plt.axvline(x=mu, color='black', linestyle='-.', linewidth=2,
+ label="population mean $\mu$")
+ # Add a vertical line for the Vuillerens sample mean
+ plt.axvline(x=np.mean(sample), color=color, linestyle='-.', linewidth=2,
+ label= title + " mean $m$")
+ plt.legend()
+
+
+def plot_t_distribution(df, alpha=None, tail="two", loc=0, scale=1):
+
+ fig, ax = plt.subplots(figsize=(10, 4))
+
+ plt.title('t distribution for {:} degrees of freedom'.format(df))
+ plt.xlabel('t')
+ plt.ylabel('Probability density')
+
+ tdist = stats.t(df=df, loc=loc, scale=scale)
+
+ # Get 100 values along the x axis from the least probable t value (0.0001) to the most probable t value (0.9999)
+ x = (np.linspace(tdist.ppf(0.0001), tdist.ppf(0.9999), 100))
+
+ # Plot the corresponding probabilities to get these t-values
+ ax.plot(x, tdist.pdf(x), color="firebrick",linestyle='-', lw=1, alpha=1) # label='t[{:}]'.format(df)
+ ax.grid(b=None, which = 'major', axis='y')
+ ax.set_ylim(bottom=0)
+
+ if (not alpha == None):
+
+ if (tail=="two"):
+
+ low_cutoff = tdist.ppf(alpha/2)
+ low_p = tdist.pdf(low_cutoff)
+
+ ax.axvline(x=low_cutoff, color='firebrick', linestyle='-.', linewidth=1)
+ ax.annotate("Cutoff $t$={:.3f}\nfor $\\alpha/2$={:.3f}".format(low_cutoff,alpha/2),
+ xy=(low_cutoff, low_p),
+ xytext=(-80, 10),
+ textcoords='offset points',
+ bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "firebrick", alpha = 0.8))
+
+ high_cutoff = tdist.ppf(1-alpha/2)
+ high_p = tdist.pdf(high_cutoff)
+ ax.axvline(x=high_cutoff, color='firebrick', linestyle='-.', linewidth=1)
+ ax.annotate("Cutoff $t$={:.3f} \nfor $1-\\alpha/2$={:.3f}".format(high_cutoff,1-alpha/2),
+ xy=(high_cutoff, high_p),
+ xytext=(5, 10),
+ textcoords='offset points',
+ bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "firebrick", alpha = 0.8))
+
+ x_zone_1 = np.linspace(tdist.ppf(0.0001), tdist.ppf(alpha/2), 100)
+ x_zone_2 = np.linspace(tdist.ppf(1-alpha/2), tdist.ppf(0.9999), 100)
+ y_zone_1 = tdist.pdf(x_zone_1)
+ y_zone_2 = tdist.pdf(x_zone_2)
+ ax.fill_between(x_zone_1, y_zone_1, 0, alpha=0.3, color='red', zorder=10)
+ ax.fill_between(x_zone_2, y_zone_2, 0, alpha=0.3, color='red', zorder=10)
+
+ elif (tail == "lower"):
+ low_cutoff = tdist.ppf(alpha)
+ low_p = tdist.pdf(low_cutoff)
+
+ ax.axvline(x=low_cutoff, color='firebrick', linestyle='-.', linewidth=1)
+ ax.annotate("Cutoff $t$={:.3f} \nfor $\\alpha$={:.3f}".format(low_cutoff,alpha),
+ xy=(low_cutoff, low_p),
+ xytext=(-80, 10),
+ textcoords='offset points',
+ bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "firebrick", alpha = 0.8))
+
+ x_zone_1 = np.linspace(tdist.ppf(0.0001), tdist.ppf(alpha), 100)
+ y_zone_1 = tdist.pdf(x_zone_1)
+ ax.fill_between(x_zone_1, y_zone_1, 0, alpha=0.3, color='red', zorder=10)
+
+
+ elif (tail == "upper"):
+ high_cutoff = tdist.ppf(1-alpha)
+ high_p = tdist.pdf(high_cutoff)
+ ax.axvline(x=high_cutoff, color='firebrick', linestyle='-.', linewidth=1)
+ ax.annotate("Cutoff $t$={:.3f}\nfor $1-\\alpha$={:.3f}".format(high_cutoff,1-alpha),
+ xy=(high_cutoff, high_p),
+ xytext=(5, 10),
+ textcoords='offset points',
+ bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "firebrick", alpha = 0.8))
+
+ x_zone_2 = np.linspace(tdist.ppf(1-alpha), tdist.ppf(0.9999), 100)
+ y_zone_2 = tdist.pdf(x_zone_2)
+ ax.fill_between(x_zone_2, y_zone_2, 0, alpha=0.3, color='red', zorder=10)
+
+ return ax
+
+def plot_t_test(sample, mu, alpha=None, tail="two"):
+
+ sample_size = np.size(sample)
+ df = sample_size - 1
+
+ ax = plot_t_distribution(df, alpha=alpha, tail=tail)
+ t, p = stats.ttest_1samp(sample, mu)
+
+ tdist = stats.t(df=df, loc=0, scale=1)
+
+ if (tail == "two"):
+ cutoff = tdist.ppf(alpha/2)
+ else:
+ cutoff = tdist.ppf(alpha)
+
+ cutoff_y = tdist.pdf(cutoff)
+
+
+ if (not tail == "two"):
+ if (tail=="lower" and t < 0):
+ p = p/2
+ elif (tail == "lower" and t > 0):
+ p = 1 - p/2
+ elif(tail == "upper" and t < 0):
+ p = 1 - p/2
+ elif(tail == "upper" and t > 0):
+ p = p/2
+
+
+
+ # Plot the p-value
+ #if t >= 0: x_t = np.linspace(t, tdist.ppf(0.9999), 100)
+ #else: x_t = np.linspace(tdist.ppf(0.0001), t, 100)
+
+ x_t = np.linspace(t, tdist.ppf(0.9999), 100)
+
+ if (tail =="two" and t < 0):
+ x_t = np.linspace(-t, tdist.ppf(0.9999), 100)
+
+ y_t = tdist.pdf(x_t)
+ if (tail == "two" or tail == "upper"):
+ ax.fill_between(x_t, y_t, 0, facecolor="none", edgecolor="firebrick", hatch="///", linewidth=0.0, label = r'p-value $p$={:.3f}'.format(p))
+
+ x_t = np.linspace(tdist.ppf(0.0001), t, 100)
+
+ if (tail=="two" and t >=0):
+ x_t = np.linspace(tdist.ppf(0.0001), -t, 100)
+
+ y_t = tdist.pdf(x_t)
+ if (tail == "two" or tail == "lower"):
+ ax.fill_between(x_t, y_t, 0, facecolor="none", edgecolor="firebrick", hatch="///", linewidth=0.0, label = r'p-value $p$={:.3f}'.format(p))
+
+ ax.axvline(x=t, color='firebrick', linestyle='-', linewidth=1)
+
+ if (tail =="two"):
+ ## Add a small vertical segment for the side of the hatching which is not on the side of t
+ point1 = [-t, 0]
+ point2 = [-t, tdist.pdf(-t)]
+
+ x_values = [point1[0], point2[0]]
+ y_values = [point1[1], point2[1]]
+ ax.plot(x_values, y_values, color="firebrick", linestyle='-', linewidth=1)
+
+
+
+ if (p < 0.3):
+ annotation_y = p
+ else:
+ annotation_y = 0.3
+
+ annotation_offset = 0
+ # are they overlapping ?
+ if (np.abs(cutoff_y - annotation_y) < 0.05):
+ if (annotation_y > cutoff_y):
+ annotation_offset = 25
+ else:
+ annotation_offset = -5
+
+ ax.annotate('t-test $t$={:.3f}, p={:.3f}'.format(t,p),
+ xy=(t, annotation_y),
+ xytext=(5, annotation_offset),
+ textcoords='offset points',
+ bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "firebrick", alpha = 0.8))
+
+
+
+
+def plot_mean_distribution(mu, alpha=None, means=None, sample_size=50):
+
+ plt.figure(figsize=(8, 4))
+
+ plt.title('Distribution of sample means for a population with $\mu$= {:}'.format(mu))
+ plt.xlabel('Mean of samples')
+ plt.ylabel('Count')
+
+ if (means == None):
+
+ x = np.linspace(5.2, 5.9, 100)
+
+ loc = mu
+ scale = 0.0796
+ df = sample_size - 1
+
+ tdist = stats.t.pdf(x, sample_size-1, loc=mu, scale=0.0796) ## * (prop_tot * (n_samples / len(nn)))
+ scale_factor = 1.0
+ frame1 = plt.gca()
+ frame1.plot(x, tdist, color='black', linestyle='-', lw=1, alpha=1);
+ frame1.set_ylim([0,6])
+ frame1.axes.get_yaxis().set_visible(False)
+
+ # In black the theoretical population mean mu
+ plt.axvline(x=mu, color='black', linestyle='-.', linewidth=2)
+
+ # Display mu
+ frame1.annotate("$\mu$={:.3f}".format(mu),
+ xy=(mu, 5.5),
+ xytext=(-20, 0),
+ textcoords='offset points',
+ bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "black", alpha = 0.8))
+
+ else:
+ # We get a list of means
+
+ # A histogram of the means
+ (nn,bb,pp) = plt.hist(means, color="lightgrey", bins=100, density=False);
+
+ ### Control
+ #(nn,bb) = np.histogram(means, bins=100, density=False)
+
+ # Add the t-distribution
+ # The positions in the histogram
+ x = np.linspace(bb[1], bb[len(bb)-1], len(bb-1))
+
+ frame1 = plt.gca()
+
+ loc = mu
+ scale = np.std(means, ddof=1)
+ df = sample_size - 1
+
+ tdist = stats.t.pdf(x, df, loc=mu, scale=scale) ## * (prop_tot * (n_samples / len(nn)))
+
+ scale_factor = np.size(means) / np.sum(tdist)
+ tdist = tdist * scale_factor
+
+ plt.plot(x, tdist, color='black', linestyle='-', lw=1, alpha=1);
+
+ frame1.axes.get_yaxis().set_visible(False)
+ frame1.set_ylim(bottom=0)
+
+ # In black the theoretical population mean mu
+ plt.axvline(x=mu, color='black', linestyle='-.', linewidth=2)
+
+ # Display mu
+ frame1.annotate("$\mu$={:.3f}".format(mu),
+ xy=(mu, np.max(nn)),
+ xytext=(-20, 0),
+ textcoords='offset points',
+ bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "black", alpha = 0.8))
+
+ if (not alpha == None):
+ df = sample_size - 1
+ add_rejection_zones(frame1, loc, scale, df, alpha, scale_factor)
+
+def add_rejection_zones(ax, loc, scale, df, alpha, scale_factor):
+
+ mytdist = stats.t(df=df, loc=loc, scale=scale)
+ # Plot the rejection zone two tailed
+ x_zone_1 = np.linspace(mytdist.ppf(0.0001), mytdist.ppf(alpha/2), 100)
+ x_zone_2 = np.linspace(mytdist.ppf(1-alpha/2), mytdist.ppf(0.9999), 100)
+ y_zone_1 = mytdist.pdf(x_zone_1) * scale_factor
+ y_zone_2 = mytdist.pdf(x_zone_2) * scale_factor
+ ax.fill_between(x_zone_1, y_zone_1, 0, alpha=0.3, color='red', zorder=10)
+ ax.fill_between(x_zone_2, y_zone_2, 0, alpha=0.3, color='red', zorder=10)
+
+ #ax.axvline(x=t, color='firebrick', linestyle='dashed', linewidth=1)
+ #ax.annotate('cut-off $t$={:.3f}'.format(mytdist.ppf(alpha/2)),
+ # xy=(mytdist.ppf(alpha/2), 0),
+ # xytext=(mytdist.ppf(alpha/2), mytdist.pdf(alpha/2)), textcoords='offset points',
+ # bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "firebrick", alpha = 0.8))
+
+
+def plot_n_and_t(sample_mean, sample_std, mu, from_n, to_n, step_n, plotp=False, df=49, alpha=0.05):
+
+ ######
+ # Compute t-values (t) for different values of sample size (n)
+ # by using the Vuillerens sample_mean and sample_std.
+
+ nvalues = []
+ tvalues = []
+ pvalues = []
+
+ # For sample sizes ranging from 10 to 60 in steps of 5
+ for n in np.arange(from_n, to_n, step_n):
+
+ # Compute the t-value if the sample size was n and with the
+ # sample_mean and sample_std from the Vuillerens sample
+ t = (sample_mean - mu) / (sample_std/np.sqrt(n))
+
+ # Collect the n and t values
+
+ df = n-1
+ p = stats.t.pdf(t, n-1)
+
+ pvalues.append(p)
+ nvalues.append(n)
+ tvalues.append(t)
+
+ ######
+ # Plot the relation between sample size and t-value and p-value
+ fig, ax = plt.subplots(figsize=(8, 4))
+
+ # Set the axes
+ plt.title('$|t|$ as a function of sample size (from {:} to {:})'.format(from_n, to_n))
+
+ ax.set_xlabel('Sample size', color="black")
+ ax.set_ylabel('|t|', color="firebrick")
+ ax.tick_params(axis='y', labelcolor="firebrick")
+
+ spacing = step_n # This can be your user specified spacing.
+ minorLocator = MultipleLocator(spacing)
+ # Set minor tick locations.
+ ax.xaxis.set_minor_locator(minorLocator)
+ # Set grid to use minor tick locations.
+ ax.grid(which = 'minor', axis = 'x', linestyle='dotted')
+ ax.grid(which = 'major', axis='x', linestyle='-')
+ ax.grid(b=None, which = 'major', axis='y')
+
+
+ # Plot the t-values that correspond to different samples sizes
+ ax.plot(nvalues, np.abs(tvalues), color="firebrick")
+
+ tdist = stats.t(df=50-1, loc=0, scale=1)
+ cutoff = tdist.ppf(1-alpha/2)
+
+
+ # Draw a horizontal line for the cutoff point at alpha = 0.025 (upper tail)
+ # If our t is negative, the cutoff point can also be negative (lower tail)
+ #if (tvalues[0] < 0):
+ # cutoff = -cutoff
+
+ plt.rcParams['path.sketch'] = (1, 100, 2)
+ ax.axhline(y=2.01, color='firebrick', linestyle='-.', linewidth=1);
+ plt.rcParams['path.sketch'] = None
+ ax.annotate('cutoff $t_{\\alpha=0.05} \\approx 2.00$', xy=(10, 2.00), xytext=(10, 0), textcoords='offset points',
+ bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "firebrick", alpha = 0.8))
+
+
+ plt.rcParams['path.sketch'] = (1, 100, 2)
+ ax.axhline(y=2.61, color='firebrick', linestyle='-.', linewidth=1);
+ plt.rcParams['path.sketch'] = None
+ ax.annotate('cutoff $t_{\\alpha=0.01} \\approx 2.66$', xy=(10, 2.66), xytext=(10, 0), textcoords='offset points',
+ bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "firebrick", alpha = 0.8))
+
+
+ # Draw a vertical grey line at the sample size that corresponds to the cutoff point for alpha=0.05
+ #n = ((cutoff * sample_std) / (sample_mean - mu)) ** 2
+ #ax.axvline(x=n, color='grey', linestyle='-.', linewidth=1);
+
+ if (plotp):
+
+ ax2 = ax.twinx()
+
+ # Plot the p-values that correspond to different sample sizes
+ ax2.plot(nvalues, pvalues, color="blue")
+
+ ax2.set_xlabel("Sample Size")
+ ax2.set_ylabel('p-value', color="blue")
+ ax2.tick_params(axis='y', labelcolor="blue")
+
+ ax2.grid(which = 'major', axis='x', linestyle='-')
+
+
+ ax2.axhline(y=0.05, color='blue', linestyle='-.', linewidth=1);
+ ax2.annotate('$\\alpha = 0.05$', xy=(to_n, 0.05), xytext=(-40, 0), textcoords='offset points',
+ bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "blue", alpha = 0.8))
+
+ ax2.axhline(y=0.01, color='blue', linestyle='-.', linewidth=1);
+ ax2.annotate('$\\alpha = 0.01$', xy=(to_n, 0.01), xytext=(-40, 0), textcoords='offset points',
+ bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "blue", alpha = 0.8))
+
+
+ #ax2.axhline(y=alpha, color='blue', linestyle='-.', linewidth=2);
+
+
+
+def one_sample_one_tailed(sample_data, mu, alpha=0.05, alternative='greater'):
+ t, p = stats.ttest_1samp(sample_data, mu)
+ p = p*2
+ print ('t:',t)
+ print ('p:',p)
+ if alternative == 'greater' and (p < alpha) and t > 0:
+ print ('Reject Null Hypothesis for greater-than test')
+ if alternative == 'less' and (p < alpha) and t < 0:
+ print ('Reject Null Hypothesis for less-thane test')
###### TO DELETE
def build_ttest_visualization(ttest_result, alpha):
# Extract information from the result of the t-test
n = round(ttest_result.loc["T-test","dof"])
t = ttest_result.loc["T-test","T"]
p = ttest_result.loc["T-test","p-val"]
d = ttest_result.loc["T-test","cohen-d"]
# Create the figure
fig = plt.figure(figsize=(14, 4))
### 1. Create the t-test visualization
ax1 = plt.subplot(121)
ax1.set_title("Result of the t-test")
# Let's plot the T distribution for this sample size
tdist = stats.t(df=n, loc=0, scale=1)
x = np.linspace(tdist.ppf(0.0001), tdist.ppf(0.9999), 100)
y = tdist.pdf(x)
ax1.plot(x, y, color='black', linewidth=1)
# Polish the look of the graph
ax1.get_yaxis().set_visible(False) # hide the y axis
ax1.set_ylim(bottom=0)
ax1.grid(False) # hide the grid
ax1.spines['top'].set_visible(False) # hide the frame except bottom line
ax1.spines['right'].set_visible(False)
ax1.spines['left'].set_visible(False)
# Plot the rejection zone two tailed
x_zone_1 = np.linspace(tdist.ppf(0.0001), tdist.ppf(alpha/2), 100)
x_zone_2 = np.linspace(tdist.ppf(1-alpha/2), tdist.ppf(0.9999), 100)
y_zone_1 = tdist.pdf(x_zone_1)
y_zone_2 = tdist.pdf(x_zone_2)
ax1.fill_between(x_zone_1, y_zone_1, 0, alpha=0.3, color='red', label = r'threshold $\alpha={}$'.format(alpha))
ax1.fill_between(x_zone_2, y_zone_2, 0, alpha=0.3, color='red')
# Plot the t-test stats
ax1.axvline(x=t, color='firebrick', linestyle='dashed', linewidth=1)
ax1.annotate('t-test $t$={:.3f}'.format(t), xy=(t, 0), xytext=(-10, 130), textcoords='offset points', bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "firebrick", alpha = 0.8))
# Plot the p-value
if t >= 0: x_t = np.linspace(t, tdist.ppf(0.9999), 100)
else: x_t = np.linspace(tdist.ppf(0.0001), t, 100)
y_t = tdist.pdf(x_t)
ax1.fill_between(x_t, y_t, 0, facecolor="none", edgecolor="firebrick", hatch="///", linewidth=0.0, label = r'p-value $p$={:.3f}'.format(p))
# Add a legend
ax1.legend(loc='upper right')
### 2. Create the effect size visualization
ax2 = plt.subplot(122)
ax2.set_title("Effect size")
# Plot the theoretical distribution of first sample
norm = stats.norm(loc=0, scale=1)
x = np.linspace(norm.ppf(0.0001), norm.ppf(0.9999), 100)
y = norm.pdf(x)
ax2.plot(x, y, color='black', alpha=0.3, linewidth=1)
ax2.fill_between(x, y, 0, color='blue', alpha=0.3, label = 'Year 1 (theoretical)')
ax2.axvline(x=0, color='blue', alpha=0.5, linestyle='dashed', linewidth=1)
# Plot the theoretical distribution of second sample (if t > 0 means 2 < 1 so we plot the second sample on the left)
loc_d = -d if t > 0 else d
norm_d = stats.norm(loc=loc_d, scale=1)
x_d = np.linspace(norm_d.ppf(0.0001), norm_d.ppf(0.9999), 100)
y_d = norm_d.pdf(x_d)
ax2.plot(x_d, y_d, color='black', alpha=0.3, linewidth=1)
ax2.fill_between(x_d, y_d, 0, color='green', alpha=0.3, label = 'Year 2 (theoretical)')
ax2.axvline(x=loc_d, color='green', alpha=0.5, linestyle='dashed', linewidth=1)
# Display the value of Cohen's d
max_y = np.max(y)+.02
ax2.plot([0,loc_d], [max_y, max_y], color='red', linewidth=1, marker=".")
ax2.annotate("effect size $d$={:.3f}".format(d), xy=(loc_d, max_y), xytext=(15, -5), textcoords='offset points', bbox=dict(boxstyle="round", facecolor = "white", edgecolor = "red", alpha = 0.8))
# Polish the look of the graph
ax2.get_yaxis().set_visible(False) # hide the y axis
ax2.set_ylim(bottom=0)
ax2.grid(False) # hide the grid
ax2.spines['top'].set_visible(False) # hide the frame except bottom line
ax2.spines['right'].set_visible(False)
ax2.spines['left'].set_visible(False)
# Add a legend
ax2.legend(loc='upper left')
# Display the graph
plt.subplots_adjust(wspace=.1)
plt.show()
# EOF