Hey, I’m Kay! This guide provides an introduction to the fundamental concepts of and relationships between hypothesis testing, effect size, and power analysis, using the one-sample z-test as a prime example. While the primary goal is to elucidate the idea behind hypothesis testing, this guide does try to carefully derive the math details behind the test in the hope that it helps clarification. DISCLAIMER: It’s important to mention that the one-sample z-test is rarely used due to its restrictive assumptions. As such, there are limited resources on the subject, compelling me to derive most of the formulas, particularly those related to power, on my own. This self-reliance might increase the likelihood of errors. If you detect any inaccuracies or inconsistencies, please don’t hesitate to let me know, and I’ll make the necessary updates. Happy learning! ;)
In a single sample z-test, our data generating process (DGP) assumes that our observations of a random variable \(X\) are independently drawn from one identical distribution (i.i.d.) with mean \(\mu\) and variance \(\sigma^2\).
Important Notation:
Here we use the capital \(\bar{X}\) to denote the sample mean to refer it as a random variable. And the \(X_i\) refer to each element in a sample also as a random variable.
Later, when we have an actual observed sample, we would use the lower case letter \(x_i\) to denote each observation/realization of the random variable \(X_i\) and calculate the observed sample mean \(\bar{x}\) and treat it as an realization of our sample mean \(\bar{X}\).
The sample mean is defined as below. As indicated in previous guide, the sample mean is an unbiased estimator of population expectation under i.i.d. assumption.
\[\bar{X} = \frac{\sum^n_i X_i}{n}\]
The expectation of the sample mean should be:
\[ \begin{align*} E(\bar{X}) =& E(\frac{1}{n} \cdot \sum^n_i(X_i)) \\ =& \frac{1}{n} \cdot \sum^n_iE(X_i)\\ =&\frac{1}{n}\cdot n \cdot \mu\\ =& \mu \end{align*} \]
and the variance of the sample mean would be:
\[ \begin{align*} Var(\bar{X}) =& Var(\frac{1}{n} \cdot \sum^n_i(X_i))\\ =& \frac{1}{n^2} \cdot \sum^n_i Var(X_i)\\ =&\frac{1}{n^2} \cdot n \cdot \sigma^2\\ =& \frac{\sigma^2}{n}\\[2ex] *\text{Note: } & Var(X_1 +X_2) = Var(X_1) + Var(X_2) + Cov(X_1, X_2)\\ &\text{As the samples are drawn individually, } Cov(X_1, X_2) =0, \\ &Var(X_1 +X_2) = Var(X_1) + Var(X_2)\\ \end{align*} \]
More importantly, according to The Central Limit Theorem (CLT), even we did not specify the original distribution of \(x\), if the original distributions of \(x\) have finite variances, as n become sufficiently large (rule of thumb: n >30), the distribution of \(\bar{x}\) become a normal distribution:
\[\bar{X} \sim N(\mu, \frac{\sigma^2}{n})\]
Given the nature of the normal distribution, we know the probability density function of \(\bar{X}\) would be
\[f_{pdf}(\bar{X}|\mu, \sigma, n) = \frac{1}{\left(\frac{\sigma}{\sqrt{n}}\right)\sqrt{2\pi}} \cdot \exp\left[-\frac{(\bar{X}-\mu)^2}{2 \cdot \left(\frac{\sigma^2}{n}\right)}\right]\]
This can be tedious to calculate so we could standardize the normal distribution to a standard normal distribution (\(N(0, 1)\)).
\[ Z = (\frac{\bar{X} - \mu}{\sigma/\sqrt{n}}) = (\frac{\sqrt{n} \cdot (\bar{X} - \mu)}{\sigma})\sim N(0, 1)\\ \]
Important Notation: Similar to \(\bar{X}\) and \(\bar{x}\), we use \(Z\) to refer to the random variable and \(z\) to refer to the observation from a fixed sample.
Also we could get the theoretical probability of getting Z between an interval from the distribution by
\[ Pr(z_{min} < Z < z_{max}) = \Phi(z_{max}) - \Phi(z_{min})\\[2ex] \text{where } \Phi(k) = \int^k_{-\infty} f_{pdf}(Z|\mu, \sigma,n)\ dZ\\[2ex] f_{pdf}(Z|\mu, \sigma,n) = \frac{1}{\sqrt{2\pi}} \cdot exp(-\frac{1}{2}Z^2)\\[2ex] Z|\mu, \sigma,n = \frac{\sqrt{n} \cdot (\bar{X} - \mu)}{\sigma} \]
For a one-sample Z-test, we assume we know the variance parameter \(\sigma^2\) of our data generating distribution (a very unrealistic assumption, but let’s stick with it for now)
Given a sample, we could also know the sample size n, the observed sample mean \(\bar{x}\) (remember we use lower case so it don’t get confused as we view the sample mean \(\bar{X}\) as a random variable in our DGP).
The aim of our hypothesis testing is then, given our knowledge about the \(\sigma\), n and the \(\bar{x}\), we can test hypothesis about our sample mean \(\mu\). Specifically, the null hypothesis (\(H_0\)) stating that,
\(\mu = \mu_{H_0}\) (a two-tailed test)
\(\mu \geq \mu_{H_0}\) (a right-tailed test)
\(\mu \leq \mu_{H_0}\) (a left-tailed test)
We make this decision follow the logic that: if, given the null hypothesis is true, the probability of getting a sample mean \(\bar{X}\) (or its corresponding test statistics \(Z\)) that is as extreme or more extreme as the observed sample mean \(\bar{x}\) (or its corresponding test statistics \(z\)) is smaller than some threshold (\(\alpha\)), we would rather believe the null hypothesis is not true.
The p-value represents the probability of observing a test statistic \(Z = \frac{\sqrt{n} \cdot (\bar{X} - \mu_0)}{\sigma}\) as extreme as, or more extreme than, the one computed from the sample \(z = \frac{\sqrt{n} \cdot (\bar{x} - \mu_0)}{\sigma}\), given that the null hypothesis is true.
The threshold we set is called significance level, denoted as \(\alpha\). As we reject the null if the p-value is below \(\alpha\), this also means that we have the probability of \(\alpha\) to falsely reject the null given our null is true and our observed case is indeed extreme (known as Type I error).
Moreover, given the distribution under the null, the \(\alpha\) correspond to a specific value(s) of z called the critical value(s), which we can denote as \(z_c\).
There are two practical ways we could conduct this hypothesis testing (they are actually the same), we could either calculate the p-value and compare them to the \(\alpha\), or compare the test statistics \(z\)with the critical value \(z_c\).
If we are concerned with the probability that our actual \(\mu\) is different (either larger or smaller) than \(\mu_{H_0}\), we are doing a two-tail test.
For a two-tailed test, when we refer to values that are “as extreme or more extreme” than the observed test statistic, we’re considering deviations in both positive and negative directions from zero.
Therefore, the two-tailed p-value is:
\[ \begin{align*} \text{If}\ z > 0\ \text{and } & \text{alternative hypo: }\ \mu\neq \mu_{H_0}, \\[2ex] p\text{-value} =& P(Z > z) + P(Z < -z)\\ =& (1 - \Phi(z)) + \Phi(-z) =\int^{\infty}_{z} f_{pdf}(Z)\ dZ + \int^{-z}_{-\infty} f_{pdf}(Z)\ dZ,\\[2ex] & \text{As the distribution is symmetrical to 0}\\[2ex] =& 2 \cdot P(Z > z) = 2 \cdot (1-\Phi(z)) = 2 \cdot \int^{\infty}_{z}f(Z)dZ\\[2ex] =& 2 \cdot P(Z < -z) = 2 \cdot \Phi(-z)= 2 \cdot \int^{-z}_{-\infty}f(Z)dZ\\[2ex] & \text{In abosolute sense: }\\[2ex] =& 2 \cdot P(Z > |z|) = 2 \cdot (1-\Phi(|z|)) = 2 \cdot \int^{\infty}_{|z|}f(Z)dZ\\[2ex] z = &\frac{\sqrt{n} \cdot (\bar{x} - \mu_0)}{\sigma}\ \text{is calculated from the observed sample} \end{align*} \]
\[ \begin{align*} \text{If}\ z < 0\ \text{and } & \text{alternative hypo: }\ \mu\neq \mu_{H_0}, \\[2ex] p\text{-value} =& P(Z < z) + P(Z > -z) = \Phi(z) + (1-\Phi(-z))=\int^{z}_{-\infty} f_{pdf}(Z)\ dZ + \int^{\infty}_{-z} f_{pdf}(Z)\ dZ,\\[2ex] & \text{As the distribution is symmetrical to 0}\\[2ex] =& 2 \cdot P(Z < z) = 2 \cdot \Phi(z) = 2 \cdot \int^{z}_{-\infty}f(Z)dZ\\[2ex] =& 2 \cdot P(Z > -z) = 2 \cdot (1-\Phi(-z)) =2 \cdot \int^{\infty}_{-z}f(Z)dZ\\[2ex] & \text{In abosolute sense: }\\[2ex] =& 2 \cdot P(Z > |z|) = 2 \cdot (1-\Phi(|z|)) =2 \cdot \int^{\infty}_{|z|}f(Z)dZ\\[2ex] z = &\frac{\sqrt{n} \cdot (\bar{x} - \mu_0)}{\sigma}\ \text{is calculated from the observed sample} \end{align*} \]
\[ \text{Overall, for two-tailed test, alternative hypo: } \mu\neq \mu_{H_0}\\[2ex] p\text{-value} = 2 \cdot P(Z > |z|) = 2 \cdot (1-\Phi(|z|)) = 2 \cdot \int^{\infty}_{|z|}f_{pdf}(Z)dZ,\\[2ex] z = \frac{\sqrt{n} \cdot (\bar{x} - \mu_0)}{\sigma}\ \text{is calculated from the observed sample} \]
And if we are only concerned with the probability that our actual \(\mu\) is larger (or smaller) than \(\mu_{H_0}\), we are doing a one-tail test.
For a one-tailed test, when we refer to values that are “as extreme or more extreme” than the observed test statistic, we’re considering deviations only in one direction from zero.
Therefore, the one-tailed p-value is:
\[ p-value= \begin{cases} P(Z > z) = 1 - \Phi(z)=\int^{\infty}_{z} f_{pdf}(Z)\ dZ,\quad \text{alternative hypo: } \mu> \mu_{H_0}\\[2ex] P(Z < z) = \Phi(z)= \int^{z}_{-\infty} f_{pdf}(Z)\ dZ, \quad \text{alternative hypo: } \mu < \mu_{H_0}\\[2ex] \end{cases} \\[2ex] z = \frac{\sqrt{n} \cdot (\bar{x} - \mu_0)}{\sigma}\ \text{is calculated from the observed sample} \]
If the p-value is smaller than our significance level \(\alpha\), we can reject the null.
That is:
\[p-value(z) < \alpha \Rightarrow \text{reject } H_0: \mu = \mu_{H_0}\]
Alternatively, we could choose to not to calculate p-value for our observed \(z\), but compare our \(z\) to the z value(s) corresponding to our \(\alpha\).
Under a two-tailed test, we use:
\[ Pr(Z > |z|) < \frac{1}{2}\alpha \]
The critical value \(z_{\alpha/2}\) is defined as:
\[ z_{\alpha/2}= arg_{z_i} \Big[Pr(Z > z_{i}) = \frac{ \alpha}{2} \Big] = \Phi^{-1} \Big(1 -\frac{ \alpha}{2} \Big) \]
Due to the symmetry of the standard normal distribution:
\[ -z_{\alpha/2} = arg_{z_i} \Big[Pr(Z < -z_{i}) = \frac{ \alpha}{2} \Big] =\Phi^{-1} \Big(\frac{ \alpha}{2} \Big) \]
Our decision rule then implies:
\[ |z| > z_{\alpha/2},\ \text{if alternative hypo: } \mu \neq \mu_{H_0} \]
Similarly for one-tailed test, the critical value \(z_{c}\) is:
\[ z_{\alpha} = \begin{cases} arg_{z_i}[Pr(Z > z_{i}) = \alpha] = \Phi^{-1}(1-\alpha), & \text{if alternative hypo: } \mu> \mu_{H_0}\\[2ex] arg_{z_i}[Pr(Z < z_{i}) = \alpha] = \Phi^{-1}(\alpha), & \text{if alternative hypo: } \mu < \mu_{H_0}\\[2ex] \end{cases} \]
Then, our conditions to reject the null hypothesis are equivalent to:
\[ \begin{cases} z > z_{\alpha}, & \text{if alternative hypo: } \mu> \mu_{H_0}\\[2ex] z < z_{\alpha}, & \text{if alternative hypo: } \mu < \mu_{H_0}\\[2ex] \end{cases}\\[2ex] \]
The idea behind effect size is to calculate a statistic that measure how large the difference actually is and make this statistic comparable across different situations.
Our intuitive effect size in the single sample Z-test might be \(\bar{x} - \mu_0 = \bar{x} - \mu_{H_0}\), given our hypothesized \(\mu_0 = \mu_{H_0}\).
But this statistic is not comparable across situations, as the same difference should be more important for us to consider when the population standard deviation is very small.
So to adjust for this, we could use Cohen’s d, the magnitude of the difference between your sample mean and the hypothetical population mean, relative to the population standard deviation.
\[Cohen's\ d = \frac{\bar{x}-\mu_{H_0}}{\sigma}, \ \text{given } H_0:\mu=\mu_{H_0}\] \[ Cohen's\ d = \frac{z}{\sqrt{n}},\ \text{if}\ H_0:\mu=\mu_{H_0}\\ \text{given}\ z = \frac{\bar{x} - \mu_{H_0}}{\sigma/\sqrt{n}} =\frac{(\bar{x} - \mu_{H_0})\cdot \sqrt{n}}{\sigma} \]
The power indicate the probability that the Z-test correctly reject the null (\(H_0: \mu = \mu_{H_0}\)). In other word, if the \(\mu \neq \mu_{H_0}\), what’s our chance of detecting this difference?.
Suppose the true expectation is \(\mu_{H_1}\), so the difference between the true expectation and our hypothetical expectation is:
\[ \Delta = \mu_{H_1} - \mu_{H_0} \\ \text{Thus } \mu_{H_0} = \mu_{H_1} - \Delta \] Our original statistics can be written as:
\[ \begin{align*} Z =& \frac{\sqrt{n} \cdot (\bar{X} - \mu_{H_0})}{\sigma}\\ =& \frac{\sqrt{n} \cdot [\bar{X} - (\mu_{H_1} - \Delta)]}{\sigma}\\ =& \frac{\sqrt{n} \cdot (\bar{X} - \mu_{H_1} + \Delta)}{\sigma}\\ =& \frac{\sqrt{n} \cdot (\bar{X} - \mu_{H_1})}{\sigma} + \frac{\sqrt{n} \cdot \Delta}{\sigma}\\ \end{align*} \]
The first term of \(Z\) can be seen as the z-statistics under the true expectation \(\mu_{H_1}\), let’s denote it as \(Z'\).
Let’s define \(\delta\) as below. \(\delta\) is referred to as the non-centrality parameter (NCP) because it measures how much the distribution of \(Z'\) diverge from the central distribution of \(z\)
\[ \delta = \frac{\Delta \sqrt{n}}{\sigma} \]
\[ Z = Z' + \delta \Rightarrow Z'=Z-\delta \]
Thus, the power would be the probability that the \(Z'\) is in the rejection area, or more simply, use \(Z'\) to replace the \(z\) in our decision rule above:
For two-tailed test:
\[ \begin{align*} Power =& Pr(|Z'| > z_{\alpha/2})\\ =& Pr(Z' > z_{\alpha/2}) + Pr(Z' < -z_{\alpha/2})\\ =& Pr(Z - \delta > z_{\alpha/2}) + Pr(Z - \delta < -z_{\alpha/2})\\ =& Pr(Z > \delta + z_{\alpha/2}) + Pr(Z < \delta-z_{\alpha/2})\\ =& 1 -\Phi(\delta + z_{\alpha/2}) + \Phi(\delta - z_{\alpha/2})\\ & \text{if alternative hypo: } \mu \neq \mu_{H_0}\\ & \delta = \frac{\sqrt{n} \cdot (\mu_{H_1} - \mu_{H_0})}{\sigma} \end{align*} \]
For one-tailed test:
\[ \begin{align*} Power =& \begin{cases} Pr(Z' > z_{\alpha}) =Pr(Z - \delta > z_{\alpha}) =Pr(Z > \delta + z_{\alpha}) = 1- \Phi(\delta + z_{\alpha}),\ \text{if alternative hypo: } \mu> \mu_{H_0}\\[2ex] Pr(Z' < z_{\alpha}) =Pr(Z - \delta < z_{\alpha}) =Pr(Z < \delta + z_{\alpha}) = \Phi(\delta + z_{\alpha}),\ \ \ \ \ \ \ \ \text{if alternative hypo: } \mu< \mu_{H_0}\\[2ex] \end{cases}\\[2ex] \text{with}\ \ \delta =& \frac{\sqrt{n} \cdot (\mu_{H_1} - \mu_{H_0})}{\sigma} \end{align*} \]
The post-hoc power analysis indicates that, if the null hypothesis is false, the probability that the one-sample Z-test would correctly reject the null hypothesis based on the observed sample mean \(\bar{x}\). Here the logic that we use the sample mean \(\bar{x}\) is that we do not know the ‘true’ distribution parameter and the sample mean is the best estimate we have.
\[ \text{When } \mu_{H_1} = \bar{x},\\ \delta = \frac{\sqrt{n} \cdot (\bar{x} - \mu_{H_0})}{\sigma} =z \]
Thus, for a one-sample Z-test, the NCP given observed sample mean \(\bar{x}\) actually is the same as the observed \(z\).
For two-tailed test:
\[ \begin{align*} Power =& Pr(Z > z + z_{\alpha/2}) + Pr(Z < z -z_{\alpha/2})\\ =& 1 -\Phi(z + z_{\alpha/2}) + \Phi(z - z_{\alpha/2})\\ \text{where }z &= \frac{\sqrt{n} \cdot (\bar{x} - \mu_{H_0})}{\sigma}\\ & \text{if alternative hypo: } \mu \neq \mu_{H_0}\\ \end{align*} \]
For one-tailed test:
\[ \begin{align*} Power =& \begin{cases} Pr(Z > z + z_{\alpha}) = 1- \Phi(z + z_{\alpha}),\ \text{if alternative hypo: } \mu> \mu_{H_0}\\[2ex] Pr(Z < z + z_{\alpha}) = \Phi(z + z_{\alpha}),\ \ \ \ \ \ \ \ \text{if alternative hypo: } \mu< \mu_{H_0}\\[2ex] \end{cases}\\[2ex] \text{with}\ \ z &= \frac{\sqrt{n} \cdot (\bar{x} - \mu_{H_0})}{\sigma} \end{align*} \]
If the Z-test is already significant, a post-hoc power analysis may not be useful as we have already rejected the null. But if the Z-test is non-significant, a low power may indicate the possibility that the null is falselt accepted because low power of the test.
The priori power analysis is aimed to estimate the sample size n needed given a desired power and assumed \(\alpha\) and effect size d (let \(\mu = \bar{X}\)).
\[ Cohen's\ d = \frac{\bar{X} - \mu_{H_0}}{\sigma}\\ \delta = \frac{\sqrt{n} \cdot (\bar{X} - \mu_{H_0})}{\sigma} = d \cdot \sqrt{n} \]
For two-tailed test, remind ourselve that its power is:
\[ \begin{align*} Power =& Pr(Z > \delta + z_{\alpha/2}) + Pr(Z < \delta -z_{\alpha/2})\\ =& 1 -\Phi(\delta + z_{\alpha/2}) + \Phi(\delta - z_{\alpha/2})\\ & \text{if alternative hypo: } \mu \neq \mu_{H_0}\\ \end{align*} \]
Thus, to determine the sample size, we have:
\[ \Rightarrow \Phi(d \cdot \sqrt{n} + z_{\alpha/2}) - \Phi(d \cdot \sqrt{n} - z_{\alpha/2}) = 1 -Power\\ \text{as the cdf of normal distribution is symmetrical of point (0, 0.5)}\\ \Rightarrow \Phi(z_{\alpha/2} + d \cdot \sqrt{n}) + \Phi(z_{\alpha/2} - d \cdot \sqrt{n}) = 2 -Power\\ \text{if alternative hypo: } \mu \neq \mu_{H_0}\\ \]
This equation is a a transcendental equation that cannot be solved analytically (using standard algebraic techniques or in terms of elementary functions) but can be solved numerically, so we could rely on computation to solve n.
At the same time, the transcendental equation can be hard to interpret, but we could use some intuition, the two terms on the left is the sum of the y value of two points symmetrical to \(Z = z_{\alpha /2}\) (which is to the right of the x = 0), as \(\alpha\) is fixed, we could only decide the how spread these two points are from the center. As the cdf function increase slower and slower on the right side, the wider the spread, the sum tend to get smaller. If we fix the power, as desired effect size d decrease (we want to detect small effect), the sample size also need to increase quadratically (a \(k*d\) change in d lead to a \((1/k^2)*n\) change in n). Similarly, if we decide a specific effect size d to detect, we can see power increase (our test being more effective in rejecting the null), our sample size n need to increase roughly quadratically (not strictly as \(\Phi^{-1}\) is not linear).
For one-tailed test:
\[ \begin{align*} Power =& \begin{cases} Pr(Z >\delta + z_{\alpha}) = 1- \Phi(\delta + z_{\alpha}),\ \text{if alternative hypo: } \mu> \mu_{H_0}\\[2ex] Pr(Z < \delta + z_{\alpha}) = \Phi(\delta + z_{\alpha}),\ \ \ \ \ \ \ \ \text{if alternative hypo: } \mu< \mu_{H_0}\\[2ex] \end{cases}\\[2ex] \end{align*} \]
Thus, for a right-tailed test, the sample size needed is:
\[ Power = 1- \Phi(d \cdot \sqrt{n} + z_{\alpha}) \\ \Rightarrow n= \bigg[\frac{\Phi^{-1}(1-Power)-z_{\alpha}}{d} \bigg]^2\\ \text{as } \Phi^{-1}(z) \text{ is symmetric to (0.5, 0)}, \Phi^{-1}(1-z)=-\Phi^{-1}(z),\\ \Rightarrow n= \bigg[\frac{-\Phi^{-1}(Power)-z_{\alpha}}{d} \bigg]^2 \\ \Rightarrow n = \frac{[\Phi^{-1}(Power)+z_{\alpha}]^2}{d^2}\\ \text{if alternative hypo: } \mu> \mu_{H_0}\\ \]
Similarly, for a left-tailed test, the sample size needed is:
\[ Power = \Phi(d \cdot \sqrt{n} + z_{\alpha}),\\ \Rightarrow n= \bigg[\frac{\Phi^{-1}(Power)-z_{\alpha}}{d} \bigg]^2\\ \Rightarrow n= \frac{[\Phi^{-1}(Power)-z_{\alpha}]^2}{d^2}\\ \text{if alternative hypo: } \mu< \mu_{H_0}\\ \]
These equations are more intuitive. As the effect size aimed to detect decrease, the sample size n need to increase quadratically (if \(d\) becomes half \(1/2*d\) i.e. \(k *d, k=1/2\), n becomes \(4 * n\) i.e., \((1/k^2) * n, k = 1/2\)). As the power and significance level increase (for right-tailed \(z_{\alpha}\) become more positive and for left-tailed \(z_{\alpha}\) become more negative), the sample size n also roughly increase quadratically (not strictly as \(\Phi^{-1}\) is not linear and the numerator is a quadratic form of a sum).