Hypothesis testing is the engine of scientific inference. From drug trials that determine whether a new treatment works, to quality control on a production line, to opinion polling before an election — this rigorous framework lets us make principled decisions from data. Rather than guessing, we quantify exactly how surprising our observations are under a default assumption, then decide whether the data is surprising enough to overturn it.
H₀: parameter = value | H₁: parameter > / < / ≠ value
p-value = P(result as extreme as observed | H₀ true)
Reject H₀ if p-value < significance level α
In this module you will develop a complete toolkit: setting up hypotheses correctly, choosing the right type of test, computing p-values from Binomial and Normal distributions, identifying critical regions, and understanding the two ways a test can go wrong.
Learning Objectives
Formulate the null hypothesis H₀ and alternative hypothesis H₁ correctly for a given context
Choose between a one-tail and two-tail test based on the wording of the question
State and interpret the significance level α of a hypothesis test
Calculate the test statistic or p-value for a given data set
Find the critical region for a test at a given significance level
Compare the p-value to α and write a conclusion in context
Understand Type I error: rejecting H₀ when it is actually true, with probability α
Understand Type II error: failing to reject H₀ when it is actually false, with probability β
Test a proportion p using the Binomial distribution B(n, p₀)
Test a population mean μ using the Normal distribution with known variance σ²
Topics in This Module
Null & Alternative Hypotheses
Formulating H₀ and H₁ from a problem context
One-Tail vs Two-Tail
Choosing the correct test direction
p-values
Computing and interpreting probability of observed result
Critical Regions
Finding the rejection region from tables
Type I & II Errors
False positives, false negatives, power of a test
Binomial & Normal Tests
Testing proportions and means from data
Learn 1 — Setting Up a Hypothesis Test
Every hypothesis test begins with two competing statements about a population parameter (such as a proportion p or a mean μ).
The Null Hypothesis H₀
H₀ is the default assumption — the status quo. It always contains an equality sign. We assume H₀ is true until the data gives us sufficient evidence to doubt it.
H₀: p = p₀ or H₀: μ = μ₀
The Alternative Hypothesis H₁
H₁ is the claim we are testing. It contains a strict inequality. There are three forms:
H₁: p > p₀ — one-tail test (right-tail). We suspect the parameter is larger than stated. H₁: p < p₀ — one-tail test (left-tail). We suspect the parameter is smaller than stated. H₁: p ≠ p₀ — two-tail test. We suspect the parameter is different (either direction).
Why We Can Never "Prove" H₀
A hypothesis test can only tell us whether our data is inconsistent with H₀. If the data is surprising under H₀ (p-value small), we reject H₀. If the data is not surprising, we fail to reject H₀ — but this is not the same as proving H₀ is true. Absence of evidence is not evidence of absence.
Language matters: Always write "fail to reject H₀" or "insufficient evidence to reject H₀". Never write "accept H₀" or "prove H₀".
Significance Level α
The significance level α is the threshold probability we set before seeing the data. If the p-value falls below α, we say the result is statistically significant and reject H₀.
The significance level also equals the probability of a Type I error — rejecting H₀ when it is actually true. A smaller α reduces this risk but makes it harder to reject H₀.
Tip: The choice of α is made before the test. If the question says "test at the 5% level", then α = 0.05 throughout — do not adjust it after seeing the data.
Learn 2 — p-values and Critical Regions
The p-value
The p-value is the probability of obtaining a result at least as extreme as the one observed, assuming H₀ is true. It measures how surprising the data is under the null hypothesis.
p-value = P(result as extreme as observed | H₀ true)
Decision rule:
If p-value ≤ α → Reject H₀. Significant evidence against H₀.
If p-value > α → Fail to reject H₀. Insufficient evidence against H₀.
One-tail vs Two-tail p-values
For a one-tail test (H₁: p > p₀), the p-value is the area in one tail only: P(X ≥ x_obs) or P(X ≤ x_obs).
For a two-tail test (H₁: p ≠ p₀), the p-value is doubled: 2 × P(X ≥ |x_obs|), because extreme values in either direction count as evidence. At 5% significance, each tail has only 2.5%.
Common error: For a two-tail test at 5%, do not compare the one-tail probability to 0.05. Compare it to 0.025 (half of 5%), or double the probability and compare to 0.05.
The Critical Region
The critical region (or rejection region) is the set of values of the test statistic for which we would reject H₀. The critical value is the boundary of this region.
One-tail right: Reject H₀ if X ≥ c (where P(X ≥ c) ≤ α)
One-tail left: Reject H₀ if X ≤ c (where P(X ≤ c) ≤ α)
Two-tail: Reject H₀ if X ≤ c₁ or X ≥ c₂
For Binomial tests, the critical region is found by cumulating probabilities from the tail until we first exceed α. For Normal tests, we use the standard Normal table (Z-table) to find the critical Z-value.
Actual significance level: Because the Binomial is discrete, the actual probability of rejecting H₀ at the boundary is often slightly less than α. The "actual significance level" is P(being in the critical region | H₀ true).
Learn 3 — Binomial Hypothesis Test
When testing a proportion p, the test statistic X is the count of successes in n trials. Under H₀, X ~ B(n, p₀).
Setting Up
H₀: p = p₀ vs H₁: p > p₀ / p < p₀ / p ≠ p₀ X ~ B(n, p₀) under H₀
Computing the p-value
For H₁: p > p₀ → p-value = P(X ≥ x_obs) = 1 − P(X ≤ x_obs − 1)
For H₁: p < p₀ → p-value = P(X ≤ x_obs)
For H₁: p ≠ p₀ → p-value = 2 × min(P(X ≤ x_obs), P(X ≥ x_obs))
These probabilities are found using Binomial tables (Cambridge provides cumulative tables) or the formula:
P(X = k) = C(n,k) · p₀ᵏ · (1−p₀)ⁿ⁻ᵏ
Worked Example: Testing a Biased Coin
A coin is suspected of being biased towards heads. It is tossed 20 times and 15 heads are observed. Test at the 5% significance level.
H₀: p = 0.5 (fair coin) H₁: p > 0.5 (one-tail right, because we suspect more heads)
Compare: 0.0207 < 0.05 = α Conclusion: Reject H₀. There is significant evidence at the 5% level that the coin is biased towards heads.
Always state: (1) H₀ and H₁ with the parameter defined, (2) the distribution under H₀, (3) the p-value calculation, (4) comparison to α, (5) conclusion in context.
Learn 4 — Normal Distribution Test (Testing a Mean)
When the population variance σ² is known and we have a sample of size n, the sample mean X̄ follows a Normal distribution under H₀.
Distribution of X̄ Under H₀
X̄ ~ N(μ₀, σ²/n) under H₀: μ = μ₀
Test Statistic
We standardise to get a Z-score, which follows N(0, 1):
Z = (X̄ − μ₀) / (σ / √n) ~ N(0, 1) under H₀
The denominator σ/√n is the standard error of the mean — not σ itself.
Decision Rule Using Z
One-tail right (H₁: μ > μ₀): Reject H₀ if Z > z_α (e.g., z₀.₀₅ = 1.645)
One-tail left (H₁: μ < μ₀): Reject H₀ if Z < −z_α
Two-tail (H₁: μ ≠ μ₀): Reject H₀ if |Z| > z_{α/2} (e.g., z₀.₀₂₅ = 1.96)
Worked Example
A machine produces bolts with mean length 50 mm and known standard deviation σ = 2 mm. A sample of n = 25 bolts gives x̄ = 50.8 mm. Test H₀: μ = 50 vs H₁: μ > 50 at 5%.
Under H₀: X̄ ~ N(50, 4/25) = N(50, 0.16), so SE = 2/√25 = 0.4
Test statistic: Z = (50.8 − 50) / 0.4 = 0.8 / 0.4 = 2.00
Critical value for 5% one-tail: z₀.₀₅ = 1.645
Since 2.00 > 1.645, Z falls in the critical region. Conclusion: Reject H₀. Significant evidence at 5% that the mean bolt length exceeds 50 mm.
Key mistake: The standard error is σ/√n, not σ. Always divide σ by √n before computing Z.
Learn 5 — Type I and Type II Errors
A hypothesis test can make two distinct types of error. Understanding these is essential for designing tests and interpreting their conclusions.
Type I Error (False Positive)
Reject H₀ when H₀ is actually true.
P(Type I error) = α (the significance level)
Example: Concluding a fair coin is biased when it isn't.
Type II Error (False Negative)
Fail to reject H₀ when H₀ is actually false.
P(Type II error) = β
Example: Failing to detect that a coin is biased when it is.
The Power of a Test
Power = 1 − β = P(reject H₀ | H₀ is false)
A powerful test is one that is good at detecting a false H₀. We want high power (low β).
The Trade-off
Reducing α (making the test more stringent) moves the critical boundary further into the tail, making it harder to reject H₀. This reduces Type I errors but increases Type II errors — more genuine effects go undetected.
Smaller α → fewer Type I errors → more Type II errors (higher β, lower power)
Larger α → more Type I errors → fewer Type II errors (lower β, higher power)
Calculating P(Type II error)
To find β for a specific alternative value p₁ or μ₁:
Step 1: Find the critical region under H₀ (e.g., reject if X ≥ c).
Step 2: Under the specific alternative H₁ value, find P(X < c). This is P(Type II error) — the probability of not landing in the critical region even though H₁ is true.
Example: H₀: p = 0.5, H₁: p = 0.7, n = 10, α = 5%. Suppose critical region is X ≥ 8.
β = P(X < 8 | p = 0.7) = P(X ≤ 7 | B(10, 0.7)) ≈ 0.617
Exam tip: Type I = reject true H₀ (probability = α, the significance level, always). Type II = miss a false H₀ (probability β, depends on true parameter value).
Worked Examples
Example 1 — Binomial One-Tail Test (Coin)
A coin is tossed 20 times. H₀: p = 0.5, H₁: p > 0.5. Observe 14 heads. Test at 5%.
0.0577 > 0.05 → Fail to reject H₀. M1A1
Insufficient evidence at 5% that the coin is biased towards heads.
Example 2 — Binomial Two-Tail Test (Spinner)
A spinner is claimed to give P(red) = 1/3. In 30 spins, 14 reds are observed. Test H₀: p = 1/3, H₁: p ≠ 1/3 at 5%.
Under H₀: X ~ B(30, 1/3). Expected value = 10.
Since 14 > 10, we look at the right tail. p-value = 2 × P(X ≥ 14)
P(X ≥ 14) = 1 − P(X ≤ 13). Using tables: P(X ≤ 13) ≈ 0.9183
One-tail prob ≈ 0.0817. Two-tail p-value ≈ 2 × 0.0817 = 0.163
0.163 > 0.05 → Fail to reject H₀. M1A1
Insufficient evidence that the probability of red differs from 1/3.
Example 3 — Normal Test (One-Tail, Bolts)
Machine mean μ = 50 mm, σ = 2 mm, n = 25, x̄ = 50.8. Test H₀: μ = 50, H₁: μ > 50 at 5%.
SE = σ/√n = 2/5 = 0.4
Z = (50.8 − 50)/0.4 = 2.00
Critical value: z₀.₀₅ = 1.645. Since 2.00 > 1.645, reject H₀. M1A1
Significant evidence that mean bolt length exceeds 50 mm.
Example 4 — Finding Critical Region (Binomial)
H₀: p = 0.3, H₁: p < 0.3, n = 20, α = 10%.
We need largest c such that P(X ≤ c) ≤ 0.10 under X ~ B(20, 0.3).
P(X ≤ 2) = P(0) + P(1) + P(2)
P(0) = 0.7²⁰ ≈ 0.0008, P(1) ≈ 0.0068, P(2) ≈ 0.0278. Sum ≈ 0.0355 ≤ 0.10 ✓
P(X ≤ 3): add P(3) ≈ 0.0716. Sum ≈ 0.107 > 0.10 ✗
Critical region: X ≤ 2 A1
Actual significance level = P(X ≤ 2 | p = 0.3) ≈ 3.55%
Example 5 — Normal Two-Tail Test
H₀: μ = 100, H₁: μ ≠ 100, σ = 15, n = 36, x̄ = 104. Test at 1%.
SE = 15/√36 = 15/6 = 2.5
Z = (104 − 100)/2.5 = 1.60
Two-tail 1%: critical value z₀.₀₀₅ = 2.576. Since |1.60| < 2.576, fail to reject H₀. M1A1
Insufficient evidence at 1% that the mean differs from 100.
Example 6 — P(Type I Error)
A test is carried out at 5% significance level. What is the probability of a Type I error?
By definition, P(Type I error) = P(reject H₀ | H₀ true) = α = 0.05B1
Example 7 — Critical Region for Binomial (Left-tail, 5%)
H₀: p = 0.4, H₁: p < 0.4, n = 10. Find critical region at 5%.
Fail to reject H₀ when H₀ is false. P(Type II) = β
Power
1 − β = P(reject H₀ | H₀ false)
Critical region
Set of test statistic values for which H₀ is rejected (e.g., X ≥ c or Z > z_α)
Two-tail critical region
X ≤ c₁ or X ≥ c₂, where each tail has probability ≤ α/2
Proof Bank
Proof 1 — Why P(Type I error) = α by construction
The significance level α is defined as the maximum probability of rejecting H₀ when H₀ is true. The critical region C is chosen precisely so that:
P(X ∈ C | H₀ true) ≤ α
For a continuous test statistic (like Z ~ N(0,1)), we choose C = {Z > z_α} where z_α is defined such that P(Z > z_α) = α. Therefore P(reject H₀ | H₀ true) = P(Z > z_α | Z ~ N(0,1)) = α exactly.
For a discrete test statistic (Binomial), we choose the largest critical value c such that P(X ≥ c | H₀) ≤ α. The actual significance level is P(X ≥ c | H₀), which may be strictly less than α due to discreteness. The critical region is still constructed so that Type I error probability does not exceed the stated α.
This is why P(Type I error) = α is not a coincidence — it is the definition of how the critical region is chosen.
Proof 2 — Derivation of the Normal Test Statistic Z = (X̄ − μ₀)/(σ/√n)
Given: X₁, X₂, …, Xₙ are i.i.d. (independent, identically distributed) with mean μ and known variance σ².
Step 1: Distribution of the sum.
By linearity of expectation: E(X₁ + X₂ + … + Xₙ) = nμ
By independence: Var(X₁ + … + Xₙ) = nσ²
Step 2: Distribution of the sample mean.
X̄ = (X₁ + … + Xₙ)/n
E(X̄) = μ
Var(X̄) = σ²/n (variance scales as 1/n² × nσ² = σ²/n)
So X̄ ~ N(μ, σ²/n) (by the Central Limit Theorem, or exactly if Xᵢ are Normal).
Step 3: Standardise under H₀ (μ = μ₀).
Under H₀: X̄ ~ N(μ₀, σ²/n).
Subtracting the mean and dividing by the standard deviation:
This Z is the test statistic. Values far from 0 (in the relevant tail) give evidence against H₀.
Normal Distribution Hypothesis Test Visualiser
Explore how the critical region changes with significance level and test type. The red shaded area shows the critical region (where H₀ would be rejected).
Exercise 3 — Normal Distribution Tests (10 Questions)
Exercise 4 — Type I and Type II Errors (10 Questions)
Exercise 5 — Critical Regions (10 Questions)
Practice (30 Questions)
Challenge (15 Questions)
Exam Style Questions
Question 1 [6 marks]
A supermarket claims that 40% of customers use self-checkout. A manager suspects the true proportion is lower. She surveys 15 randomly selected customers and finds that 3 used self-checkout.
(i) Write down H₀ and H₁, defining the parameter p. [1]
(ii) Using a 5% significance level, carry out a hypothesis test and state your conclusion in context. [4]
(iii) What is the probability of a Type I error in this test? [1]
(i) p = proportion of customers using self-checkout. H₀: p = 0.4, H₁: p < 0.4 [B1]
(ii) Under H₀: X ~ B(15, 0.4). p-value = P(X ≤ 3).
P(0) = 0.6¹⁵ ≈ 0.000470, P(1) ≈ 0.004699, P(2) ≈ 0.021985, P(3) ≈ 0.063449 [M1]
P(X ≤ 3) ≈ 0.0906 [A1]
0.0906 > 0.05, so fail to reject H₀. [M1]
Insufficient evidence at 5% that fewer than 40% of customers use self-checkout. [A1]
(iii) P(Type I error) = 0.05 (= α, the significance level) [B1]
Question 2 [7 marks]
The heights of adult males in a country are known to be Normally distributed with standard deviation 8 cm. A researcher believes the mean height μ has changed from the historical value of 175 cm. She measures a random sample of 64 males and finds a sample mean of 177.2 cm.
(i) State H₀ and H₁ for an appropriate test. [1]
(ii) Calculate the test statistic Z. [2]
(iii) State the critical region for a 5% significance level and conclude the test. [2]
(iv) Find the p-value for this test. [2]
(i) H₀: μ = 175, H₁: μ ≠ 175 (two-tail, as she suspects it has "changed") [B1]
(ii) SE = 8/√64 = 8/8 = 1 [M1]
Z = (177.2 − 175)/1 = 2.2 [A1]
(iii) Two-tail 5%: critical region |Z| > 1.960. Since |2.2| = 2.2 > 1.960, reject H₀. [B1, A1]
Significant evidence at 5% that the mean height has changed from 175 cm.
A hypothesis test uses H₀: p = 0.3 and H₁: p > 0.3 with n = 20 and a 5% significance level.
(i) Find the critical region for this test. [3]
(ii) State the actual significance level of the test. [1]
(iii) If in fact p = 0.5, find P(Type II error). [1]
(i) X ~ B(20, 0.3). Need smallest c with P(X ≥ c) ≤ 0.05.
P(X ≥ 9) = 1 − P(X ≤ 8) ≈ 1 − 0.8867 = 0.1133 > 0.05
P(X ≥ 10) = 1 − P(X ≤ 9) ≈ 1 − 0.9520 = 0.0480 ≤ 0.05 ✓ [M1A1]
Critical region: X ≥ 10 [A1]
(ii) Actual significance level = P(X ≥ 10 | p = 0.3) ≈ 0.0480 = 4.80% [B1]
Explain the difference between a Type I error and a Type II error in the context of a test where H₀ states that a new drug has no effect.
Type I error: Concluding the drug has an effect when in fact it does not. [B1]
Probability of Type I error = α (the significance level). [B1]
Type II error: Concluding the drug has no effect when in fact it does have an effect. [B1]
A more stringent significance level (smaller α) reduces Type I errors but increases Type II errors. [B1]
Question 5 [6 marks]
A random variable X ~ B(25, p). It is required to test H₀: p = 0.2 against H₁: p ≠ 0.2 at the 10% significance level. The observed value is X = 9.
(i) Find the p-value for this test. [3]
(ii) State your conclusion. [2]
(iii) State the critical region for this two-tail test. [1]
(i) X ~ B(25, 0.2). Expected = 5. Since 9 > 5, we look at the right tail.
P(X ≥ 9) = 1 − P(X ≤ 8) ≈ 1 − 0.9532 = 0.0468 [M1A1]
Two-tail p-value = 2 × 0.0468 = 0.0936 [A1]
(ii) 0.0936 < 0.10, so reject H₀. [M1]
Significant evidence at 10% that p ≠ 0.2. [A1]
(iii) Critical region: X ≤ 1 or X ≥ 9 [B1]
Question 6 [5 marks]
A factory claims that the mean weight of a product is 500 g with standard deviation 12 g. A quality inspector takes a sample of 36 items and finds a sample mean of 496 g. Test at the 1% level whether the mean weight is less than 500 g.
H₀: μ = 500, H₁: μ < 500 (one-tail left) [B1]
SE = 12/√36 = 12/6 = 2 [M1]
Z = (496 − 500)/2 = −4/2 = −2.00 [A1]
Critical value for 1% one-tail (left): −2.326. Since −2.00 > −2.326, fail to reject H₀. [M1]
Insufficient evidence at 1% that the mean weight is less than 500 g. [A1]
Question 7 [4 marks]
A test has critical region X ≤ 2 where X ~ B(12, p). Given H₀: p = 0.35 and H₁: p < 0.35, find the actual significance level and P(Type II error) when p = 0.2.
Actual significance = P(X ≤ 2 | p = 0.35, n = 12)
P(0) = 0.65¹² ≈ 0.00569, P(1) ≈ 0.03680, P(2) ≈ 0.10886 [M1]
P(X ≤ 2) ≈ 0.151 → actual significance level ≈ 15.1% [A1]
The lifetime T (in hours) of a type of battery is Normally distributed with σ = 5 hours. The manufacturer claims μ = 60 hours. A consumer group tests a random sample of 25 batteries and obtains x̄ = 57.8 hours. Test at the 5% level whether the mean lifetime is less than claimed.
H₀: μ = 60, H₁: μ < 60 [B1]
SE = 5/√25 = 5/5 = 1 [M1]
Z = (57.8 − 60)/1 = −2.2 [A1]
Critical value: −1.645 (one-tail left, 5%). Since −2.2 < −1.645, reject H₀. [M1]
Significant evidence at 5% that mean battery lifetime is less than 60 hours. [A1]
Past Paper Questions (Cambridge A-Level Style)
Past Paper 1 — 9709/62/O/N/19 style [6 marks]
A bag contains a large number of discs. The manufacturer states that 30% of the discs are red. James thinks the proportion of red discs is less than 30%. James takes a random sample of 20 discs and finds 3 are red.
(i) Test James's claim at the 5% significance level. [5]
(ii) Write down the probability of a Type I error. [1]
(i) H₀: p = 0.3, H₁: p < 0.3 where p = proportion of red discs [B1]
X ~ B(20, 0.3) under H₀. p-value = P(X ≤ 3) [M1]
P(X ≤ 3) = P(0)+P(1)+P(2)+P(3) ≈ 0.001 + 0.007 + 0.028 + 0.072 = 0.107 [A1]
0.107 > 0.05, fail to reject H₀ [M1]
Insufficient evidence at 5% that less than 30% of discs are red [A1]
(ii) P(Type I error) = 0.05 [B1]
Past Paper 2 — 9709/62/M/J/18 style [7 marks]
The random variable X has distribution B(n, p). A single observation x is used to test H₀: p = 0.45 against H₁: p < 0.45. With n = 20, the critical region is X ≤ 5.
(i) Find the actual significance level of the test. [2]
(ii) Find P(Type II error) when p = 0.3. [2]
(iii) State the effect on P(Type II error) if the significance level is increased. [1]
(iv) The observation is x = 4. State your conclusion. [2]
(iii) Increasing significance level moves the critical boundary (e.g., X ≤ 6), making it easier to reject H₀, so P(Type II error) decreases. [B1]
(iv) x = 4 ≤ 5, so x is in the critical region. Reject H₀. [M1]
Significant evidence that p < 0.45. [A1]
Past Paper 3 — 9709/63/O/N/20 style [6 marks]
The masses of apples in an orchard have been Normally distributed for many years with mean 185 g and standard deviation 22 g. Following a change in growing conditions, a farmer believes the mean mass has increased. He takes a random sample of 30 apples and finds the mean mass is 191 g.
Carry out a hypothesis test at the 10% significance level. State your hypotheses and conclusion clearly. [6]
H₀: μ = 185, H₁: μ > 185 [B1]
SE = 22/√30 ≈ 4.018 [M1]
Z = (191 − 185)/4.018 ≈ 1.493 [A1]
Critical value for 10% one-tail: 1.282. Since 1.493 > 1.282, reject H₀. [M1A1]
Significant evidence at 10% that the mean mass of apples has increased following the change in conditions. [A1]
Past Paper 4 — 9709/62/O/N/21 style [5 marks]
A teacher claims that students score an average of 65% on a test. A student believes the actual mean is different. She collects data from a random sample of 40 students and calculates a sample mean of 62.4%. The population standard deviation is known to be 9%.
Test the student's belief at the 5% level. [5]
H₀: μ = 65, H₁: μ ≠ 65 (two-tail, "different") [B1]
SE = 9/√40 ≈ 1.423 [M1]
Z = (62.4 − 65)/1.423 ≈ −1.827 [A1]
Two-tail 5%: |Z| must exceed 1.960. |−1.827| = 1.827 < 1.960, fail to reject H₀. [M1]
Insufficient evidence at 5% that the mean score differs from 65%. [A1]
Past Paper 5 — 9709/61/M/J/22 style [7 marks]
In a large town, it is claimed that 55% of households recycle regularly. A council member suspects the true proportion is higher. She surveys a random sample of 18 households; let X be the number that recycle regularly.
(i) State suitable hypotheses. [1]
(ii) Find the critical region for a test at the 5% significance level. [3]
(iii) The council member finds 14 households that recycle. State and justify the conclusion. [2]
(iv) State the probability of a Type I error using your critical region. [1]
(i) H₀: p = 0.55, H₁: p > 0.55 where p = proportion of households that recycle [B1]