This topic covers the statistical tools used to measure and model relationships between two variables. You will learn to quantify how strongly two variables are related, find the best-fit line, test whether a correlation is significant, and transform non-linear data into linear form.
Syllabus Coverage: Cambridge 9709 Statistics 2 — Regression and Correlation (typically 20–25% of S2 paper)
What You Will Learn
Learn 1 — Product Moment Correlation Coefficient (PMCC)
Compute r using Sxy, Sxx, Syy. Interpret strength and direction of linear correlation.
Learn 2 — Least Squares Regression Line
Find the equation ŷ = a + bx; understand interpolation vs extrapolation; analyse residuals.
Learn 3 — Spearman's Rank Correlation
Rank data, handle ties, apply rₛ = 1 − 6Σd²/[n(n²−1)]; know when to use it.
Learn 4 — Hypothesis Testing for Correlation
Set up H₀: ρ = 0, choose critical values from tables, state conclusions in context.
Learn 5 — Linearisation
Transform y = ab^x and y = ax^n into linear form using logarithms; read off constants from graphs.
Key Skills Checklist
✓ Calculate Sxy, Sxx, Syy from raw data or summary statistics
✓ Compute and interpret PMCC r
✓ Find regression line coefficients a and b
✓ Assign ranks and compute Spearman's rₛ
✓ Conduct a hypothesis test for correlation
✓ Linearise data and interpret log-transformed graphs
Learn 1: Product Moment Correlation Coefficient (PMCC)
The PMCC, denoted r, measures the strength and direction of the linear relationship between two variables x and y.
• r always lies in [−1, 1]
• r = +1: perfect positive linear correlation
• r = −1: perfect negative linear correlation
• r = 0: no linear correlation (there may still be a non-linear relationship)
• r is dimensionless — it doesn't depend on the units of x or y
Interpreting the Strength
|r| range
Interpretation
0.9 – 1.0
Very strong linear correlation
0.7 – 0.9
Strong linear correlation
0.5 – 0.7
Moderate linear correlation
0.3 – 0.5
Weak linear correlation
0 – 0.3
Very weak / negligible correlation
Exam tip: Always state both the direction (positive/negative) and the strength (weak/moderate/strong) when interpreting r in context.
Positive vs Negative Correlation
Positive correlation (r > 0): As x increases, y tends to increase. Negative correlation (r < 0): As x increases, y tends to decrease. No correlation (r ≈ 0): No consistent linear trend.
Causation ≠ Correlation: Even if |r| is close to 1, this does NOT prove that x causes y. There may be a confounding variable or the relationship may be coincidental.
Learn 2: Least Squares Regression Line
The regression line of y on x gives the best-fit straight line minimising the sum of squared vertical distances from each data point to the line.
The Equation
ŷ = a + bx
b = Sxy / Sxx a = ȳ − bx̄
Key Property
The regression line always passes through (x̄, ȳ) — the point of means.
Regression of x on y vs y on x
• y on x: use when predicting y from a given x. Minimises vertical residuals.
• x on y: use when predicting x from a given y. Minimises horizontal residuals.
• These are two different lines (unless r = ±1).
Interpolation vs Extrapolation
Interpolation (predicting within the range of data): Generally reliable — the model is supported by evidence in that region.
Extrapolation (predicting outside the range): Unreliable — the linear model may not hold beyond the observed data.
Residuals
Residual = y − ŷ (actual minus predicted)
• Positive residual: the actual value is above the regression line
• Negative residual: the actual value is below the regression line
• The sum of all residuals = 0 for a least squares line
• Residual plots: random scatter → model is appropriate; patterns → model is inappropriate
Common error: Using the regression line to predict y at an x value far outside the data range. Always check whether you are interpolating or extrapolating.
Interpreting Coefficients
b (gradient): For each unit increase in x, y is predicted to change by b units. a (intercept): The predicted value of y when x = 0. This may not always have a meaningful real-world interpretation.
Learn 3: Spearman's Rank Correlation Coefficient
Spearman's rₛ measures the strength and direction of the monotonic relationship between two variables using ranked data.
The Formula
rₛ = 1 − 6Σd² / [n(n² − 1)]
where d = difference between the ranks of each paired observation, and n = number of pairs.
How to Rank
1. Rank the x values from smallest (rank 1) to largest (rank n).
2. Rank the y values from smallest (rank 1) to largest (rank n).
3. Find d = rank(x) − rank(y) for each pair.
4. Compute Σd² then apply the formula.
Handling Tied Ranks
When two or more values are equal, assign each the average of the ranks they would have occupied.
Example: If values tied for ranks 3 and 4, both get rank 3.5.
When to Use Spearman's vs PMCC
Situation
Use
Data is clearly linear, both variables quantitative and approximately normal
PMCC (r)
Data is ordinal (e.g., rankings, scores)
Spearman's (rₛ)
Relationship is monotonic but not necessarily linear
Spearman's (rₛ)
Outliers are present that would distort PMCC
Spearman's (rₛ)
Data is not normally distributed
Spearman's (rₛ)
Comparing rₛ and r
• Both lie in [−1, 1] and are interpreted similarly for direction and strength.
• rₛ is the PMCC applied to the ranks of the data (not the raw values).
• If data is bivariate normal, PMCC is more powerful; otherwise Spearman's is more robust.
Note: The formula rₛ = 1 − 6Σd²/[n(n²−1)] gives an exact result only when there are no tied ranks. With many ties, it is better to calculate the PMCC of the ranks directly (though the formula is still used in A-Level exams).
Learn 4: Hypothesis Testing for Correlation
We can test whether there is evidence of correlation in the population, using the sample correlation coefficient r as the test statistic.
Setting Up the Test
H₀: ρ = 0 (no population correlation)
H₁ options:
• ρ ≠ 0 → two-tailed test (testing for any correlation)
• ρ > 0 → one-tailed test (testing for positive correlation)
• ρ < 0 → one-tailed test (testing for negative correlation)
Critical Values
The test statistic is the sample r (or rₛ for Spearman's). Critical values are given in the exam paper.
• If |r| > critical value: reject H₀ — sufficient evidence of correlation
• If |r| ≤ critical value: do not reject H₀ — insufficient evidence of correlation
Stating Conclusions
Always state conclusions in context. Do not just say "reject H₀". Say:
"There is sufficient evidence at the X% significance level to conclude that there is a positive correlation between [variable 1] and [variable 2] in the population."
One-Tailed vs Two-Tailed
Significance level α
One-tailed critical region
Two-tailed critical region
10%
Upper 10% of distribution
Upper and lower 5% each
5%
Upper 5% of distribution
Upper and lower 2.5% each
1%
Upper 1% of distribution
Upper and lower 0.5% each
Assumptions
When testing using PMCC: both variables should come from a bivariate normal distribution.
When testing using Spearman's: no distributional assumption required — it is non-parametric.
Common error: Using the wrong critical value table (e.g., using a one-tailed value for a two-tailed test). Read the question carefully to identify H₁ before looking up the table.
Step-by-Step Procedure
1. State H₀: ρ = 0 and H₁ (with direction if specified) B1
2. State the significance level α
3. Calculate r (or rₛ) from the data M1
4. Find the critical value from the table for the given n and α B1
5. Compare r with the critical value M1
6. State the conclusion in context A1
Learn 5: Linearisation
When data follows a non-linear model, we can transform it into a linear form and then apply regression and correlation techniques.
Model 1: y = ab^x (Exponential)
Take log of both sides: log y = log a + x · log b
• Plot log y against x
• The graph should be approximately linear
• Gradient = log b → so b = 10^(gradient) [if log base 10] or b = e^(gradient) [if ln]
• Intercept = log a → so a = 10^(intercept) or a = e^(intercept)
Model 2: y = ax^n (Power Law)
Take log of both sides: log y = log a + n · log x
• Plot log y against log x
• The graph should be approximately linear
• Gradient = n
• Intercept = log a → so a = 10^(intercept)
Which Transformation to Use?
Model
Transform
Plot
Gradient
Intercept
y = ab^x
log y = log a + x log b
log y vs x
log b
log a
y = ax^n
log y = log a + n log x
log y vs log x
n
log a
Exam tip: Cambridge 9709 typically uses logarithms base 10 (log) or natural logarithms (ln). Be consistent throughout a question — check which the question specifies.
Reading Off Constants
1. Identify the linearised form and which variables are plotted
2. Read gradient and y-intercept from the graph or regression equation
3. Use gradient and intercept to recover a, b (or a, n)
4. State the model clearly: e.g., y = 2.3 × (1.5)^x
Common mistake: Forgetting to convert back from log a to a. If the intercept is 0.8 (log base 10), then a = 10^0.8 ≈ 6.31, not 0.8.
Example 1 — Computing PMCC from a Table
Given n = 5, Σx = 25, Σy = 40, Σx² = 145, Σy² = 340, Σxy = 218. Find r.
x̄ = 25/5 = 5, ȳ = 40/5 = 8 B1
Sxy = 218 − 5×5×8 = 218 − 200 = 18 M1
Sxx = 145 − 5×25 = 145 − 125 = 20 A1
Syy = 340 − 5×64 = 340 − 320 = 20 A1
r = 18/√(20×20) = 18/20 = 0.9A1
r = 0.9 indicates a very strong positive linear correlation between x and y.
Example 2 — Finding the Regression Line
Using data from Example 1 (Sxy = 18, Sxx = 20, x̄ = 5, ȳ = 8), find the regression line of y on x.
b = Sxy/Sxx = 18/20 = 0.9 M1 A1
a = ȳ − bx̄ = 8 − 0.9×5 = 8 − 4.5 = 3.5 M1 A1
Regression line: ŷ = 3.5 + 0.9xA1
Check: when x = x̄ = 5, ŷ = 3.5 + 4.5 = 8 = ȳ ✓
Example 3 — Spearman's Rank Correlation
Ranks for 5 students in Maths (x) and Science (y): (1,2),(2,1),(3,4),(4,3),(5,5). Find rₛ.
Using the data from Q1, find the equation of the regression line of y on x. State the value of y predicted when x = 8 and comment on the reliability of this prediction.
x̄ = 42/7 = 6, ȳ = 63/7 = 9 [B1]
b = Sxy/Sxx = 20/28 = 5/7 ≈ 0.714 [M1 A1]
a = 9 − (5/7)×6 = 9 − 30/7 = 33/7 ≈ 4.71 [A1]
ŷ = 4.71 + 0.714×8 = 4.71 + 5.71 = 10.43 [A1]
Whether this is reliable depends on whether x = 8 is within the range of the original data — need to check the original data range. If x = 8 is an interpolation it is reliable; if extrapolation it is unreliable. [B1]
Q3 [6 marks]
Six students are ranked by a teacher and by a judge: Teacher ranks: 1,2,3,4,5,6. Judge ranks: 2,1,4,3,6,5. Calculate Spearman's rₛ and test at the 5% significance level whether there is positive agreement between the two sets of rankings (critical value = 0.8286).
Data is believed to follow the model y = ab^x. When log y is plotted against x, the regression line is log y = 0.5 + 0.2x. Find a and b.
Comparing log y = log a + x·log b with log y = 0.5 + 0.2x: [M1]
log a = 0.5 → a = 10^0.5 ≈ 3.16 [A1]
log b = 0.2 → b = 10^0.2 ≈ 1.585 [A1]
Model: y = 3.16 × (1.585)^x [A1]
Q5 [5 marks]
A sample of n = 12 gives r = −0.54. Test at the 5% significance level whether there is negative correlation in the population. Critical value (one-tail 5%, n=12) = 0.4973.
H₀: ρ = 0; H₁: ρ < 0 (one-tailed) [B1]
Significance level: 5% [B1]
|r| = 0.54 > critical value 0.4973 [M1]
Since r is negative and |r| exceeds the critical value, reject H₀. [M1]
There is sufficient evidence at the 5% level of negative correlation in the population. [A1]
Q6 [5 marks]
Data follows y = ax^n. A log–log plot gives a straight line through (0.3, 1.1) and (0.9, 2.3). Find n and a.
Gradient = (2.3−1.1)/(0.9−0.3) = 1.2/0.6 = 2 → n = 2 [M1 A1]
Using point (0.3, 1.1): 1.1 = log a + 2×0.3 = log a + 0.6 [M1]
log a = 0.5 → a = 10^0.5 ≈ 3.16 [A1]
Model: y = 3.16 x² [A1]
Q7 [6 marks]
Bivariate data on temperature (x °C) and sales (y units) for 9 days gives Sxy = 156.2, Sxx = 210.4, Syy = 132.5. (i) Find r. (ii) State with a reason whether PMCC or Spearman's is more appropriate here.
(i) r = 156.2/√(210.4×132.5) = 156.2/√27878 = 156.2/166.97 ≈ 0.936 [M1 A1]
Very strong positive linear correlation between temperature and sales. [A1]
(ii) Since both variables are quantitative and the relationship appears linear (r close to 1), PMCC is more appropriate. [B1]
Spearman's would be used if data is ordinal or non-normal, which is not indicated here. [B1 B1]
Q8 [7 marks]
For n = 8: x̄ = 4, ȳ = 10, Sxy = 24, Sxx = 32, Syy = 20. (i) Find the regression line of y on x. (ii) A new value x = 10 is observed; comment on using the line to predict y. (iii) Find the residual for the data point (6, 13.5).
(i) b = 24/32 = 0.75; a = 10 − 0.75×4 = 7 [M1 A1 A1]
Regression line: ŷ = 7 + 0.75x [A1]
(ii) x = 10 may be outside the range of the original data; this would be extrapolation and is unreliable. [B1]
(iii) ŷ = 7 + 0.75×6 = 7 + 4.5 = 11.5 [M1]
Residual = 13.5 − 11.5 = 2.0 [A1]
Past Paper Questions (Adapted 9709 S2)
PP1 — 9709/62/O/N/18 (adapted) [6 marks]
The ages (x years) and blood pressure readings (y mmHg) for 8 patients give: Σx = 400, Σy = 1200, Σx² = 21000, Σy² = 181200, Σxy = 61400, n = 8.
(a) Calculate r and comment on what this value suggests.
(b) Find the equation of the regression line of y on x.
Seven items are ranked by two assessors. Assessor A: 1,2,3,4,5,6,7. Assessor B: 3,1,2,5,4,7,6. Calculate Spearman's rₛ and test at 10% significance level for agreement (critical value for n=7, one-tail 10% = 0.7143).
Data is thought to satisfy y = ab^x. Values of log₁₀y are recorded for x = 1,2,3,4,5 and give log y values: 1.3, 1.6, 1.9, 2.2, 2.5.
(a) Verify that a linear model is appropriate for log y vs x.
(b) Use the first and last point to find the gradient and intercept, then find a and b.
(a) The differences in log y are constant (0.3 each), confirming a perfect linear relationship — the model y = ab^x is appropriate. [B1 B1]
(b) Gradient = (2.5−1.3)/(5−1) = 1.2/4 = 0.3 = log b → b = 10^0.3 ≈ 1.995 ≈ 2.00 [M1 A1]
Using point (1, 1.3): 1.3 = log a + 0.3×1 → log a = 1.0 → a = 10 [M1 A1]
Model: y = 10 × 2^x
PP4 — 9709/62/M/J/21 (adapted) [5 marks]
A researcher collects n = 15 pairs of data and obtains r = 0.48. She wishes to test at the 5% significance level whether there is a positive correlation. The critical value is 0.4409.
(a) Write down H₀ and H₁. (b) Carry out the test and state your conclusion in context.
(a) H₀: ρ = 0; H₁: ρ > 0 [B1 B1]
(b) Test statistic r = 0.48; critical value = 0.4409 [B1]
0.48 > 0.4409 → reject H₀ [M1]
There is sufficient evidence at the 5% significance level of a positive correlation in the population. [A1]