Regression & Correlation | FractionRush A-Level Statistics 2

Welcome to Regression & Correlation

This topic covers the statistical tools used to measure and model relationships between two variables. You will learn to quantify how strongly two variables are related, find the best-fit line, test whether a correlation is significant, and transform non-linear data into linear form.

Syllabus Coverage: Cambridge 9709 Statistics 2 — Regression and Correlation (typically 20–25% of S2 paper)

What You Will Learn

Learn 1 — Product Moment Correlation Coefficient (PMCC)

Compute r using Sxy, Sxx, Syy. Interpret strength and direction of linear correlation.

Learn 2 — Least Squares Regression Line

Find the equation ŷ = a + bx; understand interpolation vs extrapolation; analyse residuals.

Learn 3 — Spearman's Rank Correlation

Rank data, handle ties, apply rₛ = 1 − 6Σd²/[n(n²−1)]; know when to use it.

Learn 4 — Hypothesis Testing for Correlation

Set up H₀: ρ = 0, choose critical values from tables, state conclusions in context.

Learn 5 — Linearisation

Transform y = ab^x and y = ax^n into linear form using logarithms; read off constants from graphs.

Key Skills Checklist

✓ Calculate Sxy, Sxx, Syy from raw data or summary statistics
✓ Compute and interpret PMCC r
✓ Find regression line coefficients a and b
✓ Assign ranks and compute Spearman's rₛ
✓ Conduct a hypothesis test for correlation
✓ Linearise data and interpret log-transformed graphs

Learn 1: Product Moment Correlation Coefficient (PMCC)

The PMCC, denoted r, measures the strength and direction of the linear relationship between two variables x and y.

Summary Statistics

Sxy = Σxy − nx̄ȳ | Sxx = Σx² − nx̄² | Syy = Σy² − nȳ²

These can also be written as:

Sxy = Σxy − (Σx)(Σy)/n | Sxx = Σx² − (Σx)²/n | Syy = Σy² − (Σy)²/n

The Formula

r = Sxy / √(Sxx · Syy)

Properties of r

• r always lies in [−1, 1]
• r = +1: perfect positive linear correlation
• r = −1: perfect negative linear correlation
• r = 0: no linear correlation (there may still be a non-linear relationship)
• r is dimensionless — it doesn't depend on the units of x or y

Interpreting the Strength

\|r\| range	Interpretation
0.9 – 1.0	Very strong linear correlation
0.7 – 0.9	Strong linear correlation
0.5 – 0.7	Moderate linear correlation
0.3 – 0.5	Weak linear correlation
0 – 0.3	Very weak / negligible correlation

Exam tip: Always state both the direction (positive/negative) and the strength (weak/moderate/strong) when interpreting r in context.

Positive vs Negative Correlation

Positive correlation (r > 0): As x increases, y tends to increase.
Negative correlation (r < 0): As x increases, y tends to decrease.
No correlation (r ≈ 0): No consistent linear trend.

Causation ≠ Correlation: Even if |r| is close to 1, this does NOT prove that x causes y. There may be a confounding variable or the relationship may be coincidental.

Learn 2: Least Squares Regression Line

The regression line of y on x gives the best-fit straight line minimising the sum of squared vertical distances from each data point to the line.

The Equation

ŷ = a + bx

b = Sxy / Sxx a = ȳ − bx̄

Key Property

The regression line always passes through (x̄, ȳ) — the point of means.

Regression of x on y vs y on x

• y on x: use when predicting y from a given x. Minimises vertical residuals.
• x on y: use when predicting x from a given y. Minimises horizontal residuals.
• These are two different lines (unless r = ±1).

Interpolation vs Extrapolation

Interpolation (predicting within the range of data): Generally reliable — the model is supported by evidence in that region.

Extrapolation (predicting outside the range): Unreliable — the linear model may not hold beyond the observed data.

Residuals

Residual = y − ŷ (actual minus predicted)

• Positive residual: the actual value is above the regression line
• Negative residual: the actual value is below the regression line
• The sum of all residuals = 0 for a least squares line
• Residual plots: random scatter → model is appropriate; patterns → model is inappropriate

Common error: Using the regression line to predict y at an x value far outside the data range. Always check whether you are interpolating or extrapolating.

Interpreting Coefficients

b (gradient): For each unit increase in x, y is predicted to change by b units.
a (intercept): The predicted value of y when x = 0. This may not always have a meaningful real-world interpretation.

Learn 3: Spearman's Rank Correlation Coefficient

Spearman's rₛ measures the strength and direction of the monotonic relationship between two variables using ranked data.

The Formula

rₛ = 1 − 6Σd² / [n(n² − 1)]

where d = difference between the ranks of each paired observation, and n = number of pairs.

How to Rank

1. Rank the x values from smallest (rank 1) to largest (rank n).
2. Rank the y values from smallest (rank 1) to largest (rank n).
3. Find d = rank(x) − rank(y) for each pair.
4. Compute Σd² then apply the formula.

Handling Tied Ranks

When two or more values are equal, assign each the average of the ranks they would have occupied.

Example: If values tied for ranks 3 and 4, both get rank 3.5.

When to Use Spearman's vs PMCC

Situation	Use
Data is clearly linear, both variables quantitative and approximately normal	PMCC (r)
Data is ordinal (e.g., rankings, scores)	Spearman's (rₛ)
Relationship is monotonic but not necessarily linear	Spearman's (rₛ)
Outliers are present that would distort PMCC	Spearman's (rₛ)
Data is not normally distributed	Spearman's (rₛ)

Comparing rₛ and r

• Both lie in [−1, 1] and are interpreted similarly for direction and strength.
• rₛ is the PMCC applied to the ranks of the data (not the raw values).
• If data is bivariate normal, PMCC is more powerful; otherwise Spearman's is more robust.

Note: The formula rₛ = 1 − 6Σd²/[n(n²−1)] gives an exact result only when there are no tied ranks. With many ties, it is better to calculate the PMCC of the ranks directly (though the formula is still used in A-Level exams).

Learn 4: Hypothesis Testing for Correlation

We can test whether there is evidence of correlation in the population, using the sample correlation coefficient r as the test statistic.

Setting Up the Test

H₀: ρ = 0 (no population correlation)

H₁ options:
• ρ ≠ 0 → two-tailed test (testing for any correlation)
• ρ > 0 → one-tailed test (testing for positive correlation)
• ρ < 0 → one-tailed test (testing for negative correlation)

Critical Values

The test statistic is the sample r (or rₛ for Spearman's). Critical values are given in the exam paper.

• If |r| > critical value: reject H₀ — sufficient evidence of correlation
• If |r| ≤ critical value: do not reject H₀ — insufficient evidence of correlation

Stating Conclusions

Always state conclusions in context. Do not just say "reject H₀". Say:

"There is sufficient evidence at the X% significance level to conclude that there is a positive correlation between [variable 1] and [variable 2] in the population."

One-Tailed vs Two-Tailed

Significance level α	One-tailed critical region	Two-tailed critical region
10%	Upper 10% of distribution	Upper and lower 5% each
5%	Upper 5% of distribution	Upper and lower 2.5% each
1%	Upper 1% of distribution	Upper and lower 0.5% each

Assumptions

When testing using PMCC: both variables should come from a bivariate normal distribution.
When testing using Spearman's: no distributional assumption required — it is non-parametric.

Common error: Using the wrong critical value table (e.g., using a one-tailed value for a two-tailed test). Read the question carefully to identify H₁ before looking up the table.

Step-by-Step Procedure

1. State H₀: ρ = 0 and H₁ (with direction if specified) B1

2. State the significance level α

3. Calculate r (or rₛ) from the data M1

4. Find the critical value from the table for the given n and α B1

5. Compare r with the critical value M1

6. State the conclusion in context A1

Learn 5: Linearisation

When data follows a non-linear model, we can transform it into a linear form and then apply regression and correlation techniques.

Model 1: y = ab^x (Exponential)

Take log of both sides: log y = log a + x · log b

• Plot log y against x
• The graph should be approximately linear
• Gradient = log b → so b = 10^(gradient) [if log base 10] or b = e^(gradient) [if ln]
• Intercept = log a → so a = 10^(intercept) or a = e^(intercept)

Model 2: y = ax^n (Power Law)

Take log of both sides: log y = log a + n · log x

• Plot log y against log x
• The graph should be approximately linear
• Gradient = n
• Intercept = log a → so a = 10^(intercept)

Which Transformation to Use?

Model	Transform	Plot	Gradient	Intercept
y = ab^x	log y = log a + x log b	log y vs x	log b	log a
y = ax^n	log y = log a + n log x	log y vs log x	n	log a

Exam tip: Cambridge 9709 typically uses logarithms base 10 (log) or natural logarithms (ln). Be consistent throughout a question — check which the question specifies.

Reading Off Constants

1. Identify the linearised form and which variables are plotted

2. Read gradient and y-intercept from the graph or regression equation

3. Use gradient and intercept to recover a, b (or a, n)

4. State the model clearly: e.g., y = 2.3 × (1.5)^x

Common mistake: Forgetting to convert back from log a to a. If the intercept is 0.8 (log base 10), then a = 10^0.8 ≈ 6.31, not 0.8.

Example 1 — Computing PMCC from a Table

Given n = 5, Σx = 25, Σy = 40, Σx² = 145, Σy² = 340, Σxy = 218. Find r.

x̄ = 25/5 = 5, ȳ = 40/5 = 8 B1

Sxy = 218 − 5×5×8 = 218 − 200 = 18 M1

Sxx = 145 − 5×25 = 145 − 125 = 20 A1

Syy = 340 − 5×64 = 340 − 320 = 20 A1

r = 18/√(20×20) = 18/20 = 0.9 A1

r = 0.9 indicates a very strong positive linear correlation between x and y.

Example 2 — Finding the Regression Line

Using data from Example 1 (Sxy = 18, Sxx = 20, x̄ = 5, ȳ = 8), find the regression line of y on x.

b = Sxy/Sxx = 18/20 = 0.9 M1 A1

a = ȳ − bx̄ = 8 − 0.9×5 = 8 − 4.5 = 3.5 M1 A1

Regression line: ŷ = 3.5 + 0.9x A1

Check: when x = x̄ = 5, ŷ = 3.5 + 4.5 = 8 = ȳ ✓

Example 3 — Spearman's Rank Correlation

Ranks for 5 students in Maths (x) and Science (y): (1,2),(2,1),(3,4),(4,3),(5,5). Find rₛ.

d values: 1−2=−1, 2−1=1, 3−4=−1, 4−3=1, 5−5=0 M1

d² values: 1, 1, 1, 1, 0 → Σd² = 4 A1

rₛ = 1 − 6×4/[5×(25−1)] = 1 − 24/120 = 1 − 0.2 = 0.8 A1

Strong positive agreement in rankings between Maths and Science.

Example 4 — Tied Ranks

x values: 3, 7, 7, 9, 12. Assign ranks.

3 → rank 1; two 7s would be ranks 2 and 3, so both get rank 2.5; 9 → rank 4; 12 → rank 5 M1 A1

Assigned ranks: 1, 2.5, 2.5, 4, 5

Always check for ties before computing Spearman's rₛ.

Example 5 — Hypothesis Test for Correlation

For n = 10 pairs, r = 0.68. Test at the 5% significance level whether there is positive correlation.

H₀: ρ = 0 | H₁: ρ > 0 (one-tailed) B1

Significance level: 5%, one-tailed. Critical value for n=10 at 5% = 0.5494 B1

Test statistic: r = 0.68 > 0.5494 M1

Reject H₀. There is sufficient evidence at the 5% level to conclude that there is a positive correlation in the population. A1

Example 6 — Linearisation: Exponential Model

Data follows y = ab^x. A plot of log y against x gives gradient 0.301 and intercept 1.2. Find a and b.

log b = 0.301 → b = 10^0.301 ≈ 2.0 M1 A1

log a = 1.2 → a = 10^1.2 ≈ 15.85 M1 A1

Model: y = 15.85 × 2^x (approximately) A1

Example 7 — Linearisation: Power Law

Data follows y = ax^n. A plot of log y against log x gives gradient 1.5 and intercept 0.6. Find a and n.

n = gradient = 1.5 B1

log a = 0.6 → a = 10^0.6 ≈ 3.98 M1 A1

Model: y ≈ 3.98 x^1.5 A1

Example 8 — Full Worked Example from Summary Stats

n=8, Σx=64, Σy=96, Σx²=560, Σy²=1248, Σxy=806. Find r and the regression line of y on x.

x̄ = 64/8 = 8, ȳ = 96/8 = 12 B1

Sxy = 806 − 8×8×12 = 806 − 768 = 38 M1 A1

Sxx = 560 − 8×64 = 560 − 512 = 48 A1

Syy = 1248 − 8×144 = 1248 − 1152 = 96 A1

r = 38/√(48×96) = 38/√4608 = 38/67.88 ≈ 0.560 A1

b = 38/48 ≈ 0.792; a = 12 − 0.792×8 = 12 − 6.333 = 5.667 M1 A1

Regression line: ŷ = 5.67 + 0.792x A1

r = 0.560 → moderate positive linear correlation.

Common Mistakes

Mistake 1: Wrong Sxy Formula

✗ Sxy = Σxy − x̄ȳ (missing the n)

✓ Sxy = Σxy − n·x̄·ȳ or equivalently Σxy − (Σx)(Σy)/n

Mistake 2: Forgetting to Square Root in PMCC

✗ r = Sxy / (Sxx · Syy)

✓ r = Sxy / √(Sxx · Syy) — you must square root the product

Mistake 3: Extrapolating Beyond the Data

✗ Using ŷ = 3.5 + 0.9x to predict y when x = 1000 (data only goes to x = 20)

✓ Only use the regression line to interpolate within the observed range of x values. State that extrapolation is unreliable.

Mistake 4: Confusing the Two Regression Lines

✗ Using the regression line of y on x to predict x from a given y

✓ To predict y from x → use y on x line. To predict x from y → use x on y line (different gradient and intercept).

Mistake 5: Forgetting to Convert Back from Log

✗ In y = ab^x, if intercept of log y vs x is 0.7, stating a = 0.7

✓ log a = 0.7 → a = 10^0.7 ≈ 5.01. Always anti-log the intercept to recover a.

Mistake 6: Wrong H₁ for Hypothesis Test

✗ Testing for positive correlation but writing H₁: ρ ≠ 0 (two-tailed)

✓ Read the question: "evidence of positive correlation" → H₁: ρ > 0 (one-tailed). This halves the critical region.

Mistake 7: Ignoring Tied Ranks in Spearman's

✗ Giving tied values different consecutive ranks (e.g., ranking two 7s as 2 and 3)

✓ Tied values get the average rank: both 7s get rank 2.5. Then continue with rank 4 for the next value.

Key Formulas Reference Sheet

Formula	Meaning / Notes
Sxy = Σxy − nx̄ȳ	Sum of cross-products (corrected)
Sxy = Σxy − (Σx)(Σy)/n	Equivalent form using totals
Sxx = Σx² − nx̄²	Sum of squares for x
Sxx = Σx² − (Σx)²/n	Equivalent form
Syy = Σy² − nȳ²	Sum of squares for y
r = Sxy / √(Sxx·Syy)	PMCC; r ∈ [−1, 1]
b = Sxy / Sxx	Gradient of regression line y on x
a = ȳ − bx̄	Intercept of regression line y on x
ŷ = a + bx	Equation of regression line
Residual = y − ŷ	Actual minus predicted value
rₛ = 1 − 6Σd²/[n(n²−1)]	Spearman's rank correlation coefficient
d = rank(x) − rank(y)	Rank difference for each pair
H₀: ρ = 0	Null hypothesis for correlation test
log y = log a + x·log b	Linearised form of y = ab^x (plot log y vs x)
log y = log a + n·log x	Linearised form of y = ax^n (plot log y vs log x)
Regression passes through (x̄, ȳ)	Key property — use to check working

Proof Bank

Proof 1: PMCC from Covariance Definition

The population PMCC is defined as:

ρ = Cov(X,Y) / [σ_X · σ_Y]

The sample analogue uses:

Cov(X,Y) = (1/n)Σ(xᵢ − x̄)(yᵢ − ȳ)

Now expand Σ(xᵢ − x̄)(yᵢ − ȳ):

= Σ(xᵢyᵢ − x̄yᵢ − xᵢȳ + x̄ȳ)

= Σxᵢyᵢ − x̄Σyᵢ − ȳΣxᵢ + nx̄ȳ

= Σxy − x̄(nȳ) − ȳ(nx̄) + nx̄ȳ = Σxy − nx̄ȳ − nx̄ȳ + nx̄ȳ

= Σxy − nx̄ȳ = Sxy

Similarly, Σ(xᵢ − x̄)² = Σx² − nx̄² = Sxx, and Σ(yᵢ − ȳ)² = Syy.

So the sample PMCC is:

r = Sxy / √(Sxx · Syy) ∎

Proof 2: Least Squares Derivation (Minimising Σ(y − a − bx)²)

Let L = Σ(yᵢ − a − bxᵢ)². We minimise over a and b.

Partial derivative with respect to a:

∂L/∂a = −2Σ(yᵢ − a − bxᵢ) = 0

→ Σyᵢ = na + bΣxᵢ → a = ȳ − bx̄

Partial derivative with respect to b:

∂L/∂b = −2Σxᵢ(yᵢ − a − bxᵢ) = 0

→ Σxᵢyᵢ = aΣxᵢ + bΣxᵢ²

Substitute a = ȳ − bx̄:

Σxy = (ȳ − bx̄)Σx + bΣx² = ȳ·Σx − bx̄·Σx + bΣx²

Σxy − ȳ·Σx = b(Σx² − x̄·Σx)

Since ȳ·Σx = nȳ·x̄ = nx̄ȳ and x̄·Σx = nx̄²:

Sxy = b · Sxx → b = Sxy/Sxx ∎

Proof 3: Spearman's rₛ is the PMCC of the Ranks

Let uᵢ = rank of xᵢ and vᵢ = rank of yᵢ, with no ties. Both u and v are permutations of {1, 2, …, n}.

For a set {1, 2, …, n}: ū = v̄ = (n+1)/2

Σuᵢ² = n(n+1)(2n+1)/6, so Suu = Σuᵢ² − nū² = n(n+1)(2n+1)/6 − n(n+1)²/4 = n(n²−1)/12

Similarly Svv = n(n²−1)/12.

Now dᵢ = uᵢ − vᵢ, so Σdᵢ² = Σ(uᵢ−vᵢ)² = Suu + Svv − 2Suv

→ Suv = (Suu + Svv − Σd²)/2 = [n(n²−1)/12 + n(n²−1)/12 − Σd²]/2 = n(n²−1)/12 − Σd²/2

PMCC of ranks = Suv/√(Suu·Svv) = [n(n²−1)/12 − Σd²/2] / [n(n²−1)/12]

= 1 − Σd² / [n(n²−1)/6] = 1 − 6Σd²/[n(n²−1)] = rₛ ∎