← Back to Grade 12

Regression & Correlation S2 Stats

Grade 12 · Statistics 2 · Cambridge A-Level 9709 · Age 17–18

Welcome to Regression & Correlation

This topic covers the statistical tools used to measure and model relationships between two variables. You will learn to quantify how strongly two variables are related, find the best-fit line, test whether a correlation is significant, and transform non-linear data into linear form.

Syllabus Coverage: Cambridge 9709 Statistics 2 — Regression and Correlation (typically 20–25% of S2 paper)

What You Will Learn

Learn 1 — Product Moment Correlation Coefficient (PMCC)

Compute r using Sxy, Sxx, Syy. Interpret strength and direction of linear correlation.

Learn 2 — Least Squares Regression Line

Find the equation ŷ = a + bx; understand interpolation vs extrapolation; analyse residuals.

Learn 3 — Spearman's Rank Correlation

Rank data, handle ties, apply rₛ = 1 − 6Σd²/[n(n²−1)]; know when to use it.

Learn 4 — Hypothesis Testing for Correlation

Set up H₀: ρ = 0, choose critical values from tables, state conclusions in context.

Learn 5 — Linearisation

Transform y = ab^x and y = ax^n into linear form using logarithms; read off constants from graphs.

Key Skills Checklist

  • ✓ Calculate Sxy, Sxx, Syy from raw data or summary statistics
  • ✓ Compute and interpret PMCC r
  • ✓ Find regression line coefficients a and b
  • ✓ Assign ranks and compute Spearman's rₛ
  • ✓ Conduct a hypothesis test for correlation
  • ✓ Linearise data and interpret log-transformed graphs

Learn 1: Product Moment Correlation Coefficient (PMCC)

The PMCC, denoted r, measures the strength and direction of the linear relationship between two variables x and y.

Summary Statistics

Sxy = Σxy − nx̄ȳ  |  Sxx = Σx² − nx̄²  |  Syy = Σy² − nȳ²

These can also be written as:

Sxy = Σxy − (Σx)(Σy)/n  |  Sxx = Σx² − (Σx)²/n  |  Syy = Σy² − (Σy)²/n

The Formula

r = Sxy / √(Sxx · Syy)

Properties of r

• r always lies in [−1, 1]
• r = +1: perfect positive linear correlation
• r = −1: perfect negative linear correlation
• r = 0: no linear correlation (there may still be a non-linear relationship)
• r is dimensionless — it doesn't depend on the units of x or y

Interpreting the Strength

|r| rangeInterpretation
0.9 – 1.0Very strong linear correlation
0.7 – 0.9Strong linear correlation
0.5 – 0.7Moderate linear correlation
0.3 – 0.5Weak linear correlation
0 – 0.3Very weak / negligible correlation
Exam tip: Always state both the direction (positive/negative) and the strength (weak/moderate/strong) when interpreting r in context.

Positive vs Negative Correlation

Positive correlation (r > 0): As x increases, y tends to increase.
Negative correlation (r < 0): As x increases, y tends to decrease.
No correlation (r ≈ 0): No consistent linear trend.
Causation ≠ Correlation: Even if |r| is close to 1, this does NOT prove that x causes y. There may be a confounding variable or the relationship may be coincidental.

Learn 2: Least Squares Regression Line

The regression line of y on x gives the best-fit straight line minimising the sum of squared vertical distances from each data point to the line.

The Equation

ŷ = a + bx
b = Sxy / Sxx     a = ȳ − bx̄

Key Property

The regression line always passes through (x̄, ȳ) — the point of means.

Regression of x on y vs y on x

y on x: use when predicting y from a given x. Minimises vertical residuals.
x on y: use when predicting x from a given y. Minimises horizontal residuals.
• These are two different lines (unless r = ±1).

Interpolation vs Extrapolation

Interpolation (predicting within the range of data): Generally reliable — the model is supported by evidence in that region.

Extrapolation (predicting outside the range): Unreliable — the linear model may not hold beyond the observed data.

Residuals

Residual = y − ŷ (actual minus predicted)
• Positive residual: the actual value is above the regression line
• Negative residual: the actual value is below the regression line
• The sum of all residuals = 0 for a least squares line
• Residual plots: random scatter → model is appropriate; patterns → model is inappropriate
Common error: Using the regression line to predict y at an x value far outside the data range. Always check whether you are interpolating or extrapolating.

Interpreting Coefficients

b (gradient): For each unit increase in x, y is predicted to change by b units.
a (intercept): The predicted value of y when x = 0. This may not always have a meaningful real-world interpretation.

Learn 3: Spearman's Rank Correlation Coefficient

Spearman's rₛ measures the strength and direction of the monotonic relationship between two variables using ranked data.

The Formula

rₛ = 1 − 6Σd² / [n(n² − 1)]

where d = difference between the ranks of each paired observation, and n = number of pairs.

How to Rank

1. Rank the x values from smallest (rank 1) to largest (rank n).
2. Rank the y values from smallest (rank 1) to largest (rank n).
3. Find d = rank(x) − rank(y) for each pair.
4. Compute Σd² then apply the formula.

Handling Tied Ranks

When two or more values are equal, assign each the average of the ranks they would have occupied.

Example: If values tied for ranks 3 and 4, both get rank 3.5.

When to Use Spearman's vs PMCC

SituationUse
Data is clearly linear, both variables quantitative and approximately normalPMCC (r)
Data is ordinal (e.g., rankings, scores)Spearman's (rₛ)
Relationship is monotonic but not necessarily linearSpearman's (rₛ)
Outliers are present that would distort PMCCSpearman's (rₛ)
Data is not normally distributedSpearman's (rₛ)

Comparing rₛ and r

• Both lie in [−1, 1] and are interpreted similarly for direction and strength.
• rₛ is the PMCC applied to the ranks of the data (not the raw values).
• If data is bivariate normal, PMCC is more powerful; otherwise Spearman's is more robust.
Note: The formula rₛ = 1 − 6Σd²/[n(n²−1)] gives an exact result only when there are no tied ranks. With many ties, it is better to calculate the PMCC of the ranks directly (though the formula is still used in A-Level exams).

Learn 4: Hypothesis Testing for Correlation

We can test whether there is evidence of correlation in the population, using the sample correlation coefficient r as the test statistic.

Setting Up the Test

H₀: ρ = 0 (no population correlation)

H₁ options:
• ρ ≠ 0 → two-tailed test (testing for any correlation)
• ρ > 0 → one-tailed test (testing for positive correlation)
• ρ < 0 → one-tailed test (testing for negative correlation)

Critical Values

The test statistic is the sample r (or rₛ for Spearman's). Critical values are given in the exam paper.

• If |r| > critical value: reject H₀ — sufficient evidence of correlation
• If |r| ≤ critical value: do not reject H₀ — insufficient evidence of correlation

Stating Conclusions

Always state conclusions in context. Do not just say "reject H₀". Say:

"There is sufficient evidence at the X% significance level to conclude that there is a positive correlation between [variable 1] and [variable 2] in the population."

One-Tailed vs Two-Tailed

Significance level αOne-tailed critical regionTwo-tailed critical region
10%Upper 10% of distributionUpper and lower 5% each
5%Upper 5% of distributionUpper and lower 2.5% each
1%Upper 1% of distributionUpper and lower 0.5% each

Assumptions

When testing using PMCC: both variables should come from a bivariate normal distribution.
When testing using Spearman's: no distributional assumption required — it is non-parametric.
Common error: Using the wrong critical value table (e.g., using a one-tailed value for a two-tailed test). Read the question carefully to identify H₁ before looking up the table.

Step-by-Step Procedure

1. State H₀: ρ = 0 and H₁ (with direction if specified) B1
2. State the significance level α
3. Calculate r (or rₛ) from the data M1
4. Find the critical value from the table for the given n and α B1
5. Compare r with the critical value M1
6. State the conclusion in context A1

Learn 5: Linearisation

When data follows a non-linear model, we can transform it into a linear form and then apply regression and correlation techniques.

Model 1: y = ab^x (Exponential)

Take log of both sides: log y = log a + x · log b
• Plot log y against x
• The graph should be approximately linear
• Gradient = log b → so b = 10^(gradient) [if log base 10] or b = e^(gradient) [if ln]
• Intercept = log a → so a = 10^(intercept) or a = e^(intercept)

Model 2: y = ax^n (Power Law)

Take log of both sides: log y = log a + n · log x
• Plot log y against log x
• The graph should be approximately linear
• Gradient = n
• Intercept = log a → so a = 10^(intercept)

Which Transformation to Use?

ModelTransformPlotGradientIntercept
y = ab^xlog y = log a + x log blog y vs xlog blog a
y = ax^nlog y = log a + n log xlog y vs log xnlog a
Exam tip: Cambridge 9709 typically uses logarithms base 10 (log) or natural logarithms (ln). Be consistent throughout a question — check which the question specifies.

Reading Off Constants

1. Identify the linearised form and which variables are plotted
2. Read gradient and y-intercept from the graph or regression equation
3. Use gradient and intercept to recover a, b (or a, n)
4. State the model clearly: e.g., y = 2.3 × (1.5)^x
Common mistake: Forgetting to convert back from log a to a. If the intercept is 0.8 (log base 10), then a = 10^0.8 ≈ 6.31, not 0.8.

Example 1 — Computing PMCC from a Table

Given n = 5, Σx = 25, Σy = 40, Σx² = 145, Σy² = 340, Σxy = 218. Find r.

x̄ = 25/5 = 5, ȳ = 40/5 = 8 B1
Sxy = 218 − 5×5×8 = 218 − 200 = 18 M1
Sxx = 145 − 5×25 = 145 − 125 = 20 A1
Syy = 340 − 5×64 = 340 − 320 = 20 A1
r = 18/√(20×20) = 18/20 = 0.9 A1
r = 0.9 indicates a very strong positive linear correlation between x and y.

Example 2 — Finding the Regression Line

Using data from Example 1 (Sxy = 18, Sxx = 20, x̄ = 5, ȳ = 8), find the regression line of y on x.

b = Sxy/Sxx = 18/20 = 0.9 M1 A1
a = ȳ − bx̄ = 8 − 0.9×5 = 8 − 4.5 = 3.5 M1 A1
Regression line: ŷ = 3.5 + 0.9x A1
Check: when x = x̄ = 5, ŷ = 3.5 + 4.5 = 8 = ȳ ✓

Example 3 — Spearman's Rank Correlation

Ranks for 5 students in Maths (x) and Science (y): (1,2),(2,1),(3,4),(4,3),(5,5). Find rₛ.

d values: 1−2=−1, 2−1=1, 3−4=−1, 4−3=1, 5−5=0 M1
d² values: 1, 1, 1, 1, 0 → Σd² = 4 A1
rₛ = 1 − 6×4/[5×(25−1)] = 1 − 24/120 = 1 − 0.2 = 0.8 A1
Strong positive agreement in rankings between Maths and Science.

Example 4 — Tied Ranks

x values: 3, 7, 7, 9, 12. Assign ranks.

3 → rank 1; two 7s would be ranks 2 and 3, so both get rank 2.5; 9 → rank 4; 12 → rank 5 M1 A1
Assigned ranks: 1, 2.5, 2.5, 4, 5
Always check for ties before computing Spearman's rₛ.

Example 5 — Hypothesis Test for Correlation

For n = 10 pairs, r = 0.68. Test at the 5% significance level whether there is positive correlation.

H₀: ρ = 0  |  H₁: ρ > 0 (one-tailed) B1
Significance level: 5%, one-tailed. Critical value for n=10 at 5% = 0.5494 B1
Test statistic: r = 0.68 > 0.5494 M1
Reject H₀. There is sufficient evidence at the 5% level to conclude that there is a positive correlation in the population. A1

Example 6 — Linearisation: Exponential Model

Data follows y = ab^x. A plot of log y against x gives gradient 0.301 and intercept 1.2. Find a and b.

log b = 0.301 → b = 10^0.301 ≈ 2.0 M1 A1
log a = 1.2 → a = 10^1.2 ≈ 15.85 M1 A1
Model: y = 15.85 × 2^x (approximately) A1

Example 7 — Linearisation: Power Law

Data follows y = ax^n. A plot of log y against log x gives gradient 1.5 and intercept 0.6. Find a and n.

n = gradient = 1.5 B1
log a = 0.6 → a = 10^0.6 ≈ 3.98 M1 A1
Model: y ≈ 3.98 x^1.5 A1

Example 8 — Full Worked Example from Summary Stats

n=8, Σx=64, Σy=96, Σx²=560, Σy²=1248, Σxy=806. Find r and the regression line of y on x.

x̄ = 64/8 = 8, ȳ = 96/8 = 12 B1
Sxy = 806 − 8×8×12 = 806 − 768 = 38 M1 A1
Sxx = 560 − 8×64 = 560 − 512 = 48 A1
Syy = 1248 − 8×144 = 1248 − 1152 = 96 A1
r = 38/√(48×96) = 38/√4608 = 38/67.88 ≈ 0.560 A1
b = 38/48 ≈ 0.792; a = 12 − 0.792×8 = 12 − 6.333 = 5.667 M1 A1
Regression line: ŷ = 5.67 + 0.792x A1
r = 0.560 → moderate positive linear correlation.

Common Mistakes

Mistake 1: Wrong Sxy Formula

✗ Sxy = Σxy − x̄ȳ (missing the n)
✓ Sxy = Σxy − n·x̄·ȳ or equivalently Σxy − (Σx)(Σy)/n

Mistake 2: Forgetting to Square Root in PMCC

✗ r = Sxy / (Sxx · Syy)
✓ r = Sxy / √(Sxx · Syy) — you must square root the product

Mistake 3: Extrapolating Beyond the Data

✗ Using ŷ = 3.5 + 0.9x to predict y when x = 1000 (data only goes to x = 20)
✓ Only use the regression line to interpolate within the observed range of x values. State that extrapolation is unreliable.

Mistake 4: Confusing the Two Regression Lines

✗ Using the regression line of y on x to predict x from a given y
✓ To predict y from x → use y on x line. To predict x from y → use x on y line (different gradient and intercept).

Mistake 5: Forgetting to Convert Back from Log

✗ In y = ab^x, if intercept of log y vs x is 0.7, stating a = 0.7
✓ log a = 0.7 → a = 10^0.7 ≈ 5.01. Always anti-log the intercept to recover a.

Mistake 6: Wrong H₁ for Hypothesis Test

✗ Testing for positive correlation but writing H₁: ρ ≠ 0 (two-tailed)
✓ Read the question: "evidence of positive correlation" → H₁: ρ > 0 (one-tailed). This halves the critical region.

Mistake 7: Ignoring Tied Ranks in Spearman's

✗ Giving tied values different consecutive ranks (e.g., ranking two 7s as 2 and 3)
✓ Tied values get the average rank: both 7s get rank 2.5. Then continue with rank 4 for the next value.

Key Formulas Reference Sheet

FormulaMeaning / Notes
Sxy = Σxy − nx̄ȳSum of cross-products (corrected)
Sxy = Σxy − (Σx)(Σy)/nEquivalent form using totals
Sxx = Σx² − nx̄²Sum of squares for x
Sxx = Σx² − (Σx)²/nEquivalent form
Syy = Σy² − nȳ²Sum of squares for y
r = Sxy / √(Sxx·Syy)PMCC; r ∈ [−1, 1]
b = Sxy / SxxGradient of regression line y on x
a = ȳ − bx̄Intercept of regression line y on x
ŷ = a + bxEquation of regression line
Residual = y − ŷActual minus predicted value
rₛ = 1 − 6Σd²/[n(n²−1)]Spearman's rank correlation coefficient
d = rank(x) − rank(y)Rank difference for each pair
H₀: ρ = 0Null hypothesis for correlation test
log y = log a + x·log bLinearised form of y = ab^x (plot log y vs x)
log y = log a + n·log xLinearised form of y = ax^n (plot log y vs log x)
Regression passes through (x̄, ȳ)Key property — use to check working

Proof Bank

Proof 1: PMCC from Covariance Definition

The population PMCC is defined as:

ρ = Cov(X,Y) / [σ_X · σ_Y]

The sample analogue uses:

Cov(X,Y) = (1/n)Σ(xᵢ − x̄)(yᵢ − ȳ)

Now expand Σ(xᵢ − x̄)(yᵢ − ȳ):

= Σ(xᵢyᵢ − x̄yᵢ − xᵢȳ + x̄ȳ)

= Σxᵢyᵢ − x̄Σyᵢ − ȳΣxᵢ + nx̄ȳ

= Σxy − x̄(nȳ) − ȳ(nx̄) + nx̄ȳ = Σxy − nx̄ȳ − nx̄ȳ + nx̄ȳ

= Σxy − nx̄ȳ = Sxy

Similarly, Σ(xᵢ − x̄)² = Σx² − nx̄² = Sxx, and Σ(yᵢ − ȳ)² = Syy.

So the sample PMCC is:

r = Sxy / √(Sxx · Syy) ∎

Proof 2: Least Squares Derivation (Minimising Σ(y − a − bx)²)

Let L = Σ(yᵢ − a − bxᵢ)². We minimise over a and b.

Partial derivative with respect to a:

∂L/∂a = −2Σ(yᵢ − a − bxᵢ) = 0

→ Σyᵢ = na + bΣxᵢ → a = ȳ − bx̄

Partial derivative with respect to b:

∂L/∂b = −2Σxᵢ(yᵢ − a − bxᵢ) = 0

→ Σxᵢyᵢ = aΣxᵢ + bΣxᵢ²

Substitute a = ȳ − bx̄:

Σxy = (ȳ − bx̄)Σx + bΣx² = ȳ·Σx − bx̄·Σx + bΣx²

Σxy − ȳ·Σx = b(Σx² − x̄·Σx)

Since ȳ·Σx = nȳ·x̄ = nx̄ȳ and x̄·Σx = nx̄²:

Sxy = b · Sxx → b = Sxy/Sxx ∎

Proof 3: Spearman's rₛ is the PMCC of the Ranks

Let uᵢ = rank of xᵢ and vᵢ = rank of yᵢ, with no ties. Both u and v are permutations of {1, 2, …, n}.

For a set {1, 2, …, n}: ū = v̄ = (n+1)/2

Σuᵢ² = n(n+1)(2n+1)/6, so Suu = Σuᵢ² − nū² = n(n+1)(2n+1)/6 − n(n+1)²/4 = n(n²−1)/12

Similarly Svv = n(n²−1)/12.

Now dᵢ = uᵢ − vᵢ, so Σdᵢ² = Σ(uᵢ−vᵢ)² = Suu + Svv − 2Suv

→ Suv = (Suu + Svv − Σd²)/2 = [n(n²−1)/12 + n(n²−1)/12 − Σd²]/2 = n(n²−1)/12 − Σd²/2

PMCC of ranks = Suv/√(Suu·Svv) = [n(n²−1)/12 − Σd²/2] / [n(n²−1)/12]

= 1 − Σd² / [n(n²−1)/6] = 1 − 6Σd²/[n(n²−1)] = rₛ ∎

Interactive Scatter Plot — Drag Points to Explore Correlation

Drag any point on the canvas. The regression line and PMCC are updated live.

Loading...

Exercise 1 — PMCC Calculations

Answer all questions then check below.

Exercise 2 — Regression Line

Answer all questions then check below.

Exercise 3 — Spearman's Rank

Answer all questions then check below.

Exercise 4 — Hypothesis Testing

Answer all questions then check below.

Exercise 5 — Linearisation

Answer all questions then check below.

Practice — 30 Mixed Questions

Complete all 30 questions then check.

Challenge — 15 Hard Questions

Push yourself — these are exam-difficulty or harder.

Exam Style Questions

Write full solutions then reveal the mark scheme.

Q1 [5 marks]

For 7 pairs of data: Σx = 42, Σy = 63, Σx² = 280, Σy² = 609, Σxy = 398. Calculate Sxy, Sxx, Syy and hence find r. Comment on the correlation.

Sxy = 398 − 42×63/7 = 398 − 378 = 20 [M1 A1]
Sxx = 280 − 42²/7 = 280 − 252 = 28 [A1]
Syy = 609 − 63²/7 = 609 − 567 = 42 [A1]
r = 20/√(28×42) = 20/√1176 = 20/34.29 ≈ 0.583 [A1]
Moderate positive linear correlation. [B1]

Q2 [5 marks]

Using the data from Q1, find the equation of the regression line of y on x. State the value of y predicted when x = 8 and comment on the reliability of this prediction.

x̄ = 42/7 = 6, ȳ = 63/7 = 9 [B1]
b = Sxy/Sxx = 20/28 = 5/7 ≈ 0.714 [M1 A1]
a = 9 − (5/7)×6 = 9 − 30/7 = 33/7 ≈ 4.71 [A1]
ŷ = 4.71 + 0.714×8 = 4.71 + 5.71 = 10.43 [A1]
Whether this is reliable depends on whether x = 8 is within the range of the original data — need to check the original data range. If x = 8 is an interpolation it is reliable; if extrapolation it is unreliable. [B1]

Q3 [6 marks]

Six students are ranked by a teacher and by a judge: Teacher ranks: 1,2,3,4,5,6. Judge ranks: 2,1,4,3,6,5. Calculate Spearman's rₛ and test at the 5% significance level whether there is positive agreement between the two sets of rankings (critical value = 0.8286).

d values: −1,1,−1,1,−1,1 → d² = 1,1,1,1,1,1 → Σd² = 6 [M1 A1]
rₛ = 1 − 6×6/[6×35] = 1 − 36/210 = 1 − 0.171 = 0.829 [M1 A1]
H₀: ρₛ = 0; H₁: ρₛ > 0 [B1]
rₛ = 0.829 > 0.8286 → reject H₀. Sufficient evidence at 5% level of positive agreement. [A1]

Q4 [4 marks]

Data is believed to follow the model y = ab^x. When log y is plotted against x, the regression line is log y = 0.5 + 0.2x. Find a and b.

Comparing log y = log a + x·log b with log y = 0.5 + 0.2x: [M1]
log a = 0.5 → a = 10^0.5 ≈ 3.16 [A1]
log b = 0.2 → b = 10^0.2 ≈ 1.585 [A1]
Model: y = 3.16 × (1.585)^x [A1]

Q5 [5 marks]

A sample of n = 12 gives r = −0.54. Test at the 5% significance level whether there is negative correlation in the population. Critical value (one-tail 5%, n=12) = 0.4973.

H₀: ρ = 0; H₁: ρ < 0 (one-tailed) [B1]
Significance level: 5% [B1]
|r| = 0.54 > critical value 0.4973 [M1]
Since r is negative and |r| exceeds the critical value, reject H₀. [M1]
There is sufficient evidence at the 5% level of negative correlation in the population. [A1]

Q6 [5 marks]

Data follows y = ax^n. A log–log plot gives a straight line through (0.3, 1.1) and (0.9, 2.3). Find n and a.

Gradient = (2.3−1.1)/(0.9−0.3) = 1.2/0.6 = 2 → n = 2 [M1 A1]
Using point (0.3, 1.1): 1.1 = log a + 2×0.3 = log a + 0.6 [M1]
log a = 0.5 → a = 10^0.5 ≈ 3.16 [A1]
Model: y = 3.16 x² [A1]

Q7 [6 marks]

Bivariate data on temperature (x °C) and sales (y units) for 9 days gives Sxy = 156.2, Sxx = 210.4, Syy = 132.5. (i) Find r. (ii) State with a reason whether PMCC or Spearman's is more appropriate here.

(i) r = 156.2/√(210.4×132.5) = 156.2/√27878 = 156.2/166.97 ≈ 0.936 [M1 A1]
Very strong positive linear correlation between temperature and sales. [A1]
(ii) Since both variables are quantitative and the relationship appears linear (r close to 1), PMCC is more appropriate. [B1]
Spearman's would be used if data is ordinal or non-normal, which is not indicated here. [B1 B1]

Q8 [7 marks]

For n = 8: x̄ = 4, ȳ = 10, Sxy = 24, Sxx = 32, Syy = 20. (i) Find the regression line of y on x. (ii) A new value x = 10 is observed; comment on using the line to predict y. (iii) Find the residual for the data point (6, 13.5).

(i) b = 24/32 = 0.75; a = 10 − 0.75×4 = 7 [M1 A1 A1]
Regression line: ŷ = 7 + 0.75x [A1]
(ii) x = 10 may be outside the range of the original data; this would be extrapolation and is unreliable. [B1]
(iii) ŷ = 7 + 0.75×6 = 7 + 4.5 = 11.5 [M1]
Residual = 13.5 − 11.5 = 2.0 [A1]

Past Paper Questions (Adapted 9709 S2)

PP1 — 9709/62/O/N/18 (adapted) [6 marks]

The ages (x years) and blood pressure readings (y mmHg) for 8 patients give: Σx = 400, Σy = 1200, Σx² = 21000, Σy² = 181200, Σxy = 61400, n = 8.

(a) Calculate r and comment on what this value suggests.

(b) Find the equation of the regression line of y on x.

x̄ = 50, ȳ = 150
Sxy = 61400 − 8×50×150 = 61400 − 60000 = 1400 [M1]
Sxx = 21000 − 8×2500 = 21000 − 20000 = 1000 [A1]
Syy = 181200 − 8×22500 = 181200 − 180000 = 1200 [A1]
r = 1400/√(1000×1200) = 1400/1095.4 ≈ 0.878 — strong positive linear correlation [A1 B1]
b = 1400/1000 = 1.4; a = 150 − 1.4×50 = 80
Regression line: ŷ = 80 + 1.4x [M1 A1]

PP2 — 9709/63/M/J/19 (adapted) [5 marks]

Seven items are ranked by two assessors. Assessor A: 1,2,3,4,5,6,7. Assessor B: 3,1,2,5,4,7,6. Calculate Spearman's rₛ and test at 10% significance level for agreement (critical value for n=7, one-tail 10% = 0.7143).

d: −2,1,1,−1,1,−1,1 → d²: 4,1,1,1,1,1,1 → Σd² = 10 [M1 A1]
rₛ = 1 − 6×10/[7×48] = 1 − 60/336 = 1 − 0.179 = 0.821 [M1 A1]
H₀: ρₛ = 0; H₁: ρₛ > 0 (one-tailed, 10%)
0.821 > 0.7143 → reject H₀. Evidence of agreement between assessors at 10% level. [A1]

PP3 — 9709/61/O/N/20 (adapted) [6 marks]

Data is thought to satisfy y = ab^x. Values of log₁₀y are recorded for x = 1,2,3,4,5 and give log y values: 1.3, 1.6, 1.9, 2.2, 2.5.

(a) Verify that a linear model is appropriate for log y vs x.

(b) Use the first and last point to find the gradient and intercept, then find a and b.

(a) The differences in log y are constant (0.3 each), confirming a perfect linear relationship — the model y = ab^x is appropriate. [B1 B1]
(b) Gradient = (2.5−1.3)/(5−1) = 1.2/4 = 0.3 = log b → b = 10^0.3 ≈ 1.995 ≈ 2.00 [M1 A1]
Using point (1, 1.3): 1.3 = log a + 0.3×1 → log a = 1.0 → a = 10 [M1 A1]
Model: y = 10 × 2^x

PP4 — 9709/62/M/J/21 (adapted) [5 marks]

A researcher collects n = 15 pairs of data and obtains r = 0.48. She wishes to test at the 5% significance level whether there is a positive correlation. The critical value is 0.4409.

(a) Write down H₀ and H₁. (b) Carry out the test and state your conclusion in context.

(a) H₀: ρ = 0; H₁: ρ > 0 [B1 B1]
(b) Test statistic r = 0.48; critical value = 0.4409 [B1]
0.48 > 0.4409 → reject H₀ [M1]
There is sufficient evidence at the 5% significance level of a positive correlation in the population. [A1]

PP5 — 9709/63/O/N/22 (adapted) [7 marks]

Eight pairs of data (x, y) give: Σx = 56, Σy = 120, Σx² = 448, Σy² = 1966, Σxy = 910.

(a) Find Sxy, Sxx, Syy and r. (b) Find the regression line of y on x. (c) Estimate y when x = 9. State whether this is interpolation or extrapolation.

x̄ = 7, ȳ = 15
Sxy = 910 − 8×7×15 = 910 − 840 = 70 [M1 A1]
Sxx = 448 − 8×49 = 448 − 392 = 56 [A1]
Syy = 1966 − 8×225 = 1966 − 1800 = 166 [A1]
r = 70/√(56×166) = 70/√9296 = 70/96.42 ≈ 0.726 — strong positive correlation [A1]
b = 70/56 = 1.25; a = 15 − 1.25×7 = 6.25 [M1 A1]
Regression line: ŷ = 6.25 + 1.25x [A1]
When x = 9: ŷ = 6.25 + 11.25 = 17.5 [A1]
x = 9 > x̄ = 7; whether interpolation or extrapolation depends on the maximum x in the dataset. If max x < 9, this is extrapolation and is less reliable. [B1]