← Back to FractionRush

Data Analysis & Correlation

Grade 10 · Statistics · Cambridge IGCSE · Age 15–16

Welcome to Data Analysis & Correlation!

Scatter graphs and lines of best fit let us spot patterns in data and make predictions. Correlation tells us how strongly two variables are linked — but never assumes one causes the other. Comparing distributions goes beyond single numbers to describe both average and spread in context.

Correlation: r from −1 to +1  |  Line through mean point (x̄, ȳ)  |  Compare: average AND spread, in context

Scatter Graphs

Plotting, correlation type and strength

PMCC (r)

Pearson's coefficient from −1 to +1

Line of Best Fit

Through mean point, equation, prediction

Interpolation vs Extrapolation

When predictions are reliable vs unreliable

Comparing Distributions

Mean/median, IQR/range, in context

Outliers

Effect on line, PMCC, and comparisons

1. Scatter Graphs and Correlation

A scatter graph plots paired data as (x, y) points. Each point represents one observation of two variables. The pattern of points shows whether, and how strongly, the two variables are related.

Types of correlation:
Positive correlation — as x increases, y tends to increase
Negative correlation — as x increases, y tends to decrease
No correlation — no clear pattern; points scattered randomly

Strength of correlation:
Strong — points cluster tightly around a line
Moderate — general trend visible but with spread
Weak — barely any pattern discernible
Pearson's PMCC (r):
r ranges from −1 to +1:
r = +1: perfect positive correlation  |  r = −1: perfect negative correlation  |  r = 0: no linear correlation

r = 0.9: strong positive  |  r = 0.7: moderate positive  |  r = 0.4: weak positive
r = −0.6: moderate negative  |  r = −0.9: strong negative
r=−1 −0.8 −0.6 −0.4 0 +0.4 +0.6 +0.8 r=+1

2. Causation vs Correlation

Correlation does NOT imply causation!
Just because two variables are correlated does not mean one causes the other. There may be a confounding variable (lurking variable) that affects both.

Classic example: Ice cream sales and drowning deaths are positively correlated. Neither causes the other — both are caused by hot weather (the confounding variable).
Outlier effect on PMCC:
A single outlier can dramatically change r. If a dataset has r = 0.92 but one anomalous point shifts it to r = 0.64, removing that outlier should be justified by context (data entry error, unusual event). Do not remove outliers just to improve r.

3. Line of Best Fit

A line of best fit (regression line) summarises the trend. It must pass through the mean point (x̄, ȳ).

Mean point: (x̄, ȳ) where x̄ = Σx/n and ȳ = Σy/n
Steps to draw a line of best fit:
1. Calculate x̄ and ȳ (the mean of all x-values and mean of all y-values)
2. Plot the mean point — your line MUST pass through it
3. Draw the line so roughly equal numbers of points lie on each side
4. Extend the line across the full range of data (but not beyond, unless extrapolating)
Finding the equation of the line of best fit:
Pick two clear points on your line (NOT original data points)
Gradient m = (y₂ − y₁)/(x₂ − x₁)
Then use y − ȳ = m(x − x̄), or substitute one point into y = mx + c to find c.

Example: Line passes through (10, 25) and (30, 55).
m = (55−25)/(30−10) = 30/20 = 1.5
y = 1.5x + c; using (10,25): 25 = 15 + c → c = 10 → y = 1.5x + 10

4. Interpolation vs Extrapolation

Interpolation — predicting within the range of the data. Generally reliable.
Extrapolation — predicting outside the range of the data. Unreliable — the trend may not continue.
Example: Data on height vs mass for students aged 14–16. The line of best fit gives a reliable prediction for a 15-year-old's mass from height (interpolation). Using the same line for a 25-year-old would be extrapolation — unreliable because the relationship may have changed.
In exams, if asked whether a prediction is reliable, always state whether it is interpolation or extrapolation AND give a reason. "Extrapolation is unreliable because we cannot assume the trend continues outside the data range."

5. Comparing Distributions

When comparing two datasets, you must comment on both average and spread, and put your comparison in context.

Choosing your measure of average:
Mean — best for symmetric distributions with no outliers
Median — better for skewed distributions or when outliers are present

Choosing your measure of spread:
IQR (interquartile range) — robust to outliers, based on middle 50%
Range — simple but sensitive to extreme values
Standard deviation — average distance from the mean; useful for symmetric data
Writing a comparison — model answer:
"On average, Group A scored higher than Group B (mean 72 vs mean 65), suggesting Group A performed better overall. However, Group A also had a larger spread (IQR 18 vs IQR 10), meaning Group A results were more varied/inconsistent."
Always: comparison word + values + context.

6. Standard Deviation — The Concept

Standard deviation (σ) measures the average distance of data points from the mean.
Small σ → values clustered close to the mean.
Large σ → values spread far from the mean.

Rough comparison rule: If σ₁ > σ₂, Distribution 1 is more spread out than Distribution 2.
You do not need to calculate standard deviation from raw data at IGCSE, but you must be able to compare and interpret given values of σ.

7. Effect of Removing an Outlier

On the line of best fit: Removing an outlier generally makes the line steeper or shallower, and changes the gradient and intercept. The line will be closer to the remaining data.

On PMCC (r): Removing an outlier usually increases |r| (stronger correlation) if the outlier was away from the trend. It may decrease |r| if the outlier happened to be on the trend line.

On average: Removing an outlier above the mean decreases the mean; removing one below increases it. Median is less affected.

On spread: Removing an extreme value decreases the range, and may slightly decrease IQR.

Example 1 — Describe Correlation

Scatter graph: as revision hours increase, test scores increase; points close to a line.
Description: There is a strong positive correlation between revision hours and test scores. As revision hours increase, test scores tend to increase.

Example 2 — Compute Mean Point

Data: (2,5), (4,9), (6,11), (8,15), (10,20)
x̄ = (2+4+6+8+10)/5 = 30/5 = 6    ȳ = (5+9+11+15+20)/5 = 60/5 = 12
Mean point: (6, 12). Plot this and draw the line through it.

Example 3 — Find Line Equation and Predict

Line of best fit passes through (2, 8) and (10, 24). Find the equation.
m = (24−8)/(10−2) = 16/8 = 2
Using (2,8): 8 = 2(2)+c → c = 4 → y = 2x + 4
Predict y when x=7: y = 2(7)+4 = 18 (interpolation — reliable if 7 is within the data range)

Example 4 — Assess Reliability of Prediction

Data range: x from 1 to 12. Predict at x=15.
x=15 is outside the data range (1 to 12) → this is extrapolation.
The prediction is unreliable because we cannot assume the linear trend continues beyond x=12. The relationship may change.

Example 5 — Compare Two Distributions

Group A: mean=68, IQR=15    Group B: mean=74, IQR=8
Average: Group B has a higher mean (74 vs 68), so on average Group B scored higher on the test.
Spread: Group A has a larger IQR (15 vs 8), so Group A's scores were more spread out / less consistent than Group B's.

Example 6 — Outlier Effect on Line

Dataset: mostly points following y ≈ 2x. One outlier at (5, 25) pulls the line upward.
Removing the outlier: the gradient decreases (line less steep), the line fits the remaining points better, and r increases (closer to 1).

Common Mistakes to Avoid

Mistake 1 — Line Not Through the Mean Point

A line of best fit MUST pass through (x̄, ȳ). If it doesn't, it is not the line of best fit. Always calculate and plot the mean point first, then draw the line through it with roughly equal points on each side.

Mistake 2 — Treating Extrapolation as Reliable

Predicting outside the data range (extrapolation) is unreliable. The trend shown in the data may not continue beyond the measured range. Always check whether the value you are predicting for lies within or outside the data range, and comment accordingly.

Mistake 3 — Correlation Implies Causation

A strong correlation between X and Y does NOT mean X causes Y. There could be a third confounding variable affecting both. In exam answers, never write "X causes Y" from a scatter graph alone — write "as X increases, Y tends to increase" (correlation language).

Mistake 4 — Comparison Without Context or Missing Spread

When comparing two distributions, you must (a) compare averages AND spreads, and (b) relate both to the context of the data. Saying "Group A has a higher mean" is incomplete. Add "so Group A performed better on average in the test" — the context matters for marks.

Mistake 5 — Ignoring the Sign of PMCC

r = −0.85 describes strong NEGATIVE correlation, not strong positive. The sign tells you the direction. When describing correlation, always state both direction AND strength. Never say "r = −0.85 means strong correlation" — say "strong negative correlation".

Key Formulas — Data Analysis

Mean:   x̄ = Σx / n    Mean point:   (x̄, ȳ)
Gradient of line:   m = (y₂ − y₁) / (x₂ − x₁)
Equation of line:   y = mx + c    or    y − ȳ = m(x − x̄)
PMCC scale:   −1 ≤ r ≤ +1  |  |r| > 0.8: strong  |  0.5–0.8: moderate  |  <0.5: weak
Correlation language guide:
r = 0.95 → strong positive correlation
r = 0.65 → moderate positive correlation
r = 0.2 → weak positive correlation
r = −0.7 → moderate negative correlation
r ≈ 0 → no linear correlation (or very weak)
Comparison framework (use this in every comparison question):
1. Compare average (mean or median) + state which is greater + context
2. Compare spread (IQR or range) + state which is larger + context
Both points needed for full marks.
Choosing mean vs median:
• Data has outliers or is skewed → use MEDIAN and IQR
• Data is roughly symmetric → use MEAN and range or standard deviation

Scatter Plot Builder

Enter up to 10 (x,y) data points. The visualiser plots them, computes the mean point (pink cross), draws the regression line, gives an r description, and predicts y from x.

#xy
1
2
3
4
5
6
7
8
9
10
Enter data points and click Plot to begin.
Predict y from x:

Exercise 1 — Describe Correlation

For each PMCC value r, enter: 1=strong positive, 2=moderate positive, 3=weak positive, 4=no correlation, 5=weak negative, 6=moderate negative, 7=strong negative.

1. r = 0.92. Enter code (1–7).

2. r = −0.85. Enter code.

3. r = 0.05. Enter code.

4. r = 0.62. Enter code.

5. r = −0.55. Enter code.

6. r = 0.35. Enter code.

7. r = −0.97. Enter code.

8. r = −0.30. Enter code.

Exercise 2 — Compute Mean Point

Calculate the mean point (x̄, ȳ) for each dataset. Enter x̄ for the first question, ȳ for the second, and so on alternately.

1. Points: (1,3), (3,5), (5,7), (7,9). Find x̄.

2. Same points. Find ȳ.

3. Points: (2,10), (4,14), (6,18), (8,22), (10,26). Find x̄.

4. Same points. Find ȳ.

5. Points: (5,2), (10,4), (15,8), (20,10), (25,6). Find x̄.

6. Same points. Find ȳ.

7. Points: (0,12), (4,8), (8,4), (12,0). Find x̄.

8. Same points. Find ȳ.

Exercise 3 — Line Equation and Prediction

Find gradient, then predict y using the given equation.

1. Line through (2,6) and (10,22). Find the gradient.

2. Same line. Find the y-intercept c (y=mx+c).

3. y = 2x + 2. Predict y when x = 7.

4. Line through (0,5) and (8,21). Find gradient m.

5. y = 2x + 5. Predict y when x = 11. Is this interpolation (enter 1) or extrapolation (enter 2), if data range is x=1 to x=9?

6. Line through (4,10) and (12,34). Find gradient.

7. y = 3x + 1. Predict y when x = 5.

8. Line through (1,4) and (9,20). Find y-intercept c.

Exercise 4 — Comparing Distributions

Use the statistics given to answer each question numerically.

1. Group A: mean=55, median=52. Group B: mean=68, median=67. How much greater is Group B's mean? Enter the difference.

2. Group A: IQR=12. Group B: IQR=20. Which group is more spread? Enter 1 for A, 2 for B.

3. Dataset A: 3,5,7,9,11. Dataset B: 1,4,7,10,13. Both have median=7. Which has larger range? Enter 1=A, 2=B, 3=same.

4. Dataset: 10,12,14,16,100. Which is more representative, mean (enter 1) or median (enter 2)?

5. Group A: mean=70, sd=5. Group B: mean=70, sd=15. Group B is more spread (enter 1) or less spread (enter 2)?

6. Data: 4,6,8,10,12. Mean = ? Enter the mean.

7. Data: 4,6,8,10,12. IQR = Q3−Q1 = ? Enter IQR.

8. Removing outlier 100 from dataset {10,12,14,16,100}: does the mean increase (1) or decrease (2)?

Exercise 5 — Outliers & PMCC Interpretation

Use understanding of outlier effects and PMCC.

1. r = 0.94 with an outlier. Removing the outlier away from the trend makes r increase (1) or decrease (2)?

2. After removing outlier from Q1, r = 0.98. The correlation became stronger (1) or weaker (2)?

3. Data range x: 5 to 25. Predicting at x=30 is interpolation (1) or extrapolation (2)?

4. Data range x: 5 to 25. Predicting at x=15 is interpolation (1) or extrapolation (2)?

5. r = 0.75 describes strong (1), moderate (2), or weak (3) correlation?

6. r = −0.92 — direction is positive (1) or negative (2)?

7. Outlier above the mean is removed. The mean decreases (1) or increases (2)?

8. Removing an outlier from a dataset generally: decreases the range (1) or increases the range (2)?

Practice — 25 Questions

Mixed data analysis questions. Enter numerical answers where asked; for codes use the key given in Exercise 1.

1. r = 0.88. Correlation code (1–7)?

2. r = −0.45. Correlation code?

3. Points: (1,4),(2,6),(3,8),(4,10). Find x̄.

4. Same points. Find ȳ.

5. Line through (3,9) and (9,21). Gradient m = ?

6. y = 2x + 3. Predict y at x=8.

7. y = 2x + 3. Predict y at x=20. Data range is x=2 to x=10. Interpolation (1) or extrapolation (2)?

8. Group A mean=60, Group B mean=75. Difference (B−A)?

9. Group A IQR=8, Group B IQR=20. Which is more consistent (smaller spread)? Enter 1=A, 2=B.

10. Data: 2,4,6,8,10. Mean = ?

11. Same data. Median = ?

12. r = 0.12. Correlation code?

13. Line through (0,3) and (5,18). Gradient m = ?

14. y = 3x + 3. Predict y at x = 6.

15. Data: 5,7,9,11,13. x̄ = ?

16. Outlier above mean is removed. Mean decreases (1) or increases (2)?

17. After removing outlier, r increases from 0.70 to 0.91. Correlation got stronger (1) or weaker (2)?

18. r = −0.78. Correlation code?

19. Points: (2,14),(6,18),(10,22),(14,26). ȳ = ?

20. y = 4x − 2. Predict y at x = 5.

21. Data range x: 10 to 50. Predicting at x=35 is interpolation (1) or extrapolation (2)?

22. For skewed data with outliers, which average is more appropriate? Mean=1, Median=2.

23. Line through (2,5) and (8,17). y-intercept c = ?

24. r = 0.55. Correlation code?

25. Dataset: 3,5,7,9,200. Is mean (1) or median (2) more resistant to the outlier?

Challenge — 12 Questions

Harder multi-step data analysis problems.

1. Points: (1,2),(2,4),(3,6),(4,8),(5,10). Line of best fit y=mx+c. Find m.

2. Same data. Find c (y-intercept). Line passes through mean point.

3. Adding outlier (5,30) to data in Q1. Does the gradient of the line increase (1) or decrease (2)?

4. Points: (0,20),(5,15),(10,10),(15,5),(20,0). Find gradient m of line through mean point.

5. Same data. Find mean x̄.

6. Same data. Using y = mx + c with m from Q4, find c.

7. y = −x + 20. Predict y at x=12. Enter value.

8. r = −0.95 describes what type? Enter correlation code.

9. Data: 10,20,30,40,50. Mean=30. If 50 is replaced by 100, new mean = ?

10. Data A: mean=50, sd=4. Data B: mean=50, sd=12. Which has values closer to the mean? Enter 1=A, 2=B.

11. 8 data points: mean=15, total sum Σx = ?

12. Line: y = 1.5x + 4. At what x does y = 25? Enter x value.

Exam Style Questions — 5 Questions

Cambridge IGCSE style. Show working on paper where needed. Enter final answers.

Q1. The table shows data for 6 students: hours studied (x) and test score (y).
x: 1, 2, 3, 4, 5, 6    y: 30, 40, 45, 55, 65, 75
(a) Calculate the mean point (x̄, ȳ).
Enter x̄:

Q2. Using the same data: enter ȳ.

Q3. The line of best fit for Q1 data passes through (1, 28) and (6, 78).
Find the gradient of the line of best fit.

Q4. Using the line equation from Q3 (y=10x+18), predict the score for a student who studied 4.5 hours. Is this interpolation or extrapolation? Enter predicted score.

Q5. Class A: mean = 62, IQR = 10. Class B: mean = 58, IQR = 22.
Which class has more consistent (less spread) results? Enter 1 for Class A, 2 for Class B.