Scatter Diagrams | FractionRush

Welcome to Scatter Diagrams!

Scatter diagrams let us see whether two variables are related — from studying whether revision time links to exam scores, to whether car age affects fuel efficiency. They are a core part of IGCSE Statistics and appear on most Extended papers.

Mean point: (x̄, ȳ) | Gradient: m = (y₂−y₁)/(x₂−x₁) | Line of best fit: y = mx + c

Learning Objectives

Plot scatter diagrams correctly, labelling axes with independent (x) and dependent (y) variables
Describe the type and strength of correlation (strong/weak positive/negative, no correlation)
Understand that correlation does not imply causation
Calculate the mean point (x̄, ȳ) and draw a line of best fit through it
Find the equation of the line of best fit in the form y = mx + c
Use the line to make predictions by interpolation
Explain why extrapolation can be unreliable

Plotting Points

Independent vs dependent, reading coordinates

Types of Correlation

Strong/weak, positive/negative, none

Mean Point

(x̄, ȳ) always lies on the line of best fit

Line of Best Fit

Equal points each side, drawn through mean

y = mx + c

Gradient from two points on the line

Interpolation

Predicting within data range — reliable

Learn 1 — Plotting Scatter Diagrams

What is a Scatter Diagram?

A scatter diagram (also called a scatter graph or scatter plot) displays pairs of numerical data as individual points. Each point represents one individual (person, object, trial) and shows the values of two measurements taken from that individual.

Example dataset: Hours of revision (x) and test score (y)
Hours: 1, 2, 3, 4, 5, 6, 7
Scores: 35, 42, 50, 58, 66, 74, 80
Each pair (1, 35), (2, 42), … is plotted as a point on the graph.

Independent vs Dependent Variables

The independent variable is the one you control or that changes naturally. It goes on the x-axis (horizontal). The dependent variable is the one you measure and expect to change as a result. It goes on the y-axis (vertical).

Common exam examples:
• Hours of revision (x) → Test score (y) — revision causes score change
• Temperature (x) → Ice cream sales (y) — temperature drives sales
• Engine size (x) → Fuel consumption (y) — engine size affects fuel use
• Height (x) → Shoe size (y) — height is the independent measure

Ask yourself: "Which variable would I set or choose?" That one goes on the x-axis. "Which variable responds?" That one goes on the y-axis.

How to Plot a Scatter Diagram

Follow these steps every time:

Step 1 — Draw x and y axes. Label each axis with the variable name and its unit (e.g. "Revision time (hours)").
Step 2 — Choose a suitable scale. Start scales from 0 unless the data makes this impractical. Make each axis cover all data values with a small margin.
Step 3 — Plot each (x, y) pair as a cross (×) or dot. Never join the points with a line (this is not a line graph).
Step 4 — Give the graph a title describing both variables.

Do NOT join the plotted points with a line. A scatter diagram deliberately shows individual, unconnected points so you can see the overall pattern.

Reading and Interpreting Scatter Plots

Once a scatter diagram is drawn, you can read off values. To find the expected y for a given x value, go up from x to the line of best fit, then across to the y-axis. To find the x for a given y, go across from y to the line of best fit, then down to the x-axis.

Example: Reading a scatter plot
A scatter diagram shows revision hours (x) against test score (y). The line of best fit passes through (2, 44) and (6, 72).
Reading: When x = 4 hours, trace up to the line → y ≈ 58. So a student who revises 4 hours is predicted to score about 58.

When asked to "use the line of best fit", draw or trace a construction line — show the examiner exactly how you read the graph. You may lose a mark if you just write the answer without showing the trace lines.

Learn 2 — Types of Correlation

What is Correlation?

Correlation describes whether and how strongly two variables are related. We use the pattern of the scatter diagram to decide.

The Six Types

Strong Positive

Weak Positive

No Correlation

Weak Negative

Strong Negative

Perfect Positive

Positive correlation: As x increases, y increases. Points run from bottom-left to top-right.
Negative correlation: As x increases, y decreases. Points run from top-left to bottom-right.
No correlation: Points are scattered randomly — no direction at all.
Strong: Points cluster tightly around an imaginary line.
Weak: Points show the general direction but are widely spread.
Perfect: All points lie exactly on a straight line (very rare in real data).

Describing Correlation in Context

In exams you must describe correlation in context, not just say "positive". Use the variable names.

Poor answer: "There is a positive correlation."
Good answer: "There is a strong positive correlation between revision hours and test score — students who revise more tend to score higher."

Poor answer: "There is a negative correlation."
Good answer: "There is a moderate negative correlation between car age and value — older cars tend to be worth less."

The exam mark scheme often awards 1 mark for the type (positive/negative/none) and 1 mark for the strength (strong/weak). Make sure you say both — and use the variable names for full marks.

Correlation Does NOT Mean Causation

Just because two variables are correlated does not mean one causes the other. There may be a third "lurking variable" that drives both, or the relationship could be coincidental.

Classic examples of spurious correlation:
• Ice cream sales and drowning rates are positively correlated — but ice cream does not cause drowning. Hot weather causes both.
• Countries with more TV sets per person have higher life expectancy — but TVs don't cause long life. Wealth drives both.
• Shoe size and reading ability correlate in primary school children — because both are driven by age.

Common exam trap: "The scatter diagram shows that eating breakfast causes better exam results." This is WRONG — it shows correlation only. We cannot conclude causation from a scatter diagram alone.

If asked "does the scatter diagram prove that X causes Y?", the answer is always NO. Scatter diagrams show association, not cause and effect.

Learn 3 — Line of Best Fit & Predictions

Drawing the Line of Best Fit

The line of best fit (also called the trend line or regression line) is a straight line that best summarises the trend in a scatter diagram.

Rules for drawing the line of best fit:
1. The line must pass through the mean point (x̄, ȳ) — this is compulsory on IGCSE.
2. Roughly equal numbers of points should lie above and below the line (balance the points).
3. The line should follow the general direction of the data.
4. It does not need to pass through any actual data point (and often doesn't).
5. Use a ruler — a freehand curved line is wrong.

The Mean Point

Calculate x̄ (mean of all x-values) and ȳ (mean of all y-values). The mean point (x̄, ȳ) always lies on the line of best fit.

Mean point = (x̄, ȳ) where x̄ = Σx/n and ȳ = Σy/n

Example: Find the mean point
x values: 2, 4, 5, 7, 8, 10 → x̄ = (2+4+5+7+8+10)/6 = 36/6 = 6
y values: 15, 25, 30, 42, 48, 56 → ȳ = (15+25+30+42+48+56)/6 = 216/6 = 36
Mean point: (6, 36) — this point must lie on your line of best fit.

Equation of the Line of Best Fit

The equation takes the form y = mx + c where m is the gradient and c is the y-intercept.

Gradient m = (y₂ − y₁) / (x₂ − x₁) using two points on the line

Example: Find the equation of a line of best fit
The line passes through (2, 20) and (8, 50).
Step 1 — Gradient: m = (50−20)/(8−2) = 30/6 = 5
Step 2 — Use y = mx + c with point (2, 20): 20 = 5×2 + c → c = 20 − 10 = 10
Step 3 — Equation: y = 5x + 10
Step 4 — Verify: at x = 8, y = 5×8+10 = 50 ✓

Always choose two points that are FAR APART on the line (not close together) to reduce errors in calculating the gradient. Read coordinates carefully from the graph — ideally from grid intersections.

Interpolation vs Extrapolation

Interpolation means predicting a y-value for an x-value that lies within the range of the data. This is generally reliable.

Extrapolation means predicting a y-value for an x-value that lies outside the range of the data. This is unreliable — the pattern may not continue.

Example: Data collected for ages 15–25
Interpolation (reliable): Predicting for age 19 — this is within the data range.
Extrapolation (unreliable): Predicting for age 40 — this is well outside the range. The relationship may have changed.

Why extrapolation is unreliable: A scatter diagram only shows a pattern for the collected data range. Beyond that range, the relationship could become non-linear, could plateau, or could reverse entirely. For example, a child's height increases with age — but this doesn't mean a 50-year-old is still growing at the same rate.

If asked whether a prediction is reliable, check: (a) Is the x-value inside the data range? (b) Is the correlation strong? Both must be yes for the prediction to be reliable. If extrapolating, always say "unreliable because the x-value is outside the data range".

Making Predictions

Using the equation y = 5x + 10:
Predict y when x = 6: y = 5(6) + 10 = 40 (interpolation — reliable if data range includes x = 6)
Predict x when y = 35: 35 = 5x + 10 → 5x = 25 → x = 5

Example 1 — Plotting and describing correlation

Q: The table shows the number of hours of sunshine (x) and the number of sunburn cases at a clinic (y) on 6 days. Describe the correlation.

Hours sunshine: 2, 4, 5, 7, 8, 9 Cases: 5, 12, 16, 22, 27, 30

B1: Plot the points: (2,5), (4,12), (5,16), (7,22), (8,27), (9,30)

B1: As sunshine hours increase, clinic cases increase. Points cluster closely around an imaginary line from bottom-left to top-right.

A1: There is a strong positive correlation between sunshine hours and sunburn cases — days with more sunshine have more clinic cases.

Example 2 — Drawing the line of best fit

Q: Using the data from Example 1, find the mean point and draw the line of best fit.

M1: x̄ = (2+4+5+7+8+9)/6 = 35/6 ≈ 5.83

M1: ȳ = (5+12+16+22+27+30)/6 = 112/6 ≈ 18.67

A1: Mean point = (5.83, 18.67) — plot this point and draw a straight line through it with roughly equal points on each side.

Check: Count points above and below the drawn line — should be balanced (e.g. 3 above, 3 below).

Example 3 — Equation of line of best fit

Q: The line of best fit for the sunshine data passes through (2, 6) and (8, 26). Find its equation.

M1: Gradient m = (26−6)/(8−2) = 20/6 = 10/3 ≈ 3.33

M1: Using y = mx + c with (2, 6): 6 = (10/3)(2) + c → c = 6 − 20/3 = 18/3 − 20/3 = −2/3 ≈ −0.67

A1: Equation: y = (10/3)x − 2/3 or approximately y = 3.33x − 0.67

Example 4 — Making a prediction (interpolation)

Q: Using the equation y = 3.33x − 0.67, estimate the number of cases when there are 6 hours of sunshine. Is this reliable?

M1: y = 3.33(6) − 0.67 = 19.98 − 0.67 ≈ 19 cases

B1: x = 6 lies within the data range (2 to 9), so this is interpolation — reliable.

Example 5 — Extrapolation warning

Q: Predict the number of cases on a day with 15 hours of sunshine. Comment on reliability.

M1: y = 3.33(15) − 0.67 = 49.95 − 0.67 ≈ 49 cases

B1: x = 15 lies outside the data range (max was 9). This is extrapolation.

A1: This estimate is unreliable — the linear trend may not continue beyond the range of the data. There may be a maximum number of cases, for example.

Example 6 — Correlation and causation

Q: A student says "the scatter diagram proves that sunshine causes sunburn cases." Is the student correct? Explain.

B1: The student is not fully correct.

A1: A scatter diagram shows correlation (association) between the two variables, but does not prove causation. There could be a third variable (e.g. high temperature) that drives both more sunshine and more sunburn. We need controlled experiments to establish cause and effect.

Common Mistakes in Scatter Diagrams

These are the errors that cost marks most often. Study each one carefully.

Mistake 1 — Drawing the line through the origin

✗ Wrong: "The line of best fit must start at (0, 0) because both variables must be 0 at the same time."

✓ Correct: The line of best fit goes through the mean point (x̄, ȳ), not through the origin. It can have any y-intercept.

There is no reason to force the line through the origin. The y-intercept (c in y = mx + c) is determined by the data, not assumed to be zero.

Mistake 2 — Extrapolating far beyond the data range

✗ Wrong: Data covers ages 10–15. Predicting for age 50 using the line of best fit and saying "this is a reliable estimate."

✓ Correct: Age 50 is far outside the data range. This is extrapolation and is unreliable — the trend may not continue.

Always check: is the x-value inside the range of the collected data? If not, say "unreliable — extrapolation beyond the data range."

Mistake 3 — Confusing correlation with causation

✗ Wrong: "Since the scatter diagram shows a positive correlation, more revision causes higher marks."

✓ Correct: The scatter diagram shows a correlation (association) only. We cannot conclude that revision causes higher marks from a scatter diagram alone — there may be other factors.

Correlation means two variables move together. Causation means one directly produces the other. These are different claims. Only controlled experiments can show causation.

Mistake 4 — Using two points that are too close to find gradient

✗ Wrong: Reading two points that are very close together (e.g. (5, 18) and (6, 21)) to find gradient = 3, then using this for the whole line.

✓ Correct: Use two points far apart on the line (e.g. (2, 10) and (10, 38)) to minimise reading errors: m = (38−10)/(10−2) = 28/8 = 3.5

Small errors in reading graph coordinates have a larger effect when the points are close together. Use widely spaced points for accuracy.

Mistake 5 — Not describing correlation in context

✗ Wrong: "There is a positive correlation." [just one mark, if any]

✓ Correct: "There is a strong positive correlation between temperature and ice cream sales — as temperature increases, ice cream sales tend to increase." [full marks]

IGCSE mark schemes expect: the type (positive/negative/none), the strength (strong/weak) AND the context (reference to the specific variables). Giving just "positive" often scores only 1 out of 2.

Mistake 6 — Drawing line of best fit that doesn't pass through the mean point

✗ Wrong: Drawing a line that looks balanced but misses the calculated mean point (x̄, ȳ).

✓ Correct: Always calculate and plot (x̄, ȳ) first, then draw the line through it. Balance points either side as a secondary check.

The mean point is a mathematical requirement, not optional. In IGCSE, the examiner's line is drawn through the mean point — your line must pass through (or very close to) it to earn the mark.

Key Formulas — Scatter Diagrams

Formula / Concept	Details
Mean x̄	x̄ = (x₁ + x₂ + … + xₙ) / n = Σx / n
Mean ȳ	ȳ = (y₁ + y₂ + … + yₙ) / n = Σy / n
Mean point	(x̄, ȳ) — the line of best fit ALWAYS passes through this point
Gradient of line	m = (y₂ − y₁) / (x₂ − x₁) — use two well-separated points on the line
Equation of line	y = mx + c — substitute one point to find c
Find y from x	Substitute x into y = mx + c
Find x from y	Rearrange: x = (y − c) / m
Interpolation	Predicting within data range — generally reliable
Extrapolation	Predicting outside data range — unreliable, trend may not continue

Positive correlation: as x increases, y increases (gradient m > 0)

Negative correlation: as x increases, y decreases (gradient m < 0)

Correlation Strength — Quick Guide

Strong: Points cluster tightly near a line — easy to see the trend
Weak: Points show a general direction but are widely scattered
No correlation: Points show no direction at all — random scatter
Perfect: All points on a single straight line (r = ±1, rare in real data)

Key Rules to Remember

1. Independent variable → x-axis (horizontal)
2. Dependent variable → y-axis (vertical)
3. Line of best fit must pass through the mean point (x̄, ȳ)
4. Correlation does NOT imply causation
5. Only interpolation is reliable — extrapolation is not

Interactive Scatter Plot Builder

Click on the canvas to add data points. The visualiser will automatically calculate the mean point, draw the line of best fit, and estimate the correlation strength.

Predict y when x =

Click on the canvas above to add points. You need at least 3 points to see the line of best fit.

Exercise 1 — Plotting Points & Reading Scatter Diagrams

1. On a scatter diagram, which axis should the independent variable (the one you control) be placed on? Enter 1 for x-axis, 2 for y-axis.

2. A researcher measures engine size (cm³) and CO₂ emissions (g/km). Which variable goes on the x-axis? Enter 1 for engine size, 2 for CO₂ emissions.

3. Eight students' revision hours: 1,2,3,4,5,6,7,8. Their scores: 30,38,46,54,62,70,78,86. Calculate x̄ (mean revision hours).

4. Using the same data as Q3, calculate ȳ (mean score).

5. A line of best fit passes through (1, 32) and (9, 88). What is the gradient (m)?

6. Using m = 7 and the point (4, 54), find the y-intercept c (from y = mx + c).

7. Using y = 7x + 26, predict y when x = 6.

8. Using y = 7x + 26, find x when y = 61.

Exercise 2 — Mean Point & Line of Best Fit

1. Data: x = 2, 4, 6, 8, 10. Find x̄.

2. Data: y = 14, 22, 30, 38, 46. Find ȳ.

3. Using (x̄, ȳ) from Q1–Q2, what is the mean point? Enter x̄ value.

4. A line of best fit passes through (2, 15) and (10, 47). Find the gradient.

5. Using m = 4 and point (2, 15), find c in y = mx + c.

6. Using y = 4x + 7, find y when x = 9.

7. x values: 3, 7, 9, 11, 15. y values: 8, 20, 26, 32, 44. Find x̄.

8. Using same data as Q7, find ȳ.

Exercise 3 — Types of Correlation

Enter: 1 = Strong Positive, 2 = Weak Positive, 3 = No Correlation, 4 = Weak Negative, 5 = Strong Negative

1. Hours studying vs exam marks — as study time increases, marks consistently increase with little spread.

2. Shoe size vs intelligence — points are randomly scattered with no pattern.

3. Age of car vs value — older cars tend to be worth less, but there is quite a spread.

4. Temperature vs heating bills — as temperature rises, heating bills drop significantly in a tight pattern.

5. Amount of rainfall vs amount of sunshine — more rain generally means less sun, but points vary widely.

6. Number of people in a car vs journey time on same route — no real pattern.

7. Height vs weight for adults — taller people tend to be heavier, fairly tight cluster.

8. Speed of a car vs fuel efficiency (mpg) — faster speeds strongly associated with much lower mpg, tight pattern.

Exercise 4 — Equation of the Line of Best Fit

1. A line passes through (0, 5) and (10, 45). Find the gradient.

2. Using gradient = 4 and point (0, 5), find c.

3. A line passes through (3, 20) and (9, 38). Find the gradient.

4. Using gradient = 3 and point (3, 20), find c.

5. Using y = 3x + 11, predict y when x = 7.

6. Using y = 3x + 11, find x when y = 29.

7. A line passes through (1, 50) and (11, 10). Find the gradient.

8. Using m = −4 and point (1, 50), find c. (y = −4x + c)

Exercise 5 — Interpolation, Extrapolation & Reliability

Enter: 1 = Interpolation (reliable), 2 = Extrapolation (unreliable)

1. Data collected for x in range 5–25. Predicting y when x = 14.

2. Data collected for x in range 5–25. Predicting y when x = 40.

3. Data collected for x in range 5–25. Predicting y when x = 2.

4. Using y = 2x + 3, predict y when x = 10 (data range: 1 to 20). Enter y value.

5. Using y = 2x + 3, predict y when x = 50 (data range: 1 to 20). Enter y value.

6. Data covers ages 10–16. Is predicting for age 13 reliable? Enter 1 for Yes, 2 for No.

7. Using y = −3x + 60 and data range x: 5–15, find y when x = 12.

8. The line of best fit has equation y = 5x + 2. Find x when y = 42.

Practice — 25 Mixed Questions

For correlation type: 1=Strong Pos, 2=Weak Pos, 3=No Corr, 4=Weak Neg, 5=Strong Neg | For reliability: 1=Reliable (interpolation), 2=Unreliable (extrapolation)

1. Find x̄ for the data: 4, 8, 10, 14, 19.

2. Find ȳ for: 12, 20, 26, 34, 48.

3. A line passes through (2, 10) and (8, 34). Find gradient m.

4. Using m = 4 and point (2, 10), find c.

5. Using y = 4x + 2, predict y when x = 7.

6. Using y = 4x + 2, find x when y = 26.

7. Correlation type: number of absences vs end-of-year grade (fewer absences = higher grade, tight pattern).

8. Correlation type: eye colour vs salary — no pattern.

9. Data range: x from 3 to 18. Is predicting for x = 10 reliable? (1=Yes, 2=No)

10. Data range: x from 3 to 18. Is predicting for x = 25 reliable? (1=Yes, 2=No)

11. x values: 1, 3, 5, 7, 9. Find x̄.

12. y values: 5, 11, 17, 23, 29. Find ȳ.

13. A line passes through (0, 3) and (5, 23). Find gradient m.

14. Using m = 4 and point (0, 3), find c.

15. Using y = 4x + 3, predict y when x = 6.

16. A line passes through (5, 40) and (15, 20). Find gradient.

17. Using m = −2 and point (5, 40), find c.

18. Using y = −2x + 50, predict y when x = 12.

19. Using y = −2x + 50, find x when y = 30.

20. Correlation type: hours of exercise per week vs resting heart rate (more exercise = lower heart rate, fairly scattered).

21. x: 6, 8, 10, 12, 14. y: 3, 7, 11, 15, 19. Find the gradient of the line of best fit using the two extreme points.

22. Using m = 2 and point (6, 3), find c.

23. Using y = 2x − 9, predict y when x = 11.

24. Data range x: 6–14. Is predicting y for x = 9 interpolation? (1=Yes, 2=No)

25. x: 2, 5, 8, 11, 14. y: 9, 18, 27, 36, 45. Find ȳ.

Challenge — 12 Questions (IGCSE Extended Level)

1. Six data points have x values: 3, 6, 9, 12, 15, 18 and y values: 11, 19, 27, 35, 43, 51. Find the gradient of the line of best fit using two extreme points.

2. Using the data from Q1, find the mean point x̄ (enter x̄ only).

3. Using the data from Q1, find ȳ.

4. Using gradient m = 8/3 ≈ 2.67 and the mean point (10.5, 31), find c to 2 decimal places.

5. Using y = 2.67x + 3.0 (approx), predict y when x = 7.5. Give your answer to 1 d.p.

6. Is predicting for x = 20 using the equation from Q4 reliable? (1=Yes / 2=No)

7. A line of best fit passes through (4, 58) and (16, 22). Find the gradient.

8. Using m = −3 and point (4, 58), find c.

9. Using y = −3x + 70, find x when y = 40.

10. A dataset gives Σx = 84 and n = 7. Find x̄.

11. The mean point of a dataset is (9, 30). The gradient of the line of best fit is 2.5. Find c.

12. Using y = 2.5x + 7.5, predict y when x = 14.

Exam Style Questions

Mark-scheme style. Show all working in your book. Enter final answers below for self-marking.

Question 1 — Reading and Describing Correlation [4 marks]

A scatter diagram shows the outside temperature (°C) on the x-axis and the number of hot drinks sold at a café on the y-axis. The data was collected over 8 days.

(a) Which variable is the independent variable? Enter 1 for temperature, 2 for drinks sold.

(b) The points show a trend from top-left to bottom-right, clustered fairly tightly. What type of correlation is this? Enter 1=Strong Pos, 2=Weak Pos, 3=None, 4=Weak Neg, 5=Strong Neg

(c) The temperatures are: 5, 8, 11, 14, 17, 20, 23, 26. Find the mean temperature x̄.

(d) The drinks sold are: 95, 84, 73, 62, 51, 40, 29, 18. Find ȳ.

Question 2 — Line of Best Fit Equation [5 marks]

The line of best fit for the café data passes through (5, 93) and (25, 21).

(a) Find the gradient m of the line of best fit.

(b) Using your gradient and the point (5, 93), find c (the y-intercept).

(c) Using your equation y = −3.6x + c, estimate the number of drinks sold when temperature = 16°C.

Question 3 — Reliability of Predictions [4 marks]

The equation of the line of best fit is y = −3.6x + 111. Data was collected for temperatures 5°C to 26°C.

(a) Predict the number of drinks sold when temperature = 10°C. Is this interpolation or extrapolation? Enter your prediction.

(b) Is predicting for 10°C reliable? Enter 1 for Yes, 2 for No.

(c) Predict the number of drinks when temperature = 35°C. Enter your prediction.

(d) Is predicting for 35°C reliable? Enter 1 for Yes, 2 for No.

Question 4 — Causation and Correlation [3 marks]

A student says: "The scatter diagram proves that hot weather causes people to buy fewer hot drinks."

(a) Is the student's claim correct? Enter 1 for Yes (correct), 2 for No (incorrect).

(b) A teacher suggests: "People are less likely to want hot drinks when it is warm." Does a scatter diagram prove this? Enter 1 for Yes, 2 for No.

Question 5 — Full Analysis [5 marks]

A scientist records the depth (cm) of a river (x) and the water speed (cm/s) (y) at 6 locations. Data: x = 10, 20, 30, 40, 50, 60 and y = 4, 10, 16, 22, 28, 34.

(a) Find x̄.

(b) Find ȳ.

(c) The line of best fit passes through (10, 4) and (60, 34). Find the gradient.

(d) Find c (the y-intercept of the line of best fit).

(e) Use your equation to predict water speed when depth = 45 cm.

Scatter Diagrams Extended