What is evaluating the model — explained for kids?

Why accuracy can lie, and what a confusion matrix really tells you about your model. For Class 7.

What's the most common mistake children make about this concept?

High accuracy always means a good model — in reality, accuracy is meaningless on imbalanced datasets without looking at per-class performance

How to evaluate a machine learning model — accuracy isn't enough

Q: How does Dhee Learning teach this in a Class 7 session?

Dhee opens with a question — for example: "A model that detects whether a crop has a disease scores 95% accuracy. The farmer is thrilled. But only 3% of crops in the dataset actually have the disease. What is wrong with celebrating that 95%?" — listens to your child's answer, then probes the reasoning behind it. The session ends when the child can apply the idea to a brand-new situation, not just recall it.

What this concept actually says

Accuracy is often misleading — a model that predicts 'no disease' for everyone can be 95% accurate in a 95% healthy population
A confusion matrix shows four outcomes: true positives, true negatives, false positives, and false negatives
Precision and recall measure different failure modes — which matters more depends on the cost of each error

An analogy your child will recognise

Filtering rotten mangoes at a packing plant

If a machine that checks mangoes is set to reject anything suspicious, it might throw away good mangoes (false positive). If it is too lenient, rotten mangoes reach customers (false negative). The cost of each error is different — and your choice of threshold is an ethical decision about who bears the cost.

Umpire decisions in cricket

An umpire who gives every batsman out when unsure will be called unfair to batsmen (false positives). One who never gives anyone out will be called unfair to bowlers (false negatives). Good umpires minimise both error types — and different formats (Test vs. T20) might tolerate different error rates.

Common misconceptions to watch for

High accuracy always means a good model — in reality, accuracy is meaningless on imbalanced datasets without looking at per-class performance
You should always maximise both precision and recall — in reality, there is usually a trade-off, and which matters more is a domain and ethical question

Key facts in one breath

A confusion matrix is a 2x2 table (for binary classification) showing TP, FP, TN, and FN counts
Precision = TP / (TP + FP): of all positive predictions, how many were actually positive?
Recall = TP / (TP + FN): of all actual positives, how many did the model catch?
The F1 score is the harmonic mean of precision and recall, useful when you want a single balanced metric

How Dhee Learning teaches this — the 3-stage question loop

Every Dhee Learning session for this concept follows three stages. We share the questions Dhee actually asks, so you can hear what a session sounds like.

Stage 1 — Surface

A model that detects whether a crop has a disease scores 95% accuracy. The farmer is thrilled. But only 3% of crops in the dataset actually have the disease. What is wrong with celebrating that 95%?

Rote answer

"Because the data is imbalanced"

Understood

"A model that always predicts 'no disease' would get 97% accuracy without learning anything — the 95% model might actually be worse than doing nothing. Accuracy only means something when classes are roughly balanced."

Stage 2 — Reasoning

In a cancer screening model, which is more dangerous: a false positive (saying someone has cancer when they do not) or a false negative (saying someone is healthy when they have cancer)? Does the answer change for a spam filter?

Follow-up Dhee may use: So precision and recall measure these two different failure modes. Which one measures false negatives and which measures false positives?

Stage 3 — Application

Print the confusion matrix and classification report for the model you built last session. Find the false positive rate. Explain in plain language what it means for your specific problem — not in maths terms, in human terms.

Misconception Dhee watches for: Treating a high F1 score as automatically meaning the model is good for deployment, without asking what the consequences of its specific errors are

Related concepts

Class 7

Using a pre-trained image model — transfer learning for Class 7

Class 7

Unintended consequences of AI — the Cobra Effect