Class 7 · CBSE AI · Strand C — NLP, Vision, and LLMs Deep-Dive
How to evaluate LLM outputs — accuracy, safety and more
LLM answers must be judged on accuracy, relevance, coherence and safety — not just one score. For Class 7.
Class 7 · CBSE AI · Strand C — NLP, Vision, and LLMs Deep-Dive
LLM answers must be judged on accuracy, relevance, coherence and safety — not just one score. For Class 7.
Judging a cooking competition
A cooking judge doesn't just ask 'is this food?' — they score taste, presentation, originality, technique, and whether the dish matched the brief. LLM evaluation works the same way: you need separate scores for different quality dimensions, because a technically accurate but unreadable response is as useless as a beautifully written wrong one.
Cricket umpiring vs. ball-tracking technology
An umpire makes subjective LBW calls based on experience and position. Ball-tracking technology (like DRS) measures exact trajectory but sometimes misses edge cases the umpire intuited. LLM evaluation has the same tension: automated metrics are like ball-tracking (precise on what they measure), while human evaluation is like the umpire (catches nuance but is slower and more expensive).
Every Dhee Learning session for this concept follows three stages. We share the questions Dhee actually asks, so you can hear what a session sounds like.
Stage 1 — Surface
You asked an AI to summarise a chapter from your history textbook. Here are two summaries. Summary A is beautifully written and uses all the right vocabulary. Summary B is plain but covers every key event in the correct order. Which is the better summary — and what question does this raise about how we evaluate AI outputs?
Rote answer
"We evaluate AI outputs by checking if they are accurate."
Understood
"It depends on what the summary is for. If it's for a student to revise quickly, B is better. If it's for a presentation, A might work better. This shows that 'good output' isn't one thing — evaluation criteria are task-specific, and any automatic scoring system has to choose what to measure."
Stage 2 — Reasoning
BLEU score measures how many word sequences in the AI output overlap with a human reference answer. Why would this metric give a misleadingly low score to a translation that captures the meaning perfectly but uses different synonyms throughout?
Follow-up Dhee may use: Can you think of a type of creative AI output where BLEU would be completely useless as an evaluation metric?
Stage 3 — Application
Design an evaluation rubric with four criteria for judging an LLM's output in this task: 'Explain the Water Cycle to a Class 5 student.' For each criterion, describe what a score of 1 (poor) and 5 (excellent) would look like.
Misconception Dhee watches for: Using a single overall score (1–10) without decomposing into criteria — this hides which specific aspect of quality failed and makes it impossible to improve the prompt systematically.
Dhee turns this concept into a 15-minute spoken session — asking, listening, and probing — so your child builds the idea themselves.
LLM answers must be judged on accuracy, relevance, coherence and safety — not just one score. For Class 7.
A high BLEU or ROUGE score means the output is good — these metrics measure surface overlap, not truthfulness, reasoning quality, or usefulness.
Dhee opens with a question — for example: "You asked an AI to summarise a chapter from your history textbook. Here are two summaries. Summary A is beautifully written and uses all the right vocabulary. Summary B is plain but covers every key event in the correct order. Which is the better summary — and what question does this raise about how we evaluate AI outputs?" — listens to your child's answer, then probes the reasoning behind it. The session ends when the child can apply the idea to a brand-new situation, not just recall it.