Class 7 · CBSE AI · Strand B — Python for AI
Cleaning a messy dataset — the real work of data science
Data scientists spend 60–80% of their time cleaning data. Why messy data is normal and how to fix it. For Class 7.
Class 7 · CBSE AI · Strand B — Python for AI
Data scientists spend 60–80% of their time cleaning data. Why messy data is normal and how to fix it. For Class 7.
Sorting dal before cooking
Before cooking dal, you spread it on a plate and pick out stones, shrivelled grains, and husks. If you cook with the stones in, the dish is ruined and might break someone's tooth. Data cleaning is the same — bad data cooked into a model produces broken predictions.
Voter registration roll
A voter list with duplicate entries, wrong names, and deceased people still listed will produce wrong election results. Every democratic country has to clean these rolls before an election. Data scientists do the same thing before training a model — and the stakes are similarly real.
Every Dhee Learning session for this concept follows three stages. We share the questions Dhee actually asks, so you can hear what a session sounds like.
Stage 1 — Surface
A dataset of patient health records has 15% of the 'income' column blank. What are three different things you could do about those blank cells — and what are the consequences of each choice?
Rote answer
"You can drop or fill missing values"
Understood
"Dropping removes those patients entirely, which biases the dataset if poor patients are more likely to have missing income data. Filling with the mean hides real variation. Each choice changes what the model learns about the world."
Stage 2 — Reasoning
You find that the 'age' column contains values like '14', 14, 'fourteen', and 'N/A'. These are four different kinds of problem. Can you name each problem and say what cleaning step fixes it?
Follow-up Dhee may use: If you just drop all rows with any problem, what percentage of your data might you lose — and is that acceptable?
Stage 3 — Application
Take a deliberately messy CSV (I can give you one). Run df.info() to find the problems. Write three cleaning steps in code. After each step, run df.info() again. What changed, and did any cleaning step create a new problem?
Misconception Dhee watches for: Believing that cleaning is a mechanical step-by-step process with a single correct answer, rather than a series of judgement calls that must be justified
Dhee turns this concept into a 15-minute spoken session — asking, listening, and probing — so your child builds the idea themselves.
Data scientists spend 60–80% of their time cleaning data. Why messy data is normal and how to fix it. For Class 7.
Cleaning data is a purely technical step with no ethical implications — in reality, how you handle missing data for certain groups can introduce or amplify bias
Dhee opens with a question — for example: "A dataset of patient health records has 15% of the 'income' column blank. What are three different things you could do about those blank cells — and what are the consequences of each choice?" — listens to your child's answer, then probes the reasoning behind it. The session ends when the child can apply the idea to a brand-new situation, not just recall it.