Class 7 · CBSE AI · Strand C — NLP, Vision, and LLMs Deep-Dive

What are tokens? How AI reads text — for Class 7

Before a model reads text, it splits it into tokens. Why modern LLMs use sub-word pieces. For Class 7.

What this concept actually says

  • Text is split into tokens (words, sub-words, or characters) before a machine can process it
  • Tokenisation is not the same as splitting by spaces — punctuation and rare words create sub-word tokens
  • The vocabulary size and tokenisation strategy directly affect what an AI can and cannot understand

An analogy your child will recognise

Indian cooking — making dough

Tokenising text is like breaking a big lump of atta into equal small balls before rolling chapatis. You can't cook the whole lump at once — it must be divided into manageable, consistent pieces first.

Train ticketing

A train booking system only understands city names it has on its list. If you type 'Bengaluru' but it only knows 'Bangalore', it may split your input into pieces it recognises — like 'Benga' and 'luru' — and give you a wrong result.

Common misconceptions to watch for

  • Tokens are always whole words — in reality sub-word and character-level tokens are common.
  • All languages tokenise equally efficiently — English-trained tokenisers are far less efficient on Indian languages, consuming more tokens for the same meaning.

Key facts in one breath

  • Most modern LLMs use sub-word tokenisation (e.g. Byte-Pair Encoding), not simple word splitting.
  • The word 'tokenisation' itself is typically split into 2–4 tokens by common tokenisers.
  • GPT-4 has a vocabulary of roughly 100,000 tokens; a human adult knows around 50,000–100,000 words.
  • Emojis, numbers, and non-English scripts often consume more tokens than equivalent English text, making them more expensive to process.

How Dhee Learning teaches this — the 3-stage question loop

Every Dhee Learning session for this concept follows three stages. We share the questions Dhee actually asks, so you can hear what a session sounds like.

Stage 1 — Surface

If I asked you to teach a baby robot to read the word 'unbelievable', what's the first problem you'd run into?

Rote answer

"A token is the smallest unit of text a machine reads."

Understood

"The robot doesn't know what 'unbelievable' means as a whole, so it might have to break it into 'un', 'believ', 'able' — pieces it has seen before."

Stage 2 — Reasoning

Why do you think a tokeniser might split 'chatting' into 'chat' and '##ting' instead of keeping it as one word?

Follow-up Dhee may use: Imagine you're making a dictionary with only 500 words. How would you handle a word you've never seen before?

Stage 3 — Application

An AI tutoring app is trained only on English text. A student types 'maths ka syllabus do'. What tokenisation problems might appear, and what could go wrong?

Misconception Dhee watches for: Assuming the AI simply 'reads words like we do' and the problem is just translation, not tokenisation.

Related concepts

Want your child to actually understand this?

Dhee turns this concept into a 15-minute spoken session — asking, listening, and probing — so your child builds the idea themselves.

Frequently asked questions

What is how machines read text — tokens — explained for kids? +

Before a model reads text, it splits it into tokens. Why modern LLMs use sub-word pieces. For Class 7.

What's the most common mistake children make about this concept? +

Tokens are always whole words — in reality sub-word and character-level tokens are common.

How does Dhee Learning teach this in a Class 7 session? +

Dhee opens with a question — for example: "If I asked you to teach a baby robot to read the word 'unbelievable', what's the first problem you'd run into?" — listens to your child's answer, then probes the reasoning behind it. The session ends when the child can apply the idea to a brand-new situation, not just recall it.