What is how machines read text — tokens — explained for kids?

Before a model reads text, it splits it into tokens. Why modern LLMs use sub-word pieces. For Class 7.

What's the most common mistake children make about this concept?

Tokens are always whole words — in reality sub-word and character-level tokens are common.

How does Dhee Learning teach this in a Class 7 session?

Dhee opens with a question — for example: "If I asked you to teach a baby robot to read the word 'unbelievable', what's the first problem you'd run into?" — listens to your child's answer, then probes the reasoning behind it. The session ends when the child can apply the idea to a brand-new situation, not just recall it.

What are tokens? How AI reads text — for Class 7

What this concept actually says

Text is split into tokens (words, sub-words, or characters) before a machine can process it
Tokenisation is not the same as splitting by spaces — punctuation and rare words create sub-word tokens
The vocabulary size and tokenisation strategy directly affect what an AI can and cannot understand

An analogy your child will recognise

Indian cooking — making dough

Tokenising text is like breaking a big lump of atta into equal small balls before rolling chapatis. You can't cook the whole lump at once — it must be divided into manageable, consistent pieces first.

Train ticketing

A train booking system only understands city names on its list. Type 'Bengaluru' when it only knows 'Bangalore' and it's stuck — unless, like an AI tokeniser, it could break the unknown word into familiar pieces and work from those.

Common misconceptions to watch for

Tokens are always whole words — in reality sub-word and character-level tokens are common.
All languages tokenise equally efficiently — English-trained tokenisers are far less efficient on Indian languages, consuming more tokens for the same meaning.

Key facts in one breath

Most modern LLMs use sub-word tokenisation (e.g. Byte-Pair Encoding), not simple word splitting.
The word 'tokenisation' itself is typically split into 2–4 tokens by common tokenisers.
GPT-4 has a vocabulary of roughly 100,000 tokens; a human adult knows around 50,000–100,000 words.
Emojis, numbers, and non-English scripts often consume more tokens than equivalent English text, making them more expensive to process.

How Dhee Learning teaches this — the 3-stage question loop

Every Dhee Learning session for this concept follows three stages. We share the questions Dhee actually asks, so you can hear what a session sounds like.

Stage 1 — Surface

If I asked you to teach a baby robot to read the word 'unbelievable', what's the first problem you'd run into?

Rote answer

"A token is the smallest unit of text a machine reads."

Understood

"The robot doesn't know what 'unbelievable' means as a whole, so it might have to break it into 'un', 'believ', 'able' — pieces it has seen before."

Stage 2 — Reasoning

Why do you think a tokeniser might split 'chatting' into 'chat' and '##ting' instead of keeping it as one word?

Follow-up Dhee may use: Imagine you're making a dictionary with only 500 words. How would you handle a word you've never seen before?

Stage 3 — Application

An AI tutoring app is trained only on English text. A student types 'maths ka syllabus do'. What tokenisation problems might appear, and what could go wrong?

Misconception Dhee watches for: Assuming the AI simply 'reads words like we do' and the problem is just translation, not tokenisation.

Related concepts

Class 7

Python variables and types explained for Class 7