Tokenization: How Text Becomes Numbers

Before an LLM can do anything with text, the text has to be broken into tokens — small units (whole words, word pieces, or individual characters) that each get mapped to a unique integer ID. Different tokenizers split the same sentence in different ways, and those differences matter: vocabulary size, handling of case and whitespace, and how unfamiliar words get broken apart all affect what the model actually “sees.”

In this notebook we compare three tokenizers on the same short sentences: a classic Keras word tokenizer, the LLaMA tokenizer (used by the Open LLaMA family), and the GPT-2 tokenizer (used by many OpenAI-family models). We finish with a quick look at legacy NLP tokenization (NLTK with stop-word removal and lemmatization) to contrast pre-LLM workflows, and with how a single character is represented as bits in UTF-8.

from tensorflow.keras.preprocessing.text import Tokenizer
from transformers import LlamaTokenizer, AutoTokenizer

# 1. Define the input text
data = [
  "The earth is spherical.",
  "The earth is a planet."
]

# 2. Initialize Tokenizer (simulating the vocabulary build)
tokenizer = Tokenizer(num_words=15, lower=True, split=' ')
tokenizer.fit_on_texts(data)

# 3. Convert text to Sequence of Integers (Token IDs)
ID_sequences = tokenizer.texts_to_sequences(data)

# Output the dictionary (Vocabulary) and the IDs
print("ID dictionary:", tokenizer.word_index)
print("ID sequences:", ID_sequences)

# 1. Load a pre-trained tokenizer
tokenizer = LlamaTokenizer.from_pretrained('openlm-research/open_llama_3b_v2')

# 2. Define the input string
prompt = 'Is the earth spherical?'

# 3. Convert text to tokens (input_ids)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

print(input_ids)

# 1. Load a larger pre-trained tokenizer
tokenizer_larger = AutoTokenizer.from_pretrained('gpt2')

# 2. Define the input string
prompt = 'Is the earth spherical?'

# 3. Convert text to tokens (input_ids)
input_ids = tokenizer_larger(prompt, return_tensors="pt").input_ids

print(input_ids)

Observations:¶

• Case Sensitivity: “Hello” vs “hello” results in different tokens. • Space Handling: “hello world” (joined) vs “hello world” results in different token sequences. For instance, “hello world” might split into “hello” and " world" (with a leading space included in the token). • Vocabulary Size: The compression trades off sequence length for vocabulary size. A larger vocabulary (e.g., 100k symbols) results in shorter sequences of integers

For comparison, let us do legacy tokenization¶

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import string

# Download necessary NLTK data (run once)
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')
# Add explicit download for 'punkt_tab' as suggested by the error traceback
try:
    # NLTK's find typically looks for specific files, but downloading 'punkt_tab' as a collection
    # is suggested by the error. We will attempt to find a component of it, and if not found,
    # proceed with the download.
    nltk.data.find('tokenizers/punkt_tab/english.pickle') # Check for a known file within punkt_tab
except LookupError:
    nltk.download('punkt_tab') # Explicitly download 'punkt_tab' collection

try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')
try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('wordnet')

# 1. Define the input text
text = 'Is the earth spherical?'

print(f"Original text: '{text}'")

# 2. Tokenization (splitting into words)
tokens = word_tokenize(text)
print(f"\nTokens: {tokens}")

# 3. Lowercasing
lower_tokens = [word.lower() for word in tokens]
print(f"Lowercase tokens: {lower_tokens}")

# 4. Remove punctuation
punctuation_free_tokens = [word for word in lower_tokens if word not in string.punctuation]
print(f"Punctuation-free tokens: {punctuation_free_tokens}")

# 5. Remove stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in punctuation_free_tokens if word not in stop_words]
print(f"Stop-word free tokens: {filtered_tokens}")

# 6. Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print(f"Lemmatized tokens: {lemmatized_tokens}")

Observations:¶

Legacy Tokenization code is long. Also, the outcome is not suitable for GenAI

Before we conclude:¶

Just for context, how do computers read alphabets in bits:¶



# 1. Define the input text
character = "a"

# 2. Encode to UTF-8
byte_data = character.encode('utf-8')

# 3. Retrieve the Integer Value (The "Emoji" or Symbol)
integer_value = byte_data[0] # Fix: Access the first byte to get its integer value

# 4. Convert to Binary string (The "1 0" representation)
bit_representation = format(integer_value, '08b')

print(f"Character: {character}")
print(f"Integer (0-255): {integer_value}")
print(f"UTF-8 Bits: {bit_representation}")

Key takeaways¶

Tokenization converts raw text into integer IDs, which is the required first step before any LLM can process language.
Different tokenizers (Keras word-level, LLaMA SentencePiece, GPT-2 BPE) split the same sentence into different token sequences and vocabulary sizes.
Case and whitespace matter -- “Hello” vs “hello” and single vs double spaces yield different token IDs in modern subword tokenizers.
Vocabulary size trades off against sequence length: a larger vocab produces shorter integer sequences but consumes more memory.
Legacy NLP pipelines (NLTK stop-word removal, lemmatization) strip information that generative models need and are unsuitable for LLM workflows.

Run the code¶

To run this notebook, copy the URL below into your browser’s address bar. The link opens the notebook directly in Google Colab. (If your PDF viewer makes the URL clickable and lands on a broken page, copy the full text manually -- the viewer may have truncated the link at a line break.)

Estimated run time: ~2 minutes on CPU

https://colab.research.google.com/github/KarAnalytics/code_demos/blob/main/Tokenizer_simple_examples.ipynb