TF-IDF: Finding the Words That Actually Matter

Before deep learning and fancy embeddings, there was TF-IDF (Term Frequency-Inverse Document Frequency) -- and it is still remarkably useful today. Imagine you are searching through thousands of documents to find the most relevant ones. Common words like “the” and “is” appear everywhere and tell you nothing. TF-IDF solves this elegantly: it scores each word based on how important it is to a specific document relative to the entire collection. Words that are frequent in one document but rare across all documents get the highest scores. It remains a go-to baseline for text search, document classification, and keyword extraction.

from sklearn.feature_extraction.text import TfidfVectorizer

Step 1: Creating Our Document Collection¶

We start with three tiny “documents” -- just short phrases for clarity. In a real scenario, each document might be a product review, an email, or a news article. Notice that some words like “AI” and “cool” appear in multiple documents, while others like “am” appear in only one. TF-IDF will exploit exactly this pattern to determine which words are most distinctive for each document.

d0 = 'AI is cool'
d1 = 'I am cool'
d2 = 'AI'
string = [d0, d1, d2]

Step 2: Computing TF-IDF¶

Scikit-learn’s TfidfVectorizer handles everything in two steps. First, fit learns the vocabulary and computes the IDF (Inverse Document Frequency) values from the entire collection. Then, transform converts each document into a vector of TF-IDF scores. The result is a matrix where each row is a document and each column is a word -- and each cell contains the TF-IDF score for that word in that document.

tfidf = TfidfVectorizer()
result = tfidf.fit_transform(string)

Step 3: Understanding IDF Values¶

The IDF (Inverse Document Frequency) score tells us how “rare” or distinctive a word is across the entire collection. Words that appear in many documents get lower IDF values, while words that appear in only one document get higher values. Here, “ai” and “cool” both appear in 2 out of 3 documents, so they get an IDF of about 1.29. Meanwhile, “am” and “is” each appear in only 1 document, earning a higher IDF of about 1.69. The rarer the word across documents, the more information it carries.

print('\nidf values:')
for ele1, ele2 in zip(tfidf.get_feature_names_out(), tfidf.idf_):
    print(ele1, ':', ele2)

Step 4: The Full TF-IDF Matrix¶

Now we see the complete picture. The vocabulary index tells us which column corresponds to which word. The sparse matrix output shows only the non-zero entries (efficient for large datasets where most documents do not contain most words). The dense matrix view makes it easier to read: each row is a document, each column is a word, and the values are the final TF-IDF scores. Notice that document 3 (which contains only “AI”) gives “ai” a perfect score of 1.0 -- it is the only word there, so it carries all the weight. The scores are also L2-normalized per row, meaning each document vector has a length of 1, which makes them ready for comparison.

print('\nWord indexes:')
print(tfidf.vocabulary_)
print('\ntf-idf value:')
print(result)
print('\ntf-idf values in matrix form:')
print(result.toarray())

Bonus: Handling Single-Character Words¶

If you noticed that the word “I” was missing from our results -- good catch. By default, TfidfVectorizer uses a tokenizer that skips single-character words, since they are often just noise (punctuation, stray letters). But sometimes single-character words matter -- like the pronoun “I” in English. The code below overrides the default by providing a custom token_pattern that includes single-character words. This is a practical lesson: always check what your preprocessing tools are doing behind the scenes, because defaults can quietly throw away information you care about.

### IF you made an astute observation that the word 'I' is not shown as a term. This is because TfidfVectorizer by default uses a tokenizer that ignores single-character words.
### This is often done to remove very short words or punctuation that might not carry significant meaning.
## Below code explicitly requires TfidfVectorizer() to include single character word in its bag of words.

tfidf_with_single_char = TfidfVectorizer(token_pattern=r'(?u)\b\w+\b')
result_with_single_char = tfidf_with_single_char.fit_transform(string)

print('\nidf values (that also include single-character words):')
for ele1, ele2 in zip(tfidf_with_single_char.get_feature_names_out(), tfidf_with_single_char.idf_):
    print(ele1, ':', ele2)

print('\nWord indexes:')
print(tfidf_with_single_char.vocabulary_)

print('\ntf-idf values in matrix form:')
print(result_with_single_char.toarray())

Key takeaways¶

TF-IDF scores each word by how frequent it is in one document relative to how rare it is across the whole collection, surfacing distinctive terms.
IDF values rise for words that appear in fewer documents, so “am” and “is” (in one doc each) score higher than “ai” and “cool” (in two).
Sklearn’s TfidfVectorizer produces an L2-normalized sparse matrix in two lines, making each document vector ready for similarity comparison.
Default tokenizer silently drops single-character words like “I” -- override token_pattern when those tokens carry meaning.
Classic baselines like TF-IDF remain practical for search, classification, and keyword extraction even in the era of deep embeddings.

Run the code¶

To run this notebook, copy the URL below into your browser’s address bar. The link opens the notebook directly in Google Colab. (If your PDF viewer makes the URL clickable and lands on a broken page, copy the full text manually -- the viewer may have truncated the link at a line break.)

Estimated run time: ~1 minute on CPU

https://colab.research.google.com/github/KarAnalytics/code_demos/blob/main/TFIDF_example.ipynb