Word Embeddings: Teaching Machines the Meaning of Words

How does a computer understand that “king” and “queen” are related, or that “Paris” is to “France” what “Tokyo” is to “Japan”? The answer lies in word embeddings -- a technique that converts words into numerical vectors (lists of numbers) that capture their meaning. Words that appear in similar contexts end up with similar vectors, which means we can actually measure how related two concepts are. In this notebook, we will train our own simple embeddings and then explore a powerful pre-trained model to see this in action.

!pip install gensim

Setting Up: Word2Vec with Gensim¶

We are using the gensim library, one of the most popular Python packages for working with word embeddings. Gensim gives us access to the Word2Vec algorithm (developed by Google in 2013) as well as a collection of pre-trained embedding models. Think of it as your toolkit for turning words into numbers that actually mean something.

from gensim.models import Word2Vec
import gensim.downloader as api

Step 1: Creating a Toy Corpus¶

Word2Vec learns by reading lots of text and figuring out which words tend to appear near each other. Below, we create a very small “corpus” (collection of sentences) for demonstration purposes. In the real world, you would feed Word2Vec millions of sentences -- think all of Wikipedia or years of news articles. With only five short sentences, our model will not learn much, but it is enough to see how the process works.

# 1. Prepare your data (a list of lists of words)
# This small corpus is used for demonstration; a real corpus would be much larger.
sentences = [
    ['AI', 'for', 'business', 'course', 'is', 'fun'],
    ['AI', 'is', 'cool'],
    ['I', 'am', 'cool'],
    ['AI','uses','computers'],
    ['I','like','to','cook']

]

Step 2: Training the Word2Vec Model¶

Here we train Word2Vec on our tiny corpus. The key parameters control the learning process: vector_size=100 means each word will be represented as a list of 100 numbers, window=5 tells the model to look at words within 5 positions of each other, and min_count=1 ensures we keep every word even if it only appears once. The model slides a window across each sentence and learns to predict nearby words -- that is how it discovers meaning.

# 2. Train the Word2Vec model
# vector_size: Dimensionality of the word vectors (e.g., 100)
# window: Maximum distance between the current and predicted word within a sentence
# min_count: Ignores all words with total frequency lower than this
# workers: Number of CPU cores to use for training
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

Step 3: Examining a Word Vector¶

Once trained, every word in our vocabulary has its own vector -- a list of 100 numbers (since we set vector_size=100). These numbers do not have an obvious human interpretation on their own, but their relative positions encode meaning. We are printing just the first 5 elements of the vector for “AI” to get a feel for what these look like. In a well-trained model, words with similar meanings would have vectors pointing in similar directions.

# 3. Access the word vectors
# The trained vectors are accessible via the 'wv' attribute
word_vector = model.wv['AI']
print(f"Vector for 'AI' (first 5 elements): {word_vector[:5]}")

Step 4: Measuring Similarity Between Words¶

This is where embeddings get exciting. Cosine similarity measures how closely two word vectors point in the same direction, on a scale from -1 (opposite) to +1 (identical). If two words frequently appear in similar contexts, their vectors will be close and the cosine similarity will be high. Notice that with our tiny corpus, the similarity scores are not very meaningful -- “AI” and “business” show roughly the same similarity as “AI” and “computers.” This is because Word2Vec needs much more data to learn real-world relationships.

# 4. Perform similarity tasks
# Find words similar to a given word
similar_words = model.wv.most_similar('AI', topn=3)
print(f"Words similar to 'AI': {similar_words}")

# Calculate cosine similarity between two words
print(f"Similarity between 'AI' and 'business': {model.wv.similarity('AI', 'business')}")
print(f"Similarity between 'AI' and 'computers': {model.wv.similarity('AI', 'computers')}")
print(f"Similarity between 'AI' and 'cook': {model.wv.similarity('AI', 'cook')}")

Using Pre-Trained Embeddings: GloVe¶

Instead of training from scratch on limited data, we can use GloVe (Global Vectors for Word Representation), a model pre-trained on billions of words from Wikipedia and news articles. Pre-trained embeddings are like hiring an expert who has already read the entire internet -- they come with rich, nuanced understanding of word relationships baked in. The download may take a moment since we are pulling down 66 MB of learned word vectors.

# Load a pre-trained model (e.g., GloVe with 50 dimensions trained on Wikipedia)
# This download may take a few moments
wv2 = api.load('glove-wiki-gigaword-50')

Exploring GloVe Vectors¶

Now let us look at the vector for “computer” from the GloVe model. Notice that the numbers are much larger and more varied compared to our toy model -- that is because GloVe was trained on vastly more data and has had the chance to learn meaningful patterns. These 50-dimensional vectors pack a surprising amount of semantic information into a compact numerical representation.

# Access the vector for a word
vector = wv2['computer']
print(f"Vector for 'computer' (first 5 elements): {vector[:5]}")

Real-World Similarity Scores¶

Now we see cosine similarity with a properly trained model, and the results make much more intuitive sense. “Computer” and “business” score about 0.70 -- a strong relationship, since computers are widely used in business. “Cooking” and “computers” score only about 0.36 -- related in the sense that they are both common English words, but semantically quite distant. These scores are the foundation of semantic search, recommendation systems, and many other AI applications you encounter daily.

# Calculate cosine similarity between two words
print(f"Similarity between 'computer' and 'business': {wv2.similarity('computer', 'business')}")
print(f"Similarity between 'cooking' and 'computers': {wv2.similarity('cooking', 'computers')}")

A Gotcha: Vocabulary Limitations¶

The cell below will throw a KeyError -- and that is intentional. The GloVe model we loaded was trained on lowercase text, so it knows “ai” but not “AI” (uppercase). This is a common real-world issue: pre-trained models have a fixed vocabulary, and anything not in that vocabulary simply cannot be looked up. In practice, you would lowercase your input or choose a model with a larger vocabulary. It is a good reminder that these tools, while powerful, have practical limitations you need to be aware of.

print(f"Similarity between 'AI' and 'cook': {wv2.similarity('AI', 'cook')}")

Key takeaways¶

Embeddings turn words into dense numerical vectors where proximity in vector space reflects similarity of meaning.
Word2Vec learns these vectors by reading sentences and predicting which words appear near each other within a sliding context window.
Cosine similarity between vectors lets you quantify how semantically related two words are on a scale from -1 to +1.
Training data matters -- a toy corpus of five sentences produces noisy similarities, while GloVe trained on billions of tokens captures intuitive relationships.
Pre-trained models have fixed vocabularies, so out-of-vocabulary tokens (like uppercase “AI” in lowercase GloVe) raise errors and need preprocessing.

Run the code¶

To run this notebook, copy the URL below into your browser’s address bar. The link opens the notebook directly in Google Colab. (If your PDF viewer makes the URL clickable and lands on a broken page, copy the full text manually -- the viewer may have truncated the link at a line break.)

Estimated run time: ~3 minutes on CPU

https://colab.research.google.com/github/KarAnalytics/code_demos/blob/main/Embedding_example.ipynb