The Attention Mechanism: How AI Learns to Focus

The attention mechanism is the single most important innovation behind modern AI language models like ChatGPT, Gemini, and Claude. Before attention, neural networks processed text like a conveyor belt -- one word at a time, left to right, hoping to remember everything along the way. Attention changed the game by letting the model look at all words simultaneously and decide which ones matter most for the task at hand. In this notebook, we build a simple sentiment classifier and use attention weights to see exactly which words the model focuses on when making its prediction.

Building a Sentiment Classifier with Attention¶

The code below does quite a lot in one cell, so let us walk through the big picture. We start with a small labeled dataset of movie reviews (positive = 1, negative = 0), duplicated 10 times to give the model more examples. The text goes through three stages: first, a TextVectorization layer converts words to numbers; then, an Embedding layer turns those numbers into meaningful vectors; and finally, an Attention layer lets the model weigh which words matter most. The output is a single number between 0 and 1 representing sentiment. We also capture the attention weights so we can visualize them later.

import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, models

# 1. Dataset: 100 samples of real text

### Try improving data and see how the test prediction changes.
data = [
    ("the movie was great", 1), ("i loved the acting", 1), ("simply amazing", 1),
    ("a masterpiece of cinema", 1), ("really fun and exciting", 1),
    ("the plot was boring", 0), ("terrible acting script", 0), ("i hated it", 0),
    ("waste of time", 0), ("it was a bad experience", 0)
] * 10

texts = [item[0] for item in data]
labels = np.array([item[1] for item in data])

# 2. Text Preprocessing
max_tokens = 100
sequence_length = 5
vectorize_layer = layers.TextVectorization(max_tokens=max_tokens, output_sequence_length=sequence_length)
vectorize_layer.adapt(texts)
X_train = vectorize_layer(texts)

# 3. Build Model
inputs = layers.Input(shape=(sequence_length,), name="input_layer")
embedding = layers.Embedding(max_tokens, 16, name="embedding_layer")(inputs)

# Built-in Keras Attention
# We set return_attention_scores=True to get the weights for visualization
attention_output, weights = layers.Attention(name="attention_layer")(
    [embedding, embedding], return_attention_scores=True
)

# Pool the results and classify
flat = layers.GlobalAveragePooling1D()(attention_output)
outputs = layers.Dense(1, activation='sigmoid', name="sentiment_output")(flat)

model = models.Model(inputs=inputs, outputs=[outputs, weights])

# 4. Use a list for loss. 'None' for the attention weights output.
model.compile(
    optimizer='adam',
    loss=['binary_crossentropy', None],
    metrics=['accuracy', None]
)

# 5. Train
print("Training model...")
### Try changing the no of epochs and see how the attention in the test data changes.
model.fit(X_train, labels, epochs=100, verbose=0)
print("Training complete.\n")

Interpreting the Attention Weights¶

Now for the payoff -- we feed a test sentence (“the movie was boring”) into our trained model and look at what the attention mechanism learned. The model correctly leans toward a negative sentiment. More interestingly, the attention weights reveal why: the word “boring” gets the highest weight (about 0.32), meaning the model focused most heavily on that word when making its decision. Words like “the” and “was” get much lower weights because they do not carry sentiment information. This interpretability is one of attention’s superpowers -- unlike a black-box model, we can peek inside and see the reasoning.

# 6. Predict and Interpret
test_sentence = ["the movie was boring"]
X_test = vectorize_layer(test_sentence)
prediction, attention_weights = model.predict(X_test)

# Map numbers back to words
vocab = vectorize_layer.get_vocabulary()
words = [vocab[idx] for idx in X_test[0].numpy() if idx != 0]

print(f"Sentence: '{test_sentence[0]}'")
print(f"Sentiment: {'Positive' if prediction[0] > 0.5 else 'Negative'} ({prediction[0][0]:.4f})")
print("-" * 40)

# Average weights across the query dimension to see overall word importance
avg_weights = np.mean(attention_weights[0], axis=0)

for word, weight in zip(words, avg_weights):
    bar = "█" * int(weight * 40)
    print(f"{word:<10} | {weight:.4f} {bar}")

Does More Data Help? Training with a Richer Dataset¶

Our first model used only 10 unique reviews (duplicated). Now we scale up to 100 unique reviews -- 50 positive and 50 negative -- with much more varied language. This is closer to a real-world scenario where the model encounters diverse phrasing. After training on this richer dataset, notice how the sentiment prediction becomes more confident (further from 0.5) and the attention weights shift. The word “boring” still dominates, but the model now has a more nuanced understanding because it has seen many more examples of positive and negative language.

## Now try it with a slightly better data:

# 50 Positive Samples
pos_reviews = [
    ("the movie was great", 1), ("i loved the acting", 1), ("simply amazing", 1),
    ("a masterpiece of cinema", 1), ("really fun and exciting", 1), ("incredible story line", 1),
    ("absolutely wonderful experience", 1), ("the best film ever", 1), ("highly recommended movie", 1),
    ("superb performance by all", 1), ("i enjoyed every minute", 1), ("brilliant directing", 1),
    ("a true classic film", 1), ("spectacular visuals", 1), ("i was impressed", 1),
    ("pure joy to watch", 1), ("it was very touching", 1), ("a beautiful story", 1),
    ("top notch acting", 1), ("very entertaining film", 1), ("the cinematography was stunning", 1),
    ("captivating from start to finish", 1), ("refreshing and original", 1), ("greatest movie of year", 1),
    ("it made me happy", 1), ("an emotional journey", 1), ("wonderful cast", 1),
    ("powerful and moving", 1), ("delightful cinema", 1), ("i really liked it", 1),
    ("fantastic plot", 1), ("breathtaking scenes", 1), ("perfectly executed", 1),
    ("charming movie", 1), ("excellent work", 1), ("it was so good", 1),
    ("lovely film", 1), ("bold and brave story", 1), ("sweet and funny", 1),
    ("a must watch", 1), ("i felt inspired", 1), ("outstanding production", 1),
    ("very well made", 1), ("smart and witty", 1), ("uplifting ending", 1),
    ("cool characters", 1), ("impressive quality", 1), ("honest and raw", 1),
    ("magical feeling", 1), ("i loved it", 1)
]

# 50 Negative Samples
neg_reviews = [
    ("the plot was boring", 0), ("terrible acting script", 0), ("i hated it", 0),
    ("waste of time", 0), ("it was a bad experience", 0), ("very dull and slow", 0),
    ("worst movie ever", 0), ("extremely disappointed", 0), ("poorly directed", 0),
    ("the script was weak", 0), ("did not like it", 0), ("annoying characters", 0),
    ("boring story", 0), ("pathetic attempt at drama", 0), ("it was so noisy", 0),
    ("waste of money", 0), ("terrible directing", 0), ("ugly visuals", 0),
    ("i fell asleep", 0), ("the ending was bad", 0), ("not worth watching", 0),
    ("failed to impress", 0), ("silly plot", 0), ("boring and predictable", 0),
    ("i was very frustrated", 0), ("bad script writing", 0), ("horrible acting", 0),
    ("nothing makes sense", 0), ("clunky dialogue", 0), ("very mediocre", 0),
    ("disaster of a film", 0), ("low quality production", 0), ("uninspired film", 0),
    ("cheesy and cheap", 0), ("too long and boring", 0), ("i regret watching", 0),
    ("terrible movie", 0), ("complete failure", 0), ("badly paced", 0),
    ("offensive and loud", 0), ("pointless story", 0), ("awful experience", 0),
    ("it was painful", 0), ("messy plot", 0), ("boring and dry", 0),
    ("zero stars", 0), ("worst film of year", 0), ("really bad", 0),
    ("uninteresting and plain", 0), ("it was terrible", 0)
]

data = pos_reviews + neg_reviews
np.random.shuffle(data) # Good practice to shuffle!

texts = [item[0] for item in data]
labels = np.array([item[1] for item in data])

# 2. Text Preprocessing
max_tokens = 100
sequence_length = 5
vectorize_layer = layers.TextVectorization(max_tokens=max_tokens, output_sequence_length=sequence_length)
vectorize_layer.adapt(texts)
X_train = vectorize_layer(texts)

# 3. Build Model
inputs = layers.Input(shape=(sequence_length,), name="input_layer")
embedding = layers.Embedding(max_tokens, 16, name="embedding_layer")(inputs)

# Built-in Keras Attention
# We set return_attention_scores=True to get the weights for visualization
attention_output, weights = layers.Attention(name="attention_layer")(
    [embedding, embedding], return_attention_scores=True
)

# Pool the results and classify
flat = layers.GlobalAveragePooling1D()(attention_output)
outputs = layers.Dense(1, activation='sigmoid', name="sentiment_output")(flat)

model = models.Model(inputs=inputs, outputs=[outputs, weights])

# 4. Use a list for loss. 'None' for the attention weights output.
model.compile(
    optimizer='adam',
    loss=['binary_crossentropy', None],
    metrics=['accuracy', None]
)

# 5. Train
print("Training model...")
### Try changing the no of epochs and see how the attention in the test data changes.
model.fit(X_train, labels, epochs=100, verbose=0)
print("Training complete.\n")

# 6. Predict and Interpret
test_sentence = ["the movie was boring"]
X_test = vectorize_layer(test_sentence)
prediction, attention_weights = model.predict(X_test)

# Map numbers back to words
vocab = vectorize_layer.get_vocabulary()
words = [vocab[idx] for idx in X_test[0].numpy() if idx != 0]

print(f"Sentence: '{test_sentence[0]}'")
print(f"Sentiment: {'Positive' if prediction[0] > 0.5 else 'Negative'} ({prediction[0][0]:.4f})")
print("-" * 40)

# Average weights across the query dimension to see overall word importance
avg_weights = np.mean(attention_weights[0], axis=0)

for word, weight in zip(words, avg_weights):
    bar = "█" * int(weight * 40)
    print(f"{word:<10} | {weight:.4f} {bar}")

What Happens Without Attention?¶

To truly appreciate what attention does, let us remove it entirely. The cell below builds the same model architecture but skips the attention layer -- the embeddings go straight to pooling and classification. Without attention, the model treats every word as equally important. It can still learn sentiment to some degree (the prediction is still somewhat negative), but it loses the ability to focus on the words that matter most. Notice that we can no longer visualize per-word importance because there are no attention weights to inspect. This is why attention was such a breakthrough: it gives models both better performance and interpretability.

### Let us try without the attention layer:

# 50 Positive Samples
pos_reviews = [
    ("the movie was great", 1), ("i loved the acting", 1), ("simply amazing", 1),
    ("a masterpiece of cinema", 1), ("really fun and exciting", 1), ("incredible story line", 1),
    ("absolutely wonderful experience", 1), ("the best film ever", 1), ("highly recommended movie", 1),
    ("superb performance by all", 1), ("i enjoyed every minute", 1), ("brilliant directing", 1),
    ("a true classic film", 1), ("spectacular visuals", 1), ("i was impressed", 1),
    ("pure joy to watch", 1), ("it was very touching", 1), ("a beautiful story", 1),
    ("top notch acting", 1), ("very entertaining film", 1), ("the cinematography was stunning", 1),
    ("captivating from start to finish", 1), ("refreshing and original", 1), ("greatest movie of year", 1),
    ("it made me happy", 1), ("an emotional journey", 1), ("wonderful cast", 1),
    ("powerful and moving", 1), ("delightful cinema", 1), ("i really liked it", 1),
    ("fantastic plot", 1), ("breathtaking scenes", 1), ("perfectly executed", 1),
    ("charming movie", 1), ("excellent work", 1), ("it was so good", 1),
    ("lovely film", 1), ("bold and brave story", 1), ("sweet and funny", 1),
    ("a must watch", 1), ("i felt inspired", 1), ("outstanding production", 1),
    ("very well made", 1), ("smart and witty", 1), ("uplifting ending", 1),
    ("cool characters", 1), ("impressive quality", 1), ("honest and raw", 1),
    ("magical feeling", 1), ("i loved it", 1)
]

# 50 Negative Samples
neg_reviews = [
    ("the plot was boring", 0), ("terrible acting script", 0), ("i hated it", 0),
    ("waste of time", 0), ("it was a bad experience", 0), ("very dull and slow", 0),
    ("worst movie ever", 0), ("extremely disappointed", 0), ("poorly directed", 0),
    ("the script was weak", 0), ("did not like it", 0), ("annoying characters", 0),
    ("boring story", 0), ("pathetic attempt at drama", 0), ("it was so noisy", 0),
    ("waste of money", 0), ("terrible directing", 0), ("ugly visuals", 0),
    ("i fell asleep", 0), ("the ending was bad", 0), ("not worth watching", 0),
    ("failed to impress", 0), ("silly plot", 0), ("boring and predictable", 0),
    ("i was very frustrated", 0), ("bad script writing", 0), ("horrible acting", 0),
    ("nothing makes sense", 0), ("clunky dialogue", 0), ("very mediocre", 0),
    ("disaster of a film", 0), ("low quality production", 0), ("uninspired film", 0),
    ("cheesy and cheap", 0), ("too long and boring", 0), ("i regret watching", 0),
    ("terrible movie", 0), ("complete failure", 0), ("badly paced", 0),
    ("offensive and loud", 0), ("pointless story", 0), ("awful experience", 0),
    ("it was painful", 0), ("messy plot", 0), ("boring and dry", 0),
    ("zero stars", 0), ("worst film of year", 0), ("really bad", 0),
    ("uninteresting and plain", 0), ("it was terrible", 0)
]

data = pos_reviews + neg_reviews
np.random.shuffle(data) # Good practice to shuffle!

texts = [item[0] for item in data]
labels = np.array([item[1] for item in data])

# 2. Text Preprocessing
max_tokens = 100
sequence_length = 5
vectorize_layer = layers.TextVectorization(max_tokens=max_tokens, output_sequence_length=sequence_length)
vectorize_layer.adapt(texts)
X_train = vectorize_layer(texts)

# 3. Build Model
inputs = layers.Input(shape=(sequence_length,), name="input_layer")
embedding = layers.Embedding(max_tokens, 16, name="embedding_layer")(inputs)

# Built-in Keras Attention

### COMMENTED
# attention_output, weights = layers.Attention(name="attention_layer")(
#     [embedding, embedding], return_attention_scores=True
# )

# Pool the results and classify

## REPLACED this line to use embedding directly
# flat = layers.GlobalAveragePooling1D()(attention_output)
flat = layers.GlobalAveragePooling1D()(embedding)

outputs = layers.Dense(1, activation='sigmoid', name="sentiment_output")(flat)

# 5. Train
model.fit(X_train, labels, epochs=100, verbose=0)

## Weights not required
# model = models.Model(inputs=inputs, outputs=[outputs, weights])
model = models.Model(inputs, outputs)

### model.compile is simpler without attention layer
# 4. Use a list for loss. 'None' for the attention weights output.
#model.compile(
#    optimizer='adam',
#    loss=['binary_crossentropy', None],
#    metrics=['accuracy', None]
#)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


# 5. Train
print("Training model...")
### Try changing the no of epochs and see how the attention in the test data changes.
model.fit(X_train, labels, epochs=100, verbose=0)
print("Training complete.\n")

# 6. Predict and Interpret
test_sentence = ["the movie was boring"]
X_test = vectorize_layer(test_sentence)

## Since we do not have attention_weights, this line also changes
## prediction, attention_weights = model.predict(X_test)
prediction = model.predict(X_test)

# Map numbers back to words
vocab = vectorize_layer.get_vocabulary()
words = [vocab[idx] for idx in X_test[0].numpy() if idx != 0]

print(f"Sentence: '{test_sentence[0]}'")
print(f"Sentiment: {'Positive' if prediction[0] > 0.5 else 'Negative'} ({prediction[0][0]:.4f})")
print("-" * 40)

### ATTENTION WEIGHTS NOT AVAILABLE FOR THIS MODEL
# Average weights across the query dimension to see overall word importance
#avg_weights = np.mean(attention_weights[0], axis=0)

#for word, weight in zip(words, avg_weights):
#    bar = "█" * int(weight * 40)
#    print(f"{word:<10} | {weight:.4f} {bar}")

Key takeaways¶

Attention lets a model look at every word in a sequence at once and learn which ones matter most for the task.
Attention weights are inspectable, so you can see exactly which words (like “boring”) drove the sentiment prediction.
Dataset size and diversity strongly affect confidence -- expanding from 10 repeated reviews to 100 varied ones sharpens both predictions and attention patterns.
Removing the attention layer collapses per-word importance into uniform pooling, sacrificing both interpretability and the ability to focus on signal-carrying tokens.
Keras provides a drop-in layers.Attention that integrates with embeddings and returns weights via return_attention_scores=True.

Run the code¶

To run this notebook, copy the URL below into your browser’s address bar. The link opens the notebook directly in Google Colab. (If your PDF viewer makes the URL clickable and lands on a broken page, copy the full text manually -- the viewer may have truncated the link at a line break.)

Estimated run time: ~5 minutes on T4 GPU

https://colab.research.google.com/github/KarAnalytics/code_demos/blob/main/Attention_simple_example.ipynb