Back to Blog
Chief Idiot2 min read

ML 11: The Attention Seeker (Transformers)

Attention is all you need. Literally.

The Revolution

In 2017, Google released a paper called "Attention Is All You Need". It changed everything. It killed RNNs. It gave birth to BERT, GPT, and the AI boom we live in.

The Problem with RNNs

RNNs read sequentially. Word 1, then Word 2, then Word 3. Slow. Hard to parallelize.

The Rockstar

The Transformer

Transformers read the entire sentence at once. But how do they know which words relate to which?

Self-Attention. Imagine the sentence: "The animal didn't cross the street because it was too tired." What does "it" refer to?

  • The Street? No.
  • The Animal? Yes.

The Transformer calculates an "Attention Score" between every word and every other word. "It" pays high attention to "Animal". "Tired" pays attention to "Animal".

Self-Attention: Who Looks at Whom?

Click a word to see what it "pays attention" to

Attention weights for "it":
cat
90%

💡 "It" pays 90% attention to "cat" — it knows "it" refers to the cat!

This is how Transformers understand that "it" = "cat" without reading left-to-right

Embeddings (Word Math)

Words are turned into lists of numbers (Vectors). King - Man + Woman = Queen The model learns that "Paris" to "France" is the same connection as "Tokyo" to "Japan".

GPT (Generative Pre-trained Transformer)

GPT is just a giant Transformer stack trained to predict the Next Token. It read the entire internet. It knows probability. It doesn't "know" facts. It knows that after "The capital of France is", the most likely word is "Paris".

Summary

Transformers are the current state of the art. They are massive parallel processing machines that focus on relationships between data points.

Next up: Teaching a robot to walk by giving it cookies.

Share this article