Attention and the foundation of modern AI

Igor Alcantara
Nov 11, 2024
9 min read

Updated: Oct 13

When I ask people what kind of knowledge you associate me with, I always get different responses. Those who do not know me well enough might say is Qlik or Data Science, Machine Learning, or maybe AI in general. The ones who know me better could mention Science or Podcast, maybe even Star Wars. However, the ones who really know me would definitely say: The Beatles. Yes, Mark Costa, I will find a way to reference the Beatles to this.

The Beatles revolutionized music in the 1960s. If you like British rock and you have access to hundreds of great bands, thank Paul, John, George, and Ringo. If you like distortion in guitars, echo in vocals, double records (white album of the first of this kind), eastern and western music all mixed, it is all because of these lads from Liverpool. Some authors even say they, with the help of James Dean, help to consolidate the concept of adolescence as we know today. Even with all their fame and prestige, I can still say they are underrated. It should be more. The Beatles are to pop culture and rock and roll and Mozart and Bach were to classic music.

Why I am talking about the Beatles in an article about Artificial Intelligence? Well, in 2017, a research paper did the same thing for AI. Like Sgt. Pepper's Lonely Hearts Club Band changed the game forever, "Attention Is All You Need" came along and flipped the AI world upside down faster than you can say mute the speaker after Yoko Ono starts yelling.

As a data scientist who's watched this field evolve like rock 'n' roll through the decades, I figured it was time to break down this greatest hit of computer science. Just like how Kurt Cobain brought grunge to the mainstream in the '90s, this paper brought a whole new style of AI to the masses. And trust me, it's been topping the charts ever since.

You can find the original paper here. I also recorded a video where I briefly break down this research. The video is available at the end of this article.

The Opening Act: Why This Paper Matters

Imagine it's 2017. The iPhone X just dropped, "Despacito" is everywhere (sorry for the terrible choice of example), and a team of researchers from Google Brain and the University of Toronto are about to drop the hottest paper in AI since perceptron or deep neural networks. Their creation, the Transformer model, wasn't just another one-hit wonder – it was more like Queen's "Bohemian Rhapsody," a complete game-changer that defied all conventions.

This paper wasn't just playing power chords; it was introducing a whole new instrument to the AI orchestra: attention. Not the kind of attention David Bowie commanded on stage, but a clever mechanism that helps machines understand language better than ever before.

The Pre-Transformer Era: When AI Was Still Playing in Garages

Before we had Transformers, AI was kind of like a garage band trying to make it big with limited equipment. We had two main acts: Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs). Let's break down why these models were more Nickelback than Nirvana. More Greta Van Fleet than Led Zeppelin.

Recurrent Neural Networks (RNNs): The One-Man Band with Memory Loss

Picture RNNs as that one-man band guy trying to play all the instruments at once while walking down the street. Sure, it works, but it's not exactly optimal. RNNs process information like a game of telephone – passing the message along one word at a time. And just like how you might forget the beginning of "The Lord of the Rings" by the time they leave the shire, RNNs would start forgetting important context from the beginning of long sentences.

Let's say you're trying to process the sentence: "The guitar that was collecting dust in the attic since Woodstock finally got tuned." When an RNN gets to "got tuned," it might have forgotten all about that guitar, thinking maybe we're tuning a radio or worse – a banjo (no offense, banjo players).

Why RNNs are called this way? Think of "recurrent" like a boomerang – it keeps coming back! The word comes from Latin "recurrere" meaning "to run back." RNNs got this name because information literally runs back through the network in a loop, like a DJ playing the same vinyl record over and over, but each time adding a new layer to the mix.

Imagine you're listening to Pink Floyd's "The Wall" – each brick in the wall builds upon the previous ones. That's exactly how RNNs work: each piece of information gets processed and then fed back into the system, creating a kind of memory loop.

Convolutional Neural Networks (CNNs): The Speed Metal of AI

CNNs were like speed metal guitarists – incredibly fast but not great at subtle nuances. Originally designed for image processing, using CNNs for language was similar of trying to play classical music on an electric guitar with the distortion turned really up. Sure, you could do it, but you might miss some of the finer points along the way.

"Convolution" comes from the Latin "convolvere" which means "to roll together." No, not like the Rolling Stones. In mathematics, it's about combining two functions to create a third function – kind of like when David Bowie and Queen rolled their talents together to create "Under Pressure"!

The "convolutional" part refers to how these networks process information in overlapping chunks, like a sliding window. Imagine you're trying to find a specific guitar riff in a long rock song. Instead of listening to the whole song at once, you slide a small window of time across the track, focusing on just a few seconds at a time. That's exactly what CNNs do with images or text – they slide a small filter across the data, looking for specific patterns.

Think of it like this: If you were looking at a huge crowd photo from Woodstock '69, a CNN would scan it piece by piece, like a spotlight moving across the crowd, looking for specific things like peace signs, guitars, or tie-dye shirts. Each "convolution" is like that spotlight moving to a new position, overlapping with where it just was, making sure it doesn't miss anything important.

The "neural" part in both names (CNN and RNN) comes from their inspiration by biological neural networks in our brains. Just like how Keith Richards somehow still remembers all those guitar riffs, our brain neurons process information through complex networks – and these AI systems try to mimic that structure, just with a lot less rock 'n' roll lifestyle!

The Main Event: Attention Makes Its Stadium Debut

Then came attention, strutting onto the stage like Mick Jagger in his prime. The authors of "Attention Is All You Need", just like the authors of "All you need is Love", had a revolutionary idea: what if instead of processing words like a slow vinyl record player, we could jump around the track like a CD player with skip buttons? (Remember those?)

Attention lets the model look at a sentence the way Jimmy Page looks at a guitar solo – taking in the whole picture and knowing exactly which notes (or in this case, words) matter most. It's giving the AI model a backstage pass to every word in the sentence simultaneously.

Take this sentence: "The hippie van, painted with peace signs and flowers, cruised down Highway 61."

Traditional models would struggle like a rookie trying to play "Stairway to Heaven," but attention mechanisms can instantly connect "van" with its descriptive elements, even if they're scattered throughout the sentence like Pete Townshend's guitar pieces after a show.

Under the Hood: Query, Key, and Value (The Power Trio of AI)

The attention mechanism works like a well-oiled rock band, with three key members:

Query: Think of this as the lead singer asking, "Who should I be harmonizing with?"
Key: This is like the unique riff that identifies each band member
Value: The actual musical contribution each member brings to the song

Let's break it down with our previous example:

"The hippie van, painted with peace signs and flowers, cruised down Highway 61."

Query: "painted" asks, "Who am I describing here?"
Key: "van" has a key that says, "I'm the star of this show!"
Value: The model connects these together faster than Eric Clapton can play "Layla".

The Transformer: The Greatest Supergroup in AI History

Just as Led Zeppelin combined the talents of Jimmy Page, Robert Plant, John Paul Jones, and John Bonham, the Transformer combines multiple attention mechanisms into something greater than the sum of its parts. It's got two main sections:

Encoder: The Input. The songwriter who takes in the original material
Decoder: The Output. The producer who arranges it into something new

Each section is stacked like a Marshall amp wall at a Sepultura concert, with multiple layers of attention mechanisms working together to create the final mix.

Multi-Head Attention: The Ultimate Band Practice

Remember how The Beatles would sometimes layer multiple guitar tracks to create a richer sound? Multi-head attention works similarly. Instead of having just one attention mechanism looking at the sentence, you've got multiple "heads" analyzing it simultaneously, like a bunch of producers listening to the same track with different sets of headphones.

Each head might focus on something different:

Head 1 could be tracking the main melody (subject-verb relationships)
Head 2 might be following the bass line (contextual information)
Head 3 could be keeping time (temporal relationships)
Head 4 might be adding the harmonies (semantic nuances)

It's something like having Phil Spector's Wall of Sound, but for language processing! In other words, multi-head attention is like having a group of language experts analyzing the same sentence, each one focusing on a different aspect:

Head 1 could focus on who is doing what.
Head 2 might focus on descriptive details, like colors or emotions.
Head 3 might focus on time, or the weather.
Head 4 could focus on relationships between places and actions.

Scaled Dot-Product Attention (left). Multi-Head Attention consists of several attention layers running in parallel (right)

Positional Encoding: The Rhythm Section

Just as a song needs a steady beat to keep everything in time, Transformers need to know the order of words. Positional encoding is the drummer of our AI band. Because Transformers don’t read words in sequence, they need a way to know the order of the words. Otherwise, they would treat a sentence like a bag of words, with no beginning or end.

Positional encoding is like adding GPS coordinates to each word. These “coordinates” give each word a unique position in the sentence, so the model knows which word comes first, second, third, and so on. To achieve this, Transformers use a mathematical pattern (sine and cosine functions) that encodes each word’s position in a way the model can understand.

The Legacy: Transformers, More Than Meets the Eye

Today, Transformers are everywhere, like that one summer when you couldn't escape "Macarena." They're powering:

Google Translate (translating "Yellow Submarine" into every language imaginable)
Virtual assistants (though they're still not as cool as HAL 9000)
Chatbots like ChatGPT, Claude, Gemini, that can actually hold a conversation better than some people at a cocktail party
Real applications that we use and love, like my dear Qlik Answers.

The Transformer architecture has inspired a whole new generation of AI models, each one building on its success like bands building on the foundations laid by Chuck Berry and Elvis. It "Ch-ch-ch-changes" the whole scene. Its true impact might even be truly appreciated in the future, but we for sure can feel it now. It's when I remember the first time, I listed to Smashing Pumpkins or Nirvana. It was new, its was cool, I know it would have a big influence on generations to come, but only after a few years I could realize the size of that influence.

Conclusion: The Show Must Go On

Just as The Rolling Stones keep on rolling well into their golden years and (somehow) Keith Richards is still alive, the Transformer architecture continues to evolve and improve. Like Jim Morrison breaking on through to the other side of natural language processing, this technology has opened new doors (pun absolutely intended) that we never thought possible. It's not just another one-hit wonder – it's a timeless classic that keeps finding new audiences and applications faster than the Ramones can shout "Hey! Ho! Let's Go!"

The future of AI looks brighter than Eddie's eyes on an Iron Maiden album cover. Each new iteration of the Transformer architecture brings us closer to that perfect harmony between human and machine intelligence. And just like how Bruce Dickinson can hit those impossible high notes, Transformers keep pushing the boundaries of what's possible in AI.

Some skeptics might say AI will run to the hills, but the Transformer has proved them wrong. It's stayed true to its core principles while evolving faster than a Ramones song – quick, efficient, and straight to the point. From language translation to content generation, it's blazing through tasks like Joey Ramone blazed through power chords.

So whether you're a coding rookie or a seasoned tech veteran, remember that the Transformer, like Morrison's poetry, has layers of depth waiting to be explored. It's not the end, my only friend – it's just the beginning. As we stand at this technological crossroads, we can almost hear Jim Morrison whispering, "The future's uncertain and the end is always near," but one thing's for sure: the Transformer architecture, like rock 'n' roll itself, is here to stay.

As Iron Maiden would say, these are the "Hallowed Be Thy Name" moments in AI history – revolutionary, powerful, and destined to influence generations to come. So keep coding, keep experimenting, and remember: in the world of AI, much like in rock 'n' roll, the only way is up. And that's not just strange days ahead – that's the future of computing!

Gabba gabba hey! Let's Rock 🤘

Watch the Video

You can watch the video below where I explain this concept in more simple terms. It does not have Rock and Roll references, but the design of the slides were 100% inspired by Pink Floyd. This paper I just explained is definitely not just another brick on the wall.