From the course: Generative AI: Working with Large Language Models
Self-attention
- [Instructor] One of the key ingredients to transformers is self-attention. In this example text, "The monkey ate that banana because it was too hungry." How is the model able to determine that the it corresponds to the monkey and not the banana? It does this using a mechanism called self-attention that incorporates the embeddings for all the other words in the sentence. So when processing the word it, self-attention will take a weighted average of the embeddings of the other context words. The darker the shade, the more weight that word is given and every word is given some weight. And you can see that both banana and monkey come up as likely for the word it. But monkey has the higher weighted average. So what's happening under the hood? As part of the self-attention mechanism, the authors of the original transformer take the word embeddings and project it into three vector spaces, which they called query, key, and value. Projecting word embeddings into new vector spaces is a tool that mathematicians use to get different representations of the word embeddings. Now, in order to calculate to the attention weights, we'll take in as input the query, key, and value vectors. We then calculate the score of each word to determine how much focus to place on other words in the sentence. We want to try and figure out how the query and the key vectors relate to each other. This is done by taking the dot product of the query vector and the key vector. Queries and keys that are similar will have a large dot product. While those that don't share much in common will have little to no overlap. Now, if you've forgotten your linear algebra from school, the T means that we're performing a transpose operation on the vector K. So here N is the dimension of these vectors. We then divide this by the square root of N to scale the dot product attention, and so reduce its size. We now have the logits, and we can then convert this to probabilities by using the softmax function. We now multiply each value vector by the softmax score. We can then sum up the weighted value vectors, and this produces the self-attention calculation for a word. This process takes place for every single word in the sentence. And as we've seen in this video, self-attention allows us to apply a different weight to words in a sentence.