From the course: Generative AI: Introduction to Large Language Models
The attention mechanism
- [Instructor] Introduced in a 2017 paper title, "Attention is All You Need," transformers are an autoregressive encoder-decoder neural network architecture that makes use of a mechanism known as self-attention. As we learned in the previous course video, the encoding component of a transformer is made up of a stack of identical encoders. Each encoder has two main sublayers, the self-attention layer and the feed-forward layer. As input is fed to an encoder, it first passes through the self-attention layer, then to the feed-forward layer, which further processes the data. The feed-forward layer is a feed-forward neural network, which we previously learned about in the deep learning course video. The self-attention layer captures the importance of different words in relation to each other using a mechanism known as self-attention. This mechanism enables words to interact with each other so they can figure out which other words they should pay more attention to during the encoding and decoding process. To understand the importance of self-attention, let's say we decide to solicit the help of a large language model to complete a sentence. We provide the model with a prompt that reads, the fat cat sat on the mat because it. In order to generate the appropriate text, the model will need to understand what it in this sentence refers to. Does it refer to the mat or does it refer to the cat? The answer to this question helps the transformer choose the right set of words to complete the sentence with. Using the self-attention mechanism, the model is able to figure out which of the other words in the prompt to pay more attention to in order to figure out what it refers to. If it learns that it refers to the cat, then it could complete the sentence to say, the fat cat sat on the mat because it wanted a comfortable spot for a nap. However, if it learns that it refers to the mat, then it could complete the sentence to say, the fat cat sat on the mat because it was the closest thing around. How the model knows which words to attend to isn't necessarily always known by a casual observer. It is done through a series of vector transformations, which I illustrate in the next course video. Similar to the encoding component, the decoders in the decoding component are identical in structure. They are broken down into three main sub layers, a self-attention layer and a feed-forward layer, which are similar to those of an encoder, as well as a third layer, which is known as an encoder-decoder attention layer. While similar to the self-attention layer, the encoder-decoder attention layer has a slightly different focus. Its primary objective is to help the model align the words in the input sequence with the words in the output sequence. To better understand this difference, let's consider another example. Say we decide to use a large language model to translate the English sentence, the fat cat sat on the mat, to Spanish. If the model translates the sentence in the same order as it comes in, the output would be (speaking in Spanish). This would be incorrect because in Spanish, adjectives typically come after the noun they modify. The encoder-decoder attention layer deals with these types of challenges. It learns to assign weights to different parts of an input sequence to inform how they are to be processed. So in this example, encoder-decoder attention will assign more importance to the word cat than to the word fat within the input sequence. This tells the decoder to translate the word cat before it translates the word fat when going from English to Spanish.