From the course: Advanced AI: Transformers for Computer Vision

History of transformers

- [Instructor] Transformers have taken the AI world by storm and you can find them used in production in Google Search. You see, when transformers were first introduced, they were used for NLP applications and were very successful. So it's not surprising that researchers then tried to use the same architecture for computer vision applications. The "Attention Is All You Need" paper from 2017, which introduced the transformer architecture, has been revolutionary. The models based on that original Transformer paper have evolved over the years, and most of the AI models since have been based on that transformer architecture. Now in June, 2018, GPT or generative pre-training model which was developed by OpenAI was the first pre-trained transformer model. This was used for fine-tuning on various NLP tasks and obtained state-of-the-art results. A couple of months later, researchers at Google came up with BERT, or Bidirectional Encoder Representations from Transformers. Later, OpenAI released a bigger and better version of GPT called GPT-2. This made headlines because it was so good for its time that the OpenAI team didn't want to release details of the model because of ethical concerns. So it's not surprising that researchers who had great success with NLP then turned their attention to computer vision. So with GPT, you are generating the next word, one word at a time. Given part of an image, you could use Image GPT to help generate the rest of an image, one pixel at a time. The Google team tried a similar approach. Bert was great at language understanding and tasks like text classification. So what if we could use a similar architecture but this time for image classification? The Google Research team released the Vision Transformer in October, 2020, and that's the architecture we'll be focusing on in this course.

Contents