From the course: Advanced RAG Applications with Vector Databases

Introduction to multimodal embedding models

From the course: Advanced RAG Applications with Vector Databases

Introduction to multimodal embedding models

- [Instructor] Let's understand multimodal embedding models. The main highlight of multimodal embedding models is simple. They can embed multiple types of data. Typically, multimodal embedding models will have different functions internally to embed each type of data. This is because embedding each type of data is a different process in and of itself. Sometimes, data requires pre-processing. Often, different parts of the model are trained on and for different types of data. The most common practice for training multimodal embedding models is to train them on pairs of data. Images plus text is the most common pairing of data to train these types of models on. Some examples of multimodal embedding models include CLIP from OpenAI, large language models themselves that have evolved to become multimodal, such as GPT-4o, also from OpenAI, and LLaVa, a state-of-the-art end-to-end large transformer model that combines an image encoder in Vicuna, an LLM. This model is not from OpenAI. In this chapter, we cover CLIP and GPT-4o, and we use CLIP for embedding because it's free and open source. CLIP stands for Contrastive Language-Image Pretraining. The CLIP model has two encoders, one for encoding images and one for encoding language or text. Since this is an open source model, publicly available on Hugging Face and is also state of the art, this model is the most popular multimodal embedding model to date. Let's briefly understand how CLIP works and what the letters mean. C, contrastive. There are many machine learning methods for aligning two modalities. Contrastive learning is one of the most powerful and popular approaches to date. This technique takes pairs of data in the same embedding space and trains both encoders to represent the pairs as closely as possible. At the same time, the model is also incentivized to represent unpaired image/text combos as far as possible. L-I, language-image. The CLIP model takes both text and image as input, and as we talked about earlier, it has a different encoder for each. One point of importance to note here, though, is that although the encoders are separate, they're still both in the same embedding space and the vectors have the same dimensionality. Pretraining, P. The model is pretrained on 400 million pairs of image and text data from the internet. Now that we have an understanding of multimodal embedding models, let's dive into the code.

Contents