From the course: Advanced RAG Applications with Vector Databases

Introduction to embeddings

From the course: Advanced RAG Applications with Vector Databases

Introduction to embeddings

- [Instructor] Now that we've wrapped up chunking, let's cover embeddings. Before vector embeddings, we didn't have a way to compare unstructured data. With embedding models, we do. Embedding models are machine learning models, almost always deep neural networks, that turn your text, images, videos, audio, whatever kind of data you have into vectors or vector embeddings. Vectors are the tools we use to quantitatively compare unstructured data. Remember that it's important to use the correct embedding models to embed whatever data you have. In most contexts, that refers to embedding models trained on your data type. For example, using ResNet50 for image embeddings, using Sentence Transformers for your text, or using Whisper for your audio. In this context, we are primarily concerned with embedding text. The rise in popularity of large language models late in 2022 and all of 2023 showed us that text is one of the most important mediums for AI to work with. As such, there are now hundreds of embedding models specifically for text. You can find a list of these models on the Hugging Face MTEB Massive Text Embedding Benchmark's leaderboard. That's MTEB, M-T-E-B. If you're working in an extremely specialized domain though, even MTEB is not a comprehensive list. For example, CSV documents require their own embedding models. And if you think about it, the way the CSVs are set up is extremely different from the way that regular text is set up. In CSVs, commas are used to separate entries or entities. In regular text, commas are used to signal a pause in thought or to separate phrases, clauses, or positives. When it comes to embedding your text for later use, there are many things to think about, but if you take care of these three, the rest often fall into line. The three critical considerations in embedding your data is the embedding model itself, what you want to embed, and how to compare your embeddings. Let's look at the three pieces of picking the right model. The three pieces that go into picking an embedding model are embedding size, model size, and training data. First, embedding size. Embedding size is the size of the embedding vector. This is also referred to as the length or the dimensionality of the vector. Remember that vectors are just lists of numbers. These vectors are typically produced by a deep neural network. They're the output of the second to last layer of the network. When you put data into an neural net, each layer learns something about the data, and the final layer takes that information and makes a prediction or classification. The second to last layer contains all of the information without making a prediction on it. The size of the embedding affects the computational power needed to compare vectors when you use them. It is critical to remember here that only embeddings of the same size can be compared. Second, model size. Much like the size of embeddings, the size of the model you choose also has an effect on computational power. Smaller embedding models are less expensive overall. They're less expensive to use both to create embeddings and once you have the embeddings. Meanwhile, larger models can give you more fine-grained results, which may be necessary depending on what you're doing. One last thing to remember is that embedding models are not always LLMs. While LLMs can be used as embedding models, these are not the same thing. Third is training data. The data that your model is trained on is always important. Different models on the MTEB leaderboard are trained on different datasets. Examples of how the data and the training set can change your model include language, structure, and data size. For example, models trained on Chinese can help you embed data in Chinese, but probably not Arabic. Models trained on chat style data are better for embedding conversational input than for embedding essays. Let's also look at algorithmic models. These are a special case and these are not neural nets. These are algorithms and typically produce a different type of embedding. Examples can include TF-IDF, term frequency inverse document frequency, SPLADE, sparse lexical and expansion models, and BM25, where BM stands for best matching. These algorithms produce binary or sparse embeddings, as we talked about before. Measuring the similarity between these models is also different than the others, but we'll talk about that later. When it comes to picking what you want to embed, there are three main options. You can embed the chunked text, you can embed a portion of the chunk's text, or you can embed the larger chunk or section that your chunk text is part of. At this point, a question naturally arises. If I spent so much time making my chunks good, why would I not just embed them? Once again, because we are working with programmatic methods, these techniques are there to enhance your chunking methods. Large to small refers to a technique where you embed large paragraphs, but store individual senses as the metadata. And small to large refers to a technique where you embed individual sentences but store the large paragraph in the metadata. Remember that vectors are just long lists of numbers. While there are many distance metrics that can be used to compare vectors, there are three main distance calculations, cosine, inner product, and Euclidean. Euclidean distance measures the spatial distance between two vectors. The best way to imagine vectors for Euclidean distance is to imagine the two points in hyperspace that the vectors point at, and then imagine the distance between those two points. The formula for Euclidean distance is the square root of the sum of the difference between each entry. in the two vectors squared. Cosine similarity measures the difference in orientation of two vectors. Unlike Euclidean distance, cosine similarity has us think of the vectors as arrows in hyperspace, where we're measuring is the orientational difference between the two arrows at the origin. This distance metric is the most complicated and computationally expensive distance measure for dense vectors. Cosine similarity is the normalized dot product of two vectors. The formula is the dot product of A and B divided by the magnitude of A times the magnitude of B. Another way to express it is the sum of the product of each entry in the vectors divided by the product of the square roots of the square of each term in each vector. Inner product or dot product is the simplest of these three measures of similarity. The way to think about this measure is to think about the vectors as arrows and then think about the projection of one vector onto another. We saw this formula earlier in the cosine similarity slide. The formula for inner product is the sum of the product of each entry in the vectors. This is cosine similarity without dividing by the product of the magnitudes. So for sparse or binary vectors, there are two distance metrics that we should know, Hamming distance and Jaccard distance. Hamming distance is measured as the number of points in which two binary vectors are different. And Hamming distance can be measured by first taking an XOR of two vectors and then summing all of the 1s in the result. The other binary distance metric that's good to measure on is Jaccard distance. Jaccard distance is 1 minus the Jaccard similarity. The Jaccard similarity is the intersection of two vectors. Another way to calculate Jaccard distance is the difference of the union of A and B and the intersection of A and B divided by the union of A and B. If both vectors are 1 in an entry, then that counts as an intersection point. If either vector has a 1, then that point can be included in the union of A and B. A good way to think about Jaccard distance in terms of logical operators for vectors is A or B minus A and B total divided by A or B.

Contents