From the course: Vector Databases in Practice: Deep Dive
Chunk Wikipedia articles
- [Instructor] Now let's see how chunking and text processing is done in practice. Here I've put together some scripts to process a set of articles from Wikipedia. Off the top, I'm using the same mediawikiapi library that you saw earlier, as well as a few other standard libraries. And this is a list of some, what I think at least, are interesting articles ranging from history of computing to databases. For each of these Wikipedia article then, we need to download it, pause it to raw text, and chunk the body as you just learned to import it into the database. So what I did was set up a couple of functions to break up these tasks. The first task is to turn our text into just a list of words, so we can use the word count for chunking. This is what this word split of function does. It just takes a string of source text as inputs and it uses regular expressions to replace multiple white spaces into a single space and then splits them up based on these spaces. If you're not sure what the syntax does, don't worry too much about it. All you need to know is that it takes a text body as an input and it'll produce a list of words. And then we use this function to make sure that we get lists of chunks of the provided size. So what we'll do is to use this function, provide a chunk_size_max parameter and an overlap. These parameters will be used to produce lists of sets of words, in other words, lists of chunks that we can ingest into the database. And then what we can do here is to use the functions that are defined above to go through each page. And then for each page, we'll get a list of chunks that pertains to those page, and then we'll save those chunks into this JSON file here. It's a fairly simple piece of code that's quite useful. It can be extended of course or modified to use section headers if you'd prefer that. Now let's have a look at a sample query that'll show you what you can do with chunked data like this. Let's say you want to know what our database says about how vector databases are different from relational databases. So hopefully this syntax seems familiar. You connect to a database here, get the chunks collection, and then we perform a query using a vector search. We'll grab 10 of those chunks that we've just defined using the query, and here's our prompt to the generative model. For this particular exercise, I'm going to, of course, print the generated text, but we want to know where those chunks are coming from. So here I'm going to loop through the objects and produce the chunk_number as well as the title of the article that the chunk comes from. Let's run it and see what happens. So here's our model's answer about how vector databases might be different from relational databases. I won't go through the specifics of it, but they seem pretty good to me. And you're probably not surprised by that at this point. But crucially, take a look at the data that the RAG query is using to produce this answer. The source data came from multiple pages from the Vector database article in Wikipedia, or the database article in Wikipedia, and from different places throughout the document. Have a look at these lines, for example, for the database article. It's a long article and it's used chunks from anywhere between the first chunk to the 78th chunk, which means it's gone through all those chunks and found relevant bits before producing the answer. And notably, it didn't use any of the chunks from these other articles because they weren't relevant to the query, or at least the most relevant to our query. This is what makes chunking technique with RAG so powerful. This technique allows you to have as big a database as you can imagine, and from there, you can retrieve just the relevant information before transforming it with the language model. As we've said before, this is a great way to combine the creative and reasoning capabilities of generative models with the facts or data that you can only retrieve from a database.