From the course: Vector Databases in Practice: Deep Dive
Pre-processing text for vector databases
- [Instructor] Let's take a look at a couple of pre-processing tools and examples that might be applicable for vector databases. Here, as we often do, we'll talk about some general scenarios, before moving on to specific ones. We talked earlier about the need to extract semantic information and stripping away the styling. That might seem very hard, but luckily for us there's a wide array of great tools available for this, like Beautiful Soup, pypdf, or even AI models. Beautiful Soup, for example, is a popular tool that extracts text from HTML or XML files. And pypdf will let you extract text from a PDF file, which is notoriously tricky to extract information from. In fact, these days, you can even extract text from other media, like videos or audio, quite easily. You can do this by using an AI model that can transcribe audio to text. Current state-of-the-art open source models, like Whisper, work remarkably well for this purpose. These models can produce text outputs from conversations, or even instructional videos, even when multiple languages are spoken consecutively. So, there's definitely a limited need to reinvent the wheel in many cases. And when it comes to popular data sources, you might even find specific tools written just for them, and Wikipedia is a classic example of this. Since Wikipedia is such a well established and reputable source of information, there are lots of high quality tools that let you extract information from it. Actually, Wikipedia even produce regular data dumps itself for its users. But, for downloading just a few files, you can use API based tools like MediaWiki API. These tools will fetch and parse Wikipedia articles with items like the summaries, sections, headings, and so on, parsed. So these in turn, help us to build sensible, context-rich data objects in no time. Let's take a look at how. Here's an example of a script that you can use to inspect the contents of a Wikipedia page. Once you import this MediaWiki class, all you need to do to use it, is to instantiate it to retrieve an article like this by nominating the title. The data is then parsed by MediaWiki, so you can access attributes like the title, summary, different sections, and so on. But very simply, you can also print out the whole page with very simple, text-based formatting. And if we scroll through, you'll see that different levels of headings are helpfully pre-formatted differently with these markers. Here's one section of that page comparing the rendered version on a browser against extracted text. You can see that the headings are marked here using standardized formatting. And, exciting for us, this means that we can write code to extract this information. So, we now have just the text from a Wikipedia article with natural breakpoints to split the text, and we can also find contextual information of section and subsection headings to match the text. Next up, we'll show you how real life data like this might be imported into Weaviate.