From the course: The AI Ecosystem for Developers: Models, Datasets, and APIs
Unlock this course with a free trial
Join today to access over 24,600 courses taught by industry experts.
Multimodal architectures: CLIP and Flamingo
From the course: The AI Ecosystem for Developers: Models, Datasets, and APIs
Multimodal architectures: CLIP and Flamingo
- [Instructor] Multimodal architectures are designed to process and integrate information from multiple data modalities, such as text, images, audio, and video, to achieve a more comprehensive understanding and generate richer out outputs. These architectures are contemporary, so they are very recent, as recent as the 2020s. Some of the most permanent multimodal architectures include CLIP, contrasted language-image pre-training. CLIP is a neural network that learns to connect visual and texture representations. It's trained on large scale internet data, allowing it to associate images and text in a highly flexible manner. It is very useful for zero-shot image classification and cross-model retrieval. Some of the components of CLIP include text encoder. This process is text description using transformer based architecture to extract semantic features. Image encoder processes images to extract visual features. It uses a vision transformer like ViT or CNN, to convert images into vector…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.
Contents
-
-
-
-
Introduction to AI models and architecture5m 11s
-
(Locked)
NLP architectures: RNNs and transformers5m 49s
-
(Locked)
Computer vision architectures: CNNs and vision transformers6m 25s
-
(Locked)
Generative architectures: Diffusion and GANs6m 10s
-
(Locked)
Multimodal architectures: CLIP and Flamingo5m 29s
-
(Locked)
Efficient architectures7m 32s
-
-
-
-
-