Vision-Language Model (VLM)
A multimodal AI model specifically designed to jointly understand images and text, enabling tasks like image captioning, visual question answering, and document understanding.
Vision-language models combine computer vision with natural language understanding in a single architecture. They can look at an image and answer questions about it, describe its contents, extract structured data from documents, or follow visual instructions. Models like GPT-4V, Claude's vision capabilities, and LLaVA represent the current state of the art.
The typical architecture pairs a vision encoder (like a Vision Transformer) with a language model, connected by a projection layer that translates visual features into the language model's embedding space. The vision encoder processes images into patch embeddings, and the language model reasons over these visual tokens alongside text tokens.
For growth applications, VLMs enable powerful features: automated product catalog enrichment from images, intelligent document processing that understands layouts and charts, visual search where users upload images to find similar products, content moderation that understands images in context, and accessibility tools that generate alt text. The practical challenge is latency, as processing images adds significant compute compared to text-only tasks.
Related Terms
RAG (Retrieval-Augmented Generation)
A technique that grounds LLM responses in external data by retrieving relevant documents at query time and injecting them into the prompt context.
Embeddings
Dense vector representations of text, images, or other data that capture semantic meaning in a high-dimensional space, enabling similarity search and clustering.
Vector Database
A specialized database optimized for storing, indexing, and querying high-dimensional vector embeddings with sub-millisecond similarity search.
LLM (Large Language Model)
A neural network trained on massive text corpora that can generate, understand, and transform natural language for tasks like summarization, classification, and conversation.
Fine-Tuning
The process of further training a pre-trained LLM on a domain-specific dataset to specialize its behavior, style, or knowledge for a particular task.
Prompt Engineering
The practice of designing and iterating on LLM input instructions to reliably produce desired outputs for a specific task.