All terms
Architecture

Vision-Language Model (VLM)

Also known as: VLM, multimodal model, multimodal LLM, MLLM

A model that processes both images and text in a unified way. Show it a screenshot, a chart, or a photo, and it can describe, analyze, or answer questions about what it sees.

What it means

A vision-language model takes images and text as input and produces text. The standard recipe: a vision encoder (often a ViT — Vision Transformer) turns the image into a sequence of "image tokens" or patch embeddings, those get projected into the LLM's embedding space, and then the LLM treats them like any other tokens in its context. The model learns to attend across image patches and text tokens at the same time, so "what's in this image?" works just like any other prompt. GPT-4V (now part of GPT-4o), Claude with vision, Gemini, and Llama 4 are all VLMs. So are open models like Qwen2-VL, Pixtral, and InternVL. The vision side has gotten remarkably good in 2025-2026 — these models can read handwriting, parse complex charts, count objects, understand UI screenshots well enough to drive computer-use agents, and reason about diagrams. Quality varies by model: Gemini and GPT-4o tend to win on chart parsing, Claude on document understanding and code-from-screenshot. "Multimodal" in 2026 usually means at least vision + text, often plus audio (GPT-4o's voice mode, Gemini Live). The frontier is unified models that handle text, images, audio, and sometimes video natively rather than stitching modality-specific encoders together. GPT-4o was a big step here — it handles voice without a separate speech-to-text pipeline, which is why latency is so low. What VLMs still struggle with: precise spatial reasoning (where exactly is X in the image?), small text in low-res images, counting many objects, and anything requiring pixel-level precision. They're great at semantic understanding, weaker at geometric and quantitative measurement. Don't trust a VLM to count items in a busy photo without verification.

Example

You paste a screenshot of a Stripe dashboard into Claude and ask 'why did revenue dip on the 14th?' Claude reads the chart, sees the dip, and reasons about possible causes — that's a VLM combining vision encoding with LLM reasoning.

Why it matters

VLMs unlock entire workflows that text-only models can't do: extracting data from screenshots, debugging UI issues from images, reading PDFs with diagrams, driving computer-use agents. By 2026, vision is the default for any serious frontier model — text-only feels limiting for real work.

Related terms

See it in a comparison