Multimodal Chain-of-Thought: reasoning over images and text
Multimodal CoT extends Chain-of-Thought reasoning to prompts that include images, charts, or diagrams. Learn the two-stage pattern (rationale then answer) and when it pays off over single-stage multimodal prompting.
Show GPT-4o a chart and ask: "Did Q4 revenue grow more than Q3?" You'll get a confident answer about half the time, even when the chart shows the opposite. The model glanced, pattern-matched, and moved on.
Now ask: "First describe what you see in the chart — axes, values, trends. Then, using only what you described, answer: did Q4 revenue grow more than Q3?" Suddenly the answer is correct, the reasoning is visible, and you can audit any step that went wrong.
That's Multimodal Chain-of-Thought: two-stage reasoning that extracts visual information explicitly before reasoning about it. The technique that turns multimodal models from confident guessers into careful analysts.
The whole idea in one line
The mental model: separate seeing from thinking#
Multimodal models do both perception (what's in this image?) and reasoning (what does it mean?) in a single forward pass. When tasks require both, the model often shortcuts — glancing at the image and producing an answer that pattern-matches what such an imagetypically shows, without verifying the actual content.
Multimodal CoT splits the two jobs. Stage 1 forces explicit perception (write down what you see). Stage 2 reasons over the written description, where the model is much better behaved. The result: errors become visible, and the model has to ground its reasoning in what it actually saw.
The two-stage pattern#
Stage 1: visual extraction#
Force the model to convert the image to text — specific to what the downstream task needs. Generic descriptions don't help; targeted ones do.
Examine the chart in the image carefully. List, in this exact format: - The x-axis label and units - The y-axis label and units - Each data point with its label and approximate value - Any visible trends or notable features Be exhaustive. If a value is hard to read precisely, give a range (e.g., "around 40-50") rather than a single guess. Output:
Stage 2: reason from the extraction#
Take the structured extraction from Stage 1 and use it as context for the actual question. The image is no longer in the prompt — only the model's own description.
Based on the chart description below, answer the question.
Show your reasoning before the final answer.
Chart description:
"""
{{stage_1_output}}
"""
Question: {{user_question}}
Reasoning:The split is the trick. Stage 1 is bound to what the image actually shows. Stage 2 operates on text, where reasoning is more reliable.
The single-prompt collapsed version#
For latency-sensitive use, you can collapse both stages into one prompt with explicit phase labels:
Answer the question about the image below in two phases.
Phase 1 — Visual extraction:
List the relevant elements visible in the image (specific values,
labels, trends, features). Be exhaustive about what's relevant
to the question.
Phase 2 — Reasoning and answer:
Using only what you extracted in Phase 1, reason through the
question and provide the final answer.
Image: [the image]
Question: {{user_question}}
Phase 1:Trades a small amount of reliability for half the latency and cost. For most production tasks, this is the right call.
When multimodal CoT pays off#
When to use multimodal CoT
| If your situation is… | Reach for… | Why |
|---|---|---|
| Charts, graphs, dashboards | Yes — high lift | Reading specific values from charts is exactly where direct prompts fail |
| Diagrams (flowcharts, architecture, timelines) | Yes | Forces the model to enumerate elements before reasoning about relationships |
| Documents with mixed text + figures | Yes | Stage 1 extracts both modalities into a unified text representation |
| Complex scenes (multiple objects, spatial reasoning) | Yes | Spatial errors are easier to catch when described in writing |
| Simple image classification ("is this a cat?") | Skip — overkill | Direct multimodal prompts handle this fine |
| OCR-heavy tasks (extracting text) | Skip | Use a dedicated OCR call; multimodal CoT adds overhead without lift |
| Latency-critical interactive use | Single-prompt phased version | Two-call latency is too slow; phased single-prompt is the compromise |
Going further: production patterns#
Targeted extraction prompts#
Don't use a generic "describe the image" for Stage 1. Tailor the extraction to the downstream task. If you're going to ask about revenue trends, Stage 1 should specifically extract revenue values with dates. Generic extractions waste tokens and miss what matters.
Validate Stage 1 before reasoning#
For high-stakes tasks, run a verifier on Stage 1's output before passing it to Stage 2. The verifier can be a simple regex (does the output contain the expected fields?), a schema check, or another LLM call asking "did the extraction miss anything obvious?" Errors caught at Stage 1 cost much less than errors that propagate through Stage 2.
Mixed-modality stages#
Stage 1 must be multimodal (it sees the image). Stage 2 doesn't have to be — it can run on a cheaper text-only model. Cost optimization: GPT-4o (vision) for extraction, GPT-4o-mini (text) for reasoning. Often a 3-5x cost saving with no quality loss.
Comparing across multiple images#
For tasks comparing two or more images (before/after, A/B), do separate Stage 1 extractions per image, then a single Stage 2 that reasons across the extractions. Avoids the model conflating which feature came from which image — a classic multi-image failure mode.
Common mistakes#
- Generic "describe the image" in Stage 1. Produces fluffy descriptions that don't help Stage 2. Tailor to the downstream task.
- Passing the image to Stage 2 too. The whole point is to ground reasoning in the explicit description. If Stage 2 sees the image directly, it shortcuts around the description.
- Using on tasks that don't need it. Simple classification, OCR, single-image description — direct prompts work fine. Save multimodal CoT for tasks where perception meets reasoning.
- Skipping eval. Multimodal CoT's lift is most visible on hard cases. Without an eval set, you can't tell if you're paying for a benefit you're not getting.
Quick reference#
The 60-second summary
What it is: two-stage reasoning over images. Stage 1: extract what you see in writing. Stage 2: reason from the writing.
Why it works: separates perception from reasoning. Errors become visible at the extraction step instead of hidden in the answer.
When it shines: charts, diagrams, mixed text+figure documents, spatial reasoning.
When to skip: simple classification, OCR, latency-critical interactive use (use the phased single-prompt collapsed version instead).
What to read next#
The text-only version of the same idea: Chain-of-Thought. For per-model multimodal capabilities, prompting Gemini covers the strongest multimodal model; GPT-4o and Claude also support vision. To chain extraction and reasoning across separate API calls, prompt chaining.
Put this guide to work
Save your prompts, version every change, and share them with your team — free for up to 200 prompts.