Multimodal Chain-of-Thought: reasoning over images and text

Multimodal CoT extends Chain-of-Thought reasoning to prompts that include images, charts, or diagrams. Learn the two-stage pattern (rationale then answer) and when it pays off over single-stage multimodal prompting.

schedule7 min readLast updated May 2, 2026

Show GPT-4o a chart and ask: "Did Q4 revenue grow more than Q3?" You'll get a confident answer about half the time, even when the chart shows the opposite. The model glanced, pattern-matched, and moved on.

Now ask: "First describe what you see in the chart — axes, values, trends. Then, using only what you described, answer: did Q4 revenue grow more than Q3?" Suddenly the answer is correct, the reasoning is visible, and you can audit any step that went wrong.

That's Multimodal Chain-of-Thought: two-stage reasoning that extracts visual information explicitly before reasoning about it. The technique that turns multimodal models from confident guessers into careful analysts.

The whole idea in one line

Two stages: extract what's in the image in writing, then reason from that written description. The visual extraction step is what makes the reasoning auditable and correct.

The mental model: separate seeing from thinking#

Multimodal models do both perception (what's in this image?) and reasoning (what does it mean?) in a single forward pass. When tasks require both, the model often shortcuts — glancing at the image and producing an answer that pattern-matches what such an imagetypically shows, without verifying the actual content.

Multimodal CoT splits the two jobs. Stage 1 forces explicit perception (write down what you see). Stage 2 reasons over the written description, where the model is much better behaved. The result: errors become visible, and the model has to ground its reasoning in what it actually saw.

The two-stage pattern#

Stage 1: visual extraction#

Force the model to convert the image to text — specific to what the downstream task needs. Generic descriptions don't help; targeted ones do.

terminalPromptStage 1 — extract chart data

GPT-4o / Claude vision / Gemini

Examine the chart in the image carefully.

List, in this exact format:
- The x-axis label and units
- The y-axis label and units
- Each data point with its label and approximate value
- Any visible trends or notable features

Be exhaustive. If a value is hard to read precisely, give a range
(e.g., "around 40-50") rather than a single guess.

Output:

play_arrowTry in PromptShip

Stage 2: reason from the extraction#

Take the structured extraction from Stage 1 and use it as context for the actual question. The image is no longer in the prompt — only the model's own description.

terminalPromptStage 2 — answer using extracted data

Any text-only or multimodal

Based on the chart description below, answer the question.
Show your reasoning before the final answer.

Chart description:
"""
{{stage_1_output}}
"""

Question: {{user_question}}

Reasoning:

play_arrowTry in PromptShip

The split is the trick. Stage 1 is bound to what the image actually shows. Stage 2 operates on text, where reasoning is more reliable.

The single-prompt collapsed version#

For latency-sensitive use, you can collapse both stages into one prompt with explicit phase labels:

terminalPromptSingle-prompt multimodal CoT

Multimodal model

Answer the question about the image below in two phases.

Phase 1 — Visual extraction:
List the relevant elements visible in the image (specific values,
labels, trends, features). Be exhaustive about what's relevant
to the question.

Phase 2 — Reasoning and answer:
Using only what you extracted in Phase 1, reason through the
question and provide the final answer.

Image: [the image]

Question: {{user_question}}

Phase 1:

play_arrowTry in PromptShip

Trades a small amount of reliability for half the latency and cost. For most production tasks, this is the right call.

When multimodal CoT pays off#

When to use multimodal CoT

If your situation is…	Reach for…	Why
Charts, graphs, dashboards	Yes — high lift	Reading specific values from charts is exactly where direct prompts fail
Diagrams (flowcharts, architecture, timelines)	Yes	Forces the model to enumerate elements before reasoning about relationships
Documents with mixed text + figures	Yes	Stage 1 extracts both modalities into a unified text representation
Complex scenes (multiple objects, spatial reasoning)	Yes	Spatial errors are easier to catch when described in writing
Simple image classification ("is this a cat?")	Skip — overkill	Direct multimodal prompts handle this fine
OCR-heavy tasks (extracting text)	Skip	Use a dedicated OCR call; multimodal CoT adds overhead without lift
Latency-critical interactive use	Single-prompt phased version	Two-call latency is too slow; phased single-prompt is the compromise

Going further: production patterns#

Targeted extraction prompts#

Don't use a generic "describe the image" for Stage 1. Tailor the extraction to the downstream task. If you're going to ask about revenue trends, Stage 1 should specifically extract revenue values with dates. Generic extractions waste tokens and miss what matters.

Validate Stage 1 before reasoning#

For high-stakes tasks, run a verifier on Stage 1's output before passing it to Stage 2. The verifier can be a simple regex (does the output contain the expected fields?), a schema check, or another LLM call asking "did the extraction miss anything obvious?" Errors caught at Stage 1 cost much less than errors that propagate through Stage 2.

Mixed-modality stages#

Stage 1 must be multimodal (it sees the image). Stage 2 doesn't have to be — it can run on a cheaper text-only model. Cost optimization: GPT-4o (vision) for extraction, GPT-4o-mini (text) for reasoning. Often a 3-5x cost saving with no quality loss.

Comparing across multiple images#

For tasks comparing two or more images (before/after, A/B), do separate Stage 1 extractions per image, then a single Stage 2 that reasons across the extractions. Avoids the model conflating which feature came from which image — a classic multi-image failure mode.

Common mistakes#

Generic "describe the image" in Stage 1. Produces fluffy descriptions that don't help Stage 2. Tailor to the downstream task.
Passing the image to Stage 2 too. The whole point is to ground reasoning in the explicit description. If Stage 2 sees the image directly, it shortcuts around the description.
Using on tasks that don't need it. Simple classification, OCR, single-image description — direct prompts work fine. Save multimodal CoT for tasks where perception meets reasoning.
Skipping eval. Multimodal CoT's lift is most visible on hard cases. Without an eval set, you can't tell if you're paying for a benefit you're not getting.

Quick reference#

The 60-second summary

What it is: two-stage reasoning over images. Stage 1: extract what you see in writing. Stage 2: reason from the writing.

Why it works: separates perception from reasoning. Errors become visible at the extraction step instead of hidden in the answer.

When it shines: charts, diagrams, mixed text+figure documents, spatial reasoning.

When to skip: simple classification, OCR, latency-critical interactive use (use the phased single-prompt collapsed version instead).

What to read next#

The text-only version of the same idea: Chain-of-Thought. For per-model multimodal capabilities, prompting Gemini covers the strongest multimodal model; GPT-4o and Claude also support vision. To chain extraction and reasoning across separate API calls, prompt chaining.

Put this guide to work

Save your prompts, version every change, and share them with your team — free for up to 200 prompts.

Start free Browse the prompt library

Multimodal Chain-of-Thought: reasoning over images and text

schedule7 min readLast updated May 2, 2026

The whole idea in one line

Two stages: extract what's in the image in writing, then reason from that written description. The visual extraction step is what makes the reasoning auditable and correct.

The mental model: separate seeing from thinking#

The two-stage pattern#

Stage 1: visual extraction#

Force the model to convert the image to text — specific to what the downstream task needs. Generic descriptions don't help; targeted ones do.

terminalPromptStage 1 — extract chart data

GPT-4o / Claude vision / Gemini

Examine the chart in the image carefully.

List, in this exact format:
- The x-axis label and units
- The y-axis label and units
- Each data point with its label and approximate value
- Any visible trends or notable features

Be exhaustive. If a value is hard to read precisely, give a range
(e.g., "around 40-50") rather than a single guess.

Output:

play_arrowTry in PromptShip

Stage 2: reason from the extraction#

Take the structured extraction from Stage 1 and use it as context for the actual question. The image is no longer in the prompt — only the model's own description.

terminalPromptStage 2 — answer using extracted data

Any text-only or multimodal

Based on the chart description below, answer the question.
Show your reasoning before the final answer.

Chart description:
"""
{{stage_1_output}}
"""

Question: {{user_question}}

Reasoning:

play_arrowTry in PromptShip

The split is the trick. Stage 1 is bound to what the image actually shows. Stage 2 operates on text, where reasoning is more reliable.

The single-prompt collapsed version#

For latency-sensitive use, you can collapse both stages into one prompt with explicit phase labels:

terminalPromptSingle-prompt multimodal CoT

Multimodal model

Answer the question about the image below in two phases.

Phase 1 — Visual extraction:
List the relevant elements visible in the image (specific values,
labels, trends, features). Be exhaustive about what's relevant
to the question.

Phase 2 — Reasoning and answer:
Using only what you extracted in Phase 1, reason through the
question and provide the final answer.

Image: [the image]

Question: {{user_question}}

Phase 1:

play_arrowTry in PromptShip

Trades a small amount of reliability for half the latency and cost. For most production tasks, this is the right call.

When multimodal CoT pays off#

When to use multimodal CoT

If your situation is…	Reach for…	Why
Charts, graphs, dashboards	Yes — high lift	Reading specific values from charts is exactly where direct prompts fail
Diagrams (flowcharts, architecture, timelines)	Yes	Forces the model to enumerate elements before reasoning about relationships
Documents with mixed text + figures	Yes	Stage 1 extracts both modalities into a unified text representation
Complex scenes (multiple objects, spatial reasoning)	Yes	Spatial errors are easier to catch when described in writing
Simple image classification ("is this a cat?")	Skip — overkill	Direct multimodal prompts handle this fine
OCR-heavy tasks (extracting text)	Skip	Use a dedicated OCR call; multimodal CoT adds overhead without lift
Latency-critical interactive use	Single-prompt phased version	Two-call latency is too slow; phased single-prompt is the compromise

Going further: production patterns#

Targeted extraction prompts#

Validate Stage 1 before reasoning#

Mixed-modality stages#

Comparing across multiple images#

Common mistakes#

Generic "describe the image" in Stage 1. Produces fluffy descriptions that don't help Stage 2. Tailor to the downstream task.
Passing the image to Stage 2 too. The whole point is to ground reasoning in the explicit description. If Stage 2 sees the image directly, it shortcuts around the description.
Using on tasks that don't need it. Simple classification, OCR, single-image description — direct prompts work fine. Save multimodal CoT for tasks where perception meets reasoning.
Skipping eval. Multimodal CoT's lift is most visible on hard cases. Without an eval set, you can't tell if you're paying for a benefit you're not getting.

Quick reference#

The 60-second summary

What it is: two-stage reasoning over images. Stage 1: extract what you see in writing. Stage 2: reason from the writing.

Why it works: separates perception from reasoning. Errors become visible at the extraction step instead of hidden in the answer.

When it shines: charts, diagrams, mixed text+figure documents, spatial reasoning.

When to skip: simple classification, OCR, latency-critical interactive use (use the phased single-prompt collapsed version instead).

Put this guide to work

Save your prompts, version every change, and share them with your team — free for up to 200 prompts.

Start free Browse the prompt library