How to prompt Gemini (1.5 Pro, 2.5 Flash)
A practical guide to prompting Google's Gemini family. How its 1M+ context window changes the prompting game, and the structural patterns that work best for 1.5 Pro and 2.5 Flash.
A developer drops their entire 800,000-token codebase into a single Gemini Pro prompt. The whole thing. Every file. Then asks: " Find every place we're instantiating the payment client and check whether we're handling the timeout error consistently."
Gemini answers in one shot. With file paths. With line numbers. With a side note that file billing/legacy.py handles it differently from everywhere else and the difference might be intentional but worth checking.
That's the Gemini superpower. Most frontier models max out at 128K-200K tokens of context with degrading quality past 30K. Gemini Pro supports 1M-2M tokens with surprisingly stable quality across the whole window. It changes what you can put in a prompt — entire codebases, hundreds of pages of docs, a full conversation history with research excerpts attached.
But putting more in a prompt isn't the same as making it work better. This guide is about what Gemini is genuinely good at, where it struggles, and how to use the long context without drowning the model.
The whole idea in one line
The mental model: scale changes the playbook#
When the context window is small (8K, 32K, 128K), every prompting decision is about what to leave out. You compress, summarize, retrieve, prune. Information architecture becomes "what fits."
When the context window is 1M+, the calculus flips. You can include everything. Now the question is: how do you organize a million tokens so the model can navigate them? The answer is structure, structure, structure — Markdown headers, section dividers, table of contents, explicit signposting. Without structure, even Gemini gets lost in a million tokens of unstructured prose.
This is also where the "lost in the middle" effect matters. Even with a 1M window, attention to content in the middle of a long context is weaker than attention to the top and bottom. Critical instructions belong at the ends; reference material in the middle.
The Gemini family#
- Gemini 2.0 Pro / 2.5 Pro — the quality tier. 1-2M token context. Use for tasks requiring careful reasoning or long-document understanding.
- Gemini 2.5 Flash — fast and cheap. Surprisingly capable on most tasks; excellent cost-per-token. The default for high-volume work.
- Gemini 2.0 Flash-Thinking — Flash with extended internal reasoning. Sits between plain Flash and Pro on hard tasks; cheaper than Pro, smarter than Flash.
What Gemini is genuinely good at#
- Long-document understanding. Drop a 300-page PDF or a full repo into the context and ask questions. Gemini handles this better than any other frontier model.
- Multimodal tasks. Native handling of images, video, and audio. Image-to-text, video summarization, and audio transcription work impressively well in a single prompt.
- Cost-sensitive workloads. Flash is cheap enough to run at scales where GPT-4o or Claude would be cost-prohibitive.
- Structured outputs. Solid JSON compliance, especially with the official structured-output features.
- Native tool use including code execution. Gemini's code-execution tool runs Python in a sandbox and feeds results back to the model — useful for math, data analysis, and verification.
What Gemini struggles with#
- "Lost in the middle" on very long contexts. Even with 1M tokens available, attention to content in the middle of a giant context degrades. Important instructions belong at the top OR the bottom.
- Strict instruction following on edge cases. Slightly more likely than Claude to skip a constraint when constraints are listed densely.
- Subtle stylistic prose. Gemini's long-form writing is competent but tends toward safety. For polished prose, Claude usually edges ahead.
- Niche or low-resource topics. Like all models, weaker on long-tail facts — but combined with Gemini's confident default voice, hallucinations can be more persuasive.
How to actually use the long context window#
Stuffing a million tokens into a prompt isn't free — it costs latency, money, and quality. Three rules to use it wisely:
Structure long inputs explicitly#
Use section markers — Markdown headers, === DOCUMENT 1 === dividers, or XML-style tags. Long unstructured text is much harder for the model to navigate. A million tokens with no structure performs worse than 100K tokens of structured prose.
Put the question at the end, instructions at the top#
For long-context queries, the recommended structure: instructions and constraints at the top, the giant context in the middle, and the actual question at the very bottom. The model attends most strongly to both ends.
Don't pre-filter — let the model find#
If you're putting 200 pages in the context, you might be tempted to pre-summarize first. Often unnecessary — Gemini is genuinely good at finding the relevant passages. Skip the pre-processing for the first pass; only optimize if quality is insufficient.
# Task
Find all places in the document below where the customer expressed
frustration. Return a JSON array of objects with keys:
- "quote": the exact frustrated passage
- "page": page or section number
- "intensity": one of "mild", "moderate", "strong"
# Constraints
- Do not include passages where frustration is expressed by anyone
other than the customer.
- If a passage is ambiguous, exclude it.
- Output only the JSON array. No prose.
# Document
=== PAGE 1 ===
{{page_1_content}}
=== PAGE 2 ===
{{page_2_content}}
... (potentially hundreds of pages)
=== END OF DOCUMENT ===
# Question
List every customer-frustration passage in the document, structured
as specified above.Multimodal: images, video, audio#
Gemini natively understands images, video frames, and audio. A few practical tips:
- Be specific about what you want from the image. "Describe this image" gets generic; "list every product visible on the shelf, left to right" gets useful output.
- Video works in chunks. Long videos are processed as a series of frames; ask questions that span the whole clip rather than "what happens at minute 14".
- Audio transcription is nearly free. Pass an audio file with "transcribe" and Gemini handles it without a separate ASR pipeline.
- Mix modalities in one prompt. A document, an image, and an audio clip in one query — useful for QA over multimedia content.
Picking the right Gemini model#
Which Gemini for which task
| If your situation is… | Reach for… | Why |
|---|---|---|
| High-volume cheap workloads (classification, basic Q&A) | Gemini Flash | Cheapest frontier model; capable on most tasks |
| Most production tasks | Gemini Pro | The default — quality + reasonable cost |
| Hard reasoning, complex analysis | Gemini Flash-Thinking or Pro | Internal reasoning lifts hard tasks; pick by cost |
| Long-document analysis (>200K tokens) | Gemini Pro | Best long-context model in the field |
| Multimodal (image + text, video summarization) | Gemini Pro | Native multimodal; mature image/video support |
| Cost-sensitive RAG-replacement (dump corpus + ask) | Gemini Flash with long context | Often cheaper than maintaining a RAG pipeline |
| Polished long-form writing | Consider Claude Sonnet | Claude's writing voice tends to edge ahead |
Specific tactics that work on Gemini#
- Use Markdown structure aggressively. Headers, bullets, code blocks. Gemini respects them in inputs and produces them cleanly in outputs.
- For RAG-style work, consider using Gemini's long context as the retrieval system — instead of vector search + small context, dump the full corpus and let the model retrieve. Surprisingly competitive on cost when Flash is used.
- Put grounded sources inline, not in a separate "sources" block. Gemini cites more reliably when the source is right next to the relevant passage.
- Use Flash by default, Pro for hard tasks. Flash handles 80% of work well at a fraction of the cost. Reserve Pro for tasks where Flash visibly underperforms.
- Combine code execution with reasoning for math. Gemini's code-exec tool turns arithmetic-heavy questions into code-runs-deterministically questions.
Going further: production Gemini patterns#
Context caching for repeated long contexts#
Like Anthropic's prompt caching, Google's context caching lets you cache large stable parts of your prompt (a knowledge-base document, a long system prompt) so subsequent requests don't re-pay for processing. For long-context applications, this can drop per-query cost by 4-5×.
Grounding with Google Search#
Gemini Pro has a native Google Search grounding tool — the model can fetch fresh search results and cite them. Useful for any application that needs up-to-the-minute information without building a separate retrieval pipeline. Cite-able by default; comes with attribution back to the source.
Batch API for analytics workloads#
Gemini's Batch API offers similar discounts to OpenAI's and Anthropic's — ~50% off for non-urgent workloads with up to 24-hour turnaround. Pair with Flash for the cheapest possible large-volume processing.
Thinking budget tuning#
Gemini Flash-Thinking exposes a thinking-budget parameter. Lower values trade reasoning quality for speed; higher values trade speed for quality. Tune empirically against your eval set — there's rarely a global optimum.
Multimodal RAG#
Gemini handles image embeddings natively, opening up RAG over corpora that include screenshots, diagrams, or scanned documents. The pipeline: chunk text + extract image embeddings + store both in a vector store + retrieve by query similarity. Powerful for docs-with-figures use cases.
Things to avoid#
- Treating 1M context as a substitute for retrieval. For corpora that update frequently or are huge, RAG still wins on cost and freshness. Long context shines for static, bounded documents. See RAG for the trade-off.
- Burying critical instructions in the middle of a long prompt. The middle is the weakest spot. Top or bottom only.
- Trusting unstructured 500-page outputs. For long generated outputs, prefer structured chunks (sections, JSON arrays) over free-form prose.
- Defaulting to Pro when Flash would do. Most workloads work fine on Flash. Premium tier costs add up at scale.
Quick reference#
The 60-second summary
The Gemini superpower: 1M-2M token context that actually works. Drop entire codebases, hundreds of pages of docs, full conversation histories.
The family: Pro (quality), Flash (cheap/fast), Flash-Thinking (cheap reasoning). Default to Flash; reserve Pro for hard tasks.
Long-context discipline: structure with headers/dividers, instructions at top + question at bottom, don't pre-filter — let the model find.
The trade-off: long context vs RAG isn't one-or-the-other. Static bounded docs → long context. Dynamic huge corpora → RAG. Both → measure.
What to read next#
Compare Gemini against ChatGPT and Claude head to head. For long-context-friendly techniques, few-shot with many examples and prompt chaining both work well. To choose between Flash and Pro for a given task, run an A/B test — the right answer is rarely "always Pro".
Put this guide to work
Save your prompts, version every change, and share them with your team — free for up to 200 prompts.