Can We Backprop Through a Prompt?

Published on June 7, 2026

I have been thinking about prompt optimization as a machine learning problem rather than a copywriting problem.

If you have a task, an evaluation set, and a score, the natural question is: can the prompt become the thing we train? Can we backprop through it, or at least create a differentiable proxy that tells us how much performance is hiding in the prompt?

This is the idea behind my Gemma prefix-tuning experiment. It is not meant to be a grand benchmark result. It is a small controlled setup for testing a hunch:

If a trainable soft prefix can improve a model on a benchmark, that improvement is a rough upper bound on what discrete prompt optimization might be able to find.

Why Gemma 3 270M?

The model choice was practical. I wanted something small enough to run locally on an M1 MacBook, but real enough that the result would mean something. google/gemma-3-270m-it was a good fit: small, instruction-tuned, and well documented enough to compare with reported model-card numbers.

The first baseline was simple:

lm_eval \
  --model hf \
  --model_args pretrained=google/gemma-3-270m-it \
  --tasks winogrande \
  --device mps \
  --batch_size auto \
  --apply_chat_template \
  --log_samples \
  --output_path ./eval_out/ \
  --trust_remote_code

That reproduced the rough WinoGrande number from the model card: about 0.516 accuracy.

This matters because without a reproducible baseline, prefix tuning becomes too easy to fool yourself with. The model is small, the task is noisy, and a tiny formatting mistake can look like science.

Why WinoGrande?

WinoGrande is a commonsense reasoning benchmark. Each example has a sentence with a blank and two candidate fillers.

Example shape:

Kyle does not wear leg warmers to bed, while Logan almost always does.
_ is more likely to live in a colder climate.

option1: Kyle
option2: Logan
answer: option2

The benchmark is useful here for three reasons.

First, it is small enough to iterate on. Second, Gemma 3 270M is not already saturated on it, so there is room to observe movement. Third, the task can be evaluated with log-likelihood: compare how likely the model thinks the sentence is with option 1 versus option 2.

That makes it a nice bridge between prompting and training. The prompt changes the context. The score is still a clean probabilistic comparison.

The prompt-optimization question

The usual DSPy-style prompt optimizer is searching over text. It might rewrite instructions, add examples, or change formatting. But that search is indirect: the prompt is discrete, the score is non-smooth, and the optimizer has to make guesses.

Prefix tuning gives a different view. Instead of asking an LLM to invent better text, attach trainable vectors before the input and update them with gradient descent.

The soft prefix is not human-readable. That is the point. It asks:

If the model could get an ideal learned preamble for this task, would the task score move?

If the answer is no, the bottleneck is probably not prompt wording. Maybe the model does not know the task, the model is too small, the benchmark is too hard, or the scoring method is not aligned.

If the answer is yes, then the prompt channel has signal. The next question becomes whether a discrete optimizer can recover some of that signal with readable text.

Why formatting matters for Gemma

Gemma instruction-tuned models use a particular chat format with explicit turn markers:

<start_of_turn>user
Question here<end_of_turn>
<start_of_turn>model

Gemma IT models also do not use a separate system role in the same way some other chat models do. System-like instructions need to be placed inside the user turn.

This is a small detail, but for prompt experiments it is not cosmetic. If you train or evaluate with the wrong format, the prefix can end up learning around a formatting error. Then the result is not about reasoning or prompt optimization; it is about fixing the input template.

So the experiment needs three baselines:

Raw task format.
Gemma chat-template format.
Prefix-tuned version of the correct format.

Only the third comparison is meaningful.

What prefix tuning tells us

The result I care about is not just final accuracy. It is the gap between:

the base prompted model;
a hand-written or DSPy-optimized prompt;
the soft-prefix tuned model.

Those three numbers answer different questions.

The base prompt tells you what the model does with minimal steering.

The text optimizer tells you what a readable prompt search can find.

The soft prefix tells you what a trainable hidden prompt can find.

If the text optimizer closes most of the gap to the soft prefix, prompt search is doing real work. If the soft prefix improves a lot but text search does not, then the task has steering signal that current discrete prompt optimizers are missing. If neither improves, it is probably time to change the model, data, or task framing.

Calibration chart comparing the observed Gemma baseline with planned text optimizer and soft prefix measurements — The baseline is the only measured value in this plot. The dashed bars are the comparisons the experiment is designed to make: readable prompt search versus a trainable hidden prompt.

The practical loop I want

The longer-term loop looks like this:

Pick a small model and a benchmark where the baseline is reproducible.
Run a clean lm-eval baseline.
Train a soft prefix on a train split.
Evaluate on validation.
Run a discrete optimizer such as DSPy on the same split.
Compare text prompt gains to soft-prefix gains.

That gives prompt optimization a calibration target. Not “did my clever instruction sound better?”, but “how much of the trainable steering signal did my readable prompt recover?”

Diagram of the Gemma prefix tuning measurement loop — The loop keeps formatting, soft-prefix training, validation, and readable prompt search as separate checks so a gain can be attributed to the right part of the system.

This is also useful when deciding whether to spend time on prompt optimization at all. For many tasks, prompting is the wrong lever. Prefix tuning can make that visible earlier.

What I would improve next

The current experiment is a first pass. The obvious next steps are:

use a cleaner train/validation split for WinoGrande;
add BoolQ or another task with a different scoring structure;
track whether the learned prefix overfits;
compare against DSPy prompt optimization directly;
try multiple prefix lengths and learning rates;
inspect whether prompt gains transfer across nearby tasks.

The important part is not that prefix tuning is magic. It is that it gives a way to make prompt optimization less mystical. If the prompt is an interface to a model, then we should be able to measure the capacity of that interface.

That is the thread I want to pull on.