Generated editorial image of a decoder model spine projecting embeddings across a retrieval landscape

Turning Gemma 4 Into an Embedding Model

Published on

Gemma4Vec started as a simple question:

Can we turn Gemma 4 itself into a useful text embedding model?

Not by training a separate encoder. Not by treating embeddings as an afterthought. The experiment was to adapt a decoder language model into a retrieval model and keep the training and evaluation loop honest enough that the result would mean something.

The default route became a contrastive fine-tuning setup for google/gemma-4-E2B-it.

The first working recipe

The first recipe was deliberately low risk:

  1. Keep Gemma as a causal decoder.
  2. Append an explicit EOS token.
  3. Pool the final hidden state at the EOS/last token.
  4. Add LoRA.
  5. L2-normalize the output embedding.
  6. Train with query-positive contrastive loss and hard negatives.

That recipe is close in spirit to decoder-to-embedding systems like F2LLM/Qwen-style approaches: keep the decoder mostly intact, then teach its final state to act like a sentence embedding.

The first real positive canary used a German-heavy F2LLM-style retrieval slice:

  • 94,981 clean retrieval pairs;
  • 4 hard negatives per pair;
  • one ministry-supported accelerator run;
  • about 6 hours of training;
  • GermanDPR NDCG@10 around 0.52278;
  • GermanSTS test around 0.6446;
  • local retrieval Acc@1 around 0.9414.

That did not prove broad embedding quality. It proved something narrower and more useful: the plumbing worked, the objective had signal, and Gemma 4 could be steered toward retrieval.

Why local smoke tests are not enough

Local retrieval smoke tests can be misleading.

If each row contains a query, a positive passage, and a few hard negatives, then ranking the positive above those negatives tells you whether the model learned the local row structure. It does not tell you whether the model is a good general embedder.

So the validation stack had to separate:

  • training and validation loss;
  • local retrieval rank tests;
  • STS sanity checks;
  • compact NanoBEIR retrieval;
  • full English NanoBEIR follow-up.

The main metric became NDCG@10 on compact NanoBEIR-style retrieval. This is still a proxy, but it is much harder to fool than a local row-level smoke test.

Validation stack diagram from training loss through full NanoBEIR retrieval
The point of the validation stack is not more metrics for their own sake. It is to stop local wins from being mistaken for retrieval quality.

The recipe matrix

Once the baseline worked, the interesting question became recipe choice.

I tested several families:

  • causal_last_token_native_dim: causal mask, EOS/last-token pooling, no projection.
  • llm2vec_bidirectional_mean: bidirectional attention with mean pooling.
  • nv_embed_bidirectional_latent_cross_attention: bidirectional attention with NV-Embed-style latent cross-attention pooling.
  • causal2vec_contextual_token: inject a contextual token from a frozen encoder.
  • multi_layer_trainable_pooling_bidirectional: trainable pooling across hidden layers.
  • echo_delta_causal: a novel causal readout based on repeated input and hidden-state deltas.

Some of these were useful because they failed.

The trainable multi-layer pooler trained cleanly, then collapsed on compact NanoBEIR. Echo-Delta learned local smokes but did not transfer. Causal2Vec trailed the simpler active recipes. These are the kinds of negative results that save compute.

The useful contenders narrowed to:

  • simple bidirectional mean pooling;
  • NV-Embed-style latent cross-attention pooling;
  • causal last-token pooling as a baseline.

The 250k and 2M lessons

On an earlier 250k-row run, llm2vec_bidirectional_mean looked strongest on compact NanoBEIR with mean NDCG@10 around 0.55894. The causal baseline was close at 0.54972. The NV latent recipe was also in the mix.

Then scale-up complicated the story.

A 2M-row LLM2Vec-style scale-up completed cleanly, but did not beat the best 250k NV/LLM pair on external retrieval. Validation loss improved, but NanoBEIR did not. That was the key warning:

More rows on the same objective are not automatically the right next move.

Later 2M NV and causal scale-ups made the recommendation more nuanced. A 32-latent NV recipe became the best stage-1 base in one regime. Then a higher-latent, higher-batch, k8 negative setup shifted the winner again:

  • NV128 b64 k8 all_packed/gcon reached compact NanoBEIR final around 0.579368.
  • It beat the earlier 250k k4 control and the simpler candidate-packed fallback.
  • Short parameter probes with NV256, NV512, altered hidden size, and learning-rate changes did not beat it.

The pattern is useful: the best recipe was not a single architectural trick. It was a bundle of choices that worked together: latent count, batch size, negative count, forward packing, and gradient checkpointing.

Compact NanoBEIR NDCG chart comparing causal pooling, LLM2Vec mean pooling, and NV128 b64 k8
Compact NanoBEIR was the first place where the recipe choice became meaningfully visible. The best result here was a bundle of architectural and training choices, not just a larger dataset.

The engineering matters

The repo ended up with a structure that made this kind of iteration possible:

gemma4vec/
tasks/contrastive_embedder/
context/docs/

The task code supports:

  • causal and bidirectional attention;
  • EOS/last-token pooling;
  • mean pooling;
  • latent cross-attention pooling;
  • LoRA;
  • explicit hard negatives;
  • in-batch and cross-device negatives;
  • packed forward modes;
  • single-node DDP;
  • checkpoint evaluation.

This is not glamorous, but it matters. If recipe changes require rewriting the whole trainer, you cannot compare them cleanly. If generated checkpoints and logs pollute the repo, you cannot reason about what changed. If every result is just a terminal scrollback, you lose the thread.

What I currently believe

Gemma4Vec convinced me that decoder LLMs can be converted into useful embedding models with a relatively small amount of targeted training. But it also made me less impressed by vague claims like “we trained on more data” or “we added a learned pooler.”

The important questions are:

  • Does external retrieval improve, not just validation loss?
  • Does the recipe transfer beyond local hard-negative rows?
  • Does increasing capacity help, or just make the run more expensive?
  • Is the benchmark measuring the product direction you care about?
  • Is the negative result strong enough to stop a bad branch?

The current best path is not to blindly scale all of F2LLM. It is to lock a recipe only after compact and full-suite validation agree, then spend compute where it changes the external retrieval curve.

Why this matters for hyper³labs

hyper³labs is about embedding models and the tools to understand them. Gemma4Vec is part of that loop. It is not just a model-training exercise. It is a way to ask:

  • What does a retrieval model learn?
  • Where does it fail?
  • Which proxy metrics are honest?
  • Which architecture changes are worth compute?

That is the same loop I want in product form:

  1. Inspect the model.
  2. Find failure structure.
  3. Change data, objective, or architecture.
  4. Validate on a benchmark that matters.
  5. Repeat without fooling yourself.

Gemma4Vec is a research workbench for that loop.