LLMs as Search Operators

Published on June 7, 2026

The most interesting use of LLMs in coding is not “ask it once and hope.”

It is search.

A model proposes a candidate. A real evaluator scores it. The next candidate uses the failures of the previous one. Repeat until the evaluator says the system is better.

That is the pattern I kept running into while working with DSPy, generated scripts, and GEPA-style evolutionary optimization.

The Lyceum regression challenge

One of the cleanest examples was a small tabular regression challenge.

The task looked simple:

training_data.csv had 800 rows, 10 numerical features, and a target value.
test_features.csv had 200 rows and no labels.
The submission target was a CSV of predicted values.
The public feedback was RMSE.

This is exactly the kind of problem where one-shot LLM coding is usually weak. The answer is not a clever paragraph. It is an empirical loop:

Load the data correctly.
Generate a modelling script.
Run cross-validation.
Record RMSE and failures.
Feed the result back into the next generation.
Keep the best candidate.

The useful system is not “LLM writes model.” It is “LLM proposes code under a scoring harness.”

The loop works because the model proposes candidates while the evaluator judges them. That separation is what makes iteration less hand-wavy.

The first failure mode: generated code is not a program yet

Early iterations failed for boring reasons. The generated code had indentation problems or returned a function body in the wrong shape. That is still useful information.

If a generated optimizer cannot reliably produce executable candidates, the evaluator never gets to judge the modelling idea. The first layer of the system has to be a compiler/test harness:

Does the script import?
Does train_and_predict exist?
Does it return the right shape?
Does it run under K-fold CV?
Does it fail quickly when dependencies are missing?

Only after that can you care whether it chose CatBoost, XGBoost, ExtraTrees, feature transforms, or an ensemble.

The best working iteration in the local history used a pragmatic ensemble-style approach and reached cross-validation RMSE around 0.4587. That number is less important than the process that produced it: generate, execute, score, remember.

Why DSPy is the right abstraction

DSPy is useful because it turns “prompting” into modules, signatures, optimizers, and feedback. You can describe a generation task, run examples through it, and optimize the instructions or program structure around a metric.

For generated-code search, the shape is natural:

task description + previous attempts -> candidate implementation
candidate implementation + evaluator -> score
score + failure notes -> next attempt

The LLM is not the evaluator. The evaluator is Python, cross-validation, an offline game simulator, an API score, or a benchmark.

This distinction matters. LLMs are good at proposing structured variants. They are bad at knowing whether a variant actually works without being checked.

The Berghain-style online decision problem

The second useful example was an online decision game.

You see one person at a time. Each person has binary attributes. You can accept or reject them. The venue has fixed capacity and minimum required counts for attributes. Decisions are irrevocable.

The core tension is obvious:

accept too many neutral people and you run out of slots;
reject too aggressively and you waste the stream;
rare attributes dominate the horizon;
correlated attributes create hidden overlap constraints.

This becomes more interesting than a tabular model because you need a policy, not just predictions.

I modeled the state explicitly:

admitted count;
rejected count;
attribute deficits;
remaining capacity;
person history;
constraints;
scenario statistics.

Then I built an offline simulator as a drop-in replacement for the online API. The offline stream sampled attributes using the scenario marginals and correlations. That made policy iteration cheap.

The math that made the policy sane

Two ideas were useful.

First, feasibility has hard counting constraints. If the venue capacity is N and two attributes require m_a and m_b, then any feasible final crowd needs at least:

max(0, m_a + m_b - N)

people who have both attributes.

This matters when two required attributes have unfavorable correlations. You do not just need enough of each attribute independently. You may need enough overlap.

Second, rare attributes define the horizon. If you still need r people with an attribute whose marginal probability is p, the expected number of arrivals to see them is roughly:

r / p

That negative-binomial intuition helps explain why a rare “creative” attribute can dominate the whole strategy.

Line chart showing expected arrivals as deficit divided by marginal attribute probability — A rare attribute with even a small remaining deficit can control the policy horizon. At p=0.05, a deficit of five implies roughly 100 expected arrivals.

The policy that emerged was not a giant neural network. It was a reservation policy:

accept people who reduce outstanding deficits;
accept neutral people only if doing so preserves feasibility;
reject neutral people when they consume capacity needed for remaining deficits.

That kind of policy is understandable, fast, and testable. It also gives the LLM something structured to modify.

Where GEPA-style evolution fits

GEPA-style optimization is attractive because it treats prompts/programs as candidates that can be mutated, selected, and improved using feedback.

For these tasks, the candidate is not a final answer. It is a strategy:

a script body for a regression model;
a decision policy for an online game;
a prompt or module for DSPy;
a hyperparameter-free heuristic with a few interpretable knobs.

The evaluator is what keeps it grounded.

In a regression task, the evaluator is cross-validation RMSE. In the online game, it is average rejection count and success rate across seeds. In a prompt task, it might be exact match, retrieval NDCG, or a domain-specific score.

The LLM does the exploration. The evaluator does the judgment.

The pattern I want to reuse

The pattern is:

Build a simulator or local evaluator first.
Define a strict candidate interface.
Let the LLM generate candidates inside that interface.
Execute candidates in a sandboxed harness.
Save scores, errors, and the best candidate.
Feed back concrete failure information.
Repeat.

The most important design choice is the interface. If candidates can be arbitrary, you spend all your time debugging format errors. If the interface is too narrow, the optimizer cannot discover anything new.

Good interfaces are somewhere in the middle: constrained enough to run, expressive enough to surprise you.

What this has to do with building products

This is how I increasingly think about AI agents.

The agent should not be a chat window that gives advice. It should be a search process wrapped around real tools:

code execution;
tests;
local simulators;
benchmark runs;
data probes;
browser checks;
deployment checks.

For hyper³labs, the same idea appears in embedding work. A model is only useful if it improves a real retrieval metric. A visualization is only useful if it helps you find a fixable failure. A generated candidate is only useful if it survives the evaluator.

The common thread is not “LLMs automate everything.” It is:

Make the loop tight enough that ideas hit reality quickly.

That is where LLMs become useful search operators instead of confident autocomplete.