LLMs vs. Classical HPO: Where AI Agents Fall Short — and How a Hybrid Wins
A benchmark study pitting LLM agents against CMA-ES and TPE on hyperparameter tuning finds classical methods still lead — but a hybrid approach called Centaur closes the gap with a surprisingly small model.
The question of whether large language models can replace classical optimization algorithms in ML workflows now has a direct empirical answer from Fabio Ferreira, Lucca Wobbe, Arjun Krishnakumar, Frank Hutter, and Arber Zela in a new arXiv paper (arXiv:2603.24647). Short version: not yet — but the exact shape of the failure is illuminating, and their proposed fix is worth examining closely.
The Testbed and the Contenders
The researchers ran experiments using the autoresearch repository, which lets an LLM agent optimize hyperparameters by editing training source code directly. The target workload is tuning a small language model under a fixed compute budget — a realistic constraint, not an open-ended playground.
The classical contenders are serious: CMA-ES (Covariance Matrix Adaptation Evolution Strategy), a second-order evolutionary optimizer that explicitly models the correlation structure between hyperparameters, and TPE (Tree-structured Parzen Estimator), the algorithm underlying Optuna and Hyperopt. Neither is a straw-man baseline — both have decades of production use in AutoML pipelines.
Two LLM configurations are evaluated: one where the LLM selects configurations from a fixed search space (matching what CMA-ES and TPE see), and one where the LLM can freely edit the training source code — the regime that should play to its strengths.
Why Classical Methods Win in the Fixed-Space Regime
When the search space is defined upfront, CMA-ES and TPE consistently outperform LLM-based agents. The researchers' diagnosis is precise and a bit surprising: avoiding out-of-memory failures matters more than search diversity in this setting. Classical probabilistic models naturally stay within feasible configuration regions as they converge. LLM agents, by contrast, are more prone to proposing configurations that blow memory budgets, wasting budget on invalid trials.
Allowing the LLM to directly edit source code — where it can restructure training loops, swap optimizers, or touch architectural choices no fixed grid would expose — does narrow the gap. But it doesn't close it. According to the paper, this holds even with frontier models including Claude Opus 4.6 and Gemini 3.1 Pro Preview.
The deeper reason: LLMs struggle to track optimization state across trials. CMA-ES maintains an explicit probabilistic model updated by every evaluation — its mean vector, step size, and covariance matrix encode everything the algorithm has learned about the landscape. An LLM reasoning from scratch each trial has none of that. Without structured memory of past evaluations, it tends to revisit poor regions, propose redundant configurations, or miss systematic convergence signals that would be obvious to any gradient-informed method.
Centaur: Structured State as the Missing Link
The paper's most actionable contribution is Centaur, a hybrid that directly addresses the state-tracking gap. Rather than asking the LLM to infer search history from raw trial logs, Centaur explicitly hands CMA-ES's internal state — the current mean vector, step size, and covariance matrix — to the LLM at each iteration. The LLM then proposes informed candidates that are shaped by, but not constrained to, what CMA-ES would suggest on its own.
Centaur achieves the best results in the experiments, and the scale required is striking: a 0.8B-parameter LLM already outperforms all pure-classical and pure-LLM baselines. That's a small, self-hostable model, not a frontier API call.
The logic is clean. CMA-ES knows where it has been and what the loss landscape looks like so far. An LLM knows things CMA-ES never could: that warmup schedules tend to matter for transformer training, that certain batch sizes create memory cliffs, that gradient clipping is usually worth trying before deep weight decay tuning. Feed the CMA-ES state to the LLM and both kinds of knowledge become available simultaneously.
Scaling Behavior and Practical Takeaways
The paper traces model scaling from 0.8B up to frontier size. Bigger models help in Centaur, but returns diminish quickly — the 0.8B baseline already clears the competitive bar. For unconstrained code editing without the Centaur structure, larger models do become necessary; the task demands more reliable code generation and the LLM must carry more of the optimization burden itself.
For developers building or evaluating AutoML pipelines, the operational guidance is fairly concrete:
- Fixed search space + standalone LLM: classical methods win; use CMA-ES or Optuna's TPE sampler
- Unconstrained code editing + standalone LLM: still trails classical methods, even with frontier models
- Centaur pattern: expose the optimizer's live internal state to the LLM, not just the trial history; a small model suffices
- OOM failures first: before worrying about search diversity, constrain LLM proposals to valid memory envelopes
The Centaur architecture suggests a concrete integration path for teams already using Optuna, Ax, or SMAC: instead of replacing the surrogate model with an LLM, surface the surrogate's current beliefs as structured context for a small LLM that proposes domain-informed corrections alongside the surrogate's own candidates.
The broader pattern here mirrors findings across ML automation research: LLMs add the most value not when they replace structured algorithms wholesale, but when they augment them with domain knowledge those algorithms cannot encode on their own. Classical optimizers excel at exploiting search history systematically. LLMs carry priors about what tends to work. The combination, done carefully, beats either alone.
Sources & further reading
Mariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon.
Discussion 0
No comments yet
Be the first to weigh in.