Analysis of Youtu-Agent Team (2025), arXiv:2510.08191 — preprint Generated on April 29, 2026
If you have used a large language model agent to do something specialized — say, multi-step web research or contest math with a code interpreter — you have probably felt the gap between a frontier model's general capability and its domain-specific habits. The model knows how to reason; it just does not know which moves work in your domain, which tools to skip, which sources to trust. The textbook fix is reinforcement learning: collect ground-truth examples, run a few thousand training rollouts, and update the model's weights so high-reward behaviors become more likely. The recipe most teams reach for, made famous by DeepSeek's R1, is Group Relative Policy Optimization (GRPO) — a relative of PPO that drops the value-network critic and instead estimates per-output advantage by comparing each sample to the group mean of a small batch of rollouts. It works, and it is now the default tool for "agentic RL".
The Youtu-Agent team at Tencent argue that the vast majority of GRPO's value is mechanical — the multi-epoch rollout structure, the group-relative comparison, the iterative refinement — and that the gradient update at the end is one specific way to capture what was learned, not the only way. Their proposal: keep GRPO's structure, but replace the gradient with a natural-language experience buffer that gets read into the prompt. They call it Training-Free GRPO. The frozen base model takes the role of the policy; an evolving block of plain-text "lessons" takes the role of the learned weights.
The headline numbers are unflattering to the parameter-update orthodoxy. With about $18 in API spend, 100 training samples, and zero gradient updates, a frozen DeepSeek-V3.1-Terminus reaches 82.7% on AIME24 and 73.3% on AIME25 (Mean@32), beating the best 32B model fine-tuned via vanilla agentic RL — methods like ReTool [arXiv:2504.11536] and AFM [arXiv:2508.13167] that the paper estimates at ~$10K each. Web search gets the same treatment: WebWalkerQA pass@1 climbs from 63.2% to 67.8% with a different 100-sample buffer.
The fine print matters. The method depends on a strong base model — running the same recipe on QwQ-32B for web search actually hurts performance (-2.0 pass@1). It also depends on the buffer being learned with proper group-relative comparison: an ablation that just dumps directly-generated tips into the prompt gives essentially no gain (79.8% vs 80.0% baseline). The optimization machinery is doing real work; this is not a glorified prompt template.
What's at stake practically: if this generalizes, the cost calculus of building a vertical agent flips. Instead of paying $10K to fine-tune a model you have to host on a fixed-cost GPU cluster, you pay $20 to learn a small block of natural-language lessons (a few dozen entries, on the order of a few KB) that you append to API calls. For low-traffic specialized agents — most enterprise agents, in practice — that is a different business.
A short historical arc helps here. Until about 2017, "reinforcement learning for language models" mostly meant REINFORCE-style policy gradients with a learned baseline — high variance, painful to stabilize. PPO (Proximal Policy Optimization, Schulman et al. 2017) tamed that variance by clipping the per-step ratio between the new and old policy and adding a value-network critic to estimate per-token advantages. PPO is what powered the original RLHF wave at OpenAI and Anthropic, and it remains the default for instruction tuning.
PPO has a wart, though: the critic. Training a separate value network roughly doubles your active parameter count and adds its own optimization headaches. GRPO (Group Relative Policy Optimization), introduced by DeepSeek's mathematical-reasoning work in 2024 and now widely used in DeepSeek's reasoning-model lineage, throws the critic out. Its trick is to sample G outputs for the same prompt, score them all with the reward model, and define each output's advantage relative to its group's mean and standard deviation. No critic, no separate value model, just a small batch of rollouts and a normalized score. It is a remarkably clean idea, and it scales.
When the LLM-agents community adopted GRPO, they kept the gradient-update assumption intact. "Agentic RL" — as opposed to ordinary fine-tuning — usually means: build a tool-using ReAct loop, generate multi-step trajectories, score the trajectories, then GRPO-update the policy. The result is a domain-specialized model: ReTool for math with code interpreter, AFM for web research, and so on. Each one is good at its domain and worse than the base elsewhere. Each one costs roughly $10K to train and requires a dedicated GPU cluster to serve.
Now consider the cost calculus the paper opens with. Fine-tuning anything bigger than ~32B parameters is essentially out of reach for most teams (compute budget, but also data scarcity — annotated agentic trajectories are scarce). So you fine-tune a smaller model, get a domain-specific gain, but lose to the general-purpose frontier model on everything else. Worse, you pay continuous serving costs even when traffic is low. This is the cost-performance dilemma the authors put at the center of the paper: API access to a strong frozen model is cheap and elastic, but you can't fine-tune it; smaller models you can fine-tune are inherently weaker.
The paper's pivot is conceptual. The authors note that GRPO's parameter update is just one mechanism for steering the output distribution toward higher-reward behaviors. In-context learning — the GPT-3 finding that LLMs adapt their outputs based on prompt content — provides a second mechanism. If we can manufacture a chunk of text that, when prepended to the prompt, has the same effect on the output distribution as a weight update would have, we have built a non-parametric optimizer. They call this chunk the experiential knowledge buffer and treat it as the analog of θ. The optimization loop then becomes: roll out, compare, distill what worked into the buffer, repeat.
The "why now" is mundane but real. Two recent shifts make this feasible: (1) frontier API models like DeepSeek-V3.1 became cheap enough per token to run real multi-epoch rollouts via API; (2) those same models became coherent enough at long-context reasoning to play both roles in this loop — they are the policy generating rollouts and the judge introspecting on them and updating the buffer. Earlier, weaker models couldn't credibly self-introspect; now they can.
Before the algorithm, a 1-sentence orientation: a rollout is a complete trajectory the agent produces for a given query — every reasoning step, every tool call, every final answer — and a group is the set of G rollouts the algorithm samples for the same query, so it can rank them against each other.
Vanilla GRPO produces, for each query q, a group of G outputs {o_1, …, o_G}. Each output gets a scalar reward r_i from a verifier (in math, a check whether the boxed answer is correct; in web search, a graded comparison to a ground-truth answer). The group-relative advantage is Â_i = (r_i − mean(r)) / std(r). The training loss is a PPO-clipped objective over these advantages, with a KL-divergence penalty pulling the policy back toward a fixed reference model. Gradient ascent on θ.
Training-Free GRPO keeps the rollouts and the rewards, but replaces every step downstream of "I have G rewards". The policy is the frozen base model conditioned on an experiential knowledge buffer E — π_θ(o_i | q, E) — where E is plain text, initialized to the empty string. After scoring, the algorithm checks whether the group has both clear winners (correct answers) and clear losers (incorrect answers). If not, the group is skipped (analogous to the std(r) = 0 case in vanilla GRPO, where Â_i = 0). If yes, the LLM itself is asked to (1) summarize each rollout step-by-step, (2) compare them given the ground-truth answer, and (3) extract a natural-language experience — a concise piece of advice for what worked or what to avoid. This natural-language object is the semantic advantage A_text, the analog of Â_i.
The "optimization step" is then a buffer update. Given the existing E and the new A_text from this batch, the LLM is prompted to emit one of four operations: Add a new lesson, Delete a lesson that this batch contradicts, Modify an existing lesson with new nuance, or Keep E unchanged. The buffer typically grows to a few dozen entries — Appendix A of the paper shows a math buffer of 37 lessons (e.g. "When solving geometry problems with intersections, validate solutions lie within bounded regions") and a web-search buffer with similar style.
Two design notes worth pulling out. First, the frozen base model serves the function of the KL constraint in vanilla GRPO: because π_θ never changes, the policy cannot drift arbitrarily far from a coherent base — the buffer can only steer, not corrupt. Second, the same LLM (DeepSeek-V3.1-Terminus in the main experiments) plays three roles: policy that generates rollouts, judge that produces semantic advantages, and optimizer that emits buffer operations. There is no separate critic, no separate reward model, no separate optimizer. It is the same model called with different prompts.
Group size G is the parameter that gives Training-Free GRPO its name. The whole point of "group relative" is that comparing rollouts against each other generates a useful contrast — a judgment of why one path worked and another did not. With G = 1 there is nothing to compare to (and the paper's ablation confirms this collapses the gain). The interesting question is how large G needs to be before the contrast saturates. Click through to feel it:
A minimal sketch of the inner loop. This is a faithful translation of the paper's pseudocode in §2 — same structure, abbreviated for readability. It does not run as-is (it expects a stub LLM interface and a reward function), but the shapes and step structure are correct.
from dataclasses import dataclass, field
@dataclass
class TrainingFreeGRPO:
"""One epoch of Training-Free GRPO over a small training set.
Faithful sketch of Algorithm in Youtu-Agent (2025), §2. The same
LLM is reused as policy (rollouts), judge (semantic advantage),
and optimizer (buffer ops). No gradient is computed anywhere.
The buffer E starts empty and grows into a few-dozen-line block
of natural-language lessons that the next epoch's rollouts read.
"""
llm: "LLM" # frontier API client
reward_fn: callable # scores one trajectory
group_size: int = 5 # paper used G=5 (math), G=3 (web)
experience: list[str] = field(default_factory=list)
def step(self, query: str, ground_truth: str) -> None:
# 1. Roll out G trajectories conditioned on the current buffer.
prompt = f"{self._render_buffer()}\n\nQuestion: {query}"
rollouts = [self.llm.complete(prompt) for _ in range(self.group_size)]
# 2. Score each rollout. In math, this is exact-match on the boxed
# answer; in web search, the agent's answer is graded vs gt.
rewards = [self.reward_fn(r, ground_truth) for r in rollouts]
# 3. Skip groups with no contrast — same as Â_i = 0 when std(r) = 0.
if min(rewards) == max(rewards):
return
# 4. Semantic advantage: have the LLM compare winners and losers,
# extract one lesson in plain text. This is A_text in the paper.
a_text = self.llm.judge(
query=query,
rollouts=rollouts,
rewards=rewards,
ground_truth=ground_truth,
existing_buffer=self._render_buffer(),
)
# 5. Buffer update: ask the LLM to emit Add / Delete / Modify / Keep
# operations on the current buffer given the new advantage.
ops = self.llm.optimize_buffer(
buffer=self.experience,
new_advantage=a_text,
)
self._apply(ops)
def _render_buffer(self) -> str:
if not self.experience:
return ""
lines = "\n".join(f"[{i+1}] {e}" for i, e in enumerate(self.experience))
return f"Useful experiences from prior problems:\n{lines}"
def _apply(self, ops: list[dict]) -> None:
# ops example: [{"type": "add", "text": "Verify rectangle..."}, ...]
for op in ops:
match op["type"]:
case "add": self.experience.append(op["text"])
case "delete": self.experience.pop(op["index"])
case "modify": self.experience[op["index"]] = op["text"]
case "keep": pass
A quick orientation for the unfamiliar reader: AIME (American Invitational Mathematics Examination) is a 15-question competition for high school students; the AIME24 and AIME25 sets used here are the official problem batches from those years. They are short-answer integer problems requiring multi-step proof-style reasoning. Mean@32 is the average accuracy across 32 independent runs of the same problem — useful because LLMs at non-zero temperature give different answers across runs, and a single-shot Pass@1 is noisy.
The setup: train on DAPO-100, a random sample of 100 problems from the DAPO-Math-17K dataset. 3 epochs, single batch per epoch (so 3 optimization steps total), G = 5, temperature 0.7 during training, 0.3 at evaluation. The base model is DeepSeek-V3.1-Terminus (671B total parameters, MoE; called via API), evaluated in two configurations — Direct (prompt-only, no tools) and ReAct (with a Python code interpreter as a tool).
The headline numbers (Mean@32):
| Configuration | AIME24 | AIME25 |
|---|---|---|
| Direct | 68.6% | 52.9% |
| Direct + Training-Free GRPO | 72.6% (+4.0) | 54.0% (+1.1) |
| ReAct | 80.0% | 67.9% |
| ReAct + Training-Free GRPO | 82.7% (+2.7) | 73.3% (+5.4) |
Three things to register. First, the gains transfer to a different base — DeepSeek-V3.2-Exp also improves (+2.1, +1.4 on the same benchmarks). Second, the gains transfer to smaller bases too, suggesting the method isn't purely a frontier-model phenomenon: Qwen3-32B Non-Thinking gets +4.4 / +5.9, Qwen2.5-72B-Instruct gets +1.4 / +1.8. Third, and this is the loudest finding, the absolute number 82.7% on AIME24 beats fully fine-tuned 32B models like ReTool (67.0%) and AFM (66.7%) that cost ~$10K to train — and it does so with $18 of API spend.
The ablations make the case that the optimization machinery is what matters, not just having a longer prompt. The most pointed one: prepending experiences directly generated by DeepSeek-V3.1-Terminus (the paper asks the model to produce a list of useful experiences, matched in quantity to what Training-Free GRPO learned) gives 79.8% — basically identical to the 80.0% baseline. The same model, the same kind of content, but produced without the rollout-then-distill loop, gives no gain. It is the iterative comparison against ground truth that earns the +2.7. A second ablation removes ground truth entirely (the LLM has to compare rollouts against each other only, leveraging implicit majority voting / self-consistency) and recovers most but not all of the gain — 80.7% / 68.9%, still meaningfully above baseline. A third sets G = 1, eliminating the group, and gains collapse — confirming the relative comparison is load-bearing, not just the buffer mechanism.
A subtler observation from Figure 4 of the paper: across the 3 learning steps, the average number of tool calls per problem decreases even though accuracy rises. The buffer doesn't just teach the agent which moves are correct — it teaches it which tool calls are wasted. That dual effect (more correct, more efficient) is consistent with what an experienced practitioner internalizes after solving many problems.
Before the numbers: WebWalkerQA [arXiv:2501.07572] is a benchmark that gives an agent a real web environment plus a natural-language question whose answer requires navigating, clicking, reading, and synthesizing across multiple pages. The reward is whether the final answer matches ground truth. Pass@1 is single-trajectory accuracy; pass@3 is success-among-three-attempts. AFM-100 is 100 queries randomly sampled from the AFM (Chain-of-Agents) web-RL training corpus.
Same recipe, different domain. 3 epochs, group size G = 3 (smaller than math because web rollouts are more expensive), DeepSeek-V3.1-Terminus, ReAct loop with web tools.
| Configuration | pass@1 | pass@3 |
|---|---|---|
| ReAct (baseline) | 63.2% (full set) / 66.7% (51-instance subset) | 74.5% (subset) |
| + Directly generated experiences | — / 64.7% | 76.5% |
| + TF-GRPO (no ground truth) | — / 66.7% | 78.4% |
| + TF-GRPO (full) | 67.8% (full) / 68.6% (subset) | 78.4% (subset) |
The +4.6 pass@1 on the full WebWalkerQA set (63.2% → 67.8%) confirms the method generalizes beyond math. The same shape of ablation holds: directly-generated experiences fail (64.7%, slightly below baseline), confirming that the optimization is not just "longer prompt = better". And the no-ground-truth variant matches baseline pass@1 but lifts pass@3 to 78.4%, suggesting that even without explicit verification, the LLM-on-LLM relative comparison improves the consistency of the agent across attempts.
The buffer the method learns for web search reads like the inside of a research analyst's head. Examples from the paper's Appendix A.2: "Prioritize systematic extraction from authoritative comprehensive documents over fragmented information for coherent topic coverage" (Source prioritization), "Continuously refine search terms based on emerging patterns while periodically re-evaluating previously encountered information" (Iterative refinement), "Focus on extracting formal titles and collection names from official metadata and headers rather than inferring relationships from content descriptions" (Document identification). These are tactical, transferable, and not the kind of thing you would write down a priori — they emerge from comparing successful and failed rollouts on actual queries.
The discordant note: applying TF-GRPO to QwQ-32B drops pass@1 from 27.5% to 25.5% — the ablated 32B model regresses below its own ReAct baseline. The authors attribute this to the underlying capability ceiling: the buffer adds advice the model still cannot follow because its reasoning and tool-use foundations are too weak. Pass@3 on QwQ-32B does improve (43.1% → 45.1%), so the method is not pure noise on the smaller model — but the pass@1 regression is a real failure mode and worth taking seriously when planning to apply this to a smaller-base agent.
The most damaging comparison in the paper is the cross-domain transfer table. Take ReTool, fine-tuned via PPO on math; it scores 67.0% on AIME24 (good, in-domain) and 18.3% on WebWalker (worse than the un-tuned ReAct baseline of 31.9% on the same model). Take MiroThinker, fine-tuned for web research; it scores 53.6% on WebWalker but only 43.5% on AIME24. Each fine-tuned specialist trades cross-domain capability for in-domain gains.
Training-Free GRPO does not have this trade-off because the model never changes — only the buffer does. The authors maintain two buffers, one for math, one for web search, and plug whichever is appropriate into the prompt. The frozen DeepSeek-V3.1-Terminus then scores 82.7% on AIME24 and 67.8% on WebWalker — best in both columns simultaneously. A weight-tuned model is one specialist; a frozen-base + buffer system is N specialists for the cost of N text files.
The compute cost story is even more lopsided. ReTool's reported training cost is roughly 20K GPU-hours × $0.50 = ~$10K, plus a dedicated 32B model deployment. Training-Free GRPO with DeepSeek-V3.1-Terminus takes 6 hours to run 3 steps on 100 samples, consuming 38M input tokens + 6.6M output tokens ≈ $18 at DeepSeek's pricing (most input qualifies for cache-hit pricing because the prompt prefix is reused across rollouts). That is a three-orders-of-magnitude reduction in training cost.
Inference cost is the one place fine-tuning has an edge — but only conditionally. ReTool-32B running on a 4×GPU vLLM server at $0.50/GPU-hour processes ~400 problems/hour, costing about $0.005 per problem if you can keep the GPUs saturated. Training-Free GRPO via API is ~$0.02 per problem (60K input + 8K output tokens at cache-hit pricing). Per-call, the API is 4× more expensive. But the API has zero fixed serving cost: you pay $0 when there is no traffic. Most enterprise agentic deployments have spiky, low-volume traffic — the kind where a dedicated GPU sits idle most of the day. The break-even is roughly: if you would have utilized a 4-GPU cluster less than ~25% of the time, the API model is cheaper in total.
The paper's place in the landscape is worth pinning down. Three lines converge here.
Line 1 — agentic RL. Started with ReAct (interleave reasoning and acting), formalized by GRPO and its derivatives (DAPO, GSPO, GiGPO), instantiated for tool use by ReTool, AFM, Tongyi DeepResearch, and similar systems. All of these update model weights.
Line 2 — training-free / in-context methods. GPT-3 demonstrated few-shot in-context learning. Self-Refine has a model critique its own output and revise within a single trajectory. Reflexion layers a verbal critic on top of an agent loop. TextGrad generalizes this to "back-propagation through text", treating LLM calls as differentiable nodes whose gradients are natural-language critiques. In-context RL (Song et al. 2025; Monea et al. 2024) explicitly feeds reward signals into the prompt across attempts. The common thread: optimization happens within a single query's lifetime.
Line 3 — shared-experience agents. Agent KB maintains a hierarchical knowledge base across tasks but uses a complex retrieve-refine pipeline and collects training trajectories off-policy.
Training-Free GRPO sits at the intersection. From Line 1 it inherits the multi-epoch on-policy structure and the explicit group-relative comparison. From Line 2 it inherits the "no gradients" stance. From Line 3 it inherits the shared, persistent buffer. What is new is the combination: an on-policy multi-epoch RL loop whose update is buffer-edit operations on a single shared text artifact. Self-Refine and Reflexion give per-query corrections; TextGrad gives per-call gradients; Agent KB gives a static knowledge base. None of them have all three pieces — multi-epoch, group-relative, single shared buffer — at once.
The point of the paper, in one sentence: the optimization that matters in modern agentic RL can be done in context space rather than parameter space, with most of the benefit and a small fraction of the cost. The authors do not claim parameter updates are obsolete — they explicitly note their method depends on a strong base model and would not replace fine-tuning when the base is too weak to introspect on its own behavior. What they do show is that for the very real case of "I have access to a strong frontier API model and I need it to behave better in my domain", the GRPO recipe ports cleanly to a no-gradient setting. The fine-tune-or-prompt question, often answered with "fine-tune for serious work, prompt for prototypes", now has a third option that competes on serious work.
If the technique generalizes — and the cross-domain transfer experiment is encouraging here — the operational implication is significant. Building a domain-specialized agent becomes a matter of curating ~100 ground-truth examples, paying $20 of API spend, and shipping a text file alongside the rest of your prompt template. Specialists multiply with low marginal cost. Updates to the base model carry the buffer forward without retraining. The agentic-RL pipeline, as it currently exists, is mostly compute and infrastructure overhead designed to enable parameter updates that, this paper argues, may not have been the most efficient way to encode the lesson.
A fair criticism: the paper's evaluation is narrow. AIME math and WebWalkerQA are real benchmarks, but they are also benchmarks where verification is cleanly available (correct answer or not). For domains without easy verification — open-ended writing, scientific hypothesis generation — the no-ground-truth ablation suggests the method still works partially, but the gap between "with GT" and "without GT" widens. The technique is also not the unconditional win a casual reading might suggest: on QwQ-32B the same recipe hurts. There is a regime where this works; the paper has mapped a useful slice of it.
Code: TencentCloudADP/youtu-agent — training_free_GRPO branch. Paper: arXiv:2510.08191.