本质是筛选样本,但是核心思想方法上略有创新:叶斯建模 + 主动学习 + 高斯过程。
可迁移性很高,**高效性能估算。**且对于题目来说,可以主动探索到哪些题目有问题的多,可以更快的筛选出来。
MemCollab: Cross-Agent Memory Collaboration via Contrastive Trajectory Distillation | alphaXiv
方法:对比多个agent轨迹,然后经验提取。最后说明这个可以提高结果。。

最后经验是prompt形式。
SkillFlow: Scalable and Efficient Agent Skill Retrieval System | alphaXiv

从大量的skill中召回筛选,得到最有效的skill用。然后发现:
- 废话没用: 文档的长短、解释的多寡并不影响效果。
- 代码为王: 真正有效的技能,其代码块(Code Blocks)占比显著更高,且通常附带了可执行的脚本文件。
- 启发: 未来大家在往技能库贡献内容时,少写小作文,多给能跑的代码。
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents | alphaXiv

一个测试能否长期积累运用经验的benchmark
Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents | alphaXiv

Bench调查
| Bench | 背景 | 链接 | 公开参评模型数量 | |
|---|---|---|---|---|
| SWE-bench Verified | GitHub issue 修复,软工 agent | SWE-bench/experiments: Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task. | 100+ | |
| Terminal-Bench 2.0 | 终端环境中完成真实任务,软件工程/ML/安全/数据科学/系统管理 | yoonholee/terminalbench-trajectories · Datasets at Hugging Face | 40+ | 有些空轨迹,但不知道占比影响 |
| MathArena | 竞赛中极难题 | MathArena Outputs – a MathArena Collection | 18种竞赛题,每一个6–53 题,模型数一般 20–70 个。但重复很多,长轨迹 | 长COT结果是 |
| Toolathlon | 长程、多工具、多 app 真实任务执行 | hkust-nlp/Toolathlon-Trajectories · Datasets at Hugging Face | 17 模型 × 3 runs,超过 5,000 条 task execution records | |
| LiveCodeBench | 动态编程题 | https://github.com/LiveCodeBench/submissions | 单轮 | |
| Codeforces Rating | 竞技编程能力 | 无 | ||
| Aider Polyglot | 真实代码编辑 | 无具体traj | ||
| Codeforces Rating | 竞技编程能力 | 无 | ||
| MMLU-Pro / GPQA / HLE / AIME / HMMT / SimpleQA | 无 |
| test_acc | test_saved | test_drop_pp |
|---|---|---|
| 77.60% | 79.83% | -12.92pp |
| 81.57% | 66.41% | -7.72pp |
| 87.97% | 45.65% | -2.26pp |
| 90.01% | 39.06% | -1.30pp |
| 91.83% | 32.86% | -0.68pp |
| 93.58% | 28.61% | -0.07pp |
| 98.90% | 17.39% | 0.07pp |
| acc | saved | drop_pp |
|---|---|---|
| 65.84% | 74.39% | -22.54pp |
| 65.84% | 74.39% | -22.54pp |
| 71.10% | 55.49% | -12.70pp |
| 86.82% | 33.82% | -2.25pp |
| 89.29% | 29.65% | -1.64pp |
| 94.74% | 25.62% | -0.61pp |
| 98.77% | 22.36% | 0.20pp |
| 98.57% | 20.30% | 0.20pp |
| 100.00% | 14.00% | 0.00pp |
| target | actual | save | drop pp | succ prec | fail prec | succ weight | fail weight |
|---|---|---|---|---|---|---|---|
| 0.75 | 78.1% | 72.0% | 12.0 | 75.6% | 78.7% | 19.3% | 80.7% |
| 0.80 | 80.1% | 64.5% | 11.4 | 80.1% | 80.1% | 19.7% | 80.3% |
| 0.85 | 85.8% | 50.3% | 7.6 | 86.5% | 85.7% | 18.3% | 81.7% |
| 0.90 | 90.2% | 38.8% | 4.5 | 90.4% | 90.2% | 16.6% | 83.4% |
| 0.95 | 95.3% | 22.1% | 1.5 | 95.1% | 95.4% | 13.7% | 86.3% |