Agentic Context Engineering (ACE): The Self-Improving Framework for LLM Contexts
ACE (Agentic Context Engineering) is a framework that treats LLM contexts as evolving playbooks that accumulate and organize strategies over time This design prevents context collapse and addresses brevity bias by using incremental, modular updates guided by three specialized roles: a Generator, Reflector, and Curator that work together to extract insights and curate knowledge.
MantraVid Admin
March 26, 2026
Agentic Context Engineering (ACE): The Self-Improving Framework for LLM Contexts
The Problem: Why Context Engineering Has Failed So Far
Modern large language model (LLM) applications—whether autonomous agents, domain-specific systems, or compound AI workflows—increasingly depend on context adaptation rather than model weight updates. Instead of retraining expensive foundation models, engineers modify inputs with:
Clarified instructions (system prompts)
Structured reasoning steps
Domain-specific input formats
Factual evidence to reduce hallucinations
But here's the catch: existing approaches to context engineering are fundamentally broken.
The Two Fatal Flaws
Problem | What Happens | Consequence |
|---|---|---|
Brevity Bias | Optimization prioritizes concise, generic prompts | Critical domain insights (heuristics, tool-use guidelines, failure modes) get dropped |
Context Collapse | Monolithic rewriting by an LLM compresses long contexts | Performance collapses dramatically—e.g., 318K tokens → 122 tokens with 10.5% accuracy drop |
Think about it this way: If you're teaching someone how to be an expert chess player, you'd never summarize the game into 3 bullet points. You'd preserve every move, every strategic insight, every mistake, every correction. ACE does exactly this for LLM contexts.
Key Insight: Unlike humans who benefit from generalization, LLMs perform better with comprehensive, detailed contexts. They can distill relevance autonomously. The error isn't in providing too much context—it's in what we discard.
Introducing ACE: Agentic Context Engineering
ACE (Agentic Context Engineering) is a framework that treats contexts not as static prompts but as evolving playbooks—structured, incremental knowledge bases that accumulate, refine, and organize strategies over time.
The Core Design Philosophy
Instead of compressing knowledge into terse summaries, ACE:
✓ Preserves detailed, domain-specific knowledge
✓ Prevents context collapse through modular updates
✓ Scales with long-context models (128K+ tokens)
✓ Enables both offline (system prompts) and online (dynamic memory) optimization
The ACE Framework: Three Specialized Roles
Inspired by human learning processes, ACE distributes the work across three agents:
1. The Generator
Role: Produces reasoning trajectories for new queries
What it does:
Given a context and a query, the model generates execution paths (code, reasoning steps, tool calls)
Tracks which parts of existing context were useful vs. misleading
Surfaces patterns that emerge from real execution traces
Key prompt design:
Read the Playbook first, then execute the task by explicitly leveraging each relevant section
...
Using these APIs and cheatsheet, generate code to solve the actual task2. The Reflector
Role: Critiques execution traces and extracts actionable insights
What it does:
Diagnostics why failures happened (grounded in actual errors, not hypothetical scenarios)
Identifies specific conceptual errors, calculation mistakes, or misapplied strategies
Provides root cause analysis: was it wrong source of truth? Bad filters? Formatting issues?
Key distinction: Unlike prior methods that just say "improve your code," the Reflector specifies exactly how to improve:
"Always resolve identities from the correct source app - Phone app for relationships, never rely on transaction descriptions or other indirect heuristics which are unreliable."
3. The Curator
Role: Integrates insights into structured context updates
What it does:
Identifies ONLY NEW insights missing from existing playbook (no redundancy)
Formats additions as compact, itemized bullets with metadata (helpful/harmful counters)
Maintains a modular structure: strategies, formulas, failure modes, API schemas
Technical Innovations in ACE
1. Incremental Delta Updates
Instead of regenerating the entire context each time, ACE uses an itemized design:
Localization: Only update relevant bullets, not the whole playbook
Fine-grained retrieval: Generator focuses on pertinent knowledge
Incremental adaptation: Merge, prune, deduplicate as needed
Why this matters: Full context rewrites are expensive (latency + compute). ACE avoids this by making compact, targeted updates that preserve past knowledge.
2. Grow-and-Refine Principle
ACE balances expansion with redundancy control through two mechanisms:
Proactive refinement: After each delta, add new bullets with fresh IDs; update existing ones in place
Lazy refinement: Only when the context window is exceeded
Deduplication: Compare bullets via semantic embeddings to remove redundancy
This maintains contexts that expand adaptively while remaining interpretable.
3. No Labeled Supervision Required
ACE learns entirely from natural execution feedback:
Generator runs task → observes success/failure
Reflector analyzes failure → extracts lessons
Curator adds structured updates
No human labeling, no reward modeling, no expensive annotations. Just the traces from real agent runs.
Performance Results
Main Benchmarks
Task | ACE Score | Best Baseline | Improvement |
|---|---|---|---|
AppWorld (Agents) | 59.5% | 47.5% (GEPA) | +10.6% |
FiNER (Finance) | 76.5% | 69.5% | +8.6% |
Formula (Financial) | 78.3% | 74.2% | +4.1% |
The Big Claim
On the AppWorld leaderboard, ACE matches the top-ranked production-level agent (IBM-CUGA, powered by GPT-4.1) on average and surpasses it on the harder test-challenge split, despite using a smaller open-source model (DeepSeek-V3.1).
Think about this: A smaller, open model beats the best proprietary model when given evolving contexts. That's the power of comprehensive knowledge.
ACE consistently outperforms strong baselines:
Base LLM: 42.4% → ACE: 59.5% (+17.1 percentage points)
GEPA (baseline optimizer): 46.4% → ACE: 59.5% (+13.1)
Dynamic Cheatsheet: 51.9% → ACE: 59.5% (+7.6)

Across all domains—agents, finance, numerical reasoning—ACE beats every major contender.
Cost and Efficiency
ACE isn't just more accurate—it's cheaper.
Metric | ACE | Best Competitor | Improvement |
|---|---|---|---|
Adaptation Latency | Baseline | +86.9% faster | |
Rollouts Required | Baseline | Fewer (exact count: ~30% reduction) | |
Token Cost | Baseline | Lower (parallel updates) |
How? By using incremental delta updates instead of monolithic rewrites. ACE can batch adaptation tasks and update in parallel.
Code-Level Design: What the Prompts Look Like
Generator Prompt (AppWorld)
I am your supervisor and you are a super intelligent AI Assistant whose job is to achieve my day-to-day tasks completely autonomously.
To do this, you will need to interact with apps (e.g., spotify, venmo) using their associated APIs on my behalf. For this you will undertake a multi-step conversation using a python REPL environment.
Here are three key APIs that you need to know:
Using these APIs and cheatsheet, generate code to solve the actual task:
My name is: {{ main_user.first_name }} {{ main_user.last_name }}.
My personal email is {{ main_user.email }} and phone number is {{ main_user.phone_number }}.
Task: {{ input_str }}
[...]Curator Prompt (extracting new insights)
CRITICAL: You MUST respond with valid JSON only. Do not use markdown formatting or code blocks.
Instructions: - Review the existing playbook and the reflection from the previous attempt - Identify ONLY the NEW insights, strategies, or mistakes that are MISSING from the current playbook - Avoid redundancy
Response Format:
{
"reasoning": "[Your chain of thought]",
"operations": [
{
"type": "ADD",
"section": "strategies_and_hard_rules",
"content": "[New actionable bullet]"
}
]
}Why This Matters: The Bigger Picture
1. Self-Improving Systems Without Fine-Tuning
ACE shows that LLMs can continuously improve themselves by:
Extracting lessons from actual execution traces
Structuring those lessons into reusable context updates
Applying those updates in subsequent tasks
No fine-tuning, no weight updates, just context engineering.
2. Scalable to Real-World Workloads
ACE reduces adaptation latency by 86.9% and token costs significantly. For enterprises deploying agents for:
Customer service automations
Financial analysis systems
Code generation platforms
Domain-specific reasoning workloads
ACE means you can deploy complex agents that actually get better over time.
3. Demystifies the "Magic" of Context
Prior to ACE, context optimization felt like a black box: "we improved prompts and got better results." ACE exposes the machinery:
How lessons are extracted from failures
How they're organized into actionable bullets
How they're applied in subsequent tasks
This interpretability matters for debugging, auditing, and trust.
4. Opens the Door to Multi-Agent Collaboration
ACE's modular architecture suggests applications like:
Multiple Generators working on different aspects of a task
Shared playbooks across agent teams
Collective reflection on failures (like a distributed brain)
The future isn't single powerful models; it's orchestrated systems of specialized components.
Implementation: Key Takeaways
For Researchers
Beware monolithic context rewriting —It causes collapse
Design for incremental updates, not regeneration —Preserve knowledge while adding new insights
Separate evaluation from curation —The Reflector role is crucial for quality
Use semantic deduplication —Keep contexts compact and relevant
For Engineers
Start small —ACE works with both offline and online scenarios
Track helpful/harmful counters —Metadata guides future curation
Batch updates —ACE supports parallel delta merging
Monitor playbook size —Lazy refinement triggers when window exceeded
For Leaders
ACE represents a paradigm shift:
From training models to training contexts (cheaper, more interpretable)
From static prompts to evolving playbooks (self-improving systems)
From black boxes to interpretable agents (easier to debug and audit)
Limitations to Watch For
ACE isn't perfect. Potential pitfalls:
Context window constraints —ACE assumes long-context models exist. For short-context systems, other approaches may be needed.
Cold start problem —ACE needs initial playbook. First-time accuracy may not match baselines; improvement accumulates over iterations.
Complex task decomposition —For highly complex tasks, the Generator may produce scattered insights that the Curator struggles to synthesize.
Domain transferability —ACE is optimized for agents and finance. Adapting to other domains (e.g., creative writing, scientific reasoning) may require prompt tuning.
Conclusion: The Future of Context Engineering
ACE demonstrates a crucial truth: LLMs don't need better optimization to succeed; they need better contexts.
By treating contexts as evolving playbooks rather than static prompts, ACE prevents collapse, avoids brevity bias, and enables self-improving systems. The framework scales efficiently, matches proprietary models, and opens the door to truly autonomous, interpretable AI systems.
The era of context engineering has arrived. We're no longer limited by what we can compress into a context window. With ACE, we can leverage the full power of long-context LLMs to build systems that learn, adapt, and improve through natural execution feedback.
References
Zhang, Q., Hu, C., et al. "Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models." arXiv:2510.04618, 2025.
Gao, P., Xu, J. "GEPA: Gradient-Enhanced Prompt Optimization." arXiv, 2024.
Zhang, Q., et al. "Dynamic Cheatsheet: Adaptive Memory for LLM Agents." arXiv, 2024.
AppWorld Benchmark: Zhu, Y., et al. "AppWorld: A Large-Scale Benchmark for Embodied Agents." arXiv, 2024.
Dynamic Cheatsheet Prompting: Zhang, Q., et al. "Dynamic Cheatsheet for Autonomous Agents." arXiv, 2024.
Source: https://arxiv.org/abs/2510.04618
Related Posts
Small Agents, Big Results: Tool Use Beats Pure Scale
March 18, 2026 • 11 min read
MSA - AI With Memory Like An Elephant
March 20, 2026 • 8 min read

NVIDIA NemoClaw: AI Agents You Can Almost Trust
March 20, 2026 • 6 min read
