Agentic Context Engineering (ACE): The Self-Improving Framework for LLM Contexts

The Problem: Why Context Engineering Has Failed So Far

Modern large language model (LLM) applications—whether autonomous agents, domain-specific systems, or compound AI workflows—increasingly depend on context adaptation rather than model weight updates. Instead of retraining expensive foundation models, engineers modify inputs with:

Clarified instructions (system prompts)
Structured reasoning steps
Domain-specific input formats
Factual evidence to reduce hallucinations

But here's the catch: existing approaches to context engineering are fundamentally broken.

The Two Fatal Flaws

Problem	What Happens	Consequence
Brevity Bias	Optimization prioritizes concise, generic prompts	Critical domain insights (heuristics, tool-use guidelines, failure modes) get dropped
Context Collapse	Monolithic rewriting by an LLM compresses long contexts	Performance collapses dramatically—e.g., 318K tokens → 122 tokens with 10.5% accuracy drop

Think about it this way: If you're teaching someone how to be an expert chess player, you'd never summarize the game into 3 bullet points. You'd preserve every move, every strategic insight, every mistake, every correction. ACE does exactly this for LLM contexts.

Key Insight: Unlike humans who benefit from generalization, LLMs perform better with comprehensive, detailed contexts. They can distill relevance autonomously. The error isn't in providing too much context—it's in what we discard.

Introducing ACE: Agentic Context Engineering

ACE (Agentic Context Engineering) is a framework that treats contexts not as static prompts but as evolving playbooks—structured, incremental knowledge bases that accumulate, refine, and organize strategies over time.

The Core Design Philosophy

Instead of compressing knowledge into terse summaries, ACE:
✓ Preserves detailed, domain-specific knowledge
✓ Prevents context collapse through modular updates
✓ Scales with long-context models (128K+ tokens)
✓ Enables both offline (system prompts) and online (dynamic memory) optimization

The ACE Framework: Three Specialized Roles

Inspired by human learning processes, ACE distributes the work across three agents:

1. The Generator

Role: Produces reasoning trajectories for new queries

What it does:

Given a context and a query, the model generates execution paths (code, reasoning steps, tool calls)
Tracks which parts of existing context were useful vs. misleading
Surfaces patterns that emerge from real execution traces

Key prompt design:

Read the Playbook first, then execute the task by explicitly leveraging each relevant section
...
Using these APIs and cheatsheet, generate code to solve the actual task

2. The Reflector

Role: Critiques execution traces and extracts actionable insights

What it does:

Diagnostics why failures happened (grounded in actual errors, not hypothetical scenarios)
Identifies specific conceptual errors, calculation mistakes, or misapplied strategies
Provides root cause analysis: was it wrong source of truth? Bad filters? Formatting issues?

Key distinction: Unlike prior methods that just say "improve your code," the Reflector specifies exactly how to improve:

"Always resolve identities from the correct source app - Phone app for relationships, never rely on transaction descriptions or other indirect heuristics which are unreliable."

3. The Curator

Role: Integrates insights into structured context updates

What it does:

Identifies ONLY NEW insights missing from existing playbook (no redundancy)
Formats additions as compact, itemized bullets with metadata (helpful/harmful counters)
Maintains a modular structure: strategies, formulas, failure modes, API schemas

Technical Innovations in ACE

1. Incremental Delta Updates

Instead of regenerating the entire context each time, ACE uses an itemized design:

Localization: Only update relevant bullets, not the whole playbook
Fine-grained retrieval: Generator focuses on pertinent knowledge
Incremental adaptation: Merge, prune, deduplicate as needed

Why this matters: Full context rewrites are expensive (latency + compute). ACE avoids this by making compact, targeted updates that preserve past knowledge.

2. Grow-and-Refine Principle

ACE balances expansion with redundancy control through two mechanisms:

Proactive refinement: After each delta, add new bullets with fresh IDs; update existing ones in place
Lazy refinement: Only when the context window is exceeded
Deduplication: Compare bullets via semantic embeddings to remove redundancy

This maintains contexts that expand adaptively while remaining interpretable.

3. No Labeled Supervision Required

ACE learns entirely from natural execution feedback:

Generator runs task → observes success/failure
Reflector analyzes failure → extracts lessons
Curator adds structured updates

No human labeling, no reward modeling, no expensive annotations. Just the traces from real agent runs.

Performance Results

Main Benchmarks

Task	ACE Score	Best Baseline	Improvement
AppWorld (Agents)	59.5%	47.5% (GEPA)	+10.6%
FiNER (Finance)	76.5%	69.5%	+8.6%
Formula (Financial)	78.3%	74.2%	+4.1%

The Big Claim

On the AppWorld leaderboard, ACE matches the top-ranked production-level agent (IBM-CUGA, powered by GPT-4.1) on average and surpasses it on the harder test-challenge split, despite using a smaller open-source model (DeepSeek-V3.1).

Think about this: A smaller, open model beats the best proprietary model when given evolving contexts. That's the power of comprehensive knowledge.

ACE consistently outperforms strong baselines:

Base LLM: 42.4% → ACE: 59.5% (+17.1 percentage points)
GEPA (baseline optimizer): 46.4% → ACE: 59.5% (+13.1)
Dynamic Cheatsheet: 51.9% → ACE: 59.5% (+7.6)

Across all domains—agents, finance, numerical reasoning—ACE beats every major contender.

Cost and Efficiency

ACE isn't just more accurate—it's cheaper.

Metric	ACE	Best Competitor
Adaptation Latency	Baseline	+86.9% faster
Rollouts Required	Baseline	Fewer (exact count: ~30% reduction)
Token Cost	Baseline	Lower (parallel updates)

How? By using incremental delta updates instead of monolithic rewrites. ACE can batch adaptation tasks and update in parallel.

Code-Level Design: What the Prompts Look Like

Generator Prompt (AppWorld)

I am your supervisor and you are a super intelligent AI Assistant whose job is to achieve my day-to-day tasks completely autonomously.

To do this, you will need to interact with apps (e.g., spotify, venmo) using their associated APIs on my behalf. For this you will undertake a multi-step conversation using a python REPL environment.

Here are three key APIs that you need to know:

Using these APIs and cheatsheet, generate code to solve the actual task:

My name is: {{ main_user.first_name }} {{ main_user.last_name }}. 
My personal email is {{ main_user.email }} and phone number is {{ main_user.phone_number }}.
Task: {{ input_str }}
[...]

Curator Prompt (extracting new insights)

CRITICAL: You MUST respond with valid JSON only. Do not use markdown formatting or code blocks.

Instructions: - Review the existing playbook and the reflection from the previous attempt - Identify ONLY the NEW insights, strategies, or mistakes that are MISSING from the current playbook - Avoid redundancy

Response Format:
{
  "reasoning": "[Your chain of thought]",
  "operations": [
    {
      "type": "ADD",
      "section": "strategies_and_hard_rules",
      "content": "[New actionable bullet]"
    }
  ]
}

Why This Matters: The Bigger Picture

1. Self-Improving Systems Without Fine-Tuning

ACE shows that LLMs can continuously improve themselves by:

Extracting lessons from actual execution traces
Structuring those lessons into reusable context updates
Applying those updates in subsequent tasks

No fine-tuning, no weight updates, just context engineering.

2. Scalable to Real-World Workloads

ACE reduces adaptation latency by 86.9% and token costs significantly. For enterprises deploying agents for:

Customer service automations
Financial analysis systems
Code generation platforms
Domain-specific reasoning workloads

ACE means you can deploy complex agents that actually get better over time.

3. Demystifies the "Magic" of Context

Prior to ACE, context optimization felt like a black box: "we improved prompts and got better results." ACE exposes the machinery:

How lessons are extracted from failures
How they're organized into actionable bullets
How they're applied in subsequent tasks

This interpretability matters for debugging, auditing, and trust.

4. Opens the Door to Multi-Agent Collaboration

ACE's modular architecture suggests applications like:

Multiple Generators working on different aspects of a task
Shared playbooks across agent teams
Collective reflection on failures (like a distributed brain)

The future isn't single powerful models; it's orchestrated systems of specialized components.

Implementation: Key Takeaways

For Researchers

Beware monolithic context rewriting —It causes collapse
Design for incremental updates, not regeneration —Preserve knowledge while adding new insights
Separate evaluation from curation —The Reflector role is crucial for quality
Use semantic deduplication —Keep contexts compact and relevant

For Engineers

Start small —ACE works with both offline and online scenarios
Track helpful/harmful counters —Metadata guides future curation
Batch updates —ACE supports parallel delta merging
Monitor playbook size —Lazy refinement triggers when window exceeded

For Leaders

ACE represents a paradigm shift:

From training models to training contexts (cheaper, more interpretable)
From static prompts to evolving playbooks (self-improving systems)
From black boxes to interpretable agents (easier to debug and audit)

Limitations to Watch For

ACE isn't perfect. Potential pitfalls:

Context window constraints —ACE assumes long-context models exist. For short-context systems, other approaches may be needed.
Cold start problem —ACE needs initial playbook. First-time accuracy may not match baselines; improvement accumulates over iterations.
Complex task decomposition —For highly complex tasks, the Generator may produce scattered insights that the Curator struggles to synthesize.
Domain transferability —ACE is optimized for agents and finance. Adapting to other domains (e.g., creative writing, scientific reasoning) may require prompt tuning.

Conclusion: The Future of Context Engineering

ACE demonstrates a crucial truth: LLMs don't need better optimization to succeed; they need better contexts.

By treating contexts as evolving playbooks rather than static prompts, ACE prevents collapse, avoids brevity bias, and enables self-improving systems. The framework scales efficiently, matches proprietary models, and opens the door to truly autonomous, interpretable AI systems.

The era of context engineering has arrived. We're no longer limited by what we can compress into a context window. With ACE, we can leverage the full power of long-context LLMs to build systems that learn, adapt, and improve through natural execution feedback.

References

Zhang, Q., Hu, C., et al. "Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models." arXiv:2510.04618, 2025.
Gao, P., Xu, J. "GEPA: Gradient-Enhanced Prompt Optimization." arXiv, 2024.
Zhang, Q., et al. "Dynamic Cheatsheet: Adaptive Memory for LLM Agents." arXiv, 2024.
AppWorld Benchmark: Zhu, Y., et al. "AppWorld: A Large-Scale Benchmark for Embodied Agents." arXiv, 2024.
Dynamic Cheatsheet Prompting: Zhang, Q., et al. "Dynamic Cheatsheet for Autonomous Agents." arXiv, 2024.

Source: https://arxiv.org/abs/2510.04618

Agentic Context Engineering (ACE): The Self-Improving Framework for LLM Contexts

Agentic Context Engineering (ACE): The Self-Improving Framework for LLM Contexts

The Problem: Why Context Engineering Has Failed So Far

The Two Fatal Flaws

Introducing ACE: Agentic Context Engineering

The Core Design Philosophy

The ACE Framework: Three Specialized Roles

1. The Generator

2. The Reflector

3. The Curator

Technical Innovations in ACE

1. Incremental Delta Updates

2. Grow-and-Refine Principle

3. No Labeled Supervision Required

Performance Results

Main Benchmarks

The Big Claim

Cost and Efficiency

Code-Level Design: What the Prompts Look Like

Generator Prompt (AppWorld)

Curator Prompt (extracting new insights)

Why This Matters: The Bigger Picture

1. Self-Improving Systems Without Fine-Tuning

2. Scalable to Real-World Workloads

3. Demystifies the "Magic" of Context

4. Opens the Door to Multi-Agent Collaboration

Implementation: Key Takeaways

For Researchers

For Engineers

For Leaders

Limitations to Watch For

Conclusion: The Future of Context Engineering

References

Related Posts

Multi-Token Prediction and the Reversal Reasoning Circuit

NVLink vs PCIe Parallelism on Blackwell RTX Pro GPUs: A Comprehensive Analysis

NVIDIA NemoClaw: AI Agents You Can Almost Trust