The game aims for “context decomposition”: a dual-agent memory architecture that optimizes long-context LL.M.

For all their divine power, today’s AI models suffer from a surprisingly human flaw: they forget. Give an AI assistant a sprawling conversation, a multifaceted reasoning task or a project that spans the day, and it will eventually lose the thread. Engineers refer to this phenomenon as “context rot,” and it’s quietly become one of the most significant obstacles to building AI agents that can work reliably in the real world.

A research team from China and Hong Kong believes it has developed a solution to contextual rot. Their new paper is introduced General Agentic Memory (GAM)a system designed to store information over long intervals without overwhelming the model. The basic premise is simple: divide memory into two specialized roles, one that captures everything, the other that retrieves the right things at the right time.

Early results are encouraging, and the timing couldn’t be better. As the industry moves beyond immediate engineering and embraces the broader discipline of contextual engineering, the game is emerging at just the right inflection point.

When larger context windows still aren’t enough

At the heart of every large language model (LLM) is a hard boundary: a fixed “working memory”, commonly called a context window. Once a conversation gets longer, old information is shortened, summarized, or silently dropped. This limitation has long been recognized by AI researchers, and as early as 2023, developers have been working to expand context windows, which exponentially increase the amount of information a model can handle in a single pass.

Mistral’s Maxtral 8x7b debuted with a 32K-token window, which is about 24 to 25 words, or about 128 characters in English. Basically a small amount of text, like a single sentence. This was followed by the Mosaic MLK MPT-7B StoryWriter-65K+, which doubled that capacity. Then came Google’s Gemini 1.5 Pro and Anthropic’s Cloud 3, which offered massive 128K and 200K windows, both scalable to an extraordinary one million tokens. Even Microsoft joined the push, moving from the 2K-token limit of the first PHI model to the 128K context window of PHI3.

Adding contextual windows might seem like an obvious fix, but it’s not. Even models with 100K-token windows to intercept hundreds of pages of text still struggle to recall details buried near the beginning of long conversations. Scaling context comes with its own problems. As prompts grow longer, models become less reliable at locating and interpreting information because attention over distant tokens weakens and accuracy gradually erodes.

Longer inputs also reduce the signal-to-noise ratio, as including every possible detail can actually distort the response more than using a focused prompt. Long pointers also slow down the model. More input tokens lead to significantly higher output token latency, which creates a practical limit on how many contexts can be used before performance suffers.

Memories are priceless

For most organizations, supersized context windows come with an obvious downside—they’re expensive. Sending mass signals via an API is never cheap, and because of direct pricing with input tokens, even a bloated request can add up to costs. Immediate caching helps, but isn’t enough to make up for the habit of routinely overloading the model with unnecessary context. And that’s the tension at the heart of the problem: memory is essential to making AI more powerful.

As context windows expand into millions or millions of tokens, the financial overhead increases exponentially. The context of scaling is both a technical challenge and an economic one, and relying on ever-larger windows quickly becomes an unsustainable strategy for long-term memory.

Reforms such as abstract and recoverable generation (RAG) are not silver bullets either. Summaries inevitably omit subtle but important details, and traditional rags, while robust on static documents, break down when information is spread over multiple sessions or evolves over time. Even newer variants, such as Agent Rag and Rag 2.0 (which do better at steering the retrieval process), still inherit the same fundamental flaw of treating retrieval rather than treating memory as the primary problem.

Compilers solved this problem decades ago

If memory is the real bottleneck, and recovery can’t fix it, the gap needs a different kind of solution. This is the bet behind the game. Rather than pretending to be a memory, the game keeps a complete, flexible record and layers smart, on top of that on-demand recall, reproducing the exact details an agent needs even as the conversation turns and evolves. A useful way to understand GAM is through a familiar software engineering concept: just-time (JIT) compilation. Instead of clogging up a hard, heavily compressed memory, the game keeps things light and tight by storing a complete, untouched archive of raw history as well as a minimal set of clues. Then, when a request arrives, it “compiles” an appropriate context on the fly.

This JIT approach is built into GAM’s dual architecture, allowing the AI to take context into long conversations without making the most of it or having to guess too quickly. The result is the right information, delivered at exactly the right moment.

In-Game: A two-agent system built for memory that persists

GAM revolves around the simple idea of separating the process of recall from recall, which appropriately includes two components: the ‘remember’ and the ‘researcher’.

Memory: Total memory without overload

Memorable captures every exchange in its entirety, silently converting each interaction into a comprehensive memo while preserving the complete, decorated session in a searchable page store. It doesn’t aggressively compress or underestimate what’s important. Instead, it organizes interactions into structured pages, adds metadata for efficient retrieval and generates optional lightweight summaries for quick scanning. Critically, every detail is preserved, and nothing is thrown away.

Researcher: A deep retrieval engine

When the agent needs to act, the researcher takes the helm to plan the search strategy, combining embedding with keyword methods like BM25, navigating through page IDs and stitching the pieces together. It performs layered searches in the page store, combining vector retrieval, keyword matching and direct search. It reviews the results, identifies gaps and will continue searching until it has enough evidence to offer a confident answer, like a human analyst reviewing old notes and underlying documents. It iterates, explores, integrates and reflects until it creates a neat, task-specific briefing.

GAM’s power comes from its JIT memory pipeline, which easily assembles task-specific context that leans on easily collapsible, pre-existing summaries on demand. Its core innovation is yet simple and powerful, as it retains all information and enables every detail to be retrieved.

Extinction studies support this view: conventional memory itself fails, and naive retrieval is not sufficient. It’s a complete archive paired with an active, iterative exploration engine that enables GAM to provide a level of detail that leaves other systems behind.

Better performance of reg and long context models

To test the GAM, the researchers ran it against standard RAG pipelines and models with extended contexts such as GPT-4O-MINI and KEVIN 2.5-14B. They evaluated GAM using four major long-term context and memory-related benchmarks, each chosen to test a different aspect of the system’s capabilities:

Locomotive Measures an agent’s ability to retain and recall information over long, multi-session conversations, including single-hop, multi-hop, temporal reasoning, and open-domain tasks.
Hot pota widely used multi-hop QA benchmark built from Wikipedia, was adapted using a version of Memgent’s Memory Pressure Stress Test, which combines relevant documents with destructors to create 56K, 224K and 448K token contexts—ideal for testing how well a game handles, well, spring input.
Ruler To further investigate long-horizon reasoning, we evaluate the performance of retrieval accuracy, multi-hop state tracking, long sequence overrun and QA under a 128K-token context.
Narrative There is a standard where each question must be answered using the full text of a book or movie script. The researchers sampled 300 instances with an average context of 87K tokens.

Together, these datasets and benchmarks allowed the team to evaluate Gam’s ability to preserve detailed historical information and its effectiveness in supporting complex downstream reasoning tasks.

The game came out on top in all benchmarks. His biggest win was over Ruler, which has long-range state-tracking quality. In particular:

GAM exceeded 90% accuracy.
The melody falls apart because key details are lost in the abstract.
While technically existing, long-perspective models effectively “blur” old information.

Clearly, large context windows are not the answer. GAM works because it retrieves tokens by precision instead of stacking them.

Game, context engineering and competitive approaches

Poorly structured context, not model limitations, is often the real reason AI agents fail. GAM addresses this by ensuring that nothing is permanently lost and that the correct information can always be retrieved, even downstream. The emergence of this technique coincides with the current, broader shift toward context engineering, or the process of constructing everything from an AI model—its instructions, history, retrieved documents, tools, preferences, and output formats.

Context engineering has increasingly eclipsed immediate engineering in importance, although other research groups are tackling the memory problem from different angles. Anthropic is exploring curated, evolving context states. DeepSec is experimenting with storing images in memory. Another group of Chinese researchers has proposed a “semantic operating system” built around lifelong adaptive memory.

However, GAM’s philosophy is different: avoid loss and recover intelligently. Instead of trying to figure out what matters later, it keeps track of everything and uses a dedicated search engine to find relevant fragments at runtime. For agents managing multi-day projects, ongoing workflows or long-term relationships, this reliability can be essential.

Why the game matters for the long haul

Just as adding more compute doesn’t automatically create a better algorithm, increasing context alone won’t solve AI’s long-term memory problems. Meaningful progress requires rethinking the underlying system, and GAM takes that approach. Instead of relying on ever-larger models, massive context windows, or endlessly refined pointers, it treats memory as an engineering challenge—one that benefits from structure rather than brute force.

As AI agents transition from clever demos to mission-critical tools, their ability to remember their long history becomes critical to the development of reliable, intelligent systems. Enterprises need AI agents that can track tasks in progress, maintain continuity and recall past interactions with precision and accuracy. GAM offers a practical path toward that future, hinting at what could be the next big frontier in AI: not bigger models, but better memory systems and the context architectures that make them possible.

Editor's pick

Get latest news