What are we learning today?

CoCoMix (Jihoon et al., 2025)¹ by Meta have made conceptual learning, i.e., learning concepts behind words instead of just predicting the next token a reality, making them remarkably steerable and interpretable.

But a core question remains: even a conceptually brilliant model can struggle with nuanced or factual recall challenges after training, during actual deployment. You could ask a seemingly simple question like, “Earlier in our 2-million-token conversation, where did we discuss Pinocchio’s famously growing nose?” No matter how conceptually capable the LLM is, it cannot answer this simple question if the answer lies outside its context window.

So the question becomes, can we equip these intelligent LLMs with an adaptable “memory” or performance boost precisely when it counts — during inference?

1. Problems with the current foundation: The Transformers

Transformers (Vaswani et al., 2017)² have become nothing short of ubiquitous in the modern AI landscape. Ever since their breakout success, they’ve been the go-to architecture across domains. 

Back in 2020, the default response to any machine learning problem was often, “just throw attention at it” — and surprisingly, it worked, often outperforming state-of-the-art models. Vision tasks? Use transformers (Dosovitskiy et al., 2020)³. Time series forecasting? Transformers again (Zerveas et al., 2021)⁴. Natural language processing? Well, transformers practically defined it (Rogers et al., 2021)⁵.

But as our reliance on large models deepened and compute budgets expanded, even this “do it all” architecture began to show its limits — and so began the push to stretch its capabilities even further.

The bottleneck? Attention’s ‘everyone-talks-to-everyone’ approach. Brilliant but quadratically expensive —imagine a room of a million people, where each person must remember every conversation with everyone. This restricts Transformers to a narrow “working memory,” struggling with the “long-term recall” needed for understanding vast documents, as early information simply fades away.

Beyond the context limits, vanilla transformers face another fundamental hurdle: a lack of adaptability after training. While they excel at applying their vast pre-trained knowledge to predict the next token — a process of sophisticated reasoning and prediction — this is not the same as true learning. Like Google Maps — while it finds the “shortest path” for you, it forgets there’s construction ahead and wants you to drive through barricades. A human guide, on the other hand, would have shown you an alternate alley route.

This inability to “learn on the fly” from the data they are currently processing represents a critical limitation for tasks requiring continuous adaptation or memory of novel experiences beyond the training set.

(Source: Author)
Two of the many problems in the current vanilla Transformers

2. The Solution? Titans!

Instead of targeting just one limitation, the researchers took a broader perspective: how do intelligent systems, like the human brain, manage memory and adapt to new situations? It’s not about having one massive, ever-accessible memory. It’s a more flexible setup, where different components coordinate to handle different kinds of information and experiences.

The Titans’ architecture (Behrouz et al., 2025)⁶ embraces this, built not around a single, monolithic attention block but around a cooperative team of specialized memory systems, each playing a crucial role in understanding and responding to the task at hand.

2.1 Architecture Components: The Memory Modules

  • Short-Term Memory (STM): This is the sharp, detail-oriented expert. It functions much like the attention you know, but instead of being overwhelmed by the entire past (now LMM’s job), its attention (pun intended) is now focused on the immediate present. This is like you remembering the words the person just spoke to you, for just long enough so that you can respond to them.`
  • Long-Term Memory Module (LMM): This is the most exciting addition. It’s designed to learn and adapt during inference — yes, right there, on the fly! And by “adapt,” I literally mean its parameters change! Think of it as you understanding a friend over the years — adding experiences, while filtering out unimportant happenings.
  • Persistent Memory (PM): This member holds the bedrock, task-specific knowledge. These are learnable, fundamental insights the model picked up during its main training. This knowledge is not dynamic in the moment, but provides an essential foundation and context for the other two members. It is like your personality, your demeanor, the ability to walk or drive a car, things that you don’t need to relearn or change.
An illustration of three memory components: Short Term Memory, shown as a stressed figure at an ‘STM/Attention’ laptop, focusing on immediate context. Long Term Memory, a smiling figure at an ‘LTM weights’ laptop, updating itself with a quill for historical context. Persistent Memory, a calm figure with stone tablets showing ‘Same weights prepended’, embodying fixed, data-independent task knowledge.
(Source: Author)
The three memory modules: Short-Term Memory (STM), Long-Term Memory Module (LMM), and Persistent Memory (PM).

2.2 How are these memory modules implemented?

So, how do these three truly work together? To get started, STM is essentially the standard Self-Attention calculation, which is a staple in vanilla transformers. Its “memory” is the KV cache and attention matrices it learns during training.

On the other hand, PM is a set of learnable parameters, which are prepended to the input sequence, and are learned during training and act as the “Holy Grail” for the model to adhere to, no matter what, during inference.

Fairly easy to understand till now— hmm? Then let us dive into the innovation and truly exciting part, the one which, although it is implemented as a simple MLP network, can adapt during test time — the LMM module:

2.3 The Heart of the Titan: The Adaptive Long-Term Memory (LMM) Module

Wait a minute… parameter updates at test time? Isn’t that something we only do during training? Isn’t this basically cheating?

Are these the questions you thought of when you heard the term Test-time training? These are valid questions, but no, it is not cheating. Titans leverage principles from online learning and meta-learning to enable rapid, localized updates tailored specifically for memorization, not general task improvement. It doesn’t look at external labels during test-time to compute gradients and optimize parameters; instead, everything stays self-contained: the model adjusts internally, using only what it already knows and what it sees in the moment.

In human memory, routine and predictable events often fade, while unexpected or surprising moments tend to persist (Mandler, 2014)⁷. This is the core idea behind the implementation of dynamic test-time updates.

2.3.1 How the LMM Learns: Associative Loss Function

The LMM acts as an associative memory: it learns to connect “keys” (cues) to “values” (information). For every new piece of data xt (The input chunk in MAG & MAL, STM (Self-Attention) output in MAC):

  • Key-Value Extraction: The system first converts xt into a specific key (kt) and an associated value (vt) using learnable transformations (Wk and Wv).
(Source: Author)
Using linear layers to map xt to kt and vt
  • Testing the LMM: The LMM, in its current state, is then “asked”: given this new key kt, what value would you predict? Let’s call its prediction pt.
(Source: Author)
Mt-1: current LMM state;
kt: key for the current chunk
  • Calculating Loss: Measured by how wrong the LMM’s prediction was:
(Source: Author)
Standard MSE loss between predicted output and “ground truth”

2.3.2 The Gradient and the “Surprise” Signal

To make the LMM learn from this loss, we incorporate the Surprise Signal, which measures how much the model was “surprised” at seeing the ground truth (vt). This “Surprise” is mathematically defined as the gradient of the loss function with respect to the LMM’s parameters.

(Source: Author)
Measure of “surprise”, i.e., how far the model is from predicting the “correct” vt

A large gradient means xt is highly “surprising” or unexpected given the LMM’s current knowledge.

Basic Learning Step:
The simplest way the LMM then learns is by adjusting its parameters slightly in the direction that would reduce this surprise (i.e., reduce the loss), much like a step in gradient descent:

(Source: Author)
Mt: Updated LMM params;
Mt-1: Previous LMM params;
lr: Learning rate

2.3.3 Refining the Surprise: Smarter Learning with Momentum & Forgetting

Reacting only to immediate “surprise” is not enough. A good memory needs to see trends and also know when to let go of old, irrelevant information.

Smart Learning Direction (ΔΘMt): First, the LMM calculates the best direction to adjust its parameters. This is not just based on the current surprise, but also on a “memory” of recent surprises.

(Source: Author)
Change in parameters is calculated based on previous changes and current surprise
  • ΔΘMt: The proposed change for LMM’s parameters.
  • ηt * ΔΘMt-1: This is momentum — it carries forward the learning trend from the previous step. ηt (data-dependent) decides how much past momentum persists.
  • θt * ∇ Loss_current_surprise: This is the impact of the current surprise. θt (data-dependent) scales its influence.

Final Parameter Update (ΘMt): The LMM then updates its actual parameters, mixing its old knowledge with this new learning direction, and crucially, allowing for “forgetting.”

(Source: Author)
The final update consists of how much to update and how much to retain
  • ΘMt: The LMM’s new parameters after learning from xt.
  • (1 — at) * ΘMt-1: This is how much of the old LMM state is kept. at (data-dependent, between 0 and 1) is the forgetting factor — if at is high, more of the old state is forgotten.
  • ΔΘMt: The smart learning direction calculated above.
Diagram illustrating the LTM module’s update process. Chunked input sequence (e.g., STM output) is projected into Key and Value vectors. The Key vector goes through a forward pass in the LTM module, which, alongside the Value vector, computes a Loss. Gradients from this Loss (via a backward pass without update) are combined with stored previous updates from a Momentum Buffer via weighted sum. This combined update passes through a “Forget” gate which determines new weights for the LTM.
(Source: Author)
The entire LMM update process visualized

In a Nutshell:
The LMM looks at the current data’s “surprise” (∇Loss_current_surprise), blends it with recent learning trends (momentum ΔΘMt-1), and then updates its internal knowledge (ΘMt), deciding how much old information to keep or forget (at) in the process. The data-dependent gates (ηt, θt, at) make it adaptive on the fly.

2.4 Titans’ Architectural Blueprints: Memory in Action

The Google researchers explored three main ways in which these three memory modules could be arranged:

Memory as a Context (MAC)

In this setup, Titans creates an augmented and richer context for the STM (Standard Self-Attention Block).

  1. The very long input sequence is broken into segments or chunks.
  2. When processing, the model maps the chunk to a query and uses it to retrieve relevant historical context from LMM (the query passes through LMM, and its output is the historical context).
  3. These retrieved historical tokens are then concatenated with the static Persistent Memory tokens and current segment tokens.
  4. This entire, expanded sequence (Persistent + Historical + Current) is fed into the STM (Attention) layer, which processes relationships within this large combined context.
  5. The output from the Attention layer, which now reflects a deep understanding of the current data considering history and task knowledge, is then used as the input data for the LMM’s dynamic parameter update process.
  6. The updated LMM is queried again using the same attention result, and its response is then combined, either via a gated sum or partial concatenation, with the original attention to produce the final result.

Analogy: The text (sequence) arrives in pages (chunks). For each page, an ever-learning note-taker (LMM) quickly finds relevant summaries from past notes and mixes them with essential “rulebook” notes (PM). The student (STM/Attention) reads the whole thing — rulebook, relevant past summaries, AND the current page — and based on what it learns from this enriched context, tells the note-taker exactly which points on the current page were crucial to remember for future summaries. 

The final answer is formed considering both the student’s detailed reading and the note-taker’s updated memory perspective.

A diagram of the Memory as a Context (MAC) architecture. Chunked input queries an LMM for historical context. This, with persistent memory, forms a ‘Richer context’ for the Attention Block (STM). The attention output updates the LMM and combines with the updated LMM’s output for the final result.
(Source: Author)
MAC architecture visualized

Memory as a Gate (MAG)

This design integrates memories differently, in parallel, blending information from the STM and LMM outputs through a dynamic gate.

  1. The input sequence (Persistent + Current) is fed into two paths:
    One path goes to the STM, which is implemented here using Sliding Window Attention, allowing it to focus on a local neighborhood around each token.
    The other parallel path goes to the LMM network.
  2. The LMM’s input is used to dynamically update its parameters. It also produces a corresponding output representation.
  3. The output from the STM (Sliding Window Attention) and the output from the LMM’s forward pass are then passed through a gating mechanism. This gate learns to blend or choose between the information from local attention and the information represented by the LMM’s state/output.
  4. The output of this gating mechanism is the final output for that token position.

Analogy: As each page arrives, two tasks happen side-by-side: The student (STM) focuses intently only on the page in front of them, while the separate note-taker (LMM) reads through the current page and relates it to the content it has seen so far, quietly updating its overall understanding. The student provides an answer based on the paragraph, and the note-taker offers a thought based on its evolving big picture.

The coordinator (Gating mechanism) simply blends or picks from their two outputs for the final response.

An illustration of the Memory as Gating (MAG) architecture. A chunked input sequence is combined with persistent memory tokens. This augmented sequence simultaneously queries the LMM module (also used for update) and an Attention Block (STM). The outputs from the LMM branch and the STM branch are then combined via a gated sum to produce the final output.
(Source: Author)
MAG architecture visualized

Memory as a Layer (MAL)

In this variant, the LMM network is used as an initial processing layer that modifies the sequence before it reaches the Attention mechanism (STM).

  1. The input sequence (again, Persistent + Current) is first passed through the LMM network as a distinct layer.
  2. As the LMM network processes the sequence chunk by chunk, its parameters are dynamically updated. It also produces a transformed output sequence.
  3. This transformed output sequence from the LMM layer is then used as the input for the subsequent STM (Attention) layer (Sliding Window or full attention within windows).
  4. The output from the Attention layer is the final output of the model for that sequence.

Analogy: First, every new page goes straight to a main note-taker (LMM) who processes it all, summarizing as it goes and updating its summarizing style along the way. This (potentially less detailed) summary is then handed off to the student (STM). The student only sees and focuses on local parts of this summarized text, basing their answer entirely on what the main note-taker has provided.

A diagram of the Memory as a Layer (MAL) architecture. A chunked input sequence, prepended with persistent memory tokens, feeds into the LMM module for querying and updating. The LMM’s output then serves as input (queries) to the Attention Block (STM), which produces the final output.
(Source: Author)
MAL architecture visualized

3. What do we gain out of all this? Results and Findings

So, now we know everything about the next possible revolutionary after Transformers, but will it be that big? Did Google’s researchers truly crack the code for models that can remember, adapt, and conquer challenges previously thought impossible? Let’s go through the long list of novel findings one by one:

Language Prowess: More Than Just Words

Titans go far beyond simply predicting the next word a bit more accurately. Thanks to its dynamic Long-Term Memory Module (LMM), it shows a deeper, more intuitive grasp of language and context. When evaluated against strong baselines like Transformer++ and several of the latest recurrent models, Titans consistently outperformed them, not just in language modeling, but also on commonsense reasoning tasks.

(Source: Adapted from Behrouz et al., 2025, Table 1)
Titans’ performance (Hybrid: MAC, MAG, MAL; Simple: LMM) on commonsense and reasoning tasks

The Needle in a Haystack Challenge

Titans’ designs showed outstanding performance continuity on the S-NIAH task from the RULER benchmark (Hsieh et al., 2024), which was created to assess effective context length. Titans models — including the standalone Neural Memory (LMM as a model)— maintained strong retrieval rates even at 16K tokens, in contrast to several state-of-the-art recurrent models that had sharp accuracy declines with growing sequence length.

(Source: Behrouz et al., 2025, Table 2)
Titans’ performance (Hybrid: MAC, MAG, MAL; Simple: LMM) on S-NIAH task from RULER (Hsieh et al., 2024)

Mastering Complex Reasoning in BABILong

Retrieving a fact is one thing. But reasoning with multiple facts, spread across massive contexts? That’s the real test, and it is exactly what the BABILong benchmark (Yury Kuratov et al., 2024) demands. Titans (specifically the MAC architecture) didn’t just do well — it outperformed everyone. Even big models like GPT-4 and Llama 3.1–70B, even those that had access to external tools or retrieval systems, while Titans’ largest model is 760M parameters! 

Apart from that, Titans (MAC hybrid architecture) also managed to score 70% accuracy even at 10 million tokens. To put that into perspective, that’s like navigating and finding puzzle pieces in the entire Harry Potter series… times ten.

(Source: Behrouz et al., 2025, Figure 6)
Accuracy Vs. Sequence Length plot of different LLMs on BABILong (Yury Kuratov et al., 2024)

Memory Depth vs. Speed

The researchers explored what happens when the Long-Term Memory Module (LMM) is made deeper by stacking more layers. The results? A deeper LMM dramatically improves its ability to store and organize important information, making it less likely to forget crucial details, especially in long-form sequences where most models struggle to maintain context. 

While LMMs alone managed to get linear time complexity for efficient processing across massive inputs, deeper LMMs do come with a slight trade-off: reduced throughput, or fewer tokens processed per second. 

A line graph displays training throughput (1⁰³ Tokens/Second) against sequence length for LMM models with varying depths (L_M=1, 2, 3, 4). All LMM variants show nearly constant throughput regardless of sequence length, indicating linear scaling. However, deeper LMMs (L_M=3 and L_M=4) exhibit progressively lower throughput than shallower ones (L_M=1 and L_M=2), demonstrating an efficiency trade-off with increased memory depth.
(Source: Behrouz et al., 2025, Figure 8)
Sequence Length Vs. Throughput for different LMM depths

Beyond Language Tasks

Another really exciting fact is that the same memory mechanism worked outside of traditional language tasks. In time series forecasting, a domain known for chaotic, shifting patterns, the Long-Term Memory Module (LMM) held its own against highly specialized models, including those based on Mamba (previous SOTA). 

In DNA modeling, which is a completely different task, the architecture showed strong results. That kind of generality is not easy to come by, and it suggests that memory, when handled well, is not just useful, it is foundational across domains.

(Source: Adapted from Behrouz et al., 2025, Table 3)
Neural Memory’s (LMM as a model) performance on various Time-Series datasets
(Source: Behrouz et al., 2025, Table 4)
Neural Memory Module’s (LMM as a model) performance on Genomic Benchmarks (Grešová et al. 2023)¹⁰

4. Conclusion and Final Thoughts

And that wraps up this deep dive into Titans. Exploring this architecture has been genuinely fun — it is refreshing to see research that goes beyond scaling and instead digs into how memory and learning might actually work in more adaptive, human-like ways.
Google’s legacy of foundational work continues here, from inventing the Transformer to now rethinking how AI can learn during inference. Titans feel like a natural evolution of that spirit.

That said, the AI landscape today is a lot more crowded than it was back in 2017. New ideas, no matter how brilliant, face a steeper path to becoming the default. Performance is just one piece — efficiency, simplicity, and community traction matter more than ever.

Still, Titans make a strong case for a future where models don’t just think with what they already know, but genuinely adapt as they go. Whether this becomes the next “just throw attention at it” moment or not, it is a promising step toward a smarter, more intelligent AI.


5. References:

[1] Tack, Jihoon, et al., “LLM Pretraining with Continuous Concepts.” (2025) arXiv preprint arXiv:2502.08524.
[2] Vaswani, Ashish, et al., “Attention is all you need.” (2017), Advances in neural information processing systems 30.
[3] Dosovitskiy, Alexey, et al. “An image is worth 16×16 words: Transformers for image recognition at scale.” (2020), arXiv preprint arXiv:2010.11929.
[4] Zerveas, George, et al. “A transformer-based framework for multivariate time series representation learning.” (2021), Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining.
[5] Rogers, Anna, et al., “A primer in BERTology: What we know about how BERT works.” (2021), Transactions of the association for computational linguistics 8: 842–866.
[6] Behrouz, Ali, Peilin Zhong, and Vahab Mirrokni. “Titans: Learning to memorize at test time.” (2024), arXiv preprint arXiv:2501.00663.
[7] Mandler, George. “Affect and cognition” (2014). Psychology Press, 3–36.
[8] Hsieh, Cheng-Ping, et al., “RULER: What’s the Real Context Size of Your Long-Context Language Models?” In: First Conference on Language Modeling. 2024.
[9] Kuratov, Yury, et al. “Babilong: Testing the limits of llms with long context reasoning-in-a-haystack.” (2024), Advances in Neural Information Processing Systems 37: 106519–106554.
[10] Grešová, Katarína, et al. “Genomic benchmarks: a collection of datasets for genomic sequence classification.” (2023) BMC Genomic Data 24.1: 25.

Share.

Comments are closed.