In my last article, we dove into Google’s Titans — a model that pushes the boundaries of long-context recall by introducing a dynamic memory module that adapts on the fly, kind of like how our own memory works.

It’s a strange paradox. We have AI that can analyze a 10-million-word document, yet it still fumbles questions like: “How many ‘r’s are in the word strawberry?”

The problem isn’t the AI’s brain; it’s the eyes. The first step in how these models read, tokenization, essentially pre-processes language for them. In doing so, it strips away the rich, messy details of how letters form words; the whole world of sub-word information just vanishes.


1. Lost in Tokenization: Where Subword Semantics Die

Language, for humans, begins as sound, spoken long before it is written. Yet it is through writing and spelling that we begin to grasp the compositional structure of language. Letters form syllables, syllables form words, and from there, we build conversations. This character-level understanding allows us to correct, interpret, and infer even when the text is noisy or ambiguous. In contrast, language models skip this phase entirely. They are never exposed to characters or raw text as-is; instead, their entire perception of language is mediated by a tokenizer.

This tokenizer, ironically, is the only component in the entire pipeline that is not learned. It is dumb, fixed, and entirely based on heuristics, despite sitting at the entry point of a model designed to be deeply adaptive. In effect, tokenization sets the stage for learning, but without any learning of its own.

Moreover, tokenization is extremely brittle. A minor typo, say, “strawverry” instead of “strawberry”, can yield a completely different token sequence, despite the semantic intent remaining obvious to any human reader. This sensitivity, instead of being handled right then and there, is passed downstream, forcing the model to interpret a corrupted input. Worse still, optimal tokenizations are highly domain-dependent. A tokenizer trained on everyday English text may perform beautifully for natural language, but fail miserably when encountering source code, producing long and semantically awkward token chains for variable names like user_id_to_name_map

Like the “Spinal Cord”, that is the language pipeline, the higher up it’s compromised, the more it cripples everything downstream. Sitting right at the top, a flawed tokenizer distorts input before the model even begins reasoning. No matter how smart the architecture is, it’s working with corrupted signals from the start.

(Source: Author)
How a simple typo can possibly waste LLM’s “thinking power” to rectify it

2. Behold! Byte Latent Transformer

If tokenization is the brittle foundation holding modern LLMs back, the natural question follows: why not eliminate it entirely? That’s precisely the radical direction taken by researchers at Meta AI with the Byte Latent Transformer (BLT) (Pagnoni et al. 2024)1. Rather than operating on words, subwords, or even characters, BLT models language from raw bytes — the most fundamental representation of digital text. This enables LLMs to learn the language from the very ground up, without the tokenizer being there to eat away at the subword semantics.

But modeling bytes directly is far from trivial. A naïve byte-level Transformer would choke on input lengths several times longer than tokenized text — one million words become nearly five million bytes (1 word = 4.7 characters on average, and 1 character = 1 byte), making attention computation infeasible due to its quadratic scaling. BLT circumvents this by introducing a dynamic two-tiered system: easy-to-predict byte segments are compressed into latent “patches,” significantly shortening the sequence length. The full, high-capacity model is then selectively applied, focusing its computational resources only where linguistic complexity demands it.

(Source: Adapted from Pagnoni et al. 2024, Figure 2)
Zoomed-out view of the entire Byte Latent Former architecture

2.1 How does it work?

The model can be conceptually divided into three primary components, each with a distinct responsibility:

2.1.1 The Local Encoder:

The primary function of the Local Encoder is to transform a long input sequence of Nbytes of raw bytes, b = (b1, b2,…, bN_bytes)into a much shorter sequence of Npatches of latent patch representations, p = (p1, p2,…, pN_patches).

Step 1: Input Segmentation and Initial Byte Embedding

The input sequence is segmented into patches based on a pre-defined strategy, such as entropy-based patching. This provides patch boundary information but does not alter the input sequence itself. This patch boundary information will come in handy later.

(Source: Pagnoni et al. 2024, Figure 3)
Different strategies for patching, visualized

The first operation within the encoder is to map each discrete byte value (0-255) into a continuous vector representation. This is achieved via a learnable embedding matrix, Ebyte (shape: [256, he]), where he is the hidden dimension of the local module.
Input: A tensor of byte IDs of shape [B, Nbytes], where B is the batch size.
Output: A tensor of byte embeddings, X (shape: [B, Nbytes, he]).

Step 2: Contextual Augmentation via N-gram Hashing

To enrich each byte representation with local context beyond its individual identity, the researchers employ a hash-based n-gram embedding technique. For each byte bi at position i, a set of preceding n-grams, gi,n = {bi-n+1,…, bi} are constructed for multiple values of n {3,…,8}.

These n-grams are mapped via a hash function to indices within a second, separate embedding table, Ehash (shape: [Vhash, he]), where Vhash is a fixed, large vocabulary size (i.e., the number of hash buckets).

The resulting n-gram embeddings are summed with the original byte embedding to produce an augmented representation, ei. This operation is defined as:

(Source: Author)
Explanation: Look up the hash of the n-gram in the embedding table and add it to the respective byte embedding, for all n ∈ [3,8]

where xi is the initial embedding for byte bi.
The shape of the tensor E = {e1, e2,…,eN_bytes} remains [B, Nbytes, he].

Step 3: Iterative Refinement with Transformer and Cross-Attention Layers

The core of the Local Encoder consists of a stack of le identical layers. Each layer performs a two-stage process to refine byte representations and distill them into patch representations.

Step 3a: Local Self-Attention: 
The input is processed by a standard Transformer block. This block uses a causal self-attention mechanism with a limited attention window, meaning each byte representation is updated by attending only to a fixed number of preceding byte representations. This ensures computational efficiency while still allowing for contextual refinement.

Input: If it’s the first layer, the input is the context-augmented byte embedding E; otherwise, it receives the output from the previous local Self-Attention layer.

(Source: Author)
Hl: Input for the current Self-Attention layer
E: Context-Augmented Byte Embedding from Step 2
Hl-1: Output from the previous Self-Attention layer

Output: More contextually aware byte-representations, Hl (shape:
[B, Nbytes, he])

Step 3b: Multi-Headed Cross-Attention:
The purpose of the Cross-Attention is to distill the fine-grained, contextual information captured in the byte representations and inject it into the more abstract patch representations, giving them a rich awareness of their constituent sub-word structures. This is achieved through a cross-attention mechanism where patches “query” the bytes they contain.

Queries (Q): The patch embeddings are projected using a simple linear layer to form the queries.
For any subsequent layer (l>0), the patch embeddings are simply the refined patch representations output by the cross-attention block of the previous layer, P(l−1).
However, for the very first layer (l=0), these patch embeddings must be created from scratch. This initialization is a three-step process:

  1. Gathering: Using the patch boundary information obtained in Step 1, the model gathers the byte representations from H0 that belong to each patch. For a single patch, this results in a tensor of shape (Nbytes_per_patch, he). After padding each patch representation to be of the same length, if there are J patches, the shape of the entire concatenated tensor becomes:
    (B, J, Nbytes_per_patch, he).
  2. Pooling: To summarize the vector for each patch, a pooling operation (e.g., max-pooling) is applied across the Nbytes_per_patch dimension. This effectively summarizes the most salient byte-level features within the patch.
    • Input Shape: (B, J, Nbytes_per_patch, he)
    • Output Shape: (B, J, he)
  3. Projection: This summarized patch vector, still in the small local dimension he is then passed through a dedicated linear layer to the global dimension, hg, where he <<< hg. This projection is what bridges the local and global modules.
    • Input Shape: (B, J, he)
    • Output Shape: (B, J, hg)
(Source: Author)
Summary of the 3-step process to get the first patch embeddings:
1. Gathering and pooling the bytes for each respective patch.
2. Concatenating the patches to a single tensor.
3. Projection of the patch embedding tensor to the global dimension.

The patch representations, obtained either from the previous cross-attention block’s output or initialized from scratch, are then fed into a linear projection layer to form queries.

  • Input Shape: (B, J, hg)
  • Output Shape: (B, J, da), where da is the “attention dimension”.

Keys and Values: These are derived from the byte representations Hl from Step 3a. They are projected from dimension he​ to an intermediate attention dimension da, via independent linear layers:

(Source: Author)
Projection of the Self-Attention output from Step 3a to Keys and Values.
(Source: Author)
Overview of the Information flow in the Local Encoder

2.1.2 The Latent Global Transformer

The sequence of patch representations generated by the Local Encoder is passed to the Latent Global Transformer. This module serves as the primary reasoning engine of the BLT model. It is a standard, high-capacity autoregressive Transformer composed of lg self-attention layers, where lg is significantly larger than the number of layers in the local modules.

Operating on patch vectors (shape: [B, J, hg]), this transformer performs full self-attention across all patches, enabling it to model complex, long-range dependencies efficiently. Its sole function is to predict the representation of the next patch, oj (shape: [B, 1, hg]), in the sequence based on all preceding ones. The output is a sequence of predicted patch vectors, Oj (shape: [B, J, hg]), which encode the model’s high-level predictions.

(Source: Author)
oj is the patch that contains the information for the next prediction

2.1.3 The Local Decoder

The final architectural component is the Local Decoder, a lightweight Transformer that decodes the predicted patch vector, oj, the last token from the global model’s output, Oj​, back into a sequence of raw bytes. It operates autoregressively, generating one byte at a time.

The generation process, designed to be the inverse of the encoder, starts with the hidden state of the last byte in the encoder’s output, Hl. Then, for each subsequent byte generated by the decoder (d’k), in a typical autoregressive manner, it uses the predicted byte’s hidden state as the input to guide the generation.

Cross-Attention: The last byte’s state of the encoder’s output Hl[:,-1,:] (acting as query, with shape: [B, 1, he]) attends to the target patch vector oj (acting as Key and Value). This step injects the high-level semantic instruction from the patch concept into the byte stream.

The query vectors are projected to an attention dimension, da, while the patch vector is projected to create the key and value. This alignment ensures the generated bytes are contextually relevant to the global prediction.

(Source: Author)
The general equations, which encapsulate what Query, Key, and Value are.
d’k: The k+1th predicted byte’s hidden state from the decoder.

Local Self-Attention: The resulting patch-aware byte representations are then processed by a causal self-attention mechanism. This allows the model to consider the sequence of bytes already generated within the current patch, enforcing local sequential coherence and correct character ordering.

After passing through all ld layers, each including the above two stages, the hidden state of the last byte in the sequence is projected by a final linear layer to a 256-dimensional logit vector. A softmax function converts these logits into a probability distribution over the byte vocabulary, from which the next byte is sampled. This new byte is then embedded and appended to the input sequence for the subsequent generation step, continuing until the patch is fully decoded.

(Source: Author)
Overview of the Information flow in Local Decoder

3. The Verdict: Bytes Are Better Than Tokens!

Byte Latent Transformer could genuinely be an alternative to the regular vanilla Tokenization-based Transformers at scale. Here are a few convincing reasons for that argument:

1. Byte-Level Models Can Match The Ones Based On Tokens.
One of the main contributions of this work is that byte-level models, for the first time, can match the scaling behavior of state-of-the-art token-based architectures such as LLaMA 3 (Grattafiori et al. 2024)2. When trained under compute-optimal regimes, the Byte Latent Transformer (BLT) exhibits performance scaling trends comparable to those of models using byte pair encoding (BPE). This finding challenges the long-standing assumption that byte-level processing is inherently inefficient, showing instead that with the right architectural design, tokenizer-free models also have a shot.

(Source: Adapted from Pagnoni et al. 2024, Figure 6)
BLT showing competitive BPB (perplexity equivalent for byte models) and similar scaling laws to those of the tokenizer-based LLaMA models

2. A New Scaling Dimension: Trading Patch Size for Model Size.
The BLT architecture decouples model size from sequence length in a way that token-based models cannot. By dynamically grouping bytes into patches, BLT can use longer average patches to save on compute. This saved compute can be reallocated to increase the size and capacity of the main Latent Global Transformer while keeping the total inference cost (FLOPs) constant. The paper shows this new trade-off is highly beneficial: larger models operating on longer patches consistently outperform smaller models operating on shorter tokens/patches for a fixed inference budget.
This means you can have a larger and more capable model — at no extra compute cost!

(Source: Adapted from Pagnoni et al. 2024, Figure 1)
The steeper scaling curves of the larger BLT models allow them to surpass the performance of the token-based Llama models after the crossover point.

3. Subword Awareness Through Byte-Level Modeling
By processing raw bytes directly, BLT avoids the information loss typically introduced by tokenization, gaining access to the internal structure of words — their spelling, morphology, and character-level composition. This results in a heightened sensitivity to subword patterns, which the model demonstrates across several benchmarks.
On CUTE (Character-level Understanding and Text Evaluation) (Edman et al., 2024)3, BLT excels at tasks involving fine-grained edits like character swaps or substitutions, achieving near-perfect accuracy on spelling tasks where models like LLaMA 3 fail entirely.
Similarly, on noised HellaSwag (Zellers et al, 2019)4, where inputs are perturbed with typos and case variations, BLT retains its reasoning ability far more effectively than token-based models. These results are indicative of BLT’s inherent robustness, which Token-based models can’t gain even with significantly more data.

(Source: Pagnoni et al. 2024, Table 3)
The model’s direct byte-level processing leads to massive gains on character manipulation (CUTE) and noise robustness (HellaSwag Noise Avg.), tasks that challenge token-based architectures.

4. BLT Shows Stronger Performance on Low-Resource Languages.
Fixed tokenizers, often trained on a majority of English or high-resource language data, can be inefficient and inequitable for low-resource languages, often breaking words down into individual bytes (a phenomenon known as “byte-fallback”). Because BLT is inherently byte-based, it treats all languages equally from the start. The results show this leads to improved performance in machine translation, particularly for languages with scripts and morphologies that are poorly represented in standard BPE vocabularies.

(Source: Pagnoni et al. 2024, Table 4)
Machine translation performance on the FLORES-101 benchmark (Goyal et al., 2022)5. Comparable performance on high-resource languages, but superior for low-resource languages, outperforming the LLaMA 3 model.

5. Dynamic Allocation Of Compute: Not Every Word Is Equally Deserving
A key strength of the BLT architecture lies in its ability to dynamically allocate computation based on input complexity. Unlike traditional models that expend a fixed amount of compute per token—treating simple words like “the” and complex ones like “antidisestablishmentarianism” with equal cost—BLT ties its computational effort to the structure of its learned patches. The high-capacity Global Transformer works only for patches, allowing BLT to form longer patches over predictable, low-complexity sequences and shorter patches over regions requiring deeper reasoning. This enables the model to focus its most powerful components where they’re needed most, while offloading routine byte-level decoding to a lighter, local decoder, yielding a far more efficient and adaptive allocation of resources.


4. Final Thoughts And Conclusion

For me, what makes BLT exciting isn’t just the benchmarks or the novelties, it’s the idea that a model can move beyond the superficial wrappers we call “languages” — English, Japanese, even Python — and start learning directly from the raw bytes, the fundamental substrate of all communication. I love that. A model that doesn’t rely on a fixed vocabulary, but instead learns structure from the ground up? That feels like a real step toward something more universal.

Of course, something this different won’t be embraced with open arms, overnight. Tokenizers have become baked into everything — our models, our tools, our intuition. Ditching them means rethinking the very foundational block of the entire AI ecosystem. But the upside here is hard to ignore. Maybe instead of the complete architecture, we would see some of its features being integrated into the new systems we see in the future.


5. References

[1] Pagnoni, Artidoro, et al. “Byte latent transformer: Patches scale better than tokens.” arXiv preprint arXiv:2412.09871 (2024).
[2] Grattafiori, Aaron, et al. “The llama 3 herd of models.” arXiv preprint arXiv:2407.21783 (2024).
[3] Edman, Lukas, Helmut Schmid, and Alexander Fraser. “CUTE: Measuring LLMs’ Understanding of Their Tokens.” arXiv preprint arXiv:2409.15452 (2024).
[4] Zellers, Rowan, et al. “Hellaswag: Can a machine really finish your sentence?.” arXiv preprint arXiv:1905.07830 (2019).
[5] Goyal, Naman, et al. “The flores-101 evaluation benchmark for low-resource and multilingual machine translation.” Transactions of the Association for Computational Linguistics 10 (2022): 522-538.

Share.

Comments are closed.