Article
Why LLMs Cannot Predict Financial Returns by Tokenizing a Price Sequence
Transformers and LLMs work because language has syntax, semantics, and recoverable structure. Financial return sequences have none of that. A deep look at why tokenizing returns the way you tokenize words is a category error.
There is a test that gets run constantly in quantitative finance circles: take a hot new model architecture, point it at a price or return sequence, and see if it outperforms a baseline.
The baselines are usually embarrassingly simple. ARIMA, a ridge regression, a naive carry-forward. And the hot new architecture usually loses, or at best ties at a level indistinguishable from chance.
This is not a bug in the model architecture. It is a structural mismatch between what the model was built to learn and what financial returns actually are.
What Tokenization Assumes
Tokenization, as used in LLMs and transformers, is built around a core assumption: that your input sequence has recoverable semantic structure encoded in the ordering and co-occurrence of discrete tokens.
When you train GPT on English text, the attention mechanism discovers that “the”, “a”, and “an” are interchangeable in certain positions. That “Paris” co-occurs with “France” far more than with “Brazil”. That “ran” and “running” and “run” map to nearby regions of the embedding space because they share context windows across billions of sentences.
This works because language is not random. There are grammar rules, semantic relationships, and shared world knowledge baked into the patterns of token co-occurrence. The model is not memorising sequences, it is learning the underlying structure that generates them.
Now consider a daily return sequence for a liquid equity:
+0.0082, -0.0124, +0.0031, -0.0006, +0.0215, -0.0088, ...
What structure is there for a model to learn?
The Structural Difference
In language, the same word in the same context almost always means the same thing. “The bank by the river” versus “the bank after hours” — ambiguity is the exception, not the rule. The model can learn to resolve it with enough context.
In return sequences, the same numerical value in the same local context can precede entirely different future values. A return of +0.02 on a Monday following three down days means something completely different depending on whether there was a Fed announcement, an index rebalancing, a short squeeze, or nothing at all.
The signal that determines tomorrow’s return is not sitting in yesterday’s return sequence. It is in order flow, positioning data, macro regime, earnings revision cycles, funding rates, and dozens of other signals that are invisible to a model looking only at the historical series.
Language has grammar. Returns do not.
What the Research Actually Shows
In 2022, Zeng et al. published “Are Transformers Effective for Time Series Forecasting?” (arXiv:2205.13504). The paper introduced LTSF-Linear, a set of embarrassingly simple one-layer linear models, and benchmarked them against the best transformer-based time series architectures of the day.
The result was damning: the linear models outperformed the transformers on almost every dataset tested, often by a large margin.
The reason given is fundamental to how attention works. The self-attention mechanism is permutation-invariant by design. Given a sequence, attention computes a weighted sum over all positions regardless of their order. Positional encoding patches this somewhat, but for time series where temporal order is the entire point, this invariance is a structural liability, not a minor flaw. The model is architecturally predisposed to lose the very information that matters most.
In 2023, Gruver et al. pushed this frontier specifically with language models, publishing “Large Language Models Are Zero-Shot Time Series Forecasters” (LLMTime, NeurIPS 2023, arXiv:2310.07820). They encode time series as strings of numbers and let GPT-3, GPT-4, and LLaMA-2 do next-token prediction in text space.
The results are more nuanced but equally instructive. Where LLMs show competitive performance is on series with seasonal patterns and trends — things like temperature cycles, traffic data, and electricity consumption. These have genuine repetitive structure that an LLM can exploit by leaning on its prior for repetition and periodicity learned from general text.
For financial return series? The LLMTime paper does not claim strong performance there for a simple reason: daily equity returns are specifically the case where the structure that LLMs exploit does not exist.
The paper also makes a revealing side observation: GPT-4 performs worse than GPT-3 on time series forecasting, partly because of how it tokenizes numbers. We will get to that next.
The Tokenization Problem Is Also a Literal Problem
Even if financial returns had predictable structure, there is a more immediate technical problem: LLM tokenizers were not built for continuous numerical values.
BPE (Byte Pair Encoding), the tokenizer used by most GPT-family models, treats numbers as text. The value -0.0128 might get split into ['-', '0', '.', '01', '28'] or some other arbitrary fragmentation depending on what subword token boundaries happen to land on. The token 01 has the same ID whether it is part of a date, a serial number, or a return value.
This means:
- Numerical proximity is not preserved in the token space.
-0.0128and-0.0129might tokenize completely differently. - The precision of the return value is lost. A model working in token space cannot easily represent “this is 0.0001 larger than the previous value.”
- The model must learn to reconstruct numerical magnitude from arbitrary character sequences, which it can sometimes do but inconsistently.
The LLMTime authors worked around this by proposing specialised tokenization procedures for numerical data. But even with that, the fundamental problem remains: the return sequence lacks the co-occurrence structure that makes token-based learning work.
Why Language Models Have a Strong Prior That Hurts Here
There is a less obvious failure mode that is worth naming directly.
A pretrained LLM has seen enormous amounts of financial commentary, earnings analysis, market news, and economic reporting. It has learned that markets “tend to recover”, that “volatility clusters”, that “bull runs follow corrections”. These are pop-finance narratives that have some statistical truth at long horizons.
But when you ask the same model to predict next-period returns from a raw numerical sequence, those learned narratives become misleading priors. The model may lean towards regression-to-mean or momentum narratives simply because that is what financial text says, not because the signal in the current return sequence supports it.
A purpose-built time series model trained from scratch has no such prior. It sees only what is in the data. That is often better.
The Efficient Market Hypothesis Is Doing Real Work Here
There is a deeper reason why all of this is hard, which predates machine learning entirely.
The Efficient Market Hypothesis (Fama, 1970) states that, in an efficient market, prices reflect all available information. If tomorrow’s return were predictable from yesterday’s return sequence alone, every market participant with a computer would already be trading on it, and the arbitrage would immediately remove the signal.
This is not a perfect description of reality. Markets are not perfectly efficient. Microstructure effects, liquidity constraints, and behavioural patterns create short-lived predictable inefficiencies. But those inefficiencies, when found, get traded into oblivion quickly.
The practical implication is that the residual information content in a clean return sequence is very low by construction. The market is adversarially optimised to remove it. Training an LLM on an adversarially cleaned signal and expecting it to extract semantic structure is asking the model to find patterns that arbitrageurs have already eliminated.
What Might Actually Work
None of this means machine learning has no role in finance. The distinction is between predicting returns from returns and using ML for something else.
What ML can do well:
- Classifying market regime (trending vs. mean-reverting) using multi-dimensional feature sets.
- Processing unstructured text — earnings calls, macro reports, news — to extract signals. Here LLMs are genuinely strong because the input is actual language with actual semantic structure.
- Sizing and execution: given a signal from another source, deciding how large to trade and when, factoring in market impact estimates.
- Anomaly detection in order flow, where you are looking for unusual microstructure patterns rather than predicting direction.
- Cross-sectional ranking: predicting which assets in a universe will outperform each other given a rich feature set, which is a much more tractable problem than predicting absolute returns.
The LLM-for-finance research that has traction is almost entirely in the text processing direction. FinGPT, BloombergGPT, and similar models are valuable for classifying sentiment in earnings transcripts, summarising macro data releases, and answering queries about specific financial instruments. These are NLP tasks dressed in financial clothing, and LLMs are genuinely the right tool.
Asking those same models to look at a column of return numbers and predict the next one is a category error, not a scaling problem.
The Honest Bottom Line
The architecture of a transformer, specifically the attention mechanism and the tokenizer, was designed to extract co-occurrence and semantic structure from discrete symbolic sequences. It is extraordinarily good at that.
Financial return sequences lack that structure by construction. They are contaminated by adversarial market forces that remove predictable patterns, poorly represented by standard tokenizers, and generated by a process driven by exogenous information that no next-token predictor can see.
The empirical evidence is consistent. Zeng et al. showed that simple linear models beat transformers across time series tasks. Gruver et al. showed LLMs can compete on trend-and-seasonality tasks but required specially designed numeric tokenization to even approach the problem. Neither result supports the idea that feeding returns to a GPT-style model gives you a viable trading signal.
The research continues. There are groups working on specialised numerical tokenization, architecture modifications that bake in explicit temporal ordering, and hybrid models that combine return features with textual signals. Some of that will eventually matter in production.
But the naive version of the idea — tokenize daily returns like words, run them through a transformer, predict the next value — fails for reasons that are structural, well-documented, and unlikely to be resolved by scaling up the parameter count.