Cristobal Santana

The Reversal Curse: Why LLMs Know Tom Cruise’s Mother But Not Her Son

Cristobal Santana — Tue, 16 Jun 2026 12:31:34 GMT

When a person learns that Tom Cruise’s mother is Mary Lee Pfeiffer, something automatic happens in the background. We don’t just store the fact in one direction, from the son to the mother. We get the reverse for free: we now also know that Mary Lee Pfeiffer’s son is Tom Cruise. It’s the same fact, looked at from the other end, and we don’t learn the two ends separately. They’re one fact with two entrances.

In classical statistics, this symmetry is built into the math. A correlation between two variables is the same whether you call one of them X or Y. A joint distribution of two things, written P(A, B), doesn’t have a direction. From that single object you can compute the chance of A given B, or the chance of B given A, and any system that knows one should be able to recover the other. The relationship is symmetric by construction.

Large language models are different, and that’s the surprising part. A language model is the system behind tools like ChatGPT: it’s trained on huge amounts of text to predict what comes next. The architecture these models use, the transformer, has no built-in sense of “forward” or “backward,” so on paper it could reason in both directions equally well. And yet, when you train one the standard way, by having it predict the next token (a token is a chunk of text, roughly a word or part of a word) given the tokens before it, something strange happens. The model learns a relationship in one direction and almost completely fails to retrieve the same relationship from the other. It will tell you who Tom Cruise’s mother is. It will not reliably tell you who Mary Lee Pfeiffer’s son is. The fact is the same. The model just can’t reach it from the other side.

Forward, the fact flows: A is B. Backward, the same fact hits a wall: B is A?

In 2023, Berglund and colleagues named and carefully demonstrated this, and called it the Reversal Curse. They ran two experiments. In the first, they built a set of made-up facts in the form “name is description,” like “Daphne Barrington is the director of the film A Journey Through Time,” trained models on them, and tested both directions. Given the name, the model could recover the description. Given the description, it collapsed to near-random guessing. In the second experiment, they used real celebrity-parent pairs and asked GPT-4 questions both ways. Asked who a celebrity’s parent was, it answered correctly about 79% of the time. Asked the reverse, who a given parent’s child was, it dropped to 33%. The model knew both names. It just couldn’t travel from one to the other.

Two things made this finding hard to ignore. The first is that it showed up in every model they tested: GPT-3, GPT-4, Llama, and a range of smaller open models, so it wasn’t a quirk of one system. The second is that a separate team at Anthropic, working at the same time and without knowledge of the first group, hit the same wall from a different angle. Grosse and colleagues (2023) were studying which training examples most influence what a model predicts, using influence functions, a technique from classical statistics that estimates how a model’s behavior would change if you removed a specific training example. They found, almost in passing, that when a model answered a question phrased in one direction, the training examples that mattered were the ones phrased that same way. Examples phrased in reverse had almost no effect. Two papers, written independently, describing the same thing: to the model, the forward and reverse versions of a fact are nearly separate facts.

That rules out the easy explanations. This isn’t a problem with one architecture, one dataset, or one company’s training pipeline. It’s a property of how these models, trained to predict the next token, store and retrieve what they know. Three years later, it’s still one of the cleanest demonstrations of a gap that’s easy to forget: predicting the next token well is not the same thing as understanding a fact. This post is about that finding, why it happens, what people have tried to do about it, and why the most promising recent direction suggests the problem belongs to one specific way of training models rather than to language models in general.

Why the Reversal Happens

The main explanation is mechanical, and once you see it, it’s almost obvious.

These models are trained to predict the next token given everything before it. When the training text says “Tom Cruise’s mother is Mary Lee Pfeiffer,” the training process teaches the model to predict “Mary Lee Pfeiffer” after seeing “Tom Cruise’s mother is.” That update is directional. It strengthens the path from the prefix to the answer. It does nothing, by itself, for the opposite path, from “Mary Lee Pfeiffer’s son is” to “Tom Cruise.” Those are two different starting points leading to two different answers, and the training only ever asked the model to learn one of them.

For the model to get the reverse direction, one of two things would have to happen. Either it sees the reverse phrasing somewhere in training, or it figures out the reverse on its own from the forward version.

The first option is what saves us most of the time. Most facts on the internet are written in many different ways, and famous people have their relationships described from every angle. So for Tom Cruise, the reverse is probably stated somewhere too. The reversal curse shows up most clearly when a relationship appears in only one direction in the data: less-famous people, facts freshly added during fine-tuning (a short extra round of training that adds specific new facts to an already-trained model), or synthetic data made for a narrow purpose.

The second option, that the model could work out the reverse on its own, asks for something the training never rewards. Predicting the next token doesn’t teach a model that “X is Y’s mother” implies “Y is X’s child.” That logical equivalence is a fact about the world, not about sequences of text, and the model is only ever trained on sequences of text. The symmetry is invisible to what the model is optimizing for.

More recent work has sharpened this picture. Lv and colleagues (2024) showed that the reversal curse comes specifically from the next-token training objective and how it shapes the model’s internal representations. Wang and Sun (2025) argued the problem is structural: the way the model represents the two entities in “A is B” gets tangled together in a way that doesn’t allow a clean flip. Kitouni and colleagues (2024) widened the lens and called it the Factorization Curse, the broader point that what a model is trained to predict determines what kinds of generalizations it can and can’t make. The reversal curse is one case of that larger pattern.

The bottom line is that the failure isn’t a bug. It’s a direct consequence of training a model to predict the next token rather than to model the underlying facts. The model is doing exactly what it was trained to do. We just trained it to do something slightly different from what we actually wanted.

Three Years Later: Where Things Stand in 2026

The reversal curse hasn’t gone away, but the conversation around it has matured in two directions.

The first is a set of fixes that work by changing how the training data is presented. Reverse training (Golovneva et al., 2024) takes each training example and adds a reversed version, doubling the data and forcing the model to see both directions. Semantic-aware permutation training (Guo et al., 2024) does something similar but smarter, generating reworded versions that change the order in which entities and their attributes appear. These methods shrink the gap, sometimes closing it almost entirely on test sets, but they’re expensive to apply at the scale of full pretraining, and they don’t fully carry over to facts that weren’t part of the augmentation. They treat the symptom, not the cause.

The second direction is more interesting, and in my view more important. A growing body of work suggests the reversal curse is specific to autoregressive models, the kind that generate text strictly left to right, one token after another. It may not be a limitation of transformers in general, or of language models as a class. Masked diffusion models are a different approach: instead of generating left to right, they fill in tokens in any order, revealing them gradually. These models seem to handle reverse questions far better. LLaDA, an 8-billion-parameter masked diffusion model released in 2025 (Nie et al., 2025), is the clearest case. When a follow-up study put it head to head with autoregressive models on the same parent-child and person-description datasets used to demonstrate the curse (trained in one direction only), LLaDA held up in the reverse direction while Llama-3.1 and Qwen-2.5 collapsed to near-random, in some splits answering almost no reverse questions (Shin et al., 2026).

That’s a genuinely surprising result. It suggests the reversal curse isn’t a deep limit of neural networks or of transformers. It’s a consequence of the left-to-right training objective. Change the objective, and the curse weakens. A model trained to predict tokens at any position, given any other tokens, seems to store relationships in a way that survives being flipped, presumably because it was never locked into a single direction in the first place.

Whether these models can scale up to compete with the best left-to-right models across every task is still being worked out. But for the narrow question of representing knowledge in both directions, they look like a better answer at the root than any patch you can bolt onto a left-to-right model.

The honest summary is this. Three years in, we understand much better why the reversal curse exists, we have a few methods that reduce it, and we have at least one kind of model that seems to avoid it from the start. What we don’t have is a frontier left-to-right model that has actually solved it. The curse is still present in every major LLM you can use today.

What This Costs in Practice

It helps to make the cost concrete. Picture a mid-sized company that sells industrial parts and decides to put an assistant on top of its product catalog. The catalog is full of relationships: this pump is compatible with that controller, this part replaces that older one, this component requires that adapter. The team feeds all of it to the model, tests it, and ships.

In the demo, everything looks fine. Ask “what controller is compatible with pump X?” and the assistant answers correctly, because the catalog was written that way, from the pump to the controller. Then a customer asks the question from the other side: “I have controller Y, which pumps work with it?” The relationship is the same one, and it’s sitting right there in the catalog. But the model was only ever trained on the forward phrasing, so it answers with a plausible, confident, wrong list. The customer orders the wrong part. Support gets a complaint. And nobody on the team can reproduce the problem easily, because their tests asked questions in the forward direction, the same direction the data was written in.

That’s the shape of the damage. It isn’t a crash or an obvious error. It’s a silent gap that passes every forward-looking test and only shows up when a real user approaches the fact from the other end. For a company, that means wrong answers reaching customers, confident enough that no one flags them, in exactly the cases the team never thought to check. The cost isn’t the technology failing loudly. It’s the technology failing quietly, in a direction the team assumed was covered.

Mitigation Strategies

You can’t fully fix the reversal curse from outside the model, because it isn’t a behavior problem, it’s a missing-knowledge problem. The model doesn’t have the fact accessible in the reverse direction, and no amount of wrapping makes it retrieve something it never learned that way. But you can design the system around the model so that it doesn’t depend on a direction the model is weak at. The goal isn’t to repair the model. It’s to build a system that doesn’t need the model to do the thing it can’t.

The most reliable strategy is to not ask the model for the reverse direction at all. If your application needs reverse lookup, given a controller, find compatible pumps, don’t rely on the model’s memory for it. Put the relationships in a database or a structured index that stores them symmetrically, and let that handle the reverse query. The model is great at language and reasoning over what’s in front of it. It’s unreliable at recalling a relationship backward from its weights. So hand the backward lookup to a tool that holds the fact in both directions, and use the model for what it’s actually good at.

A close second is to store relationships in both directions from the start. If you control the data the model sees, which you usually do in a retrieval system (a setup that finds relevant text for each question and places it in the prompt before the model answers, also known as retrieval-augmented generation, or RAG), you can write each relationship twice, once in each direction, so that when the system pulls context for a query, the reverse form is already there. This connects to something the original paper noted: when the fact is present in the context, the model can use the relationship fine. The curse only shows up when the model has to recall it from memory. So the cheap, robust move is to make sure the relevant fact is in the context in the direction the question needs, rather than hoping the model can flip it on its own.

A third strategy is to enrich the query before it reaches the model. A small step in the system can take the incoming question, fetch the relevant fact from a reliable source, and place it in the prompt in the direction the model needs. The model then answers from context, not from its weak reverse memory. This is just the previous idea applied at query time instead of at indexing time.

And underneath all of these is the one habit that costs almost nothing: test both directions explicitly. The reversal curse is dangerous mostly because it’s invisible to forward-only testing. If your application has any reverse-lookup component, write tests that ask the question from the other side, and you’ll catch the gap before a customer does. None of this removes the curse. It just keeps the curse from reaching anyone who depends on your system.

A Pattern Bigger Than Reversal

What I find most interesting about the reversal curse isn’t the specific failure on celebrity-parent pairs. It’s what that failure says about two things that get treated as the same and aren’t: predicting the next token well, and modeling the underlying facts of the world.

The next-token objective, applied to enough text at enough scale, produces models that are extraordinarily good at one thing: continuing text in plausible ways. When the training data contains enough phrasings of a fact, the model can answer questions about it from many angles, and that looks like understanding. But the moment you ask a question whose phrasing wasn’t in the training data, even one whose answer is logically equivalent to something the model clearly knows, the gaps become visible. The model was never taught to derive new facts from logical equivalences. It was taught to continue text. Those are different skills, and we keep being surprised when they come apart.

That, I think, is the real lesson. These models aren’t building internal maps of the world the way people do, where a fact sits as a relationship you can approach from any side. They’re building dense statistical models of text, and the shape of those models reflects the direction in which the text was written. When the data is symmetric, as it is for famous people described from every angle, the model looks symmetric. When the data runs in one direction, as it almost always does for less-discussed entities, new facts, or specialized fields, the model inherits that one-directionality.

For anyone building on top of LLMs, the takeaway is uncomfortable but useful. Don’t assume the model knows a fact in both directions just because it knows it in one. The asymmetry is real, it’s persistent, and it’s silent. Those last three words are what make it dangerous, and they’re why the fix is almost never a bigger model. It’s designing the system so it never has to depend on a direction the model was never taught.

References

Foundations

Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A. C., Korbak, T., & Evans, O. (2023). The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”. ICLR 2024. arXiv:2309.12288 - https://arxiv.org/abs/2309.12288

Independent confirmation

Grosse, R., et al. (2023). Studying Large Language Model Generalization with Influence Functions. Anthropic. arXiv:2308.03296 - https://arxiv.org/abs/2308.03296

Mechanisms and analysis

Lv, A., et al. (2024). An Analysis and Mitigation of the Reversal Curse. EMNLP 2024. arXiv:2311.07468 - https://arxiv.org/abs/2311.07468

Kitouni, O., et al. (2024). The Factorization Curse: Which Tokens You Predict Underlie the Reversal Curse and More. arXiv:2406.05183 - https://arxiv.org/abs/2406.05183

Wang, B., & Sun, H. (2025). Is the Reversal Curse a Binding Problem? Uncovering Limitations of Transformers from a Basic Generalization Failure. arXiv:2504.01928 - https://arxiv.org/abs/2504.01928

Mitigations

Golovneva, O., Allen-Zhu, Z., Weston, J., & Sukhbaatar, S. (2024). Reverse Training to Nurse the Reversal Curse. arXiv:2403.13799 - https://arxiv.org/abs/2403.13799

Guo, Q., et al. (2024). Mitigating Reversal Curse in Large Language Models via Semantic-aware Permutation Training. arXiv:2403.00758 - https://arxiv.org/abs/2403.00758

Recent updates (2025-2026)

Nie, S., et al. (2025). LLaDA: Large Language Diffusion Models. arXiv:2502.09992 - https://arxiv.org/abs/2502.09992

Shin, S., Kim, B., Lee, K., Jeon, M., & No, A. (2026). Understanding the Reversal Curse Mitigation in Masked Diffusion Models through Attention and Training Dynamics. arXiv:2602.02133 - https://arxiv.org/abs/2602.02133

Lost in the Middle: Why LLMs Forget What They Just Read

Cristobal Santana — Tue, 02 Jun 2026 11:03:46 GMT

The Problem

If you have ever built a RAG system, where the model retrieves relevant documents and uses them to answer a question, you have probably felt this without naming it. You retrieve twenty chunks, the relevant one is in there somewhere, the model has it in context, and it still answers as if it never saw it. So you add more context to be safe, and somehow the answer gets worse. This is not a retrieval bug, and you cannot fix it by switching to a model with a bigger context window. It is the model not reading its own middle.

This matters more than it sounds. A system that silently ignores part of its input is a system that fails in ways you cannot see in a demo. It works when you test it with three documents and the answer is in the first one. It breaks in production when the answer happens to land in document eleven of twenty, and now you have a support ticket, a user who does not trust the output, and an engineer trying to reproduce a bug that depends on the position of a document nobody thought to track.

There is an old finding in cognitive psychology that fits this almost perfectly. Murdock described the serial position effect in 1962: people recall the first and last items in a list far better than the items in between. Human memory is U-shaped. We remember the beginning, we remember the conclusion, and the middle turns into a vague impression. Modern machine learning was supposed to be different. The transformer was sold partly on its ability to attend equally to every position in the input. The attention mechanism, in principle, lets each token reach across the entire context with equal ease. Everything is, mathematically, the same distance away. That is the implicit promise of long-context models: feed it a long document and it will actually use it.

In 2023, Liu and colleagues at Stanford and Berkeley showed that this assumption is wrong. Their paper, Lost in the Middle: How Language Models Use Long Contexts, showed that current LLMs behave a lot like the human U-shape. Information at the beginning and end of the context gets used. Information in the middle, even when it is the exact answer to the question being asked, often gets ignored. The model can read it. It just does not pull from it the way it pulls from the edges. This post is about why that happens, what it costs you if you are building real systems on top of LLMs, and why it is still an open problem in 2026.

What’s Actually Happening

The experiment Liu et al. designed is clean enough that you can picture it right away. The model receives a question plus several documents, exactly one of which contains the answer. The other documents are real but irrelevant, distractors pulled from the same corpus. The key move is that they change the position of the answer-bearing document within the context, from the first slot to the last, keeping everything else the same. Same question, same documents, same model. Only the position of the answer changes.

If a model really used its context evenly, accuracy should be flat across positions. Instead they found a clear U-shape: high accuracy when the relevant document sits at the start or the end, and a clear drop when it sits in the middle. The model is not using its middle.

What made this land was that it was not a quirk of one model. They tested open-source and proprietary models alike (GPT-3.5-Turbo, Claude, MPT, Longchat variants) and the U-shape showed up in every one, with different severity. Models with longer context windows did not escape it. If anything, the drop got worse as the context grew. This is what made the finding important for anyone deploying these systems: it was not a bug in one architecture you could swap out, it was a property of how transformer-based models, as a class, process long inputs. Later replications across newer models have confirmed that while the size of the effect can be reduced, the U-shape stays. The middle of the context is still, in 2026, a partial blind spot for most LLMs.

Why the Middle Disappears

Three explanations have been proposed, and the honest answer is that the field has not fully settled which one matters most. They probably stack on top of each other. You do not need to master the math to make good decisions here, but the intuition is worth having, because it tells you why no prompt trick makes the problem go away.

The first is about the data these models were trained on. In typical text, important information sits at the beginning (introductions, headlines, opening paragraphs) and at the end (conclusions, key takeaways). The middle tends to be supporting or transitional. A model trained to predict the next token in this kind of text learns, without anyone telling it to, that the edges matter more than the middle. This bias is not written anywhere explicit. The model picks it up on its own, from billions of examples.

The second is about how attention itself behaves. Xiao and colleagues (2023) found a phenomenon called attention sinks: the first tokens of a sequence get a large share of attention from every later token, no matter what they actually say. This appears to be a side effect of how attention works: the model assigns each token a set of attention scores that are forced to sum to one, so the weight always has to go somewhere. Even when nothing in the context deserves attention, the model has to put its weight somewhere, and it tends to dump that extra weight on the first few positions. Early tokens become anchors that dominate the processing, and tokens deep in the middle get pushed out.

The third is about positional encoding, the mechanism that tells the model where each token sits. Different schemes for doing this (RoPE, ALiBi, learned embeddings, all different ways of encoding position) behave differently with long contexts. Press et al. (2022) showed that some encodings handle lengths beyond their training range well and others fall apart. When a model runs on contexts longer than it saw during training, the middle is the region that suffers most, because those are exactly the positions it never had to deal with.

For anyone building on top of these models, the mechanism matters less than the result. The bias is real, it is baked in during training, and you cannot prompt your way out of it. You have to design around it. That is a product decision, not a tuning problem.

Where We Are in 2026

It is worth stopping on what “long context” even means anymore, because the numbers have moved several times, and the marketing has moved faster than the reality. When Liu et al. published in 2023, long context meant 8k to 32k tokens. By early 2026 those numbers look small: GPT-5.4 ships with 128k, Claude Opus 4.6 with 200k, Gemini 3.1 Pro advertises 2 million, Llama 4 Scout claims 10 million. The ceiling has grown by two orders of magnitude in three years. You might expect the Lost in the Middle problem to have quietly disappeared with all that room to spare. It has not.

The most useful way to think about this, especially if you are paying for tokens, is the gap between advertised and effective context length. Hsieh and colleagues (2024) built RULER, a benchmark that stress-tests long-context performance with multi-needle retrieval and multi-hop tasks. Their headline finding: of the models claiming 32k or more, only about half could actually handle 32k well, and almost all dropped below a usable level well before their advertised length. As of 2026, the rule of thumb across independent benchmarks is that effective capacity is roughly 60-70% of the advertised maximum. In plain terms: the 1M-token window you are paying for gives you maybe 600k to 700k tokens of context you can actually trust. The rest is capacity on the spec sheet that does not show up in the output.

There has been real progress, to be fair. The latest generation has put a lot of work into long-context post-training and the results are visible. On the multi-needle MRCR v2 benchmark at 1M tokens, Claude Opus 4.6 reportedly reaches around 76%, compared to about 18.5% for Claude Sonnet 4.5 on the same test, a roughly fourfold jump in a single generation. That is not small. It suggests the U-shape is something that scaling and targeted training can reduce. But reduce is the right word: no current model gives you flat accuracy across positions. The drop is smaller, but it is still there.

A more uncomfortable result came in late 2025, when Du and colleagues showed that context length alone hurts performance even when retrieval is perfect. Even when the relevant information is clearly available and the model has provably read it, more surrounding context still makes the answer worse. Stuffing the context window is not a free lunch, and “just put everything in the prompt” is not a strategy.

So what do you actually do about it if you are shipping something today? A few things, in rough order of how much they help:

Reorder. Put your highest-relevance chunks at the start and end of the context, not sorted top-to-bottom by similarity score, and let the middle hold the low-confidence material. This costs you nothing and it measurably helps.

Re-rank with position in mind. When you fetch documents and pass them to the model, sort for edge placement, not just raw similarity. The order you pass things in matters more than people first realized.

Test at production length. This is the one most teams skip. Check it with multi-needle retrieval at your real context size, not at the advertised maximum and not with toy inputs. A model with a 1M window that only uses the first and last 50k tokens will pass every demo and fail in production, and you will not know until a customer finds the gap for you.

Consider training-side fixes if you control the model. Work like IN²-training (An et al., 2024) trains models on answers placed all across the context, including the middle, and it helps. But none of these removes the U-shape, and most teams buying an API do not have this option anyway.

The realistic stance is to assume the middle is unreliable until proven otherwise, and to build that assumption into your architecture from the start instead of finding out after launch.

A Pattern Bigger Than Long Context

What I find most interesting about Lost in the Middle is not specific to long contexts. It is a clean example of a bigger pattern, and one worth keeping in mind if you make decisions about these systems: the way training shapes behavior in ways you cannot see from the architecture alone.

The transformer does not, on paper, prefer the edges of its input. The attention mechanism is symmetric in a precise mathematical sense. And yet, after training on text where important information lives at the edges, the model behaves as if it had a built-in bias. The bias is real, you can measure it, and it is consistent across models. It just does not live in the architecture. It lives in the interaction between the architecture and the data. The model you bought is not the model the spec sheet describes. It is that model, plus the hidden priors of its training data, plus behaviors nobody put there on purpose.

This is a pattern you see again and again in modern ML, and it has a direct consequence for how you work with these systems. The capability advertised on a model card is the ceiling, not the floor. Lost in the Middle is easy to see because the failure is easy to show: change one position, watch accuracy drop. Most data-driven biases are much harder to spot, and you never notice them until something specific breaks in front of a user.

The teams that win with these systems are not the ones with access to the biggest models. They are the ones who assume the model has limits the vendor did not advertise, test under conditions that match production, and design around the gaps early. Do not trust the model to read evenly. Do not assume that “in the context” means “available to the model.” The model is doing what it was trained to do, not what its spec sheet suggests it could do, and the difference is your problem to manage, not the vendor’s.

If this kind of thing is what you like to think about, I write about one of these phenomena every couple of weeks. Next up: the Reversal Curse, why a model that knows “A is B” often fails to know “B is A,” and what that means if you are building anything that relies on an LLM reasoning over structured facts.

References

Foundations

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. arXiv:2307.03172 - https://arxiv.org/abs/2307.03172

Mechanisms

Xiao, G., Tian, Y., Chen, B., Han, S., & Lewis, M. (2023). Efficient Streaming Language Models with Attention Sinks. arXiv:2309.17453 - https://arxiv.org/abs/2309.17453

Press, O., Smith, N. A., & Lewis, M. (2022). Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation (ALiBi). ICLR 2022. arXiv:2108.12409 - https://arxiv.org/abs/2108.12409

Mitigations and benchmarks

An, S., Ma, Z., Lin, Z., Zheng, N., Lou, J. G., & Chen, W. (2024). Make Your LLM Fully Utilize the Context. arXiv:2404.16811 - https://arxiv.org/abs/2404.16811

Hsieh, C. P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., Zhang, Y., & Ginsburg, B. (2024). RULER: What’s the Real Context Size of Your Long-Context Language Models? arXiv:2404.06654 - https://arxiv.org/abs/2404.06654

Recent updates

Du, Y., et al. (2025). Context Length Alone Hurts LLM Performance Despite Perfect Retrieval. arXiv:2510.05381 - https://arxiv.org/abs/2510.05381

Salvatore, N., et al. (2025). Lost in the Middle: An Emergent Property from Information Retrieval Demands in LLMs. arXiv:2510.10276 - https://arxiv.org/abs/2510.10276

Cognitive background

Murdock, B. B. (1962). The Serial Position Effect of Free Recall. Journal of Experimental Psychology, 64(5), 482-488. - https://doi.org/10.1037/h0045106

The PINN Loss Function: Where Physics and Optimization Collide

Cristobal Santana — Fri, 01 May 2026 05:13:54 GMT

In classical statistics, you usually start from a hypothesis about the function that generates your data. You assume a linear relationship, or a Gaussian distribution, or a logistic shape, and you spend most of your effort estimating a small set of parameters that make that assumed shape fit the observations as well as possible. The function is, in a sense, known up to a few numbers. The model is what you brought to the problem; the data tells you how to tune it.

Modern machine learning works differently, and the difference is more philosophical than people usually admit. When we train a deep neural network, we don’t have a hypothesis about the function. We have a flexible family of functions — the architecture — that is rich enough to approximate almost anything, and we let the data drag the parameters toward whatever shape minimizes a loss. The true function that maps inputs to outputs is unknown, possibly unknowable, and we never write it down. What we have instead are metrics. Train loss. Validation accuracy. Held-out error. We iterate, and the metrics tell us whether the function we’re approximating behaves like the unknown one, on the inputs we happen to have. The model is a black box that we trust because it agrees with reality where we can check.

This is a remarkable thing, and also a fragile one. Most of the failure modes in modern ML — distribution shift, spurious correlations, hallucinations, weird emergent behaviors — are direct consequences of this setup. We never know whether the function we learned matches the unknown one outside the points where we measured. We only know it agrees on the points we tested.

Physics-Informed Neural Networks are interesting precisely because they break this pattern. We still don’t know the function we’re trying to learn — the solution of a partial differential equation can be wildly complicated and is exactly what we want the network to discover. But we do know something the standard ML setup never has: an equation that the unknown function must satisfy. The PDE itself is a strong, exact, mathematically derived constraint on the shape of the solution. It tells us, at every point in space and time, what relationship must hold among the function’s values and its derivatives. This is information we usually don’t have in machine learning, and it’s enormously valuable. In principle, it should let the network converge faster, generalize better, and stay faithful to physical reality even in regions of the input space where we have no data at all.

When Raissi, Perdikaris, and Karniadakis published their original PINN papers in 2017, this is exactly what they exploited. Take a neural network. Add the governing equation as part of the loss. Let backpropagation discover a function that is simultaneously consistent with your data and with the laws of physics. No mesh. No finite differences. No discretization headaches. Just a loss function and an optimizer.

For someone trained as a physicist, this was seductive. Solving partial differential equations is what physicists spend years learning to do — by hand, with elaborate numerical schemes, with decades of accumulated craft. The PINN promise was that you could replace much of that machinery with a few hundred lines of PyTorch.

Eight years and thousands of papers later, the picture is more honest. PINNs work — sometimes spectacularly — and they also fail in ways that the field is still trying to understand. Most of those failures trace back to one place: the loss function.

This post is about why a deceptively simple loss function turns out to be one of the hardest objects to optimize in modern machine learning, and what we’ve learned about it.

The Idea in One Paragraph

A PINN is a neural network that takes coordinates (a position in space, a moment in time) and outputs the value of some physical field at that point — temperature, pressure, displacement, whatever the PDE describes. What makes it “physics-informed” is that the network is trained not just to fit observed data, but to satisfy the PDE itself at a cloud of collocation points sprinkled across the domain. Because automatic differentiation gives us exact derivatives of the network’s output with respect to its inputs, we can plug the network directly into the differential operator, compute how much it violates the equation, and minimize that violation. That’s the whole trick. The rest is engineering.

The Loss Has Several Voices

The loss function a PINN minimizes is not one thing. It’s a sum of distinct objectives, each pulling the network in its own direction.

Before unpacking each piece, it’s worth pausing on one observation. The data loss term is exactly the loss any standard supervised model uses: a discrepancy between predictions and observations, typically measured with the squared L² norm — the p=2 case of the general L^p family that powers nearly every loss function in modern machine learning:

In practice, we set p=2 — the squared error you’ve seen in every regression problem since your first stats course:

In that sense, a PINN trained only on the data term would be just a regular regression model. What makes it physics-informed is everything else: the additional terms that don’t compare predictions to data, but to the equation itself.

The full loss looks like this:

Reading right to left, those terms ask the network to: match observed data, match the initial state of the system, respect the boundaries of the domain, and satisfy the PDE everywhere in between. The lambdas are scalar weights that decide how much each voice gets to speak.

They look innocent. They are not.

A Choir That Doesn’t Agree

The clean way to describe the problem is multi-objective optimization disguised as single-objective optimization. We hand the optimizer one number to minimize, but inside that number live four objectives with different units, different scales, and different geometric properties.

The vivid way to describe it is a choir where every singer is trying to hit a different note at a different volume, and the conductor — the optimizer — keeps walking toward whoever is loudest.

If the residual loss produces gradients orders of magnitude larger than the boundary loss, the optimizer effectively ignores the boundaries. The network learns to satisfy the PDE almost everywhere — including, sometimes, by collapsing into trivial solutions like “everything is zero,” which satisfies many PDEs perfectly. Conversely, if the boundary loss dominates, you get a network that fits the boundary beautifully and disrespects the physics in the interior.

This isn’t speculation. Wang, Teng, and Perdikaris (2021) showed both empirically and theoretically that during training, the gradients from different loss components become severely imbalanced. They called it a gradient pathology. The optimizer ends up being steered by whichever loss has the loudest gradient, regardless of which loss matters most for solving the problem.

For a physicist this should feel familiar. It’s the same disease as a stiff system of differential equations: the dynamics are dominated by whichever component has the fastest timescale, even when the slow components carry most of the physics.

Why Boundaries Win and Physics Loses

A year later, Wang, Yu, and Perdikaris (2022) gave a sharper diagnosis using Neural Tangent Kernel theory — a tool that lets us analyze very wide neural networks as if they were linear models, where convergence rates become eigenvalues of a matrix.

Their finding is uncomfortably specific: the PDE residual term is associated with much smaller eigenvalues than the boundary and initial-condition terms. In plain language, the network learns to fit boundaries quickly and the interior physics slowly. By the time the residual loss starts converging meaningfully, the optimizer has already settled into a region of parameter space shaped almost entirely by the boundary.

This is why PINNs frequently produce solutions that look correct on the boundary and are visibly wrong in the interior. It’s also why “just train it longer” so often fails to fix the problem. The geometry of the loss landscape was set in the first thousand steps.

Failure Is Not an Edge Case

Krishnapriyan and collaborators (NeurIPS 2021) made the failure even more concrete. They studied PINNs on benchmark problems — including the convection equation, about as simple a PDE as exists — and showed that as the convection coefficient grows, PINNs systematically fail to converge to the correct solution. The loss landscape becomes increasingly ill-conditioned, and the optimizer gets trapped in solutions that are smooth, look reasonable, and are physically meaningless.

The lesson is uncomfortable: PINN failure isn’t an exotic edge case that shows up only in adversarial test problems. It happens on textbook equations, with reasonable hyperparameters, in ways you wouldn’t catch by looking at training curves. The training loss converges. The model is still wrong.

What People Are Doing About It

Most modern PINN research is, in one form or another, an attempt to fix the loss function. Four directions are worth knowing about:

Adaptive loss weighting. Stop treating the lambdas as fixed hyperparameters. Update them online based on the gradient norms of each term, so no single voice in the choir can dominate. McClenny and Braga-Neto (2023) took this further with self-adaptive PINNs, where the weights themselves become trainable parameters in a min-max game: the network minimizes the loss, the weights maximize it, and the equilibrium concentrates effort where the network is failing most.

Causal training. Time-dependent PDEs have a built-in causal structure: the solution at time t depends on the solution at earlier times. Standard PINNs ignore this and train at all collocation points simultaneously, which is a bit like learning the ending of a story before reading the beginning. Causal PINNs re-weight the loss so that the network is forced to converge in time order, fitting early times before late times become consistent.

Curriculum and domain decomposition. Train on easier subproblems first — smaller domains, smoother coefficients, gentler regimes — and expand the difficulty gradually. Or split the domain into pieces and let local PINNs collaborate. The XPINN and cPINN families follow this route.

Architectural fixes. Bake the boundary conditions directly into the network architecture so that the boundary loss is zero by construction. Use Fourier features to fight the network’s natural preference for low-frequency solutions. Use second-order optimizers like L-BFGS, which handle ill-conditioned landscapes better than Adam ever will.

None of these is a silver bullet. Each helps on some problems and hurts on others. The honest summary of the field today is that PINN training remains an active research problem, not a solved one.

A Pattern Bigger Than PINNs

What I find most interesting about all of this isn’t specific to physics-informed networks. The same pathology appears anywhere we compose a loss function from multiple terms with different physical meanings.

Auxiliary losses in self-supervised learning. KL divergence terms in variational autoencoders. Value and policy losses in actor-critic reinforcement learning. Reconstruction plus regularization in almost everything. Every time we sum two objectives with a single weight in front, we’re recreating the PINN choir, and we’re handing the optimizer the right to follow whichever voice is loudest. “Loudness” rarely correlates with importance.

The PINN literature has pushed harder than most subfields on understanding this, partly because the failure mode is visually obvious. When your network claims to have solved a partial differential equation and you plot the result, the lie is right there on the screen. In domains where the loss is the only signal you ever look at, the same pathologies are quietly degrading models all the time. We just don’t notice them as easily.

This is the part that feels most “physics” to me. The PINN loss function is a small, transparent example of a much more general principle: when you ask one optimizer to balance multiple objectives by adding them up, you’re betting that the gradients will respect the same hierarchy of importance that you do. They almost never do.

The classical statistician brought a hypothesis and let the data tune it. The deep learning researcher brings a flexible architecture and lets the data shape it. The PINN researcher brings the equation itself, and discovers that even with that much extra structure, the optimizer still finds ways to disappoint. There’s a lesson in there about how much of modern ML’s behavior is decided not by the problem, not by the data, not even by the model — but by the geometry of the loss we choose to minimize.

Physics-informed neural networks remain a beautiful idea. The loss function is what makes them work, and what makes them fail. Understanding it is the difference between using PINNs as a black box and using them as a tool whose limits you can anticipate.

Where PINNs Are Actually Used

The literature on PINN applications is vast and growing. A non-exhaustive map of where they’ve shown promise, with a representative reference for each domain:

Fluid dynamics. Navier-Stokes for incompressible flow, turbulence modeling, weather and climate sub-models. Cai et al., 2021 — comprehensive review of PINNs for fluid mechanics.
Solid mechanics. Stress and strain analysis, fracture propagation, elasticity problems with complex geometries. Haghighat et al., 2021 — PINN framework for solid mechanics and elasticity.
Heat transfer. Conduction, convection, and radiation problems where boundary conditions are messy or partially known. Cai et al., 2021 — PINNs for heat transfer problems.
Electromagnetism. Maxwell’s equations in heterogeneous media, antenna design, photonics. Chen et al., 2020 — PINNs for inverse problems in nano-optics and metamaterials.
Inverse problems. Recovering physical parameters — diffusion coefficients, source terms, material properties — from sparse observations. Often the most compelling use case, since traditional solvers struggle here. Raissi, Perdikaris, & Karniadakis, 2019 — original paper covers both forward and inverse formulations.
Subsurface and reservoir modeling. Oil and gas flow, groundwater contamination, CO₂ sequestration. Tartakovsky et al., 2020 — PINNs for subsurface flow and transport.
Biomedical modeling. Cardiovascular hemodynamics, blood pressure estimation from sparse measurements, cardiac electrophysiology. Sahli Costabal et al., 2020 — PINNs for cardiac activation mapping; Kissas et al., 2020 — PINNs for cardiovascular flow.
Quantum mechanics. Solving Schrödinger’s equation for systems where traditional grid methods become intractable. Pfau et al., 2020 — FermiNet, a deep learning approach to many-electron Schrödinger equation; Han, Jentzen, & E, 2018 — deep learning for high-dimensional PDEs.
Finance. Black-Scholes and related option pricing PDEs, especially in high dimensions where grid-based methods suffer the curse of dimensionality. Dhiman & Hu, 2023 — PINNs for European and American option pricing.
Climate and environmental modeling. Pollutant dispersion, atmospheric dynamics, ocean currents. Kashinath et al., 2021 — physics-informed ML for weather and climate modeling.

The common thread across these domains: PINNs shine when the physics is well-understood but the data is sparse, the geometry is complex, or the problem is high-dimensional enough to make classical solvers expensive. They struggle exactly where the loss landscape becomes pathological — stiff regimes, sharp gradients, multi-scale dynamics. Knowing which side of that line your problem sits on is half the battle.

References

Foundations

Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707. arXiv:1711.10561 — https://arxiv.org/abs/1711.10561

Training pathologies

Wang, S., Teng, Y., & Perdikaris, P. (2021). Understanding and mitigating gradient flow pathologies in physics-informed neural networks. SIAM Journal on Scientific Computing. arXiv:2001.04536 — https://arxiv.org/abs/2001.04536
Wang, S., Yu, X., & Perdikaris, P. (2022). When and why PINNs fail to train: A neural tangent kernel perspective. Journal of Computational Physics. arXiv:2007.14527 — https://arxiv.org/abs/2007.14527
Krishnapriyan, A. S., Gholami, A., Zhe, S., Kirby, R. M., & Mahoney, M. W. (2021). Characterizing possible failure modes in physics-informed neural networks. NeurIPS 2021. arXiv:2109.01050 — https://arxiv.org/abs/2109.01050

Adaptive methods and fixes

McClenny, L. D., & Braga-Neto, U. M. (2023). Self-adaptive physics-informed neural networks. Journal of Computational Physics. arXiv:2009.04544 — https://arxiv.org/abs/2009.04544
Wang, S., Sankaran, S., & Perdikaris, P. (2024). Respecting causality is all you need for training physics-informed neural networks. arXiv:2203.07404 — https://arxiv.org/abs/2203.07404

Comprehensive overview

Cuomo, S., Di Cola, V. S., Giampaolo, F., et al. (2022). Scientific Machine Learning Through Physics-Informed Neural Networks: Where we are and What’s Next. Journal of Scientific Computing. arXiv:2201.05624 — https://arxiv.org/abs/2201.05624

Applications by domain

Cai, S., Mao, Z., Wang, Z., Yin, M., & Karniadakis, G. E. (2021). Physics-informed neural networks (PINNs) for fluid mechanics: A review. Acta Mechanica Sinica. arXiv:2105.09506 — https://arxiv.org/abs/2105.09506
Haghighat, E., Raissi, M., Moure, A., Gomez, H., & Juanes, R. (2021). A physics-informed deep learning framework for inversion and surrogate modeling in solid mechanics. Computer Methods in Applied Mechanics and Engineering. arXiv:2003.02751 — https://arxiv.org/abs/2003.02751
Chen, Y., Lu, L., Karniadakis, G. E., & Dal Negro, L. (2020). Physics-informed neural networks for inverse problems in nano-optics and metamaterials. Optics Express, 28(8), 11618–11633. — https://doi.org/10.1364/OE.384875
Tartakovsky, A. M., Marrero, C. O., Perdikaris, P., Tartakovsky, G. D., & Barajas-Solano, D. (2020). Physics-informed deep neural networks for learning parameters and constitutive relationships in subsurface flow problems. Water Resources Research. arXiv:1912.02968 — https://arxiv.org/abs/1912.02968
Sahli Costabal, F., Yang, Y., Perdikaris, P., Hurtado, D. E., & Kuhl, E. (2020). Physics-informed neural networks for cardiac activation mapping. Frontiers in Physics. — https://doi.org/10.3389/fphy.2020.00042
Kissas, G., Yang, Y., Hwuang, E., Witschey, W. R., Detre, J. A., & Perdikaris, P. (2020). Machine learning in cardiovascular flows modeling: Predicting arterial blood pressure from non-invasive 4D flow MRI data using physics-informed neural networks. Computer Methods in Applied Mechanics and Engineering. — https://doi.org/10.1016/j.cma.2019.112623
Pfau, D., Spencer, J. S., Matthews, A. G. de G., & Foulkes, W. M. C. (2020). Ab initio solution of the many-electron Schrödinger equation with deep neural networks. Physical Review Research. arXiv:1909.02487 — https://arxiv.org/abs/1909.02487
Han, J., Jentzen, A., & E, W. (2018). Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34), 8505–8510. — https://doi.org/10.1073/pnas.1718942115
Dhiman, A., & Hu, Y. (2023). Physics Informed Neural Network for Option Pricing. arXiv:2312.06711 — https://arxiv.org/abs/2312.06711
Kashinath, K., et al. (2021). Physics-informed machine learning: case studies for weather and climate modelling. Philosophical Transactions of the Royal Society A. — https://doi.org/10.1098/rsta.2020.0093