The Most Important AI Runs on Software Nobody Funds: the fragile infrastructure of scientific AI

Thomas Thurston
2 days ago
9 min read

You’ve probably heard of fusion: the long-held dream of clean, inexhaustible energy. In late November 2022, a team of physicists at Lawrence Livermore National Laboratory ran the numbers on what would become the most consequential fusion experiment in history. They’d spent years training an AI model on more than 150,000 physics simulations and real experimental data,¹ teaching it the messy realities of how lasers interact with tiny capsules of hydrogen fuel. Now the model was telling them something no model had told them before: the next experiment at the National Ignition Facility, Livermore’s massive laser complex, had roughly a coin flip’s chance of achieving ignition, getting more energy out of a fusion reaction than the laser put in. This is what physicists had been chasing for decades.²

A coin flip doesn’t sound like much, but it was the highest probability of success the program had ever seen. The previous record was 17%.²

On December 5, 192 laser beams carrying 2.05 megajoules of energy hit a diamond capsule the size of a peppercorn.³ It produced 3.15 megajoules of fusion energy: indeed, more laser energy came out than went in. A historical first. The AI model’s prediction had landed within the expected range. A later peer-reviewed analysis, published in Science in 2025, would put the real odds even higher, at 74%.¹

“This was not a lucky guess,” said Brian Spears, director of Livermore’s AI Innovation Incubator and lead author of the paper. “We used a rigorous, data-driven AI framework to quantify the likelihood of ignition before the shot took place.”⁴

Most of the coverage focused on the physics, and rightly so. What almost nobody talked about was the computing infrastructure that made the prediction possible. Livermore had built a framework it calls “cognitive simulation,” or CogSim: an AI that learned to mimic the laboratory’s own simulation software, running thousands of times faster.¹ The team corrected the AI with real experimental data, then did something unusual. They asked the model to quantify how confident it was in its own predictions. That’s what gave them the nerve to tell management, before the laser fired, that this one was going to be different.²

CogSim proved something important: when you tightly combine AI with physics-based simulations, you can see things neither approach reveals by itself. It also pointed to a harder problem. CogSim is the equivalent of studying game film after the game is over. The AI learns from past simulations, spots patterns and makes predictions before the next experiment.

The harder problem is putting a coach on the field during the game: AI that operates inside a running simulation in real time, making adjustments as the physics unfolds. Researchers at Argonne National Laboratory have shown that AI stand-ins, called surrogates, can approximate certain physics simulations tens of thousands of times faster than the original code.⁵ The catch is getting the AI and the simulation to work together while both are running, not one after the other.

Researchers call this “in-situ scientific AI”. It’s a new frontier of progress. A holy grail of sorts. Yet today the infrastructure it requires is a lot more fragile than people might suspect (especially with all the money and headlines around AI these days).

Two Systems, One Machine

To understand what’s holding back progress, think about what these systems are actually asking a computer to do. Traditional simulation codes are like a submarine built over 40 years. Every weld, every material, every vibration dampener was engineered around one imperative: silence. Now asking that system to run AI is like asking the submarine to double as a concert hall. Both are feats of engineering. Both require obsessive control of the physical environment. One exists to suppress sound and the other to project it. Same hull. Opposite physics.

The AI side of scientific computing is largely built in Python and runs on GPUs. The simulation side is often built in Fortran and C++, codebases usually refined and validated over decades, long before modern AI frameworks existed. That longevity is a strength: code trusted for nuclear weapons simulations or climate models has earned that trust through years of verification. The problem is that it was never designed to share a machine with AI. Getting them to coexist, sharing the same memory, at the same time, is not a software update. It’s a structural problem that runs from the silicon all the way up to the source code.

We mapped the value chain for what it would take to merge them: high-performance computing simulation and AI. We found at least nine bottlenecks where supply falls well short of demand. What's striking isn't any single one. It's the pattern: the hardware at the bottom has billions of dollars chasing it. The software at the top, the part that actually connects AI to the physics, sometimes has one academic lab. The further up you go, the thinner the safety net.

Three Layers of Stuck

At the bottom, the silicon. The specialized memory chips these systems need (called High Bandwidth Memory, or HBM) are produced by essentially three companies. The manufacturing step that bonds them to the processor is dominated by one company, TSMC, with a reported majority of its 2025 capacity reserved for a single customer: NVIDIA.⁶ The interconnect chips that let processors communicate inside a server are dominated by one vendor. Three companies for the memory. One for the packaging. One for the interconnect.

These constraints are structural, not cyclical. Expanding any of them takes well over a year. Billions of dollars are flowing toward the problem, and it’s still not enough.

In the middle, security. When multiple organizations share a supercomputer, they need hardware isolation so one user’s data stays invisible to everyone else. For single-processor workloads, the technology exists. For the multi-processor jobs that scientific AI requires, the major cloud providers couldn’t offer it (as of our most recent analysis).⁷ Organizations handling classified simulations, proprietary drug compounds or sensitive patient data are locked out of the most powerful shared infrastructure. Less money is chasing this problem. Fewer people are working on it.

At the top, the software. Here’s where the pattern becomes vivid. Making AI and simulation work together requires translator tools that let the AI learn the math inside physics simulations. The most critical is a tool called Enzyme, created by William Moses as a PhD student at MIT.⁸ Moses is now an assistant professor at the University of Illinois and a researcher at DeepMind,⁹ running an academic lab that maintains a tool the national laboratories depend on. Other tools do versions of this, but Enzyme is the only one that works on the existing code these simulations are written in, without requiring scientists to rewrite their software.¹⁰

The field also needs specialized tools for the complicated math of real-world physics: blood flow through arteries, wind around buildings, turbulence in jet engines. It needs statistical tools that can quantify how much to trust the AI’s predictions. Each one is maintained by a small research group, sometimes at a single university.

The hardware bottleneck has a handful of major companies and billions in capital behind it. The software bottleneck sometimes has one research group. This is the gradient that defines the field: the further up the stack you go, the more critical the work and the less funding behind it.

Why It Compounds

These bottlenecks don’t just coexist. They compound. You can’t use faster memory if you can’t package it. You can’t use more processors if you can’t secure them. You can’t deploy AI surrogates if you can’t connect them to the physics code. You can’t trust the surrogates if you can’t measure their uncertainty. Fixing any one chokepoint in isolation doesn’t unblock the system.

Here’s the part I keep coming back to: the weakest links are the cheapest ones. Open-source tools and math libraries that the entire field depends on. They don’t look like critical infrastructure because they’re free. They don’t attract procurement budgets because nobody sells them. They’re just the thing without which none of the expensive things work.

Vendor concentration makes this worse. NVIDIA’s processors need memory from a small group of suppliers, packaging from one dominant supplier and interconnects largely from NVIDIA itself. They’re also the only platform with a mature scientific AI software ecosystem. Even when alternative hardware exists, the software to use it often doesn’t. The dependencies stack.

What Livermore Got Right

I think this is why the fusion story matters beyond physics. The problems that most need solving, predicting how a tumor responds to a drug, designing materials that don’t exist yet, understanding what a changing climate will do to food and water, all live at the boundary where computation meets the physical world. That’s where scientific AI operates. The CogSim team didn’t shortcut the hard part. They built the bridges between two computing worlds one equation, one data pipeline, one trust-earning validation at a time. That work took years and some of the most powerful computers on Earth.¹ It’s the kind of investment most organizations won’t make, because the payoff is invisible until suddenly it isn’t.

The broader field of in-situ scientific AI needs that same kind of patient, systematic bridge-building, at every layer of the stack. It has far less institutional support behind it, and I think this needs to change. The software infrastructure, the layer that will ultimately determine whether any of this works, is underfunded relative to what it carries. Some of the widest gaps in the entire supply chain are not in silicon. They're in compiler tools, math libraries and statistical software maintained by small academic teams with grant funding.

Somewhere right now, William Moses and a small group of collaborators are maintaining the compiler tool that much of the field depends on to connect the world of physics simulation to the world of machine learning. A tool behind faster climate predictions, drug design and fusion energy. Moses received the 2024 SIGHPC Doctoral Dissertation Award for this work.⁹ It’s the bottleneck of bottlenecks.

End Notes

1. Spears, B., et al., published in Science, August 2025. The peer-reviewed analysis reports a 74% probability of ignition for the N221204 design and describes the CogSim framework, including deep neural network surrogates trained on more than 150,000 HYDRA radiation hydrodynamics simulations on LLNL’s Sierra supercomputer, with transfer learning used to correct models against experimental data. See also “LLNL Used AI to Predict Historic Fusion Ignition Shot,” Lawrence Livermore National Laboratory, August 21, 2025.

2. The 50.2% pre-shot probability and the 17.3% figure for the previous record experiment (N210808) are from the CogSim team’s preliminary analysis, reported in “High-Performance Computing, AI and Cognitive Simulation Helped LLNL Conquer Fusion Ignition,” Lawrence Livermore National Laboratory, December 2022. The 2025 Science paper (note 1) later refined the probability upward to 74% using improved methodology. Both figures are accurate for their respective analyses.

3. The December 5, 2022 NIF experiment produced approximately 3.15 megajoules of fusion energy from 2.05 megajoules of laser energy delivered to the target by 192 laser beams. The target capsule is diamond, approximately the size of a peppercorn. Total energy drawn from the power grid by the laser system was roughly 300 megajoules. See NIF FAQ, lasers.llnl.gov; also “Supercomputing’s Critical Role in the Fusion Ignition Breakthrough,” HPCwire, December 21, 2022.

4. Brian Spears, quoted in “LLNL Used AI to Predict Historic Fusion Ignition Shot,” Lawrence Livermore National Laboratory, August 21, 2025. Spears is identified as director of LLNL’s AI Innovation Incubator and first author of the Science paper.

5. Rick Stevens, keynote address, International Conference on Parallel Processing (ICPP), August 2021. Stevens is associate laboratory director for computing, environment and life sciences at Argonne National Laboratory. Reported in “How Machine Learning Is Revolutionizing HPC Simulations,” InsideHPC, August 2021.

6. Multiple industry analysts reported TSMC’s CoWoS advanced packaging allocation for NVIDIA in 2025. Specific allocation figures vary by source; the characterization here reflects directional consensus of public analyst reports and booking data. HBM (High Bandwidth Memory) is produced primarily by SK hynix, Samsung and Micron, stacked vertically in towers of 8 to 12 layers that feed data to processors at extraordinary speeds.

7. As of our analysis window (March 2024 to February 2026), Microsoft Azure and Google Cloud productized single-GPU confidential computing VMs using NVIDIA H100 hardware with AMD SEV-SNP or Intel TDX but did not support multi-GPU or multi-node confidential computing clusters.

8. William S. Moses and Valentin Churavy, “Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients,” Advances in Neural Information Processing Systems 33 (NeurIPS 2020). Extended for GPU differentiation with collaborators at Argonne National Laboratory and the Technical University of Munich: Moses et al., “Reverse-Mode Automatic Differentiation and Optimization of GPU Kernels via Enzyme,” SC ’21 (2021). See enzyme.mit.edu.

9. William S. Moses is an assistant professor of computer science and electrical and computer engineering at the University of Illinois Urbana-Champaign and a researcher at Google DeepMind. He received the 2024 ACM SIGHPC Doctoral Dissertation Award for his dissertation “Supercharging Programming through Compiler Technology.” See “CS Professor Billy Moses Has Received the 2024 SIGHPC Doctoral Dissertation Award,” Siebel School of Computing and Data Science, University of Illinois.

10. Other automatic differentiation tools for HPC include Tapenade (source-to-source), ADOL-C and CoDiPack (operator overloading) and JAX (Python-native). Enzyme’s distinguishing capability is compiler-level AD on existing compiled codes (C/C++/Fortran) including GPU kernels, without requiring code rewrites. See Moses et al., “Scalable Automatic Differentiation of Multiple Parallel Paradigms through Compiler Augmentation,” SC ’22 (2022), Best Student Paper Award.

The Most Important AI Runs on Software Nobody Funds: the fragile infrastructure of scientific AI

Two Systems, One Machine

Three Layers of Stuck

Why It Compounds

What Livermore Got Right

Recent Posts

Subscribe to the blog