The AI Bottleneck

Why the most important AI applications haven't been built yet and what history says about what comes next.

March 30, 2026

Technology

Markets under infrastructure constraints always look the same from the inside. The technology is real. The demand is real. And almost none of the companies that will define what AI actually becomes likely exist yet, because the cost of the critical input is still too high for the interesting use cases to pencil out.

Not because the ideas aren't there. Because the infrastructure isn't cheap enough yet for anyone to discover what's actually possible. We are living through the most consequential technology buildout in recent memory, and most of what it eventually produces hasn't been invented yet, held back not by imagination or ambition, but by the mundane economics of running a query.

Here is the part that gets missed in most coverage of this: the constraint is not just a problem to solve. It is doing something. The scarcity that makes AI infrastructure so expensive today is the same force producing the efficiency gains, the architectural innovations, and the supply chain diversification that will define what AI becomes. The hard years are not a detour. They are the mechanism.

It has happened before, in different industries, at different moments. It always resolves. And the world on the other side always looks nothing like the world people were extrapolating from during the constraint.

The Dial-Up Moment

Right now, Anthropic, the company behind Claude, cannot keep up with its own customers.

Enterprise customers aren't just adopting Claude. They're restructuring workflows around it, hitting rate limits mid-task, and asking for more capacity than Anthropic can provide. The demand picture is almost absurdly lopsided. Anthropic grew from roughly $1 billion in annualized revenue in late 2024 to over $14 billion by early 2026, a pace with few precedents in enterprise software history.¹ And yet inference costs came in 23% higher than the company had projected, forcing a downward revision to gross margin targets despite that explosive revenue growth.²

This is not a company in trouble. This is what the infrastructure economics problem looks like when you're winning. More users means more queries means more compute means more cost, and the cost grows faster than the revenue because the hardware required to serve it is scarce, expensive, and controlled by a very small number of players. Anthropic has responded by diversifying aggressively: Google TPUs, Amazon Trainium, Nvidia GPUs, a $50 billion data center buildout with Fluidstack.³ It helps. It doesn't fix anything.

The underlying problem is that every token served costs money, and a significant portion of that money flows to a supply chain controlled by very few players, one of whom is operating at gross margins above 70% against buyers who have no credible alternative.⁴ Anthropic is among the most sophisticated consumers of AI infrastructure in the world, backed by tens of billions in capital from Google and Amazon, and it still can't reliably serve its own customers without hitting capacity walls.

If a company growing this fast, backed by two of the largest corporations on earth, still can't reliably serve its own customers, the ceiling is the story.

What the Ceiling Actually Costs

There is a distinction in AI infrastructure that sounds technical but is really an economic one.

Training a model is a capital expenditure, enormous per run, and right now frontier labs are running constantly as the capability curve is still steep enough to justify it. Each new model generation, each fine-tune, each safety evaluation and distillation run adds to a training bill that is anything but a one-time cost at this stage. But inference is different in kind. It is an operating expense that never stops. Every query, every API call, every background agent loop is a cost event, a recurring cost that scales with every user and every use case built on top of the model.

The AI industry has known this distinction for years and mostly ignored its implications. Labs priced inference below cost and called it a growth strategy. Venture capital funded the gap. OpenAI is spending $1.35 for every dollar it earns.⁵ Anthropic burned through 23% more on inference than projected in a year when revenue grew 10x.² The bet was that market share acquired cheaply today would justify pricing power later. That bet is now being stress-tested in real time, and the results are instructive.

But the deeper cost isn't what it's doing to the labs. It's what it's preventing from being built at all.

In the early 1990s, if you wanted to access the internet, you dialed in through a phone line and waited. Not metaphorically waited. You heard the screeching handshake tones of a 56k modem negotiating a connection, and then loaded a page slowly, at speeds measured in kilobits per second. The internet existed. The potential was obvious to anyone paying attention. But the bottleneck meant that only a narrow slice of what the internet could be was actually possible to build. Nobody was going to make YouTube on a dial-up connection. Nobody was going to run a cloud software business when individual users couldn't reliably load a web page.

The applications that defined the modern internet, the ones that generated trillions of dollars, were literally impossible to build until the infrastructure got cheap enough. They weren't waiting in a queue. They didn't exist yet. The bottleneck didn't just slow the market down. It prevented an entire category of creation.

The same logic applies to AI compute today. An AI agent that costs $4 to run and saves a customer service rep 15 minutes of work has negative ROI. At $0.40 it might have positive ROI. At $0.04 it changes the entire labor economics of customer service.⁶ The applications that would prove this technology's value at scale, the ones that require cheap enough inference to build sustainable businesses on, are not being built yet. Not because the ideas don't exist. Because the unit economics don't work.

Venture capital can fund a company burning through its runway. It can't fund an idea nobody has thought to pitch yet, because the economics made it look impossible before the idea fully formed. The bottleneck isn't just a tax on existing use cases. It's preventing the imagination of entire categories.

The reason the bottleneck is so stubborn has less to do with any single company's decisions than with the structure of the supply chain underneath the whole industry.

Why the Supply Chain Looks the Way It Does

To understand why the bottleneck exists, it helps to understand what's actually underneath it.

Nvidia controls somewhere between 80% and 90% of the market for AI training accelerators.⁷ Its data center revenue grew 68% year over year in fiscal 2026, reaching $193.7 billion for the full year.⁸ Gross margins sit above 70%.⁸ The company has demand commitments exceeding a trillion dollars through 2027.⁹ By any conventional measure, this is one of the most successful businesses in the history of technology.

But Nvidia's position is not a monopoly in the traditional sense. It is not holding back supply to inflate prices. It is shipping as fast as it possibly can. The constraint runs deeper than any single company's decisions.

TSMC, the Taiwanese chipmaker, manufactures the leading-edge semiconductors that power virtually every competitive AI accelerator in existence: Nvidia's GPUs, Google's TPUs, Amazon's Trainium, AMD's Instinct accelerators.¹⁰ All of them flow through TSMC's fabs, almost all of which sit on a single island at the center of one of the world's most contested geopolitical flashpoints. TSMC's advanced packaging capacity is a genuine constraint on the entire AI buildout. Nvidia's pricing power is partly a downstream effect of controlling where that constrained supply gets allocated.

Then there is HBM, high-bandwidth memory, the specialized chips that allow data to move fast enough for AI workloads to function. HBM is manufactured by three companies in the world: Samsung, SK Hynix, and Micron.¹¹ When AI demand for HBM spiked, conventional DRAM prices roughly tripled year over year as manufacturers reallocated wafer capacity.¹² A shift in one part of the supply chain sent shockwaves through unrelated memory markets.

This is not theoretical fragility. It is demonstrated fragility that has already materialized.

In the early 1980s, Japanese semiconductor manufacturers flooded the global DRAM market. Prices collapsed 60% in a single year. American DRAM producers went bankrupt one by one until only Micron and Texas Instruments remained.¹³ The concentration of manufacturing in Japan meant that a single set of industrial policy decisions could nearly destroy an entire US industry. The US government eventually intervened with the Semiconductor Trade Agreement of 1986, but by then the structural damage was done. American memory manufacturing never fully recovered.

The risk today is different in character but similar in kind. It is not about pricing wars. It is about physical access. A policy decision, a natural disaster, a disruption to the advanced packaging supply chain: any of these could interrupt the infrastructure the entire AI buildout depends on. Markets that ignore concentration risk don't eliminate it. They defer it, and collect interest on the deferral.

But here is the optimistic reading of the same facts: every one of these fragilities is creating pressure that produces something. The TSMC concentration is accelerating investment in alternative fabs and domestic semiconductor manufacturing. The HBM constraint is driving research into alternative memory architectures. The pricing pressure is forcing model efficiency work that wouldn't exist under abundance. The constraint is not just a problem. It is a forcing function.

The Market Is Already Responding

The more telling signal is what Nvidia's best customers are doing.

Google has been building TPUs, custom AI accelerators, since 2015.¹⁴ Amazon has Trainium and Inferentia. Microsoft built Maia, an in-house AI chip that arrived late and underdelivered, but the intent was clear. Meta is building its own inference silicon. These are not companies trying to start a chip business. These are companies trying to stop being permanently dependent on a single supplier with unchecked pricing power over their most critical infrastructure.

When your largest customers start vertically integrating to escape you, that is not just a competitive threat. It is proof that the constraint is severe enough to justify enormous investment to route around it. And that investment, even when it underdelivers as Maia did, builds the engineering capability, the institutional knowledge, and the supply chain relationships that make the next attempt more likely to succeed.

Three Ways Through

History offers three templates for how binding infrastructure constraints get broken. None of them involve the dominant player voluntarily lowering prices. All of them have happened before. And in each case, the constraint itself was the catalyst.

The first is learning to need less of the constrained thing.

The 1973 oil crisis already told this story. Detroit didn't build a better oil well. It built a better engine. The constraint didn't disappear. The economy learned to need less of it per unit of output. Morgan Housel has pointed out that the biggest energy story of the last fifty years had nothing to do with oil supply at all. It was efficiency. Conservation. The thing nobody was watching.¹⁷ None of that innovation happens without the price shock. The crisis was the precondition.

This path already has a name in AI: DeepSeek. In early 2025, a Chinese lab released a model that matched frontier performance at a fraction of the training compute.¹⁸ DeepSeek exists because compute is expensive. The engineering pressure that produced it, do more with less, is exactly the pressure the oil crisis put on Detroit. Whether or not that specific result holds up under scrutiny, the direction it points is real. If models keep getting more capable per unit of compute, the absolute demand for leading-edge silicon grows slower than the current investment wave assumes. The bottleneck softens rather than breaks. This is probably the most likely near-term path, and notably the one that produces a healthier, more efficient AI ecosystem than cheap abundant compute ever would have.

The second is changing the unit entirely.

In 1956, a trucking entrepreneur named Malcolm McLean had a simple idea: instead of hiring longshoremen to load and unload individual crates, barrels, and bags from ships, put everything in standardized steel boxes. The same box that left a factory in Ohio could ride a truck to port, transfer to a ship without being opened, and arrive at a warehouse in Rotterdam ready to load onto another truck. Nobody had to invent a faster ship or a bigger port. They just had to agree on the size of the box.¹⁹

Containerization didn't improve the shipping industry. It restructured it entirely. The constraint wasn't speed or capacity. It was interoperability. And it took the economic pressure of a broken, expensive, fragmented shipping system to make standardization commercially viable.

The SLM path in AI is this story. Instead of one enormous model that handles every task, requiring the most powerful, most expensive hardware to run, you build many smaller models, each specialized for a specific job, routing work between them based on what each does best. A small model that answers customer service questions doesn't need a $30,000 GPU. It can run on hardware that already exists in abundance. The constraint isn't solved. The question changes. And the SLM paradigm exists precisely because the frontier model paradigm got too expensive for most use cases to justify.

The third is different physics.

In the mid-19th century, every city in the world was designed around the horse. Not as a metaphor. Literally. How wide streets needed to be, how far a neighborhood could extend from a market, how goods moved, how armies supplied themselves. The horse was the constraint, and it was so fundamental that most people didn't think of it as a constraint at all. It was just the world.

The internal combustion engine didn't improve the horse. It made the horse irrelevant for most of what horses did. Cities redesigned themselves. Industries that existed to support horses collapsed. New industries emerged that the horse had made impossible: suburbs, highway commerce, modern logistics.

Something like this is possible in computing. Optical computing, neuromorphic chips, quantum processors for specific problem types: none of these are imminent, and most predictions about their timelines have been wrong for decades. But the economic pressure to find a different physical substrate for AI compute has never been higher, which historically does accelerate discovery. If it happens, the question won't be which company has the best GPU. The question will be whether GPUs are the right tool at all.

We Don't Know

We don't know which of these paths the market takes, or in what combination, or on what timeline.

What history is fairly consistent about is this: binding infrastructure constraints that look permanent rarely get solved head-on. They get routed around, made irrelevant, or dissolved by a shift in the underlying question. The horse didn't lose to a faster horse. The dial-up bottleneck didn't get resolved by better copper wire. The oil crisis wasn't solved by finding more oil. In each case, the pressure the constraint created was the precondition for the solution.

The AI compute market is currently priced as if the current supply chain is the permanent answer, as if leading-edge GPU capacity from a handful of fabs, assembled into rack-scale systems and sold at high margins, is how AI inference gets delivered indefinitely. That has been the right bet for the last three years. It is probably not the right bet for the next ten.

What comes after the bottleneck breaks will not look like AI today with better numbers. It will look like something we don't have a name for yet, built by companies that don't exist yet, on infrastructure that hasn't been invented yet, serving use cases that seem obvious only in retrospect. The companies worth paying attention to now are the ones building as if that world already exists, treating today's constraints as temporary inputs rather than fixed costs.

We are in the hard years. The internet that exists today was not built on dial-up. It was built on what replaced it. The same is almost certainly true here.

Epilogue

What Nvidia Knows

Nvidia is the most interesting character in this story, because it is simultaneously the bottleneck and the most sophisticated actor responding to it.

Think of it as the OPEC of this moment. OPEC created the 1973 crisis. But the crisis produced fuel injection, hybrid engines, energy efficiency standards, and eventually the conditions for electric vehicles. OPEC didn't control any of that. It couldn't prevent any of that. The constraint it imposed generated the innovation that structurally reduced the world's dependence on what OPEC sold.

Nvidia is doing something OPEC never did, though. It is actively trying to become indispensable to whatever comes next.

The CUDA ecosystem, the programming framework that made Nvidia GPUs the default choice for AI workloads, has been the primary lock-in mechanism for years. Developers learned CUDA, built on CUDA, optimized for CUDA. Switching meant rewriting code, retraining teams, accepting performance penalties. The switching costs were real. But there is a threat to this moat that did not exist until recently: AI code generation tools can now abstract the hardware layer. When a model can rewrite your CUDA-optimized inference code for a different accelerator architecture, the switching cost drops. The technical barrier becomes a smaller obstacle.

Nvidia's response is not to defend CUDA more aggressively. It is to move up the stack.

Dynamo, released as version 1.0 in March 2026, is the clearest expression of this strategy.¹⁵ It is an inference operating system, a distributed orchestration layer that manages how inference requests route across GPU fleets, how memory is handled, how the prefill and decode phases split across hardware. It is open source, free, and already integrated by AWS, Google Cloud, Microsoft Azure, and dozens of AI-native companies. The strategic logic mirrors Google giving away Android: you commoditize the layer above your core product to make the layer you control more valuable. The open-source wrapper around a proprietary core is a very old strategy, and it has worked before.

Groq followed the same playbook but at the hardware level. Groq was building inference chips that were, for specific tasks, dramatically more efficient than Nvidia GPUs. Rather than compete on efficiency, Nvidia paid $20 billion for the technology and team.¹⁶ The strategic move is legible: own inference efficiency before inference efficiency becomes the axis on which you lose.

Whether these moves are enough to stay relevant through the transition is the most interesting open question in technology right now. The three paths out of the bottleneck all reduce Nvidia's pricing power to some degree. Efficiency gains mean you need less hardware. SLMs run on commodity silicon. A hardware paradigm shift makes GPUs less central. Nvidia is adapting faster than any incumbent in recent memory, but it is adapting to a world where the thing it currently monetizes at extraordinary margins becomes less scarce.

History doesn't offer much comfort to incumbents betting on software moats to survive hardware paradigm shifts. But Nvidia is not a typical incumbent, and this is not a typical transition. The outcome is genuinely uncertain. That, too, is worth paying attention to.

Sources