How Microsoft Vaporized a Trillion Dollars, Pt. 5

Источник: https://isolveproblems.substack.com/p/how-microsoft-vaporized-a-trillion-841

Краткое содержание: Source: https://isolveproblems.substack.com/p/how-microsoft-vaporized-a-trillion-841 ============================================================ (Continued from Part 4) If you hear hoofbeats, think horses, not zebras. Microsoft rushed Azure out of the gates under intense competitive pressure.

Основные тезисы:

Source: https://isolveproblems.substack.com/p/how-microsoft-vaporized-a-trillion-841
============================================================
(Continued from Part 4)

Значимость: Затрагивает международную повестку и политический контекст.

🧾 Транскрипт (формат)

How Microsoft Vaporized a Trillion Dollars, Pt. 5 Source: https://isolveproblems.substack.com/p/how-microsoft-vaporized-a-trillion-841

(Continued from Part 4)

If you hear hoofbeats, think horses, not zebras.

Microsoft rushed Azure out of the gates under intense competitive pressure. Corners were cut. Fundamental principles of reliability and operational simplicity were quietly abandoned.

The company formalized the idea that defects could be fixed through human intervention on live production systems, all to accelerate time-to-market and secure major federal cloud contracts. As VP-level executives later admitted, the “digital escort strategy” helped the company “go to market faster.”

Instead of going back to the drawing board to tackle the growing technical debt, Microsoft relied on quick fixes: layers of automation running mitigation scripts, a growing team of on-call staff, and, when automation was not enough, manual repairs.

Public reports revealed hundreds of these interventions monthly on sensitive government clouds alone. In reality, across the much larger commercial fleet, the total number of interventions was significantly higher.

OPEX and support engineers accessing privileged parts of the system submit a Just-In-Time (JIT) request for approval, which is broadcast on a dedicated mailing list. Any full-time member of the organization can approve these requests. Once approved, the requester receives 8 hours of system access, during which they can interact with physical nodes and fabric controllers, and manage secrets when the requested access level is set to RdmSecretsAdministrator.

In just over two months, from August 14, 2024, to October 26, 2024, the Outlook folder I created to separate JIT requests from other messages collected 14,209 requests — nearly 200 per day.

What may have started as temporary workarounds became standard procedures, just part of doing business. Azure never operated as smoothly or independently as promised. What Microsoft presented to the world, and to its most demanding customers, was a sophisticated system perpetually on life support.

This foundational fragility, rooted in rushed decisions and wishful thinking about how fast the platform could grow and stabilize, led to small but ongoing disruptions. Over time, those disruptions built up.

The result was a classic butterfly effect: internal flaws in Azure node software quality, testing discipline, and architectural clarity spread outward, undermining the execution of high-visibility commitments.

By early 2025, OpenAI — still nominally under Microsoft’s right of first refusal — began aggressively diversifying its compute footprint.

The visible consequences quickly became evident: Wall Street grew doubtful despite record profits, and investor confidence sharply declined. From its peak in late October 2025, Microsoft’s stock dropped over 30% in the following months, wiping out more than a trillion dollars in market value.

The hoofbeats had been present all along.

Hindsight makes the better path clear: pause aggressive feature velocity, invest heavily in stabilizing the core node stack, simplify the agent ecosystem, and rebuild testing and ownership discipline before layering on ambitious offload projects or promising bare-metal capabilities to flagship customers.

But that path was never pursued. The organization had already adapted to constant firefighting. More importantly, Microsoft no longer had the deep senior systems talent — the experienced kernel, virtualization, and distributed-systems engineers who built the original Fabric — needed for such a fundamental overhaul.

Replacing or re-architecting a system of Azure’s scale and complexity is like swapping an airplane’s engines mid-flight. Not impossible in theory, but extremely risky in practice, especially when the crew has changed and the original expertise has mostly left.

The reality is clear: there is no quick fix. Azure is in a deep structural hole, and the company must now operate with the platform it has while stabilizing it under full load.

The situation was salvageable, though. In 2024, I read the OpenAI PM specs, which detail the demands and promises Azure made to meet their needs.

The current plans are likely to fail — history has proven that hunch correct — so I began creating new ones to rebuild the Azure node stack from first principles.

A simple cross-platform component model to create portable modules that could be built for both Windows and Linux, and a new message bus communication system spanning the entire node, where agents could freely communicate across guest, host, and SoC boundaries, were the foundational elements of a new node platform. Those ideas were widely shared through written documents, with some presented at a high-profile cross-organization technical meeting.

Some of OpenAI’s requests for their future bare-metal nodes, which would have allowed them to extract the last few percent from the hardware, required extensions to the Overlake card itself. I drafted these extensions and shared them with a division’s Technical Fellow, a renowned kernel architect who had recently shifted to Azure and whom I knew from my previous tenure in the kernel team.

The improvements might have been part of Overlake 4, the next major version of the Windows Boost offloading platform, and a software-only implementation could have been deployed in the meantime to enable true read-only remote system images and fast system resets, a useful feature that allows for quick experimentation and rollbacks common in research domains.

I created a new code repository that adheres to the latest Azure governance standards and began developing actual components, aiming to set an example and build momentum.

I solved the “million files deletion problem,” which seems simple but still needs careful handling to run reliably at cloud scale. Next, I built an encrypting LRU cache to separate tenants’ data and follow basic security principles in hostile multi-tenancy environments. Still fairly simple, but that’s the goal of componentization.

These components could be called directly from existing code, significantly enhancing resilience and security with minimal changes beyond deletions.

The practical strategy I suggested was incremental improvement, where code sections are isolated and replaced with a simple call to a new component: choose an area, develop and thoroughly test a reliable, reusable replacement, then remove the old code and replace it with a call to the new component.

This strategy goes a long way toward modernizing a running system with minimal disruption and offers gradual, consistent improvements. It uses small, reliable components that can be easily tested separately and solidified before integration into the main platform at scale.

Eventually, there is nothing left to carve out, and the original components are just skeletons calling into new ones. Componentization also enables moving elements around; for example, a secure cache could be used on the offload accelerator, on the host, inside a guest VM, a guest L1/L2 container, or on a bare-metal node, with a uniform message bus connecting all parts.

This vision was met with disdain among lower-level management in Azure Core, who may not have understood the urgency — or the scale — of the changes needed to make the platform truly scalable while lowering long-term OPEX costs.

Gradual enhancement through componentization challenged the status quo of constant firefighting and the comfort of familiar, yet fragile, code paths.

In the end, the organization chose the easiest route at the moment: keep adding complexity on a fragile foundation instead of investing in a careful, step-by-step modernization that could have restored autonomy and reliability.

The outcome was expected. High-profile commitments fell through, customer trust continued to decline, and the internal weaknesses I had pointed out from the start kept showing in more visible ways outside the Redmond campus.

What started as engineering disagreements turned into something bigger: a test of whether Microsoft could still perform at the level its most strategic customers and partners expected.

The hoofbeats grew louder. Over the following months, I extended my concerns beyond my direct managers.

Click for Part 6.