How Microsoft Vaporized a Trillion Dollars, Pt. 3

Источник: https://isolveproblems.substack.com/p/how-microsoft-vaporized-a-trillion-f67

Краткое содержание: Source: https://isolveproblems.substack.com/p/how-microsoft-vaporized-a-trillion-f67 ============================================================ (Continued from Part 2) Circling back to the origins of Azure, Cutler’s intent was to produce a system with the same level of quality, unshakable reliability, and attention to detail he was famous for in his work on VMS and NT. In a 2009 interview with ZDNET, he declared that the intent [for the Azure Fabric Controller] was that “it manages the placement, provisioning, updating, patching, capacity, load balancing, and scale out of nodes in the cloud all without any operational intervention.” (emphasis added) From my years with one of the original contributors to the Fabric, I learned that touching the nodes by hand was also strictly off-limits: the original design intent was that Azure would operate without human intervention.

Основные тезисы:

Source: https://isolveproblems.substack.com/p/how-microsoft-vaporized-a-trillion-f67
============================================================
(Continued from Part 2)

Значимость: Затрагивает международную повестку и политический контекст.

🧾 Транскрипт (формат)

How Microsoft Vaporized a Trillion Dollars, Pt. 3 Source: https://isolveproblems.substack.com/p/how-microsoft-vaporized-a-trillion-f67

(Continued from Part 2)

Circling back to the origins of Azure, Cutler’s intent was to produce a system with the same level of quality, unshakable reliability, and attention to detail he was famous for in his work on VMS and NT.

In a 2009 interview with ZDNET, he declared that the intent [for the Azure Fabric Controller] was that “it manages the placement, provisioning, updating, patching, capacity, load balancing, and scale out of nodes in the cloud all without any operational intervention.” (emphasis added)

From my years with one of the original contributors to the Fabric, I learned that touching the nodes by hand was also strictly off-limits: the original design intent was that Azure would operate without human intervention.

When discussing the discretion around Azure promises at the time, Cutler said, “The answer to this is simply that the RD group is very conservative and we are not anywhere close to being done.”

He further added that “[they] are taking each step slowly and attempting to have features 100% operational and solidly debugged before talking about them.”

That was on February 24, 2009. A mere 48 weeks later, Azure shipped for general consumption.

Fast forward to Summer 2025, and the Secretary of Defense, Pete Hegseth, publicly mentioned “a breach of trust” with Microsoft, following an article from ProPublica describing “digital escort sessions” conducted on Azure computers.

The article details how escort sessions involve specialized $18/hour employees who copy/paste and execute commands on government cloud nodes under direction from Microsoft support personnel, often based in foreign countries, including China.

However, direct node access and manual interventions are common daily practices that extend well beyond government clouds.

Cutler’s vision of a “no human touch” cloud service unfortunately never materialized, as the article mentions “hundreds of interactions” each month for the government clouds alone.

The article reveals that the program was devised at the highest levels of the company, with support from CVP-level contributors who declared that “the digital escort strategy allowed the company to ‘go to market faster,’ positioning it to win major federal cloud contracts.”

Azure shipped as an unfinished product under intense market pressure, and major corners were cut. Notably, routine manual intervention on the nodes was part of the strategy.

Marketing and competitive pressure often work in mysterious ways; however, the article does not explain why manual repairs were needed on the nodes.

The answer is now simple: the software didn't work as well as hoped, in large part because the system was rushed under intense pressure.

Cue the post-launch talent exodus, its replacement by people of very different experience levels, and you end up with a system that over-promises and under-delivers, drowning in unsolvable problems.

This gap between Cutler’s “no human touch” ideal and the reality of hundreds of monthly manual interventions wasn’t abstract for me.

In the Overlake team and Compute Node Services, the same underlying fragility I observed since day one, namely chronic crashes, resource leaks, malformed VMs, and a bloated agent ecosystem that no one could fully explain, created exactly the kind of instability that demanded constant human firefighting, including on sensitive government clouds.

What I encountered in 2023–2024 was not occasional edge cases, but a steady stream of symptoms from a system that had never been allowed to stabilize, despite the foundations, namely the hypervisor and Windows OS, being robust.

The manual escort sessions were, in many ways, the visible symptom of deeper architectural and process debt.

I began raising these issues internally, including through formal warnings that eventually reached the highest levels of the company.

On one particular occasion, a feature that had been baking for eleven months, intended to exchange secret encryption keys between some actor in the guest VMs and the host OS, generated two Sev-2 incidents within hours of being rolled out to general production.

It turned out that one of the agents was calling into another through an unknown endpoint, generating errors that were logged on both sides.

An infinite retry loop caused both agents to be busy logging errors, saturating the circular logs and reducing their horizon from the usual 2-3 days to about two hours.

This incident illustrates the lack of deep code ownership, overly complex inter-agent interactions, technical leadership gaps, and testing practices that allow major defects to reach production.

I distinctly remember asking the dev manager for permission to halt the worldwide rollout, and it took the teams the entire weekend and half of the following week to roll back the system to the previous version.

In another instance, it took three months, from January to March 2024, to run a file-deletion script across the fleet to clean up leaked files that had triggered a 100GB temporary files threshold on some nodes.

Systemic failures and limitations of the automated systems, internally known as “OaaS” and “Geneva Actions,” made a simple task daunting.

These incidents were emblematic of the daily reality for Azure OPEX teams: a constant flood of issues stemming directly from instabilities in the node software and in the surrounding support systems.

These were not isolated failures but part of a persistent pattern. The same poorly understood, interdependent agent ecosystem create fragile chains that turn minor changes into production crises.

For Azure customers, those failures manifest mostly during commissioning or decommissioning large numbers of resources, or other operations involving the node management stack.

Nodes experiencing failures are placed in an “unhealthy” state, and user workloads are migrated to other physical machines so the faulty node can be repaired, causing service interruptions as VMs must be suspended and the gigabytes of memory they consume copied to another machine, where the VMs are “rehydrated,” and these recovery operations are not immune to errors.

Resource leaks, crashes, “rogue” and “zombie” VMs, and node health issues are generally accommodated during normal times, as Azure has some room to spare and personnel to help with recovery around the clock.

However, how the system would cope near capacity, for example, in case of crisis, is anyone’s guess. A “run to the bank” where a large number of customers suddenly require increased capacity is likely to end in a disaster.

As these issues accumulated, I began raising them more formally through my management chain, including through structured warnings that ultimately reached senior leadership and beyond.

I also mentioned potential security issues that I had discovered along the way.

The responses varied from acknowledgment to defensiveness, revealing how deeply the culture had adapted to operating in a state of perpetual firefighting rather than addressing root causes.

This tension came to a head with the Azure-wide Rust mandate, conflicting porting plans, and the parallel demands of high-visibility projects such as the long-delayed OpenAI bare-metal SKUs.

What started as technical disagreements quickly exposed larger strategic and cultural fractures within the organization.

Click for Part 4.