📝 Резюме · 🧾 Транскрипт (формат) · 📄 Оригинал (9.7 KB)

https://isolveproblems.substack.com/p/how-microsoft-vaporized-a-trillion-8f4

How Microsoft Vaporized a Trillion Dollars, Pt. 2

Источник: https://isolveproblems.substack.com/p/how-microsoft-vaporized-a-trillion-8f4

Краткое содержание: Source: https://isolveproblems.substack.com/p/how-microsoft-vaporized-a-trillion-8f4 ============================================================ (Continued from Part 1) What I discovered in the following weeks and months was a strained organization, exhausted by constant incidents, millions of unattended crashes in the Azure node management stack, conflicting coding standards, limited security awareness, weak testing practices, code freezes born of fear, unrealistic timelines, blame-shifting, and a noticeable gap in senior technical leadership. Before diving deeper into each issue, it helps to understand how the team reached this point.

Основные тезисы:

Source: https://isolveproblems.substack.com/p/how-microsoft-vaporized-a-trillion-8f4
============================================================
(Continued from Part 1)

Значимость: Затрагивает международную повестку и политический контекст.

🧾 Транскрипт (формат)

How Microsoft Vaporized a Trillion Dollars, Pt. 2 Source: https://isolveproblems.substack.com/p/how-microsoft-vaporized-a-trillion-8f4

(Continued from Part 1)

What I discovered in the following weeks and months was a strained organization, exhausted by constant incidents, millions of unattended crashes in the Azure node management stack, conflicting coding standards, limited security awareness, weak testing practices, code freezes born of fear, unrealistic timelines, blame-shifting, and a noticeable gap in senior technical leadership.

Before diving deeper into each issue, it helps to understand how the team reached this point.

During my earlier tenure as a kernel engineer on the Windows Core OS team, I reported to one of the most talented operating system engineers I encountered at Microsoft.

He had decades of experience, stretching back to working with Dave Cutler on the Windows systems we know today. Among his contributions were the Server and Application Silos (code-named Helium and Argon), which form the foundation of the Windows Container platform.

He also worked on research operating systems such as Midori and Singularity, and was one of the original contributors to the Azure Fabric, the meta operating system that orchestrates Microsoft’s cloud infrastructure.

One day early in my time under him, he brought in some old team swag: a sweater emblazoned with 0xF0FFFF, the hex color of Azure.

From him and over the years, I learned not only about kernel design but also about Azure’s origins and the intense competitive pressure that shaped it.

Amazon had launched S3 and EC2 in 2006; Microsoft was late to the public cloud race and needed to move fast. The project, code-named Red Dog, began with a small team of just five or six elite engineers, led by Cutler.

The heavy lifting on the nodes falls to the hypervisor and modified host and guest OSes optimized for virtualized environments.

Nodes belong to clusters managed by the Fabric Controller, which handles resource inventory, VM placement, provisioning, servicing, load balancing, and scaling.

A set of agents, including the central RdAgent, reports back to the controller and orchestrates local resources on each node, as well as the creation of virtual machines.

Creating a VM is still fundamentally like ordering pizza (skipping some details): you choose from a menu of sizes and ingredients: 16 cores? Sure. 128 GB of memory? Done. 32 disks? No problem. Four NICs? GPU? You got it!

From there, the node software pilots the hypervisor to create the partition, attach the required devices, including a disk containing the boot image, and start the VM.

The project succeeded and shipped in February 2010 as Windows Azure. But as often happens with rushed, high-pressure efforts, many of the original core contributors eventually moved on.

At the time, Microsoft was still heavily focused on PCs, tablets, and phones. Teams were porting Windows to ARM, shipping Windows 8 and 8.1, and acquiring Nokia while reimagining Xbox One around Hyper-V under Cutler’s leadership.

Cloud was important but not yet central. OneDrive and SharePoint ran on separate infrastructure, and Azure remained a distant second to AWS.

Just months after Satya Nadella became CEO in February 2014, he canceled the dedicated SDET (Software Development Engineer in Test) role, triggering significant layoffs.

Due to Washington state WARN rules, Microsoft could not eliminate every tester position; hundreds remained.

Many of these testers, strong at execution but with limited experience in system design or deep software engineering, were retrained.

Some became data engineers focused on Windows 10 telemetry; others moved into software engineering roles (often down-leveled); and still others landed in lower-impact areas, including Azure OPEX, where they helped keep the lights on through on-call rotations and incident mitigation.

Fast forward, and large parts of Azure operations were being run by these former testers. Many were dedicated colleagues, but the shift left gaps in architectural depth for mission-critical systems.

OPEX teams exist to maintain production stability. Their work is grueling, with 24/7 on-call rotations, rapid mitigations, post-mortem analysis, and scripting fixes, leading to high attrition.

They typically do not design new software or own long-term bug fixes; instead, they file repair items for product teams and maintain a living knowledge base of incidents.

In 2018, Nadella repositioned the company around Cloud + AI and placed Scott Guthrie in charge.

Windows was reorganized under Azure, and overnight the existing Azure teams became central to Microsoft’s most strategic bet.

Most of the people stayed the same, save for a few high-profile transfers.

By the time I rejoined in 2023, roughly half the organization responsible for Compute Node Services consisted of junior engineers with only one or two years of experience.

The Group Engineering Manager’s background was in web performance (optimizing CSS for page load times), and the dev manager had limited Windows experience.

This group was now tasked with moving their inherited stack to the new Azure Boost accelerator environment, an effort Microsoft had publicly implied was well underway at Ignite conferences since 2023.

In reality, as the person responsible for the hypervisor-layer porting and reengineering, I knew the substantive work had barely begun.

The team had no clear starting point. The existing stack suffered from chronic crash-causing defects and memory leaks, leaving everyone firefighting.

Few engineers could reliably build the software locally; debugger usage was rare (I ended up writing the team's first how-to guide in 2024); and automated test coverage sat below 40%.

Every monthly release introduced more new defects than it fixed. Most rollouts were panicked rollbacks. Millions of crashes occurred each month, the majority unattributed because teams had never claimed ownership of their modules in the Azure Watson crash reporting system.

As a result, automated triage created few formal incidents, allowing monthly newsletters to tout glowing quality metrics unsupported by actual data.

The Core OS team often absorbed blame for issues originating in the Azure node software. Crashes frequently leaked resources: files, disks, even entire VMs.

Weak error handling led to malformed VMs (e.g., missing disks). When customers decommissioned them, the node software attempted to detach non-existent disks, triggering hypervisor errors.

The Azure team pointed fingers at Hyper-V, sparking escalations that reached VP level.

I once convened a high-stakes meeting with stakeholders from both sides; the Hyper-V leads were visibly frustrated by the repeated, misplaced blame.

Layered on this chaos was an Azure-wide mandate: all new software must be written in Rust. Some porting plans were abandoned, and many junior engineers grew excited by the new language.

Critical modules at the heart of Azure's node management, a critical part of the company's flagship Cloud + AI initiative, were sometimes designed by engineers with less than a year of tenure, under leads who lacked visibility into the details.

None of it shipped.

The VM management software continued to run and crash on Windows, despite repeated public statements from 2023 through 2025 claiming that key components had been offloaded to the Azure Boost accelerator and rewritten in Rust.

From my direct involvement, I know those claims did not reflect reality as late as the end of 2024. Of the 64 key work items identified a year earlier to reengineer the VM management stack for offload, none had been completed, and work had not even started on approximately 60 of them.

The list included foundational pieces such as a key-value store, tracing, logging, and observability infrastructure.

Worse, early prototypes already pulled in nearly a thousand third-party Rust crates, many of which were transitive dependencies and largely unvetted, posing potential supply-chain risks.

On top of all that, the org had a hard commitment to deliver the already long-delayed OpenAI bare-metal SKUs that had been promised for years. This work started around May 2024 with a target of Spring 2025 and was led by a Principal engineer who had evidently never tackled a task of that scale.

Fast-forward to March 10, 2025: OpenAI signed an $11.9 billion compute deal with CoreWeave for model training and services.

Sam Altman, OpenAI’s CEO, declared that “Advanced AI systems require reliable compute, and we’re excited to continue scaling with CoreWeave so we can train even more powerful models and offer great services to even more users” — words that landed as a pointed comment on Azure’s reliability and scalability.

This was significant because just weeks earlier at the World Economic Forum in Davos, Satya Nadella had highlighted Microsoft’s “ROFR” (right of first refusal) with OpenAI, stating that OpenAI would need to come to Microsoft first and could only look elsewhere if Microsoft could not deliver.

In September 2025, OpenAI—still technically under Microsoft’s ROFR—expanded its CoreWeave agreement by another $6.5 billion. Around the same period, OpenAI also committed to a massive, multi-year computing power deal with Oracle valued at $300 billion.

Microsoft, meanwhile, conducted major layoffs—approximately 15,000 roles across waves in May and July 2025 —most likely to compensate for the immediate losses to CoreWeave ahead of the next earnings calls.

One can reasonably infer that Microsoft struggled to meet OpenAI’s demanding requirements on time and at scale. That outcome should come as no surprise after reading this series.

Click for Part 3.