Google’s Virgo Network Signals a Fundamental Shift in How AI Data Centers Are Designed

May 12, 2026
Data Centers
World
Akash Sharma

Share the Post:

Google’s Virgo Network, introduced at Google Cloud Next 2026, is a megascale AI data center fabric that Google describes as built on a campus-as-a-computer philosophy, connecting up to 134,000 TPU chips within a single data center with 47 petabits per second of bisectional bandwidth. The design delivers four times the bandwidth per accelerator of Google’s previous generation networking fabric and 40% lower unloaded latency, while enabling more than one million TPUs across multiple data center sites to operate as a single training cluster. The numbers are striking, but the more significant aspect of Virgo is not its performance specifications. It is what the design choices embedded in Virgo reveal about what AI at scale actually requires from data center networking, and how far those requirements diverge from the general-purpose networking architectures that hyperscale data centers have used for the past decade.

The core architectural departure in Virgo is the shift from a three-layer network topology to a two-layer non-blocking fabric. Conventional data center networks use a three-tier hierarchy of access, aggregation, and core switching layers, a design optimised for the east-west traffic patterns of enterprise workloads and cloud services where many small flows compete for shared bandwidth. AI training clusters generate a fundamentally different traffic pattern: massive, tightly synchronised collective communications where tens of thousands of accelerators exchange gradient updates simultaneously on a precise schedule. The three-tier hierarchy creates queuing delays at aggregation and core layers that appear as tail latency in AI training, slowing the entire cluster when a subset of transfers experience delays. Virgo addresses this by eliminating the aggregation layer entirely, using high-radix switches with more ports per switch to flatten the topology and reduce the hop count between any two nodes in the fabric.

The Three Traffic Domains That Define Virgo’s Architecture

Virgo’s flat two-layer fabric handles a specific category of AI traffic: the tightly coupled accelerator-to-accelerator communication of large training clusters. However, Google’s data centers serve multiple traffic types simultaneously, and Virgo’s architecture explicitly segments these into three independent network domains. The first domain is the tightly coupled AI training fabric that Virgo’s flat topology addresses. The second is the broader east-west traffic that moves data across larger clusters and between training and serving infrastructure. The third is the north-south traffic that connects accelerators to storage and external services through Google’s existing Jupiter fabric.

This segmentation reflects a design philosophy that the data center network is no longer a single uniform system but a set of coordinated layers, each tuned to the specific traffic pattern it serves. As Ron Westfall, vice president and analyst at HyperFrame Research, noted in analysis of Virgo, Google has reimagined the data center as a campus-as-a-computer, treating tail latency as a hardware reliability issue rather than a network tuning problem and isolating AI training traffic to keep large clusters in synchronisation. The practical implication is that operators designing AI data centers cannot apply conventional network architecture thinking to the AI training layer. The traffic characteristics are different enough that a purpose-built fabric, rather than a tuned general-purpose network, is the appropriate design response.

The Scaling Tax That Virgo Eliminates

Google describes Virgo’s architecture as eliminating the scaling tax — the performance degradation that conventional networks experience as cluster sizes increase. In traditional three-layer networks, adding more accelerators to a cluster does not produce proportionally more training throughput because the aggregation and core switching layers become bottlenecks as the number of concurrent flows increases. The result is that 100,000 accelerators do not produce 10 times the training throughput of 10,000 accelerators, because network congestion at aggregation layers grows faster than the compute capacity being added. Virgo’s non-blocking two-layer topology ensures that every accelerator has a dedicated path to every other accelerator with no shared aggregation bottleneck, enabling near-linear scaling of training throughput with cluster size.

The elimination of the scaling tax has direct implications for how data center operators and hyperscalers should think about network infrastructure investment relative to compute infrastructure investment. A cluster whose training throughput scales linearly with compute additions justifies higher per-accelerator spending on compute than a cluster where network bottlenecks limit the effective utilisation of added capacity. Conversely, a network investment that eliminates the scaling tax and enables linear compute scaling can generate returns that exceed the cost of the additional accelerators otherwise required to achieve the same training throughput. The economic logic of Virgo is therefore not just about better network performance. It is about changing the relationship between network investment and compute utilisation in ways that alter the optimal capital allocation for AI infrastructure at scale.

As covered in our analysis of the AI infrastructure spending model resting on assumptions nobody has actually tested, the assumptions embedded in AI infrastructure investment models have not been stress-tested at the scale that Virgo-class infrastructure enables.

What Virgo Means for the Broader Networking Market

Virgo’s architectural choices are not simply Google-specific engineering decisions. They reflect constraints of AI training workloads that apply to every operator building large-scale AI infrastructure, and the solutions Google has implemented in Virgo are influencing how the broader networking industry thinks about AI fabric design. The shift from three-layer to two-layer topology, the use of high-radix switches to reduce hop count, and the segmentation of AI training traffic from other data center traffic types are all design patterns that competing hyperscalers, independent data center operators, and enterprise AI infrastructure teams are evaluating in their own network architecture decisions.

The convergence between Virgo and MRC, OpenAI’s multipath networking protocol released through the Open Compute Project on May 5, illustrates how the networking layer of AI infrastructure is undergoing coordinated evolution across the industry rather than incremental improvement within existing frameworks. Both Virgo and MRC reflect the same underlying insight: general-purpose data center networking cannot efficiently support the synchronised, high-bandwidth, low-latency demands of large-scale AI training because engineers originally designed it for diverse workloads with unpredictable traffic patterns. The solutions they implement from different starting points, Virgo through topology redesign and MRC through transport protocol redesign, are complementary approaches to the same problem. Data center operators who are planning AI networking infrastructure for the next three to five years need to engage with both simultaneously rather than treating network topology and transport protocol as independent design decisions.