C
ChaoBro

Meta Joins AMD, Broadcom, Intel, Microsoft, NVIDIA to Release MRC Protocol: Solving Network Bottlenecks in AI Training Clusters

Meta Joins AMD, Broadcom, Intel, Microsoft, NVIDIA to Release MRC Protocol: Solving Network Bottlenecks in AI Training Clusters

Core Takeaway

On May 6, 2026, Meta jointly released the Multipath Reliable Connection (MRC) open network protocol together with five tech giants: AMD, Broadcom, Intel, Microsoft, and NVIDIA. This is a new network protocol specifically designed for large-scale AI training clusters, with the core goal of reducing GPU wait time, minimizing training interruptions caused by network failures, and improving overall training efficiency.

This tweet received 4,485 likes, 488 retweets, and 1,250 bookmarks on the day of publication, with views exceeding 580,000 — triggering unusually high discussion in the AI infrastructure domain.

What Happened

The core positioning of the MRC protocol: making large-scale AI training clusters run faster and more stably, reducing GPU time waste.

Participant Lineup

CompanyRolePosition in AI Infrastructure
MetaInitiatorUltra-large model training demand side (Llama series)
AMDCo-publisherGPU/CPU computing power supplier
BroadcomCo-publisherAI network chip custom design
IntelCo-publisherCPU/network processor supplier
MicrosoftCo-publisherCloud infrastructure operator (Azure)
NVIDIACo-publisherGPU and network solution supplier (InfiniBand)

The significance of this lineup is that it covers almost the entire chain of AI training infrastructure — from computing chips to network hardware, from cloud operations to model training parties.

What Problem MRC Protocol Solves

The core network challenges faced by large-scale AI training clusters:

Traditional approach problems:
┌─────┐    ┌─────┐    ┌─────┐
│GPU 0│────│GPU 1│────│GPU 2│  ← Single path dependency, any link failure causes training interruption
└─────┘    └─────┘    └─────┘
    │          │          │
    └──────────┴──────────┘
         Single network path
MRC approach improvement:
┌─────┐    ┌─────┐    ┌─────┐
│GPU 0│═══│GPU 1│═══│GPU 2│  ← Multipath reliable connection, automatic failover
└─────┘    └─────┘    └─────┘
    │   ╲    │   ╲    │
    │    ╲   │    ╲   │
    │     ╲  │     ╲  │
    └══════╲═┴══════╲═┘
      Multipath redundancy + reliable transport

Technical Advantages

DimensionTraditional ApproachMRC Protocol
Network pathSingle path, failure means interruptionMultipath redundancy, automatic failover
ReliabilityDepends on physical link stabilityReliable connection layer, software-level fault tolerance
GPU utilizationNetwork issues cause GPU idle waitingReduced GPU wait time
OpennessVendor-proprietary protocol (e.g., InfiniBand)Open protocol, cross-vendor compatibility
Ecosystem supportLock-in to specific vendor solutionsSix major tech giants jointly support, open standard

Why It Matters

1. AI Training Bottleneck Shifting from Compute to Network

As model sizes grow (from hundreds of billions to trillions of parameters), the number of GPUs in training clusters increases from hundreds to tens of thousands. As GPU count increases, network communication overhead and failure rates grow exponentially.

A typical trillion-parameter model training task:

  • Requires thousands of GPUs working simultaneously
  • Parameter synchronization between GPUs consumes significant network bandwidth
  • Network failure of any single GPU can cause the entire training task to pause

The MRC protocol directly addresses this pain point, reducing the impact of network failures on training through multipath redundancy and reliable connection layers.

2. Open Protocol vs. Proprietary Protocol Competition

Current AI training cluster network solutions are primarily monopolized by NVIDIA’s InfiniBand. The emergence of MRC as an open protocol means:

  • Reduced vendor lock-in risk: Cluster operators can mix network equipment from different vendors
  • Lower infrastructure costs: Competition from open protocols may reduce network equipment prices
  • Accelerated technical innovation: Multi-vendor participation drives protocol iteration

3. AMD Datacenter AI Business Growth Signal of 80%

On the same day, AMD announced that its datacenter AI business is expected to grow 80%, primarily driven by GPU/CPU orders from cloud and infrastructure operators. AMD specifically noted: market forecasts are now catching up to actual deployment cycles, signaling sustained demand ahead.

This echoes the release of the MRC protocol — the AI infrastructure market is at a turning point from planning to large-scale deployment.

Impact on the Industry

For Model Training Parties

  • Higher training stability: Reduced training interruptions and restarts caused by network issues
  • Lower GPU idle costs: Less GPU wait time for network, improved training efficiency
  • More flexible hardware choices: No longer locked into specific vendors’ network solutions

For Cloud Service Providers

  • Infrastructure differentiation competition: Cloud platforms supporting MRC protocol gain training efficiency advantage
  • Reduced operations complexity: Multipath redundancy reduces dependency on physical network stability

For Chip Vendors

  • New competitive dimension: Competition at the network protocol level will affect GPU/network chip market dynamics
  • Open ecosystem opportunities: Smaller vendors can enter the AI infrastructure market by supporting MRC protocol

Landscape Assessment

The release of the MRC protocol is a watershed event in the AI infrastructure domain. It marks:

  1. Shifting bottleneck awareness in AI training — from “need more GPUs” to “need better networks”
  2. Open protocols challenging proprietary protocol monopolies — InfiniBand’s moat is being eroded
  3. Industry giants jointly setting standards — Meta, NVIDIA, AMD, Intel and others’ joint participation shows AI infrastructure standardization is accelerating

For China’s AI industry, there are two reasons to watch MRC protocol development: domestic large model training faces the same cluster network bottleneck issues; and the emergence of open protocols may lower the threshold for domestic vendors to access AI training infrastructure.