Abstract: The Ultra Accelerator Link (UALink) specification, UALink_200 Rev 1.0, presents a sophisticated framework for high-speed, low-latency communication between processing accelerators. This paper provides an in-depth technical examination of UALink, detailing its multi-layered protocol architecture, fundamental operational principles including addressing, coherency, and flow control, comprehensive security measures (UALinkSec), and its robust reliability, availability, and serviceability (RAS) features. Designed to support up to 1024 accelerators within an interconnected "Pod," UALink aims to standardize and advance the capabilities of large-scale accelerator-based computing systems, particularly in the demanding fields of Artificial Intelligence and High-Performance Computing.
Keywords: Accelerator Interconnect, UALink, UPLI, Transaction Layer, Data Link Layer, Physical Layer, SerDes, Coherency, Flow Control, UALinkSec, RAS, System Node, Accelerator Switch.
1. Introduction
The relentless growth in computational demands, especially from AI/ML and HPC applications, necessitates highly efficient and scalable interconnects for specialized processing accelerators. The UALink_200 Rev 1.0 specification addresses this by defining a comprehensive set of protocols and interfaces. UALink is engineered to enable direct, low-latency memory access (reads, writes) and atomic operations between accelerators, fostering a robust ecosystem for switched accelerator fabrics. This paper will explore the technical intricacies of the UALink protocol as defined in its v1.0 specification.
2. UALink System and Network Architecture
UALink envisions a system composed of multiple System Nodes (OS Domains), each potentially housing CPUs and one or more accelerators. UALink provides the backbone for accelerator-to-accelerator communication both within and, crucially, across these System Nodes.
Key Architectural Components:
- Accelerators (Acc): The primary processing units that utilize UALink for communication.
- UALink Switches (ULS): Network devices forming the UALink fabric, responsible for routing UPLI transactions based on 10-bit Source and Destination Accelerator Identifiers. Switches are stateless concerning individual transactions.
- UALink Links: Point-to-point serial interfaces. A UALink Station comprises 4 lanes, which can be bifurcated into one x4, two x2, or four x1 UALink Links (Ports). Each lane supports data rates up to 200 GT/s (e.g., 212.5 GT/s signaling rate to accommodate overhead).
- Pod: The largest collection of accelerators and switches interconnected via UALink, with a capacity of up to 1024 accelerators.
- Virtual Pod: A logically isolated, non-overlapping partition of a Pod, enabling multiple tenants or workloads to share the physical infrastructure securely.
3. Layered Protocol Stack
UALink's functionality is organized into a four-layer protocol stack:
3.1. Protocol Layer (UPLI - UALink Protocol Level Interface) The UPLI is the uppermost layer, defining the logical signaling and protocol for exchanging data and control information.
- Transaction Model: Based on Request-Response pairs. An Originator Device issues a Request (Command), and a Completer Device responds.
- UPLI Channels:
- Credit Management: Each channel employs a credit-based flow control mechanism.
- Time Division Multiplexing (TDM): UPLI channels are TDM'd, with the PortID signal in each channel indicating the TDM cycle for a specific bifurcated port.
3.2. Transaction Layer (TL) The TL interfaces the UPLI with the Data Link Layer.
- Flit Packaging: Packages UPLI channel information into 64-byte Transmit (Tx) Flits and unpacks received 64-byte Receive (Rx) Flits. A TL Flit consists of two 32-byte Half-Flits.
- Half-Flit Types:
- Address Caching (Optional): Supports Tx and Rx address caches to enable compressed requests by omitting parts of the address if previously cached. Synchronization between Tx and Rx caches is maintained.
- Flow Control: Implements TL-to-TL flow control using credits for Request/Response fields and Data Half-Flits.
3.3. Data Link Layer (DL) The DL manages the reliable transmission of TL Flits over the Physical Layer.
- Flit Aggregation: Packs multiple 64-byte TL Flits (and DL messages) into larger 640-byte DL Flits.
- DL Message Service: Supports in-band DL-to-DL messages for functions like TL rate notification, device/port ID query, channel negotiation (online/offline for TL Flits and UART), and UART for firmware communication.
- Link Level Replay (LLR): Ensures guaranteed, in-order delivery of DL Flits. Uses sequence numbers and Ack/Replay Request mechanisms. Transmitters buffer payload Flits until acknowledged.
- CRC: Calculates and appends a 32-bit CRC to each 640-byte DL Flit for error detection.
- Tx Pacing/Rx Rate Adaptation: Allows link partners to limit the transmission rate of TL Flits, accommodating differing UPLI clock frequencies.
3.4. Physical Layer (PL) The PL handles the electrical transmission of bits, based on IEEE 802.3dj.
- Signaling Rates: Supports 212.5 GT/s (for 200G per lane modes like 200GBASE-KR1/CR1, 400GBASE-KR2/CR2, 800GBASE-KR4/CR4) and 106.25 GT/s (for 100G per lane modes).
- Reconciliation Sublayer (RS): Adapts DL Flits into a stream of 64B/66B blocks. Manages link fault signaling and DL Flit-to-codeword alignment.
- PCS/PMA Modifications:
- Auto-Negotiation/Link Training: Unchanged from IEEE 802.3.
4. Core UALink Mechanisms and Concepts
4.1. Addressing and Remote Memory Access (RMA) UALink facilitates memory access across System Node boundaries. Accelerators can use System Physical Addresses (SPA) for local accesses and Network Physical Addresses (NPA) for remote accesses. A typical translation flow involves: Source Accelerator MMU (GVA -> NPA) -> UALink Fabric -> Destination Link MMU (NPA -> SPA). This enables RMA, crucial for distributed applications.
4.2. Coherency UALink adopts a software-managed I/O coherency model for accelerator-to-accelerator interactions, avoiding hardware snoops across UALink.
- Reads from peer memory obtain the most recent coherent data.
- Writes to peer memory invalidate peer cache copies.
- Partial writes fetch, merge, and write back. Coherency within a System Node (Host-Accelerator) is implementation-specific.
4.3. Transaction Routing and Ordering
- Routing: UPLI Requests and Responses are routed using ReqSrcPhysAccID, ReqDstPhysAccID, and ReqPortID (and their response channel counterparts). Switches use ReqDstPhysAccID for routing table lookups.
- Ordering:
4.4. Single-Copy Atomicity A UPLI operation is single-copy atomic if performed entirely without visible fragmentation. Operations not declared single-copy atomic may be decomposed by the destination accelerator into smaller single-copy atomic units. The UPLI and TL deliver the operation without decomposition; the atomicity behavior at the final completer is an accelerator implementation detail.
4.5. UPLI Reset and Connection Handshake UPLIReset_N initializes the UPLI logic. A Connection Handshake Protocol (OrigClkReq, OrigClkAck, CompClkReq, CompClkAck signals) ensures an ordered bring-up of the interface, preventing race conditions.
5. Security (UALinkSec)
UALinkSec provides robust protection for UALink traffic.
- Objectives: Confidentiality (encryption) and optional integrity (authentication, including replay protection) for traffic within a Virtual Pod.
- Adversary Model: Protects against physically present adversaries and, in Confidential Computing (CC) contexts, against untrusted infrastructure providers or other tenants.
- Encryption Scheme: AES-GCM with 256-bit keys. Applied on a per-UALink transaction basis.
- Tagging: An optional 8-byte authentication tag can be generated. The TL includes a "Tag Half-Flit" with the Control Half-Flit when integrity is enabled.
- Initialization Vector (IV): A 96-bit IV (partially fixed, partially a 32-bit per-transaction invocation counter) is used with a 32-bit block counter (reset per transaction).
- Field Protection: Specific UPLI/TL header fields are encrypted and/or authenticated. Fields necessary for routing, compression, and flow control remain plaintext but are authenticated if integrity is enabled. Data payloads and byte enables are fully protected.
- Key Management:
- Ordering for Security: Strict ordering of transactions within each of the three independent streams (Request, Read Response, Write Response) between a source-destination pair is mandatory when security is active to maintain IV synchronization.
- Poisoned Data: Securely handles data beats marked as "poisoned" (corrupted), ensuring they are skipped during crypto operations but their status is included in authentication.
- Switch Requirements: Switches must handle tag half-flits and maintain stream ordering.
6. Reliability, Availability, and Serviceability (RAS)
UALink defines mechanisms for robust fault management.
- Error Types: UPLI Control/Data/Protocol Errors, Switch Core Control/Data Errors, Link Down Errors.
- End-to-End Data Protection: UPLI uses parity; DL/PL use FEC and CRC.
- Error Handling Mechanisms:
- Link Down Error Processing: A defined sequence involving PHY detection, TLs entering Drop Mode, the local UPLI Originator entering Isolation Mode, and remote UPLI Originators timing out and also entering Isolation Mode. This facilitates controlled application shutdown and system recovery.
7. UALink Switch Requirements
Switches are central to the UALink fabric.
- Functionality: Relay UPLI Requests and Responses between accelerators. Switch-to-switch links are not supported. Routing is unicast and stateless at the switch level.
- Protocol Stack: Must implement the UALink DL and TL. The interface to the switch core is nominally UPLI.
- Bifurcation: Stations must support 1x4, 2x2, or 4x1 lane port configurations.
- Lossless Delivery: Guaranteed via LLR and flow control, except in defined error scenarios.
- Non-Blocking Architecture: Traffic between port pairs should be independent, minimizing head-of-line blocking. Responses must not be blocked by stalled requests.
- Forward Progress: Starvation-free arbitration is required.
- Ordering & VCs: Must adhere to UPLI ordering rules. Strict Ordering mode support is mandatory.
- Routing Tables: Each station has an independently programmable routing table, indexed by the 10-bit Destination Accelerator ID. Entries specify egress station/port and a deny/allow flag for partitioning (Virtual Switches/Pods).
8. Manageability
A UALink Pod is managed by a central UALink Pod Controller.
- Pod Controller: Manages Pod resources, configuration, topology validation, Accelerator ID allocation, switch routing, RAS (error recovery, logging), health monitoring, and telemetry.
- Node Management Agents: Reside on System Nodes, interfacing with the Pod Controller to manage local accelerators (e.g., Virtual Pod assignment, port configuration).
- Switch Management Agents: Manage switches, typically on Switch Platforms, and communicate with the Pod Controller.
- Virtual Pods: The Pod Controller can partition the physical Pod into Virtual Pods for tenant isolation and resource allocation.
9. Performance and Latency Goals (Recommendations)
UALink Switches are recommended to target specific pin-to-pin latencies for a 64-byte Write Request (4-lane, FEC enabled, unloaded):
- 128-lane switch: <200ns
- 256-lane switch: <250ns
- 512-lane switch: <300ns Switches should aim to maintain a post-FEC line rate of 200Gbps per port.
10. Conclusion
The UALink_200 Rev 1.0 specification provides a robust and comprehensive standard for high-performance accelerator interconnects. Its layered architecture, detailed protocol mechanisms for data transfer, coherency, flow control, and advanced features for security and RAS, make it a pivotal technology for building scalable and efficient systems for AI, HPC, and other data-intensive workloads. UALink's emphasis on standardization combined with vendor flexibility is poised to foster a strong ecosystem around accelerator-based computing.
11. References
[1] Ultra Accelerator Link Consortium, Inc. (2025). UALink_200 Rev 1.0 Specification.
#UALink #HighPerformance #AcceleratorInterconnect #TechInnovation #LowLatency #SecureCommunication #FutureOfComputing
Thanks for sharing, Satinder. Any pointers in the recently released 1p0 spec on this topic? Unable to find the relevant sections.
Thanks for sharing, Satinder. However, I am not able to get a clear picture on spraying capabilities of the switches or even endpoints. For the elephant AI flows, the ability to spray the packets/transactions on a set of physical ports, which collectively is a LAG port. Any inputs in that direction? Thank you for the great summary 👍