By running production workloads simultaneously on x86 and Arm, Google is signaling a new era of hardware neutrality and accelerating Armβs growing role in hyperscale cloud environments. Credit: Michael Vi / Shutterstock While x86 has been dominant for decades, a new migration project at Google represents a significant shift to more mixed architectures. The tech giant has released the technical details of its ongoing transition to production clusters that can run both x86 and Axion Arm-based machines at the same time. Flagship services including YouTube, Gmail, and BigQuery now run on both instruction set architectures (ISAs), and Google has migrated more than 30,000 applications, both enormous and tiny, to Arm — roughly one-third of its 100,000-plus applications. This Google multi-architecture (“multiarch”) deployment not only signals a new era of hardware neutrality, but underscores Arm’s growing influence in hyperscale cloud environments. “Google has created a dev pipeline to adopt multi-architecture as a guiding principle, and other cloud service providers (CSPs) are also moving in the same direction,” said Manish Jain, a principal research director at Info-Tech Research Group. How Google migrated to Arm During the migration from x86-only to Arm and x86, Google researchers ran production services on Axion Arm-based CPUs. They analyzed 38,156 commits (saves) made to Google3, a vast unified repository (monorepo) that stores code for multiple projects, to track the types of changes required during the process. The researchers began by porting jobs running on Google’s F1, Spanner, and Bigtable databases via “typical software practices,” and were surprised when architectural differences such as drift, performance, and platform-specific operators were well-handled by modern compiler and sanitizer tools. Thus they were able to focus on fixing tests that broke due to overfitting (when architecture tailored for x86 performed poorly on other hardware); updating build and release systems for the oldest and highest-traffic services; resolving production configurations; and preventing destabilization. Initially, the researchers manually ported a dozen applications to Arm and got them up and running on Google’s cluster management system, Borg. But they recognized the need to move beyond “just a few jobs” to Google’s remaining 100,000-plus apps. Automation helped them tackle this enormous task. A large scale change tool, for instance, sharded master changes into smaller pieces, allowing the team to more quickly shepherd large groups of commits through review. They used sanitizers and fuzzers to catch common execution differences between x86 and Arm and avoid difficult-to-debug behavior later on. They also employed continuous health monitoring to pull out jobs that presented issues such as repeated crashing or slow throughput on Arm. Those jobs were later fine-tuned and debugged offline. The researchers then built an AI-based migration tool, CogniPort, to automate the remainder of the migration process. It was designed to automatically fix problems such as when an Arm library, binary, or test did not build or failed with an error. CogniPort is a three agent system comprising an orchestrator, a build-fixer, and a test-fixer. The orchestrator agent repeatedly calls the two below it, which handle different reasoning steps and invoke and execute tools. The build-fixer agent, for instance, was tasked with building a particular target and making modifications until it was successful (or until the agent gave up); the test-fixer’s job was to run tests and make modifications until achieving success (or, again, until the agent gave up). The two could also coordinate autonomously to execute their tasks. Google’s researchers also generated a benchmark set of 245 commits, then rolled them back and evaluated whether the agent loop was able to fix them. Early tests were “very encouraging,” with the agents successfully fixing failed tests 30% of the time. “We’re confident that as we invest in further optimizations of this approach, we will be even more successful,” the researchers note. Multiarch by default indicative of a larger trend All new Google applications are now designed to be multiarch by default. This is for a variety of reasons, the researchers note: code is visible in a vast unified repository, most required structural changes have been completed, and automation allows continued expansion and rollouts without much human intervention. “We’re increasingly confident in our goal of driving Google’s monorepo towards architecture neutrality for production services,” the researchers write. Matt Kimball, VP and principal analyst with Moor Insights and Strategy, pointed out that AWS and Microsoft have already moved many workloads from x86 to internally designed Arm-based servers. He noted that, when Arm first hit the hyperscale datacenter market, the architecture was used to support more lightweight, cloud-native workloads with an interpretive layer where architectural affinity was “non-existent.” But now there’s much more focus on architecture, and compatibility issues “largely go away” as Arm servers support more and more workloads. “In parallel, we’ve seen CSPs expand their designs to support both scale out (cloud-native) and traditional scale up workloads effectively,” said Kimball. Simply put, CSPs are looking to monetize chip investments, and this migration signals that Google has found its performance-per-dollar (and likely performance-per-watt) better on Axion than x86. Google will likely continue to expand its Arm footprint as it evolves its Axion chip; as a reference point, Kimball pointed to AWS Graviton, which didn’t really support “scale up” performance until its v3 or v4 chip. Arm is coming to enterprise data centers too When looking at architectures, enterprise CIOs should ask themselves questions such as what instance do they use for cloud workloads, and what servers do they deploy in their data center, Kimball noted. “I think there is a lot less concern about putting my workloads on an Arm-based instance on Google Cloud, a little more hesitance to deploy those Arm servers in my datacenter,” he said. But ultimately, he said, “Arm is coming to the enterprise datacenter as a compute platform, and Nvidia will help usher this in.” Info-Tech’s Jain agreed that Nvidia is the “biggest cheerleader” for Arm-based architecture, and Arm is increasingly moving from niche and mobile use to general-purpose and AI workload execution. While x86 remains dominant for legacy enterprise workloads, Arm’s architectural simplicity, power efficiency, and flexible licensing versus X86 make it “increasingly attractive” for modern cloud and AI infrastructure, said Jain. In particular, its flexible licensing model promotes more innovation and cost competition, which will benefit users, hyperscalers, and enterprises. “What is noteworthy about Google’s approach is the scale (30,000-plus applications) and method (AI-powered automation),” said Jain. “As much as it is a technological feat, it is a win for large-scale program management and organizational change management.” SUBSCRIBE TO OUR NEWSLETTER From our editors straight to your inbox Get started by entering your email address below. Please enter a valid email address Subscribe