Decentralized training isn't a policy nightmare — yet
Governments should still be able to keep track of who's training frontier models — though a shift to reinforcement learning could make that harder
Experts predict that by 2030, frontier AI training runs will require data centers that consume five gigawatts of power — enough energy to power New York City. But the US doesn’t currently have the energy to build such behemoths. Amidst the search for options, companies are quietly turning to an alternative: distributing the computational load across multiple data centers.
Decentralized training is a technique that splits AI model training across multiple servers or data centers. And it could help address the growing energy and infrastructure demands of AI development: it’s much easier to build lots of smaller data center facilities than one big one. While the US is expanding its energy capacity, it’s challenging to make five gigawatts available in one place. Spreading the required power over multiple counties or states eases the constraint. But it’s not without its challenges.
Traditionally, training a new AI model happens in a single massive data center. During pre-training, vast amounts of scraped internet data are fed into the model. The computer chips analyze a batch of the data, make a small improvement to the model weights to better fit that data, and then sync that update across the other chips. Billions of updates later, there’s a pre-trained model ready for fine-tuning into a helpful chatbot.
Distributing that training is challenging because the chips need to communicate constantly, updating each other multiple times per second as the model weights are updated. Each time information is moved across a chip or cable, Epoch AI researcher Ege Erdil explains, there is a fixed latency cost from loading and unloading data at each connection. While data centers minimize these through direct chip-to-chip cables, greater distances require more intermediate connections, leading to accumulated delays.
At larger distances, Erdil says, even the speed of light becomes a limiting factor — data in fiber optic cables moves at just two-thirds light speed, causing additional delays as servers wait for massive amounts of information to travel through bandwidth-constrained cables. This problem only gets harder each year as models get bigger and bigger, requiring more and more data flowing between chips.
These communication challenges make training a frontier AI model across a network of ordinary computers almost impossible, Erdil says. The internet is too slow and unreliable to manage frontier training runs.
What might be possible, however, is training models across multiple small data centers. And researchers are exploring new training algorithms and infrastructure solutions to do so. Techniques like DiLoCo, for example, allow models to train in a decentralized manner, by allowing different parts of the model to make bigger updates — getting slightly out of sync before remerging with their counterparts in other data centers. DiLoCo has already been used to train a 10B parameter model — much smaller than frontier models, but still substantial. On the infrastructure side, meanwhile, companies can install more fiber optic cables between facilities to scale the amount of information that can flow at one time, reducing bandwidth bottlenecks and the number of latency-adding connection points along the way.
While these solutions make training across distance feasible, it comes at the cost of training efficiency. Though it’s hard to say how much efficiency is lost, exactly, AI companies' push for giant data centers suggests that the trade-off is substantial enough that they still value centralized training runs, despite the difficulty of securing the required energy.
However, we’re already seeing hints that AI labs are quietly deploying some decentralized training. Dylan Patel, founder of semiconductor research company SemiAnalysis, said that Microsoft and OpenAI’s actions suggest they have figured out “how to train across multiple data centers.” Microsoft alone has “signed deals north of $10 billion with fiber companies to connect their data centers together,” he explained. Additionally, Google DeepMind’s first Gemini model was trained across multiple data centers, although it’s not clear how many sites or how far apart they were.
Some worry that decentralized training could undermine AI policy, by making it much harder for governments to track who has the computing power necessary to train frontier AI models — knowledge that’s necessary if governments ever want to impose regulation on training AI models. Currently, tracking multi-gigawatt data centers is straightforward — after all, they’re half the size of Manhattan. Some believe that if decentralized training leads to companies abandoning such huge facilities, it’ll be much harder for governments to keep track of who owns which chips.
But compute governance researchers Lennart Heim and Konstantin F. Pilz told Transformer that these fears are overstated. Research suggests that, in practice, while there may be more data centers, only around a dozen companies own data centers capable of delivering the compute needed for frontier models, even across multiple locations. Building lots of small data centers would cost millions, if not billions, to build, putting it in reach for only a handful of tech giants and Infrastructure as a Service (IaaS) companies. With such concentrated ownership, the list of players to monitor is very short.
That makes monitoring them fairly easy. The government could easily keep track of tech giants’ chip purchases, even if they were disbursing those chips across multiple data centers. IaaS companies, meanwhile, collect data on their clients’ utilization, network traffic, and similar metrics, so are well aware of who is renting out large amounts of compute across their facilities. A government information sharing program between IaaS providers could also ensure that a malicious actor doesn’t split their training across multiple companies to try to evade detection.
Furthermore, the money, scarcity of sites, and chip shortage makes someone building large data centers in secret unlikely: each data center would still be too big to keep secret, even with decentralized training. And even if you could get each facility small enough to evade detection, you’d still need to connect them with huge amounts of fiber optic cables, requiring large construction projects that would also be easy to track.
Future AI models will likely require an order of magnitude more training compute, meaning it will only get harder to find — and easier to track — the amount of compute required. So while decentralized training may make it easier to assemble that compute, for now it likely won’t have many implications for policy.
But that might not be true for long. While decentralized training has issues when it comes to large pre-training runs, reinforcement learning training requires fewer updates between chips, reducing the networking requirements and thus making it easier to decentralize the training. With RL training potentially becoming an increasingly important part of developing a model, that may push companies towards decentralized training — and might remove some of the requirements that make decentralized training easy for governments to monitor at the moment (such as requiring large facilities or big fiber-optic buildouts).
Erdil cautions that this is all highly uncertain: “We don't know about this new paradigm of RL,” he says. “It is something we don't understand as well”. But in AI, the landscape can shift fast — and governments might be left playing catch up.
Disclosure: The author’s partner works at Google DeepMind.