Google DeepMind introduced Decoupled DiLoCo, a distributed training method that tolerates network failures and high latency. The system separates local gradient computation from outer-loop synchronization, allowing worker nodes to continue training even when connections drop.
Traditional distributed training requires constant, reliable communication between nodes. DiLoCo's architecture lets workers train independently for extended periods before synchronizing updates. This decoupling reduces bandwidth requirements and maintains progress despite intermittent connectivity.
The method targets scenarios where compute exists across geographic regions or unreliable infrastructure. DeepMind's approach could enable training on networks of devices that previously couldn't participate in large-scale model development due to connection constraints.