Microsoft Azure's Chief Technology Officer, Mark Russinovich, has raised concerns that the data centres powering advanced AI models, like ChatGPT, will soon face size limitations, pushing for innovative solutions to connect multiple data centres for future technology advancements.
The most powerful AI models currently rely on being trained within a single facility, where tens of thousands of processors - such as Nvidia’s H100 GPUs - are networked to function as a single system.
However, as Microsoft and other tech giants race to build ever-more capable AI models, the increasing energy demand, compounded by America’s ageing energy grid, will create practical limits for data centre expansion.
These centres, consuming vast amounts of electricity, could soon demand gigawatts of power, comparable to hundreds of thousands of homes.
Certain regions are already struggling to meet power demands, with rolling blackouts becoming more common during peak electricity usage, such as on hot days when air conditioners are heavily used.
To counteract these issues, Microsoft has embarked on numerous projects to enhance grid capacity. These include deals to reopen the Three Mile Island Nuclear Power Plant, launching a $30 billion AI infrastructure fund with BlackRock, and a $10 billion green energy agreement with Brookfield.
Despite the U.S. government’s investment in upgrading energy infrastructure — such as the $3 billion for transmission lines under the 2022 Inflation Reduction Act — companies like Microsoft cannot wait for these long-term improvements to materialise.
The technical challenge of linking data centres, which already push the boundaries of modern networking, is immense. Even connecting two centres requires high-speed fibre optic connections, which were only recently achievable over long distances. As such, the data centres may need to be built relatively close to each other.
Training large AI models involves splitting computation across tens of thousands of GPUs, all of which must communicate continuously. Any lag or GPU failure could disrupt the entire process, making it even more complex when spread across multiple data centres. Failures, often due to overheating, remain a significant challenge.
While China’s hyperscalers have experimented with connecting multiple data centres, they lack access to the most advanced AI chips due to US restrictions.
Nevertheless, some experts believe that in the future, it might be possible to train AI models on smaller, globally distributed computers.
Start-ups like Gensyn are working on new methods that could leverage a range of computing power, from CPUs to GPUs, to train AI models - an idea reminiscent of SETI@home, which used volunteers’ computers to analyse radio telescope data in the search for extraterrestrial life.