Microsoft Azure: Meta shares details of new 24k GPU AI clusters

Mar 13, 2024 | Posted by Abdul-Rahman Oladimeji

Meta has shared the details of the hardware, network, storage, design, performance, and software that make up its two new 24,000-GPU data center scale training clusters. These clusters based on Meta’s AI Research SuperCluster (RSC) are being usedto train its Llama 3 large language AI model.

In a blog post co-written by Kevin Lee, technical program manager; Adi Gangidi, production network engineer; and Mathew Oldham, director, production engineering, the company said it maintains its commitment to open innovation in AI software and hardware and has launched the AI Alliance in an effort to build an open ecosystem that brings “transparency, scrutiny, and trust to AI development and leads to innovations that everyone can benefit from that are built with safety and responsibility top of mind.”

The blog post continued: “As we look to the future, we recognize that what worked yesterday or today may not be sufficient for tomorrow's needs. That's why we are constantly evaluating and improving every aspect of our infrastructure, from the physical and virtual layers to the software layer and beyond. Our goal is to create systems that are flexible and reliable to support the fast-evolving new models and research.”