- Meta says it’s delivering supercomputers on the scale of mass-produced consumer electronics to meet surging AI demand
- It’s scaling to multi-region, gigawatt data centers with Ethernet-based fabrics, giant racks and custom hardware
- Reliability, efficiency and talent are bottlenecks
2025 OCP GLOBAL SUMMIT, SAN JOSE, CA – Supercomputers have transformed from "artisinally crafted" projects built over more than a year at elite research laboratories, to products built by the hundreds daily, according to Meta's Dan Rabinovitsj.
"We are delivering supercomputers as if they were a mass-produced product," said Rabinovitsj, Meta VP for hardware and AI systems infrastructure engineering, during a keynote. "We're transforming infrastructure to be like a consumer electronics play."
      
He added, "We're used to a new phone or a new television coming out every year for the holiday season. Now we're doing the same thing with supercomputers."
      
      
That volume is needed to feed AI demand for computing power to serve Meta's 3.4 billion daily users. "We've managed to integrate AI into almost every surface and everything we do as a company," Rabinovitsj said. "This is really an exciting journey, but man, it is hard. I used to think hardware is called hardware because it's hard, but actually it's all hard," he quipped.
Meta started using AI before the 2022 generative AI revolution. The company built a single cluster for AI spanning an entire data center. Over time, Meta continued doubling the number of GPUs in its clusters — from 24,000 to 100,000. It is now planning gigawatt data centers, with millions of GPUs spanning multiple regions, he said.
      
Increased scale led to increased requirements for error control. Thus, Meta developed the concept of "server karma" to predict server errors, allowing the company to proactively take servers offline to maintain reliability.
"Where we're going next is even more bananas," Rabinovitsj said.
Innovation all the way down
Meta plans to scale AI to multiple regions separated by hundreds of kilometers, requiring innovation across software, hardware and every part of the tech stack "all the way down to transistors in custom silicon."
The company recently introduced the Minipack3 data center switch in partnership with Nvidia, and it is developing Non-Scheduled Fabric (NSF), a new network architecture designed for the largest AI clusters. "Just to address the challenges of building these massive clusters across such huge distances, we needed a way to guarantee that that performance would be delivered end-to-end," Rabinovitsj said.
Meta Prometheus, which will scale to 1 gigawatt, is up and running in New Albany, Ohio, built in tents to deploy faster — exactly like the big tent used for the OCP keynote he was speaking in, Rabinovitsj noted.
Hyperion, a 5-gigawatt cluster, is scheduled to go online over the next several years in Richland Parish, Louisiana. Hyperion will span a distance equivalent to lower Manhattan to Central Park in New York, requiring three hours to walk. And it will be one of many clusters of that scale built by Meta and other hyperscalers, Rabinovitsj said.
To achieve that scale, Meta is implementing many different types of standardized hardware, introducing software challenges and requiring abstracting complexity away from the developer community. That diversity pays off in resiliency, performance and redundant supply chain options.
Like moving an African elephant
For networking, only Ethernet can meet Meta's requirements, Rabinovitsj said. Meta supports the newly introduced Ethernet for Scale Up Networking (ESUN) OCP workstream, which is also supported by AMD, Arista, ARM, Broadcom, Cisco, HPE, Nvidia, OpenAI and other AI and networking leaders. ESUN is designed to improve throughput and reduce latency in scale-up environments.
Large, scale-up domains require bigger racks. By the third quarter of 2027, Meta will need a rack supporting up to 256 accelerators. Rabinovitsj called them "BFRs." "We'll let you imagine in your mind what a 'BFR' is," he said.
These racks are required to meet AI demands. But they are difficult to design, manufacture, transport, operate and service. For example, the racks are too big and heavy to be put on coasters. A 60-70-pound tray will tend to sag in the middle and needs to be reinforced.
Meta had to design a new type of tug to move the racks around the data center — "the equivalent mass of an African elephant" — and will open source that design to OCP.
"These things are so big you have to structurally build everything in a different way to get that rigidity and integrity that needs to be there as you're moving these racks around," he said. They are liquid cooled, with "a lot of expensive and fragile equipment inside them, so the care taken in this design is really, really impressive."
Bubble? What bubble?
Rabinovitsj dismissed talk about AI being an economic bubble, due for imminent collapse, like the Dot Com bubble or the fiber build-out of the 1990s and 2000s. "This capacity will be in demand for at least the next couple of years and needs to be delivered "with quality and reliability," he said.
"Having built infrastructure for years, we thought we had learned everything there is to learn about scale, but honestly, AI is kicking our collective butts every day," Rabinovitsj said. "We have to stand up and figure out how to address these challenges."
The demand. however, leads to a skills shortage. "We are starved for quality engineers across the industry," he said. That also applies to partners, who need to hire skilled workers to work in factories.
And data centers need to be designed to reduce greenhouse gas emissions. "We need to look at very dramatic and creative ways to reduce the emissions associated with all this infrastructure," Rabinovitsj said.
He concluded, "One of the most fun things about being at Meta, for me, is that we get to work on every layer of the stack, from PyTorch down to transistors, and that helps a lot with understanding the scale and the context of all these challenges."