Meta has released details of its next generation Training and Reasoning Accelerator (MTIA)

This MTIA chip has significant improvements in computing and memory bandwidth compared to its previous generation and is designed to efficiently serve ranking and recommendation models that provide high-quality recommendations.
The new MTIA design uses the TSMC 5nm process and has higher frequencies, larger gate numbers and floating point operands, and a larger package size.
It also provides higher GEMM and SIMD vertex operation speeds, as well as greater local and on-chip memory capacity and bandwidth.
In addition, Meta has developed a large rack system that can accommodate up to 72 accelerators, and a new software stack that is fully integrated with PyTorch 2.0 to support efficient model and kernel code generation.
These optimizations have enabled the new generation of MTIA to increase the performance of three times compared to the first-generation chips, increase model service throughput by six times, and increase performance per watt by 1.5 times.
Meta is deploying the chip in data centers to support its AI workloads, demonstrating its advantages in providing performance and efficiency, especially on Meta specific workloads.

Launch the next generation Meta Training and Inference Accelerator (MTIA), the next generation in our custom chip family, designed specifically for Meta’s AI workloads.

This inference accelerator is part of our broader full-stack development plan for customized, domain-specific chips that solve our unique workload and system issues. This new version of MTIA more than doubled the compute and memory bandwidth of our previous solution while maintaining a close connection to the workload. It is designed to efficiently serve ranking and recommendation models that provide users with high-quality recommendations.

The chip’s architecture is fundamentally focused on providing the appropriate balance of computing, memory bandwidth, and memory capacity for service ranking and recommendation models. In reasoning, even if our batch size is relatively low, we need to be able to provide relatively high utilization. By focusing on providing large SRAM capacity relative to typical GPUs, we can provide high utilization with limited batch sizes and provide enough computing when we encounter large amounts of potentially concurrent work.

The accelerator consists of an 8×8 processing element (PE) grid. These PEs significantly improve intensive computing performance (3.5x improvement over MTIA v1) and sparse computing performance (7x improvement). This is due in part to architectural improvements associated with sparse computing pipelines. It also comes from the way we power the PE grid: we tripled the size of local PE storage, doubled on-chip SRAM, increased its bandwidth 3.5 times, and doubled the capacity of LPDDR5.

The new MTIA design also features an improved Network on Chip (NoC) architecture that doubles bandwidth and allows us to coordinate between different PEs with low latency. These and other new capabilities in PE constitute key technologies that are critical to our long-term roadmap to expand MTIA to a wider and more challenging workload.

To support the next generation of chips, we have developed a large rack-mounted system that can accommodate up to 72 accelerators. It consists of three chassis, each chassis contains 12 boards, and each board contains two accelerators. We specifically designed the system so that we could set the chip’s clock frequency to 1.35GHz (up from 800 MHz) and run at 90 watts of power, compared to 25 watts of power for the first-generation design. Our design ensures that we provide more intensive functions as well as higher computing, memory bandwidth and memory capacity. This density allows us to more easily adapt to various model complexities and size.

In addition, we have upgraded the structure between accelerators and between hosts and accelerators to PCIe Gen5 to improve system bandwidth and scalability. If we choose to expand beyond the rack, we can also choose to add RDMA NICs.

Since the beginning of our investment in MTIA, software has been one of our key areas of focus. As the original developers of PyTorch, we value programmability and developer efficiency. Our MTIA stack is designed to be fully integrated with PyTorch 2.0 and features such as TorchDynamo and TorchInductor. Front-end graphics-level capture, analysis, transformation, and extraction mechanisms (such as TorchDynamo, torch.export, etc.) are independent of MTIA and are being reused. MTIA’s lower-level compiler takes the output of the front end and generates efficient and device-specific code. This lower-level compiler itself consists of several components responsible for generating executable code for the model and the kernel.

Below is the runtime stack responsible for connecting to the driver/firmware. The MTIA Streaming interface abstraction provides the basic and necessary operations needed for inference and (future) training software to manage device memory and run operators on the device and execute compiled graphs. Finally, the runtime interacts with drivers located in user space-a decision we made to allow us to iterate through drivers and firmware faster across the production stack.

In many ways, this new chip system runs a software stack similar to MTIA v1, which allows the team to deploy faster because we have completed most of the necessary integration and development work needed to run applications on this architecture. The new MTIA is designed to be compatible with the code developed for MTIA v1. Since we have integrated the complete software stack into the chip, we can use this new chip to get our traffic up and running in a few days. This allows us to quickly implement the next generation of MTIA chips, going from the first chip to production models running in 16 regions in less than 9 months.

full details https://go.fb.me/kwahju

If you want to learn more, you can click on the link below the video.
Thank you for watching this video. If you like it, please subscribe and like it. thank

Video: