Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Photonic Fabric Platform for AI Accelerators

Created by
  • Haebom

Author

Jing Ding, Trung Diep

Outline

This paper presents Photonic Fabric™ and Photonic Fabric Appliance™ (PFA), optical-based switch and memory subsystems that deliver low latency, high bandwidth, and low energy consumption. The PFA integrates high-bandwidth HBM3E memory, on-module optical switches, and external DDR5 into a 2.5D electro-optical system-in-package, providing up to 32 TB of shared memory and 115 Tbps of all-in-one digital switching. Photonic Fabric™ enables distributed AI training and inference to execute parallel strategies more efficiently. It removes the silicon beach constraints that limit fixed memory-to-compute ratios observed in traditional XPU accelerator designs. Replacing the local HBM stack in an XPU with chiplets connected to the Photonic Fabric increases memory capacity and bandwidth, scaling to levels not achievable with on-package HBM alone. We introduce CelestiSim, a lightweight analytical simulator validated on NVIDIA H100 and H200 systems, to evaluate the performance and energy savings of LLM in the PFA without significant changes to the GPU core design. Simulation results show that using PFA achieves up to 3.66x throughput improvement and 1.40x latency reduction in 405B-parameter LLM inference, up to 7.04x throughput improvement and 1.41x latency reduction in 1T-parameter LLM inference, and 60-90% data movement energy reduction of collective computation in all LLM training scenarios. While these results are presented for NVIDIA GPUs, they can be similarly applied to other AI accelerator designs (XPUs) that share the same fundamental limitation of fixed memory-to-compute ratio.

Takeaways, Limitations

Takeaways:
A novel optical-based architecture that overcomes the limitations of fixed memory-to-computation ratios is presented.
Shows potential to significantly improve LLM inference and learning performance and energy efficiency (up to 7x throughput improvement, up to 90% energy savings)
Applicability to various AI accelerator designs
Efficient performance evaluation possible with lightweight analysis simulator CelestiSim
Limitations:
Currently, actual implementation and verification are required based on simulation results.
Lack of analysis of the cost and complexity of PFA
Further research is needed on generalizability to various XPU architectures.
Further analysis of CelestiSim's accuracy and limitations is needed
👍