[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Photonic Fabric Platform for AI Accelerators

Created by
  • Haebom

Author

Jing Ding, Trung Diep

Outline

This paper presents Photonic Fabric™ and Photonic Fabric Appliance™ (PFA), a photonic-based switch and memory subsystem that provides low latency, high bandwidth, and low energy consumption. It integrates high-bandwidth HBM3E memory, on-module optical switches, and external DDR5 into a 2.5D electro-optical system-in-package to provide up to 32 TB of shared memory and 115 Tbps of full-bandwidth digital switching. Photonic Fabric™ enables more efficient execution of parallel processing strategies in distributed AI training and inference. It addresses the silicon area constraints that limit the fixed memory-to-compute ratio of conventional XPU accelerator designs. It expands memory capacity and bandwidth by replacing the local HBM stack of the XPU with a chiplet that connects to the Photonic Fabric. Using CelestiSim, a lightweight analytical simulator validated on NVIDIA H100 and H200 systems, we evaluate the LLM inference performance and energy savings in the PFA without changing the GPU core design. Simulation results show up to 3.66x throughput and 1.40x latency improvement in 405B-parameter LLM inference, up to 7.04x throughput and 1.41x latency improvement in 1T parameters, and 60-90% reduction in data movement energy consumption in all LLM training scenarios. The results are for NVIDIA GPUs, but are similarly applicable to other AI accelerator designs (XPUs) with the same memory-compute constraints.

Takeaways, Limitations

Takeaways:
A novel photonics-based architecture that overcomes the limitations of fixed memory-to-computation ratios is presented.
Shows potential to significantly improve LLM inference and learning performance and energy efficiency (up to 7x throughput improvement and 60-90% energy savings)
Applicability to various XPU architectures
Efficient performance evaluation with CelestiSim, a lightweight analysis simulator
Limitations:
Currently, it is based on simulation results and actual implementation and verification are required.
Lack of analysis of the actual implementation cost and complexity of PFA.
Generalizability across different XPU architectures needs to be verified in practice.
Further review of the accuracy and limitations of the CelestiSim simulator is needed.
👍