[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Photonic Fabric Platform for AI Accelerators

Created by
  • Haebom

Author

Jing Ding, Trung Diep

Outline

This paper presents Photonic Fabric™ and Photonic Fabric Appliance™ (PFA), optical-based switch and memory subsystems that provide low latency, high bandwidth, and low energy consumption. It integrates high-bandwidth HBM3E memory, on-module optical switches, and external DDR5 into a 2.5D electro-optical system-in-package, providing up to 32 TB of shared memory and 115 Tbps of all-connected digital switching. Photonic Fabric™ enables more efficient execution of parallel processing strategies in distributed AI training and inference. It addresses the silicon area constraints that limit the fixed memory-to-compute ratio of conventional XPU accelerator designs. It expands memory capacity and bandwidth by replacing the local HBM stack of XPUs with chiplets connected to the Photonic Fabric. We present CelestiSim, a lightweight analytical simulator validated on NVIDIA H100 and H200 systems, and evaluate the LLM inference performance and energy savings in PFA without changing the GPU core design. Simulation results show up to 3.66x throughput improvement and 1.40x latency reduction in 405B parameter LLM inference, up to 7.04x throughput improvement and 1.41x latency reduction in 1T parameter LLM inference, and 60-90% data movement energy consumption reduction in all LLM training scenarios. Although the results are presented for NVIDIA GPUs, they can be similarly applied to other AI accelerator designs (XPUs) that share the same memory-compute constraints.

Takeaways, Limitations

Takeaways:
We demonstrate that optical technologies can be used to overcome the memory-to-compute ratio limitations of XPUs and significantly improve scalability.
We experimentally demonstrate that significant performance improvements (throughput and latency) and energy savings can be achieved in LLM inference and learning.
It is a general-purpose technology applicable to various AI accelerator designs.
Limitations:
Currently based on simulation results, actual hardware implementation and verification are required.
Only results for NVIDIA GPUs are presented, and performance on other architectures requires further study.
Further validation of the accuracy and generalizability of the CelestiSim simulator is needed.
Lack of analysis of the cost and complexity of PFA.
👍