PerspectiveNet is a lightweight and efficient model for generating long-form descriptions from multiple camera perspectives. It utilizes a compressed connector module that converts visual features into fixed-size tensors, and a large language model (LLM) with powerful natural language generation capabilities. The connector module is designed with three goals: mapping visual features to LLM embeddings, highlighting key information needed for description generation, and generating fixed-size feature matrices. In addition, an auxiliary task for detecting the correct frame order is added to help find the correct frame order for description generation. Finally, the connector module, auxiliary task, LLM, and visual feature extraction model are integrated into a single architecture to train for the traffic safety description and analysis task. This task requires generating detailed and fine-grained event descriptions from multiple cameras and perspectives. The resulting model is lightweight, ensuring efficient training and inference while maintaining high performance.