Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation

Created by
  • Haebom

Author

Yang Zhou, Shiyu Zhao, Yuxiao Chen, Zhenting Wang, Can Jin, and Dimitris N. Metaxas.

Outline

Large-scale language models (LLMs) trained on large-scale vision-language data can improve open-vocabulary object detection (OVD) using synthetic training data. However, handcrafted pipelines often introduce bias and can overfit to specific prompts. In this paper, we present a systematic method for enhancing visual ground truth by leveraging the decoder layer of the LLM. We introduce a zero-initialization cross-attention adapter that enables efficient knowledge fusion from the LLM to the object detector, resulting in a novel approach called LLM Enhanced Open-Vocabulary Object Detection (LED). We find that the intermediate LLM layers already encode rich spatial semantics, and that most of the performance improvement can be achieved by applying only the initial layers. Using Swin-T as the vision encoder, Qwen2-0.5B + LED improves GroundingDINO by 3.82% on OmniLabel with only an additional 8.7% in GFLOPs. With a larger vision backbone, the improvement increases to 6.22%. The design is further validated through extensive experiments with adapter deformation, LLM scale, and fusion depth.

Takeaways, Limitations

Takeaways:
We present a novel method (LED) that effectively improves open vocabulary object detection performance by leveraging the decoder layer of LLM.
Addressing bias and overfitting issues in manual pipelines.
We confirm that the intermediate layer of LLM contains rich spatial semantics.
Achieve significant performance improvements at little additional computational cost.
Demonstrates applicability to various LLM sizes and vision backbones.
Limitations:
Further research is needed on the generalization performance of the proposed method.
Possible dependencies on specific LLM and vision backbones.
Performance evaluation on other OVD datasets is needed.
👍