Large-scale language models (LLMs) trained on large-scale vision-language data can improve open-vocabulary object detection (OVD) using synthetic training data. However, handcrafted pipelines often introduce bias and can overfit to specific prompts. In this paper, we present a systematic method for enhancing visual ground truth by leveraging the decoder layer of the LLM. We introduce a zero-initialization cross-attention adapter that enables efficient knowledge fusion from the LLM to the object detector, resulting in a novel approach called LLM Enhanced Open-Vocabulary Object Detection (LED). We find that the intermediate LLM layers already encode rich spatial semantics, and that most of the performance improvement can be achieved by applying only the initial layers. Using Swin-T as the vision encoder, Qwen2-0.5B + LED improves GroundingDINO by 3.82% on OmniLabel with only an additional 8.7% in GFLOPs. With a larger vision backbone, the improvement increases to 6.22%. The design is further validated through extensive experiments with adapter deformation, LLM scale, and fusion depth.