Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CountingFruit: Language-Guided 3D Fruit Counting with Semantic Gaussian Splatting

Created by
  • Haebom

Author

Fengze Li, Yangle Liu, Jieming Ma, Hai-Ning Liang, Yaochun Shen, Huangxiang Li, Zhijing Wu

Outline

FruitLangGS is a language-guided 3D fruit counting framework that uses an adaptive dense Gaussian splatting pipeline with radius-aware pruning and tile-based rasterization to reconstruct orchard-scale scenes. Unlike existing pipelines that rely on multi-view 2D segmentation and dense volume sampling, FruitLangGS filters the compressed CLIP-aligned semantic vectors contained within each Gaussian through a double-threshold cosine similarity mechanism to retrieve Gaussians relevant to the target prompt without retraining or image-space masks, suppressing common distractors (e.g., leaves). The selected Gaussians are sampled from a dense point cloud and geometrically clustered to estimate fruit instances, and are robust to severe occlusion and viewpoint variations. Experiments on nine different orchard-scale datasets demonstrate that FruitLangGS consistently outperforms existing pipelines in instance counting recall, avoids multi-view segmentation fusion errors, and achieves up to 99.7% recall on the Pfuji-Size_Orch2018 orchard dataset. Additional ablation studies confirm that language-conditional semantic embeddings and double-threshold prompt filtering are essential for suppressing distractors and improving counting accuracy under severe occlusion. Beyond fruit counting, the same framework enables prompt-based 3D semantic retrieval without retraining, highlighting the potential of language-guided 3D recognition for scalable agricultural scene understanding.

Takeaways, Limitations

Takeaways:
An efficient and accurate solution to the 3D fruit counting problem in orchards.
We present a novel approach to avoid multi-view fusion errors and reduce computational costs.
Enables prompt-based 3D semantic retrieval through language induction.
Maintains high accuracy even in severe occlusion situations.
Presenting new possibilities for understanding scalable agricultural scenes.
Limitations:
Further research is needed on generalization performance, as the performance evaluation is biased towards a specific orchard dataset.
Applicability to various fruit types and orchard environments needs to be verified.
Limitations of the CLIP model due to dependency on the CLIP model may also affect FruitLangGS.
Additional consideration is needed regarding the computational complexity of 3D reconstruction methods based on Gaussian splatting.
👍