Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

PlantDeBERTa: An Open Source Language Model for Plant Science

Created by
  • Haebom

Author

Hiba Khey, Amine Lakhder, Salma Rouichi, Imane El Ghabi, Kamal Hejjaoui, Younes En-nahli, Fahd Kalloubi, Moez Amri

Outline

This paper presents PlantDeBERTa, a high-performance open-source language model specifically designed to extract structured knowledge from plant stress response literature. Based on the DeBERTa architecture, it is fine-tuned using a carefully curated corpus of expert-annotated abstracts focusing on diverse biotic and abiotic stress responses in lentils (Lens culinaris). Combining Transformer-based modeling, rule-based language postprocessing, and ontology-based entity normalization, it accurately and semantically captures biologically meaningful relationships. The annotated base corpus, using a hierarchical schema aligned with the crop ontology, encompasses molecular, physiological, biochemical, and agronomic dimensions of plant adaptation. PlantDeBERTa demonstrates strong generalization across diverse entity types, demonstrating the feasibility of robust domain adaptation in resource-poor scientific fields. By providing a scalable and reproducible framework for high-resolution entity recognition, it addresses a critical gap in agricultural NLP and paves the way for intelligent, data-driven systems in plant genomics, phenotyping, and agricultural knowledge discovery. Models are distributed openly to increase transparency and accelerate interdisciplinary innovation in computational plant science.

Takeaways, Limitations

Takeaways:
We present a successful case study of the development and application of a domain-adaptive language model to the resource-poor field of plant science.
Providing a scalable and reproducible framework for high-resolution object recognition.
Presenting the potential for developing intelligent data-driven systems in the fields of plant genomics, phenotyping, and agricultural knowledge discovery.
Accelerating interdisciplinary collaboration and innovation through open source disclosure.
Limitations:
Currently, the training focused on lentils, so further study is needed to determine generalization performance for other plant species.
The size and quality of the corpus used can affect model performance. Performance may be improved by leveraging a larger dataset.
Difficulties in maintenance and extension due to the complexity of rule-based language post-processing and ontology-based normalization processes.
Further research is needed to improve generalizability across different stress types and plant species.
👍