Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

InterFeat: A Pipeline for Finding Interesting Scientific Features

Created by
  • Haebom

Author

Dan Ofer, Michal Linial, Dafna Shahaf

Outline

This paper presents an integrated pipeline for automatically discovering simple interesting hypotheses (feature-target relationships with direction of effect and potential underlying mechanisms) from structured biomedical data. This pipeline combines machine learning, knowledge graphs, literature search, and large-scale language models to formalize "interestingness" as a combination of novelty, utility, and relevance. In experiments on eight major diseases from the UK Biobank, the proposed pipeline consistently identified risk factors years before they appeared in the literature. Forty-five to fifty-three percent of the top candidates were validated as interesting, compared to 0 to 7% for the SHAP-based baseline. Overall, 28% of the 109 candidates were rated as interesting by medical experts. This pipeline addresses the challenge of making "interestingness" scalable and operational across all targets, and the data and code are publicly available ( https://github.com/LinialLab/InterFeat ).

Takeaways, Limitations

Takeaways:
A novel pipeline for automatically discovering interesting hypotheses from biomedical data is presented.
Discover new risk factors with much higher accuracy than existing methods.
A new method for quantitatively measuring and evaluating “interest” is presented.
Ensuring reproducibility and scalability through data and code disclosure.
Limitations:
The definition of "interesting" can be subjective and relies to some extent on expert judgment.
Pipeline performance may vary depending on the quality and quantity of data.
Optimized for specific types of data and may have limitations in applying to other types of data.
👍