Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Enabling On-Device Medical AI Assistants via Input-Driven Saliency Adaptation

Created by
  • Haebom

Author

Uttej Kallakurik, Edward Humes, Rithvik Jonna, Xiaomin Lin, Tinoosh Mohsenin

Outline

This paper presents a novel medical assistance system for deploying large-scale language models (LLMs) in resource-constrained environments, such as real-time healthcare. Optimized using a general-purpose compression framework, the system tailors LLMs to specific domains. By measuring neuron importance on domain-specific data, it aggressively removes irrelevant neurons, reducing model size while maintaining performance. Post-training quantization is then applied to further reduce memory usage, and the compressed models are evaluated on healthcare benchmarks including MedMCQA, MedQA, and PubMedQA. Furthermore, we deploy a 50% compressed Gemma model and a 67% compressed LLaMA3 model on a Jetson Orin Nano and a Raspberry Pi 5, achieving real-time, energy-efficient inference under hardware constraints.

Takeaways, Limitations

Takeaways:
Presenting the possibility of implementing a real-time medical assistance system using LLM even in resource-limited environments.
Proposing an effective model compression technique based on neuron importance measurement.
We present a successful case study of real-time inference on real hardware (Jetson Orin Nano, Raspberry Pi 5) using a compressed model.
Presenting an energy-efficient medical LLM distribution plan.
Limitations:
Further research is needed to determine the generalizability of the proposed compression framework.
Performance validation is required in various medical datasets and clinical environments.
A more detailed analysis of the performance degradation that may occur during the compression process is needed.
Scalability issues to other hardware platforms due to optimizations for specific hardware.
👍