Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

SMA: Who Said That? Auditing Membership Leakage in Semi-Black-box RAG Controlling

Created by
  • Haebom

Author

Shixuan Sun, Siyuan Liang, Ruoyu Chen, Jianjie Huang, Jingzhi Li, Xiaochun Cao

Outline

This paper proposes the first Source-Aware Membership Audit (SMA) methodology that precisely identifies the source of content generated from Retrieval-Augmented Generation (RAG) and Multimodal Retrieval-Augmented Generation (MRAG). To overcome the limitations of existing membership inference methods, which cannot accurately identify the sources (transfer learning data, external search results, and user input) of generated content due to the complexity of RAG/MRAG systems, we utilize a zero-order optimization-based attribute estimation mechanism and cross-modal attribute techniques. Specifically, we utilize MLLM to convert image inputs into text, enabling membership inference on image search history in MRAG systems. This presents a novel perspective that focuses on "where content comes from," rather than whether data is "remembered."

Takeaways, Limitations

Takeaways:
A novel methodology is presented to precisely identify the source of content generated in RAG/MRAG systems.
Effective auditing even in semi-black-box environments through a zero-order optimization-based attribute estimation mechanism.
Enabling Membership Inference for Image Search Records in MRAG Systems with Cross-Modal Attribute Technology Using MLLM.
A new perspective on data provenance auditing.
Limitations:
Attribute estimation based on zero-order optimization can be computationally expensive as it requires large-scale perturbation sampling.
There is a possibility of information loss during image-to-text conversion using MLLM.
The accuracy and efficiency of SMA may vary depending on the specific RAG/MRAG system architecture and data characteristics.
Additional application and performance evaluation for actual systems are required.
👍