Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

ChatGPT Doesn't Trust Chargers Fans: Guardrail Sensitivity in Context

Created by
  • Haebom

Author

Victoria R. Li, Yida Chen, Naomi Saphra

Outline

This paper examines the bias of guardrails in generative language models (LLMs). Specifically, we analyze the impact of user background information (age, gender, race, political affiliation, etc.) on the likelihood of LLM requests being rejected, using GPT-3.5. Our findings reveal that young female and Asian American users are more likely to be rejected when requesting prohibited or illegal information, and that guardrails tend to reject requests that contradict a user's political leanings. Furthermore, we find that even innocuous information, such as sports fandom, can infer a user's political leanings and influence guardrail activation.

Takeaways, Limitations

Takeaways: We demonstrate that LLM safeguards can operate biasedly based on users' demographic characteristics and political leanings. This raises serious questions about fairness and equity. We suggest that LLM safeguards should consider user diversity in their design and evaluation. We highlight the need for new methodologies to measure the bias of safeguards that utilize user background information.
Limitations: This study focused on a specific LLM, GPT-3.5. Therefore, further research is needed to determine whether it can be applied to other LLMs. Due to limitations in the user profile generation method, it may not fully reflect the diversity of real users. The scope of user background information used in the study may be limited.
👍