This paper addresses the emerging security risks posed by the emergence of modern audio language models (AudioLMs) that directly process speech. While end-to-end approaches that bypass traditional separate transcription steps preserve details such as intonation and multi-speaker information, they also introduce new risks, such as the potential misuse of sensitive speech features like speaker identification. We present experimental evidence that end-to-end modeling increases sociotechnical security risks, such as identity inference, biased decision-making, and emotion detection, compared to hierarchical pipeline approaches. We also raise concerns about voiceprint storage and functionality, which could create uncertainty under existing legal frameworks. We argue that model development and deployment should be guided by the principle of least privilege, emphasizing the need for an assessment of the privacy and security risks associated with end-to-end modeling and the appropriate scope of information access. Finally, we highlight the shortcomings of current audio LM benchmarks and highlight key technical and policy research challenges that need to be addressed to ensure responsible end-to-end audio LM deployment.