Speak Up, I’m Listening: Extracting Speech from Zero-Permission VR Sensors

Derin Cayir (Florida International University), Reham Mohamed Aburas (American University of Sharjah), Riccardo Lazzeretti (Sapienza University of Rome), Marco Angelini (Link Campus University of Rome), Abbas Acar (Florida International University), Mauro Conti (University of Padua), Z. Berkay Celik (Purdue University), Selcuk Uluagac (Florida International University)

As Virtual Reality (VR) technologies advance, their application in privacy-sensitive contexts, such as meetings, lectures, simulations, and training, expands. These environments often involve conversations that contain privacy-sensitive information about users and the individuals with whom they interact. The presence of advanced sensors in modern VR devices raises concerns about possible side-channel attacks that exploit these sensor capabilities. In this paper, we introduce IMMERSPY, a novel acoustic side-channel attack that exploits motion sensors in VR devices to extract sensitive speech content from on-device speakers. We analyze two powerful attacker scenarios: informed attacker, where the attacker possesses labeled data about the victim, and uninformed attacker, where no prior victim information is available. We design a Mel-spectrogram CNN-LSTM model to extract digit information (e.g., social security or credit card numbers) by learning the speech-induced vibrations captured by motion sensors. Our experiments show that IMMERSPY detects four consecutive digits with 74% accuracy and 16-digit sequences, such as credit card numbers, with 62% accuracy. Additionally, we leverage Generative AI text-to-speech models in our attack experiments to illustrate how the attackers can create training datasets even without the need to use the victim’s labeled data. Our findings highlight the critical need for security measures in VR domains to mitigate evolving privacy risks. To address this, we introduce a defense technique that emits inaudible tones through the Head-Mounted Display (HMD) speakers, showing its effectiveness in mitigating acoustic side-channel attacks.

Paper

Slides

Video

Speak Up, I’m Listening: Extracting Speech from Zero-Permission VR Sensors

View More Papers

LeoCommon – A Ground Station Observatory Network for LEO...

Repurposing Neural Networks for Efficient Cryptographic Computation

A New PPML Paradigm for Quantized Models

DiStefano: Decentralized Infrastructure for Sharing Trusted Encrypted Facts and...