Yichen Gong (Tsinghua University), Delong Ran (Tsinghua University), Xinlei He (Hong Kong University of Science and Technology (Guangzhou)), Tianshuo Cong (Tsinghua University), Anyu Wang (Tsinghua University), Xiaoyun Wang (Tsinghua University)

The safety alignment of Large Language Models (LLMs) is crucial to prevent unsafe content that violates human values.
To ensure this, it is essential to evaluate the robustness of their alignment against diverse malicious attacks.
However, the lack of a large-scale, unified measurement framework hinders a comprehensive understanding of potential vulnerabilities.
To fill this gap, this paper presents the first comprehensive evaluation of existing and newly proposed safety misalignment methods for LLMs. Specifically, we investigate four research questions: (1) evaluating the robustness of LLMs with different alignment strategies, (2) identifying the most effective misalignment method, (3) determining key factors that influence misalignment effectiveness, and (4) exploring various defenses.
The safety misalignment attacks in our paper include system-prompt modification, model fine-tuning, and model editing.
Our findings show that Supervised Fine-Tuning is the most potent attack but requires harmful model responses.
In contrast, our novel Self-Supervised Representation Attack (SSRA) achieves significant misalignment without harmful responses.
We also examine defensive mechanisms such as safety data filter, model detoxification, and our proposed Self-Supervised Representation Defense (SSRD), demonstrating that SSRD can effectively re-align the model.
In conclusion, our unified safety alignment evaluation framework empirically highlights the fragility of the safety alignment of LLMs.

View More Papers

Victim-Centred Abuse Investigations and Defenses for Social Media Platforms

Zaid Hakami (Florida International University and Jazan University), Ashfaq Ali Shafin (Florida International University), Peter J. Clarke (Florida International University), Niki Pissinou (Florida International University), and Bogdan Carbunar (Florida International University)

Read More

Non-intrusive and Unconstrained Keystroke Inference in VR Platforms via...

Tao Ni (City University of Hong Kong), Yuefeng Du (City University of Hong Kong), Qingchuan Zhao (City University of Hong Kong), Cong Wang (City University of Hong Kong)

Read More

SongBsAb: A Dual Prevention Approach against Singing Voice Conversion...

Guangke Chen (Pengcheng Laboratory), Yedi Zhang (National University of Singapore), Fu Song (Key Laboratory of System Software (Chinese Academy of Sciences) and State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Science; Nanjing Institute of Software Technology), Ting Wang (Stony Brook University), Xiaoning Du (Monash University), Yang Liu (Nanyang Technological University)

Read More

Retrofitting XoM for Stripped Binaries without Embedded Data Relocation

Chenke Luo (Wuhan University), Jiang Ming (Tulane University), Mengfei Xie (Wuhan University), Guojun Peng (Wuhan University), Jianming Fu (Wuhan University)

Read More