Rui Zeng (Zhejiang University), Xi Chen (Zhejiang University), Yuwen Pu (Zhejiang University), Xuhong Zhang (Zhejiang University), Tianyu Du (Zhejiang University), Shouling Ji (Zhejiang University)

Backdoors can be injected into NLP models to induce misbehavior when the input text contains a specific feature, known as a trigger, which the attacker secretly selects. Unlike fixed tokens, words, phrases, or sentences used in the textit{static} text trigger, textit{dynamic} backdoor attacks on NLP models design triggers associated with abstract and latent text features (e.g., style), making them considerably stealthier than traditional static backdoor attacks. However, existing research on NLP backdoor detection primarily focuses on defending against static backdoor attacks, while research on detecting dynamic backdoors in NLP models remains largely unexplored.

This paper presents CLIBE, the first framework to detect dynamic backdoors in Transformer-based NLP models. At a high level, CLIBE injects a textit{"few-shot perturbation"} into the suspect Transformer model by crafting an optimized weight perturbation in the attention layers to make the perturbed model classify a limited number of reference samples as a target label. Subsequently, CLIBE leverages the textit{generalization} capability of this "few-shot perturbation" to determine whether the original suspect model contains a dynamic backdoor. Extensive evaluation on three advanced NLP dynamic backdoor attacks, two widely-used Transformer frameworks, and four real-world classification tasks strongly validates the effectiveness and generality of CLIBE. We also demonstrate the robustness of CLIBE against various adaptive attacks. Furthermore, we employ CLIBE to scrutinize 49 popular Transformer models on Hugging Face and discover one model exhibiting a high probability of containing a dynamic backdoor. We have contacted Hugging Face and provided detailed evidence of the backdoor behavior of this model. Moreover, we show that CLIBE can be easily extended to detect backdoor text generation models (e.g., GPT-Neo-1.3B) that are modified to exhibit toxic behavior. To the best of our knowledge, CLIBE is the first framework capable of detecting backdoors in text generation models without requiring access to trigger input test samples. The code is available at https://github.com/Raytsang123/CLIBE.

View More Papers

Securing BGP ASAP: ASPA and other Post-ROV Defenses

Justin Furuness (University of Connecticut), Cameron Morris (University of Connecticut), Reynaldo Morillo (University of Connecticut), Arvind Kasiliya (University of Connecticut), Bing Wang (University of Connecticut), Amir Herzberg (University of Connecticut)

Read More

A Field Study to Uncover and a Tool to...

Leon Kersten (Eindhoven University of Technology), Kim Beelen (Eindhoven University of Technology), Emmanuele Zambon (Eindhoven University of Technology), Chris Snijders (Eindhoven University of Technology), Luca Allodi (Eindhoven University of Technology)

Read More

THEMIS: Regulating Textual Inversion for Personalized Concept Censorship

Yutong Wu (Nanyang Technological University), Jie Zhang (Centre for Frontier AI Research, Agency for Science, Technology and Research (A*STAR), Singapore), Florian Kerschbaum (University of Waterloo), Tianwei Zhang (Nanyang Technological University)

Read More

type++: Prohibiting Type Confusion with Inline Type Information

Nicolas Badoux (EPFL), Flavio Toffalini (Ruhr-Universität Bochum, EPFL), Yuseok Jeon (UNIST), Mathias Payer (EPFL)

Read More