Ahmed Mostafa, Raisul Arefin Nahid, Samuel Mulder (Auburn University)

Tokenization is fundamental in assembly code analysis, impacting intrinsic characteristics like vocabulary size, semantic coverage, and extrinsic performance in downstream tasks. Despite its significance, tokenization in the context of assembly code remains an underexplored area. This study aims to address this gap by evaluating the intrinsic properties of Natural Language Processing (NLP) tokenization models and parameter choices, such as vocabulary size. We explore preprocessing customization options and pre-tokenization rules tailored to the unique characteristics of assembly code. Additionally, we assess their impact on downstream tasks like function signature prediction—a critical problem in binary code analysis.
To this end, we conduct a thorough study on various tokenization models, systematically analyzing their efficiency in encoding assembly instructions and capturing semantic nuances. Through intrinsic evaluations, we compare tokenizers based on tokenization efficiency, vocabulary compression, and representational fidelity for assembly code. Using state-of-the-art pre-trained models such as the decoder-only Large Language Model (LLM) Llama 3.2, the encoder-only transformer BERT, and the encoderdecoder model BART, we evaluate the effectiveness of these tokenizers across multiple performance metrics. Preliminary findings indicate that tokenizer choice significantly influences downstream performance, with intrinsic metrics providing partial but incomplete predictability of extrinsic evaluation outcomes. These results reveal complex trade-offs between intrinsic tokenizer properties and their utility in practical assembly code tasks. Ultimately, this study provides valuable insights into optimizing tokenization models for low-level code analysis, contributing to the robustness and scalability of Natural Language Model (NLM)-based binary analysis workflows.

View More Papers

ABElity: Attribute Based Encryption for Securing RIC Communication in...

K Sowjanya (Indian Institute of Technology Delhi), Rahul Saini (Eindhoven University of Technology), Dhiman Saha (Indian Institute of Technology Bhilai), Kishor Joshi (Eindhoven University of Technology), Madhurima Das (Indian Institute of Technology Delhi)

Read More

PhantomLiDAR: Cross-modality Signal Injection Attacks against LiDAR

Zizhi Jin (Zhejiang University), Qinhong Jiang (Zhejiang University), Xuancun Lu (Zhejiang University), Chen Yan (Zhejiang University), Xiaoyu Ji (Zhejiang University), Wenyuan Xu (Zhejiang University)

Read More

SongBsAb: A Dual Prevention Approach against Singing Voice Conversion...

Guangke Chen (Pengcheng Laboratory), Yedi Zhang (National University of Singapore), Fu Song (Key Laboratory of System Software (Chinese Academy of Sciences) and State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Science; Nanjing Institute of Software Technology), Ting Wang (Stony Brook University), Xiaoning Du (Monash University), Yang Liu (Nanyang Technological University)

Read More

Scale-MIA: A Scalable Model Inversion Attack against Secure Federated...

Shanghao Shi (Virginia Tech), Ning Wang (University of South Florida), Yang Xiao (University of Kentucky), Chaoyu Zhang (Virginia Tech), Yi Shi (Virginia Tech), Y. Thomas Hou (Virginia Polytechnic Institute and State University), Wenjing Lou (Virginia Polytechnic Institute and State University)

Read More