Abstract
Data contamination presents a critical barrier preventing widespread industrial adoption of advanced software engineering techniques that leverage large language models (LLMs). This phenomenon occurs when evaluation data inadvertently overlaps with the public code repositories used to train LLMs, severely undermining the credibility of performance evaluations. Code refactoring, which comprises code restructuring and variable renaming, has emerged as a promising measure to mitigate data contamination. However, the lack of automated code refactoring tools and scientifically validated refactoring techniques has hampered widespread industrial implementation. To bridge the gap, this paper presents the first systematic study to examine the efficacy of code refactoring operators at multiple scales (method-level, class-level, and cross-class level) and in different programming languages. We develop CodeCleaner, including 11 operators for Python in multiple scales and 4 for Java. We elaborate on the rationale for why these operators could work to resolve data contamination and use both data-wise (e.g., N-gram matching overlap ratio) and model-wise metrics (e.g., perplexity) to quantify the efficacy after operators are applied. A drop of 75% overlap ratio is found when applying all operators in CodeCleaner, demonstrating their effectiveness in addressing data contamination. Besides, we migrate four operators to Java, showing their generalizability to another language. We also observed an average of 19% decrease in LLMs’ performance after applying our operators. We make CodeCleaner online available at https://github.com/ArabelaTso/CodeCleaner-v1 to facilitate further studies on mitigating LLM data contamination.
| Original language | English |
|---|---|
| Pages | 71-83 |
| Number of pages | 13 |
| DOIs | |
| Publication status | Published - 27 Oct 2025 |
| Event | 16th International Conference on Internetware, Internetware 2025 - Trondheim, Norway Duration: 20 Jun 2025 → 22 Jun 2025 |
Conference
| Conference | 16th International Conference on Internetware, Internetware 2025 |
|---|---|
| Country/Territory | Norway |
| City | Trondheim |
| Period | 20/06/25 → 22/06/25 |
Bibliographical note
Publisher Copyright:© 2025 Copyright held by the owner/author(s).
Keywords
- Code Mutation
- Code Refactoring
- Data Contamination
- Empirical Study
- Large Language Model