CodeCleaner: Mitigating Data Contamination for LLM Benchmarking

Jialun Cao, Songqiang Chen*, Wuqi Zhang, Hau Ching Lo, Yeting Li*, Shing Chi Cheung

*Corresponding author for this work

Research output: Contribution to conferenceConference Paperpeer-review

Abstract

Data contamination presents a critical barrier preventing widespread industrial adoption of advanced software engineering techniques that leverage large language models (LLMs). This phenomenon occurs when evaluation data inadvertently overlaps with the public code repositories used to train LLMs, severely undermining the credibility of performance evaluations. Code refactoring, which comprises code restructuring and variable renaming, has emerged as a promising measure to mitigate data contamination. However, the lack of automated code refactoring tools and scientifically validated refactoring techniques has hampered widespread industrial implementation. To bridge the gap, this paper presents the first systematic study to examine the efficacy of code refactoring operators at multiple scales (method-level, class-level, and cross-class level) and in different programming languages. We develop CodeCleaner, including 11 operators for Python in multiple scales and 4 for Java. We elaborate on the rationale for why these operators could work to resolve data contamination and use both data-wise (e.g., N-gram matching overlap ratio) and model-wise metrics (e.g., perplexity) to quantify the efficacy after operators are applied. A drop of 75% overlap ratio is found when applying all operators in CodeCleaner, demonstrating their effectiveness in addressing data contamination. Besides, we migrate four operators to Java, showing their generalizability to another language. We also observed an average of 19% decrease in LLMs’ performance after applying our operators. We make CodeCleaner online available at https://github.com/ArabelaTso/CodeCleaner-v1 to facilitate further studies on mitigating LLM data contamination. 

Original languageEnglish
Pages71-83
Number of pages13
DOIs
Publication statusPublished - 27 Oct 2025
Event16th International Conference on Internetware, Internetware 2025 - Trondheim, Norway
Duration: 20 Jun 202522 Jun 2025

Conference

Conference16th International Conference on Internetware, Internetware 2025
Country/TerritoryNorway
CityTrondheim
Period20/06/2522/06/25

Bibliographical note

Publisher Copyright:
© 2025 Copyright held by the owner/author(s).

Keywords

  • Code Mutation
  • Code Refactoring
  • Data Contamination
  • Empirical Study
  • Large Language Model

Fingerprint

Dive into the research topics of 'CodeCleaner: Mitigating Data Contamination for LLM Benchmarking'. Together they form a unique fingerprint.

Cite this