Abstract
Model editing has become an increasingly popular method for efficiently updating knowledge within language models. Current approaches primarily focus on reliability, generalization, and locality, with many excelling across these criteria. Some recent studies have highlighted the potential pitfalls of these editing methods, such as knowledge distortion and conflicts. However, the general capabilities of post-edited language models remain largely unexplored. In this paper, we conduct a comprehensive evaluation of various editing methods across different language models, and have the following findings. (1) Existing editing methods inevitably lead to performance deterioration on general benchmarks, indicating that these methods can maintain the general capabilities of the model only within a limited number of edits. When the number of edits increases, the model's intrinsic knowledge structure may be disrupted or even completely destroyed. (2) Instruction-tuned models exhibit greater robustness to editing, showing less performance drop on general knowledge after editing. (3) Language models with larger scale are more resistant to editing compared to smaller models. (4) The safety of the edited models is significantly compromised, even for models that were originally safety-aligned. Our findings indicate that current editing methods are only suitable for small-scale knowledge updates within language models, which motivates further research on more practical and reliable editing methods. The details of code and reproduction can be found in Appendix F.
| Original language | English |
|---|---|
| Journal | Advances in Neural Information Processing Systems |
| Volume | 37 |
| Publication status | Published - 2024 |
| Event | 38th Conference on Neural Information Processing Systems, NeurIPS 2024 - Vancouver, Canada Duration: 9 Dec 2024 → 15 Dec 2024 |
Bibliographical note
Publisher Copyright:© 2024 Neural information processing systems foundation. All rights reserved.