(why) is my prompt getting worse? Rethinking regression testing for evolving llm apis

Wanqin Ma*, Chenyang Yang, Christian Kästner

*Corresponding author for this work

Research output: Chapter in Book/Conference Proceeding/ReportConference Paper published in a bookpeer-review

Abstract

Large Language Models (LLMs) are increasingly integrated into software applications. Downstream application developers often access LLMs through APIs provided as a service. However, LLM APIs are often updated silently and scheduled to be deprecated, forcing users to continuously adapt to evolving models. This can cause performance regression and affect prompt design choices, as evidenced by our case study on toxicity detection. Based on our case study, we emphasize the need for and re-examine the concept of regression testing for evolving LLM APIs. We argue that regression testing LLMs requires fundamental changes to traditional testing approaches, due to different correctness notions, prompting brittleness, and non-determinism in LLM APIs.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI, CAIN 2024
PublisherAssociation for Computing Machinery, Inc
Pages166-171
Number of pages6
ISBN (Electronic)9798400705915
DOIs
Publication statusPublished - 14 Apr 2024
Externally publishedYes
Event3rd International Conference on AI Engineering, CAIN 2024, co-located with the 46th International Conference on Software Engineering, ICSE 2024 - Lisbon, Portugal
Duration: 14 Apr 202415 Apr 2024

Publication series

NameProceedings - 2024 IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI, CAIN 2024

Conference

Conference3rd International Conference on AI Engineering, CAIN 2024, co-located with the 46th International Conference on Software Engineering, ICSE 2024
Country/TerritoryPortugal
CityLisbon
Period14/04/2415/04/24

Bibliographical note

Publisher Copyright:
© 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Keywords

  • large language models (LLM)
  • regression testing

Cite this