Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting

  • Tsz On Li
  • , Wenxi Zong
  • , Yibo Wang
  • , Haoye Tian
  • , Ying Wang*
  • , Shing Chi Cheung*
  • , Jeff Kramer
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

40 Citations (Scopus)

Abstract

Automated detection of software failures is an important but challenging software engineering task. It involves finding in a vast search space the failure-inducing test cases that contain an input triggering the software fault and an oracle asserting the incorrect execution. We are motivated to study how far this outstanding challenge can be solved by recent advances in large language models (LLMs) such as ChatGPT. However, our study reveals that ChatGPT has a relatively low success rate (28.8%) in finding correct failure-inducing test cases for buggy programs. A possible conjecture is that finding failure-inducing test cases requires analyzing the subtle differences (nuances) between the tokens of a program's correct version and those for its buggy version. When these two versions have similar sets of tokens and attentions, ChatGPT is weak in distinguishing their differences. We find that ChatGPT can successfully generate failure-inducing test cases when it is guided to focus on the nuances. Our solution is inspired by an interesting observation that ChatGPT could infer the intended functionality of buggy code if it is similar to the correct version. Driven by the inspiration, we develop a novel technique, called Differential Prompting, to effectively find failure-inducing test cases with the help of the compilable code synthesized by the inferred intention. Prompts are constructed based on the nuances between the given version and the synthesized code. We evaluate Differential Prompting on Quixbugs (a popular benchmark of buggy programs) and recent programs published at Codeforces (a popular programming contest portal, which is also an official benchmark of ChatGPT). We compare Differential Prompting with two baselines constructed using conventional ChatGPT prompting and Pynguin (the state-of-the-art unit test generation tool for Python programs). Our evaluation results show that for programs of Quixbugs, Differential Prompting can achieve a success rate of 75.0% in finding failure-inducing test cases, outperforming the best baseline by 2.6X. For programs of Codeforces, Differential Prompting's success rate is 66.7%, outperforming the best baseline by 4.0X.

Original languageEnglish
Title of host publicationProceedings - 2023 38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023
PublisherIEEE
Pages14-26
Number of pages13
ISBN (Electronic)9798350329964
DOIs
Publication statusPublished - 2023
MoE publication typeA4 Conference publication
EventIEEE/ACM International Conference on Automated Software Engineering - Echternach, Luxembourg
Duration: 11 Sept 202315 Sept 2023
Conference number: 38

Conference

ConferenceIEEE/ACM International Conference on Automated Software Engineering
Abbreviated titleASE
Country/TerritoryLuxembourg
CityEchternach
Period11/09/202315/09/2023

Keywords

  • failure-inducing test cases
  • large language models
  • program generation
  • program intention inference

Fingerprint

Dive into the research topics of 'Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting'. Together they form a unique fingerprint.

Cite this