Synthetic Students : A Comparative Study of Bug Distribution Between Large Language Models and Computing Students

Stephen MacNeil, Magdalena Rogalska, Juho Leinonen, Paul Denny, Arto Hellas, Xandria Crosland

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

6 Downloads (Pure)

Abstract

Large language models (LLMs) present an exciting opportunity for generating synthetic classroom data. Such data could include code containing a typical distribution of errors, simulated student behavior to address the cold start problem when developing education tools, and synthetic user data when access to authentic data is restricted due to privacy reasons. In this research paper, we conduct a comparative study examining the distribution of bugs generated by LLMs in contrast to those produced by computing students. Leveraging data from two previous large-scale analyses of student-generated bugs, we investigate whether LLMs can be coaxed to exhibit bug patterns that are similar to authentic student bugs when prompted to inject errors into code. The results suggest that unguided, LLMs do not generate plausible error distributions, and many of the generated errors are unlikely to be generated by real students. However, with guidance including descriptions of common errors and typical frequencies, LLMs can be shepherded to generate realistic distributions of errors in synthetic code.
Original languageEnglish
Title of host publicationSIGCSE Virtual 2024: Proceedings of the 2024 on ACM Virtual Global Computing Education Conference V. 1
Place of PublicationNew York, NY, USA
PublisherACM
Pages137–143
ISBN (Electronic)979-8-4007-0598-4
DOIs
Publication statusPublished - 5 Dec 2024
MoE publication typeA4 Conference publication
EventACM Virtual Global Computing Education Conference - Virtual, Online
Duration: 5 Dec 20248 Dec 2024
Conference number: 1
https://sigcsevirtual.acm.org/

Conference

ConferenceACM Virtual Global Computing Education Conference
Abbreviated titleSIGCSE Virtual
CityVirtual, Online
Period05/12/202408/12/2024
Internet address

Keywords

  • buggy code
  • generative ai
  • gpt-4
  • llms
  • synthetic data

Fingerprint

Dive into the research topics of 'Synthetic Students : A Comparative Study of Bug Distribution Between Large Language Models and Computing Students'. Together they form a unique fingerprint.

Cite this