Inproceedings,

Exploring the Responses of Large Language Models to Beginner Programmers’ Help Requests

A. Hellas, J. Leinonen, S. Sarsa, C. Koutcheme, L. Kujanpää, and J. Sorva.
Proceedings of the 2023 ACM Conference on International Computing Education Research V.1, page 93-105. ACM, (August 2023)
DOI: 10.1145/3568813.3600139

Abstract

Background and Context: Over the past year, large language models (LLMs) have taken the world by storm. In computing education, like in other walks of life, many opportunities and threats have emerged as a consequence. Objectives: In this article, we explore such opportunities and threats in a specific area: responding to student programmers’ help requests. More specifically, we assess how good LLMs are at identifying issues in problematic code that students request help on. Method: We collected a sample of help requests and code from an online programming course. We then prompted two different LLMs (OpenAI Codex and GPT-3.5) to identify and explain the issues in the students’ code and assessed the LLM-generated answers both quantitatively and qualitatively. Findings: GPT-3.5 outperforms Codex in most respects. Both LLMs frequently find at least one actual issue in each student program (GPT-3.5 in 90% of the cases). Neither LLM excels at finding all the issues (GPT-3.5 finding them 57% of the time). False positives are common (40% chance for GPT-3.5). The advice that the LLMs provide on the issues is often sensible. The LLMs perform better on issues involving program logic rather than on output formatting. Model solutions are frequently provided even when the LLM is prompted not to. LLM responses to prompts in a non-English language are only slightly worse than responses to English prompts. Implications: Our results continue to highlight the utility of LLMs in programming education. At the same time, the results highlight the unreliability of LLMs: LLMs make some of the same mistakes that students do, perhaps especially when formatting output as required by automated assessment systems. Our study informs teachers interested in using LLMs as well as future efforts to customize LLMs for the needs of programming education.

BibTeX key: Hellas_2023
entry type: inproceedings
booktitle: Proceedings of the 2023 ACM Conference on International Computing Education Research V.1
year: 2023
month: aug
pages: 93-105
publisher: ACM
series: ICER 2023
collection: ICER 2023
DOI: 10.1145/3568813.3600139
url: http://dx.doi.org/10.1145/3568813.3600139

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

%0 Conference Paper %1 Hellas_2023 %A Hellas, Arto %A Leinonen, Juho %A Sarsa, Sami %A Koutcheme, Charles %A Kujanpää, Lilja %A Sorva, Juha %B Proceedings of the 2023 ACM Conference on International Computing Education Research V.1 %D 2023 %I ACM %K feedback help icer2023 llm programming progtutor %P 93-105 %R 10.1145/3568813.3600139 %T Exploring the Responses of Large Language Models to Beginner Programmers’ Help Requests %U http://dx.doi.org/10.1145/3568813.3600139 %X Background and Context: Over the past year, large language models (LLMs) have taken the world by storm. In computing education, like in other walks of life, many opportunities and threats have emerged as a consequence. Objectives: In this article, we explore such opportunities and threats in a specific area: responding to student programmers’ help requests. More specifically, we assess how good LLMs are at identifying issues in problematic code that students request help on. Method: We collected a sample of help requests and code from an online programming course. We then prompted two different LLMs (OpenAI Codex and GPT-3.5) to identify and explain the issues in the students’ code and assessed the LLM-generated answers both quantitatively and qualitatively. Findings: GPT-3.5 outperforms Codex in most respects. Both LLMs frequently find at least one actual issue in each student program (GPT-3.5 in 90% of the cases). Neither LLM excels at finding all the issues (GPT-3.5 finding them 57% of the time). False positives are common (40% chance for GPT-3.5). The advice that the LLMs provide on the issues is often sensible. The LLMs perform better on issues involving program logic rather than on output formatting. Model solutions are frequently provided even when the LLM is prompted not to. LLM responses to prompts in a non-English language are only slightly worse than responses to English prompts. Implications: Our results continue to highlight the utility of LLMs in programming education. At the same time, the results highlight the unreliability of LLMs: LLMs make some of the same mistakes that students do, perhaps especially when formatting output as required by automated assessment systems. Our study informs teachers interested in using LLMs as well as future efforts to customize LLMs for the needs of programming education.

@inproceedings{Hellas_2023, abstract = {Background and Context: Over the past year, large language models (LLMs) have taken the world by storm. In computing education, like in other walks of life, many opportunities and threats have emerged as a consequence. Objectives: In this article, we explore such opportunities and threats in a specific area: responding to student programmers’ help requests. More specifically, we assess how good LLMs are at identifying issues in problematic code that students request help on. Method: We collected a sample of help requests and code from an online programming course. We then prompted two different LLMs (OpenAI Codex and GPT-3.5) to identify and explain the issues in the students’ code and assessed the LLM-generated answers both quantitatively and qualitatively. Findings: GPT-3.5 outperforms Codex in most respects. Both LLMs frequently find at least one actual issue in each student program (GPT-3.5 in 90% of the cases). Neither LLM excels at finding all the issues (GPT-3.5 finding them 57% of the time). False positives are common (40% chance for GPT-3.5). The advice that the LLMs provide on the issues is often sensible. The LLMs perform better on issues involving program logic rather than on output formatting. Model solutions are frequently provided even when the LLM is prompted not to. LLM responses to prompts in a non-English language are only slightly worse than responses to English prompts. Implications: Our results continue to highlight the utility of LLMs in programming education. At the same time, the results highlight the unreliability of LLMs: LLMs make some of the same mistakes that students do, perhaps especially when formatting output as required by automated assessment systems. Our study informs teachers interested in using LLMs as well as future efforts to customize LLMs for the needs of programming education.}, added-at = {2023-12-09T19:29:38.000+0100}, author = {Hellas, Arto and Leinonen, Juho and Sarsa, Sami and Koutcheme, Charles and Kujanpää, Lilja and Sorva, Juha}, biburl = {https://www.bibsonomy.org/bibtex/2d209fdfc8125bc4a8bc3da6a53d79cc3/brusilovsky}, booktitle = {Proceedings of the 2023 ACM Conference on International Computing Education Research V.1}, collection = {ICER 2023}, description = {Exploring the Responses of Large Language Models to Beginner Programmers’ Help Requests | Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1}, doi = {10.1145/3568813.3600139}, interhash = {1ca0cd703ae6f44dd934fa1f17705986}, intrahash = {d209fdfc8125bc4a8bc3da6a53d79cc3}, keywords = {feedback help icer2023 llm programming progtutor}, month = aug, pages = {93-105}, publisher = {ACM}, series = {ICER 2023}, timestamp = {2023-12-09T19:37:04.000+0100}, title = {Exploring the Responses of Large Language Models to Beginner Programmers’ Help Requests}, url = {http://dx.doi.org/10.1145/3568813.3600139}, year = 2023 }

BibSonomy

Exploring the Responses of Large Language Models to Beginner Programmers’ Help Requests

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on