Inproceedings,

Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses

J. Savelka, A. Agarwal, M. An, C. Bogart, and M. Sakr.
Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1, volume 1 of ICER '23, page 78–92. New York, NY, USA, Association for Computing Machinery, (Sep 10, 2023)
DOI: 10.1145/3568813.3600142

Abstract

This paper studies recent developments in large language models’ (LLM) abilities to pass assessments in introductory and intermediate Python programming courses at the postsecondary level. The emergence of ChatGPT resulted in heated debates of its potential uses (e.g., exercise generation, code explanation) as well as misuses in programming classes (e.g., cheating). Recent studies show that while the technology performs surprisingly well on diverse sets of assessment instruments employed in typical programming classes the performance is usually not sufficient to pass the courses. The release of GPT-4 largely emphasized notable improvements in the capabilities related to handling assessments originally designed for human test-takers. This study is the necessary analysis in the context of this ongoing transition towards mature generative AI systems. Specifically, we report the performance of GPT-4, comparing it to the previous generations of GPT models, on three Python courses with assessments ranging from simple multiple-choice questions (no code involved) to complex programming projects with code bases distributed into multiple files (599 exercises overall). Additionally, we analyze the assessments that were not handled well by GPT-4 to understand the current limitations of the model, as well as its capabilities to leverage feedback provided by an auto-grader. We found that the GPT models evolved from completely failing the typical programming class’ assessments (the original GPT-3) to confidently passing the courses with no human involvement (GPT-4). While we identified certain limitations in GPT-4’s handling of MCQs and coding exercises, the rate of improvement across the recent generations of GPT models strongly suggests their potential to handle almost any type of assessment widely used in higher education programming courses. These findings could be leveraged by educators and institutions to adapt the design of programming assessments as well as to fuel the necessary discussions into how programming classes should be updated to reflect the recent technological developments. This study provides evidence that programming instructors need to prepare for a world in which there is an easy-to-use widely accessible technology that can be utilized by learners to collect passing scores, with no effort whatsoever, on what today counts as viable programming knowledge and skills assessments.

BibTeX key: Savelka2023
entry type: inproceedings
address: New York, NY, USA
booktitle: Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1
year: 2023
month: 9
day: 10
pages: 78–92
publisher: Association for Computing Machinery
series: ICER '23
volume: 1
isbn: 9781450399760
location: Chicago, IL, USA
DOI: 10.1145/3568813.3600142
url: https://doi.org/10.1145/3568813.3600142

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

%0 Conference Paper %1 Savelka2023 %A Savelka, Jaromir %A Agarwal, Arav %A An, Marshall %A Bogart, Chris %A Sakr, Majd %B Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1 %C New York, NY, USA %D 2023 %I Association for Computing Machinery %K assessment icer2023 llm programming %P 78–92 %R 10.1145/3568813.3600142 %T Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses %U https://doi.org/10.1145/3568813.3600142 %V 1 %X This paper studies recent developments in large language models’ (LLM) abilities to pass assessments in introductory and intermediate Python programming courses at the postsecondary level. The emergence of ChatGPT resulted in heated debates of its potential uses (e.g., exercise generation, code explanation) as well as misuses in programming classes (e.g., cheating). Recent studies show that while the technology performs surprisingly well on diverse sets of assessment instruments employed in typical programming classes the performance is usually not sufficient to pass the courses. The release of GPT-4 largely emphasized notable improvements in the capabilities related to handling assessments originally designed for human test-takers. This study is the necessary analysis in the context of this ongoing transition towards mature generative AI systems. Specifically, we report the performance of GPT-4, comparing it to the previous generations of GPT models, on three Python courses with assessments ranging from simple multiple-choice questions (no code involved) to complex programming projects with code bases distributed into multiple files (599 exercises overall). Additionally, we analyze the assessments that were not handled well by GPT-4 to understand the current limitations of the model, as well as its capabilities to leverage feedback provided by an auto-grader. We found that the GPT models evolved from completely failing the typical programming class’ assessments (the original GPT-3) to confidently passing the courses with no human involvement (GPT-4). While we identified certain limitations in GPT-4’s handling of MCQs and coding exercises, the rate of improvement across the recent generations of GPT models strongly suggests their potential to handle almost any type of assessment widely used in higher education programming courses. These findings could be leveraged by educators and institutions to adapt the design of programming assessments as well as to fuel the necessary discussions into how programming classes should be updated to reflect the recent technological developments. This study provides evidence that programming instructors need to prepare for a world in which there is an easy-to-use widely accessible technology that can be utilized by learners to collect passing scores, with no effort whatsoever, on what today counts as viable programming knowledge and skills assessments. %@ 9781450399760

@inproceedings{Savelka2023, abstract = {This paper studies recent developments in large language models’ (LLM) abilities to pass assessments in introductory and intermediate Python programming courses at the postsecondary level. The emergence of ChatGPT resulted in heated debates of its potential uses (e.g., exercise generation, code explanation) as well as misuses in programming classes (e.g., cheating). Recent studies show that while the technology performs surprisingly well on diverse sets of assessment instruments employed in typical programming classes the performance is usually not sufficient to pass the courses. The release of GPT-4 largely emphasized notable improvements in the capabilities related to handling assessments originally designed for human test-takers. This study is the necessary analysis in the context of this ongoing transition towards mature generative AI systems. Specifically, we report the performance of GPT-4, comparing it to the previous generations of GPT models, on three Python courses with assessments ranging from simple multiple-choice questions (no code involved) to complex programming projects with code bases distributed into multiple files (599 exercises overall). Additionally, we analyze the assessments that were not handled well by GPT-4 to understand the current limitations of the model, as well as its capabilities to leverage feedback provided by an auto-grader. We found that the GPT models evolved from completely failing the typical programming class’ assessments (the original GPT-3) to confidently passing the courses with no human involvement (GPT-4). While we identified certain limitations in GPT-4’s handling of MCQs and coding exercises, the rate of improvement across the recent generations of GPT models strongly suggests their potential to handle almost any type of assessment widely used in higher education programming courses. These findings could be leveraged by educators and institutions to adapt the design of programming assessments as well as to fuel the necessary discussions into how programming classes should be updated to reflect the recent technological developments. This study provides evidence that programming instructors need to prepare for a world in which there is an easy-to-use widely accessible technology that can be utilized by learners to collect passing scores, with no effort whatsoever, on what today counts as viable programming knowledge and skills assessments.}, added-at = {2023-12-18T19:37:13.000+0100}, address = {New York, NY, USA}, author = {Savelka, Jaromir and Agarwal, Arav and An, Marshall and Bogart, Chris and Sakr, Majd}, biburl = {https://www.bibsonomy.org/bibtex/2a2bce8b693e3c9774045b5b40e22b055/brusilovsky}, booktitle = {Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1}, day = 10, description = {Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses | Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1}, doi = {10.1145/3568813.3600142}, interhash = {dc3814c03e3429bbe7a320d0d8431222}, intrahash = {a2bce8b693e3c9774045b5b40e22b055}, isbn = {9781450399760}, keywords = {assessment icer2023 llm programming}, location = {Chicago, IL, USA}, month = {9}, pages = {78–92}, publisher = {Association for Computing Machinery}, series = {ICER '23}, timestamp = {2023-12-18T19:37:13.000+0100}, title = {Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses}, url = {https://doi.org/10.1145/3568813.3600142}, volume = 1, year = 2023 }

BibSonomy

Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on