Academics from 187 universities wanted to know how an AI chatbot would do on accounting exams. Recently, Open AI’s newest chatbot, GPT-4, which uses machine learning to generate natural language text, passed the bar exam with a score in the 90th percentile, passed 13 of 15 AP exams, and nearly got a perfect score on the GRE Verbal test, according to the folks at OpenAI.

So, in response to the ongoing debate about how AI chatbots should factor into education, David Wood, a professor of accounting at Brigham Young University, decided to recruit as many professors as possible to see how OpenAI’s original chatbot, ChatGPT, would fare against actual university accounting students on accounting exams.

“When this technology first came out, everyone was worried that students could now use it to cheat,” said Wood, the lead author on the study. “But opportunities to cheat have always existed. So, for us, we’re trying to focus on what we can do with this technology now that we couldn’t do before to improve the teaching process for faculty and the learning process for students. Testing it out was eye-opening.”

Thanks for reading CPA Practice Advisor!

Subscribe for free to get personalized daily content, newsletters, continuing education, podcasts, whitepapers and more...

Already registered? Login

Need more information? Read the FAQ's

Wood pitched the need for study co-authors on social media at the end of last year.

He ended up getting a staggering 327 co-authors from 186 educational institutions in 14 countries participating in the research, contributing 25,181 classroom accounting exam questions. They also recruited undergrad BYU students—including Wood’s daughter, Jessica—to feed another 2,268 textbook test bank questions to ChatGPT. The questions covered accounting information systems (AIS), auditing, financial accounting, managerial accounting and tax, and varied in difficulty and type, such as true or false, multiple choice, and short answer.

While ChatGPT’s score was an impressive 47.4%, the students performed much better with an overall average score of 76.7%. On 11.3% of the questions, ChatGPT scored higher than the student average, doing particularly well on AIS and auditing. But the chatbot did worse on tax, financial, and managerial assessments, possibly because ChatGPT struggled with the mathematical processes required for the latter type, according to the researchers.

When it came to question type, ChatGPT did better on true or false questions (68.7% correct) and multiple-choice questions (59.5%), but it struggled with short-answer questions (between 28.7% and 39.1%). In general, higher-order questions were harder for ChatGPT to answer, and sometimes ChatGPT would provide authoritative written descriptions for incorrect answers or answer the same question different ways, according to the researchers.

The researchers also revealed other interesting trends through the study, including:

ChatGPT doesn’t always recognize when it is doing math and makes nonsensical errors, such as adding two numbers in a subtraction problem or dividing numbers incorrectly.
ChatGPT often provides explanations for its answers, even if they are incorrect. Other times, ChatGPT’s descriptions are accurate, but it will then proceed to select the wrong multiple-choice answer.
ChatGPT sometimes makes up facts. For example, when providing a reference, it generates a real-looking reference that is completely fabricated. The work and sometimes the authors do not even exist.

“It’s not perfect; you’re not going to be using it for everything,” said Jessica Wood, a freshman at BYU. “Trying to learn solely by using ChatGPT is a fool’s errand.”

However, the researchers expect the newer chatbot, GPT-4, to improve exponentially on the accounting questions posed in their study. What they find most promising is how the chatbot can help improve teaching and learning, including the ability to design and test assignments, or perhaps be used for drafting portions of a project.

“It’s an opportunity to reflect on whether we are teaching value-added information or not,” said Melissa Larson, a BYU accounting professor and study co-author. “This is a disruption, and we need to assess where we go from here. Of course, I’m still going to have teaching assistants, but this is going to force us to use them in different ways.”

The paper, “The ChatGPT Artificial Intelligence Chatbot: How well does it answer accounting assessment questions?” published in Issues in Accounting Education, can be found here.