A Comparison of Teachers and Artificial Intelligence Grading and Feedback on Secondary School Writing
As technology advances, a perennial question is the extent to which it will change or replace jobs traditionally done by humans. From self-checkouts at the grocery store to the ability of AI to detect serious diseases in medical scans, workers in all fields are finding themselves working with tools that can do some of their jobs. As the availability of AI tools in the classroom increases and shows no signs of slowing down, teaching has become another area where professional work is shared with tools like AI.
We wondered about the role of AI in one particular teaching practice: assessing student learning. Since it takes time to grade and give feedback on student work, many writing instructors are reluctant to assign longer writing assignments, and most students take a long time to receive scores and feedback, AI saves a lot of time and learning potential in helping to grade student work. Then again, we wondered, can AI grading and feedback systems really help students as much as teachers?
“The teacher has the ability to say, ‘What were you trying to tell me?’ Because I don’t understand.” Artificial intelligence tries to fix the writing process and formatting – fixing what’s already there – rather than trying to understand what they’re trying to say.”
We recently completed an evaluation of an AI-equipped platform through which middle school students can draft, submit, and revise argumentative essays based on pre-planned writing prompts. Each time students clicked “submit,” they received mastery-based (scores 1-4) dimensionally aligned scores in four writing domains (Claim and Focus, Support and Evidence, Organization, and Language and Style), as well as dimensionally aligned comments that provided observations and suggestions for improvement – all of which were provided to the students. -all of which are generated by the AI immediately after student submission.
To compare the AI scores and feedback with those given by actual teachers, we convened 16 middle school writing teachers who used the platform with their students during the 2021-22 school year. To ensure reliable understanding and application of the scores and suggestions, we worked together to calibrate the item titles, and then randomly assigned each teacher 10 essays (not from their own students) to score and provide feedback. This produced a total of 160 teacher-assessed essays, which we could directly compare to the scores given by the AI and the feedback given on those essays.
How were the teachers’ scores similar to or different from those given by the AI?
On average, we found that teachers scored lower on essays than the AI, with significant differences on every dimension except Claim & Focus. In terms of total scores across all four dimensions (minimum of 4 and maximum of 16), teachers scored an average of 7.6 on these 160 essays, while the AI scored an average of 8.8 on the same set of essays. In terms of specific dimensions, Figure 1 shows that in the Claim & Focus and Support & Evidence dimensions, teachers and AI tend to agree on high (4) and low (1) scored essays, but they diverge in the middle section, where teachers are more likely to give the essays a score of 2 and AI is more likely to give the essays a score of 3. On the other hand, on the dimensions of organization and language style, teachers were more likely to give articles a score of 1 or 2, while the AI scored between 1 and 4, with more articles scoring between 3 and even 4.
How are teachers’ written comments similar to or different from those given by the AI?
In our meetings with 16 teachers, we gave them the opportunity to discuss their scores and feedback on 10 essays. Even before reflecting on their specific essays, one observation we often heard was that when they had used the program in their own classrooms the previous year, they needed to help most of their students read and interpret the comments given by the AI. In many cases, for example, they reported that students would read the comments but were not sure what it required them to do to improve their writing. Thus, according to the teachers, an immediate difference was their ability to comment in developmentally appropriate language that matched students’ needs and abilities.
“Upon reflection, we discussed how good AI is, even in comments/feedback. Kids today are used to more direct and honest feedback. It’s not always about ego gratification, it’s about problem solving. So we don’t always need two stars to fulfill a wish. Sometimes we need to get to the point.”
Another difference was that teachers focused on the essay as a whole-the flow of the writing, the voice of the writing, whether the essay was a summary or an argument, whether the evidence matched the argument, or whether it made sense overall. They attributed the tendency for teachers to receive a 2 in the argument-centered domains of Claims and Focus and Support and Evidence to their ability to see the whole essay, which the AIs could not actually see because many AIs are trained at the sentence level, not on the instruction of the whole essay.
Unlike AIs, teachers’ rigorous assessment of organization likewise stemmed from their ability to grasp the sequence and flow of whole texts. For example, teachers shared that the AI could spot transitions or instruct students to use more transitions and use the use of transitions as evidence of good organization, whereas they, as teachers, could see whether transitions were truly fluid or just inserted into a disjointed set of sentences. In terms of language and style, teachers again pointed to ways in which AIs are more easily fooled, such as including a series of seemingly complex words-which may impress the AI, but which the teacher will see as a series of words that don’t make a sentence or an opinion.
Can AI help teachers with grading?
Assessing student work is a time-consuming and very important teaching task, especially when students are learning to write. In order to become confident, solid writers, students need consistent practice and quick feedback, but most teachers lack planning and grading time and teach too many students to assign routine or lengthy writing and maintain work-life balance or sustainability in their careers.
The promise of AI in alleviating this burden could be quite significant. While our initial findings in this study suggest that teachers and AIs assess slightly differently, we believe that AI has the potential to help teachers grade if AI systems can be trained to look at papers more holistically than teachers do, and to write feedback language in ways that are more developmentally and contextually appropriate for students, allowing students to process comments independently. We believe that improving AI in these areas is a worthwhile pursuit, both to reduce the grading burden on teachers and to ensure that students are given more opportunities to write and receive immediate and useful feedback so that they can grow as writers.