AI

Using AI to judge writing could ‘revolutionise’ assessment – trial

No More Marking says its trial found AI is 'very good at judging student writing and is a viable and time-saving alternative for many forms of school assessment'

No More Marking says its trial found AI is 'very good at judging student writing and is a viable and time-saving alternative for many forms of school assessment'

Using AI to judge student writing “has the potential to revolutionise assessment and decimate workload”, according to the findings of a trial of the approach.

No More Marking, an organisation that pioneers comparative judgment as an alternative to marking written work, recently ran an AI assessment project called CJ Lightning. Comparative judgment involves deciding which is better out of two pieces of writing. 

In a blog post, director of education Daisy Christodoulou and founder Chris Wheadon said the results showed AI “is very good at judging student writing and is a viable and time-saving alternative for many forms of school assessment”.

It comes as the government is pushing to use AI and other technology to cut teacher workload.

Another study last year suggested teachers who use ChatGPT alongside a guide on using it effectively can reduce lesson planning time by 31 per cent.

The CJ Lightning project assessed the writing of 5,251 year 7 students from 44 secondary schools.

Pupils wrote a non-fiction response to a short text prompt about improving the environment.

Teachers uploaded their writing to the No More Marking website and then used comparative judgment to assess it.

The process “typically delivers very high levels of inter-rater reliability, and is the gold standard of human judgment”.

AI agreed with 81% of human decisions

In this project, No More Marking asked AI to make judgments too. This allowed them to compare judgments made by humans and AI to see if they agreed.

Of the 3,640 decisions made by humans, the AI agreed with 81 per cent of them.

During NMM’s most recent previous human-judged year 7 assessment, the human judges agreed with each other 87 per cent of the time, which is “fairly typical”.

But they said total levels of disagreement were “not conclusive” and the “type of error matters.

“The overall agreement can be good, but if the 20 per cent of disagreements are full of absolute howlers, that’s still a huge problem.”

“Reassuringly” in this trial, those disagreements “peak where the scaled score difference is small”.

Graph showing the frequency of agreements by the scaled score difference between them

Disagreements sometimes down to human error

NMM scrutinised a sample of the biggest disagreements in detail, and talked to teachers who made some of the decisions.

“They are not cases where the AI is wrong and the human is right. In fact, some of the biggest disagreements involved teachers being biased by handwriting, and accepting on review that the AI was probably right and they were wrong”.

An example where the AI was probably right

Other examples “involved teachers making a manual error and clicking the wrong button”.

They also compared the tests of 2,297 of the pupils who took part in a similar assessment in September last year and in this project.

The correlation of scores between the two sessions was 0.65. NMM said they had seen a correlation of 0.58 between human tests in May and September last year.

“The high correlation reassures us that the AI is not judging on some strange dimension of writing ability, but is actually providing us with a similar dimension to the one we value,” wrote Christodoulou and Wheadon.

Not just ‘asking AI for a mark’

They added that their approach to AI assessment was “very different to the ‘ask an AI for a mark’ approach, and offers far more assurances that you are getting the right grade”.

This is because AI, like humans, is better at comparative judgments than absolute ones. They also got the AI to make every decision twice to “eliminate its tendency to position bias”.

Christodoulou and Wheadon also “think that you could run a 100 per cent AI judged assessment with no human judging.

“However, we would not recommend that you routinely do this. You would always want to run some human-AI hybrids to a) keep validating the AI model and b) make sure that teachers are engaging with student writing.”

In this assessment, they recommend a split of 10 per cent human judgment and 90 per cent AI.

Teachers could save time

In one school with 269 year 7s, a head of department spent an hour and 12 minutes on the assessment.

That was “enough to validate all the other AI decisions and provide robust and meaningful scores for every student”.

“In other schools, they shared the decisions out amongst lots of teachers, resulting in 5-10 minutes of judging per teacher.”

Christodoulou and Wheadon concluded that they “still think [AI technologies] have flaws and are prone to hallucinations”.

“But we think the process we’ve developed here has the potential to revolutionise assessment and decimate workload (quite literally decimate if you follow our recommended 10 per cent human judging approach).”

NMM will run free projects in the summer term for any primary or secondary school wanting to trial the approach.

They will then have a “more comprehensive plan available in academic year 2025-26”.

Latest education roles from

IT Technician

IT Technician

Harris Academy Morden

Teacher of Geography

Teacher of Geography

Harris Academy Orpington

Lecturer/Assessor in Electrical

Lecturer/Assessor in Electrical

South Gloucestershire and Stroud College

Director of Management Information Systems (MIS)

Director of Management Information Systems (MIS)

South Gloucestershire and Stroud College

Exams Assistant

Exams Assistant

Richmond and Hillcroft Adult & Community College

Lecturer Electrical Installation

Lecturer Electrical Installation

Solihull College and University Centre

Sponsored posts

Sponsored post

Turbo boost your pupil outcomes with Teach First

Finding new teaching talent for your school can be time consuming and costly. Especially when you want to be...

SWAdvertorial
Sponsored post

Inspiring Leadership Conference 2025: Invaluable Insights, Professional Learning Opportunities & A Supportive Community

This June, the Inspiring Leadership Conference enters its eleventh year and to mark the occasion the conference not only...

SWAdvertorial
Sponsored post

Catch Up® Literacy and Catch Up® Numeracy are evidence-based interventions which are highly adaptable to meet the specific needs of SEND / ALN learners

Catch Up® is a not-for-profit charity working to address literacy and numeracy difficulties that contribute to underachievement. They offer...

SWAdvertorial
Sponsored post

It’s Education’s Time to Shine: Celebrate your Education Community in 2025!

The deadline is approaching to nominate a colleague, team, whole school or college for the 2025 Pearson National Teaching...

SWAdvertorial

More from this theme

AI

AI plan: Hackathons, chatbots and new tools for schools

Keir Starmer this week unveiled a new plan to boost AI - but what does this mean for schools?...

Rhi Storer
AI

Ministers plan to appoint edtech evidence checkers

Revealed: Experts to scrutinise classroom impact of technology tools as part of new AI training package for teachers worth...

Lucas Cumiskey
AI

Oak gets another £2m to expand AI quizzes and lesson planner

First step to 'providing every teacher with personalised artificial intelligence lesson-planning assistant', says government

Schools Week Reporter
AI

Education ‘hackathon’ could test AI role in EHCPs and careers advice

The artificial intelligence firm linked to Vote Leave campaign wins £350k government contract to run first-of-kind event

Samantha Booth

Your thoughts

Leave a Reply

Your email address will not be published. Required fields are marked *