Let’s consign grades to the educational graveyard

All awarding systems must discriminate but our grades discriminate in all the wrong ways, writes Dennis Sherwood Grades, grades, grades. Why are we so obsessed with grades? Simple. Because the difference between an A and a B means a student can become a doctor, or can’t. Because a 3 rather than a 4 in GCSE […]

DENNIS SHERWOOD Guest Contributor

managing director, Silver Bullet

4 min read

| 14 April 2021, 12:30 pm

See comments

All awarding systems must discriminate but our grades discriminate in all the wrong ways, writes Dennis Sherwood

Grades, grades, grades. Why are we so obsessed with grades? Simple. Because the difference between an A and a B means a student can become a doctor, or can’t. Because a 3 rather than a 4 in GCSE English relegates a student to the “Forgotten Third”.

Grades have a peculiar duality. They appear to achieve two contradictory outcomes simultaneously: ‘homogenisation’ and ‘discrimination’. ‘Homogenisation’ because all students awarded the same grade are regarded as indistinguishable in quality. ‘Discrimination’ because grade Bs are deemed profoundly different from grade As – doctor material, or not.

But are all grade As the same? How different is the student with the highest grade A from the one with the lowest? More importantly, are all As different from – and inherently better than – all Bs? What, in truth, is the difference between the student awarded the lowest grade A, and the student awarded the top grade B? Is that a smaller difference than between the top grade A and the lowest grade A?

Are those cliff-edge grade boundaries making false – and unfair – distinctions? Every teacher agonising over which side of a grade boundary a given student will be placed this summer will be all too familiar with this dilemma.

Every teacher agonising over a grade boundary this summer will be familiar with this dilemma

In truth, even the “gold standard exam system” doesn’t get it right. By Ofqual’s own admission, “it is possible for two examiners to give different but appropriate marks to the same answer”. So a script given 64 marks by one examiner (or team) might equally legitimately have been given 66 by another. And if the cliff-edge grade boundary is 65, then the grade on that candidate’s certificate depends on the lottery of who marked their script.

That explains Dame Gleny’s Stacey’s statement to the Education Select Committee that exam grades “are reliable to one grade either way”. By any reckoning, that must mean that grades, as currently awarded, are fatally flawed. But grades have been with us for a long time, and inertia makes it hard to imagine an alternative.

Yet, there is a simple one. Ditch the grade. A student’s certificate could just as easily present assessment outcomes in the form of a mark, plus a measure of the ‘fuzziness’ associated with marking – a statistically valid way of representing those “different but appropriate marks”. ‘Fuzziness’ is real, and according to Ofqual’s own research, some subjects (such as English and History) are fuzzier than others (such as Maths and Physics).

So, for example, a certificate might show not grade B but 64 ± 5. 64 is the script’s mark, and ± 5 is the measure of the subject’s fuzziness.

Instantly, we are rid of cliff-edge grade boundaries. Anyone seeking to distinguish between a student assessed as 64 ± 5 and another assessed as 66 ± 5 will realise that these two students are in essence indistinguishable on the basis of this exam alone.

We need to change the rules for appeals too. As things stand, the student awarded 64 and re-marked at 66 on appeal (if that were allowed!) would see their grade rise consequentially from B to A. But 64 ± 5 explicitly recognises that marking is ‘fuzzy’, and that it is possible, nay likely, that a re-mark might be anywhere in the range from 59 to 69. And since 66 is within this range, the re-mark confirms the original assessment: only if the re-mark were greater than 69 or less than 59 would the assessment be changed.

Accordingly, if the ‘fuzziness’ measure is determined statistically correctly, the likelihood that an appeal would result in a change in the assessment will be very low. So this idea not only delivers assessments that are fairer, but that are much more reliable too.

Showing assessments in the form of 64 ± 5 is not perfect. No awarding system is. Issues with curriculum and the weaknesses of exams themselves would still need addressing.

But the benefits of fairness and reliability are highly significant. And shifting the responsibility for discriminating between who should and shouldn’t become a doctor onto those who will train those doctors rather than those who teach teenagers must surely be better and fairer.

That alone seems reason enough to consign grades to the graveyard.

7 Comments

Bob Harrison
15 April 2021

Agreed Dennis,Asa former chief examiner a norm referenced assessment system that overlays a normal curve of distribtion on the efforts of learners is morally and ethically abhorrent

Reply
1. Doug Green
  16 April 2021
  
  Norm referencing does alleviate a race to the bottom of exam boards trying to produce the easiest exams possible to get more business. It works on a national scale, but last year when norm referencing was applied at individual schools, grades may as well have been pulled out of a hat. Of course grades are only reliable to one grade either way as that’s the smallest incremental change for discrete data. Presenting a ‘grade’ as a mark with uncertainty is ludicrous as different exam boards will have different total marks; a percentage seems reasonable. Although this still doesn’t help anyone as all employers and further courses will have a ‘cliff edge’ on what they will accept.
  
  Reply
Dc
15 April 2021

While I’m not defending grades and agree that they have issues I don’t think this is the answer. This does nothing to address relative difficulty of exams from one year to the next or between different subjects. You could apply some sort of normalisation but as soon as you do that you get back to cliff edge boundaries.

Reply
Phil
17 April 2021

I give this piece a C.

Reply
Crispin Weston
27 April 2021

There are three separate issues here, two of which are well-known and the third not so much.

1. Cliff-edge grade boundaries. These would be eliminated by Dennis’ proposal of a percentage associated with a measure of confidence. It is an easy win. The problem is that if the margin of fuzziness were established by creditable statistical means, it would be found to be unacceptably wide. I disagree with Doug Green that cliff-edge grades are OK because employers have to made cliff-edge decisions about recruitment: employers will want to establish their own cliff-edges (including considerations about personality, aspiration, soft skills etc) – and should not delegate this to an algorithm being run by an exam board who know nothing about the job (/university place etc) being offered.

2. Normalisation. I disagree with Bob (sorry Bob). Normalisation is required to ensure consistent standards because the overall variation in average ability between one national cohort and the next will be insignificant in comparison with the difference in difficulty between one paper and another. If one year’s results are higher than last year’s, then it is almost certainly because the paper was easier. This answers the point raised by “Dc” (who is not correct to say that normalisation inevitably creates a cliff edge – it is percentage scores that are normalised, not grades).

3. The third problem is the one that is being missed: the lack of dependability of exam results because of the paucity of data on which the inferences (i.e. grades) are being based. Normalisation checks one cohort against another, but not (as Doug Green remarks) one exam board against another, or one teacher’s judgement against another.

This (largely hidden) weakness of our exam system will only improve when we start to use data analytics to aggregate large amounts of data from different sources. Only this policy will allow us (a) to improve the reliability of our inferences, and at the same time (b) to measure a much wider range of educational objectives and to stop the narrowing of the curriculum. This narrowing is being caused by a form of standardised testing that places too much weight on small quantities of unrepresentative raw data. Standardisation is an exercise in achieving the appearance of reliability by *narrowing* the dataset being used. What we should be doing (and what modern technology allows us to do) is the exact opposite of this.

When will government start to take this potential seriously, instead of running so-called “edtech” programmes that only help teachers use Microsoft Teams?

Crispin.

Reply
Dennis Sherwood
27 April 2021

Thank you, Crispin. All very valid points, especially the one about the consequences of publishing measures of “fuzziness”.

This was noted in a report, published by AQA in 2005, on page 70 of which we read:

“However, to not routinely report the levels of unreliability associated with examinations leaves awarding bodies open to suspicion and criticism. For example, Satterly (1994) suggests that the dependability of scores and grades in many external forms of assessment will continue to be unknown to users and candidates because reporting low reliabilities and large margins of error attached to marks or grades would be a source of embarrassment to awarding bodies. Indeed it is unlikely that an awarding body would unilaterally begin reporting reliability estimates or that any individual awarding body would be willing to accept the burden of educating test users in the meanings of those reliability estimates.”

Indeed. And reference to the front cover will show that the lead author of this report was Dr Michelle Meadows, then at AQA, and currently Executive Director for Strategy, Risk and Reasearch at Ofqual.

Yes, that report was published in 2005.

https://www.aqa.org.uk/about-us/our-research/research-library/paper?path=review-literature-marking-reliability

Reply
1. Crispin Weston
  30 April 2021
  
  Thanks Dennis – I look forward to reading this interesting report. And it is interesting too that the question of the reliability of exams has been unresolved for so long.
  
  One (practical) problem is that it is difficult to demonstrate the reliability (i.e. the consistency) of something that in normal circumstances has no comparators.
  
  The more fundamental problem is our obsession with the reliability of the test rather than the statistical confidence we can place in our assessment of the student’s capability – what Dylan Wiliam refers to as the “inference” we make on the basis of the test. The reliability of the test only matters if we use a single test. Unreliability is only random error., which can be eliminated by taking the mean from multiple samples (if we had them). Unreliability would be an easy problem to solve if we tested early, tested often (probably in the first instance for formative purposes) and compared the results.
  
  By narrowing the dataset to a small number of similar, predictable, end-of-course exams, the process of standardisation converts random error (measurable and possible to eliminate by statistical means) into systematic error (not apparent or measurable and impossible to eliminate by statistical means). It converts an easy problem into a difficult one. But (and maybe this is the prize), at least it makes things look tidy. All conflicting data is eliminated, except for what is handled by an “appeals” process, which suggests some form of incompetence, rather than the normal management of inevitable uncertainty.
  
  I fully support your proposal – but I think it is only likely to be practical to implement (both technically and politically) in the context of the sort of wider and more fundamental reform that I have outlined.
  
  Reply