News

Exam board marking league tables delayed by monitoring concerns

Exclusive

Ofqual, the exams regulator, is still trying to find a way to publish data on how the quality of marking varies between exam boards, more than three years after the idea was floated.

Dame Glenys Stacey (pictured below), the former chief regulator at Ofqual, announced in June 2015 that the organisation would publish metrics for exam marking quality in 2017.

However, only limited data was made available last year, and Schools Week understands the regulator is still trying to find a “sensible” way to make the published data work.

New documents published this week reveal Ofqual is concerned about the impact any more detailed data would have on the way marking is monitored.

Although the document, Marking consistency metrics: an update, reported for the first time on qualification-level metrics, it warned that future work with metrics “needs to proceed with some caution”.

“This is to manage the risk that any use of thresholds or benchmarks do not compromise the live online monitoring procedures and hence the actual quality of marking, which is the very thing we wish to improve.”

Glenys Stacey

Concerns were also raised in a January 2017 Ofqual board meeting. Minutes state that although the regulator was “now able to routinely create marking consistency metrics for GCSEs and A-levels”, the metrics were based on data from exam boards’ own quality control mechanisms.

“As we have previously discussed, publishing such metrics might have perverse consequences for the monitoring of live marking.”

A set of marking reliability studies completed last year provided limited information about the quality of marking by exam boards. However, it is believed a method of regularly publishing quality metrics is still some way off.

The Joint Council for Qualifications, which represents the four exam boards which provide GCSE and A-level exams in England, said its members “welcome any research into marking consistency.

“We are focused on implementing improvements to the quality of our marking. Our priority is, as it always has been, to give students the results they deserve for their performance in examinations.”



Your thoughts

Leave a Reply

Your email address will not be published. Required fields are marked *

2 Comments

  1. To quote Ofqual (https://ofqual.blog.gov.uk/2016/06/03/gcse-as-and-a-level-marking-reviews-and-appeals-10-things-you-need-to-know/), “it is possible for two examiners to give different but appropriate marks to the same answer”. This ‘possibility’ has nefarious consequences: when two “different but appropriate” marks are on different sides of a grade boundary, the grade awarded to the candidate depends on the lottery of which one of those two examiners happened to mark the script. The measurement of this ‘fuzziness’ is therefore important. Here is a suggestion as to how to do it.

    Take, for example, the summer 2018 [history] scripts for any exam board, and choose, say, [100] scripts which were all given the same mark, say, [56]. Then give each of these [100] scripts to [100] different examiners*, drawn from the ‘normal’ examiner team (not just senior examiners), and ask that the scripts be fairly re-marked. Each of the original [100] scripts will be given a single re-mark, so generating [100] mark/re-mark pairs, all for different scripts with the same original mark, [56]. Some of these re-marks will be the same as the original mark, [56], but some will be different. The distribution of these 100 re-marks will be characterised by an end-to-end spread (from the highest and the lowest re-marks), as well as statistical measures such as the standard deviation.

    Then do exactly the same thing for the other boards, and compare the results.

    I wonder if the results from the different boards will all have the same spread, and standard deviation… and if they don’t…

    * A quick note (if that’s not too pompous!!!): this process is different from having the same script marked by [100] different examiners, and gives a different result. The significance of the process as described is that it mirrors the reality that different examiners can give “different but appropriate” marks, so reflecting the lottery of the a single mark, and the award of a grade based on that random chance.

  2. To quote an Ofqual blog (https://ofqual.blog.gov.uk/2016/06/03/gcse-as-and-a-level-marking-reviews-and-appeals-10-things-you-need-to-know/), “it is possible for two examiners to give different but appropriate marks to the same answer”. If those two “different but appropriate marks” are on different sides of a grade boundary, then the resulting grades are different. The grade actually awarded therefore depends on the lottery of which examiner happens to mark the script first. This is why grades are unreliable, and explains the results published in Ofqual’s recent report (https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/759207/Marking_consistency_metrics_-_an_update_-_FINAL64492.pdf).

    The key driver of grade (un)reliability is the spread of “different but appropriate marks” that might be given to the same script by different, equally qualified, examiners. Here is a pragmatic way to measure it.

    Take, for example, the marked scripts from [xyz board’s] 2018 GCSE [history], and randomly choose, say, [100] scripts, all of which were given the same mark, say, [56]. Give each of these scripts to another examiner, drawn randomly from all the original examiners, and re-mark each script. This will result in 100 mark/re-mark pairs, all of which have the same original mark, [56]. Some of the re-marks will be 56, but others won’t, so there will be a distribution of re-marks, characterised by a spread (highest – lowest), and statistical measures such as the standard deviation. This can be repeated for other original marks, giving an average spread and standard deviation for that subject, as examined by that board. In general, the greater these averages, the more unreliable the grade.

    Then do the same for history scripts from the other boards. I wonder if the results will be the same…