We need a debate about the future of education post-Covid-19, but that shouldn’t mean a rush to abolish things like GCSEs, writes Tim Oates.
A lot has changed over the past year. The effect of Covid has been seismic for individuals, industries and countries around the globe. But there is now hope that mass vaccination will arrest the rate of infection and save a great many lives.
While crises can create positive opportunities for change, their reverberations can damage well-laid foundations
Yet we may only just be starting to feel some of the shockwaves created by the disease. While crises can create positive opportunities for change, their reverberations can damage well-laid foundations. This is our worry for education here in the UK and around the globe.
Few aspects of the current education and assessment landscape have escaped attention. Some think the national curriculum should be scrapped or that textbooks are a thing of the past. Secure, well-grounded evidence says otherwise.
And then there are the calls to cull GCSEs.
Exams have been a ‘national success’
GCSEs and A-levels have been an extraordinary national success. They have been long-lived and valuable precisely because they have repeatedly adapted to changing educational and social needs over the past 50 years.
There is no reason to think they cannot evolve again, driven by the very best evidence about quality.
As external, independent assessments, they are entirely in line with the high stakes assessment present across other leading education systems around the world.

And they do a lot more than just provide internationally-trusted grades, they provide comprehensive programmes of learning, clarity in expectations and attainment standards, present motivating goals, offer quality assurance of education, support fair admissions and so on.
Any change in assessment would also need to support these key functions.
It sounds a complex set of demands, but that’s what a developed system needs.
Without attending to this, urges to abolition could leave serious gaps in our educational arrangements, and add even more disruption to that already stressing us all.
Debate is needed, but changes mustn’t add workload
Cambridge Assessment and Cambridge University Press have been long-term supporters of debate about the future of education, reflecting our mission, to contribute to society through the pursuit of education, learning and research at the highest international levels of excellence.
We have considered how recent educational policy has influenced our standing in international surveys, on the events of 2020 and early 2021, and from research and reflection have developed a set of outline principles (see below) that we believe will help all those that are interested to continue the debate around teaching, learning and assessment.
Urges to abolition could leave serious gaps in our educational arrangements, and add even more disruption to that already stressing us all
We recognise that the priorities right now are getting young people back into schools, implementing effective recovery learning, and managing the demanding task of determining grades for those young people due to receive qualifications this summer.
Without adding to workload or detracting from these priorities, we would like to begin to lay out the evidence and views which can provide robust building blocks for our educational future.
As part of our effort, we will be curating a series of blogs and online debates over coming months through which we will dissect important areas such as the importance of the essential role of teachers and schools, textbooks and other learning materials, approaches to the curriculum and assessment, the role of cognitive science, and the welfare of young people, as well as approaches to learning and schools themselves.
We hope you will take the opportunity to engage with us, and in doing so, help reaffirm foundations where appropriate, draw from our national experience of interrupted education, and establish the grounds for new approaches that can truly stand the test of time.
Our principles for the future of teaching, learning and assessment
- Early literacy and numeracy are vital foundations for broad and balanced learning
- Curriculum coherence – the alignment of curriculum content and standards, teaching practices, learning resources and assessment – remains fundamental to high equity and high attainment for all learners
- Curriculum and its assessment, and all other requirements, should be a manageable load for all teachers
- Effective learning should be built on variety, using a well-managed mix of adaptable approaches and modes
- Excellence in teaching and elevated attainment can be supported by well-designed and carefully-chosen technology that can support teaching and attainment
- Well-trained and well-supported teachers are central to high quality pedagogy, high attainment and the well-being of learners
- Evidence and cognitive science should inform teacher practices
- Access to high quality teaching and learning materials is essential for high quality, manageable education at all ages
- Dependable assessment is vital for social justice, learning support and equitable progression
- Clear standards are important for equity, accessibility and progression for all learners
- Equity and high attainment can be achieved hand in hand
- First and second language skills are essential for all learners, including English as the language of international communication.
 
             
                                             
                
                                             
             
             
             
             
                
“GCSEs and A-levels have been an extraordinary national success….And they do a lot more than just provide internationally-trusted grades… ” (Tim Oates, 19 Feb 2021)
“Exam grades are reliable to one grade either way.” (Dame Glenys Stacey, evidence to the Select Committee, 2 Sept 2020)
That second statement, in plain English, means “an GCSE certificate showing 8, 8, 7, 7 really means any set of grades from 9, 9, 8, 8 to 7, 7, 6, 6 but no one knows which”. If that is “internationally trusted” then it seems to me that that trust is sorely misplaced.
“GCSEs and A-levels have been an extraordinary national success….And they do a lot more than just provide internationally-trusted grades… ” (Tim Oates, 19 Feb 2021)
“Exam grades are reliable to one grade either way.” (Dame Glenys Stacey, evidence to the Select Committee, 2 Sept 2020)
That second statement, in plain English, means “an A level certificate showing ABB really means any set of grades from A*AA to BCC but no one knows which” (and likewise for GCSE). Tell that to a student denied an AAB University place.
If exam results “reliable to one grade either way” are “internationally trusted” then that trust is surely misplaced.
And is it any surprise that an organisation that generates income from selling GCSEs is arguing in favour of maintaining them?
Hi Dennis…I am very conscious of an organisation which runs exam boards speaking up in support of exams…but it’s important to bear in mind the following: exam boards in England runs public examinations as a public good, on behalf of the state, the public and young people. In effect, something on which young people, employers, education rely in common are designed jointly and operated ‘in trust’ by exam boards. In other nations this public good often is run purely by the State or its direct agencies. You will note from what I suggest above is that assessment will indeed change, as social and economic needs change. This they have done since their inception. Your own important work on grading shows that elements of fairness need constantly to be reviewed, and assessment – whether in the form of an exam or any other form – needs to be refined. We know that this is a constant process, best managed through solid research and refinement, not sudden abolitions and pendulum swings. Yes, we run exams, but what we run is operated through the request of society and the state, and supports essential functions in education, which I list in the piece above. Collective interest and social consent runs like a thread through all work on public qualifications.
Hi Dennis…I am very conscious of an organisation which runs exam boards speaking up in support of exams…but it’s important to bear in mind the following: exam boards in England run public examinations as a public good, on behalf of the state, the public and young people. In effect, something on which young people, employers, education rely in common are designed jointly and operated ‘in trust’ by exam boards. In other nations this public good often is run purely by the State or its direct agencies. You will note from what I suggest above is that assessment will indeed change, as social and economic needs change. This they have done since their inception. Your own important work on grading shows that elements of fairness need constantly to be reviewed, and assessment – whether in the form of an exam or any other form – needs to be refined. We know that this is a constant process, best managed through solid research and refinement, not sudden abolitions and pendulum swings. Yes, we run exams, but what we run is operated through the request of society and the state, and supports essential functions in education, which I list in the piece above. Collective interest and social consent runs like a thread through all work on public qualifications.
Hi Tim… thank you. There is now no dispute that exam grades are “reliable to one grade either way”, and, as you are well aware, Ofqual’s research, published in 2016 and 2018, contains all the evidence. So grades “reliable to one grade either way” have knowingly been awarded every year (when exams took place!) since at least 2016. And, given the absence of any indication to the contrary, this will continue when exams eventually return – as A levels surely will, and quite likely GCSEs too, at least for several years.
So my point still stands. Are grades “reliable to one grade either way” satisfactory?Or – now that there is an unwelcome gap – is the time right for fixing this? Especially since this is so easy to do, with the benefit that no longer will some 1.5 million ‘wrong’ grades be ‘awarded’ each year.
GCSEs are not ‘in line with the high stakes assessment’ found in exams present in other developed nations. Where tests exist at the end of lower secondary, they are fewer in number than are expected here and low stakes. They are used to decide upper secondary progression not to judge schools.
Nearly ten years ago, the OECD warned there was too much emphasis on GCSEs in England which risked negative consequences. Since that time, the emphasis has increased not decreased.
That said, GCSEs or equivalent could form part of a coherent exam system leading to graduation at 18 via multiple routes.
Regarding OECD that’s not correct. OECD is not a singular organisation, and has said many things at many times. But most recently Andreas Schleicher supported reformed GCSEs in Schoolsweek in 2017. In your own article in 2011 you quote the OECD economic survey 2011 which does not advise that we should abandon GCSE but points out that grade inflation and teaching to the test are problematic, which concurs with other research.
Tim – thank you for your reply but I did not say that OECD recommended abandoning GCSEs but said it warned about negative consequences arising from the excessive emphasis on GCSEs in England. As you say, these negative consequences have been highlighted in other research.
I have not suggested GCSEs should be scrapped but, as I made clear, they could be part of a suite of exams and other assessment leading towards graduation at 18.
I am neutral about this, but it seems to me that some of Tim Oates’ arguments need to be questioned.
“As external, independent assessments, they are entirely in line with the high stakes assessment present across other leading education systems around the world.”
Perhaps. However, the point is what proportion of those systems have high stake exams for all students at 16 when most don’t leave school until 18? What proportion have these exams for around 10 subjects then immediately narrow down to around 3? Is our system “entirely in line” with theirs?
Second, “And they do a lot more than just provide internationally-trusted grades”.
Do they even do that? When English and Welsh students apply for further education, undergraduate or postgraduate courses, or jobs abroad, are their applications accorded more trust, by virtue of their GCSEs, than those by students from leading education systems without high stake exams at 16?
Is there evidence for what Tim Oates claims above even when the student applies to a British university? For example, Cambridge University’s view is that GCSE grades are not a very good predictor of how well a student will do in their degree, and it’s unlikely that applicants from leading education systems without GCSEs will be disadvantaged.
In 2013, I researched the exam systems in other countries including those cited by the then education secretary (Michael Gove) which he said informed the exam reforms in England which included exams at 16. He said these reforms would match the best in the world.
However, my research showed his reforms did not do so. In fact, most of the exam systems were moving towards graduation at 18 with far less emphasis on tests at 16.
You can find my research here but be aware it was written in 2013 and is likely to have changed. However, it is unlikely countries have moved backwards to high-stakes assessment at 16. https://www.localschoolsnetwork.org.uk/faq/what-are-examination-systems-other-countries
Hi Janet,
Thank you for the link to your research. From it it doesn’t look like it’s the norm to have exams at 15 or 16. Furthermore, some places that have them are former British colonies, so having them might be a vestige of British influence. Also, we also need to see how high the stakes are. For example, while France has the Brevet, it’s a lot lighter than GCSEs and I’m often told that it’s not as serious or important as GCSEs. And that doesn’t seem to put French candidates at a disadvantage when they apply to study or work abroad.
For the 50% of students who go on to universities, once you have a degree, your GCSE grades will not matter very much. When university admission becomes post A-levels, GCSE grades will have even less weight than they do now.
Janet and Huy – indeed very important to look at the density of assessment in those countries which have high stakes exam-based assessment at 15/16, and also the uses to which the results are put. We will be publishing tomorrow our updated report on the presence of exam-based arrangements at 15/16 in a number of key nations. Which certainly contradicts the assertion that England is exceptional in having examinations at 16. Janet you mention the ‘direction of travel’ and it’s interesting that Sweden is introducing more national tests as the severe grade inflation and serious decline in standards evident in the last three decades begins to really worry that nation (see the IVA report Javervall and Henrekson). This is not to argue for the status quo in national assessment, but to argue that forward-looking policy on refinement of assessment needs to be based on sound evidence.
Tim – thank you again for your comments. As you say, any exam reforms in England should be based on sound research. This raises questions about why Gove pushed ahead with his radical reforming of GCSEs when the evidence at the time showed a move away from high-stakes exams at 16 while at the same time arguing that reformed GCSEs would bring England in line with the ‘best’ systems in the world. But this was not so.
The research Tim references has been published:
https://www.cambridgeassessment.org.uk/blogs/exams-at-16-more-common-around-the-world-than-you-thought/
I have a lot of sympathy with Dennis’ comment about grade reliability. It is a factor of the grading approach which is inevitable, but does have benefits. Of course, O-levels were originally only issued as a pass-fail, with some examining boards issuing ‘indicative’ grades until a common system was introduced. I guess you might argue that was a fairer approach.
I think it slightly disingenuous to suggest that Tim would be arguing for the continuation of GCSE to support Cambridge Assessment’s finances! The wealth of curriculum development and innovation in education that examining boards have engaged in, let alone assessment, is enormous. Often, curriculum change initiatives only gained momentum when examining boards were prepared to support them with a certification. And, to be clear, most (if not all) of those did not make money. One in which I was involved, for example, both sides of the fence was the Schools’ Council Integrated Science Project in the 1970s and 1980s.
I think Tim’s principles are worth examining. I don’t see in any of them that he argues for an end-of-course examination as exists now. Is there an assumption being made there?
Of course, memories can be short in this business. There have been many approaches to assessment that not only aim to be ‘reliable’ but also ‘valid’. Teacher assessments appropriately standardised and/or moderated. Observation of skills and competencies (yes – in GCSEs). But recent ‘reforms’ have removed many of those in the mistaken belief (in my opinion) that written exams are the fairest approach to assessment.
It is worth taking a look at the JISC report, The future of assessment: five principles, five targets for 2025 (https://bit.ly/3sgmPBc). I would certainly add those to the mix.
I would argue for the retention of the GCSE in title. But, look to how technology enhanced assessments can improve not only engagement with pupils, provide support to teachers, and more appropriate methods of assessment. In a talk given to OEB Berlin last year (https://bit.ly/3kilEhJ), I argue for short ‘sprints’ of learning, assessed regularly (in the most appropriate way for the ‘sprint’), credit awarded and feedback provided to enhance learning. This could be aggregated to a GCSE. Keeping the title makes sense for continuity and also credibility as Tim refers to.
If you want to get in touch to discuss, happy to do so. You’ll fine me on LinkedIn.
Tim, Thank you for your comments. I look forward to seeing Cambridge Assessment’s report.
My concern is not whether England is exceptional in having examinations at 16 (I don’t think it is), but whether those exams are overweight in terms of contents and in terms of unnecessary pressure on students, especially in view of the fact that about 80% of students remain in full time education after that, and 50% go on to university, and that university admission might become post-A-level.
I should say that our first child got eight 9s, two 8s, one distinction, one A* at GCSEs, so I don’t have anything against GCSEs from the the perspective of personal interests. However, I’m not sure if GCSEs, as they are, are good for education. For example, at 14 (which is quite a young age) a student has to learn about Lady Macbeth, at 15 or 16 take an exam in it, then they can forget about it completely. And no university or employer will care whether they got Grade 8 or Grade 9. I suspect that few, if any, will care whether it’s grade 8 or grade 7, and so on.
I used to trust that the DfE, Ofqual and the exam boards are competent and conscientious, but what happened in 2020 made me question things more. What the DfE and Ofqal did in 2020 was not as competent or conscientious as I would like, and most of the time the exams boards just went along, even as far as defending the status quo.
For example, in this article, https://www.theguardian.com/education/2020/aug/07/a-level-result-predictions-to-be-downgraded-england,
you said “On results day, energy should be channelled into how each young person can progress swiftly with what they have been awarded, rather than time lost on agonising over an apparently controversial but fundamentally misleading difference between teacher grades and final grades.”
despite the fact that Ofqual’s testing had indicated that their algorithm got A-level Biology grades wrong around 35% of the time and French grades wrong around 50% of the time, while for GCSEs, it awarded around 25% wrong Maths grades and around 45% wrong History grades.
The revulsion when the grades came out, the fact that the PM called it a derailing of grades by a mutant algorithm, and the “No algorithm” rule for 2021, prove clearly that there was a problem, but the exam boards, for all their expertise and research, did not inform the public of potential problems.
Our continued reliance on exams as the gold standard of assessment is indefensible. I am not talking from a left-wing “prizes for all” perspective, but from one that values rigour and knowledge in the curriculum and accuracy in the assessment system.
Dennis Sherwood asks the critical question and Tim’s answer is revealingly evasive.
The unreliability of an assessment should not be a problem. Unreliability is just random error, which can be corrected by repeat sampling and the use of data analytics (removing outliers, averaging results, correlating different data subsets). The problem is that standardised tests do not allow for repeat sampling because the apparatus of formal exams is just too expensive. By trying to reduce the unreliability of the data collection method (i.e. by using standardised tests and controlled conditions) we make the assessment predictable and thereby convert random error into systematic error, which is statistically invisible and impossible to correct. Taken on their own, without correlation against other forms of assessment, exams make the problem worse.
When politicians say that exams are the best option, they mean “in comparison to teacher assessment”. The experience of last year’s grade inflation shows that this is not saying very much, to put it mildly.
The answer is to fund a massive expansion in the availability of centrally administered, locally assigned, digitally delivered formative assessments. All data (including form other sources like traditional tests and teacher assessments) should be monitored, using data analytics to form defensible inferences about student capability. Drive reliability not through controlling exam conditions but by increasing the amount and variety of data collected.
This approach would improve reliability and validity, massively improve teaching by encouraging the frequent use of formative assessment (which I take to include increased use of Bjork’s testing effect + better feedback, including adaptive instruction), reduce teacher workload, reduce the sense of polarisation between the profession and government (which tends to use formal exams as an instrument of policy a.k.a. stick to beat the teachers with); and finally, avoiding the sort of catastrophe that we are currently experiencing.
The phased abolition of GCSEs in their current form (which, as has been pointed out elsewhere on this thread, are largely redundant) should be used as an opportunity to introduce a new edtech policy to underpin the new approach.
If society is looking for a radical yet level-headed way out of our current mess, then for the core strategy, there is nothing else going but this.
Thanks Crispin…a couple of things….no I wasn’t being evasive, and certainly not ‘revealingly so’. I did think that it was vital to address the ‘vested interest speaks’ point; which I hope highlighted the way in which exam boards support vital public goods ‘in trust’; a vital point of national arrangements not usually discussed. I acknowledged that Dennis’ work is important and is part of the research intelligence which we should use to review and enhance assessment. There is discussion in Ofqual and amongst my researchers of Dennis’ work and the different forms of measurement imprecision which can enter educational assessment. You are right that measurement imprecision can be reduced by multiple assessment of the same thing, on different occasions and by different methods. This is a feature of medical education, where the critical nature of the later professional practice requires high levels of assurance that things have been learned. It’s sound assessment method. But very time-hungry and extremely expensive. We like it to happen in medical education because of the effects of getting it wrong. Ofqual has it’s marking accuracy research, which yields measurement imprecision at levels different to some other studies, but corroborates research that measurement imprecision is there, from various sources, and affects grade classification. We could go down the route of medical education but I believe, like you, that we will see assessment being woven more densely into learning, using digital applications like Isaac Physics (which received the Ogden Gold Medal from IOP in 2019). Such applications will require the same high quality items which have been refined and refined in public exams – we are very good as a nation at high quality assessment items (questions). The best of these will not only produce good evidence of attainment, but will support rich and motivating learning. I would anticipate that this will be the most likely incremental, well-theorised and highly practical way in which public examinations and national assessment will be refined. And it will introduce the kind of resilience we are looking for in assessment arrangements in the future.
Thank you for the response, Tim, which does now acknowledge the point about the (in)accuracy of exams.
Before I respond to your comments on accuracy, let me also comment on “vested interests speaks” and also your arguments about evidence.
Your response to the charge of being a vested interest is to assure your critics that you are all good chaps looking after public goods in trust. Its the sort of answer I would expect from a politician flannelling his way through a tricky 5 minutes on the Today programme. I do not doubt your motives but I believe (a) that you have got important things wrong in your theoretical understanding (notably about “curriculum coherence” as we have discussed before), and (b) in view of the position of power and influence that has been vested in the exam boards (and particularly in you personally, given your role in Michael Gove’s curriculum reforms), whether you are acting as a powerful blockage to innovation and progress.
On the subject of theory and evidence, which you come back to repeatedly, I agree with you that these are most important – but not as straightforward one might imagine. Evidence is not only difficult to use in education, owing to most educational research being underpowered, but empirical evidence can only reflect what is being done now. An over-cautious approach, which says “we can’t do z because we have no direct empirical evidence that it works” becomes suppresses innovation. One has to be able to use evidence in a more agile way, saying that “we have evidence that x and y both work, and x + y = z”.
Second, “theory” seems often to be used in education like “doctrine” was in the medieval church, when the concept was also used as another powerful suppressant of innovation and change. When you look at, say “curriculum theory” (which we have discussed in the past) you find that it is bird’s nest of contradictions and absurdities. Much of our current educational theory is bad theory.
I accept that the sort of theory and evidence you reference are on the whole respectable. I appreciate your reference outside education to medical training. I would reach out further and reference evidence from the data analytics revolution that is occurring in business, yet leaving education almost entirely untouched. My main objection to your own use of the evidence argument is the one about innovation. It serves to promote the status quo.
So – to return to the main point, accuracy. Your objection to using the approach used in medicine is that (1) mainstream education does not need to be as accurate as medicine does; (2) it is too expensive; but that (3) this will nevertheless be our direction of travel in the future.
Point (1) sounds like a pretty straight answer to Dennis Sherwood’s question: “yes, educational exams are not very accurate, but their level of accuracy is acceptable, given the nature of our business”.
When you consider all those students who did not get to the appropriate university because of being given the wrong grade, I suspect that this is a little too sanguine. Indeed, I look forward to hearing Gavin Williamson giving that answer in the House of Commons. But even if we accept that our summative assessment does not need to be more accurate than it is, I make two counter points.
a) If our summative assessment is that inaccurate, then our formative assessment must be even worse – and the effect of bad formative assessment is not just that students leave school with the wrong bit of paper, it is to damage their learning by depriving them of the right sort of practice, feedback and remediation. If one believes in education at all, then this is a very serious matter.
b) It undermines the force of your assurances about the pride we can feel about being a nation capable of producing such high quality question items. The items might be very good but if the accuracy achieved is only “good enough” then the outcome of the process, which is what matters in the end, is not of such “high quality”.
Your argument (2) – (the medical approach is too expensive) goes to the heart of my previous post, which you haven’t engaged with. I make two points.
a). It is only “very time-hungry and extremely expensive” when you are running formal exams that i) have all the apparatus of standardised tests, controlled conditions and moderation; and ii) are evaluated only as assessments and not (as you admit they could be) as a form of instruction. There is no time limit on the amount of assessment we can do if that assessment is at the same time our principal method of instruction.
b). If, as I argue, you can achieve accuracy by repeat sampling and not by focusing on the quality of the initial items or the use of controlled conditions and moderated marking, then this expense would be very substantially reduced.
c). A third benefit of my approach is that, by lowering the quality threshold of the “data in” part of the process, you can consider data that evaluate types of learning objective that exams find it very hard to assess, such as attitudes and soft skills, improving what most teachers would think of as the “validity” of our assessment. This reflects a concern that is not just felt on the left but also on the right, which values qualities like grit.
Finally, on your point (3) I am glad that we agree that this will be the direction of travel in the future. The problem with your response on this is that (a) progress of the use of digital technology requires innovation by technology companies, which is not Cambridge Assessment. The fact that our ultimate measure of educational attainment remains in the hands of traditional awarding bodies becomes a powerful blocker to such innovation happening. And (b) you continue to emphasize the importance of high-quality question items – which to me shows that you don’t really understand the transformation that your words imply. The whole point of the argument above is (i) that high-quality question items do not give us assurance of accurate assessment, and (ii) the move towards data analytics means that the quality of “data in” becomes much less important.
I am sorry, Tim, to seem such a relentless critic of you position because there is much of what you say that I agree with. I also think the traditional skills of the formal assessor will remain an important sheet anchor, even in the new world of edtech. We just have to persuade you (or your political masters) that traditional awarding bodies cannot remain in their current dominant position, effectively as the sole arbiter of attainment. For all your expertise and good intentions, the current awarding bodies are a key blocker to introducing forms of assessment that are more dependable, better serve the cause of rigour and knowledge, and more useful, from a formative perspective.
One more point.
We agree that “measurement imprecision can be reduced by multiple assessment of the same thing”.
If this is to happen in practice, we need to be able to define with some precision those “things” that we need to be addressed by multiple assessments (for “things”, read “constructs” – I prefer “learning objectives” or “capabilities”).
This is what was attempted by the system of criterion referencing., introduced after the 1988 Education Act. But it didn’t work. The long retreat from criterion referencing started almost immediately and was only completed with the withdrawal of the system of levels after 2010, on your watch. The current orthodox “theory”, states that criterion referencing is fundamentally mistaken (and not just poorly implemented).
The result of this position is that repeat testing of the same construct is made very difficult – another major blocker to the sort of approach that you claim to support.
To both Tim and Crispin in particular… are we using the right language? For example, words like “accurate” and “consistent”? To me these are inappropriate – there is never a “right” mark for any question (other than unambiguous multiple choice), so there is no ultimate benchmark against which “accurate” can be compared. So no matter how many examiners might mark any script, there is still no “right”, and the average over any number of examiners’ marks is no better or worse than any single mark. Likewise “consistency” carries baggage that marking should be “consistent” and that any “inconsistency” in marking is “bad” and so must be eliminated.
So rather than “accuracy”, let’s talk about “reliability”, where I mean “the probability that a grade will be confirmed, not changed, on a fair re-mark”. That does not require a definition – or a fudge – about “right”. It asks a much more pragmatic question: does a second opinion confirm the original opinion, or not?
Fundamentally, all marking is necessarily “fuzzy” – this being an attribute of marking that is neither good, nor bad, just is; an attribute that accepts that different examiners can legitimately give the same script different marks. Indeed, is was the recognition of fuzziness that led to the invention of grades at Yale in 1785 – where they also recognised that particular care is needed at grade boundaries – care that has long since disappeared for schools exams. Hence the grading problem – fuzzy marks that straddle grade boundaries with no proper review of which side that script fairly lies.
So the key question is “is it possible to award assessments that are reliable – meaning that there is a high probability that the original assessment will be confirmed by a fair re-mark – even if the original assessment is fuzzy?” To which the answer is “yes”, in many different ways, as described here https://www.hepi.ac.uk/2019/07/16/students-will-be-given-more-than-1-5-million-wrong-gcse-as-and-a-level-grades-this-summer-here-are-some-potential-solutions-which-do-you-prefer/.
Hi Dennis,
Thanks for the challenge (your post of today at 2.47). You expresses the orthodox view, but which seems to me to be fundamentally wrong.
With regard to your objection to “consistent”, to me “consistent” and “reliable” are synonymous, other than that “reliable” is a technical term that is more easily misunderstood by the layman. If you think I am wrong, can you explain what the difference is?
If you think the only quality of a dataset that matters is reliability, you presumably do not admit the possibility of systematic error, i.e. bias? Let’s say your assessment systematically penalizes a student’s score on maths because of their weakness in reading. Your marks that purport to measure their ability in maths end up being reliably wrong. But if you say that the only thing that matters is reliability and there is no such thing as accuracy, then I presume that you don’t see that there is any problem in being reliably wrong?
I admit two things about this.
1. That the only thing we can see in the data is reliability (i.e. the consistency of results). Its like shooting on a rifle range when you can’t see the target. You can tell the difference between a case where the shots are scattered all over the place and one where the shots are closely grouped; but you can’t tell the difference between an example where the shots are closely grouped on the bullseye and where the shots are closely grouped off-target.
But the fact that you can’t see the target (the construct) doesn’t mean that the construct doesn’t exist. This is where I find the conventional language profoundly unhelpful. The term “construct” suggests something that is an artefact of the observer, that doesn’t exist in reality. But that is not true of e.g. ability in maths: one student really is more able in maths than another. The fact that we cannot see the construct does not mean that it does not exist. Scientists cannot observe the atom directly – they infer its properties from their observations (in fact this is true of all observations – see Descartes’ cogito). So the fact that we cannot see directly whether our assessments are accurate does not mean that the concept is invalid.
2. The only way to detect and ultimately to correct systematic error is to widen your dataset. If the systematic error on the rifle range results from the sights on your rifle being skewed, then use a variety of rifles with different sights. If the bias in an assessment is being introduced by your rater, then use a variety of raters. If by the test instrument, then use a variety of test instruments. By widening your dataset, you convert systematic error (bad) into random error (good). Random error (unreliability) is good because it can be detected and corrected, essentially by taking the mean.
The problem with standardised assessment is that it does the opposite of this. Incorrectly thinking that unreliability is our enemy and not our friend, we narrow the dataset, converting random error into systematic error, which is invisible and not correctable. But at least it makes things look tidy.
You say that the mean is no better than any individual mark. That is not true on the rifle range, it is not true in for the pollster, it is not true in any case where you are dealing with random error. Because the error is random, it can be eliminated by taking the mean. The only way that I can make sense of your position is that you do not believe that unreliability equates to random error because you do not believe in error because you do not believe that there is a true answer. I am afraid this reflects the corrosive effect of relativism in contemporary educational thought. What is the point of trying to measure something (i.e. educational attainment) that does not truly exist?
Dennis,
I did not answer your point about fuzziness, or refer to your blog.
You make this statement, with reference to the Ofqual website, which in my view is a non sequitur: “different examiners can legitimately give the same script different marks. As a consequence, of the more than 6 million grades to be awarded this August, over 1.5 million – that’s about 1 in 4 – will be wrong”.
If examiners can “legitimately” give different grades for the same answer, then this cannot lead to grades that are “wrong” (e.g. “inaccurate” – I don’t understand the difference).
It is the inevitable consequence of saying that different examiners can “legitimately” give different marks is that students can “legitimately” get different grades for the same scripts. The position taken by Ofqual is therefore unacceptable.
Most of the supposed solutions you give follow the approach that I explain in my previous post, not of solving the problem but of concealing it. That is the danger when you focus on the *appearance* of reliable outcomes and do not concern you with concepts of accuracy.
The two approaches that increase genuine accuracy are:
1. using multiple raters (better still, multiple test instruments) and taking the mean of the outcomes – but this is contradicted by your own statement above that the mean score is no better than any individual score);
2. creating a better definition of the construct, the importance of which I explain in my answer to Tim, which is why our contemporary rejection of all notions of criterion referencing is what has led us into this mess.
The problem is that these two approaches are in tension with each other. If you are to have multiple rates/test instruments, it is important that they all share the same understanding of the construct. But you avoid this problem by stating that you should have “Just one examiner [who has a private, unpublished view of the construct], so ensuring consistency”. This is not only completely impractical, it is just an exercise in sweeping the problem under the carpet. You create inaccurate marks, whose inaccuracy is not apparent to anyone because there is nothing to compare them with.
Tim – I’ve now read the report you mentioned. It’s is a detailed summary of exam systems in 21 ‘repeatedly high performing’ jurisdictions (RHPJs) and will prove a useful resource.
Your report concludes that two-thirds of RHPJs use them; one third do not.
But does the report compare like with like? GCSEs are end-of-course exams with no internal assessment – a feature shared by eight of the 14 RHPJs. The remaining six used internal assessment combined with the external exams.
What isn’t clear, however, is the number of tests and subjects covered. In England and Northern Ireland, pupils take exams in an average of nine subjects. Is this large number typical or extreme? The report doesn’t make it clear.
What is clear is that only four countries state they use exam results at 16 for formal accountability purposes. Two of these are England and Northern Ireland. This is what makes GCSEs so high-stake.
I have no problem with tests at 16 but they should work for pupils – deciding future courses and not being so onerous that they cause undue stress. They should be a stepping-stone not a final destination.
Error alert – my comment says ‘Your report concludes that two-thirds of RHPJs use them; one third do not’. The word ‘them’ should be ‘external exams’. Sorry – poor proof reading combined with over-zealous editing to reduce word count.
Janet – many thanks; we hope that the report is a contribution to the pool of evidence for the policy debates around qualifications. It’s indeed he case that the density of assessment is important. In Estonia it’s three key external examinations, but they too are very significant for the focus of pupils, parents and schools. I think the number and ‘felt weight’ of qualifications is an important point for discussion. Important also to consider workload from the bottom up. The QCA research on controlled assessment and other work on continuous teacher assessment emphasises the high workload this places on teachers. This is where innovation in linking assessment and learning may bring benefit – although there is a significant risk, in putting in place an instrumental approach to ‘turning everything into assessment’ – and then increasing student stress, not decreasing it. On accountability, in line with the 2011 OECD report, we wanted to highlight the different ways in which the outcomes from external exams are used – particularly in respect of national accountability systems. It’s clear that as nations around the world push to improve quality, accountability arrangements increasingly are part of public policy. And you will see we highlight how some countries, such as Finland, have relatively high accountability but through different mechanisms. We don’t go into that in huge detail in this report – since the focus was on informing the ‘England is an outlier’ debate – but more detail of the performance and impact (eg on teacher behaviour and pupil welfare) of respective national accountability arrangements is a very important area.
My observations are:
1. The scope of the report is whether there are exams at 16, which is not a very interesting or useful scope. As I indicated earlier, I think what’s important, where there are exams, is the contents that are examined, the importance that’s placed on those exams for students (ie the pressure), how much do big-bang exams weigh in the grading, and the uses and values of the grades in the real world.
2. The report’s choice of subjects seems strange. It does not consider countries such as France, the US, Denmark, Israel, Sweden, Switzerland, South Korea, Germany, which are 8 of the 10 countries that Bloomberg rated as the most innovative. It’s strange that three different parts of Belgium features three times, and NSW and Ontario are featured, but many countries that are influential, our important partners, our competitors, are not considered. I don’t think the subjects have been chosen very well.
3. The report does not does not consider the context of each country’s educational system. For example, suppose we know that a country has national exams in maths at 16, to understand the significance of that, we need to know whether students continue to study maths, or have a reasonable opportunity to do so after those exams? I think to discuss whether GCSEs in their current form are optimal in serving students, we also have to consider what happens after GCSE, not just whether there are exams at 16.
4. The report does not consider the educational culture of each country. For example, in some countries students take exams in all subjects that they do in a year at the end of that (academic) year, sometimes even from primary school. For those countries, having national exams at 16 is not that different from other years. England and Wales do not have the educational culture and taking big-bang GCSEs at 16 is likely to more stressful for students than for some other countries that have exams at 16.
So I think the report sheds little light on how our GCSEs compare with what happens in other countries, especially ones that are our important partners and competitors.
Sorry, typo, should have been “England and Wales do not have that educational culture [of students taking exams every year]”
The case of Belgium in the report is quite interesting, although not in terms of which country does what.
In the same country, the French speaking region has exams at 16 but the Flemish and German speaking regions don’t. I don’t think anyone can suggest that school leavers in the Flemish and German speaking regions are less able or more disadvantaged in any way. If anything, the Flemish and German speaking regions have higher income per capita. So perhaps Belgium’s case suggests that exams at 16 are a preference or a tradition rather than a necessity.
Tim – re teacher workload. I agree that controlled coursework assessment can be onerous and eat into teaching time. But coursework during the early years of GCSE was part of general classwork. It gave pupils a sense that the work they did would contribute to their final exam grade. The work was both internally and externally moderated. It was a well-honed system which was unfortunately dismantled. If it were still in place, then the angst surrounding exams at the moment would have been greatly lessened because schools would have had something to fall back on.
I’ve valued talking to you about this – thank you for responding.
Hi Crispin – thank you for your most insightful thoughts. I’m pretty sure we’re very much on the same page, so let me try to explain that. I’ll be as brief as I can – and am very happy to continue the conversation directly.
A student takes an exam; it’s given 64 marks and awarded grade B. Is that award ‘right’?
One ‘right’ is that the exam result truly reflects the student’s abilities, and potential in the future. But the exam was a single event, taking place over 3 hours on one particular day, asking only a narrow range of questions. There is no way that the exam can be a ‘holistic assessment’. So that calls into question exams, the syllabus, learning opportunities, assessment and a host of other mega-important issues. I think that is your metaphor of “which target?”. I agree. A very important, and very complex, issue which I truly hope will be addressed sooner or later, as lobby groups such as Rethinking Assessment are arguing for (https://rethinkingassessment.com).
A much narrower ‘right’ is about the ‘cluster’ of marks on a given ‘target’ – and it is this ‘right’ that concerns me, despite my agreeing that the target could well be the wrong one. This is pure pragmatism – if the state of current society is that this is the ‘target’ everyone is using, then it makes sense to ensure that at least that (arguably wrong in principle) target is being ‘used’ correctly.
So what is the ‘right’ mark for a conventional exam? If there were only one ‘right’ answer, an answer that was independently verifiable, then that’s fine. One way of achieving that would be to run all exams like “Who wants to be a millionaire?”. Another is to have a single, trusted, examiner, who marks all scripts using exactly the same standards consistently (!yes, consistently!) for all scripts at all times. As long as that examiner is trusted by all students, that’s fine. This could work for small cohorts, but is tricky for 600,000 GCSE English scripts. Maybe one day AI will resolve this: if the algorithm gains trust, then one ‘examiner’ could indeed mark large numbers of scripts to the same, accepted, standard.
When multiple examiners have to be used, then the problem is the empirical observation that different examiners, all equally conscientious, can give the same script different marks – one gives 67, another 64. Which is right?
One answer could be “the mark given by the more senior of the two” – which happens to be 67, grade A. Which causes a problem given that the script was actually marked 64, grade B.
Another approach is to mark the script multiple times, and generate a distribution – say, from 60 to 70. Which mark is ‘right’? A fundamental truth is “we don’t know, any one could be”; a more pragmatic approach is to agree one measure that can be applied to all scripts – say, the mean, the median, the mode, the mark that defines the upper quartile, whatever. In principle, it doesn’t matter, as long as we all use the same measure consistently. Pragmatically, though, this does not work: marking 600,000 GCSE English scripts multiple times is tricky.
So another approach might be to attempt to narrow the range, perhaps by better training of examiners, by tighter mark schemes (ultimately resulting in multiple-choice-by-stealth, as is happening). This is sensible, but is unlikely to eliminate fuzziness totally.
Yet another approach is to accept that marking is fuzzy, and ask “how can the fuzziness of marking – which is inherent and varies by subject – be recognised in the assessment, so that the student is not penalised?”. To me that is a sensible question, and it has sensible answers, for example, as described here https://www.hepi.ac.uk/2019/07/16/students-will-be-given-more-than-1-5-million-wrong-gcse-as-and-a-level-grades-this-summer-here-are-some-potential-solutions-which-do-you-prefer/.
A brief word on “1 grade in 4 is wrong”.
“1 grade in 4 is wrong” is my short-hand for “According to Ofqual’s research, if the entire cohorts of scripts in 14 popular subjects were marked twice, once by an ‘ordinary’ examiner and once by a ‘senior’ examiner, then the corresponding grades would be the same for about 75% of those scripts, and different for the remaining 25%”. That’s long-winded, and doesn’t make a pithy headline, hence my short-hand. And my use of ‘wrong’ is driven by Ofqual’s description of the grade corresponding to the mark given by a ‘senior’ examiner as being ‘definitive’ or ‘true’. Ofqual’s research shows that, for example, for Geography, 65% of scripts are awarded the ‘definitive’ or ‘true’ grade, implying that the remaining 35% of grades, as actually awarded, are ‘non-definitive’ or ‘untrue’. I prefer the plain English ‘wrong’.
In essence, Ofqual are defining the ‘right’ mark and the ‘right’ grade as “that given by a senior examiner”. Which is fine, as long as all senior examiners in any given subject always give all scripts exactly the same mark to the same script, and as long as only senior examiners mark scripts. In practice, the first condition is unproven, and the second doesn’t happen. So the result is that students’ grades are the result of a lottery of which examiner happens to mark any given script – a lottery that gives results such that 25% of those grades would be different, had a senior examiner marked the script. To me, that’s bad news. I can’t trust any result for I don’t know which examiner did the marking, and there is a possibility – quite a high possibility in fact – that the grade would have been higher otherwise. The grade I was “awarded” is therefore untrustworthy, unreliable… And to make matters worse, since 2016, Ofqual’s policy for appeals has denied me the opportunity of requesting a fair re-mark…
Which brings me back to solutions. How many ways can we think of that accept fuzziness (and yes, let’s try sensibly to minimise that, but whilst it’s there, it’s real) and design assessments that are reliable and trustworthy, in the sense that I can have a high degree of confidence that my original grade will be confirmed, and not changed, as the result of a fair re-mark?
GCSE’s are ridiculous, they only test you on how much you can shove into your brain at one time, this could be said for any exam for that matter. It does not show the students full learning abilitys. We should take the time now in COVID-19 to look into other ways of testing studentson their learning ability. For example, look at Canada and Scandinavia, they use teacher assessed grades and they have the most successfull education system in the world, why are we not looking at these countries and developing. The basis of our education system today is still the same as it was 150 years ago, and that is deeply tragic. So lets change!