Ofsted inspections are unreliable by design

How can we know which schools are good if inspectors are inconsistent and biased and the data is wrong, asks Becky Allen.

We want school inspection to be able to tell us where the quality of curriculum and instruction is good. If we could show the psychologists Daniel Kahneman and (the late) Amos Tversky how we do this, I suspect they would express concern about the reliability of inspector judgment.

The author Michael Lewis introduces us to their work in his brilliant book, The Undoing Project. In an early chapter we meet Daryl Morey, manager of the Houston Rockets and self-confessed nerd, who is trying to understand why scouts make poor choices when picking players for the National Basketball Association draft. He might as well have been talking about why inspectors cannot spot the best schools, because all the same arguments apply.

It seems self-evident that if you want to know how well a school is doing, you visit it rather than rely on performance data alone.

We currently have a system where the Ofsted judgment rarely diverges much from a data judgment

But just as Lewis argues that watching a prospective hire play a single game might be worse than not doing so, it is possible that a short inspection leads to worse judgments than having no inspection at all. As he says, if the data on a player says he was a great free-throw shooter in college, it is worse than useless if the scout sees him miss a free throw a few times during a workout match. It is worse because, once a scout has seen something happen, he finds it almost impossible to discount it, even in the face of data that suggests it is not generally true.

As it happens, we currently have a system where the Ofsted judgment rarely diverges much from a data judgment. By the time you finish this article you may think this is something of a relief. In past research we showed that where Ofsted inspectors made judgments that were misaligned with the data that would have been available to them at the time, these judgments did not act as leading indicators for future exam performance. They did not spot the schools on the cusp of a material improvement in their performance, nor those where exam outcomes were about to significantly decline.

Kahneman and Tversky would say that the problem is not simply that the inspection (or the basketball match) is short. It is rather that the inspector is a human.

Like all of us, they use mental shortcuts (called heuristics) to make sense of the new information they encounter, constantly trying to associate it with existing patterns of information to make models of the world. The way we compile these patterns can create systematic biases that undermine the validity of the opinions we form.

For example, these theories suggest that a human would form a near instant impression as he or she walks into a school, around which all other observations tends to organise themselves (anchoring bias). That the human mind finds it too hard to see things it didn’t expect to see, and too easy to see what it wants to see (confirmation bias).

That we humans place too much importance on one aspect of an event, such as observing one student swear at another in a classroom (focusing effect). That inspectors unconsciously might favour heads and teachers who remind them of how they used to run a school, and will construct entirely unrelated arguments as to why they like them (mere familiarity effect).

They may also prefer heads or schools that visually look like those that are known to be great (halo effect). And finally, that inspection team opinions will quickly and unconsciously converge to minimise conflict (a bandwagon or false consensus effect).

These mental shortcuts are thought to be adaptive, leading humans to make faster decisions when timeliness is more valuable than accuracy. Because they are “built-in” mechanisms it is almost impossible for individuals to be aware that they are invoking them, and thus to safeguard against any illogical interpretations or inaccurate judgments that may arise.

I know of no research that determines how and when these well-established heuristics are employed by inspectors in a way that undermines the validity of the judgments they make. If we are to continue to use humans to make high-stakes judgments on schools then we should probably figure this out.

If humans are unreliable, does this mean that data inspection wins? Not so fast. We know that data is currently used naively to make poor judgments on schools. This is why junior schools are less favourably judged than infant schools because, unlike all-through primary schools, they are unable to depress the key stage 1 baseline from which they are measured.

I regularly hear of material inconsistencies in the way that 11-year-olds are “supported” during SATs

I worry about our primary school performance metrics because I regularly hear of material inconsistencies in the way that 11-year-olds are “supported” during SATs.

We reward schools that enter native speakers for exams designed for second language speakers and those using the European Computer Driving Licence for reasons other than its value to the student. We hail schools for fantastic results, neglecting to ask what happened to the pupils who disappeared from the roll before the year 11 spring census.

We have a duty to address these issues to create more reliable statutory assessments and performance metrics.

It is perfectly possible that inspectors are human (that is, unreliable and biased) and the data is wrong. What then?

In the book, Morey says of his earlier career in management consultancy: “the consultant’s job is to feign total certainty about uncertain things”. Such is the job of the Ofsted inspector.

Morey couldn’t hack it as a consultant because he was a nerd – a person who knows his own mind well enough to mistrust it. Morey says that basketball scouts are not nerds. Neither, I fear, are school inspectors.

We need to stop pretending we know for certain which schools are doing a good job and lower the stakes associated with inspection judgment. No more forced academisation and pushing out headteachers based on flawed data and a few hours of some humans walking around a school.

As Lucy Crehan said at the Headteachers’ Roundtable summit last week, most high-performing school systems manage just fine with an accountability system that promotes responsibility and answerability, rather than culpability and liability.

This is not to say that school inspection should not have a role in our system. It is possible that the threat of inspection, day-in-day-out, leads to better practice in schools that outweighs the obvious dysfunctional behaviours it creates.

This alone would be a good argument for universal inspection, where every school has a non-zero chance of being visited each week, and where data is
used to weight the probability of inspection at a school by the risk it needs support to improve.

We really do need to know where best practice lies in our system so that we can share good practice and identify where support is needed.

We should be using human inspections and inhuman data to set up the best process we can for working out whether a school is truly well-functioning or not.

We should keep trying to do this better, knowing that we will never be able to answer this question with any certainty.


Rebecca Allen is director of Education Datalab

Your thoughts

Leave a Reply

Your email address will not be published.

One comment

  1. Stephen Fowler

    Where there is an increased level of politically correctness, the Ofsted ratings will become more and more meaningless. In fact they will start to become inverted, with the most politically correct, and therefore the worst schools, scoring the highest. What measures is the government taking to ensure that the politically correct types do not dominate the Ofsted inspection process?