Ofsted’s worrying reliability findings won’t comfort heads who get ‘the call’ in a couple of months

27 Jun 2019, 9:36

In a report on the future of Ofsted that accompanied the launch of our new think tank called EDSK (short for ‘education and skills’), we delved into the research on whether people agreed with each other when observing schools and teachers.

It transpired that when two observers, however experienced, walked into a classroom and started recording what they saw, they were likely to reach different judgements in as many as 50% of cases. Imagine that multiplied across more than 20,000 state schools.

Yesterday, Ofsted released their own research that set out to test the reliability (consistency) of inspectors’ decisions. This involved measuring the reliability of lesson visits and work scrutiny – two of the three main pillars of the new ‘quality of education’ judgement in the upcoming inspection framework. In terms of the results of a reliability study, the score is given on a sliding scale from 1 (perfect agreement between inspectors) to 0 (no agreement).

On lesson visits, Ofsted found that “reliability between observers was good in both the primary and secondary schools’ sample”, with a score of approximately 0.6 (on the borderline between ‘moderate’ and ‘substantial’ agreement).

We have no idea how bad the situation is (or will be) for the more numerous Ofsted inspectors as opposed to a handful of HMIs

Even so, these results were achieved by Her Majesty’s Inspectors (HMIs) – the most experienced individuals at Ofsted’s disposal. When HMIs were paired together for observations, their reliability score was around 0.65. However, this dropped to about 0.55 when an HMI was paired with a less experienced Ofsted inspector.

HMIs are outnumbered 9-to-1 in Ofsted’s workforce, and you could easily have a school inspection that does not involve an HMI. In fact, of the 16 indicators for judging lessons that Ofsted tested, “none of the indicators achieved a substantial level of reliability in the HMI and non-HMI pairings.” Ofsted’s study did not even test reliability between two Ofsted inspectors.

For work scrutiny, Ofsted compared the verdicts of just nine HMIs in five subjects, with 15 exercise books typically being checked by two HMIs per subject. They were asked to scrutinise work on four indicators (e.g. ‘pupils’ progress’). None of the four indicators produced reliability scores above 0.5, and one indicator produced a score of just 0.38. Astonishingly, Ofsted concluded that “this suggests that HMI rated reliably”.

Furthermore, these numbers were being propped up by work scrutiny at primary level. For secondaries, the results were dismal, with reliability scores of 0.22, 0.59, 0.32 and 0.21 across the four indicators.

Ofsted noted that the sample size was smaller for secondaries compared to primaries, although they acknowledged that non-specialists could struggle at secondary level “where subject matter is more complex.” Remember that the inspector who visits your school could well be a non-specialist in the subject they are inspecting.

To cap it all off, Ofsted admitted that work scrutiny might not be possible in special schools, it may not work in further education and skills, it probably won’t be any use when judging “alternative methodologies in teaching and learning” (e.g. Montessori schools) and it might not produce anything useful for modern foreign languages.

Moreover, Ofsted said the amount of work in workbooks at the beginning of an academic year “may not be sufficient for inspectors to make a valid and reliable judgement about curriculum and learning progression”. Their solution? Workbooks from the last few months of the previous academic year should be made available to inspectors. I can guess the reaction from teachers to that suggestion, and I doubt it would be very polite.

Ofsted’s response to their own findings was to promote the virtues of more training for inspectors and producing detailed subject guidance. Both may improve reliability scores, but we have no idea how bad the situation is (or will be) for the more numerous Ofsted inspectors as opposed to a handful of HMIs. Our EDSK report on Ofsted called for the new framework to be delayed by a year to ensure that new processes such as work scrutiny were rigorously evaluated before being rolled out. These new studies show exactly why our recommendation was so pertinent.

In summary, Ofsted’s research is a welcome sign of engagement with some critical issues, but the worrying findings will not be any comfort to a headteacher who gets ‘the call’ in a couple of months from now.

Your thoughts

Leave a Reply

Your email address will not be published. Required fields are marked *


  1. Terry Pearson

    When Ofsted set out to test the reliability of inspectors’ overall judgements during short inspections in March 2017, they set a target of 80% agreement between inspectors. Only HMI took part in the test and an agreement of 92% was achieved. Consequently, Ofsted proclaimed profusely that this showed inspectors’ judgements were reliable. Ofsted’s report of the test failed to draw attention to significant flaws in the testing methodology which meant that the results obtained should not be used as a sign of the reliability of short inspection outcomes at all. A detailed review of that test is available on this link: https://www.researchgate.net/publication/327894743_A_review_of_Ofsted's_test_of_the_reliability_of_short_inspections
    Here we are again more than two years later with Ofsted continuing to use a flawed testing methodology only this time getting results that are much worse. Ofsted really must stop making claims about inspectors’ judgements being reliable until it has developed a suitable methodology for testing the reliability of them.

  2. Tom Burkard

    Already, we rely far too much on subjective assessments of pupils’ learning. Ofsted inspections of teaching and learning are inherently unfair–inspector’s biases will inevitably vary, and schools have to prepare for a range of expected outcomes.

    ‘Bold Beginnings’, the Ofsted report on teaching in Reception Year, focussed on the curriculum and pedagogy at primary schools with outstanding results. This was also the approach used by the Rose Review of early reading instruction, and arguably this has substantially reduced reading failure. Inspiring failing schools with a clear roadmap of what works has to be a more positive means of inspiring less successful schools than the stressful and subjective process of inspection.