In a report on the future of Ofsted that accompanied the launch of our new think tank called EDSK (short for ‘education and skills’), we delved into the research on whether people agreed with each other when observing schools and teachers.

It transpired that when two observers, however experienced, walked into a classroom and started recording what they saw, they were likely to reach different judgements in as many as 50% of cases. Imagine that multiplied across more than 20,000 state schools.

Yesterday, Ofsted released their own research that set out to test the reliability (consistency) of inspectors’ decisions. This involved measuring the reliability of lesson visits and work scrutiny – two of the three main pillars of the new ‘quality of education’ judgement in the upcoming inspection framework. In terms of the results of a reliability study, the score is given on a sliding scale from 1 (perfect agreement between inspectors) to 0 (no agreement).

On lesson visits, Ofsted found that “reliability between observers was good in both the primary and secondary schools’ sample”, with a score of approximately 0.6 (on the borderline between ‘moderate’ and ‘substantial’ agreement).

We have no idea how bad the situation is (or will be) for the more numerous Ofsted inspectors as opposed to a handful of HMIs

Even so, these results were achieved by Her Majesty’s Inspectors (HMIs) – the most experienced individuals at Ofsted’s disposal. When HMIs were paired together for observations, their reliability score was around 0.65. However, this dropped to about 0.55 when an HMI was paired with a less experienced Ofsted inspector.

HMIs are outnumbered 9-to-1 in Ofsted’s workforce, and you could easily have a school inspection that does not involve an HMI. In fact, of the 16 indicators for judging lessons that Ofsted tested, “none of the indicators achieved a substantial level of reliability in the HMI and non-HMI pairings.” Ofsted’s study did not even test reliability between two Ofsted inspectors.

For work scrutiny, Ofsted compared the verdicts of just nine HMIs in five subjects, with 15 exercise books typically being checked by two HMIs per subject. They were asked to scrutinise work on four indicators (e.g. ‘pupils’ progress’). None of the four indicators produced reliability scores above 0.5, and one indicator produced a score of just 0.38. Astonishingly, Ofsted concluded that “this suggests that HMI rated reliably”.

Furthermore, these numbers were being propped up by work scrutiny at primary level. For secondaries, the results were dismal, with reliability scores of 0.22, 0.59, 0.32 and 0.21 across the four indicators.

Ofsted noted that the sample size was smaller for secondaries compared to primaries, although they acknowledged that non-specialists could struggle at secondary level “where subject matter is more complex.” Remember that the inspector who visits your school could well be a non-specialist in the subject they are inspecting.

To cap it all off, Ofsted admitted that work scrutiny might not be possible in special schools, it may not work in further education and skills, it probably won’t be any use when judging “alternative methodologies in teaching and learning” (e.g. Montessori schools) and it might not produce anything useful for modern foreign languages.

Moreover, Ofsted said the amount of work in workbooks at the beginning of an academic year “may not be sufficient for inspectors to make a valid and reliable judgement about curriculum and learning progression”. Their solution? Workbooks from the last few months of the previous academic year should be made available to inspectors. I can guess the reaction from teachers to that suggestion, and I doubt it would be very polite.

Ofsted’s response to their own findings was to promote the virtues of more training for inspectors and producing detailed subject guidance. Both may improve reliability scores, but we have no idea how bad the situation is (or will be) for the more numerous Ofsted inspectors as opposed to a handful of HMIs. Our EDSK report on Ofsted called for the new framework to be delayed by a year to ensure that new processes such as work scrutiny were rigorously evaluated before being rolled out. These new studies show exactly why our recommendation was so pertinent.

In summary, Ofsted’s research is a welcome sign of engagement with some critical issues, but the worrying findings will not be any comfort to a headteacher who gets ‘the call’ in a couple of months from now.