**Published: Friday 13th February on EEF website.**

One should always be wary of donning the mantle of an expert.

I am no statistician. I have no pedigree as an education researcher or academic. What little teaching experience I have was acquired through the PGCE I completed before many of you were born.

But I do have a strong grounding in education policymaking. And I’ve spent much of the past five years monitoring educational publications, press releases and media coverage, not infrequently pointing out the inconsistencies between them.

So my antennae were twitching when the Education Endowment Foundation selected Friday February 13 to publish the results of its randomised control trials of Mathematics Mastery, one of its first four high profile awards back in October 2011.

You don’t choose the Friday before half term to broadcast good news.

The EEF’s press release featured nine different projects. Maths Mastery was granted a single paragraph.

Over the course of the day, press releases also appeared from academy chain Ark, which originated the Maths Mastery programme, and from the eponymous organisation they set up to run it.

Both seemed rather more positive than was warranted by the outcome of the trials, so I worked through all the published material and wrote a blog post about what I discovered.

There were two principal reports – one focused on two successive Year 1 cohorts; the other on implementation in a single Year 7 cohort, plus a related process evaluation. The outcomes had also been combined through meta-analysis.

According to the EEF’s rating scale, the effect size from the primary school evaluation showed that the average pupil following Maths Mastery would make two months’ more progress than the average pupil in the control group. This fell to one month’s additional progress for the secondary evaluation. The same was true of the meta-analysis. EEF describes all these effect sizes as “low”.

But the effect sizes were qualified by 95 per cent confidence intervals. The toolkit’s technical appendices explain that: “If the confidence interval includes zero, then the effect size would be considered not to have reached conventional statistical significance.”

According to the EEF’s summary report, the lower range of the confidence interval was negative for the primary and secondary evaluations and zero for the two combined. Given this, the assumption would be that none of the three effect sizes is statistically significant.

Yet the EEF, Ark and Maths Mastery press releases all claimed statistical significance for the meta-analysis. How could this be?

It turns out that, whereas the table in the EEF’s project summary shows confidence intervals to two decimal places, the table describing the outcomes of the meta-analysis provides them to three decimal places. So “0.0” becomes “0.004”.

As the full report said: “…the pooled effect size of 0.073 is just significantly different from zero at conventional thresholds.”

Statistical wizardry rescues the outcome from statistical insignificance, but the distinction is marginal.

I was even more disturbed to find Ark claiming that this effect size for one year of Maths Mastery could simply be multiplied to calculate the impact of full immersion: “A two-month gain every primary year and one-month gain every secondary year could see pupils more than one and a half years ahead by age 16 – halving the gap with higher performing jurisdictions.”

The maths is a little iffy, the logic more so.

Fortunately Ark subsequently amended this, though it continues to claim that: “…the data indicates that the programme may have the potential to halve the attainment gap with high performing countries in the Far East.”

I ended my post by showing how these findings might be summarised more accurately, because evidence-based policy demands evidence-based publicity.

All three bodies seem worryingly impervious to this constructive criticism, so perhaps we need a code of practice to control publicity material built upon the outcomes of EEF evaluations.

Asked to respond to this review, Ark said: “We are encouraged by the IoE’s judgement that the extra progress made was statistically significant, but as a long term programme, we are mindful not to overemphasise test results from only one year of our support. We look forward to results of the follow-up studies and to working closely with partner schools to develop our support year-on-year.”

## Stephen Gorard

March 12, 2015 at 5:26 pm

Thanks. I think these kinds of problems are very easily avoided simply by avoiding presenting confidence intervals (and significance tests) in the first place. No one knows what they mean. In these two studies the attrition (missing scores from pupils randomised to treatments) rate was over 18% for primary and over 23% for secondary schools. This means that even if confidence intervals made sense they cannot be used here. The allocations are no longer random, so the probability calculations presented in the reports are plain wrong. The key question is whether around 20% attrition, as a source of potential bias in the findings, could explain away an effect size of around 0.06. In my view, easily so. There is no need for the confusing CIs, p values and regression results which in my view serve only to confuse the reader.

## Dylan Wiliam

March 12, 2015 at 6:39 pm

Stephen is, as ever, right about the impact of these levels of attrition on effect sizes, but there are other considerations here too. The first point is that you either accept the logic of null hypothesis significance testing (NHST), or you don’t. Stephen, like many statisticians, does not. But if you do accept the logic of NHST then even if your result achieves the predetermined level of significance by a whisker, you accept it. And if it does not, then you say that the result is not significant. You do not claim that it is “bordering on significance” nor, as Ronald Coase famously remarked, do you “torture the data until it confesses” by doing multiple significance tests. The second, rather technical, but incredibly important, point is that the EEF has this rather strange notion that one year’s progress is one standard deviation. It is for key stage 1, but for students over the age of 7, one year’s progress is typically around 0.4 standard deviations. An effect size of 0.073, if it is correct, would not be a “small” effect size if it was achieved with secondary-aged students. It would be an increase in the rate of learning of almost 20%. Tim is right that you cannot aggregate this year on year. Economists of education typically assume that 30% of increased achievement is lost each year, so that five years of a 0.073 effect size would add an extra year of learning in secondary school. And if you hear anyone saying that an effect size of 0.3 is “small” because Jacob Cohen said so in 1988, be aware that you are listening to someone who does not know what they are talking about.

However, the broader point is this. Researchers can afford the luxury of saying that “more research is needed.”

Those actually doing education at the sharp end, in classrooms, have to decide whether a programme such as Mathematics Mastery might be an improvement on what is happening in their school right now. And while some of the “spin” regarding the result is unhelpful, I would regard it as quite reasonable to conclude, on the basis of the evidence presented in the EEF evaluation, that Mathematics Mastery was worth a try.

## Gifted Phoenix

March 16, 2015 at 10:43 am

The two comments above prompted a vigorous debate on Twitter which I have embedded at the bottom of my original post here: https://giftedphoenix.wordpress.com/2015/02/22/maths-mastery-evidence-versus-spin/

The discussion is captured in reverse order, with the most recent tweet at the top, so you’ll need to scroll down to the beginning to read it in chronological order.