It has long been an adage of the education profession: your school’s inspection outcome depends on the inspector you get, with trade union conference halls often being filled with stories of nightmare inspections and unjust outcomes.
Myth or not, Ofsted has come under increasing pressure to prove the reliability of its inspectors and system of inspection.
As a result, Ofsted last week reported the results of the first study of its kind into inspection outcomes – with notably positive headline findings.
However, the report has been criticised for a very small, primary-only sample size as well as other limitations that critics say make it inappropriate to draw to many conclusions about the reliability of inspection as a whole.
Chief inspector Amanda Spielman said the research was a “small-scale exploratory study” and also acknowledged its limitations. The report itself says that the study is just “the first step towards collecting a body of evidence on the reliability of inspection practice”.
The study focused solely on the shorter inspections of schools rated “good” which were introduced in September 2015. It aimed to evaluate “how frequently two inspectors independently conducting a short inspection of the same school on the same day agreed whether the school remained good or whether they needed further evidence to reach a secure decision”.
The findings are based on 24 completed short inspections of primary schools rated as “good”. The schools were inspected by a lead inspector as usual, but a second inspection was carried out simultaneously by a second “methodology inspector”. The two were instructed to work independently of one another.
The results show that in 22 of the 24 inspections, the Ofsted inspectors both reached the same final decision – a rate of 92 per cent.
In the two inspections where the inspectors disagreed, one case involved the inspectors “interpreting the school’s self-evaluation document and the initial discussion with the senior leadership team in different ways”. The second disagreement was “more clearly associated with inspectors undertaking different inspection activities with different people”.
In one disagreement, one inspector thought the school was “still good” while the second wanted to convert the inspection into a Section 5 because they thought the school potentially outstanding.
In the second, both inspectors wanted to convert to a Section 5, one because they wanted more evidence to justify a “good” rating and the other because they thought the school potentially outstanding.
Both these schools were found ultimately to have remained “good” after the Section 5 inspections.
The report concludes that Ofsted short inspection reliability comes down to four factors:
- The structure of the short inspection framework, whereby inspections can be converted if there is a lack of evidence.
- The triangulation of evidence from different inspection activities across the day (not just the leadership meeting).
- Ofsted’s quality assurance and complaint procedures.
- Inspector training.
The education unions and Ofsted itself have highlighted a number of limitations in the study – not least the sample size and make-up and the fact the findings cannot be related to full inspections.
The report itself admits that ensuring the two inspectors remained independent of each other while conducting their investigations in the same school on the same day was difficult.
A key problem was that only four of the inspections were monitored. The report states: “The lack of sufficient involvement of independent observers across the sample means we cannot be completely sure that all of the inspectors in the test inspections arrived at their decisions independently.” It adds: “Unintentional influence by inspectors cannot be completely ruled out.”
Furthermore, some of the methodology inspectors complained that the process was “too artificial” – the lead inspector in each school had priority whereas the so-called “methodology inspector” sometimes found it difficult to carry out inspection activities as they would on a routine inspection.
On the sample size, the report adds: “While a relatively high level of inter-rater agreement has been achieved, the small sample size means there is limited external validity to these findings. It is someway short of the calculated sample required for statistical validity.”
In her monthly commentary as chief inspector, Ms Spielman also acknowledges the limitations. She writes: “There are, of course, limitations to a small-scale exploratory study like this that need to be taken into account. The findings cannot be extrapolated across other types of inspections or all types of institution. For instance, the study looked only at short inspections of primary schools in a certain size range and it had a relatively small sample.”
Reaction from the teaching and school leadership unions has been critical of the study’s short-comings, although Ofsted has received widespread praise for embarking on research into inspection reliability in the first place.
Russell Hobby, general secretary of the National Association of Head Teachers, said the findings were “encouraging”, but that they still imply that roughly one in 10 short inspection outcomes could be wrong. However, he added: “The sample size here is small, so we would like to see a more expansive study to look into short inspections to give a more accurate snapshot of the current system.”
Malcolm Trobe, interim general secretary of the Association of School and College Leaders, agreed. He said the study was “an important first step” in ensuring that Ofsted judgements are “consistent and reliable”. He added: “It involved a relatively small sample of schools, and we hope to see similar studies involving a larger sample in the future, and which also look at full inspections.”
The Association of Teachers and Lecturers, meanwhile, warned that the fact all the schools in the sample were rated “good” constricted the research.
General secretary Dr Mary Bousted said: “In reality, away from the constraints of research methodology, inspectors are called upon to make much more complex judgements.
“In secondary schools, where a greater range of subjects are taught, inspectors have to make an overall judgement of the quality of a whole school. Given that there is much greater variation in teaching quality within schools than between schools, we question how valid and reliable these inspection judgements really are.”
The National Union of Teachers was clear that the independent verification of the study fell short. It pointed to the lack of independent observation but also to some aspects of the inspection. It said: “Both the lead and methodology inspector jointly attended the initial meeting with the headteacher where the logistics of the inspections were agreed, the head and senior leadership team explained their self-evaluation and key lines of enquiry were agreed. It is difficult to accept that the two inspectors would begin their independent inspections without some shared assumptions at the outset.”
Ofsted is clear that this is a “first step” in its research programme. Ms Spielman said: “As an initial attempt at evaluating reliability, these findings should provide some reassurance that the purpose of the short inspection model is being met and that inspectors made consistent judgements.”
She continued: “This study is just a first step towards a continuing programme of research into inspection. We should routinely be looking at issues of consistency and reliability. And even more importantly, we should be looking at the validity of inspection: is inspection succeeding in measuring what it is intended to measure? This is not an easy question, in part because validity is not an absolute: it depends on the purpose of the inspection.
“We are beginning to shape up what this research programme should look like. But this is not a quick hit in which everything is sorted at once: rather, it will be a steady process in which questions are addressed systematically.”
And it is clear the unions will be keeping a close eye on this work. General secretary of the NUT, Kevin Courtney, said: “Given the limitation of this research the NUT concurs with Ofsted that further research is required in order to establish the reliability of inspection more broadly as well as investigating the overall validity of school inspection for determining school quality.
“It is frankly astonishing that reviews of reliability and validity have not been carried out previously. Many headteachers will tell you that it is the fear of Ofsted that is driving the huge over-work of teachers and in turn driving the teacher shortage.”
Chris Keates, general secretary of the NASUWT, added: “While inspection reliability in relation to short inspection is important, Ofsted also needs to demonstrate how it will ensure reliability of longer Section 5 inspections, where the risks to schools are often far higher due to the wider range of factors that are taken into account in forming overall judgements.”
Dr Bousted went further, calling for future research work to be passed to an independent body. “The time is overdue for an independent investigation of Ofsted,” she said.
She continued: “The question must be answered: does Ofsted inspect schools in different areas, with different social intakes, fairly? Do inspectors inspect the right things and do they come to accurate judgements?
“Ofsted produces an enormous ‘backwash’ in England’s schools. The consequences of a poor inspection judgement provoke fear among school leaders and an epidemic of over-work in the profession which is driving teachers and school leaders away from teaching. We have to question whether Ofsted is a catalyst for improved educational performance in our schools."
- The report – Do two inspectors inspecting the same school make consistent decisions? – can be downloaded at http://bit.ly/2m6xD3d
- Chief inspector Amanda Spielman’s monthly commentary (March 2017), which speaks to the report, can be found at http://bit.ly/2m6sasU