Investigating volatility in exam results

Written by: Tim Oates | Published:
Photo: iStock
The is a significant problem with assuming you've controlled for exam end issues by just looking at ...

Posted by: ,

Last month, SecEd reported on research showing just how volatile a school’s exam results can be from year-to-year. It has led to calls for schools to be judged on a rolling basis across five years of results. Tim Oates explains

Around 10 years ago, I was out mountain-biking with a great friend of mine, a senior teacher in a high-performing sixth-form college. It was a wonderful day, and the tracks on the South Downs were in fine condition. But he looked extremely glum.

“What’s up?” I asked. He said: “My A level results were vastly better this year...”

I immediately said: “But surely that’s good?”

“No,” he replied. “I have absolutely no idea why they are better; the pupils are roughly the same as last year, I taught the same syllabus in the same way, and these results will come back to haunt me since I can see no reason for them.

“My results will most likely be down next year and then top management will be all over me for an apparent crash in results.”

We went on to discuss what might have been going on in the pupil group, discussed the examination and the consistency of marking, the rank order of the pupils, and so on. But I filed the conversation away, in a place in my head marked “come back to it, since this is important”.

Fast-forward 10 years, and researchers at Cambridge Assessment have been digging and probing.

The organisation includes three large awarding bodies: Oxford, Cambridge and RSA (OCR), Cambridge International Examinations, and Cambridge English Language Assessment. It administers millions of assessments every year, in more than 170 countries.

Many of the assessments are very high stakes, determining of life-chances. So research is not a luxury or indulgence, it is a vital part of evaluating the performance of the assessments and improving accuracy of measurement.

For the last three years, as part of a specific programme of work focusing on improving measurement accuracy, researchers in our Assessment Research and Development Department have been assiduously tracking down things which cause variability in exam results.

Variability is a problem for everyone developing, using and taking qualifications. Consistent measurement is at the heart of good assessment. It is central to measurement accuracy and public confidence in assessment.

What do I mean by variability? In an individual school, the numbers gaining specific grades can go up or down each year. This may come from differences in what children know, how they are taught, but also from differences in the assessment.

At an individual level, two similar candidates, in different years, can get a different score or grade if they prepare in different ways, face or choose different questions, or if what they do is treated differently by those making the assessment.

There is a lot going on which affects the outcomes. This means that the causes of variability are numerous, diverse and elusive.

On the part of exam boards, some of the things which have in the past given rise to variability include problematic exam questions, inconsistent marking, the way that exam standards are set and maintained, and administration errors on the ground.

But this is just the tip of an iceberg of reasons for variability. Others are not in the control of an examinations body – the variations which can occur in teaching groups, changes in the school which disrupt learning, and so on.

Systematic analysis of volatility has fallen out of fashion, with the last extensive work in England being done more than 25 years ago, by Caroline Gipps and Pat Broadfoot.

MT Kane recently has done some research on test scores in the USA. But despite this overall lack of attention, it is a very serious issue.

With millions of assessments being designed, despatched, administered, marked and graded, the necessary complexity of the system means that sources of variation can be subtle and elusive.

For a whole host of reasons, political, technical and educational, there has also been constant change in the form, content and administration of assessments – government has encouraged modularisation of qualifications and then removal of modular exams; new technology has been introduced through on-screen marking; the national regulator has determined new approaches to setting standards.

Each and every one of these can have an impact on the accuracy of assessments. We see our task as ensuring that this impact is positive, and does not detract from accuracy and fairness.

The programme of work on variation, which Cambridge Assessment has put in place, involves isolating and identifying where problems may be located and where improvements might be gained. It is the same meticulous approach which was adopted by the British Cycling Team in its massive Olympic success – after you have done all the obvious improvements in training processes, start homing in on all the little, hidden things and by doing something to all of them, a significant gain can be obtained.

Because of the complexity and diversity of assessment, this hunt for sources of variation is a long and painstaking process. Identifying what impact each potential problem has also requires carefully designed, meticulous research.

Our recently published work, by Tom Benton and Tom Bramley – just one small part of this large programme – carefully controls for how standards are set and removes any effect of marking inaccuracy.

The level of volatility which remains, due to the impact of factors which we necessarily termed “other”, is higher than most people expect – among 55 benchmark centres selected for stability in entry, 12 centres had a “native volatility” of more than 10 per cent, and two more than 20 per cent.

Now, “other” is not some throw-away category as frequently it is in other social research. In this category “other” is the variation which arises in the complex thing we call schooling – a bad or particularly good dynamic among a year group; a severe bout of viral illness; a teacher finding some brilliant new resources, or losing some; disruptions to families; a series of bereavements; and many, many more.

They can all be at work, and all combine in peculiar ways in a specific year.

Any realistic educationalist knows about this complexity, but in a system where hyper-accountability holds sway, management and governance can react very adversely to a single set of poor results, particularly if the drop is substantial.

But we have shown that substantial drops in results can, and do, occur. Our research suggests that management needs to be forensic yet balanced in its analysis of the reasons for an increase or decrease.

A decrease can simply be a return to “normal” after an unusual peak. A decrease from a trajectory of improvement may not mark a collapse of teaching. But likewise, a set of poorer results should not be written-off as “sometimes results go up, sometimes results go down”.

If something has indeed been substantially changed in the form of teaching then there may have been an adverse impact on grade outcomes.

Our judgement is that teaching staff and management should continually review performance through a rolling five-year set of data – a five-year picture which gives each year’s results as well as a rolling average.

Against this can be logged changes which deliberately have been made in teaching and learning. And while in the research we controlled for standards-setting and marking accuracy, schools should continue to monitor their experiences of each examination session to make sure that no errors or administration problems on the part of exam boards have arisen.

Assessment is an exacting science in a very complex system; our research takes us more than a few steps closer to a better understanding of our impact. So next time I go mountain biking with my teacher friend, I’ve got a better reply.

But nobody has a definitive answer yet. Maybe we never will, but we think it is worth continuing to try.

  • Tim Oates CBE is group director of assessment research and development at Cambridge Assessment.

Further information
Download the report, Volatility in Exam Results (May 2015), at and read SecEd’s coverage at

The is a significant problem with assuming you've controlled for exam end issues by just looking at marking reliability. Marking can be reliable but the judgements lack validity. In fact this is often the case I think. In fact the more s board focuses on reliability the more it narrows possible responses that can score marks and the more variability will be seen in results.
Posted By: ,

Please view our Terms and Conditions before leaving a comment.

Change the CAPTCHA codeSpeak the CAPTCHA code
Sign up SecEd Bulletin