Mixed-up confusion: Trying to make sense of IELTS scores

Disclaimer: I was an IELTS examiner (Speaking and Writing) for two years.

The International English Language Testing System (IELTS) is receiving a great deal of attention at the moment, particularly from people who normally perhaps would not pay much attention to the field of language assessment. That’s the result of an Australian Government discussion paper released recently called Strengthening the test for Australian Citizenship, which contains this section on page 9:

Test

Since the release of this paper, Australia’s Immigration Minister, Peter Dutton, has indicated that the test will be IELTS and the minimum level will be a score of 6:

test 2

Non-experts are confused about it

This has led to impassioned and articulate calls from a range of groups for the Government to reconsider – and rightly so – as well as to discussions amongst politicians and journalists in the national media, including on ABC’s Insiders program. However, some pundits seem to have trouble distinguishing it from the Citizenship test. Take this tweet from comedian Charlie Pickering :

https://twitter.com/charliepick/status/879177884316745730

Pickering was responding to an article on the RenewEconomy site:

coal 2

This was followed by – unbeknownst to the writer, Giles Parkinson – samples from an IELTS Reading module:

coal 3

Several of the 64 replies to Pickering’s tweet expressed skepticism that the Citizenship test was asking people to ‘swear allegiance to coal’ but none (except from my own, that is!) pointed out that this was not from the ‘Citizenship test’ per se. A fine distinction, perhaps, but it indicates that, in addition to eliciting strong emotions from a range of ‘stakeholders’, tests can both be easily misunderstood at even the most superficial level.

Experts are confused about it, too

In terms of understanding what IELTS is, at a much deeper level, language assessment ‘experts’ can get quite confused too. Exhibit A: IELTS ‘Test performance data’ for 2015, a section of the IELTS website targeted at ‘teachers and researchers’. It reports, among other things, ‘reliability estimates’ for the various Reading, Writing, Listening and Speaking modules used for live IELTS tests in 2015, the most recent year for which such data is publicly available.

Confusion over Cronbach’s alpha

The focus on ‘reliability’ on this page – further illustrated by the fact that the words ‘valid’ and ‘validity’ are not used anywhere on it – reflects a psychometric approach to test analysis and, for such analysis, “classical test theory is without doubt the most extensively used model” (Borsboom, 2005, p. 22; Borsboom describes two others: latent variable and scale). It seems safe to assume that this is the model supporting the analysis presented on the IELTS website.

It also brings to mind this comment in Borsboom (2005, p. 31):

Of all psychometric concepts, reliability plays the most important role in practical test analysis. Of course, all researchers pay lip service to validity [or not, as is the case here], but if one reads empirical research reports, reliability estimates are more often than not used as a primary criterion for judging and defending the adequacy of a test.

Based on his analysis, Borsboom (p. 31) goes on to question whether

reliability deserves this status. The theoretical acrobatics [described in detail in Measuring the Mind] necessary to couple empirical quantities, like test-retest correlations, to reliability, as defined in classical test theory, are disconcerting.

What then, role does ‘reliability’ play in the analysis of the IELTS test performance data? This gives us a clue:

IELTS

If Cronbach’s alpha gives us ‘meaningful reliability values’, and Cronbach’s alpha measures the ‘internal consistency’ of a test, then reliability obviously is closely related to ‘internal consistency’, if not necessarily synonymous.

Cronbach’s alpha appears here unproblematised and uncontroversial, but, as Sijtsma (2009, p. 107) argues,

probably no other statistic has been reported more often as a quality indicator of test scores than Cronbach’s (1951) alpha coefficient, and presumably no other statistic has been subject to so much misunderstanding and confusion. … alpha is persistently and incorrectly taken to be a measure of the internal structure [i.e., internal consistency] of the test and hence as evidence that the items in the test “measure the same thing.

Sijtsma argues that Cronbach himself was vague in his definition of ‘internal consistency’ and that has led to it being interpreted in different ways. What exactly is meant by the term when used on the IELTS site? It’s not clear but at any rate, according to Sijtsma

Alpha is not a measure of internal consistency. Neither is it a measure of the degree of unidimensionality [i.e., the items all measure a single construct or trait] … Alpha has been shown to correlate with many other statistics and much as these results are interesting, they are also confusing in the sense that without additional information, both very low and very high alpha values can go either with unidimensionality or multidimensionality [i.e., items measure several different traits] of the data. But given that one needs the additional information to know what alpha stands for, alpha itself cannot be interpreted as a measure of internal consistency (2009, p. 119)

To recap:

It’s not clear what is meant by ‘internal consistency’
There is good reason to think that it has nothing to do with ‘reliability’
High alpha values do not necessarily indicate a high degree of ‘internal consistency’
To interpret Cronbach’s alpha, we would need additional information
It’s not clear what that additional information is or whether it is provided on the IELTS site.

In response to such confusion, Borsboom (2005, p. 47) asks

Why is it that virtually every empirical study in psychology reports values of Cronbach’s [alpha] as the main justification for test use? I am afraid that the reason for this is entirely pragmatic. … In fact, this value can be obtained through a mindless mouse-click.

Even more confusion over reliability

Another curious aspect of the treatment of the IELTS test performance data is how it uses two different methods to determine the reliability of the Reading and Listening modules and the Writing and Speaking modules. The website states that the

reliability of the Writing and Speaking modules cannot be reported in the same manner as for Reading/Listening because they are not item-based … Reliability of rating [the Writing and Speaking modules] is assured through the face-to-face training and certification of examiners and all must undergo a retraining and recertification process every two years.

Clearly, we are now dealing with a different type of reliability. The Reading/Listening type could supposedly be measured with Cronbach’s alpha and related to internal consistency. Writing/Speaking don’t have ‘items’ and so it is difficult to apply the notion of internal consistency.

But why would we dispense with the notion so easily if it was so important for Reading/Listening? Simply because it’s inconvenient? How should we understand ‘reliability’ in terms of the Writing/Speaking module if it does not relate to internal consistency? And what about validity?

There are apparently ‘reliability measures’ which come out of training and certification processes described above which themselves produce ‘outcomes’ which in turn “feed back into examiner retraining and continually build on quality management and assurance systems for IELTS.” But why is the website so vague on these ‘reliability measures’ when they provided so much detail regarding the Reading/Listening modules?

There’s a startling lack of coherence within this one web page, but, as the comments of Borsboom and Sijtsma, this probably reflects the confusion, even over the meaning of basic terminology, which characterises a great deal of expert discourse in the assessment/testing/measurement communities. What chance have the rest of us got of interpreting and using IELTS scores (or those from any other test, for that matter) appropriately?

Disclaimer: I was an IELTS examiner (Speaking and Writing) for two years.

Non-experts are confused about it

Experts are confused about it, too

Confusion over Cronbach’s alpha

Even more confusion over reliability

Share this:

Related

Leave a comment Cancel reply