[Listening to: Ode to Joy by Wilco; Nuclear War EP by Yo La Tengo; Too Pure – The Peel Sessions; Born into Trouble as the Sparks Fly Upward by the Silver Mt. Zion Memorial Orchestra & Tra-La-La Band]
Since my last post, I have been confirmed as a PhD candidate. As such, I’m on a Research Training Program scholarship: for each year of my candidature, the Federal Government pays my university somewhere between $28,000 and $44,000 to provide me with research training, supervision and support. For this reason, I consider myself to be very fortunate and, increasingly, I am thinking about what I can give back to my communities. I think this is where blogging can be of use as a relatively open-access place to share some thoughts.
What follows are some thoughts on the Literacy And Numeracy Test for Initial Teacher Education students, or LANTITE. Earlier this year, I sent these thoughts to an Education Review journalist, who included some of the material in an article titled ‘Is the LANTITE contributing to the ‘collapse’ of the Australian teaching profession?’ That article is behind a paywall so I have posted my full comments here.
What is LANTITE?
LANTITE is really two tests, a Literacy test and a Numeracy test. They were introduced in 2016 as part of a series of changes to the regulation of Initial Teacher Education (ITE) (i.e., Bachelor of Education and Master of Teaching courses) in Australia. The first of these changes was the introduction of the Australian Institute for Teaching and School Leadership (AITSL) in 2010 (Dinham, 2011). AITSL then developed, as part of a ‘national accreditation system’, ‘Program Standards’ which stipulated that ITE entrants “will possess levels of personal literacy and numeracy broadly equivalent to the top 30 per cent of the population” (Education Services Australia, 2015, p. 14); these were implemented in 2013 (TEMAG, 2014).
Next, in 2014, the Teacher Education Ministerial Advisory Group (TEMAG) was established by then Education Minister Christopher Pyne and given the task of working out how to make ITE “more practical” (SBS News, 2014). Pyne, apparently, was keen to improve ‘teacher quality’ and saw that the only way the federal government could do this was by putting pressure on university courses (SBS News, 2014).
TEMAG recommended the introduction of a standardised test to check that ITE students met the ‘personal literacy and numeracy’ standard (TEMAG, 2014). In 2015, the Department of Education, Skills and Employment published a request for tender and the Australian Council for Educational Research (ACER) was the successful tenderer. ACER, according to a report in the Sydney Morning Herald,
received a $1 million government tender to design and run the tests until 2018. It will also receive up to $3.7 million a year in revenue from student fees given around 20,000 education students graduate annually.(Knott, 2016)
ACER is still designing and running the tests.
This section of ACER’s LANTITE website states the following:
The test has been developed to rigorous professional and technical standards. Test questions are designed and developed by a team of ACER test writers, specialists in their fields, and reviewed by panels of external experts. All test questions are also subject to trial testing, statistical analysis and final review. The content, style and duration of the test are determined to ensure that the testing program is relevant, fair, valid and reliable.
The test data are subjected to statistical analysis to check that each test question has performed as required. Test questions in development are carefully scrutinised in an ongoing attempt to minimise gender, ethnic or religious bias, and to ensure the test is culturally fair. The test may contain a small number of trial questions. These questions will not contribute to candidate scores. This is standard practice in secure testing.
The last sentence refers to ‘standard practice’ in the educational testing field. ‘Standard practice’ in educational testing certainly does include teams of people writing and scrutinising test items; trialling these items; and analysing these items statistically. These are important steps in the process of validating the test: determining to what extent test scores mean what they are assumed to mean and can be used as an appropriate basis for decisions about test-takers. For example, in the case of the LANTITE Literacy Test, the public assume that ‘more literate’ test-takers score higher on the Literacy Test than ‘less literate’ test-takers, and that test-takers who do not ‘achieve the standard’ do not have an adequate level of literacy to be effective teachers, etc. Validation involves investigating these assumptions using data gathered about how different test-takers respond to different Literacy Test items. If these assumptions turn out to be unsupported by the analysis, it would be unfair and unethical to use the test scores as the basis for decisions as to whether individual test-takers were ‘literate enough’ to be effective teachers.
It is also standard practice in the educational testing field to release publicly the results of this design-trial-analysis process. This allows the public to understand what the testers are claiming about the fairness, validity and reliability of the test, and to see some of the data which supports these claims. Testing organisations frequently release this analysis and data in the form of ‘technical reports’. ACER have released such reports for other tests that they have developed (see here and here). Technical reports for NAPLAN are released annually by ACARA. Sets of statistics are also released annually for the UK’s version of LANTITE, the Professional Skills Test, which, incidentally, was scrapped last year to “allow universities and schools to better identify the individual needs of each trainee and offer them extra support to strengthen their skills where needed”. Other illustrations of this standard practice can be found easily in the technical reports for IELTS, SAT, TOEFL, Pearson Test of English, and so on.
Contrary to this standard practice, ACER have released no such technical reports or sets of statistics relating to LANTITE. This is puzzling, especially given the public significance of the test. It is hard to imagine that the technical reports don’t exist at all; it is more likely that they do exist but a decision has been made – I’m guessing within the Federal Department of Education – not to release them. It is even more puzzling when you consider that ACER’s CEO, Geoff Masters, is a globally-recognised expert on the technical validation of tests, specifically using the Rasch measurement technique. It seems likely that Rasch measurement is used to validate LANTITE but we simply do not know because ACER has not said anything publicly about the validation process.
What this means is that ACER’s claim that the LANTITE “testing program is relevant, fair, valid and reliable” cannot be meaningfully evaluated by the public. The Australian public has, to date, simply been expected to take for granted that LANTITE does the job it is supposed to do. In my view, this situation is unacceptable. It would be unacceptable even if there were no specific reasons to doubt LANTITE’s validity, reliability and fairness. However, there are specific reasons to doubt LANTITE’s validity, reliability and fairness, and that’s what I want to explain next.
A person’s LANTITE score, on either the Literacy or Numeracy Test, is an aggregate of their responses to individual test items. In Rasch measurement, it is assumed that there are at least two main factors which determine how people respond to individual test items: the person’s ability, and the item’s difficulty. As Merton Krause argues here, there is an intractable logical problem with this. If you take the Literacy Test as an example, ACER will determine your ability based on how many easy and difficult test items you get correct: if you get all the easy ones correct and most of the difficult ones correct, then ACER will safely say you’re very literate. But how does ACER know how difficult each test item is? Well, if the more literate people get an item correct but less literate people get it wrong, then ACER conclude it is a relatively difficult item; if most people get an item correct, including the less literate people, then ACER conclude it is a relatively easy item. But how does ACER know which test-takers are more literate and less literate? Well, that’s based on the difficulty of the items. But how does ACER know the difficulty of the items? And so on and so on. In theoretical terms, this is the problem of ‘parameter separation’: you can’t determine the difficulty of an item if you don’t know beforehand the test-taker’s ability, but the entire point of the test is to determine the test-taker’s ability… Proponents of Rasch measurement respond to this logical problem with a series of mathematical equations to demonstrate that the item difficulty and person ability ‘parameters’ have indeed been ‘separated’. I’m not convinced.
At any rate, there are very good reasons to suspect that LANTITE test-takers’ responses to individual items are influenced by a wide range of factors which are unlikely to be accounted for (and probably technically cannot be accounted for) by the Rasch measurement technique. I’ll make a list here of some of these factors, based on my own experience of completing ACER’s practice test materials, familiarity with other documents on ACER’s LANTITE site, and hearing about the LANTITE experiences of students from all over the country.I took the complete practice test and scored 60 out of 65. I consider myself a very literate person and I have spent 18 years practicing reading tests, teaching grammar, etc. But:
In the second half of the test, I got extremely bored and frustrated with the test and, by the end of it, I did something I would never encourage my students to do: I just guessed the answers to items based on my existing knowledge of the topic and without reading the text. I still got the items correct, which indicates that background knowledge is a significant factor influencing responses to at least some test items.
As I mentioned, I got extremely bored and frustrated with the test because of the sheer number of texts I had to read and make sense of. This indicates that boredom/frustration is a significant factor influencing responses to at least some test items, and particularly those in the last third of the test.
I have completed the ‘retired’ practice items and scored 28 out of 35. Several of these items are very poorly written and it is alarming to think that they could ever have been included in a live test. Questions 10-14 relate to a relatively easy text, but I found the items related to it very hard to answer. This suggests that variable quality of test items is a significant factor influencing responses, i.e., some items are so poorly written that they could not be answered correctly by test-takers who had no problem comprehending the associated text.
I have heard that test-takers who attempt LANTITE multiple times get the same test items on subsequent attempts. To illustrate the problem here, imagine this scenario: You take the Literacy Test and fail. You attempt a second time and you see some of the items that you got the first time. You don’t know whether you answered these items correctly the first time but, feeling confused and uncertain, you choose answers because you think they are different to what you answered the first time. This suggests that previous encounter of test items is a significant factor influencing responses.
Some students apparently don’t realise that, if you click on certain highlighted words in the test items, the relevant part of the text is highlighted. This would mean that some students might spend longer than others trying to figure out which paragraph is ‘paragraph 2’, for example, and may identify the wrong part of the text. This suggests that familiarity with key technical aspects of the test is a significant factor influencing responses to test items.
Some students who take the test remotely report major disruptions from the online test ‘proctor’. Test-takers have mentioned being interrupted and/or distracted repeatedly by the remote proctor while taking the test remotely; these interruptions and distractions include the following:
- Test-takers can hear the remote proctors making various kinds of noises while they are trying to concentrate on the test.
- Remote proctors interrupt the test-taker to ask about a noise which actually happened outside the test-taker’s house, or if the test-taker covers their mouths or reads aloud to themselves, etc.
This can be intrusive and intimidating for the test-takers and suggests that the behaviour of the remote proctor is a significant factor influencing responses to test items.
Some test-takers experience technical problems with remote proctoring which delay the start time for their test, sometimes by several hours. This suggests that technical issues are a significant factor influencing responses to test items.
Some test-takers receive from their university the actual numerical score generated by the testing procedure, while others are told that this is not available. Imagine that two students achieve the same score of 109 on the Literacy Test, only one point below the ‘standard’. One of these students receives this numerical score from their university, feels more confident preparing for their next attempt and goes on to pass; the other student does not even know that a numerical score is available, sees only the vague black circle on the result form provided by ACER, feels unsure how far away from the standard they are, attempts a second time and fails. This suggests that whether or not the test-taker’s university chooses to share their numerical score with them may be a factor indirectly influencing responses.
The bottom line is that there is very good reason to suspect that a range of factors which have nothing to do with ‘literacy’ are influencing test-takers’ responses to individual LANTITE test items and, consequently, their overall score. If this is the case, then the validity, reliability and fairness of LANTITE is also suspect. The absolute minimum that ACER and/or the Federal Department of Education can do is to release a technical report, preferably a separate report for each test window since 2015.