Yesterday, I tried out the following thought experiment in the #AusELT online community:
You’ve been teaching General English at your private language school for a few months but you’d really like to teach EAP. You’ve been teaching for a few years and you’re confident you could do it well. Also, your goal is to teach at uni and you know you’ll need EAP experience first. Your Academic Manager says that there is a standard process that all teachers have to go through first, in order to be fair to everyone.
- She will observe you teaching four separate one-hour lessons, one for each of the macro skills
- In each hour, you must *only* teach that skill and not integrate it in any way with the other three skills
- Using a criteria sheet (which you have access to in advance), she will give you a score out of ten for each skill, then take an average of the four scores to produce an overall score out of ten
- You must score at least 6 in order to be able to teach EAP.
Do you think this would be appropriate? Why? Why not? If not, what would a suitable alternative be?
Inspired by Pamela Moss’s 1994 article ‘Can there be validity without reliability?’, I thought the scenario of a manager making ‘gatekeeping’ decisions about teachers based on a single score (albeit one which is the aggregate of several other scores) might help to clarify some complex and confusing aspects of assessment, specifically the psychometric approach. Although the scenario clearly has parallels with IELTS, I wasn’t thinking of any particular test.
In the assessment scenario above (ridiculous as it is!), the manager designed a method which minimised the influence of her own subjective judgment, reflecting a “profound concern for fairness to individual [teachers] and protection of stakeholders’ interests by providing accurate information” (Moss, 1994, p. 9). Moss says that “with psychometric approaches to assessment, fairness in task selection has typically been addressed by requiring that all subjects respond to equivalent tasks, which have been investigated for bias” (p. 9).
Related to this is the notion that getting everyone to do equivalent tasks contributes to higher reliability and, the manager believes, ‘without reliability, there is no validity’.
Many of us who develop and use educational assessments were taught to take this maxim for granted as a fundamental principle of sound measurement … Theoretically, reliability is defined as “the degree to which test scores are free from errors of measurement … Measurement errors reduce the reliability (and therefore the generalizability) of the score obtained for a person from a single measurement” (AERA et al, 1985, p. 19). Typically, reliability is operationalized by examining consistency, quantitatively defined, among independent observations or sets of observations that are intended as interchangeable (p. 6).
Thus, fairness, reliability, measurement (i.e., quantification) and objectivity are tightly bound together. There are also some deeper assumptions underlying all this which have been transferred from physics to education via psychology: there is such a thing as ‘EAP teaching ability’ (ETA) and, because this thing exists, it exists in a certain quantity, i.e. there is a lot of it or little of it, etc. Thus, we can quantify it and we can then also compare different amounts of it, i.e. Person A has more of it than Person B and therefore Person A gets the job teaching EAP. But we need of course to develop an instrument which measures ETA and, hopefully, nothing else; we want scores to vary according to different quantities of ETA and not, for example, extroversion. This is a process analogous to using a thermometer to measure temperature.
This is psychometrics (or at least my understanding of it) and a great deal of our assessment practices are influenced by it. Why is each macro skill assessed separately? Psychometrics, influenced by natural sciences, involves breaking phenomena down into their constituent parts (EL Thorndike called these ‘ability-atoms’), labelling them and producing taxonomies. To return to the other side of the looking glass, in language assessment, this involves macro skills and subskills: specific tasks are designed to assess subskills; the scores on the various ‘subskill tasks’ are aggregated to produce a ‘Reading’ score; and this is repeated for each macro skill and the scores are then aggregated to produce an overall ‘English Language Proficiency’ score. This process is valued in psychometrics because it is rational, scientific and objective.
Moss argues that, for psychometricians, “less standardised forms of assessment [e.g. portfolio assessment] … present serious problems for reliability” because they “typically permit students substantial latitude in interpreting, responding to, and perhaps designing tasks; they result in fewer independent responses, each of which is more complex, reflecting integration of multiple skills and knowledge; and they require expert judgment for evaluation” (p. 6) This kind of latitude would lead to unacceptably low levels of reliability and thus tends to be deliberately designed out of many assessment systems. Moss proposes ‘hermeneutics’ (literally ‘the art of interpretation’) as an alternative to psychometrics.
Instead of the psychometric ‘EAP teaching ability’ test in the scenario, several people in the #AusELT discussion suggested a quite different approach including things like interviews, portfolios, more ‘naturalistic’ observations. This kind of approach is much closer to the hermeneutic approach to assessment described by Moss, which “would involve holistic, integrative interpretations of collected performances that … privilege readers [or ‘raters’, ‘assessors’] who are most knowledgeable about the context in which the assessment occurs” (p. 7). This expands “the role of human judgment to develop integrative interpretations based on all the relevant evidence” (p. 8). In contrast to psychometrics, neither disagreement amongst assessors nor “inconsistency in student performance across tasks” would invalidate the assessment; “rather, it would provide the impetus for dialogue, debate, and enriched understanding informed by multiple perspectives as interpretations are refined and as decisions or actions are justified” (pp. 8-9). From this perspective, the psychometric desire for “”detached and impartial” high-stakes assessment” appears “arbitrarily authoritarian and counterproductive, because it silences the voices of those who are most knowledgeable about the context and most directly affected by the results” (pp. 9-10).
Perhaps the best example of a hermeneutic assessment is the portfolio. There are lots of different types of portfolio assessment systems but imagine one in which students take responsibility for collecting samples of their spoken and written work to meet certain curriculum goals or requirements; select their best pieces along with the drafts which teachers have provided formative feedback on (but no grades or scores); write a cover letter explaining why they have met the curriculum goals with reference to the actual portfolio contents; and finally submit this so it can be assessed by two teachers, who read the cover letter and portfolio contents where necessary and then, for summative purposes, rate it as ‘Unsatisfactory’, ‘Satisfactory’ or Excellent’.
From a psychometric perspective, this would present a range of serious threats to reliability and confound efforts to establish criterion validity (no numbers are produced and so there is nothing to correlate with IELTS, etc.). From a hermeneutic and also an intuitive pedagogical perspective, however, the same features that would give Thorndike hives are actually considered desirable, fair, more positive in terms of washback, valid, and, perhaps surprisingly even reliable and objective.
Finally, coming back to the scenario, the questions I’m interested in are these:
- Do we accept psychometric assessment for our ESL students but not for ourselves when it comes to, for example, career progression?
- If so, why?
- Are there good reasons to use hermeneutics in one situation and psychometrics in the other?