The notion of interpreting quality is indeed multi-faceted, complex, and dynamic as Collados Aís and García Becerra observe (see Chapter 23, this volume); equally multi-faceted and complex are the knowledge, ability, and skills of a competent interpreter or successful interpreter-in-training. Contributing to the complexity is a conundrum; users of interpreting services are not able to comprehensively evaluate the quality of interpretation events because they cannot understand the source material. Who then, should evaluate an interpretation, and against what criteria or standards? To address these questions and others, the goals of this chapter include: (1) providing readers

an overview of current practice in the areas of certification testing for court interpreting and medical interpreting as well as admissions testing for interpreter training programs; (2) discussion of assessment-related issues and concerns in the specified professional settings and at the various stages in interpreters’ training, including admission, formative, and summative assessment; and (3) identifying important areas of on-going challenge. In this chapter, research and theory in the assessment of interpreting spoken language that has been published since 2000 is reviewed, focusing primarily on work published in English, though some seminal work published prior to that time is referenced. An observation by Clifford (2005) serves as a useful introduction to a discussion of testing

practices in interpreting. He noted that “while it is clear that an interpreter’s lack-luster performance can have very serious consequences, it is much less obvious what steps should be taken to ensure that interpreters do good work” (p. 98). Among the important questions he addressed in his research on evaluating the work of interpreters is this: “What exactly should be assessed?” (p. 98). This question is related to the concept of test validity, which Sawyer (2004, p. 14) described as “an argument or a series of arguments for the effectiveness of a test for a particular purpose.” As Sawyer noted, a test must be valid for its intended purpose; that is, evidence must be assembled to support a “justification” of a test’s use for a particular purpose. The reliability of a test, its consistency of measurement, is another critical characteristic of a

test. The type of reliability relevant to the testing of interpreters varies depending on the type of test. High-stakes multiple-choice tests like the Written Examination for the U.S. Federal Court Interpreter Certification Examination must have a demonstrably high level of internal consistency,

the extent to which the items’ statistical performance is similar (Turner, Impara, and Romberger, 2012, p. 14). However, another frequently used test format in the testing of interpreters is performance testing. Performance tests require human scoring so a different type of reliability is relevant, interrater reliability, the consistency of multiple scorers’ application of scoring criteria. Clifford (2001) observed that “performance-based assessment evaluates behavior in a realistic

context designed to target particular competencies” (p. 375); when testing the skills, knowledge, and abilities of interpreters, a performance test requires them to perform tasks that are similar to those interpreters do on the job. Clifford’s description of a four-step cycle for the design, implementation, and use of performance tests (p. 375) highlighted the rigor that is necessary to achieve an acceptable degree of validity and reliability. The first step, intention, contributes directly to validity and involves deliberate decisions to evaluate specific skills, behavior, and knowledge. The second, the actual measurement, is done through the test taker’s engagement in tasks similar to those done by interpreters as they carry out their professional responsibilities. The third, judgment, is done by evaluators trained to use a rubric, or scoring guide, which defines the specific levels of quality for the skills, behaviors, and knowledge the performance test tasks are designed to tap. The fourth, decision, as Clifford noted, is made “by comparing the level of performance achieved [by an individual] against some minimum standards set within the profession” (p. 375). The consistency of scores across raters and the consistency of decisions across raters are the basis for interrater reliability. Clear, transparent criteria are a starting point for interrater reliability; however, scorers must be trained in how to apply the criteria to attain an acceptable degree of interrater reliability. (See Chapter 23 for discussion of quality criteria.) Clifford (2005) noted that the original approaches to assessing quality in interpreting focused

primarily on flaws in the interpretation product (p. 98). Great attention was given in assessment procedures to identifying errors and classifying their gravity, though Pöchhacker’s work in 2001 established four areas of concern beyond errors, including accuracy, adequacy, equivalency, and success. Grbic´’s (2008) observation that “quality is not intrinsic to an object” (p. 234), and her thought-provoking discussion of deriving benchmarks for quality from the perspectives of exception, perfection, and fitness for purpose provides a framework for further research on the evaluation of interpreting. With regard to the area of fitness for purpose, a user’s perception of the quality of an interpretation is an important consideration. Kurz’s (2001) review of research on users’ impression of interpretation quality highlighted on-going attention in the literature to this issue, though she observed that users of interpreters’ services are often considered “poor judges of quality since they lack one of the most crucial means of assessing quality – an understanding of the source message” (p. 403). She recommended research aimed at developing “user expectation profiles” (p. 407) as a productive direction for examining interpreter quality from the perspective of users’ satisfaction (p. 407). Clifford (2005, p. 102) identified an additional shortcoming of scoring approaches that focus

only on the product, noting that they are not “sufficient” when “decisions about the sample are used to make wider inferences.” The potential insufficiency of a scoring approach is particularly important for a certification examination, “which is intended to determine whether individuals possess the competencies that are needed for practice in a particular professional field.” He posed a very important question about the relevance of test tasks to interpreters’ work; “what if the things we ask a test taker to do on a test bear no relationship to the activities he or she would undertake in the profession” (p. 103) and noted that one of the critical aspects of test validity is the extent to which a test measures what it is intended to measure (p. 104). Clifford argued convincingly for establishing an evidential basis for the appropriateness of test design, test scoring, and test use, a concept that he refers to as a psychometric approach in the sense that principles of measurement and evaluation are followed (p. 128). As he wrote, “we need to

collect evidence to provide an indication of the reliability and validity of certification tests, so that we can make appropriate assumptions based on their results” (p. 103). Though he referred specifically to certification tests, these principles should be followed in the design, development, and use of any type of test used for making decisions about people, their lives, their well-being, or the quality of services that they might provide or receive.