ABSTRACT

In the last few decades, positivistic worldviews and the related scientific techniques of testing have influenced the development and evaluation of second and foreign language tests (Hamp-Lyons and Lynch, 1998), and psychometrically approached research on scored/rated data was encouraged and well-connected to the professional literature of language testing. More effort has been given to rater reliability and rater characteristics than to item construction issues in validity. The general procedure for item writing has been repeated and elaborated in language testing

(Alderson et al., 1995; Bachman and Palmer, 1996; Hughes, 2003; Spaan, 2006) and educational measurement literature (Downing and Haladyna, 2006). Some researchers (Davidson and Lynch, 2002; Fulcher and Davidson, 2007) approach item writing as consisting of the process of writing test specifications (or “test specs”) and building validity arguments as a system. The actual process of how items are written from test specs and the people who write test items, however, are still areas of active enquiry. Research and development in the item writing process and writer training (Peirce, 1992 for example) have not been properly introduced to testing communities, while issues related to rating and rater training have often appeared in language testing literature. The purpose of this chapter is to review different historical views of item writing and sum-

marize critical issues and research topics surrounding item writing and item writers. This chapter will not, however, focus on the scoring aspect of item writing despite the intricate relationship between the two. When test items are developed to be objectively scored, the scoring will be, strictly speaking, a matter of item creation; for instance, if it is a multiple-choice formatted item, then the item writer’s responsibilities include the crafting of good distractors linked to the item. If, however, an item is rater scored in direct tests of English speaking, or an admissions-related portfolio is expert evaluated, then there must be complicated steps taken beyond the item creation process. These types of items require the development of some kind of scoring rubric or training system. In these instances, the following questions remain complicated: Where does the responsibility of the item writer end? When does it shift to the province of test administration, scale calibration, rater training, and so forth? Although these are interesting and important questions, a discussion of them is beyond the scope of this chapter. Instead, I will

focus on item writing in and of itself. Thus it should be noted that in this chapter, the term “item” will be interchangeable with “task”.