As the BWC and, to a slightly lesser extent, the MC, are the major sources for the discussion of spoken language features in this book, it is important to consider the origins, nature and quality of the data contained in these two corpora. This is particularly the case as both the BWC and the MC present written records of speech which were collected for purposes other than linguistic research. Written records of speech, as Schneider (2002: 68) observes, can vary on a continuum from quite faithful renderings to a ‘gross distortion’. As we shall see, to avoid the pitfalls of ‘gross distortion’, the compilation of both corpora involved a constant filtering of the potential data in the sources: I was repeatedly wrestling with the questions, ‘Is this written sample of spoken data convincing enough to include in the corpus, and how can I make consistent decisions about what to include and exclude?’ This chapter discusses the following points in relation to the BWC and the MC in turn:

Chance discovery of the data

The historical background to the data

The nature and verisimilitude of the data (including samples of the data)

The compilation process.

The chapter concludes with some general reflections on the process of authenticating the data in the BWC and the MC.