Corpora used in this study - A Corpus-based Study of the Complements of Prevent in the 18th, 19

This section introduces the corpora that were used in this thesis. The Corpus of Late Modern English Texts (CLMET) and the Early American Fiction (EAF) corpus were used in the diachronic part of the study, representing British and American English respectively, and the British National Corpus (BNC) was used to enable a comparison of the historical data of British English with present-day British English data.

2.2.1 The Corpus of Late Modern English Texts (1710-1920)

The Corpus of Late Modern English Texts (CLMET) has been compiled by Henrik de Smet at the University of Leuven, utilising the online text collections of the Project Gutenberg¹ and the Oxford Text Archive², and it includes texts of British English from 1710 to 1920. The word count in the original version is slightly less than 10 million (de Smet, 2005). The extended version has a little less than 15 million words³.

In this thesis I have used the extended version of the first part, covering the years

1710-1 http://www.gutenberg.org/

2 http://ota.ahds.ac.uk/

3 The count of roughly 15 million words was obtained by using the Monoconc program. The word counts listed on the website of CLMET (http://www.http://perswww.kuleuven.be/~u0044428/) were obtained by using Microsoft Word, and these figures are different from those obtained by using Monoconc. This is probably due to differences in these programs in counting borderline cases, like hyphenated words, as either one or two words. In any case, de Smet recommends using the Monoconc figures.

1780, and the original versions of the last two parts, covering the years 1870-1850 and 1850-1920. I decided on this mixture of the two versions of the CLMET corpus because the first part is somewhat smaller than the other two parts, both in the original and the extended version, and because the three parts cover different time periods so that the 18^th century is underrepresented when compared to the 19^th century, as regards the amount of data available. Both points become clear from the rounded figures for the sizes of the different parts. In the original version of the corpus, the first part has 2.1 million words⁴, and the second and third parts both have 3.8 million words. In the extended version, the first part has 3.0 million words, the second part 5.8 and the third part 6.1 million words.

In this study, the sizes of the selected three parts are 3.0 million words for the first part of the extended version, and 3.8 million words for both the second and third parts of the original version. This combination gives a much more equal balance between the different time periods than in either version as a whole. Moreover, choosing one part from one version and the other two parts from another (or any other combination) does not lead to any real inconsistency between the corpora of different time periods, and this is due to the method used in the compilation of the corpus,

explained below.

According to de Smet (2005: 70), each subperiod in the original version of CLMET represents a fairly homogeneous set of authors as regards their date of birth, and no author is

represented in any other subperiod. The amount of text contributed by each author is approximately the same, with the maximum being 200,000 words per author. With the extended version, the same method was applied, which simply means that more authors are included for each time period, and all authors are different for each time period. Hence, there is no essential difference between the structures of the original and the extended version of CLMET, except for their size, when

considering the different parts according to the time periods. The inconsistency is merely formal, due to the practical fact that it was desirable to first release a preliminary version of the corpus, which could later be expanded. Counteracting the greater sizes of the subcorpora of the two latter time periods by using the extended version for the first time period can only be beneficial.

4 Rounded to nearest 100,000 words.

As for the genre of texts and the social background of the authors, de Smet (2005) notes that there is a slight bias toward literary, fictional, and formal texts written by men of the better-off layers of the English society, even though other texts have been consciously favoured to counteract this (a list of all texts is provided in de Smet, 2005: 72-78). According to de Smet (ibid: 78), the size of the corpus makes it suitable for the study of “relatively infrequent syntactic patterns, or borderline phenomena between grammar and the lexicon”. As its disadvantage de Smet (ibid: 79) considers the lack of exact bibliographical history of the texts, i.e. there is rarely information on the edition that the texts represent. Even so, de Smet maintains that editors are unlikely to introduce new

constructions into a historical text, or have an influence on possible semantic developments through history within specific words or constructions.

2.2.2 The Early American Fiction corpus (1809-1874)

The corpus of Early American Fiction (EAF) used in this study goes by the same name as a subcorpus in the Chadwyck-Healey corpus, but the version used in this study includes only 173 works by 51 authors from the 19^th century. This version contains approximately 11.9 million words⁵, covering the years 1809-1874, and it was used in this study as a source of American English before the 20^th century. This smaller version of EAF was obtained from the Electronic Text Collection at the University of Virginia⁶, and it includes all the works that are publically accessible. The original, full version of EAF in the Chadwyck-Healey corpus has more than 730 works by more than 130 authors from the time period of 1789-1875⁷.

All works in both versions of the EAF corpus are fictional, and for this reason the results from EAF cannot be directly compared to those from the CLMET corpora, which include also non-fictional works. However, since the use of prevent in American English during this time period has not been extensively researched before, any kind of results will be of interest.

5 Rounded again to nearest 100,000 words.

6 http://etext.lib.virginia.edu/eaf/

7 All information found from http://collections.chadwyck.com

2.2.3 The British National Corpus (1960-1995)

According to the website of the British National Corpus⁸ (BNC), it is a collection of samples of British English, both written and spoken texts, and it includes approximately 100 million words. The written part forms 90% of the corpus and the spoken part 10%. The BNC is a synchronic corpus, representing British English from the end of the 20^th century, more specifically from the 1960s to the 1990s. The texts cover various subject fields, genres and registers.

In this thesis, the BNC is used to provide data of present-day British English, to be compared with the data from the historical CLMET and EAF corpora. The methods used in

gathering the data from the BNC were different and more varied than with the historical corpora, and they will be explained in detail in chapter 5.

3 Complementation

This chapter introduces some views on how to define the terms complementation and complement, and on how they are related to adjuncts. When studying the complementation of a word, it can occasionally be challenging to determine whether an element is a complement or an adjunct, both intuitively and in formal terms, as will be observed in the following sections. The terminology used in this thesis follows that used in Huddleston and Pullum (2002), introduced in 3.1.

In document A Corpus-based Study of the Complements of Prevent in the 18th, 19th, and 20th Centuries (sivua 9-12)