• Ei tuloksia

4. Aims and methods of research

4.2 Material of research: ICE˗GB and ICECUP 3.1

Using corpora as the material of research is common in linguistics and there is a large variety of different corpora available. In addition to more simple text search programs, such as MonoConc, more complex corpus utility programs have been developed for some corpora to

enable in-depth analysis of language and the factors that lie behind it. One of the latter is the ICE-GB corpus which will be the material of this study. The data will be gathered with the help of ICECUP 3.1 corpus utility program, which has exclusively been developed for ICE-GB.

The British Component of the International Corpus of English, ICE˗GB (described in Nelson et al. 2002) is a corpus in the ongoing project ICE ˗ The International Corpus of English, which currently includes 21 different corpora of English around the world including the following regions: Australia, Cameroon, Canada, Fiji, Ghana, Great Britan, Hong Kong, India, Ireland, Jamaica, Kenya, Malawi, New Zeland, Nigeria, Philippines, Sierra Leone, Singapore, South Africa, Sri Lanka, Tanzania and USA. The project was initiated in 1988 and most of the corpora have been gathered around the 1990s, although new corpora are boing added every once in a while. The texts in ICE˗GB date from 1990 to 1993, so the corpus is a description of contemporary BrE.

ICE˗GB has been grammatically analysed. This analysis consists of the following stages: text collection, optical scanning and transcription, applying structural markup, part˗of˗speech tagging, tag selection, syntactic marking, parsing, parse selection, alignment of tagged and parsed versions, cross˗sectional checking and speech digitization (Nelson et al. 2002: 3). The subject group in the corpus has been defined as 18 years of age or older and they have graduated either from secondary school or university.

The total number of words in the corpus is 1,061,264 and it is divided into spoken (637,562 words) and written (423,702 words) parts. These two parts have been divided into different

subcategories, such as dialogues and monologues, private and public conversations in the spoken part. The following table will illustrate the corpus design further (adapted from the ICECUP 3.1 program Help feature) :

Written Texts (200)Non˗printed (50) Non˗professional untimed student essays (10) writing (20) student examination scripts (10) Correspondence (30) social letters (15)

business letters (15) Printed (150) Academic writing (40) humanities (10)

social sciences (10)

As can be seen in the table, ICE˗GB provides a variety of situations in which the data have been gathered, from more formal to less formal situations. Especially informal situations can reflect the contemporary natural language accurately, therefore creating an interesting field of study. However, this research will include the whole spoken corpus and the formality aspect will be examined separately, as it is complicated to determine the level of formality of the different parts of the spoken corpus. For example, the formality of unsrcipted speeches, classroom lessons or even private dialogues can vary a lot depending on many factors, such as how well the speakers know each other. Thus, it is difficult to judge a text group purely as formal or informal.

ICECUP (ICE Corpus Utility Program) is a program developed specifically for searching the ICE˗GB corpus. As the corpus has been parsed and tagged, it can be extensively used in the searches. The ICECUP program provides different options for research, such as, variable queries, node queries, markup queries, random sampling, text fragment queries and fuzzy tree fragment searches. There have been two releases of the program, the first one at the release of the corpus itself and another in 2006. In this study, the newest patched 3.1 version has been used, since it is more stable and useful than the previous version.

Although ICE˗GB is not a very large corpus with only approximately one million words, compared to a corpus like BNC (British National Corpus) with 100 million words, it was chosen for the fact that it is the most suitable corpus that could be found for this study. No other available corpora provided adequate tagging and a sufficiently good search program vital to the nature of my research. The ICECUP 3.1 program has many very functional features to help with studying the corpus in various ways. The program includes variable

searches which are the base of my spoken BrE study, enabling a search with a defined group of people with certain age, gender and education.

Unfortunately, however, no corpus or corpus utility program is perfect. In ICE˗GB, the proportions of specified subject groups and the number of words is not balanced. For example, some groups are more widely presented (e.g. university males) than others (e.g.

secondary˗educated females). This might create some problems in comparing the results of the searches. In addition, as seen from the previous example, I will be dealing with the education factor, not the social class factor, which has been studied in most other studies.

ICE˗GB does not provide information about social class but education tendencies are close to social class tendencies and therefore, this should not be a problem.

Another aspect that will restrict the study is the fact that there is no personal information about all the subjects contributing to the corpus, or the information might be partial. This means that the results of the variable queries will also be partial, as they only include words that show all three discussed factors age, gender and education in its information. Text that has insufficient information in relation to these social factors must be left out in order to keep the results as comparable and as reliable as possible.