• Ei tuloksia

Differences in the temporal structure between text types

5.6.1 The structure of corpus

2

For the current analysis, clauses were collected from the texts characterised above and stratified into three groups, of equal size and equal proportion of Finnish and Polish originals, as presented in Table 5.2.6

Text type Orig. lang. Finnish Orig. lang. Polish narrative, literary texts (LIT) 150 150

informative (INFO) 150 150

to-be-spoken (TBS) 150 150

Total 450 450

Table 5.2: Number of clauses chosen from corpus1to corpus2

For several, mostly practical, reasons, clauses were selected manually and not truely randomly. First and foremost, random selection of clauses from the corpus would require an extensive preprocessing of all texts including proper division of texts to clauses, alignment of Polish and Finnish clauses as well as filtering the clauses which satisfy the pre-conditions of the current study. These could have been done only with automatic tools. Considering the current state of the art, in particular in terms of marking clause borders, these were not possible to fulfil.

Instead, whenever the collected set of texts was big enough, the text excerpts were chosen randomly (e.g. narrative, literary texts), whenever possible. The manual selection of clauses had the advantage that obviously errounous translations were immediately excluded.

Clauses in the narrative, literary group arise from fiction, informative clauses were collected from news, company websites, and essays; to-be-spoken texts (henceforth: TBS) are dialogues from literary texts, film subtitles and playscript dialogues.

In order to avoid bias related to the fact that the choice of available texts was small, I ensured that clauses from at least two texts per text type per source lan-guage combination were included, and that whenever possible, they have different authors and translators (see Appendix B).

6Detailed infromation about each text is given in Appendix B.

The detailed descriptive statistics of all studied features of corpus2 are pre-sented in Chapter 7 following the full annotation scheme from Chapter 6 but, con-sidering the representativeness and temporal properties of text types summarised in Table 5.2 it is worth examining whether the pre-selected types really differ with respect to temporal features.

Following Smith’s 2003 idea that some text types are temporal and others atemporal, I provide a tentative analysis of the three groups as to differences in the four temporal features which must characterise each aligned clause7accrding to the annotation model (see Chapter 6). These features are: temporal quantification (specific, non-specific, pattern, statement as described in Section 2.5.2), Polish and Finnish tense, andPVA.

The analysis is based on the frequencies of these four features in different text types. They are examined using mosaic plot where each box shows the frequencies (numbers of occurrences) of the values of features proportionally to their share in corpus2. Since this is count data, the differences in distribu-tions can be tested with the chi-square test (having ensured that data distribution does not violate the prerequisites for the test). The plots were generated with the mosaicplotfunction (Meyer et al. 2017) which allows for showing the Pear-son residuals with colors. The plots can be used to easily obtain information on strength and type of correlation between the features in focus and the text types.

Grey is used for residuals between -2 and 2 which do not show much deviation from the expected values, the intensity of blue shows the strength of positive cor-relation, while the intensity of red indicates the strength of negative correlation.

The values between -2 and -4 (light red) and between 2 and 4 (light blue) show some correlation, but only residuals over 4 (deep blue) or below -4 (deep red) show very strong deviations from the expected values.

7An original clause and its translation.

5.6.2 Temporal quantification

Figure 5.1: Distribution of temporal quantification across text types (left: Finnish clauses; right: Polish clauses)

Figure 5.1 shows that the three samples differ with regard to the temporal quantification of clauses. Although each type of quantification appears in each subset, one type of quantification dominates each type of text. Statements are typical for informative types and atypical for the remaining text types. Tempo-rally specific clauses are most strongly represented in the literary texts, and also the most frequent type in TBS, but not significantly deviating from the expected value. TBS can be characterised by the particularly high frequency of patterns.

The frequency of non-specific quantification is quite equally distributed across text types with a light negative residual for literary texts, thus it contributes least to the characteristics of the temporal structure of the chosen samples.

5.6.3 Tense

The difference in the temporal structure of the three text types is even better visible in the comparison of tense use (Figure 5.2). In Polish, the Past tense is predomi-nant in the literary texts, while the Non-past dominates inTBS. Informative texts are balanced in this regard.

Figure 5.2: Distribution of Finnish tenses (left) and Polish tenses (right) across text types. The Polish Analytical Future was excluded due to infrequency.

Similar tendencies can be observed in Finnish – the Simple Past dominates in literary texts, the Non-past inTBS– but Finnish tense is a four-level variable (see Section 4.4.1). Although the Perfect and the Pluperfect are relatively infrequent, they play a certain role in determining the temporal structure of the three sam-ples. The Perfect is associated with the informative subset and the Pluperfect with the literary texts. Those results sound reasonable, as news often describes situa-tions of current relevance, which is one of the funcsitua-tions of the Perfect, while the Pluperfect mainly fulfils an ordering function. Since the primary narrative line is described in the Simple Past, situations anterior to the main events are described in the Pluperfect.

5.6.4

PVA

The contribution ofPVA(see Figure 5.3) is not as clear as in the case of temporal quantification and tense.

Figure 5.3: Distribution ofPVAacross text types of the sample

The generally dominated type ofPVA isIPFV. PFVis dominant only in the literary texts. In the informative subset and inTBSthe proportion is exactly oppo-site, but the Pearson residual only crosses the value 2, which indicates correlation, in the informative text type.

Thus it seems thatPVA’s contribution to distinguishing the temporal structure of text types is primarily relevant for the division literary texts versus other text types.

Chapter 6

Annotation scheme