• Ei tuloksia

A cross-sectional study on lexical complexity development of Vietnamese learners of English

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "A cross-sectional study on lexical complexity development of Vietnamese learners of English"

Copied!
68
0
0

Kokoteksti

(1)

A CROSS-SECTIONAL STUDY ON LEXICAL COMPLEXITY DEVELOPMENT OF VIETNAMESE LEARNERS OF ENGLISH

MINH NGHIA NGUYEN ID: 291294

UNIVERSITY OF EASTERN FINLAND Linguistic Sciences

MASTER THESIS

UNIVERSITY OF EASTERN FINLAND, SPRING 2019

(2)

Acknowledgements Abstracts

1. Introduction ………. 1

2. Literature review ………..4

2.1 Linguistic complexity ………...4

2.2 Lexical complexity in SLA……….5

2.3 Lexical diversity ……….7

2.3.1 Defining lexical diversity ……….7

2.3.2 Measuring lexical diversity ………..8

TTR: A problematic measure ………8

Solution 1: Transforming TTR ………...10

Solution 2: A fitting-curve approach ………....13

D - A random sampling approach ………..13

MTLD - A the sequential approach …………...16

2.4 Lexical sophistication ………...18

2.4.1 Defining lexical sophistication ………18

2.4.2 Measuring lexical sophistication. ………18

Frequency band-based approach ………...19

Frequency count-based approach ………..21

3. Aims and research questions ………...24

4. Method ……….26

(3)

4.2 Data collection ………28

4.3 Linguistic complexity measures ……….29

4.3.1 Linguistic diversity measures ………...29

4.3.2 Linguistic sophistication measures ………...30

4.4 Statistical analyses ………..32

5. Results ………... 35

5.1 Research question 1 ………35

5.2 Research question 2 ………43

5.3 Research question 3 ………45

6. Discussion ………..48

7. Conclusion ……….54

References ……….56

(4)

of people to whom I can never say thank enough. Indeed, no words or verbalization of any kind can convey my gratefulness to them. Nevertheless, I feel honored to have a chance to name them here and give them credit.

Firstly, I would like to thank my supervisor, academic advisor, Prof. Dr.

Stefan Werner. His immeasurable help (regardless of time and space), considerable patience, and faith in me have always been a great source of motivation for me since my first day at UEF up to now when I am embarking on another journey in my research career. My gratefulness for him forever lasts.

Secondly, I would like to express my gratitude to Dr. Bastien De Clercq from VUB, Brussels, Belgium. His doctoral dissertation inspired me to start this project. His tremendous support helped me through hardships I encountered in Belgium and Finland. I am also indebted to him for his thoughtful guidance and constant encouragement in my pursuit of an academic career.

Thirdly, I would like to thank Prof. Dr. Mikko Laitinen, Prof. Dr. Jukka Mäkisalo, and Dr. Hanna Lantto for their constructive comments and helpful feedback on my work in the LingSci seminars 2017, 2018. I am also grateful to my statistics professor, Prof. Dr. Matti Estola. Without his instruction and assistance, I could not have finished statistical analyses in my work.

Last but not least, I would like to thank my dearest friends at UEF, Lotta, Phát, Hạnh, Alica, Dora, Nasim, Paulina, and others for their immense support and our shared memories that I always treasure. Thanks to them, UEF was home to me.

(5)

ITÄ-SUOMEN YLIOPISTO – UNIVERSITY OF EASTERN FINLAND

Tiedekunta – Faculty

Filosofinen tiedekunta Osasto – School

Humanistinen osasto Tekijät – Author

Nguyen, Minh Nghia Työn nimi – Title

A cross-sectional study on lexical complexity development of Vietnamese learners of English

Pääaine – Main subject Työn laji – Level Päivämäärä – Date Sivumäärä – Number of pages Yleinen kielitiede ja kieliteknologia Pro gradu -tutkielma X 02.5.2019 62

Sivuainetutkielma Kandidaatin tutkielma Aineopintojen tutkielma Tiivistelmä – Abstract

Tutkielma on poikittaistutkimus leksikaalisesta kompleksisuudesta neljän eri tason vietnamilaisten englannin oppijoiden tuotoksissa. Leksikaalinen kompleksisuus valikoitui tutkimuskohteeksi, koska kielitaidon CAF-mittareista (complexity/accuracy/fluency l. kompleksisuus/tarkkuus/sujuvuus) se on kaikkien monipuolisin ja monikerroksisin parametri. Lisäksi leksikaalista kompleksisuutta on pitkään pidetty tärkeänä tekijänä arvioitaessa oppijan kieltä.

Tässä tutkimuksessa tarkastellaan leksikaalista kompleksisuutta leksikaalisen monimuotoisuuden ja leksikaalisen hienostuneisuuden kannalta. Kvantitatiivinen tutkimus myös keskittyy useiden leksikaalisten kompleksisuusindeksien käyttöön. Tämän perusteella työn toinen tavoite on tilastollisesti tutkia valittujen indeksien vahvuutta, niiden rakennevaliditeettia.

Tulokset viittaavat siihen, että opiskelijoiden leksikaalisen kompleksisuuden kehittyminen on havaittavissa vasta 2- tasosta (A1.2) lähtien. Merkittävin kehitys tapahtuu leksikaalisessa monimuotoisuudessa, kun taas leksikaalisessa hienostuneisuudessa on havaittavissa vain hienoista muutosta. Mitä tulee mittausindekseihin, niin Guiraud ja D parhaiten erottavat ryhmien taitotasoa ja selittävät aineiston varianssia. Osa tutkimuksen tuloksista on linjassa aiempien tutkimusten kanssa, osa taas selvässä ristiriidassa, joten lisätutkimuksille on ilmeinen tarve.

Avainsanat – Keywords

Vietnamese English learners, L2, lexical complexity, lexical diversity, lexical sophistication

(6)

ITÄ-SUOMEN YLIOPISTO – UNIVERSITY OF EASTERN FINLAND

Tiedekunta – Faculty

Philosophical Faculty Osasto – School

School of Humanities Tekijät – Author

Nguyen, Minh Nghia Työn nimi – Title

A cross-sectional study on lexical complexity development of Vietnamese learners of English

Pääaine – Main subject Työn laji – Level Päivämäärä – Date Sivumäärä – Number of pages General linguistics and language

technology Pro gradu -tutkielma X 02.5.2019 62

Sivuainetutkielma Kandidaatin tutkielma Aineopintojen tutkielma Tiivistelmä – Abstract

The current research is a cross-sectional study investigating lexical complexity development of Vietnamese learners of English on four different levels. I chose lexical complexity because in the CAF triad (Complexity-Accuracy- Fluency), complexity stands out as the most multi-layered and multi-faceted construct.

Furthermore, lexical complexity has long been considered an important factor when judging learner language. In this study, lexical complexity is scrutinized in terms of lexical diversity and lexical sophistication. Moreover, this quantitative research focuses on adopting multiple lexical complexity indices. On this basis, the second goal is to statistically examine the strength of the selected indices, or their construct validity.

Findings suggest that the students’ lexical complexity development is only observable starting from level two, equivalent to level A1.2. The most significant development is found in lexical diversity, as opposed to only subtle changes in lexical sophistication. As for the measurement indices, Guiraud and D are the most effective to distinguish the groups’ proficiency and explain the variance. Some of the findings are in agreement and others in disagreement with previous studies. Thus, the research calls for further investigations.

Avainsanat – Keywords

Vietnamese English learners, L2, lexical complexity, lexical diversity, lexical sophistication

(7)

1. Introduction

As bilingualism is a worldwide phenomenon and daily reality, Second Language Acquisition (SLA) now becomes mainstream in the Applied Linguistic discipline. SLA has been investigated from numerous perspectives, from the characteristics of learner language to the learner-internal and learner-external factors affecting the learning process (Ellis, 2008). Describing learner language, among other tasks, has received considerable attention from researchers. Since teachers’ qualitative judgment can be subjective, hence, could be unreliable, researchers have turned to quantitative methods. Complexity, accuracy, and fluency (CAF) are well-established constructs frequently used to describe and evaluate learners’ linguistic attainment.

Of the three constructs in the CAF triad, complexity is claimed to be the most multilayered and multifaceted aspect (Bulté and Housen, 2012; Pallotti, 2009; Ortega, 201; Pallotti, 2015;). Wolfe-Quintero, Inagaki, & Kim (1998) referred to linguistic complexity as "the scope of expanding or restructured second language knowledge" (p. 4). As far as linguistic complexity of learner language is concerned, Bulté and Housen (2012) named three essential properties to take into consideration, namely discourse-interactional, propositional, and linguistic. Of the three, according to the researchers, linguistic complexity was less vague and more focused than the others.

Linguistic complexity studies contribute to the field of SLA in two main

(8)

characteristics of learners’ language knowledge and production such as the length and internal structure of clauses and sentences (known as absolute complexity).

Indeed, a wealth of studies have shown a close link between complexity itself and overall language competence. Secondly, second language teaching can also benefit from such research. Research on linguistic complexity has informed the teachability of a target linguistic construction or effectiveness of various instructional methods (Bulté and Housen, 2012; Pallotti, 2015; De Clercq, 2015).

As a broad construct, linguistic complexity is often categorized into two main major components, namely lexical and grammatical complexity. Of the two components, lexical complexity is said to be an important indicator of linguistic proficiency in general and linguistic complexity in particular. A body of literature has provided evidence proving that. This will be discussed in depth later in this thesis.

Moreover, measuring lexical complexity draws special attention from researchers. This research area came into being around the time of 1935-1940. It was a critical period witnessing a thriving trend in vocabulary research in linguistics, psychology, and statistical analyses (Jarvis, 2013). However, up to now, researchers have still referred to it as a young research area. It could be due to the lack of sustainable schemata of analyses and measurement (McCarthy and Jarvis, 2010). There has been a constant call for more in-depth investigation to not only explore new potentials but also to re-examine long-standing conventions.

(9)

On this basis, I selected lexical complexity as the research object for this project. Choosing Vietnamese adolescents learning English as a foreign language, I designed a cross-sectional study and analyzed free-writes by the participants. This served two purposes. Firstly, I attempted to investigate the development of lexical complexity over four years by describing quantitative changes in lexical complexity in each developmental stage. Furthermore, the study largely involved the process of lexical complexity measurement, which was related to adopting multiple linguistic complexity measures. Given that, the second goal is to evaluate the construct validity of each measure. I define construct validity with three properties, namely (1) distinguishing the participants’ proficiency levels (discriminatory power); (2) explaining the variance (explanatory power); and (3) correlating with each other (comparability).

(10)

2. Literature review

2.1 Linguistic Complexity

It is common for difficulty and complexity to be interchangeably used.

However, researchers have noted that the two constructs entail different epistemological values (Dahl, 2004; Bulté and Housen, 2012). The former has to do with cognitive perception of target linguistic items. In other words, difficulty tells us how the learner cognitively perceives the difficulty level of learning and verbalizing a linguistic item. In this sense, difficulty reveals the relation between language and its learners. It is learner-dependent, subjectively perceived, and inconsistent among learners. Meanwhile, the latter, (linguistic) complexity reflects relations among concrete components. It can be a quantitative relation (e.g., longer sentences, less repeated words) and/or a qualitative relation (e.g., the learner using rare words). Because etymology determines epistemology, linguistic complexity, or absolute linguistic complexity (Bulté and Housen, 2012) is quantitively investigated. Meanwhile, qualitative methods are characteristic of research on difficulty, or relative complexity (Bulté and Housen, 2012).

Bulté and Housen (2012) defined linguistic complexity as a broad construct featuring two main types of properties, namely dynamic and stable properties. On one hand, the dynamic property of learner language informs size, width, breadth, and the richness of the language. For instance, a text would be considered as more complex if it is written or spoken with more distinct word types, infrequently used lexicon, and more idiomatic chunks or collocations. In

(11)

this example, the three patterns of language production in that order, are widely conceptualized as lexical diversity, lexical sophistication, and lexical density (De Clercq, 2015).

On the other hand, the stable property of linguistic complexity indicates more the depth rather than breadth of language. One can gauge it by looking at a number of linguistic components, such as passive voice or past tenses, in text or speech. Also, the transparency in mapping meanings and forms in text or speech also indicates the stable property of linguistic complexity.

Bulté and Housen (2012) developed a taxonomy reviewing three levels on which one can examine the construct of linguistic complexity. The most abstract level, in this taxonomy, is the theoretical level that cognitively analyzes complexity. The second level involves the judgment of complexity of the learner’s language production. The third level is the operational level, also the most concrete. It is specified mathematical measurement and statistical assessment of text or speech in question.

2.2 Lexical complexity in SLA

Lexical complexity (henceforth, LC) is one of the two broad strands (the other is grammatical complexity) in the linguistic complexity schemata (Wolfe- Quintero et al., 1998; Bulté and Housen, 2012). Bulté et al. (2008) defined lexical complexity as one component of lexical competence. In Bulté and Housen (2012), they described lexical complexity as an intrinsic property of lexical

(12)

diversity, and lexical sophistication equally under the umbrella of linguistic competence. However, in their later work, they placed lexical diversity and lexical sophistication under the hierarchy of LC. Pallotti (2015) defined lexical complexity as the number of lexical components and explained that this way lexical complexity could be practically studied in production data. One important note in Pallotti’s definition is the absence of semantic analysis in measuring LC on the operational level. The author attributed it to the polysemic nature of words’

semantic meanings that are affected and nuanced by co-occurring lexical items.

That factor makes it impractical to measure and statistically evaluate LC. This, however, seems to clash with the stance held by Bulté et al. (2008). According to these authors, lexical complexity reflects learners’ ability to comprehend and produce both prevalent and peripheral semantic aspects (among other aspects), of a specific word.

The brief overview above implies that there is no consensus on defining, conceptualizing, and operationalizing LC. However, there is markedly more agreement among researchers about what are essential characteristics of LC.

Lexical diversity and lexical sophistication are the two key constructs of LC that researchers tend to agree on (Bulté and Housen, 2012; Jarvis 2013a, 2013b;

Crossley et al., 2011a, 2011b; Pallotti, 2015; De Clercq, 2015).

Given that, I take lexical diversity and lexical sophistication as the focal epistemological aspects of this research. The following discussions will be

(13)

devoted to reviewing prior literature on conceptualizing and measuring these two constructs.

2.3 Lexical diversity

2.3.1 Defining lexical diversity

Concerning the conceptualization of lexical diversity (henceforth, LD), no consensus on what defines the construct has been reached yet. To Bulté and Housen (2012), LD indicates size, breadth, width, and the number of lexical items in a language sample. It accounts for why LD is often interchangeably used with lexical variation and/ or lexical variety (Verspoor, Schmid, &Xu, 2012; Jarvis, 2013a, 2013b). Jarvis (2013a, 2013b) put it simply that LD is the reverse of word repetition rates. The definition by Bulté and Housen (2012) also aligns with the one by Pallotti (2015) who referred LC to the number of lexical components.

Researchers are also in agreement about LD as an important and determining factor of LC in particular and vocabulary knowledge or lexical richness in general. Lexical richness is a term used to describe one’s mental lexicon (Yule, 1994, citing Jarvis, 2013a). In this sense, lexical richness covers more than the variety of lexicon (or LD). In fact, researchers tend to see lexical richness as a broad construct that captures multiple subordinate constructs, for instance, LD and lexical sophistication (Read, 2000; Šišková, 2012). That is to say that lexical richness and LD are two distinct constructs even though they may overlap at some point. As reported in Jarvis (2013a, 2013b), LD and lexical

(14)

interchangeable for some researchers. Cases in point are studies by Daller, van Hout & Treffers-Daller (2003), Daller and Xue (2007), Daller (2010), Daller (2013). In these studies, the researchers measured lexical richness using LD indices.

In spite of inconsistency in conceptualizing LD, researchers seem to agree upon one thing. It is that LD is an essential constituent of LC. Hence, measuring LD has received noticeable attention. This will be discussed in the following sections.

2.3.2 Measuring lexical diversity

History of LD studies can be traced back to the years between 1935 to 1944. This period of time witnessed an intensive focus on lexical research. The number of LD measures is too numerous to list here. In this report, I chose to review several representatives. They are (1) Type/Token ratio (TTR), the longest standing but so far deemed as the most problematic measure; (2) The Guiraud index, a representative of the transformed TTR; (3) D and MTLD (Measure of Textual Lexical Diversity), the two recent innovations of TTR adopting a curve- fitting approach.

TTR: A problematic index

Some of the first renowned achievements in this research line are Zipf’s law in 1935 and the notion of diversity of vocabulary coined by John Caroll in 1938. On the basis of Caroll’s work, Johnson (1944) developed the TTR (Type/Token ratio) index to calculate the ratio between the number of different

(15)

words (type) and the number of words (token) in a text. TTR was an important invention in the field and draws considerable interest and attention. For years, TTR enjoyed the status of a ground-breaking and predominant instrument to calculate LD of language samples until researchers put its construct validity into question.

In fact, among others, text-length dependence is arguably the thorniest issue challenging the construct validity of LC measures. TTR is claimed to be most affected by text length. The index is calculated by dividing the number of word types (every different word is a new type) by the number of tokens (each word is one token). Researchers found that TTR’s score decreases when the length of a text increases (Malvern et al. 2004; McCarthy & Jarvis, 2010; Jarvis, 2013a, 2013b). They explained that when the text became longer, the number of toke increased. Meanwhile, the number of word types increased a at much slower pace. This way, TTR steadily regressed when new words entered. Due to this, TTR scores hardly presented accurate complexity changing of the text. They rather reflected changes in text length which could have been made by simply multiplying one same text several times (Malvern et al., 2004; Mc Carthy and Jarvis, 2010; Šišková, 2012; Koizumi & In’nami, 2012).

Text length sensitivity causes TTR to lose ground and researchers have been extensively seeking for alternatives. Solutions have been long studied (Malvern et al., 2000; Jarvis, 2002; Jarvis, 2013a, 2013b). An immediate solution

(16)

from this idea are limiting the length and mitigating the falling rate of the TTR curve. The first method is implemented by cutting the text into segments equal in length. However, the intervention may be detrimental to the validity of measures (Jarvis, 2002; Šišková, 2012). The second method seems to be more plausible.

Instead of changing text length, TTR curves could be mathematically transformed using square root (e.g., the Guiraud index) or logarithm (e.g., the Herdan index and the Uber index) to create a model wherein TTR rate slowly increases. In this paper, I choose the Guiraud index as a representative for such transformation of TTR to review in more depth.

Solution 1: Transforming TTR

The Guiraud index is a transformation of TTR that has been long studied, is well-established, and claimed to be a valid alternative measure. It is sometimes referred to as a measure for lexical richness (Daller, van Hout, & Treffers-Daller, 2003; van Hout & Vermeer, 2007; Daller & Xue, 2007; Daller, 2010; Šišková, 2012; Bulté & Housen, 2014). Bulté and Housen (2014) argued that the Guiraud index could capture lexical richness rather than sheer LD. Note that, in Bulté and Housen (2014)’s definition, lexical richness covers LD and lexical productivity.

The index was invented by Guiraud in 1954 and is calculated by dividing the number of word types by the square root of tokens. Guiraud found empirical evidence showing the index stable across French literature texts of 1000 to 100.000 words (Daller, 2010). It means that the Guiraud index is impervious to length variation. Thus, it is claimed to surpass TTR in terms of overcompensating

(17)

for the decreasing TTR when text length increases (Daller & Xue, 2007; Koizumi

& In’nami, 2012; Bulté & Housen, 2014). Up to now, the Guiraud index has prevailed as an effective measure of LC.

Much research has provided evidence acknowledging the validity of the Guiraud index in measuring LC. One instance is the research reported by van Hout &Vermeer (2007). The authors chose three chapters of Genesis (King James Bible) consisting of 2241 tokens and 376 types then fragmented the texts into samples of different sizes varying from 100 to 1000 tokens. The linear regression analysis of the relationship between types and tokens in a text was not valid for TTR because the authors found a sample of 0 or 1 tokens gave 59 types, which did not make sense. Additionally, the residuals were not randomly distributed.

To address these issues, they applied algebraic transformations using the Guiraud index and the Herdan index. Statistically comparing three measures, TTR, Guiraud, and Herdan, the researchers acknowledged Guiraud as the best candidate to yield linear regression (R = .99854; b0 = -32.9891; b1 = 9.3994).

Another example was the research by Daller and Xu (2007). The researchers analyzed utterances produced by Chinese learners of English in two different instructional settings, in the UK and in China. They then measured the development of lexical richness in their language output of each group. The Guiraud index differentiated participants’ performance (F = 24.912; Eta2 = .342) most accurately.

(18)

In the study by Verspoor et al. (2012), the Guiraud index also was a strong discriminator to distinguish adjacent levels of groups of pre-intermediate Dutch learners of English. In the same line, De Clercq (2015) operationalized the Guiraud indices for nouns and verbs in English and French utterances produced by secondary school pupils in Belgium. The author also acknowledged the Guiraud index as a reliable descriptor of proficiency which distinguished LD of texts from level one to level three in both languages.

Furthermore, high correlations between the Guiraud index and other LD measures have been confirmed in a wealth of research. Bulté et al. (2008) found high correlations of 0.95 to 0.99 between Guiraud, Uber, and D when investigating LD of French data obtained from unplanned oral story retellings by 38 Dutch-speaking and 19 French-speaking pupils in Brussels. Likewise, Šišková (2012) reported that the Guiraud index correlated with the Uber index, D, and HD-D at 0.93, 0.865, and 0.75 respectively.

However, regardless of such recognition, there is still dispute about the Guiraud index. Bulté and Housen (2014) examined complexity development in writings by 45 adult ESL learners over the course of an academic term. They employed three LC measures including the Guiraud index, the advanced Guiraud index and D but found no statistically significant difference across levels. In that study, the Guiraud indices had low discriminatory power. Besides, Koizumi and In’nami (2012) studied the robustness of multiple measures including the Guiraud index and investigated their text length sensitivity. Texts of 200 tokens

(19)

were fragmented into segments ranging from 50 to 100 tokens. Across the range of 50 to 100 tokens, they found the Guiraud index the second most affected by text length (ηp2 = .73) after TTR (ηp2 = .64). This inconsistency in findings regarding the validity of Guiraud justified the concern of Vermeer (2000) and Daller & Xue (2007) that the Guiraud index was only valid in some certain circumstances.

Solution 2: TTR curve-fitting approaches D – A random sampling approach

There seems to be no doubt about the flaw of TTR in terms of being unable to capture the variation of texts. However, even for transformed versions of TTR such as the Guiraud index, researchers are still doubtful about their validity.

Firstly, such measures are still sensitive to text length (Malvern et al., 2004, p.

27; Koizumi & In’nami, 2012). Secondly, the Guiraud index and the like entail analyses of the size of the smallest samples instead of taking into account the entire TTR curve. Given that, results could be uninterpretable and incomparable (Richards and Malvern, 2000; Jarvis, 2002; Malvern et al., 2004, chapter 2).

This problem was addressed in Richards and Malvern (2000) and Malvern et al. (2004) who proposed a robust alternative measure called D. D differentiates and surpasses its antecedents because it calculates an entire TTR curve rather than a single point. In this light, Malvern et al. (2004) assumed that D was not affected by text length. Furthermore, this way, D make research varying in the

(20)

baseline of tokens comparable (Richards & Malvern, 2000; Jarvis, 2002; Malvern et al., 2004). Richards and Malvern (2000) proposed a formula as follows.

TTR = 𝐷

𝑁 [(1 + 2 𝑁𝐷) − 1]

In the above equation, N is the number of tokens. D is a parameter that is adjusted along with TTR. Note that TTR in the equation is predicted TTR. D adjusts several times until the predicted TTR yields the closest values to the actual TTR. To calculate actual TTR, 100 35-token samples are first pulled out from the text. Then TTR of each sample and mean TTR of all samples are calculated. A similar calculation is continued on longer samples such as 36 to 50 tokens. A TTR curve is an aggregation of all TTR means of the samples. The calculation is successively implemented up to the point when the predicted TTR is as close as possible as the actual TTR. It is when a model curve of TTR is produced. The D value represents the curve. The highness of the curve indicates LD of the text. D can be computed with a program called Vocd, available in CLAN (MacWhinney, 2013).

Another characteristic of D calculation is that it draws on random sampling of tokens to plot the TTR curve. Researchers see it as the strength of D because random sampling can avoid clusters of content words (Richards and Malvern, 2000; Malvern et al., 2004; McCarthy & Jarvis, 2010). D has been employed in a wealth of research and yielded its recognition as an “industry standard”

(McCarthy & Jarvis, 2010).

(21)

De Clercq (2015), in his analyses of oral retellings of Dutch learners of English and French, reported that D had a stronger discriminatory effect than Guiraud. D was able to better distinguish consecutive proficiency levels (effect size 𝜂p2 = .75-French and 𝜂p2 =.68-English). Meanwhile, the effect sizes of the Guiraud index were from 0.54 to 0.68. McCarthy and Jarvis (2010) used D to distinguish text registers in a corpus composed of 16 different registers. D accurately explained 46.7% of the model, ranked third (after MTLD and Mass) out of six measures

D is also said to highly correlate with other commonly-used measures.

McCarthy and Jarvis (2010) found the highest correlation between MTLD and D (r = .848). In the same line, DeBoer (2014) compared D scores and HD-D scores in English texts of different L1 speakers and saw a consistently high correlation (r > .9) between the two.

However, regardless of the achievement, the validity of D is not absolute.

Firstly, drawing on random sampling, which is a non-sequential approach, D could perform weaker when being used with its sequential counterpart approach, as cautioned by its creators (Malvern et al., 2004, p. 72). Secondly, as Mc Carthy and Jarvis (2007) pointed out, D could still be affected by text length. Treffers- Daller (2013) in fact verified this. The author found that D strongly correlated with the number of tokens (r = .61, p < .001).

(22)

MTLD – A sequential approach

Another popular measure taking up the curve-fitting approach is MTLD (Measure of Textual Lexical Diversity). MTLD was developed by McCarthy (2005) as an index of LD adopting a sequential approach.

Sequentially processed, MTLD is assumed to outdo D in terms of maintaining the integrity of a text, which, according to researchers, makes results more interpretable (McCarthy and Jarvis, 2010,2013). MTLD is computed with the mean length of word strings that maintain the default factor size TTR of 0.72.

TTR of each word is judged until its TTR value reaches 0.72. The number 0.72 is the prime default value because at this point TTR trajectories reach the stabilization point (McCarthy and Jarvis, 2010). It is when the TTR trajectory is no longer affected by an increase in new types. This is said to be the distinguishing feature of MTLD from its counterpart indices. To be specific, a non-sequential approach such as D calculates the number of words making up a TTR curve. MTLD, meanwhile, looks at the progress that a text takes to get to the point of saturation. By implication, the faster a text gets to the saturation point, the less diverse it is because it takes fewer words to become type-saturated.

Comparing MTLD with D, McCarthy and Jarvis (2010, 2013) confidently said that the former was at least as effective as the latter. Indeed, MTLD outperformed D and other indices in a number of studies (Crossley & McNamara, 2012; Crossley, Salsbury, & McNamara, 2009; McCarthy and Jarvis, 2010;

McCarthy and Jarvis, 2013). Also, MTLD has been reported to highly correlate

(23)

with long-standing indices such as D (Koizumi and In’nami, 2012; Šišková, 2012).

However, it does not necessarily mean that MTLD always yields an optimal result across research. To shed light on this, we need to consider one important property in MTLD operationalization. It is the effect of partial factors.

Partial factors are any values less than 0.72 (a full factor). When calculating D, partial factors are discarded. However, in computing MTLD, they are taken into consideration because they inform how far the text is from the saturation point 0.72. However, partial factors come with a caveat. McCarthy and Jarvis (2010) were concerned about a practical problem when processing short texts.

According to them, there is a high chance that short texts contain mainly partial factors. Given this, the researchers suggested a threshold of 100 words in order to maintain the validity of MTLD. This was confirmed by Koizumin and In’nami (2012) who found MTLD the best candidate resistant to text length but still affected in texts of 50-150 and 50-200 tokens. Their size effects were respectively 𝜂p2 = .11; 𝜂p2 = .12. Besides, the superiority of MTLD over other indices, especially D, was not necessarily constant. In the analyses of French texts by university English learners of French, Treffers-Daller (2013) found D

(r = .763), rather than MTLD (r = .571), correlated the most with C-test, a language ability measure. Furthermore, in the same research, the author reported length dependency of MTLD at r = .47, p < .001.

(24)

2.4 Lexical sophistication

2.4.1 Defining lexical sophistication

Lexical frequency is a key aspect of lexical sophistication (henceforth, LS). According to Crossley et al. (2011a), input frequency is a determining factor of language acquisition and production. In Jarvis (2013)’s terms, LS reflects learner’s advanced use of language in terms of producing infrequent words. In this sense, sophisticated lexicon is concomitant with infrequent vocabulary. As reviewed in Crossley et al. (2011a), empirical literature on lexical acquisition pinpointed that high-frequency words tend to be processed, responded, and judged more quickly than infrequent ones. LS is also said to be an effective predictor of holistic quality of learner language (Kyle & Crossley, 2015).

Furthermore, as Crossley and Salsbury (2010) suggested, sophisticated words were related to word length, an important indicator of LD. This, to some extent, accounts for the growing interest in revealing the correlation between LD and LS measures (Šišková, 2012).

2.4.2 Measuring lexical sophistication

Conceptualized on the basis of word frequency, LS is accordingly operationalized with frequency-based measures. There is not much dispute about word frequency as the most common approach to investigate lexical knowledge (Crossley et al., 2011b; Kyle and Crossley, 2015). Malvern et al. (2004) called it an extrinsic measure because the implementation requires a reference to an external source. This source is normally frequency word lists retrieved from

(25)

reference corpora. Juxtaposing it with the traditional TTR, Malvern et al. named the frequency-based measure “Type-Type Ratio” (p. 121). They found it in accordance with the rarity measure by Ménard (1983, citing Malvern et al., 2004).

Up to now, to my knowledge, there are two mainstream methods taking up frequency measurement. They are frequency band-based and frequency count- based. Each of them will be discussed below.

Frequency band-based approach

The band-based approach involves the use of frequency word lists as a reference source to evaluate words in a text then categorize them into multiple bands, such as the 1000 most frequent word band (or K-1 band) or the 2000 most frequent word band (or K-2 band). The fewer frequent words that a text is made up of, the more sophisticated the text is. To take an example, given that the 2-K band is the frequency band; if text A contains 80% of tokens in that band; and text B has 70% of words in the same band; then text B is more sophisticated than text A.

Researchers have acknowledged effective band-based tools, for example the Lexical Frequency Profile (Laufer and Nation, 1995) and VocabProfile (Cobb, 2013), which indicate language proficiency (Cobb, 2000), test performance (Morris & Cobb, 2004), text types and text comprehension (Nation, 2006). Among numerous frequency-based methods, to my knowledge, the groundbreaking and the most widely-used is the Lexical Frequency Profile

(26)

to reveal the depth of language samples and its sample size effect, Laufer and Nation created word lists and then classified them under four headings, corresponding to four frequency bands. These were the 1000 most frequent words, the second 1000 most frequent words, academic words, and the least common words. The last band contained words belonging to none of the first three groups. On this basis, they analyzed written essays by university students in an attempt to discriminate their language levels. The researchers appreciated the discriminatory power of the method especially when using the last two bands.

However, LFP by Laufer and Nation and similar tools still face critique.

Meara and Bell (2001) and Malvern et al. (2004) remarked that the method was still subject to sample size dependency and results derived from processing short texts were inconsistent. Also, the discriminatory power of frequency- based measures is not always consistent. Verspoor et al. (2012) and De Clercq (2015) respectively drew on Customized Lexical Frequency Profile and BNC-COCA- 25 frequency lists (similar to LFP) to see the link between LS and language proficiency of Dutch learners of English and French. The authors of the two studies were in agreement that the frequency-based measures could not discriminate very well learners of adjacent levels.

(27)

Frequency count-based approach

The implementation of the frequency count-based approach begins with choosing a reference corpus. After that, the frequency of each word in the target text is calculated from which an average frequency score of the text is derived.

Among many others, Coh-Metrix is a state-of-the-art tool to compute LS indices.

Coh-Metrix is developed on the CELEX database (Graesser et al., 2004; Graesser and McNamara, 2011).

Researchers have attempted to investigate the construct validity of Coh- Metrix word frequency indices as a linguistic proficiency indicator. For instance, Crossley et al. (2011b) studied lexical knowledge of 100 learners of English by analyzing their writings and computing various indices. CELEX- content word frequency was among the strongest predictors of the students’ lexical knowledge and was able to explain 29% of the variance. However, the track record is not always consistent. Crossley et al. (2011a) examined the correlation between CELEX word frequency index and human evaluations of lexical proficiency in 240 written texts by adult English learners whose levels varied from beginner to advance. Even though the authors reported that the word frequency index was an important indicator, the index itself shared only 5% of the variance with the human judgment method. Results were even less significant in speech data, as reported in Crossley et al. (2011b). In this study, Crossley and his colleagues looked at the CELEX-content word index and found a negative Pearson

(28)

by 29 adult learners of English. Also, the CELEX-content word index failed to predict human ratings of speech samples (t = -.672, r2 = 0, p > 0.050 ).

Comparing the two approaches, the count-based approach is said to be closer to frequency data, thus, allow greater accuracy, and especially be able to capture small changes (Crossley, Cobb, and McNamara, 2013). Meanwhile, missing subtle changes in the learner’s language development are deemed to be the weakness of the band-based approach because small changes in linguistic development of the learner may occur within one band (Meara, 2005). Another pitfall of the band-based approach is its arbitrariness in grouping words in bands that can cause overlapping. In fact, research has shown the count-based approach outperformed its counterpart in analyzing texts. One case in point is the research by Crossley et al. (2013) that compared the validity in terms of classifying texts by English speakers of different competence levels (both English native and non- native speakers and L2) with the Coh-Metrix indices and LFP. The accuracy rate of the Coh-Metrix indices was 58% and that of LFP was 10% less.

My aforementioned reviews, however, are not to imply that the count- based approach is flawlessly superior to its counterpart. In fact, as pointed out by Crossley et.al (2013), frequency-count indices may be misleading when indicating lexical knowledge because the operationalization may take into account also morphological characteristics, such as go and gone that reflect grammatical rather than lexical proficiency. This feature would make the

(29)

frequency-count method fail to study morphologically rich languages such as Finnish.

(30)

3. Aims and Research Questions

In this research, I, firstly, attempt to investigate the development of lexical complexity of English learners for four years. The research subjects are four groups of Vietnamese students of English. Each group differs from the others in their English proficiency and ages. The first focus is to examine how changes in LC are reflected in their writings according to their English proficiency levels.

The investigation revolves around two sub-components of LC, namely LD and LS.

Furthermore, the current research is characterized by quantitative analyses.

In other words, on the operational level (Bulté and Houssen, 2012), I compute multiple measures of LD and LS. On this basis, the second aim of this study is to statistically evaluate strength of the chosen measures. Their strength is defined as (1) how well they distinguish the groups’ level (discriminatory power); (2) how much they are able to indicate the groups’ proficiency (explanatory power);

(3) and how strongly they correlate with one another (comparability).

The two guiding purposes are respectively reflected in the three following research questions.

Research Question 1: How does lexical complexity in English develop over four years?

1.1. How does lexical diversity in English develop over four years?

1.2 How does lexical sophistication in English develop over four years?

(31)

Research Question 2: How well do the measures explain learners’

linguistic proficiency?

Research Question 3: How comparable are the measures to one another?

3.1. How comparable are the lexical diversity measures to one another?

3.2. How comparable are the lexical sophistication measures to one another?

(32)

4. Method

4.1 Research context and participants

This current research was designed in the form of a cross-sectional study in order to explore developmental trends of LC of Vietnamese learners of English. The research setting was a secondary school in Hanoi, Vietnam. The school has four levels, from 6th to 9th grades. In each grade, I randomly selected one group of 30 students, resulting in four separate groups of 120 participants in total. They were respectively named Group 1, Group 2, Group 3, and Group 4.

The students were placed in each group by the school on the basis of their age and academic performance. The age range of the four groups was from 11 to 14 years old. The students’ academic performance was mainly quantitatively assessed with a set of tests at the school.

As for English, the language was mandated as a foreign language subject at the school. Group 1 and Group 2 received three hours of instruction in English every week. The amount of time for Group 3 and Group 4 was about two hours.

Students took two summative exams by the end of each semester together with several formative tests over the course of an academic year. Their scores in summative exams and formative exams (the former weighs more) determined their transition to the next grade. The Common European Framework of Reference for English (henceforth, CEFR) was the systematic benchmark for English proficiency assessment at the school. Accordingly, English proficiency

(33)

level of the four groups were respectively A1.1, A1.2, A2.1, and A.2.2. The followings are descriptions of bands A1 and A2.

A2: Can understand sentences and frequently used expressions related to areas of most immediate relevance (e.g. very basic personal and family information, shopping, local geography, employment). Can communicate in simple and routine tasks requiring a simple and direct exchange of information on familiar and routine matters. Can describe in simple terms aspects of his/her background, immediate environment, and matters in areas of immediate need.

A1: Can understand and use familiar everyday expressions and very basic phrases aimed at the satisfaction of needs of a concrete type. Can introduce him/herself and others and can ask and answer questions about personal details such as where he/she lives, people he/she knows and things he/she has. Can interact in a simple way provided the other person talks slowly and clearly and is prepared to help (Global scale - CEFR 3.3: Common Reference levels, n.d.).

In this research, I adopted the school’s assessment as the evaluative basis for the overall linguistic competence of the four groups. I assumed that the overall English proficiency of each group increased over four years. Note that the four proficiency levels represented general developmental stages of language

(34)

learning from beginning to more advanced instead of implying that learners in each group attained exactly the same linguistic competence level.

4.2 Data collection

I collected data by the end of the first semester in the 2016-2017 academic year. To tackle practical issues, I chose a narrative writing task over a speaking activity. I believed a writing task suited beginners better because it allowed them more time for reflection and composition. Additionally, according to teachers at the school, writing was a routine activity in class. A familiar task, I anticipated, would prevent unwanted socio-affective factors such as anxiety and stress.

Moreover, the choice of this task essentially made the project feasible for me because I was not able to be present at the school at the time of data collection.

Given my absence, the process of data collection was largely facilitated by four English teachers at the school with my distant instructions. The corpus consisted of 120 writings by the participants. The average text length was 177 words. In all four groups, the teachers gave the students the same task at relatively the same time. The teachers provided participants with explicit instructions on the task day. The prompt was to write about their favorite teachers. The cut-off point for length was 100 words. This threshold was suggested in prior research (McCarthy & Jarvis, 2010; Koizumi & In’nami, 2012) as the minimal word count in order to compute MTLD and sustain its validity. The students were not provided with any help from external sources such as teachers, peers, dictionaries, or the internet during the task. The allotted time was 45 minutes,

(35)

equivalent to one class meeting at the school. The writings were then scanned by an assistant and sent to me for analyses in Finland.

4.3 Linguistic complexity measures

LC was measured in two broad lines, diversity and sophistication. The operationalization of each construct involves multiple measures and will be described in the following sections.

4.3.1 Linguistic diversity measures

The four LD measures are TTR, the Guiraud index, D, and MTLD. The rationales behind the choice vary but mainly boil down to two factors, namely practicality and credibility.

Firstly, in terms of practicality, this refers to the availability of measurement tools. To calculate TTR scores, I used the software AntConc (Anthony, 2019). As for the Guiraud index, its formula is the total number of word types divided by squared number of tokens (Daller, 2010). I calculated Guiraud scores manually. D and MTLD measures are more recent and, to some extent, require a more advanced operationalization. D is the index for LD of all words (Malvern et al., 2004) and calculated with the CLAN software (MacWhinney, 2013). The software also features lemmatization functions using the MOR and POST programs. MTLD was created by McCarthy (2005) to measure the textual length of LD. The author provides free software named Gramulator (McCarthy, 2011) to calculate MTLD scores.

(36)

Regarding credibility, it is crucial to determine the validity and reliability of data analyses and findings. Credibility, by large, implies the construct validity that each method features. This has been discussed in depth in the literature review (see above). As also clearly stated in the literature review, each of the four measures chosen in this study is valuable in different ways. TTR is a groundbreaking index that stands the longest but also encounters the most critique due to its sample size effect. The Guiraud index is a representative of mathematic transformation of TTR in order to fix the text length sensitivity issue of TTR. D and MTLD are two measures derived from the recent innovative approach, the curve-fitting. However, each of them takes a different direction.

On the one hand, in the case of D, a curve-fitting TTR is produced with the random sampling approach. The rationale behind it is to avoid clusters of words.

On the other hand, MTLD utilizes the consequential approach and takes into consideration partial factors that are discarded in D calculation. The purpose is to look at the progress that a text takes before reaching the point of saturation wherein the text length is no longer impactful.

4.3.2 Linguistic sophistication measures

To operationalize linguistic sophistication, I chose two tools VocabProfile1 (Cobb, 2013) and Coh-Metrix version 32 (Graesser et al., 2004; Graesser and

1VocabProfile is available at https://www.lextutor.ca/vp/comp/.

2Coh-Metrix is available at http://tool.cohmetrix.com.

(37)

McNamara, 2011). They represent two main well- established types of frequency measures, namely frequency band and frequency count (see above). However, they have a practical drawback. The websites allow only one text to be processed at a time. Thus, analyzing the number of texts was a time-consuming and laborious process.

To use VocabProfile, I chose VP- Compleat BNC-COCA 25 provided on the website (see footnote 1). BNC-COCA 25 is among the most recently updated lists provided on the website. The list is the integration of BNC (British National Corpus) word lists (Nation, 2005, citing Cobb, n.d.) and 450 million words from the COCA corpus (Corpus of Contemporary American English) (Davies, 2012, citing Cobb, n.d.). BNC-COCA 25 was expanded with 25 K-levels (K stands for thousand) and developed on the basis of word frequency and range. Consisting of two dominant English varieties, British English and North American English, the list is said to be more inclusive. Hence, texts based on this list will have fewer

“off-list” items. They are words that do not belong to any repository categories.

It is often the case when the corpus size is small.

Having chosen the version, I entered one text at a time and submitted it after correcting and removing spelling mistakes and proper names. This was to avoid an unnecessary number of words falling into the “off-list” category. The results were reported in terms of percentages of word families, types, and tokes in each K-band ranging from K-1 to K-25 (25 bands). In the current research, I

(38)

When computing with the Coh-Metrix tool, I adopted a similar procedure of text editing (correct spelling mistakes and proper nouns) and submission for analysis. Coh-Metrix returned a list of 106 descriptive items corresponding to various analytical categories that reflect lexical proficiency from multiple perspectives such as vocabulary size, depth of knowledge, and access to lexical core (Crossley et al., 2011 a, 2011b). Given the purpose of this study, I collected only mean values of frequency of all words (the descriptive item 93) and that of all content words (the descriptive item 92). The following table summarizes LC measures used in the current research.

Table 1: A summary of measures

Constructs Measures

TTR

Guiraud index Lexical diversity D

MTLD

VocabProfile Lexical sophistication CELEX-all words

CELEX-content words

4.4 Statistical analyses

Statistical analyses were carried out in SPSS 25 (IBM, 2017) and served the purpose of comparing developmental trends (Research Question 1) and examining the strength of LD and LS measures used in this research in terms of how well they are able to explain the variance and how comparable they are to each other (Research Questions 2 and 3).

(39)

As for the Research Question 1 asking how lexical complexity develops over the course of four years, I computed one-way ANOVA tests and post-hoc comparisons using the Bonferroni test. Field (2014, pp. 372-373) refers to Bonferroni as an effective test for controlling the Type I error rate and more suitable than the like, such as the Tukey test, when the number of means is small.

To answer Research Question 2, I calculated the effect sizes, in the form of ηp2 , using the partial eta-squared formula. The formula is a good fit to answer the question because it gives the percentage of variance (learners’ proficiency) that can be explained by one variable but not others in the analyses (Field, 2014, p. 415).

Research Question 3 asks how comparable the measures are to one another. Statistically speaking, the question taps into the correlation among the measures. I carried out Pearson correlation tests with two groups of measures.

In the investigation into strength of the chosen measure, I touched upon two main aspects of construct validity, namely convergent validity and divergent validity (McCarthy and Jarvis, 2010). The former validates how well the measures of the same construct agree with each other. The latter is an evaluation of how well an index disagrees with the most flawed one. Hypothetically, the most flawed measure is TTR, as findings in previous literature suggests (see Literature Review). However, no such hypothesis was made for any LS measures.

The table below is an overview of statistical testing corresponding to each

(40)

Table 2: An overview of statistical analyses Research

questions

Analyses Statistical tests

1 Comparing lexical complexity differences among groups

One-way ANOVA &

Bonferroni test

2 Explaining the variance Partial eta-squared (ηp2) 3 Correlating with other measures Pearson correlation

(41)

5. Results

Findings derived from statistical analyses of data will be presented in the form of answering the three research questions.

5.1 Research question 1: How does lexical complexity develop over four years?

Research question 1.1: How does lexical diversity develop across four groups?

Table 3 below summarizes descriptive statistics (means, and standard deviations) of all four LD measures.

Table 3: Descriptive statistics of lexical diversity scores

Groups

TTR Guiraud D MTLD

Means SD Means SD Means SD Means SD 1 .56 0.06 6.61 .96 57.11 17.98 48.09 18.23 2 .56 0.06 6.73 .59 55.91 11.76 54.36 11.18 3 .56 0.05 7.6 .74 80.74 17.09 66.21 17.56 4 .557 0.06 7.98 .85 86.6 24.07 70.08 17.36

My tentative judgment of the results is that four group means were relatively unchanged in the case of TTR, such as group 1, M = .56, SD = .06, and group 4, M = .557, SD = .06. There was an increase in Guiraud score means between Group 1, M= 6.61, SD = .96, and Group 4, M = 7.98, SD = .85. Marked

(42)

55.11 (SD = 17.98), much lower than group 4, M = 86.6, SD = 24.07. MTLD mean score of group 1 was 48.09 (SD = 18.23). Group 4’s score was noticeably higher, M = 70.08, SD = 17.36.

One-way ANOVA between groups was conducted to compare the four group’s LD mean scores. LD, indexed by the scores of TTR, Guiraud, D, and MTLD, was the dependent variable. The groups' English proficiency was the independent variable. The rationale behind ANOVA tests was to examine how well LD scores (dependent variable) explain differences in language proficiency of the four groups (independent variable). The following table reports on freedom (df), F test results (F), and significance values (p).

Table 4: ANOVAs results

df** F p

TTR 3 .1 .96

Guiraud 3 20.81 0*

D 3 22.67 0*

MTLD 3 11.76 0

p-value level is less than .001.

**. within-group degree of freedom is 116.

As seen in Table 4, significant differences among the four groups were found for the three indices Guiraud, D, and MTLD, but not for TTR. Of the three, the most significant difference was in D, F (3, 116) = 22.67, p < .001. The mean difference of Guiraud scores was slightly less, F (3, 116) = 20.81, p < .001. The

(43)

F test result of MTLD measure was about a half, F (3, 116) = 11.76, p < .001. A striking contrast is in the results of TTR, F (3, 116) = 0.1, p = 0.96. The finding suggests that there was no significant difference in LD calculated with TTR. In other words, TTR is a very weak discriminator of LD development of the groups.

As no significant differences were found in TTR mean scores, I proceeded to multiple comparison tests for scores derived from the three other measures.

The comparisons were implemented on three levels including consecutive groups, two-level apart, and three-level apart. Consecutive groups are one-level apart groups such as group 1 and group 2, coded as 1-2. Likewise, two-level apart groups such as group 1 and group 3 are written as 1--3. Three level apart is 1-4, meaning group 1 and group 4. On this basis, I defined the discriminatory capability of the measures as follows.

Significantly distinguish consecutive groups (1-2, 2-3,3-4): Strong Significantly distinguish two-level apart groups (1-3.2-4): Medium Significantly distinguish three-level apart group (1-4): Weak

Table 5 below presents the results of post-hoc comparisons using the Bonferroni test.

(44)

Table 5: Multiple comparison results

Guiraud D MTLD

Groups Mean difference

SD p Mean

difference

SD p Mean

difference

SD p

1-2 .012 .21 1 - 1.2 4.71 1 6.27 4.22 1

1-3 .99 .21 .05 23.63 4.71 .05 18.12 4.22 .05

1-4 1.37 .21 .05 29.5 4.71 .05 22 4.22 .05

2-3 0.87 .21 .05 23.63 4.71 .05 11.86 4.22 .05

2-4 1.25 .21 .05 30.7 4.71 .05 15.73 4.22 .05

3-4 .38 .21 .05 5.87 4.71 .05 3.87 4.22 .05

Firstly, and unsurprisingly, Table 5 shows that the biggest differences were found on the three-step apart level, between group 1 and group 4. The same holds true for Guiraud, D, and MTLD, respectively M = 1.37, SD = .21; M = 29.5, SD = 4.71; and M = 22, SD = 4.22, at p = .05. Meanwhile, there was no statistically significant development of lexical complexity between level 1 and level 2. Indeed, all three measures failed to distinguish group 1 and group 2, namely Guiraud (M = .012, SD = .21); D (M = 1.2, SD = 4.71); MTLD (M = 6.27, SD = 4.22) at p = 1. On the contrary, regarding two other consecutive levels, group 2 and group 3 saw a marked difference in mean scores. In detail, the two groups had the significant mean difference in D scores, M= 24.83, SD = 4.71,

(45)

p = .05. The mean difference of MTLD scores was M = 11.86, SD = 4.22, p = .05. In comparison with the pair 2-3, group 3 and 4 were much less significantly different. In detail, the groups’ mean differences in the three measures are Guiraud (M = 0.38, SD = .21); D (M = 5.87, SD = 4.71); MTLD (M = 3.87, SD = 4.22, p = .05). Furthermore, the two level-apart groups highly differentiated especially in D and MTLD mean scores. The biggest difference in D scores was between level 2 and level 4, M = 30.7, SD = 4.71, p = .05. As for group 1 and 3, the group’s mean difference of MTLD scores was 18.12 (SD = 4.22, p = .05). It was also the highest mean difference score found for MTLD. The findings suggest that there was no significant development of LD found between levels 1-2 and between levels 3-4. The clearest change, as expected, was between level 1 and level 4. More noteworthily, noticeable changes started from level 2 forwards. This is reflected by significant mean differences between groups 2-3 and groups 2- 4 calculated with stronger measures, especially with D and MTLD.

Besides, of all the indices, D appeared to be the strongest discriminator. D was able to significantly distinguish one consecutive pair, groups 2-3 (M = 23.63);

and two-level apart pairs, namely groups 1-3 and groups 2-4. Meanwhile, other indices were able to distinguish only two-level apart groups.

(46)

Research question 1.2: How does lexical sophistication develop over four years?

Table 6 below reports on descriptive statistics of LS scores. There was no marked difference except for a slight decrease in use of frequent words by the groups. In VocabProfile, means slightly increased between group 1 (M = 96.62, SD= .45) and group 2 (M = 97.41, SD = .29) then gradually declined. The two other measures saw the same trend. In the case of CELEX-all, there was a slight increase between level 3 and level 4 from 3.06 (SD = .01) to 3.08 (SD = .01). In CELEX-content, mean scores stayed unchanged between level 1 and level 2, (M = 2.59, SD = .03 & .02) then gradually decreased when reaching level 4 (M = 2.52, SD = .02).

Table 6: Descriptive statistics of lexical sophistication scores Group VocabProfile CELEX-All CELEX-Content

Mean SD Mean SD Mean SD

1 96.62 .45 3.07 .02 2.59 .03

2 97.41 .29 3.09 .01 2.59 .02

3 96.52 .38 3.06 .01 2.55 .02

4 95.54 .36 3.08 .01 2.52 .02

Similar as in the statistical testing of LD scores, I conducted one-way between groups ANOVA to reveal differences in LS mean scores of the four groups. LS indices, namely VocabProfile, CELEX-all, and CELEX-content were

(47)

the dependent variable. The groups were the independent variable. Table 7 below reports on degrees of freedom (df), F test results (F), and significance values (p).

Note that degrees of freedom within each group were 116.

Table 7: ANOVAs results

df* F p VocabProfile 3 4.23 .007 CELEX-all 3 .685 .563 CELEX-content 3 2.476 .065

*. Within group degree of freedom is 116

Table 7 shows that of the three measures, significant differences lied in the mean scores of the VocabProfile measure, F (6.116) = 4.23, p = .007. For the two others, p values were greater than the alpha level of .05. This means that among all groups there was no significant development in LS using the CELEX database. Hence, only mean scores derived from the VocabProfile tool were kept for multiple comparisons using the Bonferroni post-hoc test. With the same protocols, the comparisons were conducted in three conditions, namely one-step apart (groups 1-2, groups 2-3, and groups 3-4), two-level apart (groups 1-3 and groups 2-4), and three-level apart (groups 1-4).

(48)

Table 8: Comparison of VocabProfile score means Groups Mean difference SD Sig

1-2 - .79 .53 .825

1-3 .1 .53 1

1-4 1.08 .53 .255

2-3 .89 .53 .568

2-4 1.87 .53 .003

3-4 .98 .53 .392

As shown in Table 8, group 2 and 4 were the only pair to be found significantly different from each other in LS scores. The mean difference between two groups was 1.87 (SD = 0.53, p = 0.003). Other pairs had no significant changes in the use of frequent words. Especially, the performance of group 1 and group 3 were found identical (p = 1). The findings suggest that the LS indices selected in this research had very low capability to distinguish the groups. The participants’ linguistic development could hardly be reflected via these indices.

(49)

Research Question 2: How well do the measures explain LD development and language proficiency?

As mentioned in the method section (see above), answers to this question are to shed light onto how each measure explains the variance. The individual indices are dependent variable. The variance is progress between each developmental stage. The larger proportion of variance that a measure is able to explain, the more reliable it is as an indicator of language proficiency in general and lexical complexity development in particular. Effect sizes, in the form of ηp2, using the partial eta-squared formula, were calculated on SPSS and reported in the following sections.

Lexical Diversity

Table 9: Effect sizes of lexical diversity measures

df F p ηp2

TTR 3 .103 .958 .003 Guiraud 3 20.81 .00 .35

D 3 22.67 .00 .37

MTLD 3 11.76 .00 .23

As Table 9 shows, D (𝜂p2 = .37, df = 3, F = 22.67, p < .001) and Guiraud (𝜂p2 = .35, df = 3, F = 20.81, p < .001) were the two measures explaining the variance the most . MTLD, in the third position, was able to explain 23%

Viittaukset

LIITTYVÄT TIEDOSTOT

Tässä luvussa tarkasteltiin sosiaaliturvan monimutkaisuutta sosiaaliturvaetuuksia toi- meenpanevien työntekijöiden näkökulmasta. Tutkimuskirjallisuuden pohjalta tunnistettiin

Birgit Henriksen Kööpenhaminan yliopistosta puhui sanaston oppimisesta otsikolla Exploring the Quality of Lexical Knowledge in Language Learners’ L1 and L2, Jean-Marc Dewaele

We have witnessed a shift of attention to lexical competence and an increase in lexical studies in second language acquisition research over the last two decades. Despite an array

This article will describe a study in which the oral proficiency of advanced learners of English (many of them future language teachers) was analysed on the basis of two

1 In the study, manual and nonmanual movements were treated as commensurable, and complexity was seen as a property of a movement which correlates with the number of articulators

With this background, grammatical complexity may be approached using as criteria the number of grammaticalized distinctions in a functional domain and the extent to which the

As such, research needs not only to appreciate the nature of the contexts in which the learner finds themselves (on both macro and micro-levels and in terms

In their articles, Doró, Pípalová and Siitonen focus on written learner language.. Doró compares the lexical richness and use of writing strategies of learners