• Ei tuloksia

Segmental pronunciation and error gravity in L2 Russian speakers of English: A data-driven approach

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "Segmental pronunciation and error gravity in L2 Russian speakers of English: A data-driven approach"

Copied!
50
0
0

Kokoteksti

(1)

UNIVERSITY OF EASTERN FINLAND PHILOSOPHICAL FACULTY

SCHOOL OF HUMANITIES MDP in Linguistic Sciences

Translation Studies and Translation Technology

Daniil Bergman

Segmental pronunciation and error gravity in L2 Russian speakers of English: A data-driven approach

MA Thesis

Autumn 2021

(2)

ITÄ-SUOMEN YLIOPISTO – UNIVERSITY OF EASTERN FINLAND

Tiedekunta – Faculty Philosophical Faculty

Osasto – School School of Humanities Tekijät – Author

Daniil Bergman Työn nimi – Title

Segmental pronunciation and error gravity in L2 Russian speakers of English: A data-driven approach

Pääaine – Main subject Työn laji – Level Päivämäärä – Date

Sivumäärä – Number of pages

MDP in Linguistic Sciences

Pro gradu - tutkielma

x

12.12.2021 50

Sivuainetutkielma Kandidaatin tutkielma Aineopintojen tutkielma Tiivistelmä – Abstract

The main problem in pronunciation training process is mostly the identification of second language speakers’ most frequent production errors. There are several approaches on how to tackle the issue of error systematization and identification, namely contrastive analysis hypothesis and expert analysis. However, most of the lists (Swan, 2001) based on these hypotheses are either implying that all errors are equally valuable, or based on experience of language teachers, with not a lot of data supporting the classification. The main goal of this study is to take a data driven approach to this matter. An error analysis of L1 Russian speakers of English was conducted with the usage of a self-collected corpus containing 1,132 phonetically-balanced English sentences, 150 of which were randomly selected and manually annotated afterwards. After the annotation process, this database was analyzed in order to see whether this approach is suitable for pronunciation error identification. The findings confirmed that, although limited in scope out of necessity, this approach can be useful, identifying several errors not explicitly mentioned by the classroom experts and making important distinctions for some of the other errors. The discussion of pronunciation combined with the functional load (Brown, 1988) highlighted some of the persistent errors that might be beneficial for classroom setting, helping teachers to optimize learning process and shift their focus accordingly.

(3)

Avainsanat – Keywords

Russian learners, word error rate, error frequency, accented speech intelligibility, functional load, pronunciation corpus, pronunciation teaching, segmental errors

Contents

I. Introduction... 4

II. Literature review ... 6

1. Segmental pronunciation and error gravity ... 6

2. Russian pronunciation of English ... 8

3. Automatic speech recognition and speech intelligibility ... 12

4. Research questions ... 15

III. Methodology ... 17

1. Corpus justification ... 17

2. Recording process ... 19

3. Annotation process ... 20

3. Results collection process ... 22

IV. Results. ... 24

1. Substitutions and distortions ... 27

2. Deletions ... 31

3. Additions ... 32

4. Errors identified by the corpus and the experts ... 34

5. Unexpected errors... 36

6. Word error rate ... 37

V. Discussion ... 38

1. Distribution of errors in the corpus ... 39

2. Priorities ... 40

VI. Implications ... 42

VII. Conclusions ... 45

References ... 45

Appendix I. List of ARPAbet symbols and corresponding IPA symbols and examples. ... 50

(4)

I. Introduction

During the history of pronunciation teaching, there has always been a concern of how the speech errors of second language learners (or L2 learners) should be identified. The best approach is to identify and rectify those errors on a learner-by-learner basis, but this requires a lot of time and cannot be reliably used as a guideline for other L2 speakers. There are, however, several

approaches that deal with the problem of commonality. The first one, called contrastive analysis hypothesis (Munro, 2018), assumes that ‘mismatches between the phonological

systems of the L1’ (or native language) ‘and target language (TL) will help identify common difficulties’ (Rehman, 2020).

The effect of L1 is most noticeable in the case of pronunciation, as Swan and Smith point out (Swan and Smith, 2001). Moreover, Munro and Derwing (2006) mention in their earlier study that even though pronunciation is the area most affected by L1, it still does not receive enough attention in the classroom setting. That is especially evident for Russian speakers of English. The linguistic literature available on the topic is quite scarce, and local classroom literature only seriously dedicates time to this issue at university-level linguistic programs and other fields where comprehensibility of speech is considered to be one of the main points of study. As Sokolova (2008) points out, even bachelor-level textbooks that are mainly focused on English phonetics still require a corrective course in pronunciation taken beforehand, fully taking into account that this matter has not been the focus of most students enrolling into the university, even among people who plan to pursue a career in a linguistics-related field. She also mentions that they firmly believe that comparison between TL and L1 phonetic system might help students to improve their pronunciation and comprehensibility.

However effective that system may be in terms of improving TL pronunciation, it is lacking in terms of predicting whether the error will occur or not. In his later work, Munro (2018)

acknowledges that even some errors predicted by comparing language systems did not occur,

(5)

while other error, not predicted by these comparisons, appeared in analysis, often across several L1’s, ‘as part of the developmental path shared by learners’. To rephrase, while this system can predict pronunciation difficulties with relative accuracy, a 100% accurate prediction of

pronunciation errors consistently is not possible (Rehman, 2020).

There are several other approaches that were used to identify pronunciation errors, the most prominent one being expert analysis. Several Russian classroom textbooks (Kichigina et al., 2002; Sokolova, 2008; Bondarenko, 2009) often combine expert analysis and comparative approach. Kichigina (2002) explicitly mentions in her textbook that they deliberately tried to combine tongue-twisters and articulatory exercises found to be effective in classroom setting and theoretical knowledge from several different sources. Nilsen and Nilsen (1971) used ‘more than fifty linguists and other language specialists’ (p. xiii) in order to identify possible pronunciation errors according to L1 of the learners. Swan and Smith (2001) also talk about drawing on the expertise of different writers to describe common grammatical and phonological errors according to the L1 of the learners.

Despite the wide usage of aforementioned approaches, most of the lists compiled by various authors does not include data supporting their claims, in no small part due to the fact that these lists are mostly based on the expertise of language teachers (see McAndrews & Thomson, 2017).

None of the current error lists that were discovered, however, differentiate between the errors and their gravity, and, as Rehman (2020) rightfully mentions, ‘missing a key aspect of a

principled approach to setting priorities for teaching’. This can cause confusion, because both the errors that barely affect intelligibility and the errors that are ‘far more likely to result in

confusion for listeners.’ (Levis, 2018).

(6)

II. Literature review

1. Segmental pronunciation and error gravity

There have been several attempts to determine segmental pronunciation errors. One of the systems mentioned in the introduction and discussed by Fries (1945) and Lado (1957), called contrastive analysis hypothesis, states that errors made in L2 learning process are mostly caused by interference from L1. However, as Derwing and Munro (2018) mention, CAH has been found flawed in terms of error analysis. As Rehman et al. (2020) aptly points out, other approaches like functional load (FL) are more representative of error gravity in speech (see Brown, 1988; King, 1967; Meyerstein, 1970; Munro & Derwing, 2006).

Accent comprehensibility in whole is yet another important topic when it comes to accented speech. Derwing and Munro’s (2020) article that revisits the problem of foreign accent and comprehensibility tackles the issue from a general standpoint. They confirm a notion that is quite important for this paper, which is that the “dimensions at issue are related, but partially

independent” and that “the speech can be heavily accented but highly intelligible”. This correlates with the idea that even though the speech samples collected can be quite accented, it does not necessarily mean that the speaker’s utterances are significantly harder to process for a L1 English speaker. This is where the concept of functional load can be quite useful.

In terms of this work, it is also quite important to understand what a functional load is and how does it help us with the error distinction. Brown (1988) extensively covers this topic in his paper, while also giving the reason for a diminishing presence for linguistically oriented material in the classroom, which still hold up today in some sense. He states that for a reasonable phonetical training with limited time available, the teacher should come to terms with the fact that some of the minimal pairs occur more often than others, and thus should have more time dedicated for them. There are only a few languages that can reliably reproduce [f] and [θ] sounds. Seeing how there are so few minimal pairs using these sounds, perhaps more attention should be paid to the

(7)

more frequently used pairs, e.g. [p] – [b]. Functional load is used precisely for this purpose – it measures the frequency of the minimal pair occurrence in the language and uses it to predict the pairs that may hinder the comprehensibility of the speech. The functional load is then put on a scale from 1 to 10, where 10 is for the most frequently used minimal pairs. However, even though this approach has been used in a classroom setting for quite some time (Brown, 1991), it has not yet been used widely enough for L2 assessment purposes (Kang and Moran, 2014).

Munro and Derwing (2006) further prove that some pairs affect intelligibility more, depending on their functional load, in their paper titled The functional load principle in ESL pronunciation instruction: An exploratory study. The analysis of L2 accented speech, assessed by L1 speakers of English, proves that mistakes in pairs with high FL tend to considerably lower the

intelligibility of the speech. It is important to note that the number of errors with high functional load also drops considerably with the increase of proficiency, which is illustrated in research conducted by Kang and Moran (2014). The research shows a significant decline for high FL errors in CPE (Cambridge Proficiency Exam, corresponds to C2 level of English) speakers of English (who had low accentedness rating), compared to the PET (Preliminary English test, corresponds to B1 level of English) speakers (who had a high accentedness rating). However, due to the time constraints of this project and limited amount of people available, the only two problems that will be looked into are how intelligible L2 speech in this research is and how much it is affected by accent using WER (word error rate) from Automatic Speech Recognition (ASR).

This approach was previously used by different researchers (Moore et al., 2019, Bernstein, 2013, Gallardo, 2017, Cooper and Wang, 2017) with relative success, although for different purposes.

Even though the consensus seems to be that WER approach has a lot of drawbacks, it is still a viable approach for this research’s purposes. This is further discussed in the ‘Automatic Speech Recognition and Speech Intelligibility’ part of this paper.

(8)

It is important to mention, however, that even though the functional load for some of the phonemes might be high, it tells us very little about how the speaker may be perceived. A good illustration of this could be an L1 speaker of Russian that wants to pronounce the word ‘rat’.

Generally speaking, it is very probable that this speaker will substitute the voiced retroflex trill [ɹ] for the voiced alveolar trill [r], which is a native sound for Russian language, and will also pronounce [æ] in the middle of a word as [e], which is one of the vowels in the Russian phonetic system. Comparing it with RP, is likely that [ret] will be produced instead of [ɹæt]. However, even though the comprehensibility isn’t lowered significantly, these sounds will most likely be noticeable to L1 speaker of English, even if he would not be able to explain why exactly does it sound wrong.

Therefore, it would also be beneficial to look at other approaches not involving phonemic

contrasts. Syllable structure errors are among those segmental errors that are hard to define using the concept of functional load. Like Rehman et al. (2020) rightfully mentions, there is no

certainty about the extent to which this type of errors affects speech intelligibility. Gao and Weinberger (2018) in their study of accentedness find out that some errors like vowel epenthesis are more likely to be considered accented, while others like deletion may not even be considered as a signifier of accent at all. Some of the classroom-oriented textbook mentioned later do a good job of tackling these problems, all of which are described later.

2. Russian pronunciation of English

Russian is one of the most widespread languages in the world. It occupies number 8 spot for the number of speakers worldwide1, with around 258 million speakers worldwide2. This is the official and cultural language of Russia, as well as other countries such as Belarus and Kazakhstan. Russian is also widely used in CIS countries and Baltic states, Central Asia and

1https://www.ethnologue.com/guides/ethnologue200, accessed 1 December 2021

2https://www.ethnologue.com/language/rus, accessed 18 October 2021

(9)

Caucasus3, making it one of the most widely distributed languages in Europe and the most popular Slavic language. Compared to other geographically distributed languages, like Arabic or Chinese, there is a high degree of mutual intelligibility not only between different distant regions of Russia, but also between Russian, Belarusian and Ukrainian (Comrie, 2018).

Pronunciation difficulties in different languages, specifically foreign accent, have long been studied by a multitude of different scholars. Munro and Derwing (1998) define foreign accent as

“the extent to which an L2 learner’s speech is perceived to differ from native speaker norms”.

However, there is distinct lack of papers that approach this subject from a data-driven point of view, and they rarely concern L1 Russian English accented speech. According to Munro (2018), most of the pedagogues have shifted their attention from linguistic sources to such classroom- oriented material as Nilsen and Nilsen’s (1971) Pronunciation Contrasts in English, despite the book being first published 49 years ago. This is also the case for Russian-based institutes of higher education – even though the students preferred more expensive, foreign books and the government prohibits the usage of books older than 2010, they are still used in classroom setting or just republished with slight adjustments (Pitina, 2015). A good example of that would be Gzhanyanz’s (1969) Korrektivniy fonetiko-yazikovoi kurs [Speech Pattern Course]. On the other hand, there is Sokolova’s (2008) Prakticheskaya fonetika Angliyskogo yazika [Practical

Phonetics for English Language], which is also a popular phonetics textbook, even though it is just a slightly revised version of an original book published in 1984.

Only a handful of international papers written in English focus on the problem of Russian L1 pronunciation in English language, namely Crosby’s L1 Influence on L2 Intonation in Russian Speakers of English, which is a longitudinal case study of one Russian-speaking student and L1 Russian language influence on L2 English intonation. One of the other papers that tackle this problem is Gildersleeve-Neumann & Wright’s (2010) paper on bilingual children in the US,

3According to https://en.wikipedia.org/wiki/Russian_language

(10)

speaking both English and Russian at home. Even though its’ main focus is bilingual children, it offers quite an extensive analysis of Russian phonetical system and its’ comparison to the English system. Despite the fact that the paper under discussion takes a very thorough approach to the problem of accented speech, the subjects of the studies – bilingual children between 3 and 5 years old – exist in a quite different environment compared to the subject of this research paper. The effect of L2 exposure cannot be understated, which has been proved by a significant number of researchers (Kibota et al., 2020, Zsiga, 2003, Swan, 1987). A good example of that would be Kibota et al. (2020) paper Losing access to the second language and its effect on executive function development in childhood: The case of 'returnees'. In this study, L1 Japanese speakers with prolonged exposure to L2 English, who returned to L1 environment for a while and came back, were studied to measure the effect of their exposure. The results show that there has been a noticeable improvement in proficiency in the language they have been exposed to.

This fact is quite important for this study, because even though the speakers who provided the recording for this paper, are ranging from B1 (Intermediate) to C2 (Proficient) in terms of their command of English language, they are exposed to the Russian language daily. This provides a unique opportunity to study the influence of L1 Russian on L2 English pronunciation, since all the studies mentioned above look at the speakers who are currently temporarily or permanently reside in L2 environment.

Since the English-speaking research papers on L1 Russian English pronunciation mistakes are scarce, one of the main sources for pronunciation errors has been Swan’s textbook on

pronunciation difficulties, titled Learner English. It features 22 different groups of speakers, with a meticulous description of what difficulties does each group encounter during their L2

acquisition process. It contains a very thorough description of the problems L1 Russian speakers of English encounter. This textbook contains the most extensive list of the phonetic difficulties a speaker might have, however, due to the scope of this research and in order account for the confirmation bias, this study will focus on some of the most obvious ones.

(11)

Another research that analyzes L1 Russian speakers’ phonetical difficulties is Gildersleeve- Neumann & Wright’s (2010) paper, which is also mentioned above. As already mentioned before, its’ main focus is not phonetic difficulties and thus it doesn’t explicitly give the list of the problems the L1 Russian speakers might encounter. However, the part of the research that talks about Russian phonetics and gives a comparison to English phonetics, implicitly shows some of the possible difficulties. The higher number of consonants that can also be palatalized in

Russian, combined with the absence of diphthongs in Russian and a shallower vowel resource, are among some of the difficulties obvious from the list presented. Even syllabic structures sometimes can present quite a challenge for an untrained L1 Russian English speaker, with Russian allowing such structures as CCCCVC (взгляд [vzɡlʲat], ‘glance’) or CVCCCC (монстр [monstr], ‘monster’) to exist.

Another extensive list is located in a textbook that is authored by Sokolova (2008). Since it is more classroom-oriented, it focuses on some of the most obvious linguistic features of a Russian accent and thus is not as extensive as the first one, but still gives a good idea of how of a

phonetics teacher teaches English phonetics in Russia. A very thorough analysis of phonetic mistakes, marked in the Tables 1 to 3 below, is almost an identical match of the mistakes Swan (2001) pointed out. Two of the four speakers recorded had linguistic training, which included this textbook in the curriculum. This might produce some interesting results when compared to the speakers who had no linguistic training beforehand.

Kichigina’s (2002) textbook on phonetics Uchebnoe posobie po phonetike / Phonetics textbook, holds a structured side-by-side comparison of problematic phonemes. Like Sokolova’s textbook, it also shows articulation process for various problematic vowels and consonants but doesn’t focus as much on intonation patterns and suprasegmental difficulties as Sokolova does.

(12)

3. Automatic speech recognition and speech intelligibility

ASR, or automatic speech recognition in general, is a system, which uses different

methodologies and technologies to enable speech recognition, or in other words to translate spoken language into text with the use of computers. The translation process, however, is quite complex and requires several steps. First, the acoustic speech signal is recorded using a

microphone. Then the signal is processed by computer and further adjusted (i.e. filtered,

denoised, converted to a needed frequency and cut into slices). The resulting signal slices can be then analyzed as separate entities. Depending on the method, the program extracts features from those slices and compares them to a source language library, essentially finding a matching segment from that library, allowing the analyzed slice to be recorded as certain sound. Some of the existing systems, such as the system described below, can also give different options along with the accuracy of the prediction for the segment under question. These segments are then combined, or concatenated, into words, and words may be concatenated into sentences. Modern ASR systems also allow the usage of different language models, mathematical methods of analysis and specific accuracy-increasing vocabularies.

Even though a thorough description including all the mathematical and computational details is beyond the scope of this research, a simple metaphor by Kitzing et al. (2009) may help to understand the inner workings of such systems. They describe this process as a jigsaw puzzle or a memory game, where “a picture can only be entirely right or wrong”, but “in a puzzle the picture is built up by many pieces, some of which sometimes may be ‘almost right’”. Comparing that to ASR systems, it provides some insight into why these systems are not always able to produce a 100% correct result.

Modern advances in ASR have made this technology widely available, easy to set up, which made it an option for transcribing audio input from various sources. In some ways, it might yield

(13)

better results compared to manual assessment, since it is not affected by subjectivity, emotion, fatigue, or accidental lack of concentration (Neri et al, 2010).

General-purpose off-the-shelf ASR systems such as Google ASR4, are becoming progressively more popular each day due to these reasons (see Meeker, 2017, Këpuska, V.; Bohouta, G., 2017). It is a very useful tool to assess non-specific speech, since it’s trained on large datasets, spans several different domains, and is improved upon constantly.

The main problem of the ASR systems, however, is that while they are quite good at recognizing native English speech, the recognition for non-native speakers and sometimes even speakers of different English dialects, can be quite poor, mostly due to the heterogeneity of non-native speech (Park & Culnan, 2019). Google reports the error rate for gASR as 5% for native speech and up to 30% for non-native speech. (Tejedor-García et al., 2021) Word error rate, or WER, is commonly used to provide this measure.

Word Error Rate (WER), according to Ali and Renals (2021) is commonly used to ‘evaluate the performance of a large vocabulary continuous speech recognition (LVCSR) system. The

sequence of words ‘hypothesized by the ASR system is aligned with a reference transcription, and the number of errors is computed as the sum of substitutions (S), insertions (I), and deletions (D)’. If there are N words in the reference transcription in total, then the word error rate (WER) is calculated using the following formula:

𝑊𝐸𝑅 =

𝐼+𝐷+𝑆

𝑁

× 100

(Ali et al., 2016)

For WER to be reliable, however, ‘at least two hours of data’ (Ali and Renals, 2021) are needed

‘for a standard LVCSR system’. The common problem with this procedure, however, is that the data in question has to be manually transcribed at least at the sentence level. In this case,

4https://cloud.google.com/speech-to-text, accessed on 19 October 2021

(14)

however, due to the nature of the experiment, all of the reference sentences are already available and paired with voice recordings comprising about 1.5 – 2 hours of speech for each speaker.

There are also studies (Park et al., 2008, Ali & Renals, 2018) that have used WER or its’

analogues to display speech production accuracy. Even though some of the authors (He et al., 2011, Favre et al., 2013) suggest that WER is not a good metric, they talk about different applications such as speech recognizer training for speech translation task (He et al., 2011) or

“human subjects’ success in finding decisions” (Favre et al., 2013). Since in this study the goal is to compare the accuracy between speakers’ language level and accuracy production and to find if there is a general relation between speech intelligibility and recognition accuracy, WER is expected to serve as a good measure of how much the Russian accent affects the intelligibility of L2 English.

Lane (1962) talks about intelligibility and accent, mentioning that non-native speakers of English may have problems making themselves understood, especially in non-ideal listening conditions.

Flege (1995) expands on that, listing several other possible problems that non-native speakers may encounter, among them the misperception of prosodic features, and even frustration in speech recipients, which may prompt speech recipients to adjust their speech output provided to L2 speakers.

As far as this study is concerned, even though there are a lot of studies discussing the benefits of ASR systems (Gerosa et al., 2009, Raju et al., 2020, Krishna et al., 2019, Mirghafori, Fosler and Morgan, 1996) and their use in various situations, there are no studies that employ ASR models to specifically look into the ASR model’s accuracy of recognizing Russian L2 speakers of English accented speech and its’ subsequent comparison with the data from manual annotation and error analysis.

(15)

4. Research questions

The purpose of this study is to see whether using a corpus containing phonetical annotations would yield any information not identified by classroom experts or highlight the differences between the two approaches.

Table 1. English consonant errors for speakers of Russian in various sources.

Notes. FL = functional load. + means that the error is discussed or implied by the authors. Numbers in parentheses following contrasts indicate the functional load of the contrast according to Brown (1988). The numbers range from 1 (lowest FL) to 10 (highest FL).

Table 2. English vowel errors for speakers of Russian in various sources.

Error (FL from Brown, 1988) Sokolova (2008)

Kichigina (2002)

Gildersleeve- Neumann &

Wright (2010)

Swan (2001)

Wrong variants of /r/ + + + +

/w/ - /v/ + + +

/s or t–θ/ (5, 4) + + + +

/h/ - /x/ + + + +

/ŋ/ - /n/ + + +

/z or d–ð/ (7, 5) + + + +

Dark-light /l/ contrast + + + +

Error (FL from Brown, 1988) Sokolova (2008)

Kichigina (2002) Gildersleeve- Neumann &

Wright (2010)

Swan (2001)

Vowels in beat–bit (8) + + + +

Monophthongization of diphthongs + + + +

Diphthongization of monophtongs (e.g. so pronounced as saw)

+ +

Vowel in bat (bet) (10) + + + +

Vowels in cot–caught (4) + + + +

Vowels in boat–bought (10) + + + +

(16)

Notes. FL = functional load. + means that the error is discussed or implied by the authors. Numbers in parentheses following contrasts indicate the functional load of the contrast according to Brown (1988). The numbers range from 1 (lowest FL) to 10 (highest FL).

Table 3. English syllable structure errors for speakers of Russian in various sources.

Note. * No functional load (FL) for these types of errors.

This study hypothesizes that a speech corpus of moderate size may uncover another perspective on the pronunciation errors’ distribution of L1 Russian speakers of English. Same as Rehman’s research (2020), ‘positive results would provide a justification for including more widely

representative phonetic corpora stratified by spoken proficiency and by task type’. If the research yields positive results, it could help us further look into the creation of a multi-faceted phonetic corpora, which might include more samples and a suitcase corpus, as the original Arctic corpus does. Specifically, this study of Russian pronunciation errors uses the self-collected L2-ERCS corpus with approximately 600 sentences from four L1 Russian learners of English with varying levels of proficiency so that the following three research questions can be answered:

1. How common are the errors of different types (substitutions, deletions, distortions and insertions)?

2. Is there any merit to taking a data-driven approach to pronunciation error identification?

Does it yield different results compared to classroom experts’ data and what are the strengths and weaknesses of these two approaches?

Vowels in fool–foot (3) + + + +

Vowels in work – job (4) + + + +

Vowels in bit–bet (9) + + +

Error* (FL from Brown, 1988) Sokolova (2008)

Kichigina (2002)

Gildersleeve- Neumann &

Wright (2010)

Swan (2001)

Pronouncing geminate consonants (e.g. ap-ple)

+ +

Word-final obstruents are pronounced voiceless

+ + + +

Word-final voiceless consonants are voiced

+ +

(17)

3. Does word error rate (WER) has any correlation with the speaker proficiency and their intelligibility?

III. Methodology

1. Corpus justification

Several studies (T. Toda et al., 2007; L. Sun et al., 2015; G. Zhao and R. Gutierrez-Osuna, 2017;

Y.-C. Wu et al., 2016) have used CMU (Carnegie Mellon University) ARCTIC speech corpus (J.

Kominek and A. W. Black, 2004) for a multitude of different tasks such as voice conversion or phonetical studies. However, there is a not a lot of corpora available online that includes L1 Russian speakers of English. One of the few corpora freely available online is OSCAAR corpus, which includes five speakers of Russian. However, it only has several short abstracts recorded (about 10 for each speaker). CMU ARCTIC itself does not have any Russian accented speakers, most of the speakers are either native speakers of different English dialects or even professional, native-speaking voice actors (VCC dataset). Therefore, other than CMU ARCTIC, the corpora available is not suitable for either accent conversion tasks or voice conversion tasks, since these tasks requires a lot of data.

Among other non-native English corpora, the Speech Accent Archive 5 and IDEA 6 have a lot of different native speakers represented, L1 Russian being one of them. The problem with these corpora is that either the speaker only recorded a short paragraph or a short free speech task (Speech Accent Archive and IDEA respectively). This is also not very suitable for voice

conversion or accent conversion studies. The corpus recorded for this study may be used in tasks such as this, which is further proved by L2 ARCTIC corpus used by the studies mentioned before with relative success.

5https://accent.gmu.edu/, accessed 10 June 2021

6https://www.dialectsarchive.com/, accessed 10 June 2021

(18)

Another common problem with available/existing speech corpora is that they are not easily accessible. As Zhao et al. (2018) mention, The Wildcat 7 , LDC2007S08 8, and NUFAESD (Bent

& Bradlow, 2003) datasets provide ‘a limited number of recordings for each non-native speaker, and have restricted access – LDC2007S08 requires a fee’, while ‘Wildcat and NUFAESD are only available to designated research groups’.In fact, most modern systems that detect mispronunciation use private datasets, which makes their usage for scientific purposes not feasible (Zhao et al, 2018).

Corpora available in the Russian-speaking part of the internet mainly focuses on text collection.

There are several corpora that include texts produced by L1 Russian learners of English – texts translated from Russian to English and from English to Russian by future translators, blog posts in English by Russian speakers, and essays produced by university students. There were no freely available speech corpora of Russian learners of English.

In order to overcome the problems mentioned above, a new corpus was constructed. Like L2- ARCTIC, it can be used as a corpus for accent conversion, voice conversion between speakers, and mispronunciation detection (Zhao et al.,2018). This L2 English Russian speakers' corpus (or L2-ERSC for short) contains English speech of four different speakers, two male and two female, all of them using Russian as their L1. Speakers were recruited from Novosibirsk State University student body; their age ranges from 18 to 29 years with an average of 23.5 years (std:

4.65.) Speakers’ demographic information is represented in Table 1. The proficiency was measured using the score of the standardized test they undertook (IELTS, Cambridge test or TOEFL iBT) and the results are recorded based of what the speakers have disclosed. The level of the speakers is listed on a separate column as a value between A1 and C2.

7https://groups.linguistics.northwestern.edu/speech_comm_group/wildcat/, accessed 10 June 2021

8https://catalog.ldc.upenn.edu/LDC2007S08, accessed 10 June 2021

(19)

Table 4. Demographic information for Russian participants from L2-ERSC corpus

Speaker L1 Gender Test score English level

OL Russian F TOEFL iBT - 98 C1

AM Russian M CPE Grade A -

224

C2

ME Russian F TOEFL iBT - 64 B1

EK Russian M IELTS 5.5 B2

The corpus contains 1,132 prompts from the original CMU ARCTIC corpus. These prompts were used for multiple reasons. First, according to the authors of the original L2 ARCTIC corpus Zhao et al. (2018) they ‘are phonetically balanced (100%, 79.6%, and 13.7% coverage for

phonemes, diphones, and triphones, respectively), are open source’ (collected from the

Gutenberg library) and ‘produce approximately 1-1.5 hours of speech’. Second, it was ‘proven to work well with speech synthesis and voice conversion’. Finally, the prompts are quite

‘challenging for non-native speakers and may elicit them to make more pronunciation mistakes’, which opens plenty of opportunities for mispronunciation studies.

2. Recording process

The speech was recorded in a quiet classroom at Novosibirsk State University (NSU). Zoom H2 microphone was used for recording, paired with a generic no-name pop filter. To present

prompts sentence by sentence and in order to simplify the analysis, BAS SpeechRecorder (Draxler and Jänsch, 2004) was used to show sentences and automate the process so that there is less interference from the experiment supervisor. The microphone was placed about 15 cm from the speaker, which helped to avoid most of the air puffing. During each recording session, L2 speaker was guided through the process of recording, and left alone to record for 1-2 hours.

(20)

Occasionally during the recording process, and after each session, random checks were

conducted in order to ensure that the records produced were of sufficient quality. Each speaker took about 3-4 sessions to record all 1.132 prompts, each session not being longer than 2 hours to reduce pronunciation fatigue.

Once the recording was done, a script was used to trim off mouse clicks at the start and the end of the recordings where possible. Another script removed random audio clipping, presumably when the speaker accidentally leaned in. The speech was originally sampled at 44.1 kHz, dual channel, and saved as a WAV file. Each sentence was saved as a separate file. Finally, the recordings were resampled to 16 kHz, single-channel PCM-16 signed WAV files.

One session of recordings was affected by an equipment malfunction. As a result, some records from 1 to 400 for speaker AM contain segments greatly affected by plosives and were not legible. Since it was not possible to re-record this session due to the speaker leaving the city, these records were salvaged with the use of ERA Plosive plugins for Audacity. Even though the words affected became legible and thus were annotated, the words restored are of poor quality and these recordings might not be suitable for performing acoustic measurements, together with sentences 65 (construction noise in the background) and 366 (car alarm in the background) for speaker OL.

3. Annotation process

Same as the original L2 ARCTIC corpus, L2-ERSC provides orthographic transcriptions at the word level. Montreal forced-aligner software (McAuliffe et al., 2017) was used to create phonetic transcriptions in TextGrid format, containing word and phone tier and respective boundaries (fig. 1). After the automatic annotation process, a common set of 150 sentences was annotated manually for each speaker. The sentences were selected randomly using a Python script. Each random recording selected for further annotation contains additional tier with

pronunciation mistakes. In order to annotate pronunciation mistakes, The Longman Dictionary of

(21)

Contemporary English’s online version 9 was used. Also, to account for both Russian EFL textbook inclination towards British English (BrE) (Sokolova, 2008, Kichigina, 2002) and a huge exposure to American English (AmE) due to the prominence of the Internet, even if the speaker switched between BrE and AmE during the course of the recording, it was not recorded as a mistake. If the pronunciation did not correspond to either of the variant or there was only one correct variant different with the utterance, American English’s transcription was used to record the mistake. This allowed to keep annotation somewhat in line with Montreal’s forced-aligner automatic annotation, since it is also based on American English variants.

Same reasoning was used to record weak and strong variants of the word in a sentence, and for suprasegmental features such as incorrect stress, aspiration, or intonation. Even though there are definitive rules for the usage of weak and strong variants of the word, and incorrect stress or intonation may significantly hinder the intelligibility of spoken utterance, these features were not the focus of this research and thus were not recorded.

In addition, the boundaries were manually adjusted, and incorrect labels were manually fixed. In order to ease future computer processing, ARPAbet phoneme set was used for phonetic tier, The tier that lists pronunciation mistakes uses IPA symbols for the error tags.

9https://www.ldoceonline.com/, accessed 15 October 2021

(22)

Figure 1. An example of an annotated Praat recording visualisation with pronunciation errors commented.

3. Results collection process

The data presented in the Results section was collected using a set of Python and Praat scripts written for this research. Since the original prompts were provided in a single text document, they were separated into a single .txt file for each sentence. After that, as mentioned before, 150 random sentences were selected together with their corresponding prompts. For the resulting 600 sentences, Montreal’s forced-aligner automatic annotation was used to record the phones on the

‘words – phones’ tier, with the boundaries corresponding to the actual places for those phonemes in the recording. For the errors label, the author of this study listened to the recordings and added a corresponding boundary in the ‘comments’ section. After the manual annotation, another script went through all the TextGrid files containing the annotations, recorded the results for each speaker, and produced pie charts, bar plots and a .csv file (fig. 2) containing the table of all the data extracted from the TextGrid files.

(23)

Figure 2. A sample of the .csv table

This table was later used to produce Table 6 in the Results section, along with other numbers used throughout the study, such as the percentage of phoneme affected by error compared to the total phoneme number. All of the scripts are available on author’s GitHub 10. The corpus requires additional work to be accessible for everyone and thus will be available publicly in several months after the publication together with scripts.

As for the WER calculation, another script was written that used Google Speech to Text ASR (gASR) cloud functionality and then employed an algorithm that calculated WER using the Levenshtein distance (Navarro, 2001). According to him, ‘like the Levenshtein distance, WER defines the distance by the number of minimum operations that has to been done for getting from the reference to the hypothesis. Unlike the Levenshtein distance, however, the operations are on words and not on individual characters.’ The results for the WER were recorded in a separate .txt file.

10https://github.com/Rewaster/L2-ERSC

(24)

IV. Results.

The 600 sentences recorded for four Russian L2 speakers of English included 18604 individual phones (4651 phones/speaker, with an average of 31 phones per sentence). There were 3,688 segmental errors in total (2682 substitution errors, 173 deletion errors, 714 distortion errors and 121 addition errors), which in total gives us 19.83% of all segments, or an average of 6.15 errors per sentence. The first research question concerned the error distribution for substitutions, deletions, distortions, and additions. As evident from Figure 1, substitutions comprise the majority of phonetic deviations, totaling at 72.68%.

Figure 1. Phoneme error distribution

(25)

Distortions, which are mainly composed of the /r/ - /ɹ/ and /ɪ/ - /i/ pairs, were not counted as substitutions in this distribution, mainly due to the fact that the Russian /r/ sound is not phonemic in English and /ɪ/ - /i/ pair is allophonic. However, applying the same reasoning, /i:/ - /i/ error was recorded as substitution mistake – even though the minimal pair for English is /ɪ/ - /i:/, speakers replaced /ɪ/ with /i/ often enough to separate those two pairs into separate categories.

Figures 2 – 5 show how common each mistake was for each individual speaker. Figure 2 displays the substitution errors’ distribution. ME produced 35.38%, EK produced 29.90%, AM produced 21.29% and OL produced 13.42% of all substitution errors.

Figure 2. Substitution error distribution

Distortion errors in Figure 3 how that speaker ME produced more than half of all the distortion errors. An argument can be made for the correlation between the proficiency level of L2 English and the amount of distortion errors. However, it was not possible to account for individual

(26)

speaker differences. This may be evident from the fact that the speaker OL, who, based on her language test results, should be less proficient than speaker AM, was more successful in both her production of /r/ - /ɹ/ and /ɪ/ - /i/ and the overall speech production accuracy, resulting in a 7.99%

difference between highly proficient speakers in terms of distortion errors, or 57 mistakes, which is 2.5 times as low as the more proficient AM, and in terms of total production errors (11.66%

and 19.22% for speakers OL and AM respectively).

Figure 3. Distortion error distribution

Figure 4. Deletion error distribution

(27)

Deletion errors are presented in Figure 4. The distribution here is somewhat equivalent for three out of four speakers, with ME, AM and OL producing 25.43%, 19.65% and 12.72%,

respectively. EK, who was slightly more successful than ME in terms of substitution and distortion errors, accounts for 42.20% of deletion mistakes.

Figure 5. Addition error distribution

Addition errors on Figure 5 follow a similar distribution pattern as distortion errors, with a slight difference in performance from OL, who had very close results compared to the speaker AM this time, accounting for 7,44% and 6.61% of all addition errors respectively.

1. Substitutions and distortions

Distortions and substitutions, where the expected phoneme was replaced by another identifiable sound, was the most common type of pronunciation error. It was expected for some sounds to deviate from the English pronunciation norms more than others, but a number of errors were discovered that occurred in both the speech of advanced proficiency learners and intermediate learners. The most common phonetic error in L2-ERSC data was the voiced dental fricative /ð/, or DH in ARPAbet annotation. It was substituted 381 out of 547 times (69.65%). The most

(28)

common variants for substitution were /d/ (157 occurrences, or 41.2%), /z/ (221 occurrences, or 58%) or /t/ (only 3 occurrences, or .007%). It is important to note, however, that the reverse substitution /t/ - /d/ was quite uncommon (only 7 occurences, or .015%), which is another argument for the prevalence of word-final devoicing.

A very common pronunciation error recorded was a voiced postalveolar approximant /ɹ/, with 225 errors (29.4%) out of 765 occurrences. Due to the reasons mentioned above in the Results section of this study, this pronunciation error was recorded as a distortion. That is because in Russian, <r> is a voiced alveolar trill, so L1 Russian speakers of English frequently replaced /r/

with /r/. In the L2-ERSC dataset, [r] was used to replace the voiced postalveolar approximant /ɹ/

223 times out of a total of 225 (99.12%).

Interestingly, despite the fact that /ʒ/ exists in Russian, it had a very high error rate in English, being replaced 11 out of 22 times that the phoneme occurred in L2-ERSC sentences (50%).

Figure 6. Distribution of substitution errors

AE, or /æ/ in ARPAbet, was a very common substitution error in the corpus. 300 phonemes out of 509 were affected (58.93%), this might be linked to the fact that is not a phoneme in Russian language, thus making it more unnatural for the speakers to produce. Moreover, judging from the

(29)

Figure 7, improvement in the quality of this phoneme production process can be attributed to increasing proficiency in the language.

/ŋ/ was another commonly substituted phoneme. Having occurred 195 in the sentences selected for the corpus, 125 of those occurrences were affected by distortion. 118 out of 125 times (94.4%) it was replaced by /n/, which might also be evidence for speaker production being hindered by the absence of the phoneme in L1 language.

Word-final devoicing was a very popular problem within the corpus. Figure 6 includes six of the most persistent mistakes made in the speech production. /z/ (268 out of 534, 50.19%), /v/ (106 out of 364, 29.95%) and /w/ (222 out of 455, 48.79%) all have a very high occurrence

percentage. The other three phonemes - /dʒ/ (29 out of 106, 27.35%), /d/ (230 out of 923,

24.92%) and /g/ (10 out of 107, 9.34%), while still prevalent, had an occurrence rate under 30%.

/w/ was mostly substituted by /v/ (214 out of 222, 96.3%), /z/ was often replaced by /s/ (262 out of 268, 97.7%) and /v/ changed into /f/ 104 out of 106 times it occurred (98.11%). For the lower occurrence phonemes, /dʒ/, /g/ and /d/ were mostly substituted by /tʃ/, /k/, and /t/, with a

substitution rate of 96.55%, 100%, and 99.56%, respectively.

Monophthongization was most prevalent with the phoneme /oʊ/. Out of 246 occurrences in total, it was substituted 97 times (27.5%), out of which 82 (84.5%) were /ɔ/. Another two incidences of monophthongization are /eɪ/ and /aɪ/, which had 19 and 18 errors out of 336 and 339 occurrences respectively (5.65% former and 5.3% latter). Out of these 37 substitutions, 28 involved a

monophthong, mostly /a/ (9 times, or 32.1%) for /eɪ/ and /i/ for /aɪ/ (14 times, or 50%).

The vowel /ɑ:/ occurred 266 times and was substituted 153 times (57.5%). One hundred fifteen (75.16%) of all substitutions were /ɔ/, which is accepted in AmE. However, 34 (22.23%) of the /ɑ:/ substitutions were with /a/, which is most likely beacuse of the speakers trying to substitute the phoneme with a natural – sounding one from the Russian language, as the closest sounding

(30)

vowel in Russian is /a/. Errors with other vowels were infrequent, with less than 10% of the tokens being substituted by another phoneme.

Another interesting phenomenon was the substitution of /ə/. Occurring 232 times in 1717 total phonemes (13.5%), it was very heterogenous in terms of substitution variants. The three most common ones were /e/ (34.45%), /o/ (21.12%) and /a/ (25.86%). Even though /ə/ - /e/ variant was considered as substitution, an argument could be made for recording this variant as a distortion error. Occurring in such words as Hawaiian or disappointment for /a/ and oppressive and Solomon for /o/, the other two words are clear examples of orthographic influence on L2 English for L1 Russian speakers since they tend to read these words phonetically.

The voiced dental fricative /ð/ was substituted 372 out of 547 times (68.01%). The two most common options for substitution were /z/ (214 substitutions, or 57.53%) and /d/ (157

substitutions, or 42.2%); /θ/ was substituted 37 out of a total of 145 phonemes in the annotated part of the corpus (25.52%), with most (21, 56.7%) of these substitutions being /s/. Other recorded errors had an occurrence rate of less than 5%. Figures 6 and 7 show the distribution of distortion and substitution errors across phonemes and individuals respectively.

Figure 7. Distribution of substitution errors per individual

(31)

2. Deletions

Figure 8 portrays phoneme deletion errors’ distribution. The most common deletions seem to involve /j/ and /tʃ/ with 15 out of 164 (9.14%) and 8 out of 125 (6.4%). The former was related to the sound change process, since most of the words that had the error (mostly calculated and document) were borrowed and slowly adapted to the Russian phonetical alphabet. The latter, however, is characteristic of spelling pronunciation, since in word under question (mostly Portuguese and situation) /tʃ/ is often spelled with just <t>.

Figure 8. Distribution of deletion errors

Figure 9 shows the distribution of deletion errors per individual. Apart from several phonemes such as /eɪ/ or /t/, there seems to be no correlation between level of proficiency and the number of errors made in the pronunciation process and can be thus attributed to individual speaker differences. Some errors are common to all 4 speakers, while others (such as /dʒ/ or /əɹ/) are characteristic of 1 or 2 speakers only.

(32)

Figure 9. Distribution of deletion errors per individual

3. Additions

Figure 10 shows the total distribution of phoneme addition errors. There were a total of 121 insertions, marking it as the least frequent pronunciation mistake in L2-ERSC corpus. As evident from the graph, most of the insertions concern the /ŋ/ phoneme. Having occurred 195 in the sentences selected for the corpus, 34 of those phonemes were affected by deletion (17.43%). The two most common options for addition were either /g/ or /k/ sounds. The addition of these sounds was often connected to the substitution of /ŋ/ with /n/, resulting in either /nk/ or /ng/ as the final phoneme cluster. This is a multi-layered issue. First, L1 Russian speakers of English are very likely to read English phonetically because of the orthographic nature of Russian language.

Therefore, word-final letters that a part of another phoneme, such as <g> in the case of -ing- suffix will be pronounced as separate phonemes by a less proficient L1 Russian speaker of English. Second, because of the /ŋ/’s word-final position, not only speakers tend to substitute /ŋ/

with more phonetically acceptable in terms of Russian phonetic alphabet /n/ phoneme, but also devoicing, a common phenomenon in Russian language, affects the added <g> sound, replacing it with <k>.

(33)

Figure 10. Distribution of addition errors per individual

Figure 11 shows the distribution of deletion errors across individuals. They follow the same pattern as deletion errors, where some errors are speaker-specific, while others can be attributed to all of the speakers recorded.

Figure 11. Distribution of addition errors per individual

(34)

4. Errors identified by the corpus and the experts

The second research question asked whether taking a data-driven approach to pronunciation error identification would yield different results compared to classroom experts’ data. After comparing results with the classroom expert findings from tables 1-3, Tables 5.1 and 5.2 were compiled to display the findings.

These tables shows that there is a clear overlap between the three original tables and the two different approaches employed. However, even when looking at the data on Figures 1 through 10 in the handout, it is evident that the data-driven approach uncovered some unexpected errors that were not explicitly identified or not discussed in detail comparable to the extent of the error being present. A good example of that would be /ə - o or a/ substitution. Another point that the experts didn’t discuss are the deletion and addition mistakes. Although they account for only 10.93% of total mistakes, it is still a significant portion of pronunciation errors to analyze. This means that out of 14 errors mentioned in Table 5, five did not overlap. Out of the 9 errors that had some overlap, four had minor or moderate differences in the details.

Tables 5.1 and 5.2. Comparison of expert and corpus findings.

(35)

For example, experts often refer to /s or t–θ/ being on the same level of commonality as /z or d–

ð/, while the data showed that even when the number of phonemes in total is taken into account and the errors are calculated in terms of percentage from it, /ð/ is still almost 3 times more common than /θ/ (68.01% versus 25.52%). Also, even though monophthongization of diphthongs was present and quite significant in the data (although, not as prevalent as the classroom experience might suggest), the reverse process, diphthongization of monophtongs, was quite uncommon.

The data also suggests that while some of the errors might be attributed to individual speaker differences, with high enough sample of errors, such as the dataset for the substitution and distortion errors, more proficient speakers almost always outperformed less proficient ones. The data also reveals that errors above 10% were problematic for all of the speakers. Even though all of the textbooks listed are geared towards speakers who are pursuing a higher education and thus demonstrated some sort of language proficiency, the research materials and especially

book/research materials might be oriented towards improving poor language pronunciation habits and overall targeted towards or researching less proficient learners. In this case, as Rehman (2020) points out, ‘a systematic collection and analysis’ would ‘represent the types of

(36)

errors that occur for a group of L2 learners’ of English more adequately. A significant enough phonetically annotated speech corpus would be more accurate in terms of displaying whether there are other patterns to the errors this study uncovered, if they are significant or not on a bigger scale and how much do different levels of proficiency actually affect the patterns

discussed here. Despite all that, this study may suggest that even with a larger corpus, classroom expert approach and data-driven approach have several places where they do not have little to no overlap or no overlap at all.

5. Unexpected errors

Another part of the second research question concerned the errors that were not identified by experts. There was a number of errors revealed by the corpus, such as the /ə - o or a/ substitution already mentioned above in section IV. This is most likely due to the fact that Russian as a language is quite well connected in terms of sound and spelling, while English orthography can often be quite opaque and inconsistent. There are several reasons for why it might have occurred.

Several experts suggested that Russian speakers might have problems with English articulation, since English is much more demanding in terms of articulation compared to Russian. Sokolova (2008) even dedicates a whole section to articulatory exercises in order to improve the sound production in Russian speakers of L2 English. However, people with no training like that or without sufficient language proficiency may have problems articulating certain segments, resorting to replacing phonemes with the ones used by the speakers on a daily basis. This hypothesis is further supported by the data gathered, seeing how linguistically trained and proficient speakers outperform speakers with little or no training and practice. Same two reasons apply to word-final additions of /g/ and /k/ for the words with the word-final voiced velar nasal.

Another unexpected error is implicitly discussed by authors of the textbooks in Tables 1 – 3, but it is never explicitly mentioned. 50% of all /ʒ/ phonemes were replaced by either /ʃ/ or /z/ (8 occurrences, 72.73%), and while some of that can be attributed to word-final devoicing common

(37)

for L1 Russian speakers of English even in their native language, it doesn’t account for 5 out of 11 of them being in the middle of the word. This could also be analyzed with a larger

phonetically – annotated corpora, given that the sample is not that high, although overall consistent.

It is also of note that none of the textbooks don’t discuss the problem of addition and deletion, despite even this paper’s limited-scale research uncovering that almost 11% of all the errors are either additions or deletions. The possible explanation for this occurrence is the same as the reasoning for /ə - o or a/ substitution. Two of the most common unexpected errors for deletion and addition (/j/ and /u:/ respectively) are often close in the terms of word position, and while /j/

is often not reflected in the words (some of the common occurrences for both mistakes were words such as Fitzhugh, illuminating or lieutenant), where orthography sometimes confused even more proficient speakers and elicited them to produce phonemic errors.

Despite all that, it is also peculiar to see so many similarities between the errors discussed by the experts and errors found in L2-ERSC corpus. There is also some consistency between different languages. Even though some of the patterns are different from the patterns displayed by

speakers of Arabic (Rehman et al, 2020), the commonalities such as orthographic confusion and /r/ variant production are also quite surprising and further support the theory that the data-driven approach may account for the errors which are quite hard to detect and discover in a classroom setting.

6. Word error rate

The last research question asked whether does word error rate (WER) has any relation with the speaker proficiency and their intelligibility. The results are displayed in Table 6:

Table 6. WER rate and accuracy rate compared to the reported gASR WER

Speaker WER rate (%)

Accuracy rate (100% – WER)

Speaker proficiency Target WER reported by Google

(38)

AM 23.34% 76.66% C2 5% for native speakers, up to 30%

for non-native speakers

EK 26.78% 73.22% B2

ME 27.15% 72.85% B1

OL 17.13% 82.87% C1

Combined with the data collected during the annotation process, this relation between the speaker proficiency and their intelligibility may be indicative of several things. As suggested by manually annotated phonetic data, although there is a clear discrepancy between two less proficient and two more proficient speakers, it does not strictly align with the proficiency levels disclosed by the participants. Speaker OL disclosed her level as high C1, while speaker AM has the highest grade for Cambridge proficiency exam. Since for this experiment, all 2 hours of recordings for each speaker were used, the main problem that this research was not able to account for was the general purpose of the gASR recognition system. While the ‘video’ package was used, since it provided the best Speech to Text results, it still doesn’t account for the fact that gASR is not suited to recognize specific, phonetically rich speech without any accuracy- boosting dictionaries. While it is possible to construct one of these, it is counterproductive in terms of this research since it will artificially boost the accuracy even for low intelligibility sentences. Overall, together with the data from manual annotation, it can be assumed that there is some consistency with the lower levels of proficiency in terms of automatic speech recognition and accuracy in general. However, it is likely that with higher levels of proficiency individual differences will account for a bigger part of speech intelligibility.

V. Discussion

This exploratory study hypothesized that if a corpus containing a significant enough number of sentences read by speakers of varying proficiency was collected, annotated and analyzed, combined with contemporary methods of data analysis, it would provide a greater insight into pronunciation error identification. It would also confirm some data presented by experts,

disconfirm other, and reveal overlooked errors that were either not present in the experts’ data at

(39)

all or were unreasonably overlooked. Finally, the results of this study can be used to identify some of the recurring problems for pronunciation instruction for L1 Russian speakers of English in the future, possibly in an EFL teaching setting.

1. Distribution of errors in the corpus

The first research question asked how common the errors of different types (substitutions, deletions, distortions and insertions) were. Unsurprisingly, the majority of errors were

substitution and distortion errors (72.68% and 19.35% respectively). Without a proper linguistic training or lacking an advanced enough proficiency level, the ability to produce L2 sounds accurately is hindered to some extent. Even though some speakers might be able to produce better results in a classroom environment, in a less controlled linguistic environment such as a dialogue between L1 and L2 speaker of English, L2 learners are more ‘likely to produce the sound that is most similar to the target sound in perception and/or articulation’ (Rehman et al., 2020). While the substitution errors for less proficient speaker could be attributed to lack of exposure to the language or simply lack of practice, for more proficient speakers the situation is more complex. Having mastered most aspects of the language and having a set picture of the language formed in their head along with the perceived phonemic contrasts, the substitution errors are due to the inability to correctly distinguish some of those contrasts and separate them from L1 contrasts.

The findings also show that while not that common, addition and deletion errors still comprised more than 10% of total error number. This category raised the most questions for us during the process of data annotation and analysis. Despite not being explicitly pointed out by experts, addition and distortion errors are very ‘likely to affect the ability of the listener to interpret the speech signal accurately’ (Rehman et al., 2020), since it often changes the syllabic structure of the word. However, there is still a question which of the mistakes – deletion or distortion – will affect the intelligibility more. For instance, Jenkins (2000) argues that deletions are more likely

(40)

to hinder the intelligibility, since they require a syllabic reconstruction from the speech recipient.

On the other hand, it is also possible that it is error dependent, as suggested by Levis (2018).

While this study tried to partially address this concern using ASR and WER, it is still a question to research using more phonetically annotated data and cross-examining the how the amount of each type of errors correlated with speech intelligibility. As for this study’s data, while some patterns have emerged during the analysis, such as the word final /ŋ/ being quite likely (17.43%) to have an addition error, especially for B1 and B2-level speakers, it is quite likely that with the more extensively annotated dataset, more patterns will emerge and error distributions will change.

Another thought-provoking category was distortions. Syllable-initial [r] is very unlikely to be interpreted as any category other than English /ɹ/, but a huge number of /ɹ/ mispronunciations (262 out of 765 total phonemes, 34.24%) suggests that this is quite likely to have an effect on comprehensibility. /ɹ/ is a quite common phoneme in English, and if L2 speaker’s pronunciation is affected by a significant amount of these errors, it may result in speech recipient having to put extra effort in speech processing, which may result to any, if not all, problems described in the Word Error Rate and Speech Intelligibility portion of this paper. In addition, as Rehman et al.

(2020) rightfully mentions, since /ɹ/ occurs postvocalically in English, when this more vocalic realization of /ɹ/ is pronounced as /r/, speakers may have difficulties processing the consonantal realization of the trill as the vocalic allophone of /ɹ/. Since the scope of the research was limited, and the sample of the study is only 4 speakers, all of these should be interpreted as hypotheses.

They would require further testing whether or not these distortions are the mark of accentedness or is intelligibility affected. This may reveal interesting findings due to the number of distortions discovered even in this limited sample.

2. Priorities

Viittaukset

LIITTYVÄT TIEDOSTOT

The waveform of the phrase can be seen in Figure 15, and the pitch and intensity curves are included in Appendix 2 (Image 15). According to the target language rules the first

6) Lotta: Kyllähän se molemmissa kielissä on tärkee, se on vähän erilainen, ehkä johtuu siitä että englantia kuitenki yleensä on, kuulee, lapset kuulee sitä enempi,

From an actual or simulated component load forecasting error, the resulting increase the forecasting error of the total energy balance can be calculated, and using this increase

On this basis, it will be argued that if one is to consider teaching the pronunciation of any subphonemic segments in North American English (NAE), the highest priority should be

The survey was designed to gain research-based information about the state of English pronunciation teaching in European teaching contexts, and it included

All speakers distinguished all English vowels in their pronunciation, although the difference between long and short allophones of the same vowel was chiefly durational, which

6 (Which kind of aspects you think are important in teaching pronunciation?) was coded in the following categories found in the answers (see Table 3): differences

Its main aim was to create a teaching material package that would support the Finnish-speaking students in Finnish upper secondary schools to learn the pronunciation of English