Methodology - Segmental pronunciation and error gravity in L2 Russian speakers of English: A da

1. Corpus justification

Several studies (T. Toda et al., 2007; L. Sun et al., 2015; G. Zhao and R. Gutierrez-Osuna, 2017;

Y.-C. Wu et al., 2016) have used CMU (Carnegie Mellon University) ARCTIC speech corpus (J.

Kominek and A. W. Black, 2004) for a multitude of different tasks such as voice conversion or phonetical studies. However, there is a not a lot of corpora available online that includes L1 Russian speakers of English. One of the few corpora freely available online is OSCAAR corpus, which includes five speakers of Russian. However, it only has several short abstracts recorded (about 10 for each speaker). CMU ARCTIC itself does not have any Russian accented speakers, most of the speakers are either native speakers of different English dialects or even professional, native-speaking voice actors (VCC dataset). Therefore, other than CMU ARCTIC, the corpora available is not suitable for either accent conversion tasks or voice conversion tasks, since these tasks requires a lot of data.

Among other non-native English corpora, the Speech Accent Archive ⁵ and IDEA ⁶ have a lot of different native speakers represented, L1 Russian being one of them. The problem with these corpora is that either the speaker only recorded a short paragraph or a short free speech task (Speech Accent Archive and IDEA respectively). This is also not very suitable for voice

conversion or accent conversion studies. The corpus recorded for this study may be used in tasks such as this, which is further proved by L2 ARCTIC corpus used by the studies mentioned before with relative success.

5https://accent.gmu.edu/, accessed 10 June 2021

6https://www.dialectsarchive.com/, accessed 10 June 2021

Another common problem with available/existing speech corpora is that they are not easily accessible. As Zhao et al. (2018) mention, The Wildcat ⁷ , LDC2007S08 ⁸, and NUFAESD (Bent

& Bradlow, 2003) datasets provide ‘a limited number of recordings for each non-native speaker, and have restricted access – LDC2007S08 requires a fee’, while ‘Wildcat and NUFAESD are only available to designated research groups’.In fact, most modern systems that detect mispronunciation use private datasets, which makes their usage for scientific purposes not feasible (Zhao et al, 2018).

Corpora available in the Russian-speaking part of the internet mainly focuses on text collection.

There are several corpora that include texts produced by L1 Russian learners of English – texts translated from Russian to English and from English to Russian by future translators, blog posts in English by Russian speakers, and essays produced by university students. There were no freely available speech corpora of Russian learners of English.

In order to overcome the problems mentioned above, a new corpus was constructed. Like L2-ARCTIC, it can be used as a corpus for accent conversion, voice conversion between speakers, and mispronunciation detection (Zhao et al.,2018). This L2 English Russian speakers' corpus (or L2-ERSC for short) contains English speech of four different speakers, two male and two female, all of them using Russian as their L1. Speakers were recruited from Novosibirsk State University student body; their age ranges from 18 to 29 years with an average of 23.5 years (std:

4.65.) Speakers’ demographic information is represented in Table 1. The proficiency was measured using the score of the standardized test they undertook (IELTS, Cambridge test or TOEFL iBT) and the results are recorded based of what the speakers have disclosed. The level of the speakers is listed on a separate column as a value between A1 and C2.

7https://groups.linguistics.northwestern.edu/speech_comm_group/wildcat/, accessed 10 June 2021

8https://catalog.ldc.upenn.edu/LDC2007S08, accessed 10 June 2021

Table 4. Demographic information for Russian participants from L2-ERSC corpus

Speaker L1 Gender Test score English level

OL Russian F TOEFL iBT - 98 C1

AM Russian M CPE Grade A -

224

ME Russian F TOEFL iBT - 64 B1

EK Russian M IELTS 5.5 B2

The corpus contains 1,132 prompts from the original CMU ARCTIC corpus. These prompts were used for multiple reasons. First, according to the authors of the original L2 ARCTIC corpus Zhao et al. (2018) they ‘are phonetically balanced (100%, 79.6%, and 13.7% coverage for

phonemes, diphones, and triphones, respectively), are open source’ (collected from the

Gutenberg library) and ‘produce approximately 1-1.5 hours of speech’. Second, it was ‘proven to work well with speech synthesis and voice conversion’. Finally, the prompts are quite

‘challenging for non-native speakers and may elicit them to make more pronunciation mistakes’, which opens plenty of opportunities for mispronunciation studies.

2. Recording process

The speech was recorded in a quiet classroom at Novosibirsk State University (NSU). Zoom H2 microphone was used for recording, paired with a generic no-name pop filter. To present

prompts sentence by sentence and in order to simplify the analysis, BAS SpeechRecorder (Draxler and Jänsch, 2004) was used to show sentences and automate the process so that there is less interference from the experiment supervisor. The microphone was placed about 15 cm from the speaker, which helped to avoid most of the air puffing. During each recording session, L2 speaker was guided through the process of recording, and left alone to record for 1-2 hours.

Occasionally during the recording process, and after each session, random checks were

conducted in order to ensure that the records produced were of sufficient quality. Each speaker took about 3-4 sessions to record all 1.132 prompts, each session not being longer than 2 hours to reduce pronunciation fatigue.

Once the recording was done, a script was used to trim off mouse clicks at the start and the end of the recordings where possible. Another script removed random audio clipping, presumably when the speaker accidentally leaned in. The speech was originally sampled at 44.1 kHz, dual channel, and saved as a WAV file. Each sentence was saved as a separate file. Finally, the recordings were resampled to 16 kHz, single-channel PCM-16 signed WAV files.

One session of recordings was affected by an equipment malfunction. As a result, some records from 1 to 400 for speaker AM contain segments greatly affected by plosives and were not legible. Since it was not possible to re-record this session due to the speaker leaving the city, these records were salvaged with the use of ERA Plosive plugins for Audacity. Even though the words affected became legible and thus were annotated, the words restored are of poor quality and these recordings might not be suitable for performing acoustic measurements, together with sentences 65 (construction noise in the background) and 366 (car alarm in the background) for speaker OL.

3. Annotation process

Same as the original L2 ARCTIC corpus, L2-ERSC provides orthographic transcriptions at the word level. Montreal forced-aligner software (McAuliffe et al., 2017) was used to create phonetic transcriptions in TextGrid format, containing word and phone tier and respective boundaries (fig. 1). After the automatic annotation process, a common set of 150 sentences was annotated manually for each speaker. The sentences were selected randomly using a Python script. Each random recording selected for further annotation contains additional tier with

pronunciation mistakes. In order to annotate pronunciation mistakes, The Longman Dictionary of

Contemporary English’s online version ⁹ was used. Also, to account for both Russian EFL textbook inclination towards British English (BrE) (Sokolova, 2008, Kichigina, 2002) and a huge exposure to American English (AmE) due to the prominence of the Internet, even if the speaker switched between BrE and AmE during the course of the recording, it was not recorded as a mistake. If the pronunciation did not correspond to either of the variant or there was only one correct variant different with the utterance, American English’s transcription was used to record the mistake. This allowed to keep annotation somewhat in line with Montreal’s forced-aligner automatic annotation, since it is also based on American English variants.

Same reasoning was used to record weak and strong variants of the word in a sentence, and for suprasegmental features such as incorrect stress, aspiration, or intonation. Even though there are definitive rules for the usage of weak and strong variants of the word, and incorrect stress or intonation may significantly hinder the intelligibility of spoken utterance, these features were not the focus of this research and thus were not recorded.

In addition, the boundaries were manually adjusted, and incorrect labels were manually fixed. In order to ease future computer processing, ARPAbet phoneme set was used for phonetic tier, The tier that lists pronunciation mistakes uses IPA symbols for the error tags.

9https://www.ldoceonline.com/, accessed 15 October 2021

Figure 1. An example of an annotated Praat recording visualisation with pronunciation errors commented.

3. Results collection process

The data presented in the Results section was collected using a set of Python and Praat scripts written for this research. Since the original prompts were provided in a single text document, they were separated into a single .txt file for each sentence. After that, as mentioned before, 150 random sentences were selected together with their corresponding prompts. For the resulting 600 sentences, Montreal’s forced-aligner automatic annotation was used to record the phones on the

‘words – phones’ tier, with the boundaries corresponding to the actual places for those phonemes in the recording. For the errors label, the author of this study listened to the recordings and added a corresponding boundary in the ‘comments’ section. After the manual annotation, another script went through all the TextGrid files containing the annotations, recorded the results for each speaker, and produced pie charts, bar plots and a .csv file (fig. 2) containing the table of all the data extracted from the TextGrid files.

Figure 2. A sample of the .csv table

This table was later used to produce Table 6 in the Results section, along with other numbers used throughout the study, such as the percentage of phoneme affected by error compared to the total phoneme number. All of the scripts are available on author’s GitHub ¹⁰. The corpus requires additional work to be accessible for everyone and thus will be available publicly in several months after the publication together with scripts.

As for the WER calculation, another script was written that used Google Speech to Text ASR (gASR) cloud functionality and then employed an algorithm that calculated WER using the Levenshtein distance (Navarro, 2001). According to him, ‘like the Levenshtein distance, WER defines the distance by the number of minimum operations that has to been done for getting from the reference to the hypothesis. Unlike the Levenshtein distance, however, the operations are on words and not on individual characters.’ The results for the WER were recorded in a separate .txt file.

10https://github.com/Rewaster/L2-ERSC

In document Segmental pronunciation and error gravity in L2 Russian speakers of English: A data-driven approach (sivua 17-24)