• Ei tuloksia

Reliability and validity of LENA speaker identification and core

2 Literature review

2.2 Language acquisition through socialization

2.3.3 Digitalization and technical advancements in analyzing naturalistic data 47

2.3.3.2 Reliability and validity of LENA speaker identification and core

The LENA System uses pre-defined rules for segmenting audio stream and American-English-based (AE) probabilistic models to identify and label sound segments with speaker labels (key child near/far, female adult near/far, male adult near/far, other child near/far) or labels for environmental sounds (overlapping near/far, noise near/far, electronics near/far, and silence). The role of segmentation accuracy and correct labelling is crucial for LENA adult word count, child vocalization count, and conversational turn count measures, as these counts are grounded in segmentation, speaker identification and phone recognition.

Inter-rater percent agreement (between LENA and “human”, “human” considered the gold standard) for speaker identification from the normative sample has been presented in the LENA technical report (LTR-05-02) and is reported to be 82% for adult, 76% for child, 71% for TV, and 76% for other sounds (Xu, Yapanel & Gray, 2008). However, percent agreement has been widely criticized, as it includes only the observed agreement, but fails to take chance into account (Hayes & Hatch, 1999).

Therefore, for the purposes of examining observer agreement, it would be advisable to use, for example, Cohen’s kappa (ҝ) (Viera & Garrett, 2005). In addition, for any potential diagnostic tools, the diagnostic accuracy should also be tested, for example, using sensitivity, specificity, overall accuracy, and predictive and/or discriminative values (Eusebi, 2013; Okeh & Okoro, 2012, review). VanDam and Silbert (2013b) have further stated that an important goal of automatic labelling is to maintain relatively high precision by reducing false positives. Studies conducted on LENA reliability that have looked beyond agreement rates are summarized below.

In the studies of VanDam and Silbert (2013a) and Oller and colleagues (2010), percent agreement between machine-coded segment labelling and human judges was counted, but these were reported in addition to kappa-statistics. VanDam and Silbert (2013a) found percent agreement to be higher for children (85.9%, Cohen’s kappa ҝ=.708), but lower for male adults (60.9%, ҝ=.599) and female adults (59.4%, ҝ=.503), when compared with LENA Foundation’s agreement rates. Oller and collegues (2010) followed previous studies and chose to study agreement rates for child versus adult segments, which were found to be 73% with 5% of false positives (when “human” was used as the gold standard). However, Gilkerson et al. (2014) chose to explore the ability of LENA to identify speakers from Chinese Mandarin dialect data through sensitivity (true positives from true positives and false negatives) and precision (true positives from true and false positives, also called “positive predictive power”). Gilkerson and

colleagues found that LENA showed to be similarly sensitive to child and adult segments as in AE validation, but precision in child segment identification was found to be poor.

The reliability of the LENA System counts has been considered a part of studies conducted with typically developing children (TD), late speakers (LT), and children with autism spectrum disorder (ASD), and mainly with English-speaking populations for adult word count (AWC) and child vocalization count (CVC) (Table 2). All reliability tests have been conducted by comparing LENA segments, counts, and estimates with ones provided by human transcribers. For AWC, inter-rater correlations between LENA and human coders have been reported to correlate between r=.76 and r=.83, respectively, and, more importantly, encouraging results have been reported by Spanish (AWC r=.80) and Chinese SDM studies (AWC to SDM orthographic words r=.73, p<.001) (Gilkerson et al., 2014; Weisleder & Fernald, 2013). For CVC, inter-rater agreement has been reported to range from r=.65 to r=.76. However, to the author’s best knowledge, it seems that all LENA core measure reliability tests have so far been conducted with correlative analyses, which may not be the most reliable way in conducting such research (Bland & Altman, 1986; Haber & Barnhardt, 2006). Bland and Altman (1986) have stated that the use of the correlation coefficient is inappropriate in agreement studies, for example, because a high correlation coefficient does not actually mean that the two measurements agree, but also that data that seems to be in poor agreement can produce high correlations. In addition, the author is not aware of any studies that would have questioned LENA’s ability to distinguish multiple child speakers from each other.

LENA AWC, CVC, and CTC (conversational turn count) measures have also been compared to various types of language, social behavior, and developmental measures.

AWC has been reported to correlate positively with the Preschool Language Scale (PLS-4) scores (r=.35, p<.05), Mullen Scales of early Learning (MULLEN-VR; r=.41, p<.01) (Dykstra, Sabatos-DeVito, Irvin, Boyd, Hume & Odom, 2012) and negatively with increased scores from The Modified Checklist for Autism in Toddlers (M-CHAT) (r=-.66, p<.01) (Warren et al., 2010). CVC has been reported to correlate positively with PLS-4 (r=.33-.51, p<.01) (Greenwood, Thiemann-Bourque, Walker, Buzhardt & Gilkerson, 2011).

LENA CTC has been studied in relation with information about children’s performance in traditional measures and/or parent reports. LENA CTC has been reported to correlate positively with PLS-4 (Greenwood et al., 2011, see also Dykstra, 2012, for close to significant correlation), MULLEN-VR (r=.33, p<.05) (Dykstra et al., 2012), Communication and symbolic behavior scales (CBCS) (r=.76, p<.01), The Child Development Inventory (CDI) (r=.78, p<.01), and The MacArthur-Bates Communicative Inventory (MB-CDI) (r=.80, p<.01) (Warren et al., 2010). Negative correlations have been found between CTC and several tools screening for atypical

behaviors. CTC correlated statistically significantly with M-CHAT (r=-.52, p<.01), The Child Behavior Checklist (CBCL) (r=-.39, p<.01), and the Social Communication Questionnaire (SCQ) (r=.-57, p<.05) (Warren et al., 2010). However, the Autism Diagnostic Observation Schedule (ADOS) did not correlate with LENA measures in the study of Dykstra et al., (2012), but the authors suggest that the result may reflect the small sample size of the study. An older version of LENA (V 2.3.) has also correlated positively with SALT transcription for AWC (r= .71-.85, p<.001) and CVC (r=.76), but not for CTC (Oetting et al., 2009).

3 Aims of the study

It is not known how the early language of twins develops, and what the role of biomedical and social environmental variables is in their language development. In addition, it is only because of recent technological advancements that it has become possible to study naturalistic social interaction without sampling restrictions as it is occurring in families living their daily lives and to discover the very basic information needed to understand language acquisition through socialization. Therefore, this study is two-fold in nature, relating to a) questions about the reliability of automated technology and its performance in relation to traditional parental questionnaires, and b) questions about twins’ language development and the role of the social, pre-, and neonatal environment in language development. Firstly, this study aims to explore whether the algorithm of the automated method provides reliable information about the detection and identification of speakers and the accuracy of child utterance and adult word counts. Secondly, the automated method is applied to measure the quantity of speech and speech-like utterances spoken in twin families and to explore, if children neonatal health and demographic variables have any effect on the volubility of different family members. Thirdly the study aims to discover how babbling and early linguistic skills develop in twins and to explore if the neonatal health and demographic variables affect development. Lastly, the study aims to discover whether there are associations between variables of quantified family speech and parent reported variables of twins’

language development.

In the first part of this study, the reliability of a novel method and its automated analysis (Language Environment Analysis™, later LENA) is assessed with a special focus on segmentation accuracy, speaker identification, and reliability of adult word and child vocalization counts. These are studied using the following questions:

1. How similarly does LENA and a native Finnish-speaker identify speakers?

2. How reliably does LENA identify key child vocalizations in key child segments from non-vocal elements (i.e. cries and vegetative sounds) compared with human-identified vocalizations?

3. How accurate are LENA-provided adult word counts (AWC) and child vocalization counts (CVC) compared with counts provided by native Finnish-speakers?

Secondly, automatic LENA analyses are utilized to gain an understanding of what twin children hear during their typical day. In addition, the second part also inspects the amount of vocalization produced by twins and the possible relations of shared and non-shared environmental variables to the quantity of vocalization and input frequency in twin families. These themes are addressed using the following questions:

4. How much families talk, according to LENA spoken segment durations of key child, other child, male, and female adult?

5. How much does LENA suggest children to hear adult words, participate in conversational turns, and produce child vocalizations?

6. Do differences in social and biomedical environments affect the amount of automatically detected speech and vocalizations?

The third part inspects twin children’s language development through the eyes of the parents: how parents discover vocal milestones from pre-lexical stages, and how children acquire vocabulary and language skills in early toddlerhood. These themes are addressed through the following questions:

7. When do parents report their children starting reduplicative and variegated babbling and when do they discover their children’s first protowords?

8. How do twin children’s vocabularies and language skills develop during the second year of their lives, when compared with normative information?

9. Does the emergence of vocal milestones, the acquisition of vocabulary, and language skills differ when social and biomedical environmental differences are compared?

In the final part, the associations of LENA-measured heard input and the parent-reported onset of vocal milestones, the quantity of children’s vocabulary and language development are studied. These themes are addressed using the following questions:

10. Is the amount of LENA-detected speech or speech-like vocalizations associated with the LENA-detected volubility of family members?

11. Is there a relationship between the LENA-detected amount of child vocalizations and heard input with the information gathered from pre-lexical development, vocabulary development and language development using parent questionnaires?

12. Is there a relationship between parent-reported vocal milestones, early vocabulary, and language skills in twins?