A Speech Corpus for Modeling Language Acquisition: CAREGIVER

T. Altosaar*, L. ten Bosch, Guillaume Aimetti, C. Koniaris, K. Demuynck, H. van den Heuvel

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

A multi-lingual speech corpus used for modeling language acquisition called CAREGIVER has been designed and recorded within the framework of the EU funded Acquisition of Communication and Recognition Skills (ACORNS) project. The paper describes the motivation behind the corpus and its design by relying on current knowledge regarding infant language acquisition. Instead of recording infants and children, the voices of their primary and secondary caregivers were captured in both infant-directed and adult-directed speech modes over four languages in a read speech manner. The challenges and methods applied to obtain similar prompts in terms of complexity and semantics across different languages, as well as the normalized recording procedures employed at different locations, is covered. The corpus contains nearly 66000 utterance based audio files spoken over a two-year period by 17 male and 17 female native speakers of Dutch, English, Finnish, and Swedish. An orthographical transcription is available for every utterance. Also, time-aligned word and phone annotations for many of the sub-corpora also exist. The CAREGIVER corpus will be published via ELRA.

Original languageEnglish
Title of host publicationLREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION
EditorsN Calzolari, K Choukri, B Maegaard, J Mariani, J Odijk, S Piperidis, M Rosner, D Tapias
Number of pages7
Publication statusPublished - 2010
MoE publication typeA4 Article in a conference publication
EventInternational Conference on Language Resources and Evaluation - Valletta, Malta
Duration: 17 May 201023 May 2010
Conference number: 7

Conference

ConferenceInternational Conference on Language Resources and Evaluation
Abbreviated titleLREC
CountryMalta
CityValletta
Period17/05/201023/05/2010

Cite this

Altosaar, T., ten Bosch, L., Aimetti, G., Koniaris, C., Demuynck, K., & van den Heuvel, H. (2010). A Speech Corpus for Modeling Language Acquisition: CAREGIVER. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, & D. Tapias (Eds.), LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION