Samromur 21.05
Identifier: SLR112
Summary: Samrómur Icelandic Speech corpus approved for release in May 2021
Category: Speech
License: CC BY 4.0
Downloads (use a mirror closer to you):
samromur_21.05.tgz [7.0G] ( whole corpus (includes dev and train sets)
) Mirrors:
[US]
[EU]
[CN]
About this resource:
The corpus is a result of the crowd-sourcing effort run by the Language and Voice Lab at the Reykjavik University, in cooperation with Almannarómur, Center for Language Technology. The recording has started in October 2019 and continues to this day (May 2021). This release has been authorized for release in May 2021. The aim is to create an open-source speech corpus to enable research and development for Icelandic Language Technology. The corpus contains audio recordings and a metadata file that contains the prompts the participants read. A Kaldi based script using this data can be found on the Language and Voice Lab gitHub page https://github.com/cadia-lvl/samromur-asr Collection Procedure: The data was collected using the website https://samromur.is, code of which is available at https://github.com/cadia-lvl/samromur. The participants are aged between 18 to 90, 59,782 recordings are from female speakers and 40,218 are from male, recorded by a smartphone or the web app. The original audio was collected at 44.1 kHz or 48 kHz sampling rate as *.wav files, which was down-sampled to 16 kHz and recoded to *.flac. Each recording contains one read sentence from a script. The script contains 85.080 unique sentences and 90.838 unique tokens. The participants self-reported their age group, gender, and the native language. There was no identifier other than the session ID, which is used as the speaker ID. The corpus is distributed with a metadata file with a detailed information on each utterance and speaker. Data Format Specifics, Audio: The corpus contains 100 000 utterance from 8392 speaker, totalling 145 hours. The distributed audio files are encoded at 16 kHz sampling rate, 16 bit linear PCM, 1 channel, *.flac format. The corpus is split into train, dev, and test subsets with no speaker overlap. Each subset contains folders that correspond to speaker IDs, and the audio files inside use the following naming convention: {speaker_ID}-{utterance_ID}.flac.
You can cite the data using the following BibTeX entry:
@inproceedings{mollberg-etal-2020-samromur, title = "{S}amr{\'o}mur: Crowd-sourcing Data Collection for {I}celandic Speech Recognition", author = "Mollberg, David Erik and J{\'o}nsson, {\'O}lafur Helgi and {\TH}orsteinsd{\'o}ttir, Sunneva and Steingr{\'\i}msson, Stein{\th}{\'o}r and Magn{\'u}sd{\'o}ttir, Eyd{\'\i}s Huld and Gudnason, Jon", booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2020.lrec-1.425", pages = "3463--3467", abstract = "This contribution describes an ongoing project of speech data collection, using the web application Samr{\'o}mur which is built upon Common Voice, Mozilla Foundation{'}s web platform for open-source voice collection. The goal of the project is to build a large-scale speech corpus for Automatic Speech Recognition (ASR) for Icelandic. Upon completion, Samr{\'o}mur will be the largest open speech corpus for Icelandic collected from the public domain. We discuss the methods used for the crowd-sourcing effort and show the importance of marketing and good media coverage when launching a crowd-sourcing campaign. Preliminary results exceed our expectations, and in one month we collected data that we had estimated would take three months to obtain. Furthermore, our initial dataset of around 45 thousand utterances has good demographic coverage, is gender-balanced and with proper age distribution. We also report on the task of validating the recordings, which we have not promoted, but have had numerous hours invested by volunteers.", language = "English", ISBN = "979-10-95546-34-4", }