Open Speech and Language Resources

Multilingual and code-switching ASR Challenge Dataset - sub-task1

Identifier: SLR103

Summary: Datasets for sub-task1 in Multilingual and code-switching ASR challenges for low resource Indian languages - MUCS 2021 (https://navana-tech.github.io/MUCS2021/)

Category: Speech

License: https://msropendata-web-api.azurewebsites.net/licenses/f1f352a6-243f-4905-8e00-389edbca9e83/view

Downloads (use a mirror closer to you):
Hindi_train.tar.gz [4.4G] ( Hindi Train speech and transcripts ) Mirrors: [EU] [EU] [CN]
Hindi_test.tar.gz [258M] ( Hindi Test speech and transcripts ) Mirrors: [EU] [EU] [CN]
Marathi_train.tar.gz [4.3G] ( Marathi Train speech and transcripts ) Mirrors: [EU] [EU] [CN]
Marathi_test.tar.gz [235M] ( Marathi Test speech and transcripts ) Mirrors: [EU] [EU] [CN]
Odia_train.tar.gz [4.3G] ( Odia Train speech and transcripts ) Mirrors: [EU] [EU] [CN]
Odia_test.tar.gz [251M] ( Odia Test speech and transcripts ) Mirrors: [EU] [EU] [CN]
subtask1_blindtest_wReadme.tar.gz [1.1G] (sub-task1 Blind Test set and transcripts) Mirrors: [EU] [EU] [CN]

About this resource:

Summary of Hindi Data

The Hindi speech dataset is split into train and test sets with 95.05 hours and 5.55 hours of audio respectively. There are 4506 and 386 unique sentences taken from Hindi stories in the train and test sets, respectively, with no overlap of sentences. The train set contains utterances from a set of 59 speakers, and the test set contains speakers from a disjoint set of 19 speakers. The audio files are sampled at 8kHz, 16-bit encoding. The total vocabulary size of the train and test set is 6542.

Summary of Marathi Data

The Marathi speech data is collected from three different user groups: College students, Rural low income workers, Urban low income workers. The dataset is split into train and test, with 93.89 hours and 5 hours of audio, respectively. There are 2543 and 200 unique sentences in the train and test sets, respectively, and the utterances belong to the same set of 31 speakers in both train and test sets, with 100% speaker overlap. The text transcriptions of train and test sets are disjoint. The audio files are sampled at 8kHz, 16-bit encoding. The total vocabulary size of the train and test set is 3395.

Summary of Odia Data

The text data was collected from four districts, (representative dialects indicated in parenthesis) - Sambalpur (North Western Odia), Mayurbhanj (North Eastern Odia), Puri (Central and Standard Odia) and Koraput (Southern Odia). The focal themes of data collection were agriculture, healthcare and finance. Data collection was carried out on the field from farmers and agriculture officers for Agriculture domain; nurses, doctors and associate professionals (front desk staff, naturopathy practitioners) for healthcare domain and bank employees for Finance domain. A cumulative of 885 sentences were obtained for speech data collection, and were split across train and test set with 94.54 hours and 5.49 hours audio respectively. The dataset has 65 unique sentences in Test set non overlapping with 820 unique sentences in Train set. The audio files are sampled at 8kHz, 16-bit encoding. The vocabulary size is 1644.

Summary of the blind test data

In addition to the train and test sets, the blind test set for subtask1 is also provided.