Open Speech and Language Resources



Multilingual and code-switching ASR Challenge Dataset - sub-task2

Identifier: SLR104

Summary: Datasets for sub-task2 in Multilingual and code-switching ASR challenges for low resource Indian languages - Interspeech 2021 Special Session

Category: Speech

License: CC BY-SA 4.0

Downloads (use a mirror closer to you):
Hindi-English_train.zip [7.3G]   ( Hindi-English code-switched train speech and transcripts )   Mirrors: [China]  
Hindi-English_test.zip [443M]   ( Hindi-English code-switched test speech and transcripts )   Mirrors: [China]  
Bengali-English_train.zip [3.9G]   ( Bengali-English code-switched train speech and transcripts )   Mirrors: [China]  
Bengali-English_test.zip [606M]   ( Bengali-English code-switched test speech and transcripts )   Mirrors: [China]  

About this resource:

These datasets are part of train and test data for sub-task2 in Multilingual and code-switching ASR challenges for low resource Indian languages (Interspeech 2021 special session). For the duration of the challenge, the data is password-protected and you have to be registered participant the challege to receive the passwords. Register here. If you have any questions, please write to is21ss.indicasrchallenge@gmail.com

Summary of Hindi-English and Bengali-English Data

The Hindi-English and Bengali-English datasets are extracted from spoken tutorials. These tutorials cover a range of technical topics and the code-switching predominantly arises from the technical content of the lectures. The segments file in the baseline recipe provides sentence time-stamps. These time-stamps were used to derive segments from the audio file to be aligned with the transcripts given in the text file. Hindi-English train and test datasets contain 89.86 hours and 5.18 hours, respectively, while the Bengali-English train and test datasets contain 46.11 hours and 7.02 hours of speech, respectively. All the audio files in both datasets are sampled at 16 kHz, 16 bits encoding. The vocabulary size for Hindi-English and Bengali-English are 17877 and 13656, respectively.