openslr.org

Open Speech and Language Resources

1111 Hours Hindi ASR Challenge

Identifier: SLR118

Summary: Datasets for 1111 Hours Hindi ASR Challenge Closed, Self Supervised Closed and Open - 2022 (https://sites.google.com/view/gramvaaniasrchallenge/home)

Category: Speech

License: The data released as a part of this challenge can be used freely for academic purposes, but permission for any commercial use of the data should be sought by writing to contact@gramvaani.org.

Downloads (use a mirror closer to you):
GV_Train_100h.tar.gz [2.0G] ( Gramvaani Hindi Train set 100 hours speech, transcripts and metadata ) Mirrors: [EU] [EU] [CN]
GV_Dev_5h.tar.gz [98M] ( Gramvaani Hindi Development set 5 hours speech, transcripts and metadata ) Mirrors: [EU] [EU] [CN]
GV_Eval_3h.tar.gz [62M] (Gramvaani Hindi Eval set 3 hours speech, transcripts and metadata ) Mirrors: [EU] [EU] [CN]
Gramvaani_1000hrData_Part1.tar.gz [6.0G] (Gramvaani Hindi 1000 hours speech only data Part1 ) Mirrors: [EU] [EU] [CN]
Gramvaani_1000hrData_Part2.tar.gz [7.3G] (Gramvaani Hindi 1000 hours speech only data Part2 ) Mirrors: [EU] [EU] [CN]
Gramvaani_1000hrData_Part3.tar.gz [8.2G] (Gramvaani Hindi 1000 hours speech only data Part3 ) Mirrors: [EU] [EU] [CN]
Gramvaani_1000hrData_Part4.tar.gz [5.8G] (Gramvaani Hindi 1000 hours speech only data Part4 ) Mirrors: [EU] [EU] [CN]
Gramvaani_1000hrData_Part5.tar.gz [3.2G] (Gramvaani Hindi 1000 hours speech only data Part5 ) Mirrors: [EU] [EU] [CN]
Metadata.tar.gz [460K] (Metadata for all the Gramvaani data released as part of the 1111 Hours Hindi ASR Challenge ) Mirrors: [EU] [EU] [CN]

About this resource:

1111 Hours Hindi ASR Challenge 2022 - A challenge on Automatic Speech Recognition for Hindi by sharing the spontaneous telephone speech recordings from a social technology enterprise Gram Vaani, in regional variations of Hindi. The regional variations of Hindi together with spontaneity of speech, natural background and transcriptions with varying degrees of accuracy due to crowd sourcing make it a unique corpus for automatic recognition of spontaneous telephone speech.

The data set comprises of telephone quality speech data in Hindi. We will be releasing approximately 1000 hours of unlabelled data and 105 hours of labelled speech data through this challenge. The details of the data sets released for this challenge are as follows: 1) Train set - 100 hours (labeled) 2) Development set - 5 hours (labeled) 3) 1000 hours of unlabelled data

Gramvaani data has .mp3 files with mix of sampling rates from 8KHz to 48KHZ. Following table shows the sampling rate distribution in the Train&Development, and unlabeled 1000 hours datasets.

Frequency	Percentage distribution in the train and dev dataset	Percentage distribution in the unlabeled 1000hr dataset
8KHz	60.87%	67.63%
16KHz	0.84%	0.66%
22KHz	0.00%	3.01%
24KHz	0.00%	0.08%
32KHz	0.25%	0.26%
44KHz	34.46%	25.45%
48KHz	3.56%	2.87%

Baseline results and scripts can be found here (https://github.com/anish9208/gramvaani_hindi_asr#gramvaani_hindi_asr).