Open Speech and Language Resources



Golos

Identifier: SLR114

Summary: Russian ASR dataset (1240 hours) with trained acoustic and language models

Category: Speech

License: https://github.com/sberdevices/golos/blob/master/license/en_us.pdf

Downloads (use a mirror closer to you):
golos_opus.tar.gz [18G]   ( Opus audio files with Russian speech and transcripts )   Mirrors: [China]  
QuartzNet15x5_golos.nemo.gz [71M]   ( Acoustic model trained using Golos dataset )   Mirrors: [China]  
kenlms.tar.gz [4.7G]   ( KenLM language models created using Russian Common Crawl corpus )   Mirrors: [China]  

About this resource:

Golos dataset

Golos is a Russian corpus suitable for speech research. The dataset mainly consists of recorded audio files manually annotated on the crowd-sourcing platform. The total duration of the audio is about 1240 hours.
We have made the corpus freely available for downloading, along with the acoustic model prepared on this corpus. Also we create 3-gram KenLM language model using an open Common Crawl corpus.
The main project page: Golos GitHub repository

Dataset structure

Domains Train utterances Train hours Test utterances Test hours
Crowd 979 796 1 095 9 994 11.2
Farfield 124 003 132.4 1 916 1.4
Total 1 103 799 1 227.4 11 910 12.6

External URLs

Audio files in opus format

golos_opus.tar [20.5 GB]

Audio files in wav format

Manifest files with all the training transcription texts are in the train_crowd9.tar archive listed below:
train_farfield.tar [15.4 GB]
train_crowd0.tar [11 GB]
train_crowd1.tar [14 GB]
train_crowd2.tar [13.2 GB]
train_crowd3.tar [11.6 GB]
train_crowd4.tar [15.8 GB]
train_crowd5.tar [13.1 GB]
train_crowd6.tar [15.7 GB]
train_crowd7.tar [12.7 GB]
train_crowd8.tar [12.2 GB]
train_crowd9.tar [8.08 GB]
test.tar [1.3 GB]

Acoustic and language models

QuartzNet15x5_golos.nemo [68 MB]
KenLMs.tar [4.8 GB]

Authors (in alphabetic order):

  • Alexander Denisenko
  • Angelina Kovalenko
  • Fedor Minkin
  • Nikolay Karpov

You can cite the data using the following BibTeX entry:

  @article{karpov2021golos,
    title={Golos: Russian Dataset for Speech Research},
    author={Karpov, Nikolay and Denisenko, Alexander and Minkin, Fedor},
    journal={arXiv preprint arXiv:2106.10161},
    year={2021}
  }

To contact us please create an issue in the Golos GitHub repository!

External URLs:
https://sc.link/JpD   (Opus audio files and transcripts with Russian speech )
https://sc.link/ZMv   (Acoustic model trained using Golos dataset )
https://sc.link/YL0   (KenLM language models created using Russian Common Crawl corpus )