Open Speech and Language Resources



Contact
dpovey@gmail.com
Phone: 425 247 4129
(Daniel Povey)

Resources

Resource Name Category Summary
SLR1 Yesno Speech Sixty recordings of one individual saying yes or no in Hebrew; each recording is eight words long.
SLR2 OpenFST Software A mirror of the OpenFst toolkit
SLR3 sph2pipe Software A mirror of the sph2pipe software
SLR4 sctk Software A mirror of the sctk scoring software
SLR5 MSU Switchboard transcipts Text A mirror of the Mississippi State transcripts and lexicon for Switchboard.
SLR6 Vystadial Speech English and Czech data, mirrored from the Vystadial project
SLR7 TED-LIUM Speech English speech recognition training corpus from TED talks, created by Laboratoire d’Informatique de l’Université du Maine (LIUM) (mirrored here)
SLR8 Sprakbanken Text Danish pronunciation dictionary generated using eSpeak
SLR9 The AMI pack Text Some auxiliary non-speech data used to build AMI systems with Kaldi
SLR10 SRE Data Misc Various files from SRE data that NIST used to host online
SLR11 LibriSpeech language models, vocabulary and G2P models Text Language modelling resources, for use with the LibriSpeech ASR corpus
SLR12 LibriSpeech ASR corpus Speech Large-scale (1000 hours) corpus of read English speech
SLR13 RWCP Sound Scene Database Speech + Software A database of recordings of real-world sounds and measured room impulse responses
SLR14 BEEP Dictionary Text Phonemic transcriptions of over 250,000 English words. (British English pronunciations)
SLR15 SRE Speaker List Misc A list linking speakers across NIST SRE corpra
SLR16 The AMI Corpus Speech Acoustic speech data and meta-data from The AMI corpus.
SLR17 MUSAN Audio A corpus of music, speech, and noise
SLR18 THCHS-30 Speech A Free Chinese Speech Corpus Released by CSLT@Tsinghua University
SLR19 TED-LIUMv2 Audio TED-LIUM corpus release 2, English speech recognition training corpus from TED talks, created by Laboratoire d’Informatique de l’Université du Maine (LIUM) (mirrored here)
SLR20 Aachen Impulse Response Database Audio Aachen Impulse Response database (AIR): a database of room impulse responses (mirrored here)
SLR21 Spanish Word list Text A list of words in Spanish with frequency derived from a large corpus (Spanish Gigaword).
SLR22 THUYG-20 Speech A free Uyghur speech database Released by CSLT@Tsinghua University & Xinjiang University
SLR23 NIST LRE 2007 Key Misc A file containing metadata for the utterances in the LRE 2007 evaluation
SLR24 Iban Speech Iban language text and speech corpora for ASR
SLR25 ALFFA (African Languages in the Field: speech Fundamentals and Automation) Speech Amharic, Swahili and Wolof data, mirrored from the ALFFA git repository
SLR26 Simulated Room Impulse Response Database Audio A database of simulated room impulse responses
SLR27 Cantab-TEDLIUM Release 1.1 (February 2015) Text Cantab Research Language models for the TEDLIUM database
SLR28 Room Impulse Response and Noise Database Audio A database of simulated and real room impulse responses, isotropic and point-source noises. The audio files in this data are all in 16k sampling rate and 16-bit precision.
SLR29 Sprakbanken_Swe Text Swedish pronunciation dictionary
SLR30 Sinhala TTS Speech Sinhalese multi-speaker TTS corpora
SLR31 Mini LibriSpeech ASR corpus Speech Subset of LibriSpeech corpus for purpose of regression testing
SLR32 High quality TTS data for four South African languages (af, st, tn, xh) Speech Multi-speaker TTS data for four South African languages, Afrikaans, Sesotho, Setswana and isiXhosa.
SLR33 Aishell Speech Mandarin data, provided by Beijing Shell Shell Technology Co.,Ltd
SLR34 Santiago Spanish Lexicon Text A pronouncing dictionary for the Spanish language.
SLR35 Large Javanese ASR training data set Speech Javanese ASR training data set containing ~185K utterances.
SLR36 Large Sundanese ASR training data set Speech Sundanese ASR training data set containing ~220K utterances.
SLR37 High quality TTS data for Bengali languages Speech Multi-speaker TTS data for Bangladesh Bengali (bn-BD) and Indian Bengali (bn-IN).
SLR38 Free ST Chinese Mandarin Corpus Speech A free Chinese Mandarin corpus by Surfingtech (www.surfing.ai), containing utterances from 855 speakers, 102600 utterances;
SLR39 Heroico Speech Spanish data, mirrored from the LDC
SLR40 Zeroth-Korean Speech Corpus for Automatic Speech Recognition Korean Open-source Speech Corpus for Speech Recognition by Zeroth Project (https://github.com/goodatlas/zeroth)
SLR41 High quality TTS data for Javanese. Speech Multi-speaker TTS data for Javanese (jv-ID)
SLR42 High quality TTS data for Khmer. Speech Multi-speaker TTS data for Khmer (km-KH)
SLR43 High quality TTS data for Nepali. Speech Multi-speaker TTS data for Nepali (ne-NP)
SLR44 High quality TTS data for Sundanese. Speech Multi-speaker TTS data for Sundanese (su-ID)
SLR45 Free ST American English Corpus Speech A free American English corpus by Surfingtech (www.surfing.ai), containing utterances from 10 speakers, Each speaker has about 350 utterances;
SLR46 Tunisian_MSA Speech Tunisian Modern Standard Arabic
SLR47 Primewords Chinese Corpus Set 1 Speech Chinese Mandarin corpus released by Shanghai Primewords Co. Ltd. (www.primewords.cn), containing 100 hours of speech data.
SLR48 MADCAT Arabic data splits Other Unofficial data splits (dev/train/test) for the MADCAT Arabic LDC corpus
SLR49 VoxCeleb Data Misc Various files for the VoxCeleb datasets
SLR50 MADCAT Chinese data splits Other Unofficial data splits (dev/train/test) for the MADCAT Chinese LDC corpus
SLR51 TED-LIUM Release 3 Speech TED-LIUM corpus release 3
SLR52 Large Sinhala ASR training data set Speech Sinhala ASR training data set containing ~185K utterances.
SLR53 Large Bengali ASR training data set Speech Bengali ASR training data set containing ~196K utterances.
SLR54 Large Nepali ASR training data set Speech Nepali ASR training data set containing ~157K utterances.
SLR55 CLMAD Text A Chinese Language Model Adaptation Dataset (CLMAD).
SLR56 IAM Aachen splits Other Aachen data splits (train/test/val) for the IAM dataset.