Multilingual LibriSpeech (MLS)
Identifier: SLR94
Summary: A large multilingual corpus derived from LibriVox audiobooks
Category: Speech
License: CC BY 4.0
About this resource:
NOTE: The data is not hosted on OpenSLR (due to it's size) -- please use the links provided below to download.
Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish.
ASR Resources
Consists of train, dev and test sets for each language. Also, includes small training set for limited supervision (10hr, 1hr or 10 minutes of labelled speech).
Language | Download Link (Original: flac) |
Download Link (Compressed: opus) |
---|---|---|
English * | mls_english.tar.gz (2.4T) | mls_english_opus.tar.gz (651G) |
German | mls_german.tar.gz (115G) | mls_german_opus.tar.gz (29G) |
Dutch | mls_dutch.tar.gz (86G) | mls_dutch_opus.tar.gz (23G) |
French | mls_french.tar.gz (61G) | mls_french_opus.tar.gz (16G) |
Spanish | mls_spanish.tar.gz (50G) | mls_spanish_opus.tar.gz (14G) |
Italian | mls_italian.tar.gz (15G) | mls_italian_opus.tar.gz (3.8G) |
Portuguese | mls_portuguese.tar.gz (9.3G) | mls_portuguese_opus.tar.gz (2.5G) |
Polish | mls_polish.tar.gz (6.2G) | mls_polish_opus.tar.gz (1.6G) |
LM Resources
Consists of language modelling corpus and pre-trained 3-gram and 5-gram LMs.
Language | Download Link |
---|---|
English | mls_lm_english.tar.gz (44G) |
German | mls_lm_german.tar.gz (2.7G) |
Dutch | mls_lm_dutch.tar.gz (1.4G) |
French | mls_lm_french.tar.gz (4.8G) |
Spanish | mls_lm_spanish.tar.gz (1.2G) |
Italian | mls_lm_italian.tar.gz (1.7G) |
Portuguese | mls_lm_portuguese.tar.gz (558M) |
Polish | mls_lm_polish.tar.gz (30M) |
Other Resources
About | Download Link |
---|---|
Downloaded text from LibriVox books | lv_text.tar.gz (2.0G) |
Unrated dev/test transcripts (before human rating) | unrated_transcripts.tar.gz (2.8M) |
MD5 Checksums
All the above links are hosted AWS S3 bucket and can be downloaded using AWS CLI tools as well. For downloading using AWS CLI tools, create an AWS account, put the credentials in the CLI tools and all the resources can be downloaded for free. To get the S3 bucket path of an URL, replace https://dl.fbaipublicfiles.com/
with s3://dl.fbaipublicfiles.com/
. For example, the S3 bucket path of URL https://dl.fbaipublicfiles.com/mls/mls_polish.tar.gz
is s3://dl.fbaipublicfiles.com/mls/mls_polish.tar.gz
.
More details on these files and their directory structure can be found in README files included in the .tar.gz files.
You can cite the data using the following BibTeX entry:
@article{Pratap2020MLSAL, title={MLS: A Large-Scale Multilingual Dataset for Speech Research}, author={Vineel Pratap and Qiantong Xu and Anuroop Sriram and Gabriel Synnaeve and Ronan Collobert}, journal={ArXiv}, year={2020}, volume={abs/2012.03411} }
NOTE: We have made few updates to the MLS dataset after our INTERSPEECH paper was submitted to include more #hours and also improve the quality of transcripts. To avoid confusion (by having multiple versions), we are making ONLY one release with all the improvements included. For accurate dataset statistics and baselines, please refer to the arXiv paper - https://arxiv.org/abs/2012.03411.
* This resource can also be downloaded in parts from the 100GB splits - mls_english_parts_list.txt and mls_english_opus_parts_list.txt for the original and compressed versions respectively. After downloading all the splits, run `cat` on the files to create a single tar file and verify the md5 checksum.