LibriSpeech ASR corpus
Identifier: SLR12
Summary: Large-scale (1000 hours) corpus of read English speech
Category: Speech
License: CC BY 4.0
Downloads (use a mirror closer to you):
dev-clean.tar.gz [337M] (development set, "clean" speech
) Mirrors:
dev-other.tar.gz [314M] (development set, "other", more challenging, speech
) Mirrors:
test-clean.tar.gz [346M] (test set, "clean" speech
) Mirrors:
test-other.tar.gz [328M] (test set, "other" speech
) Mirrors:
train-clean-100.tar.gz [6.3G] (training set of 100 hours "clean" speech
) Mirrors:
train-clean-360.tar.gz [23G] (training set of 360 hours "clean" speech
) Mirrors:
train-other-500.tar.gz [30G] (training set of 500 hours "other" speech
) Mirrors:
intro-disclaimers.tar.gz [695M] (extracted LibriVox announcements for some of the speakers
) Mirrors:
original-mp3.tar.gz [87G] (LibriVox mp3 files, from which corpus' audio was extracted
) Mirrors:
original-books.tar.gz [297M] (Project Gutenberg texts, against which the audio in the corpus was aligned
) Mirrors:
raw-metadata.tar.gz [33M] (Some extra meta-data produced during the creation of the corpus
) Mirrors:
md5sum.txt [600 bytes] (MD5 checksums for the archive files
) Mirrors:
About this resource:
Acoustic models, trained on this data set, are available at and language models, suitable for evaluation can be found at
For more information, see the paper "LibriSpeech: an ASR corpus based on public domain audio books", Vassil Panayotov, Guoguo Chen, Daniel Povey and Sanjeev Khudanpur, ICASSP 2015 (submitted) (pdf)