Open Speech and Language Resources


Identifier: SLR145

Summary: LibriSpeech text with Punctuation and Capitalization

Category: Text

License: CC BY 4.0

Downloads (use a mirror closer to you):
manifests.tar.gz [25M]   ( Manifest files that match original LibriSpeech splits )   Mirrors: [US]   [EU]   [CN]  

About this resource:

LibriSpeech-PC: A dataset based on LibriSpeech* with restored punctuation and capitalization.
  • The dataset includes ONLY .json manifests, NO audio files, audio files can be taken from the original LibriSpeech:
  • Subsets' structure is preserved.
  • Some samples were dropped during punctuation and capitalization restoration, see STATISTICS for details.
*V. Panayotov, G. Chen, D. Povey and S. Khudanpur, "LibriSpeech: An ASR corpus based on public domain audio books," 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 2015, pp. 5206-5210, doi: 10.1109/ICASSP.2015.7178964.

You can cite the data using the following BibTeX entry:

        title={LibriSpeech-PC: Benchmark for Evaluation of Punctuation and Capitalization Capabilities of end-to-end ASR Models}, 
        author={A. Meister and M. Novikov and N. Karpov and E. Bakhturina and V. Lavrukhin and B. Ginsburg},
        journal={arXiv preprint arXiv:2310.02943},