Pansori-TEDxKR
Identifier: SLR58
Summary: Korean speech corpus generated from Korean language TEDx talks
Category: Speech
License: Creative Commons BY-NC-ND 4.0 (attribution/non-commercial/no-derivatives)
Downloads (use a mirror closer to you):
pansori-tedxkr-corpus-1.0.tar.gz [174M] ( Korean speech and trascripts
) Mirrors:
[US]
[EU]
[CN]
About this resource:
The Pansori TEDxKR Corpus is a Korean speech recognition (ASR) corpus generated from Korean language TEDx talks given in Korea from 2010 to 2014. It contains about 3 hours of speech audio-transcript pairs from 41 speakers.
This corpus was generated by using a new corpus data ingestion and processing system called Pansori. Please refer to this code repository and the following paper for further information on the Pansori ASR corpus generation system:
@inproceedings{choi_2018, title={{Pansori: ASR corpus generation from open online video contents}}, author={Choi, Yoona and Lee, Bowon}, booktitle={Proceedings of the IEEE Seoul Section Student Paper Contest 2018}, pages={117-121}, month={Nov}, year={2018}, }Extra care was taken to maintain the quality of the generated corpus:
- Only TEDx talks hand transcribed by community translators were included.
- Corpus fragments were segmented at subtitle boundaries.
- Fine tuning segmentation by manual (tool-assisted) speech-text alignment.
- Final validation by state-of-the-art speech recognizer (Google Cloud Speech-To-Text).
Contact Information
- Yoona Choi yoona@ieee.org
- Bowon Lee bowon.lee@inha.ac.kr
Electronics Engineering, Inha University (link)
External URLs:
- https://storage.googleapis.com/pansori/corpus/pansori-tedxkr-corpus-1.0.tar.gz (External link for download)
- https://github.com/yc9701/pansori-tedxkr-corpus (GitHub repository)
- https://github.com/yc9701/pansori (Data processing scripts)
- https://storage.googleapis.com/pansori/paper/pansori_asr_corpus_tool.pdf (Paper)
External URL: https://storage.googleapis.com/pansori/corpus/pansori-tedxkr-corpus-1.0.tar.gz