Open Speech and Language Resources



Multilingual TEDx

Identifier: SLR100

Summary: a multilingual corpus of TEDx talks for speech recognition and translation

Category: Speech

License: Creative Commons Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)

Downloads (use a mirror closer to you):
mtedx_es-es.tgz [35G]   ( Spanish speech and transcripts )   Mirrors: [China]  
mtedx_fr-fr.tgz [34G]   ( French speech and transcripts )   Mirrors: [China]  
mtedx_pt-pt.tgz [29G]   ( Portuguese speech and transcripts )   Mirrors: [China]  
mtedx_it-it.tgz [19G]   ( Italian speech and transcripts )   Mirrors: [China]  
mtedx_ru-ru.tgz [10G]   ( Russian speech and transcripts )   Mirrors: [China]  
mtedx_el-el.tgz [5.5G]   ( Greek speech and transcripts )   Mirrors: [China]  
mtedx_ar-ar.tgz [3.6G]   ( Arabic speech and transcripts )   Mirrors: [China]  
mtedx_de-de.tgz [2.6G]   ( German speech and transcripts )   Mirrors: [China]  
mtedx_es-en.tgz [13G]   ( Spanish speech and transcripts with aligned English translations )   Mirrors: [China]  
mtedx_es-fr.tgz [1.9G]   ( Spanish speech and transcripts with aligned French translations )   Mirrors: [China]  
mtedx_es-it.tgz [1.9G]   ( Spanish speech and transcripts with aligned Italian translations )   Mirrors: [China]  
mtedx_es-pt.tgz [8.1G]   ( Spanish speech and transcripts with aligned Portuguese translations )   Mirrors: [China]  
mtedx_fr-en.tgz [9.8G]   ( French speech and transcripts with aligned English translations )   Mirrors: [China]  
mtedx_fr-es.tgz [7.1G]   ( French speech and transcripts with aligned Spanish translations )   Mirrors: [China]  
mtedx_fr-pt.tgz [4.7G]   ( French speech and transcripts with aligned Portuguese translations )   Mirrors: [China]  
mtedx_pt-en.tgz [10G]   ( Portuguese speech and transcripts with aligned English translations )   Mirrors: [China]  
mtedx_pt-es.tgz [625M]   ( Portuguese speech and transcripts with aligned Spanish translations )   Mirrors: [China]  
mtedx_it-en.tgz [746M]   ( Italian speech and transcripts with aligned English translations )   Mirrors: [China]  
mtedx_it-es.tgz [746M]   ( Italian speech and transcripts with aligned Spanish translations )   Mirrors: [China]  
mtedx_ru-en.tgz [2.3G]   ( Russian speech and transcripts with aligned English translations )   Mirrors: [China]  
mtedx_el-en.tgz [2.4G]   ( Greek speech and transcripts with aligned English translations )   Mirrors: [China]  

About this resource:

Multilingual TEDx is a multilingual speech recognition and translation corpus to facilitate the training of ASR and SLT models in additional languages.

The corpus comprises audio recordings and transcripts from TEDx Talks in 8 languages (Spanish, French, Portuguese, Italian, Russian, Greek, Arabic, German) with translations into up to 5 languages (English, Spanish, French, Portguese, Italian). The audio recordings are automatically aligned at the sentence level with their manual transcriptions and translations. Each .tgz file contains two directories: data and docs. Docs contains a README detailing the provided files and their structure.

NOTE: the training data for pt-es, it-es, it-en are being held out for IWSLT 2021 and will be released in mid-April.

Contact: Elizabeth Salesky, Matthew Wiesner esalesky@jhu.edu, mwiesner@jhu.edu

You can cite the data using the following BibTeX entry:

  @misc{salesky2021mtedx,
      title={Multilingual TEDx Corpus for Speech Recognition and Translation},
      author={Elizabeth Salesky and Matthew Wiesner and Jacob Bremerman and Roldano Cattoni and Matteo Negri and Marco Turchi and Douglas W. Oard and Matt Post},
      year={2021},
  }