openslr.org

Contributing new resources

Contact

jtrmal@gmail.com
(Jan "Yenda" Trmal)

What data we host

We are open to hosting any type of data that's useful for speech recognition and related tasks, that needs a stable URL where it can be downloaded from. We may think more carefully in cases where the data is very large (e.g. tens of gigabytes or more).

Submitting your data

The process of adding data to OpenSLR is as follows. First you might want to quickly check with us whether the data you want to contribute is something we want to host; you can email jtrmal@gmail.com. If we think it's a good idea, you can prepare a .tar.gz file containing a directory with your data in it.

The format of submitted data

The directory that you transfer to us as a .tar.gz file should not contain subdirectories; it should just contain the files you want to host and two special files called info.txt and about.html whose format we'll explain below. Here is an example of such a directory:

# ls /var/www/openslr/resources/6
about.html  data_voip_cs.tgz  data_voip_en.tgz	info.txt

Note: the .tgz files inside it are the actual files that we're offering for download (and there is no limitation on their names or file-type, except for the no-subdirectories rule). What you would transfer to us is a .tar.gz file containing /var/www/openslr/resources/6, i.e. the four files you see in the listing above. This information is used to automatically populate the web-page at http://www.openslr.org/6/. An example of what the info.txt file looks like is as follows:

root@www:/var/www/openslr# cat /var/www/openslr/resources/6/info.txt
name: Vystadial
summary: English and Czech data, mirrored from the Vystadial project
category: speech
license: Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0 US)
file: data_voip_cs.tgz  Czech speech and transcripts
file: data_voip_en.tgz  English speech and transcripts
alternate_url: https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-4670-6 Czech data 
alternate_url: https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-4671-4 English data

This is a plain-text file that will be parsed by php scripts on our site. Some of the fields are mandatory and must appear only once: the name, summary, category and license fields. The name field gives the name of your resource, which shouldn't be too long. The summary is a short-sentence-length description of the resource. The category will normally be either "speech", "text" or "software" but it can have other values too. The license line should be concise; it can just summarize the license, which we assumed is explained more fully in the download itself or in the about.html file. There may be multiple instances of the file field; each one corresponds to one of the files in the directory you sent us. The text after the filename in the file field is optional; if your resource only contains one file it may not be necessary. The alternate_url field is optional and if it occurs, may be repeated; the text after the URL is optional.

The about.html file is generic HTML which will be included in the "about this resource" section of the automatically generated webpage. Just send us a first guess and you can edit it later if needed. In our example, the about.html file looks like this:

This data is transcribed telephone converation data, in English and Czech.
<p>
The data collection process and development of these training scripts was partly
funded by the Ministry of Education, Youth and Sports of the Czech Republic
under the grant agreement LK11221 and core research funding of Charles
University in Prague.
<p>

You can cite the data using the following BibTeX entry:
<pre>

@inproceedings{korvas_2014,
  title={{Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license}},
  author={Korvas, Mat\v{e}j and Pl\'{a}tek, Ond\v{r}ej and Du\v{s}ek, Ond\v{r}ej and \v{Z}ilka, Luk\'{a}\v{s} and Jur\v{c}\'{i}\v{c}ek, Filip},
  booktitle={Proceedings of the Eigth International Conference on Language Resources and Evaluation (LREC 2014)},
  pages={To Appear},
  year={2014},
}
</pre>
Once you have your .tar.gz file containing the info.txt, about.html files and your
actual data, you can transfer it to us (we'll have to discuss the exact mechanism if it's too big to fit in email)
and we'll check it and put it on the site.