openslr.org

Open Speech and Language Resources

ScribbleLens

Identifier: SLR84

Summary: Dutch cursive, 16-18th century handwritings, pages and lines, for (un)supervised AI and other research.

Category: Handwriting

License: CC-BY-NC-ND (details in LICENSE.txt)

Downloads (use a mirror closer to you):
scribblelens.corpus.v1.2.zip [12G] ( Dutch historical handwritings ) Mirrors: [EU] [EU] [CN]
scribblelens.supplement.original.pages.tgz [2.5G] (Supplement data (mostly hires images of the previous) ) Mirrors: [EU] [EU] [CN]

About this resource:

This data is Dutch transcribed and un-transcribed, cursive handwritten text, from historical (16-18th century) handwritten manuscripts.

Historical handwritten documents guard an important part of human knowledge only at the reach of a few scholars and experts. Recent developments in machine learning and handwriting research has the potential of rendering this information accessible to a larger audience. Data-driven approaches to automatic manuscript recognition require large amounts of transcribed scans to work. To this end, we introduce a new handwritten corpus based on 400-year-old, cursive, early modern Dutch documents such as ship journals and daily logbooks. The 1000 page collection has been segmented into lines and we provide textual transcriptions on 20% of the pages. Other annotations such as handwriting slant, year of origin, complexity, and writer identity have been manually added. With over 80 writers this corpus is significantly larger and more varied than other existing data sets such as Spanish RODRIGO. We provide train/test splits, experimental results from an automatic transcription baseline and tools to facilitate its use in deep learning research. The manuscripts span over 150 years of significant journeys by captains and traders from the Vereenigde Oost-indische Company (VOC) such as Tasman, Brouwer and Van Neck, making this resource also valuable to historians and the paleography community.

Contact: scribblelens@protonmail.com

The data has been used for academic research as part of JSALT'19, project Distant Supervision for Representation Learning

README.txt contains all details on data, splits, structure and organization.
LICENSE.txt contains the CC-BY-NC-ND details plus acknowledgments and thanks.
scribblelens.py is an example Python loader of the data.
MD5 (scribblelens.corpus.v1.2.zip) = 005a4572daeb1e144bbb5e59d344f45f

You can cite the data using the following BibTeX entry:


@inproceedings{Dolfing20,
  author={Hans J.G.A. Dolfing, Jerome Bellegarda, Jan Chorowski, Ricard Marxer, Antoine Laurent },
  title={{The ``ScribbleLens'' Dutch historical handwriting corpus}},
  booktitle={International Conference on Frontiers of Handwriting Recognition (ICFHR)},
  pages={To Appear},
  year={2020},
  note="{http://www.openslr.org/84/}"
}