Skip to content

Chayan-halder/DIDA

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Swedish historical handwritten document

DIDA: The largest historical handwritten digit dataset with 250k digits

DIDA is a new image-based historical handwritten digit dataset and collected from the Swedish historical handwritten document images between the year 1800 and 1940. It is the largest historical handwritten digit dataset which is introduced to the Optical Character Recognition (OCR) community to help the researchers to test their optical handwritten character recognition methods. To generate DIDA, 250,000 single digits and 100,000 multi-digits are cropped from 75,000 different document images. The dataset has multiple unique characteristics as explained below:

  • Degradation: The poor quality of the used papers and inks in the nineteenth centuries, age of documents, and distortions influence digits in DIDA dataset. These issues simply result in multiple degradation and artifacts. For instance age of documents causes deterioration of digits (i.e.~faint). Moreover, other artefacts in the document images are background variation, show-through, bleed-through, and smear. Consequently, the CARDIS dataset is exposed to many different inter- and intra-class variations.
  • Handwriting styles: In the recorded documents, the digits were written in copperplate, cursive, and Gothic styles by different priests using various types of ink, nib, and dip pen, which result in different appearances. Moreover, in the documents the same class digits appeare with different sizes thus, the shapes can be diverse. Hence, in DIDA dataset, the digits appear in various writing styles, sizes, directions, widths, and arrangements. These variations in handwriting patterns due to individual writing style and materials used to write the texts generate endless inter-class variations.

The DIDA single digits dataset has 250,000 handwritten digit samples with 10 different classes from 0 to 9, and each class contains 20,000-25,000 single digit images. To the best of our knowledge, this dataset is the largest one to present historical handwritten single digit samples in RGB color space with the original sizes and appearances (a). This dataset is in contrast with the existing publicly available handwritten digit datasets (e.g. MNIST (b)), where the digit images are size-normalized, denoised and cleaned.

DIDA vs MNIST

12k, 30k, and 58k digit string data-set are generated from different document images.

DIGITNET Model and Weights

DIGITNET and Weights

A Swedish Historical Handwritten Character Dataset with 116,000 characters, 30,000 Swedish names, and 1000 region names

CArDIS Dataset

If you use any of these datasets, please cite:

Reference:

  • Huseyin Kusetogullari, Amir Yavariabdi, Johan Hall, Niklas Lavesson, "DIGITNET: A Deep Handwritten Digit Detection and Recognition Method Using a New Historical Handwritten Digit Dataset", Big Data Research, 2020, DOI: 10.1016/j.bdr.2020.100182.
  • Huseyin Kusetogullari, Amir Yavariabdi, Johan Hall, Niklas Lavesson, DIDA: The largest historical handwritten digit dataset with 250k digits, June 2021. Accessed on: June 13, 2021. Available: https://github.com/didadataset/DIDA/.

BibTeX:

  • @article{Kusetogullari2020DIDA,
    title={DIGITNET: A Deep Handwritten Digit Detection and Recognition Method Using a New Historical Handwritten Digit Dataset},
    author={Huseyin Kusetogullari, Amir Yavariabdi, Johan Hall, Niklas Lavesson},
    journal={Big Data Research},
    year={2020}
    }
  • @misc{DIDA2020,
    author = {Huseyin Kusetogullari, Amir Yavariabdi, Johan Hall, Niklas Lavesson},
    title = {DIDA: The largest historical handwritten digit dataset with 250k digits},
    howpublished = {\url{https://github.com/didadataset/DIDA/}},
    note = {Accessed: 2021-06-13}
    }

News: