Processing the output of OCR4all

EML-txt2txt is a series of three scripts (o4a-solver, EML-spellchecker, EML-normalizer) under a common GUI that converts the output of OCR4all into texts that have (1) abbreviations solved (OCR4all outputs unicode), (2) scanning mistakes corrected, and - finally - (3) with normalized orthography.

Read More

Digital Corpora

of Latin Texts

For some time now I have been looking for digital text corpora and wordlists of Latin. I would like to construct wordlists that can be used with the EML-spellchecker (in development now); I am specially interested in data which are chronologically tagged, since later iterations of the software should be able to produce information about the chronological stratification of an EML-text’s vocabulary.

Read More


OCR for Incunables

Probably the most significant step forward for quantitative (really any kind of text-oriented) research in Early Modern Latin (EML) in a long time is ocr4all, an OCR software that reliably converts scans of early printed books to machine-readable (and human-researchable) text, developed at the U. of Würzburg (github.com/OCR4all). High quality scans of early printed books have been abundant for some time now; that has, however, so far not translated into an increased availability of texts.

Read More


Used as template

Next you can update your site name, avatar and other options using the _config.yml file in the root of your repository (shown below).

Read More