This blog entry is about an attempt to use lemmatization (the software ‘Collatinus’) to recognize and isolate errors in the OCR of a 16th century print, and in some cases to reconstitute words separated by incorrectly recognized word borders.
This blog describes my project to OCR the Gesta Porsennae (1458/1460) by Leonardo Dati (1408-1472) with OCR4all from high quality scans offered online by the Biblioteca Apostolica Vaticana, of the ms. Vrb. lat. 411, a beautiful late fifteenth century manuscript from the library of Federico da Montefeltro.
The quality of the scans used for OCR is one of the two factors deciding the quality of the output of OCR4all (the other being font training). Since I have mostly ocr’ed incunables and early 16th-century prints, this has been a major concern. ‘My’ books are often heavily discolored and hardly ever equally illuminated, since pages towards the spine often curve away from the camera and thus reflect light differently from the rest of the page. This usually does not impact the legibility of the original scans, but bitonal conversions often turn the beginning of the line into a black blob that even the ingenuity of OCR4all cannot make sense of. I have tried to improve the legibility of the scan by heightening the contrast, changing the histogram curve with Adobe Lightroom, setting the scanning parameter in OCR4all to greyscale instead of bitonal (not a success in my few attempts), creating bitonal output with Scantailor (only a success if the scan is rather uniform), etc. This blog is about the circuitous route to create optimized input for OCR4all.
EML-txt2txt is a collection of three scripts (o4a-solver, EML-spellchecker, EML-normalizer) under a common GUI that work on Early Modern Latin (EML) texts. They transform the output of OCR4all (version of May 2019) into texts that have (1) abbreviations solved (OCR4all outputs unicode), (2) scanning mistakes corrected, and (3) orthography normalized.
For some time now I have been looking for digital text corpora and wordlists of Latin. I would like to construct wordlists that can be used with the EML-spellchecker (in development now); I am specially interested in data which are chronologically tagged, since later iterations of the software should be able to produce information about the chronological stratification of an EML-text’s vocabulary.
Probably the most significant step forward for quantitative (really any kind of text-oriented) research in Early Modern Latin (EML) in a long time is ocr4all, an OCR software that reliably converts scans of early printed books to machine-readable (and human-researchable) text, developed at the U. of Würzburg (github.com/OCR4all). High quality scans of early printed books have been abundant for some time now; that has, however, so far not translated into an increased availability of texts.
Next you can update your site name, avatar and other options using the _config.yml file in the root of your repository (shown below).