ALexis

This blog entry is about an attempt to use lemmatization (the software ‘Collatinus’) to recognize and isolate errors in the OCR of a 16th century print, and in some cases to reconstitute words separated by incorrectly recognized word borders.

Recently I had the chance to work with a text of the Artis historicae penus, printed Basileae 1579. This is a collection of texts on the theory of historiography in two volumes; large part of the first volume is occupied by Bodin’s Methodus ad facilem historiae cognitionem. The two volumes had been OCR’ed some time ago with poor luck, partly to do with the poor quality of the digitization used as source, partly with the lack of a suitable OCR software. Thus, among many mistakes, there was a constant misreading of long ‘s’ as ‘f’, the ligature for enclitic ‘-que’ as ‘q’ or ‘q;’, loss of hyphens or replacement by ‘.’. Also there were many split words, since the print often does not use hyphens when splitting words at the linebreak.

Enclitic -que was easily resolved since the words ending in -que are a closed class in Latin with very few ambigous cases (most importantly ‘quoque’, either meaning ‘etiam’ or ‘et quo’). This collection of texts seemed the perfect test case for an idea I have toyed with for some time, to use a lemmatizer as spell-checker. The software I used was Collatinus, described on its website as a ‘Lemmatiser and morphological analyser for Latin texts’, i.e. for classical Latin texts. It is maintained by Yves Ouvrard (https://outils.biblissima.fr/en/collatinus), and available under the GNU GPL licence. I have been using it recently for lemmatizing Early Modern Latin texts. Version 11 comes with a built-in server, which can be interrogated from other programs and returns its answer via the clipboard.

The text of the first volume - which was my source - has 360.000 tokens. First, I removed all not-ASCII letters. Those that were in foreign alphabets (Cyrillic, Arabic) I deleted, letters with accents were replaced by equivalents without accents. Then I converted the text into a list of tokens ordered by frequency, since it made sense to focus on the most frequent errors; tokens that occur only once or twice contribute less to the state of the text. This resulted in a list of 55.930 tokens:

ut	2984
ad	2596
de	2578
non	2518
quod	2356
qui	2172
cum	2165
eft	2113
ac	2039
quae	1542
aut	1485
...

This list was than lemmatized with Collatinus; those that came back lemmatized, were presumed correct. In this case the tendency of Collatinus to return even the most obscure of forms (such as ‘natura’ for the feminine future participle of ‘no, nare’, to swim) did not matter much. I noticed only one important case: ‘fi’ as imperative of ‘fieri’ (!), instead of a frequent mistake for ‘si’. The result was a list of 23.801 items for further correction. Within this list, words containing ‘f’ were also tested with ‘s’, and the resulting lemmatization - if any - was written into column 3 and eventually into the text itself:

eft	2113	est
fed	758	sed
effe	620	esse
funt	584	sunt
rum	403	-
bodinus	399
fe	342	se
con	287
tur	284	-
fint	254	sint
bus	241	-
kai	240
atq	234
poteft	221	potest
poft	221	post
...

Column 1 has the token as contained in the data, col. 2 the frequency, col. 3 the possible correction

My test setup was intrinsically quite slow. Furthermore, sometimes the connection to the server timed out, and lemmatization had to be repeated after 3 seconds. So the first pass of the correction took the better part of a night.

The first list of words not lemmatized alerted me to another problem: The original OCR had split many words that might be recovered using the same procedure. Ideally one should just make a list of all unlemmatized neighbours and test them; to reduce the time used I only tested this procedure against a list of tokens, which could either be the beginning or the end of a words. These I marked by hand and created a list of their respective neighbours which was then tested with Collatinus. This procedure allowed the recovery of a further number of words lost in the original OCR. At this point I also intervened manually to restore some words which had been corrupted by the insertion of additional letter, where the restoration was secure (in the list below e.g ‘halicarnaso fus’; both the deletion of ‘o’ and the change ‘f/s’ were unproblematic). I targeted mainly word-families like ‘historia, -cus’ etc., since these texts will be used in a research project about the history of ideas.

guber nationem	1	gubernationem
ha bemus	1	habemus
ha bent	3	habent
ha bet	2	habet
ha buit	3	habuit
haben da	2	habenda
haben disque	1	habendisque
hac bemus	3	habemus
hac bet	9	habet
haeredi tate	1	haereditate
halicarnaso fus	2	halicarnassus
hanni bal	1	hannibal
he bris	1	hebris
hi storia	2	historia
hi storici	4	historici
hi storico	2	historico
hi storicus	2	historicus

All in all I inserted (aside from the Unicode corrections), approx. 22.000 corrections, recovered 2000 split words, and added ‘-que’ 1050 times.

Facit:

(+) The text is more legible than before
(+) Key terms of the text can be more reliably searched by any text processor
(+) The procedure can with little effort target specific word families
(-) Running Collatinus on a large scale is time-consuming
(-) If run unsupervised, new mistakes will inevitably be introduced

Written on April 14, 2021