For some time now I have been looking for digital text corpora and wordlists of Latin. I would like to construct wordlists that can be used with the EML-spellchecker (in development now); I am specially interested in data which are chronologically tagged, since later iterations of the software should be able to produce information about the chronological stratification of an EML-text’s vocabulary.
Digital Corpora of Latin Texts
I have not had much luck with chronologically tagged data; but I have an idea of how to construct them myself with the help of the Index librorum of the Thesaurus Linguae Latinae which contains complete data for the Latin literature before 600 AD and is conveniently open access on line, and the list of works in the NLW. The following are the corpora which are significant for me.
LatinISE
lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2506
License: CC BY-SA 4.0
9873823 words, 340233 different forms
A lemmatized corpus built in 2012 by Barbara McGillivray collected from the LacusCurtius, IntraText and Musisque Deoque websites. The lemmatization reflects the state of the art of 2012, but will be usable for the more frequent forms.
Perseus
www.perseus.tufts.edu/hopper
License: Creative Commons ShareAlike 3.0 License
The source of many other text corpora on the Web. I have not managed to find any lemmatization data from the Perseus Lemmatizer on the Perseus site. Probably they are created dynamically.
IntraText
www.intratext.com
License: Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported
Texts up to the twentieth century. Rough chronological tagging. No lemmatization data.
Corpus Corporum
www.mlat.uzh.ch
License: Rights vary due to heterogeneous origin
A rich ressource for the Latin corpus linguist; many texts have lemmatization data.
All the corpora mentioned above contain data from medieval and later texts.
digital Monumenta Germaniae Historica
www.dmgh.de
License: Open Access
For medieval Latin this is the site. The site’s technology is from 2010, and clearly the primary public are not corpus linguists, but medieval scholars who need access to the MGH-volumes. Consequently, no lemmatization data. Still, one can access an OCR of all the pages with a high quality text; thus the dMGH could contribute to the construction of a corpus of medieval texts, though it would need some effort. Lemmatization would probably be a lot of work.
Index Thomisticus Treebank
itreebank.marginalia.it
License: Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported
Latin texts of Thomas Aquinas (Summa contra Gentiles partially, Scriptum super Sententiis Magistri Petri Lombardi, Summa Theologiae) with morphological, syntactic and semantic/pragmatic annotation. The package contains lemmatization data.
The Corpus Thomisticum itself has an “All rights reserved” license, though I can not fathom what the point might be of putting a text corpus on the Web without accompanying rights.
LatInfLex
github.com/matteo-pellegrini/LatInfLexi
No license indicated
This is an inflected lexicon of Latin verbs in a csv-file. It includes 254 inflected forms of 3,348 verbs from Delatte et al., Dictionnaire fréquentiel et Index inverse de la langue latine (1981), irrespective of whether they are actually attested. A fascinating experiment.
Thesaurus Formarum Totius Latinitatis
License: commercial
By Paul Tombeur. Chronologically tagged. No lemmatization data. This is a commercial product; its data can neither be freely used nor distributed. I mention it because it would be an interesting product.