3D scan with identified cuneiform strokes of HS 1174 © Hilprecht Collection of Babylonian Antiquities, adapted from Fig. 2 in Homburg et al. 2022.

The Cuneiform Wide Web: From Card Catalogues to Digital Assyriology

october 2022 | Vol. 10.10

By Shai Gordin and Avital Romach

Can computers read cuneiform better than experts? The answer, at the moment, is no. But will computers read cuneiform better than experts? And why would we want them to?

There are multiple benefits in using computational methods for studying ancient texts, particularly ancient cuneiform texts. The most straightforward benefit is obvious. Imagine all cuneiform texts were available with an effective search engine. You can search for specific words, or sign sequences, or even general keywords. The benefits of searchable databases of cuneiform texts and artefacts is already well established with the Cuneiform Digital Library Initiative, Open Richly-Annotated Cuneiform Corpus, Archibab, Achemenet, Database of Neo-Sumerian Texts, The Electronic Text Corpus of Sumerian Literature, and the Hethitologie Portal Mainz, to name the largest few. A particularly exciting venture is the electronic Babylonian Literature (eBL) project, which currently holds ca. 20,000 editions of tablet fragments from the British Museum.

Force directed network visualization of the Mesopotamian Ancient Placenames Almanac (MAPA), which documents placenames attested in and around Uruk from the 8th-4th centuries BCE. Nodes are colored based on their geographic relationships as recorded in the publications of Ran Zadok; graph produced in the cloud-based visualization tool Polinode (https://www.polinode.com/)

Often Assyriological research involves examining substantial numbers of sources and extracting relevant information. In fact, saving pieces of data in digital format is not different from traditional methods of card catalogs. Instead of storing small papers with pertinent information in specific boxes, they are stored digitally in files that are like cabinets: their structure, which is easy for a computer to process, is like folders within folders, which can easily take you to the information you are looking for. The important difference is that when these folders and cards are digital, you can easily rearrange them, visualize, explore, and reuse them in ways that were barely possible with analog methods.

In addition to translating analog to digital, computational methods can open new avenues of research that are unprecedented. Machines understand text very differently from humans. For computational text analysis, words need to be transformed into numbers, or vectors. Then, it is possible to perform a multitude of analyzes, for example, finding lexical similarities and meanings, or identifying parts of speech and syntactical structures. While this practice is foreign to traditional text analysis, it opens an opportunity to view texts from new perspectives.

Force directed network visualization of the MAPA gazetteer. Nodes are colored based on community detection algorithm, which groups together closely linked places. graph produced in the cloud-based visualization tool Polinode (https://www.polinode.com/)

These are just some examples of many possibilities. Moving in a digital direction also provides opportunities for interdisciplinary research that was previously unfeasible. Think of the card catalog example. When it is analog, these boxes are of use to a limited number of scholars with access, and they cannot be connected to catalogs of other scholars from the same or adjacent fields. But if they are digital, these files can be shared, compared, and joined together through the semantic web, and using established conventions under the principles of linked open data (see for example the FactGrid Cuneiform Project, Cuneiform Inscriptions Geographical Site Index, or the Mesopotamian Ancient Place-names Almanac).

Furthermore, machine learning can aid the decipherment of unpublished cuneiform texts. Reading ancient tablets is challenging, the size of the corpus is substantial, and requires expertise not only in the script but also the language, genre, and context of the text. These tasks can be aided using natural language processing (NLP) and optical character recognition (OCR) models.

NLP is a subfield in computer sciences and linguistics that studies and develops methods for computers to understand and analyze human languages. The best models for NLP and OCR are often neural networks – advanced machine learning (ML) models whose mechanism attempts to imitate the workings of the human mind.

Part-of-speech (POS) and syntactic labeling of a Neo-Assyrian letter (CDLI: P224395) in the semantic annotation tool INCePTION (https://inception-project.github.io/); labeling by Matthew Ong from UC Berkely and the Digital Pasts Lab, for his project on metaphor detection in Akkadian texts using AI models.

But while a human does not need to see tens of thousands of images of cats and dogs to differentiate the two, neural networks do. This is not a problem when training models on the English language, for example. It is easy to extract millions of words in English from the Internet. The cuneiform languages, however, fall under the category of low-resource languages–these have a limited number of examples, and some, like Akkadian, also have more complex morpho-syntactic structures. The upside of cuneiform texts is that many are formulaic and repetitive, unlike modern languages that are highly variable in nature. This counteracts to a certain extent the sparseness of available texts.

One of the common tasks in NLP is restoring masked words in a sentence–a task that, in practicality, is equivalent to restoring broken passages in cuneiform texts. Based on the complete sentences that the model has seen, it can predict how to restore broken sentences. The more examples it will see, the better the results. Two research teams have shown the effectiveness of NLP models for this task in Akkadian. One research group developed a model to restore words, training on Neo-Babylonian and Achaemenid archival and administrative texts from Achemenet. Another developed a model to restore missing signs, based on most corpora available on ORACC.

A related NLP task is predicting sequences of words, which in the case of cuneiform, can also be signs. This can be adapted to a task of choosing a specific sign reading given a sequence of only visually identified cuneiform signs, represented digitally as Unicode cuneiform. Using Neo-Assyrian royal inscriptions from ORACC, with their equivalent Unicode cuneiform glyphs, it was possible to effectively train a model to provide transliteration and segmentation of signs, creating the tool Akkademia. This was performed using neural networks, as before, and statistical models that look at the frequency in which certain sign readings follow each other.

The MAPA gazetteer uploaded to the World Historical Gazetteer website, a growing initiative to document and link historical geographical data from across disciplines.

To make cuneiform studies accessible to a larger audience, it is necessary to provide translations of the primary sources. While machine translations are often not perfect, they can still provide some basic understanding of a text at hand. Neural machine translation has recently been developed for Ur III Sumerian texts (under the Machine Translation and Automated Analysis of Cuneiform Languages project), the first time a cuneiform language has been translated by a machine.

The above-mentioned tasks could be performed by training on existing digital texts. The greatest challenge, however, lies in optical character recognition (OCR), the visual identification of cuneiform signs, be that from handcopies, 2D and 2D+ images, or 3D models. For this task, there was no digital data for training; it was especially created. The number of available datasets is still small, but gradually growing, along with designated editors and tools for cuneiform under construction.

The outputs of Akkademia’s machine learning models (HMM, MEMM, and biLSTM) for column 2, lines 31-34 of Sennacherib’s clay prism (CDLI: P430082). The translation of these lines: ‘On my return march, I received a heavy tribute from the distant Medes, of whose land none of the kings, my ancestors, had heard mention’, translation adapted from A.K. Grayson and J. Novotny’s ORACC edition (http://oracc.org/rinap/Q003497/).

What will the field of Assyriology look like if computers will be able to perform all of these tasks, from visual recognition, to transliteration and segmentation, even translation? Can we say that the models are reading cuneiform, like a scholar? It is important to remember that even when the models perform well, they do not perform perfectly. There is always a margin of error. Besides that, reading and reconstructing cuneiform texts does not create a final truth – many issues are open to interpretation.

Furthermore, computers do not view texts like humans do. Words and signs are translated to numbers, and multiple tasks that a human performs at once need to be broken down into parts. The models repeat to us what they have seen most often, what is the most likely scenario, but no more than that. Even for English, tasks that involve language comprehension–understanding the meaning of a text–are the most difficult, and do not yet provide satisfactory results in all cases.

3D scan with identified cuneiform strokes of HS 1174 © Hilprecht Collection of Babylonian Antiquities, adapted from Fig. 2 in Homburg et al. 2022.

The Assyriologist, then, needs to guide the models, correct, and analyze their results from a humanistic perspective. The models described above can be viewed as effective assistants, significantly shortening the time-consuming tasks needed for the creation of textual editions. As the number of digital cuneiform resources and tools grow, the possibilities discussed at the onset will become more and more possible. Tasks that Assyriologists deem desiderata, like cuneiform paleography and orthography, investigating intertextuality in literary works, or the identification of people and places, can be performed more comprehensively, some for the first time.

The combination of human and machine-based approaches will offer especially fresh perspectives on such classical problems, as well as raise new avenues for research. Digitally curating Assyriological data will allow it to join the global community of knowledge through the semantic web, and make this specialist field more present, impactful, and relevant to the wider research community.

Screenshot of the new Babylonian Engine portal, currently under development, with the CuRe (Cuneiform Recognition) tool for identifying cuneiform signs from handcopies. CuRe can be used to view the results of the machine learning models, correct them, and save the data as part of a digital scholarly edition. The image is from Strassmaier’s copies of archival texts from the reign of Cyrus, and the transliteration is published on Achemenet.

Shai Gordin is Senior Lecturer in Assyriology and Digital Humanities at Ariel University. Digital Pasts Lab.

Avital Romach is a Ph.D. student in Assyriology at Yale University.

Further Reading

Alstola, Tero, Heidi Jauhiainen, Saana Svärd, Aleksi Sahala, and Krister Lindén. 2022. “Digital Approaches to Analyzing and Translating Emotion: What Is Love?” Pages 88-116 in The Routledge Handbook of Emotions in the Ancient Near East, edited by Ulrike Steinert and Karen Sonik. Routledge: London.

Bogacz, Bartosz and Hubert Mara. 2022. “Digital Assyriology—Advances in Visual Cuneiform Analysis.” JOCCH 15/2: 1–22. https://doi.org/10.1145/3491239

Chiarcos, Christian, Émilie Pagé-Perron, Ilya Khait, Niko Schenk, and Lucas Reckling. 2018. “Towards a Linked Open Data Edition of Sumerian Corpora.” Paper presented at the Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA), May . https://aclanthology.org/L18-1387.

Fetaya, Ethan, Yonatan Lifshitz, Elad Aaron, and Shai Gordin. 2020. “Restoration of Fragmentary Babylonian Texts Using Recurrent Neural Networks.” PNAS 117:22743–51. https://doi.org/10.1073/pnas.2003794117.

Gordin, Shai, Gai Gutherz, Ariel Elazary, Avital Romach, Enrique Jiménez, Jonathan Berant, and Yoram Cohen. 2020. “Reading Akkadian Cuneiform Using Natural Language Processing.” PLOS ONE 15:e0240511. https://doi.org/10.1371/journal.pone.0240511.

Gordin, Shai, Samuel Clark, and Avital Romach. In print. “MAPA: A Linked Open Data Gazetteer of the Southern Babylonian Landscape.” Interdisciplinary Digital Engagement in Arts & Humanities.

Homburg, Timo, Robert Zwick, Hubert Mara, and Kai-Christian Bruhn. 2022. “Annotated 3D-Models of Cuneiform Tablets.” Journal of Open Archaeology Data 10/4. http://doi.org/10.5334/joad.92

Lazar, Koren, Benny Saret, Asaf Yehudai, Wayne Horowitz, Nathan Wasserman, and Gabriel Stanovsky. 2021. “Filling the Gaps in Ancient Akkadian Texts: A Masked Language Modelling Approach.” Pages 4682–91 in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, November. https://aclanthology.org/2021.emnlp-main.384.

Punia, Ravneet, Niko Schenk, Christian Chiarcos, and Émilie Pagé-Perron. 2020. “Towards the first machine translation system for Sumerian transliterations” Pages 3454–3460 in Proceedings of the 28^thInternational Conference on Computational Linguistics. Barcelona, Spain (Online). International Committee on Computational Linguistics. 10.18653/v1/2020.coling-main.308

Sahala, Aleksi, Tero Alstola, Jonathan Valk, and Krister Linden. 2022. “BabyLemmatizer: A Lemmatizer and POS-tagger for Akkadian” Pages 14-18 in CLARIN Annual Conference Proceedings, 2022, Prague, Czechia. CLARIN Annual Conference Proceedings, edited by Tomaž Erjavec and Maria Eskevich. https://office.clarin.eu/v/CE-2022-2118-CLARIN2022_ConferenceProceedings.pdf

Veldhuis, Niek 2021. “Exploring Ancient Networks.” H2D|Digital Humanities Journal, 3(1). https://doi.org/10.21814/h2d.3508

Selective links to recent projects in Digital Assyriology (for a fuller listing see for example https://www.arch.cam.ac.uk/about-us/mesopotamia/online-resources-for-mesopotamia):

Annotated Corpus of Ancient West Asian Imagery: Cylinder Seals (Roßberger, Elisa et al.): https://www.acawai-cs.gwi.uni-muenchen.de/
Computational Assyriology Jupyter Notebooks (Veldhuis, Niek): https://github.com/niekveldhuis/compass
Prosobab: Prosopography of Babylonia (c. 620-330 BCE) (Waerzeggers, Caroline and Melanie Groß): https://prosobab.leidenuniv.nl/index.php
CUNE-IIIF-ORM: Towards an Internationally Image Interoperable Corpus of Cuneiform Tablets (De Graef, Katrien et al.): https://www.kmkg-mrah.be/en/scientific-research/cune-iiif-orm
Late Babylonian Signs (Jursa, Michael and Reinhard Pringruber): https://labasi.acdh.oeaw.ac.at/
LIBER: Applying Machine Learning and Computer Vision to the Study of Scribal Marks on Cuneiform Tablets (Corò, Paola et al.): https://pric.unive.it/projects/liber/home
Sumerian Networks JupyterBook (Veldhuis, Niek et al.): https://niekveldhuis.github.io/sumnet/welcome.html

How to cite this article

Gordin, S. and A. Romach. 2022. “The Cuneiform Wide Web: From Card Catalogues to Digital Assyriology.” The Ancient Near East Today 10.10. Accessed at: https://anetoday.org/gordin-romach-digital-cuneiform/.

The Cuneiform Wide Web: From Card Catalogues to Digital Assyriology

october 2022 | Vol. 10.10

Want to learn more?

Connecting Objects Across Space and Time with 3D Scanning and Shape Analysis

Putting Carthaginian Stelae Back Into Context: The ASOR Punic Project Digital Initiative

Processing Geospatial Data in Archaeology: Introducing LuwianSiteAtlas for Bronze Age Western Anatolia

Inventing Writing in South-west Asia

Post a comment Cancel reply