• Ei tuloksia

Contributions to Computational Assyriology

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "Contributions to Computational Assyriology"

Copied!
96
0
0

Kokoteksti

(1)

Contributions to Computational Assyriology

Aleksi Sahala

Doctoral dissertation, to be presented for public discussion with the permission of the Faculty of Arts of the University of Helsinki, in Auditorium 1, Metsätalo,

on the 31st of August, 2021 at 15 o’clock.

(2)

ISBN 978-951-51-7416-1 (PDF) University of Helsinki

Helsinki 2021

(3)

Abstract

This thesis explores the use of Natural Language Processing (NLP) on the ancient Mesopotamian primary sources, especially those written in the Akkadian language documented from 2400 BCE to 100 CE. The methods and tools proposed in this thesis aim to fill the gaps left in previous research in Computational Assyriology, contributing to the transformation of transliterated cuneiform tablets into richly annotated text corpora, as well as to the quantitative lexicographic analysis of cuneiform texts.

Three contributions of this thesis address the task of transforming Akkadian from its basic Latinized representation, transliteration, into linguistically annotated text corpora. These include (I) automatic phonological transcription of transliterated cuneiform text, which is essential for normalizing the diverse spelling variations encountered in the Akkadian writing system; (II) automatic morphological analysis of Akkadian that allows deconstructing word forms into morphological labels, lemmata and part-of-speech tags to improve the useability of Akkadian corpora for quantitative analysis; and (III) creation of a morphological gold standard, and a standardized Universal Dependencies approved morphological label set for Akkadian morphology as the byproduct of an Akkadian treebank.

Three contributions address the previously unexplored quantitative analysis of Akkadian lexical semantics using word association measures and word embeddings in order to better understand the language in its own terms.

One of these contributions is (IV) an algorithmic method for reducing the distortion caused by fully or partially duplicated sequences in Akkadian texts.

This algorithm solves issues encountered in pointwise mutual information (PMI)-based collocation analysis, and according to preliminary results, also in PMI-based word embeddings. Two contributions (V and VI) are quantitative case studies that demonstrate the use of PMI and word embeddings to gain insights into the concepts of seeing and fearing in Akkadian texts.

The last contribution (VII) is a hybrid approach, where PMI is applied to social network analysis of the Neo-Assyrian pantheon in order to reinforce the statistical relevance between the actors, and to study the position of the Assyrian main god, Aššur, within it.

In addition to the contributions, this thesis presents the first survey of Computational Assyriology, which covers six decades of research on automatic artifact reconstruction, optical character recognition, linguistic annotation, and quantitative analysis of cuneiform texts.

(4)
(5)

Preface

My journey to Computational Assyriology has been full of coincidences. I began my studies in Network Engineering, but my sudden interest toward languages, which at that time were Old English, Náhuatl and Proto-Uralic, tempted me to change my subject to Computational Linguistics. With some of my new class mates, Stephan and Ilari, I began to attend a wide range of language courses from Basque to Korean just out of curiosity. One of these courses was introductory Sumerian lectured by Simo Parpola. I never had any particular interest toward Mesopotamia before, but Simo’s knowledge about every little detail of the texts we read made even the most mundane things sound fascinating. Not long after, I had taken 40 credits worth of Sumerian and Akkadian courses, which I could no longer satisfactorily include in my degree unless I made Assyriology my minor subject. Ultimately, I ended up taking as many Assyriology courses as I had taken courses in Computational Linguistics.

This made me eligible to write a combined Master’s thesis, which I wrote in 2014 to leave doors open to continue my studies in either of the subjects.

Choosing Assyriology to accompany Computational Linguistics turned out to be a good choice, as in 2016, 2017 and 2018 Saana Svärd and Krister Lindén successfully applied for funding for three projects, which all involved the use of computers to study cuneiform texts: Semantic Domains in Akkadian Texts (SemDom) (2016-2020) funded by the Academy of Finland and led by Krister Lindén, Deep Learning and Semantic Domains in Akkadian Texts (2017- 2020) funded by the University of Helsinki and led by Saana Svärd, and the Centre of Excellence in Ancient Near Eastern Empires (ANEE) (2018-2025) also funded by the Academy of Finland and led by Saana Svärd.

The work presented in this thesis was conducted in the Deep Learning and Semantic Domains in Akkadian Texts project in close co-operation with SemDom and ANEE. During the finalization of this thesis in 2021, I was employed by FIN-CLARIN, led by Krister Lindén. I am very grateful to ANEE for funding my visit to the University of California, Berkeley, in 2019, and Niek Veldhuis for inviting me. Two (and a half) of the publications included in this thesis were written during my stay in Berkeley.

I wish to thank my supervisors Saana Svärd and Krister Lindén, and my co-authors in Helsinki and abroad: Tero Alstola, Heidi Jauhiainen, Shana Zaia (Vienna), Mikko Luukko, Sam Hardwick, Miikka Silfverberg (British Columbia), and Antti Arppe (Alberta). I also express my gratitude to Robert M. Whiting, and the preliminary examiners Steve Tinney (Pennsylvania) and Gerlof Bouma

(6)

(Gothenburg) for providing valuable feedback on the manuscript, Johannes Bach for commenting some of my papers, Laurie Pearce (Berkeley) for helping me to find some relevant Assyriological publications, Jyrki Niemi for assisting me with Korp updates, and Sebastian Fink (Innsbruck) for interesting research ideas for the future.

I owe my thanks to Adam Anderson (Berkeley) for completing my list of Computational Assyriology related publications with some early papers that I failed to discover by myself. I am also grateful to Tommi Jauhiainen, Fumi Karahashi (Chuo) and Evelien Vanderstraeten for pointing out a few missing publications, and to Ziya Aktaú (Baúkent), who kindly answered my questions about pioneering work done on computerization of Hittite cuneiform in Turkey in the 1980s.

I am indebted to everyone involved in the Open Richly Annotated Cuneiform Corpus (Oracc), whether it has been contributing new data, annotating it, or maintaining the corpus. I am particularly thankful to Jamie Novotny (Munich) and Niek Veldhuis, who have always provided me with whatever data or information about Oracc I have ever needed. Without Oracc and the people behind it, none of the research presented in this thesis would not have been possible.

Finally, I thank my close ones for their support during this project.

(7)

Contents

Abstract...3

Preface...5

List of publications...9

Abbreviations and notation...11

1 Introduction...13

1.1 History of Assyriology...14

1.2 Languages of Mesopotamia...16

1.2.1 Sumerian...16

1.2.2 Akkadian...18

1.3 Cuneiform writing...22

1.3.1 Development...23

1.3.2 Transliteration and transcription...24

1.4 Research objectives and motivation...25

1.5 Author’s contributions...28

1.6 The Data...30

2 An Overview of Computational Assyriology...31

2.1 Automatic artifact reconstruction...34

2.2 Automatic transliteration and transcription...37

2.2.1 Optical character recognition of cuneiform...37

2.2.2 Tokenization, transliteration and transcription...42

2.3 Linguistic annotation...47

2.3.1 Morphological analysis, POS-tagging and lemmatization...48

2.3.2 Syntactic annotation...54

2.3.3 Named-entity recognition...56

2.3.4 Machine translation...57

2.3.5 Cuneiform language identification...58

2.4 Content analysis...58

2.4.1 Semantic analysis...60

2.4.2 Social network analysis...67

2.4.3 Other quantitative approaches...70

3 Discussion...73

3.1 Conclusions and future work...74

4 References...77

4.1 Referenced language resources...95

5 Publications...97

(8)
(9)

List of publications

Publication I Sahala, A., Silfverberg, M., Arppe, A. & Lindén, K. (2020). Automated phonological transcription of Akkadian cuneiform text. Proceedings of the 12th Language Resources and Evaluation Conference, pp. 3528–3534. European Language Resources Association (ELRA). [6 pages] [hdl.handle.net/10138/317688]

Publication II Sahala, A., Silfverberg, M., Arppe, A. & Lindén, K. (2020). BabyFST:

Towards a finite-state based computational model of ancient Babylonian. Proceedings of the 12th Conference on Language Resources and Evaluation, pp. 3886–3894.

European Language Resources Association (ELRA). [8 pages]

[hdl.handle.net/10138/317691]

Publication III Luukko, M., Sahala, A., Hardwick, S. & Lindén, K. (2020). Akkadian Treebank for early Neo-Assyrian Royal Inscriptions. Proceedings of the 19th Workshop on Treebanks and Linguistic Theories, pp. 124–134. Association for Computational Linguistics (ACL). [10 pages] [hdl.handle.net/10138/322305]

Publication IV Sahala, A. & Lindén, K. (2020). Improving Word Association Measures in Repetitive Corpora with Context Similarity Weighting. Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management – KDIR, pp. 48–58. Science and Technology Publications. [11 pages] [DOI: 10.5220/0010106800480058]

Publication V Sahala, A. & Svärd, S. (2021). Language Technology Approach to

‘Seeing’ in Akkadian. Handbook of the Senses in the Ancient Near East, ed. by K.

Neumann & A. Thomason. Routledge/Taylor and Francis. (Accepted). [18 pages]

Publication VI Svärd, S., Alstola, T., Jauhiainen, H., Sahala, A. & Lindén, K. (2021).

Fear in Akkadian Texts: New Digital Perspectives on Lexical Semantics. The Expression of Emotions in Ancient Egypt and Mesopotamia, ed. by S.-W. Hsu & J.

Llop-Raduà. Leiden: Brill, pp. 470–502. Culture and History of the Ancient Near East 116. [32 pages] [DOI: 10.1163/9789004430761_019]

Publication VII

Alstola, T., Zaia, S., Sahala, A., Jauhiainen, H., Svärd, S. & Lindén, K. (2019). Aššur and his Friends: A Statistical Analysis of Neo-Assyrian Texts. Journal of Cuneiform Studies 71, pp. 159–180. ASOR. [21 pages] [DOI: 10.1086/703859]

(10)
(11)

Abbreviations and notation

Dialects and development stages of Akkadian OAkk Old Akkadian

OA Old Assyrian

OB Old Babylonian

MA Middle Assyrian

MB Middle Babylonian

NA Neo-Assyrian

NB Neo-Babylonian

LB Late Babylonian

SB Standard Babylonian

Methods

CNN Convolutional Neural Network CSW Context Similarity Weighting FST Finite-State Transducer

LSTM Long Short-Term Memory (neural network) OCR Optical Character Recognition

PMI Pointwise Mutual Information RNN Recurrent Neural Network SNA Social Network Analysis Other

POS Part-of-speech

For abbreviations of language resources see 4.1.

(12)
(13)

1 Introduction

This thesis describes the development and application of Natural Language Processing methods to ancient Mesopotamian texts, primarily those written in the Akkadian language. The contributions of this thesis aim to fill the gaps left by previous research in Computational Assyriology.1 A part of the research papers included in this thesis contribute to a text processing pipeline trans- forming texts from digital representations of tablet fragments into fully annotated text corpora. This is important because it saves the limited resources of the small Assyriological research community and allows the researchers to concentrate their energy on analyzing the contents of cuneiform texts instead of manually digitizing and annotating them. Tools or prototypes of tools for automatic tablet reconstruction, optical character recognition, and automatic tokenization and transliteration of cuneiform have already been published, but the tasks of automatic phonological transcription or morphological analysis, which also allow more sophisticated lemmatization and POS-tagging of the Akkadian language, have not been studied in detail, or at all. There are existing tools for Akkadian lemmatization and POS-tagging, but they rely on dictionary- based approaches that are not capable of making generalizations and recognizing previously unseen word forms.

The annotation tools presented in this thesis aim to provide better grounds for analyzing the contents of Akkadian cuneiform texts, especially by means of social network analysis and distributional semantics to gain answers to Assyrio- logical research questions. A part of the research papers included in this thesis contribute to these topics by improving the existing methods and applying them to Akkadian data.

This thesis is structured as follows: Chapter 1 provides an overview of traditional Assyriology, including the two most extensively documented and studied Mesopotamian languages (Akkadian and Sumerian), the cuneiform writing system and its Latinization conventions, and explains the motivation and the main contributions of this thesis. Chapter 2 is a survey of previous NLP related research on Computational Assyriology, which aims to provide a context for the publications included in this thesis. Chapter 3 is dedicated to discussion, conclusions, and planned future work.

1 Often also called Digital Assyriology (1,340 hits on Google) after Digital Humanities and the closely related field of Digital Archaeology. I personally favor the term Computational Assyriology (401 hits on Google) due to its many analogies in other fields of science, such as Economics, Biology, Physics, Chemistry and Linguistics. I also prefer to emphasize the methodological (computational) side of research instead of its platform (digital).

(14)

1.1 History of Assyriology

For centuries, the western knowledge of Mesopotamia was mostly dependent on the works of classical authors, the Hebrew Bible, and some published accounts of medieval travelers. One such traveler was Benjamin de Tudela, who visited the ruins of Nineveh already in 1170 during his journeys in the Middle East and left behind his account of Mesopotamia that was first transmitted in a manuscript and later printed in 1543 (Ooghe 2007). From the 16th century onward, many Europeans began to visit the ancient Mesopotamian sites in hopes of connecting them with the almost mythical locations mentioned in the classical and biblical sources.

Some primary sources written in cuneiform were discovered by westerners already in the 17th century. In 1611, Antoine de Gouvea, who was a professor of theology and the rector of the College of Goa, mentioned the strange writing he saw in the ruins he had visited nine years before in Persia, and only five years later in 1616, Pietro Della Valle discovered Babylon and was the first to bring cuneiform inscribed bricks back to Europe (Meade 1974). More inscriptions carved in black jasper were found in 1618 in the ruins of Persepolis by García de Silva Figueroa, and ultimately the first drawings of cuneiform texts were published in the latter half of the 17th century (Robinson 2007). However, unlike the beautiful hieroglyphs of Egypt, cuneiform inscriptions did not draw much public interest. There was even a dispute if the mysterious wedgelike symbols represented writing at all, or were they merely ancient geometric experiments, ornaments, or even bird-tracks perpetuated on soft clay (Robinson 2007).

The first insight toward the decipherment of cuneiform was made in 1712 by Engelbert Kaempfer, who noticed that the cuneiform inscriptions contained different types of cuneiform scripts. Further advances were made by Carsten Niebuhr, who made a reasonably accurate copy of the trilingual inscription found at Persepolis in 1767, and was the first to discover that the script was written left-to-right and that it featured three distinct cuneiform scripts (Robinson 2007).

The decipherment was continued by Georg Grotefend, who focused on the simplest of the three scripts, the Old Persian cuneiform, which at the time was called “Class I” script. In 1802, Grotefend’s method relied on comparing different inscriptions with each other and trying to find recurring sequences of signs that could represent names of Persian kings known from Greek sources in similar patterns as found in later Avestan texts (Robinson 2007). This approach proved to be fairly successful, and Grotefend was able to decipher the script at least partly. Following his breakthrough, other scholars such as Henry Rawlinson contributed to the decipherment of the Old Persian script with the aid of the freshly copied trilingual Behistun inscription, and already by 1848 the script was largely deciphered.

(15)

1.1 History of Assyriology 15 The ability to read Old Persian provided grounds for deciphering the remaining “Class II” and “Class III” scripts (which were actually the same script but encoded different languages). In 1846, Edward Hicks, a scholar who had already worked on the Old Persian cuneiform, announced that he had made some progress working on the third script and stated that it represented a Semitic language he called Assyrio-Babylonian. In 1850, he also made an interesting note and claimed that the Assyrio-Babylonian script contains likely foreign signs, which later were shown to be Sumerian logograms. The decipherment of the Class II script, today known as Elamite, was also progressing. Edwin Norris published the Elamite part of the Behistun inscription in 1853, building on the earlier work of Hincks and Nils Ludwig Westergaard, who had already deciphered readings for several Elamite signs in 1844 and 1845 (Cathcart 2011).

Independently of Hincks, Henry Rawlinson also worked on the decipherment of the Assyrio-Babylonian script, and further contributions were also made by Julius Oppert and Henry Fox Talbot. In 1857, the Royal Asiatic Society organized an experiment, where Hincks, Rawlinson, Oppert and Talbot were assigned to translate a yet unpublished cuneiform inscription. The translations agreed to such an extent that the Assyrio-Babylonian (now better known as Sumero-Akkadian) cuneiform was proclaimed to be adequately deciphered (Robinson 2007: 79).

The growing interest toward ancient Mesopotamia yielded a series of archaeological excavations. The first systematic excavations began already in 1842 by Paul-Emilé Botta in Mosul, and in 1851 Austen H. Layard discovered the library of Aššurbanipal at Nineveh. Eventually ca. 30,000 tablets representing various different genres of texts were found (Parpola 1983), including some masterpieces of Akkadian literature, such as the Epic of Gilgameš. Layard’s additional findings at Nimrud brought Assyrian reliefs and sculptures to the British Museum and sparked even more public interest toward Mesopotamian history and culture. Additional remarkable discoveries, such as the remains of the Ištar Gate, were made in the excavations at Babylon from 1899 to 1917 funded by the German Oriental Society (Fitzgerald 2019).

These and numerous subsequent excavations have ultimately yielded more than half a million tablets or their fragments. Some of the largest collections are housed in the British Museum (>130,000 tablets) and the Iraq Museum in Baghdad (>100,000 tablets), but individual collections of tens of thousands of tablets are also found in several other museums including Ankara, Istanbul, Berlin, Paris and Idlib, and in universities such as Yale and Philadelphia (Streck 2010). Thus, what began as an almost complete lack of primary sources, transformed over the course of 150 years into a field of study with vast amounts of fragmentary source material that allowed the Assyriologists to recover three millennia of Mesopotamian history, culture and religion, as well as to rediscover several long-forgotten cuneiform languages, including Akkadian, Sumerian, Eblaite, Elamite, Urartian, Hurrian, Hittite, Luwian and Hattic.

(16)

1.2 Languages of Mesopotamia

With the exception of Semitic Akkadian and Eblaite, and Indo-European Hittite and Luwian (spoken in Anatolia), most of the documented cuneiform languages are either isolates, or have only a single known long extinct relative (as in the case of Hurrian and Urartian, which belong to the same language family). The extensive language shifts from Sumerian to Akkadian and from Akkadian to Aramaic that took place in Mesopotamia likely obliterated the closest relatives of Sumerian (Michalowski 2000: 180), and this was probably the fate of many other prehistoric languages and language families of the region.

Streck (2010) estimates the total word count of the five best documented cuneiform languages as 10 million words for Akkadian, 3 million words for Sumerian, 770,000 words for Hittite, 300,000 words for Eblaite, and 100,000 words for Elamite. The figures are based on the number of catalogued tablets and do not reflect the state of digitized resources, which at present represent only a fraction of the excavated texts especially for Akkadian. However, for Sumerian, Streck’s estimate is already outdated, as the JSON dumps of EPSD2 alone contain ca. 4.41 million words as of 2021.2 The estimated word counts of the remaining languages, Hurrian, Urartian, Luwian and Hattic are significantly smaller.3

This section gives a brief introduction to Sumerian and Akkadian, the two most researched languages in traditional and Computational Assyriology. The Akkadian language will be discussed in closer detail to provide background for the contributions of this thesis.

1.2.1 Sumerian

Sumerian is documented as a written language from the late fourth millennium BCE until the first or the second century CE. Although the earliest corpus of 6,000 administrative texts from the Uruk IV/III periods (3200–3000 BCE) is written in a pictographic proto-cuneiform script that shows very little evidence of the underlying language, most scholars today agree on its Sumerian origin (Wilcke 2005). Texts that indisputably feature Sumerian grammatical elements begin to appear around the 28th century BCE (Jagersma 2010: 4).

Old Sumerian is attested from 2600 BCE onwards. During this period the use of written language extended first from administrative purposes to literary texts, incantations and legal documents, and ca. 2500 BCE to royal inscriptions, letters and dedicatory texts. In the 24th century BCE, the Sumerian heartland fell under Akkadian control and Akkadian became the dominant

2 [http://oracc.org/epsd2/json] (Accessed 2021-06-01)

3 Streck (2010) gives some estimates, but at least the size of the Urartian corpus is outdated.

Oracc gives a word count of 27,000 for its Urartian corpus [http://oracc.org/ecut] (Accessed 2021-04-01), but Streck's estimate in 2010 was only 10,000.

(17)

1.2 Languages of Mesopotamia 17 language of the region. This resulted in widespread bilingualism and intensive linguistic interaction known as the Sumero-Akkadian Sprachbund. This interaction not only led to lexical borrowings, but it also had an effect on the Sumerian and Akkadian phonology, and to some extent, grammar (Edzard 2003: 173).

During the following Neo-Sumerian period, which covers the historical Lagaš II (2200–2113 BCE) and Ur III (2112–2004 BCE) periods, southern Mesopotamia underwent a Sumerian renaissance. Although Akkadian was still used as a language of everyday communication, Sumerian regained its position as the administrative language. The Neo-Sumerian period left behind ca.

120,000 administrative documents (Molina 2008: 20), as well as one of the most important Sumerian texts, the Cylinders of Gudea, an almost perfectly preserved temple building hymn comprising 1,300 lines inscribed on two clay cylinders (Edzard 1997).

The fall of the Ur III dynasty in 2004 BCE marks the beginning of the Late Sumerian period, which coincides with the historical Old Babylonian period (2000–1500 BCE). It has been suggested that the Sumero-Akkadian bilingualism had already shifted in favor of Akkadian in the early 21th century BCE, and that Sumerian as a native tongue became marginalized in the 20th century BCE. Most scholars agree that Sumerian died as a spoken language at the latest in the 18th century BCE. The Sumerian language documented after its death is sometimes separated from Late Sumerian and called Post-Sumerian.

It was still studied in scribal schools as a part of the curriculum, and older texts were copied, and new ones composed probably until the first or the second century CE (Geller 1997). The Late and the Post-Sumerian texts consist of literary compositions (songs, prayers, epics, myths, dialogues, proverbs), grammatical texts, and lexical lists of various kinds, as well as laments written in Emesal, a Sumerian liturgical language (Thomsen 1984: 30–33).

Linguistically, Sumerian is a language isolate without known relatives. It is documented in two distinct dialects, the main dialect Emeg˾ir, and the liturgical language Emesal, attested from the Old Babylonian period onwards (Schretter 1990). Moreover, there were two regional Emeg˾ir Sumerian dialects, North Sumerian and South Sumerian, but the study of their distinctive features has proven to be challenging due to the opacity of the cuneiform script. These dialects coexisted during the Old Sumerian period and are only distinguished by a handful of morphological and phonological differences. South Sumerian evolved into Neo-Sumerian, whereas the North Sumerian dialect disappeared by the end of the Old Sumerian period (Jagersma 2010: 6–9).

Typologically, Sumerian is a split-ergative agglutinative language. It features a rich noun declination comprising at least 10 grammatical cases, possessive suffixes and distinct ways to express grammatical number in singular, plural and distributive depending on the animacy of the noun.

Sumerian verb conjugation involves using a prefix chain to mark various grammatical functions, including mood, voice, subject and object. There are also

(18)

various, partly mutually exclusive dimensional prefixes that agree with the verb’s arguments in animacy, number, person and case. Verbal suffixes are used to mark subject, object, nominalization and various syntactic functions. Some characteristic features of Sumerian are productive use of reduplication, use of perfective and imperfective aspects instead of tense, and the ability to decline phrases and clauses as if they were single nouns.

Although scholars of the Sumerian language have a general agreement over the description of the language, certain details of the verbal prefix chain are still widely disagreed on. Most of the conflicting views concern the form, function and mutual combinatorics of verbal prefixes traditionally called “the conjugation prefixes” (e.g., compare Jagersma 2010, Michalowski 2008, Foxvog 2014, and Woods 2008), but the nature of the local prefixes also have divided opinions (Jagersma 2010, Zólyomi 2017).

The first comprehensive Sumerian grammars were published by Langdon (1911) and Poebel (1923). Some important later descriptions of Sumerian are Thomsen (1984), Edzard (2003) and recently Jagersma (2010), which is to date the most comprehensive monograph on the Sumerian language. The first Sumerian grammar published in Finnish is Sahala (2018).

1.2.2 Akkadian

Akkadian was an East Semitic language best known as the language of the Babylonians and the Assyrians. The earliest texts written in Akkadian consist of 1,575 documents dating back to the Sargonic Empire (2350–2170 BCE) (Streck 2011a). From the beginning of the second millennium BCE, Akkadian is attested as two distinct dialects: Babylonian in the south and Assyrian in the north.

Although these dialects were mutually intelligible, they were distinguished from each other by certain phonological and morphological differences, as well as partly by their lexicon.

Grammatically Akkadian is a typical Semitic language featuring a combination of linear and nonlinear morphology. Stem derivation is mostly done by using nonlinear morphological processes such as interdigitation and infixation, whereas grammatical conjugation and declension utilizes linear agglutination (Huehnergard & Woods 2008: 106–107, 117).

Nouns have three states: rectus, constructus and absolutus, two genders:

masculine and feminine, and are declined in number and case. Akkadian has three numbers, of which dual is generally restricted to nouns describing body parts. The case system consists of three productive cases: nominative, accusative and genitive, dative in pronouns and pronominal suffixes, and marginally used locative and terminative cases. The noun declension also involves using various other suffixes, such as the particularizing {Ņn}, abstract {źt}, and possessive suffixes that have a distinct form in every person (von Soden 1995: 91–112). In general, the number and case marking strategies in Akkadian nouns are rather diverse. Consider the following Old Babylonian

(19)

1.2 Languages of Mesopotamia 19 examples of the nouns šarrum “king” and šarratum “queen” in the nominative case (Table 1). The first two rows are in the rectus and construct states, and the last in the construct state with a third person singular masculine possessive suffix {šu} “his” attached.

Masculine Feminine

Singular Plural Singular Plural Rectus. šarr|um šarr|ź šarr|at|um šarr|Ņt|um

Constr. šar šar šarr|at šarr|Ņt

Poss. šarr|a|šu šarr|ź|šu šarr|as|su4 šarr|Ņt|ź|šu Table 1. An example of noun declination (with segmentation cues). The nominative is marked with {u, Ø, a} ({ź} in the feminine plural possessed construct state), plural masculine nominative with {ź, Ø}, singular feminine with {at} and plural feminine with {Ņt}. The morpheme {m} is mimation that does not carry a meaning here.

Verbal stems are formed by interdigitating root consonants (often called radicals) and lexically defined vocalizations into pattern templates. The roots can be strong or weak, the latter having one or more radicals that are typically not directly observable in the surface forms (Kouwenberg 2010: 447) (Table 2).

G-Infinitive Root + vowel class Deep Surface parŅsu “to decide” prs a/u {i+prus+ź} iprusź

nadŅnu “to give” ndn i/i {i+ndin+ź} iddinź

amƗru “to see” Ҷmr a/u {i+ސmur+ź} ŝmurź

wasŅbu “to sit” wšb a/i {i+wšib+ź} źšibź or źšbź

dâku “to kill” dwk u/u {i+dwuk+ź} idźkź

nadû “to throw” ndj i/i {i+ndij+ź} iddû

Table 2. Interdigitation of different roots into a G-Preterite pattern C1 C2 V2 C3 with the plural third person masculine personal circumfix {i+X+ź}.

In addition to finite forms, verbal roots can also be used in noun formation (Huehnergard & Woods 2008: 109). Akkadian verbs have four main stems (G, D, Š, N) which can be further extended by using {t} and {tan} infixes.5 The stems are used to mark various grammatical categories from causative to factitive, passive, reciprocal and intensive, and they all occur in three tenses: present, preterite and perfect. Moods, with the exception of imperative, are formed by affixation. The morphological template of the finite verb consists of nine slots for morphemes such as person prefixes and suffixes indicating the subject (or object in passive), the verbal stem and its preformatives, a subordinative marker, deictic ventive element, and indirect and direct object markers, which

4 Assimilated from *šarratšu.

5 For example, G: wabŅlu ”to carry”, Š: šźbulu ”to deliver”, Št: šutŅbulu ”to be delivered”.

(20)

occur in all persons, genders and numbers. Akkadian also features a special verbal form known as the stative, which is essentially a verbal adjective that is conjugated with a special set of personal suffixes (von Soden 1995: 120–157).

1.2.2.1 Dialects and development

Old Akkadian (OAkk), also called Sargonic Akkadian, is the earliest documented form of the Akkadian language, dating back to 2350-2150 BCE.

The size of the Old Akkadian corpus is ca. 1,600 texts and it consists of royal inscriptions, incantations, letters, and administrative and legal documents (Streck 2011a). One feature of Old Akkadian is the productive use of the archaic dual forms, which may occur for instance in nouns (šarrŅn “two kings”) and pronominal suffixes such as {šunŋti} “these two (accusative)”, {kunŋ} “your two (genitive)”, and are thus not restricted to paired body parts as in the later dialects of Akkadian. Also, the use of the terminative {iš} is still productive (bŋtiš “to a/the house”) (Hasselbach 2005). The most recent grammatical description of Old Akkadian is Hasselbach (2005).

Old Babylonian (OB) is attested already in the Ur III period (ca. 2100- 2000 BCE) in Ešnunna and Mari as a distinct dialect of Akkadian. Babylonian began to spread around Mesopotamia during the following centuries, ultimately being used from northern Syria to the southwestern parts of Iran. A vast majority of the Old Babylonian text material comes from the classical Old Babylonian period (1800–1500 BCE). This corpus comprises about 45,000 texts of various genres from letters to administrative and legal texts, royal inscriptions, omens, lexical texts, and literary works (Streck 2011b).

Old Babylonian underwent several phonological changes likely due to the influence of the Sumerian language. Several archaic Semitic gutturals /h ত ޑ ƥ/ were lost, either by merger with /প/ or replacement by /ސ/, and the old phoneme /Ğ/ developed into /š/ (Streck 2011b). Additionally, the morphophonemic vowel /e/ began to assimilate neighboring /a/ vowels into /e/ (epŅšum ൺ epŋšum “to do”). A recent grammatical description of Old Babylonian is Streck (2011c).

Middle Babylonian (MB) (1500–1000 BCE) is documented in about 12,000 texts consisting mostly of letters, economic texts and a few royal inscriptions. The dialect was used as a lingua franca in diplomatic communication around the Ancient Near East, including Cyprus, Anatolia and Egypt, and thus almost all MB texts come from outside Babylonia (e.g., from Ugarit, Amarna, Alalakh and Nuzi). Some typical features of Middle Babylonian were lateralization of /š/ before dental consonants (išten ൺ iltŋn “one”), loss of word initial /w/ and dissimilation of geminates /dd, bb, gg/ into /nd, mb, ng/.

Additionally, the archaic morpheme {m} known as mimation was lost (kakkum ൺ kakku “weapon”) (Streck 2011b). A grammatical description of Middle Babylonian is Aro (1955).

(21)

1.2 Languages of Mesopotamia 21 The latest stages of Babylonian, Neo-Babylonian (NB) (1000–626 BCE) and Late Babylonian (LB) (626 BCE – 100 CE) are documented in 47,500 letters and economic texts found in Babylonia. Around the 8th century BCE, the West-Semitic Aramaic language began to gain status as a lingua franca in the Ancient Near East (Gzella 2011: 574), and it displaced Akkadian as an everyday language in the middle of the first millennium BCE, leaving Akkadian to be used only in liturgical and scholarly contexts (Huehnergard & Woods 2008: 53). The latest texts written in Akkadian have been estimated to date back to the first or the second century CE. These texts are part of the Graeco-Babyloniaca, a small corpus of Akkadian and Sumerian texts accompanied by Greek transcription (Geller 1997).

One of the most distinctive features of Neo- and Late Babylonian is the merger of nominative {u} and accusative cases {a} into {u}. Ultimately the omittance of short final vowels reduced the case system even further to the point that occasionally all case distinctions were lost (Streck 2011b). A grammatical description of Neo-Babylonian letters is Woodington (1982), and for Late-Babylonian relevant grammatical works are Streck (1995) and Hackl (2007).

In addition to the previously mentioned stages of Babylonian, there was also a literary language called Standard Babylonian (SB) (Ger.

jungbabylonisch). This variant of Old Babylonian was used in literary contexts (especially in hymnic-epic texts and royal inscriptions) by both the Babylonians and the Assyrians. Although the origin of the Standard Babylonian was in Old Babylonian, the texts often contain some residue from the contemporary Babylonian or Assyrian dialects. Some grammatical features of Standard Babylonian were a somewhat productive use of archaic terminative {iš} and locative {źm} endings, and the use of the ŠD-stem of verbs, which was otherwise very rarely attested and only found in the Old Babylonian dialect (Kouwenberg 2010: 16).

The Assyrian dialect differs from Babylonian by some archaic, but also by some innovative features. A feature that Assyrian shares with Old Akkadian (but not with Babylonian) is unmerged vowel contractions over weak radicals (Bab.

dâku(m) ~ Ass. duŅku(m) “to kill”). Another vowel related phenomenon that distinguishes Assyrian from Babylonian is the contraction of the Proto-Semitic diphthong *aj and triphthongs *aji and *awi into /ŋ/ in Assyrian but /ŝ/ in Babylonian (Huehnergard & Woods 2008: 104). An innovative feature in Assyrian is its vowel harmony, which involves the assimilation of short /a/ in open unstressed syllables to the vowel of the following syllable as in Ass.

*iddanź iddunź “they gave” (Huehnergard & Woods 2008: 104). A typical morphological feature of Assyrian is also the use of subordinative marker {(ź)ni} instead of Babylonian {u} (Streck 2011b: 368). The dialects also feature lexical differences, such as OA kŋna vs. OB anna/i “yes”, OA pźrum vs. OB isqum “lot” and OA aršŅtum vs. OB kibtum “wheat” (Streck 2011b: 370).

(22)

Old Assyrian (OA) (1950–1500 BCE) is attested in ca. 22,000 documents mostly excavated at ancient Assyrian trade colonies in Anatolia, at the time populated by the Hittites. A recent grammatical description of Old Assyrian is Kouwenberg (2017).

Middle Assyrian (MA) (1500–1000 BCE) is rather poorly attested compared with Old Assyrian, and it lacked a comprehensive grammatical treatment before de Ridder’s recent descriptive grammar (de Ridder 2018). The MA corpus consists of some royal inscriptions, economic texts and most importantly, the Middle Assyrian laws. A distinct feature of MA that it shares with Middle Babylonian is the omittance of mimation: OA/OB šarrum vs.

MA/MB šarru “king”.

Neo-Assyrian (NA) (1000–600 BCE) is the best attested stage of the Assyrian dialect, documented in ca. 7,000 texts largely found at Nineveh. The corpus consists mainly of texts related to the kings and the royal court, including letters, grants, decrees and treaties. Literary compositions and royal inscriptions of the Neo-Assyrian period were composed in Standard Babylonian.

Some distinctive features of NA compared to OA and MA are partial free variation of voiced and voiceless stops (igtaldź ~ igdaldź “they became frightened”) (Luukko 2004: 69). Neo-Assyrian grammatical descriptions are Hämeen-Anttila (2000) and Luukko (2004).

1.3 Cuneiform writing

The cuneiform script was a logo-syllabic writing system used widely in Mesopotamia and the surrounding regions from the late fourth millennium BCE until the first or second century CE. It was originally invented by the Sumerians around 3200 BCE and then adapted to write Eblaite and Akkadian in the 25th and the 24th centuries BCE (Michalowski 2008: 13). During the following centuries, the use of cuneiform spread around the Ancient Near East and was adapted from Akkadian to write several languages, including Elamite ca. 2300 BCE, Hurrian ca. 2000 BCE and Hittite in ca. 1600 BCE.

Cuneiform signs can be used for three clearly distinguishable purposes.

Logograms are used to depict complete words or ideas, whereas syllabograms are used to express phonetic sequences. Cuneiform signs can also be used as determinatives, which classify words into various categories from divine names to place names, various animals and objects made of different materials. Logo-syllabic writing is realized as the use of logograms to denote stems of the words, of which grammatical details are marked by using syllabograms. Occasionally cuneiform also uses phonetic complements, which are essentially syllabic signs that reveal phonetic details about the preceding or the following logogram.

(23)

1.3 Cuneiform writing 23

1.3.1 Development

It has been argued that the earliest stage of the cuneiform script, known as proto-cuneiform, developed from an earlier book-keeping system where small tokens depicting various commodities and livestock were sealed inside clay envelopes (bullae). The step from a token-bullae system to writing possibly took place when the bookkeepers began to impress the contents of the envelope onto its surface (Robinson 2007: 60-62).

Proto-cuneiform still lacked its wedge-like appearance as the signs were drawn on the clay instead of impressing them. The script was purely pictographic and did not indicate any phonetic elements until 2800 BCE, when a small set of signs acquired new reading values as syllables. The syllabic values were assigned following the rebus principle. For example, the Sumerian sign

ȿ ȿ

depicting the word for water / aj/ began to mark a syllabic value of / a/ (later simply /a/) alongside its pictographic meaning. This invention gave birth to the logo-syllabic cuneiform script and made the marking of grammatical elements possible. In the middle of the third millennium BCE, wedge-shaped styluses were introduced and the scribes began to impress the signs into the clay using four basic wedge types: horizontal <, vertical ϖ, oblique = and Winkelhaken ̏. This innovation gave the script its peculiar appearance (Cooper 1996: 38).

Around 2400 BCE the scribes began to write the signs more consistently in the order that they were supposed to be read, that is, from left to right (Michalowski 2008: 13).

When Akkadians adopted the cuneiform script to write their own language around the 24th century BCE, they discarded most of the Sumerian logograms and preferred to use syllabic signs instead. This pushed the development of the cuneiform syllabary forward (Michalowski 2008: 13). Sumerian logograms were not, however, completely discarded, but still used, especially to write various nouns, likely because nouns featured a simpler morphological structure than the Akkadian verbs (Michalowski 2010: 13). Additionally, the use of signs as determinatives to classify words into certain categories was preserved (Huehnergard & Woods 2008: 89). Over the course of the next two millennia, the appearance of the cuneiform script became even more abstract, and most of the signs lost their resemblance to their original pictographic forms as they diverged into their Babylonian and Assyrian sign forms. The preference between syllabic and logo-syllabic writing varied depending on the period and context of the writing. Whereas letters were often written mostly using syllabic script, scholarly circles still favored the use of logograms during the first millennium BCE, especially in the Neo-Assyrian period.

(24)

1.3.2 Transliteration and transcription

Transliteration of cuneiform involves representing the source text sign by sign in the Latin alphabet, preserving as much information about the original tablet as possible. Imperfections, such as reconstructed broken or missing signs are enclosed within various brackets, and signs used in different functions are distinguished from each other. Logograms are written in capital letters, determinatives and phonetic complements in superscript, and syllabic text in lowercase (typically in italic except in Sumerian). Logogram sequences are separated from each other with dots and syllabic signs with dashes. To make the transliteration reversible, homophonic signs are distinguished by an indexing system, where a subscript index number is added after the transliterated sign.

For example, as the signs ˛ and ˝ can both present a phonemic value /šu/, they are distinguished in transliteration as šu and šu2 to explicitly indicate which sign was used in the original source (Huehnergard & Woods 2008: 89–92).

Transliteration is a complex task because many signs can be read as syllables, logograms, determinatives or phonetic complements, and the word boundaries are not marked. Thus, for every sign, its context has to be taken into account. Signs may also form compounds, of which readings may be completely unrelated to their components. For example, in Akkadian the sign IGI has several syllabic values, including ši, lim, lem and li3, whereas the sign RI can be read syllabically ri, re, tal, dal, or ܒal. However, when these two signs occur as a compound IGI.RI, they are read as ar.

Some cuneiform languages can also be represented in phonological transcription. This process involves translating logograms into the target language in their expected surface forms, and normalizing the syllabic renderings from the graphemic level into strings of phonemes (Huehnergard &

Woods 2008: 89–90). Like transliteration, this task also requires a profound knowledge of the language (if not deeper, because the transcriber has to be very familiar with the morphology and syntax of the language) especially when the context is poorly preserved or the text contains numerous logograms. For instance, the sign IGI from our previous example has several logographical uses in Akkadian, such as ŝnu “eye”, pŅnu “front”, maېru “before”, or amŅru “to see”, which can all occur in several grammatical forms, including but not limited to construct states maېar, ŝn and pŅn, and genitives maېri, ŝni and pŅni. For the verb Ņmaru, theoretical possibilities are even more numerous: ŝmur “he saw”, immar “he sees”, ŝtamar “he has seen”, innamir “it became visible”, or innammar “it becomes visible” to name a few. In syllabic writing, transcription challenges emerge from inconsistent spelling of vowel and consonant quantities, as in i-be-el ൺ ibŋl “he ruled” or ibêl “he rules”, i-di-in ൺ idin “give!” or iddin

“he gave”, and a-na-ku anŅku “I” or annaku “tin”, where the transliteration ൺ does not explicitly indicate if the underlying vowels are short /a/, long /Ņ/ or contracted /â/, or if the consonants are single or geminated (Huehnergard &

(25)

1.3 Cuneiform writing 25 Woods 2008: 93-94). Examples of transliteration and transcription are presented in Figure 1.

(1) ˷ȤȧĕȧʃŊ̀̕MȤ

(2) šum-ma MA2-LAԋ4gišMA2 a-wi-lim u2-Ԛe4-bi-ma (3) šumma malÃӘum elip awílim uԚebbíma

“If a sailor sank a boat of a free man (and made it refloat, he shall give half of the boat’s price in silver)”

Figure 1. A short example from the Codex Hammurabi (§238) in (1) cuneiform, (2) trans- literation, and (3) phonological transcription.

Phonological transcription is commonly used in Akkadian, Hurrian, Hittite and Elamite, especially in their grammatical descriptions, but it is not commonly practiced in Sumerian due to our poor understanding of Sumerian phonology (and partly morphology). The electronic Pennsylvania Sumerian Dictionary (ePSD2) uses a rudimentary transcription in Sumerian as a spelling normaliza- tion, but the use is strictly limited to the dictionary forms of the words.

1.4 Research objectives and motivation

Since the beginning of Assyriology, Mesopotamian primary sources have mostly been studied using qualitative methods. A major part of this research has involved collating and publishing texts and studying their contents by close- reading. The close-reading oriented research is still dominant, although over a hundred thousand Mesopotamian texts have been published in digital format.

Thus, the digitalization of Assyriology has mostly had an impact on the ease of use and accessibility of primary sources, which in the times before digital corpora were mostly accessible only to those fortunate enough to work or study in universities provided with a comprehensive Assyriological library.

The availability of digital corpora has offered many possibilities for quantitative research. A pioneer in this respect is Oracc, which offers anyone the opportunity to download its data and to explore it without the constraints of Oracc’s standard search functionalities. People involved in the development of Oracc have also published tools and tutorials, such as Compass (Veldhuis 2020) to aid and motivate researchers to explore the possibilities of computational approaches for analyzing cuneiform texts.

The feasibility of quantitative research on cuneiform texts is highly dependent on the detail of the annotation in the source corpora. A corpus with extensive metadata on the provenience, dialect, historical period and genre offers many more opportunities than plain transliterations. Even better grounds are provided if the corpus includes rich linguistic annotation, such as spelling

(26)

normalization, lemmatization, POS-tagging and morphological analysis.

Lemmatization in particular has been considered a mandatory step for making the cuneiform corpora useful in practice (Maiocchi 2019). This can easily be demonstrated by using the Korp version of Oracc (Jauhiainen et al. 2019a).6 A query for the lemma nadŅnu “to give” returns 1,933 hits, whereas a query by its (seemingly) common transliterated word forms id-din and na-da-nu yield only 107 and 22 hits respectively. Even a longish regular expression that represents the five most common word forms and their spellings in the corpus ^(id-din|i- nam-di|it-ta-di|na-dan|ina-an-din).* yields only 1,028 hits, including 56 that are false matches, thus missing 961 of the 1,933 occurrences of nadŅnu.

The process of transforming a physical clay tablet into a format that is useful for quantitative analysis involves a chain of steps where each step provides adequate source data for the next step. For example, in order to study semantic relationships between Akkadian words, one has to have lemmatized and preferably POS-tagged and morphologically annotated source material due to the morphological complexity of the language. To be able to lemmatize and POS-tag the texts through morphological analysis, a consistent phonological transcription is required in turn due to the extent of spelling variation in the primary sources. This again, requires a transliterated source text based on the original cuneiform tablet, often collated from numerous surviving fragments.

The contributions of this thesis can be seen as a part of a pipeline for transforming Akkadian cuneiform texts, first from physical tablet fragments into annotated corpora, and ultimately into content-related information extracted from the annotated corpora (Figure 2). Although this thesis does not contribute to the first steps of the pipeline, namely artifact reconstruction and the OCR of cuneiform, an extensive survey of relevant research is presented in Sections 2.1 and 2.2.

Figure 2. A processing pipeline from fragments to information.

As thousands of digitized texts exist in transliteration without linguistic annotation (e.g., Achemenet comprises ca. 3,000 texts alone), approaching the pipeline from top to bottom contributes significantly to the amount of available data for higher level quantitative analysis, including social network analysis and the study of Akkadian lexical semantics. In addition to contributing to the

6 The searches are constrained by disallowing unlemmatized matches.

fragments

transcription tablets

transliteration

linguistic

annotation content analysis

ARTIFACT TEXT CORPUS INFORMATION

publication I

publications II, III publications IV-VI

(27)

1.4 Research objectives and motivation 27 volume of annotated data, it is hoped that the pipeline, being strictly based on Oracc conventions, also advances harmonizing data with Oracc standards.

As sketched in Figure 2, the focus is set on three topics, which form the three primary research objectives and main contributions of this thesis:

1. Transformation of transliterated Akkadian cuneiform text into phono- logical transcription. This is necessary for morphological analysis due to extensive spelling variation. The work conducted in this thesis investigates the feasibility of this task using a statistical/heuristic model and state-of-the-art sequence-to-sequence models.

2. Morphological analysis of Akkadian, especially the Babylonian dialect, which is the most widely attested variant of the Akkadian language.

Although morphological analysis is useful for several purposes,7 the work conducted in this thesis investigates the feasibility of morphology-driven lemmatization and part-of-speech tagging of Akkadian.

3. Improving the quantitative analysis of Akkadian lexical semantics by proposing methods that are better suited to the Akkadian data. The work conducted in this thesis investigates an algorithmic way to deduplicate the source data to reduce the amount of statistical noise, and experiments with the proposed method on different PMI-based word association measures in corpora with varying degree of duplication. This is motivated by the possibility of studying Akkadian lexical semantics in the emic perspective, and seeing behind the formulaic and prescriptive use of the language without manual alteration of the historical source data by removing problematic parts of the corpus.

Although this thesis concentrates on Akkadian, some of the proposed methods are directly applicable to other cuneiform languages as well. The models for automatic phonological transcription of transliterated cuneiform can be trained for any cuneiform language, as long as standardized transcription conventions exist and there is adequate training material available for the models. Similarly, the improvements proposed for semantic analysis aim to solve general problems also encountered Sumerian texts.

Because Akkadian is a low-resource language, the applied methods are especially suitable for sparse data. These include finite-state transducers for morphological analysis, and primarily statistical methods for semantic analysis.

Despite their need for large amount of data, the use of neural networks is also experimented with on the phonological transcription task with promising results, as well as on creating Akkadian word embeddings.

7 For example, studying root-pattern formalisms and word formation, identifying derived words (especially verbal stems), studying bilingual and grammatical texts, or stylistic features of different genres, and providing better grounds for various NLP tasks from dependency parsing to machine translation.

(28)

1.5 Author’s contributions

Publication I (Automated phonological transcription of Akkadian cuneiform text) is a description of automatic phonological transcription of Akkadian by using statistical/heuristic and state-of-the-art neural sequence-to-sequence models. The contribution of this paper is to apply context-aware and non- context-aware models to produce normalized phonological transcriptions from transliterated cuneiform texts to make them better suited to morphological analysis and further annotation. I developed the statistical/heuristic baseline model and implemented the initial versions of the neural transcriber, performed the extrinsic evaluation and part of the intrinsic evaluation, and wrote the paper with Miikka Silfverberg and Antti Arppe under the supervision of the fourth author, Krister Lindén.

Publication II (Towards a finite-state based computational model of ancient Babylonian) describes a finite-state model for the Babylonian dialect of the Akkadian language, capable of providing morphological labeling, lemmatization, and part-of-speech tagging from transcribed text. The contribution of this paper is to introduce a comprehensive, easily extendable and maintainable way to annotate phonologically transcribed Akkadian texts. I developed the morphological model, performed the experiments for the non- weighted model and wrote the paper with Miikka Silfverberg and Antti Arppe under the supervision of the fourth author, Krister Lindén.

Publication III (Akkadian treebank for early Neo-Assyrian royal inscriptions) presents a syntactic treebank for Neo-Assyrian royal inscriptions.

The contribution of this paper is to define a standard for Akkadian syntactic and morphological notation following the Universal Dependencies guidelines, test the reproducibility of the syntactic annotations by using the TurkuNLP neural parser, and build a morphological gold standard for Akkadian. I developed the standard for Akkadian morphological description, and validated and harmonized the morphological annotations. I wrote the paper with Mikko Luukko and Sam Hardwick under the supervision of the fourth author, Krister Lindén.

Publication IV (Improving word association measures in repetitive corpora with Context Similarity Weighting) proposes context similarity weighting (CSW) for pointwise mutual information (PMI), and a new normalized derivation for one of the commonly used PMI measures known as PMI2. The contribution of this paper is to (a) present a consistent and reproducible method for down-sampling word co-occurrences that occur in repetitive or formulaic passages without the need for manual and often methodologically problematic alteration of historical primary sources, and (b) to introduce a new PMI variant

(29)

1.5 Author’s contributions 29 that has fixed bounds for perfect dependence and independence, does not suffer from low-frequency bias, and takes slightly better advantage of CSW. I developed the weighting algorithm and the normalized variant of PMI2, performed the experiments and wrote the paper under the supervision of the second author, Krister Lindén.

Publication V (Language technology approach to ‘seeing’ in Akkadian) applies PMI to study the use of seven verbs of seeing in Akkadian, and compares the observed results with previous philological studies. I wrote the tools, performed the quantitative analysis, analyzed the results, and wrote the paper with my supervisor Saana Svärd.

Publication VI (Fear in Akkadian texts: new digital perspectives on lexical semantics) studies the concept of fear in Akkadian using fastText, and PMI weighted with an early version of the duplicate down-sampling method presented in Publication IV. The contribution of this paper is to apply the methods of distributional semantics to the study of emotions in ancient Mesopotamia. I developed the down-sampling method and wrote the tool for calculating the association measures. I wrote the paper with Saana Svärd, Tero Alstola, Heidi Jauhiainen, and Krister Lindén under supervision of the first- and the last-mentioned authors.

Publication VII (Aššur and his friends: A statistical analysis of Neo-Assyrian texts) extends the use of PMI to network analysis in order to study the position of the god Aššur within the Neo-Assyrian pantheon. The contribution of this paper is to combine the methods of distributional semantics with social network analysis to improve the relevance of the connections within the network. My contributions were weighting the edges by using PMI and writing the tool for calculating the association measures. I wrote the paper with Tero Alstola, Shana Zaia and Heidi Jauhiainen under supervision of the fifth and sixth authors, Saana Svärd and Krister Lindén.

(30)

1.6 The Data

The data used in this thesis comes ultimately from the Open Richly Annotated Cuneiform Corpus (Oracc), which is one of the largest corpora of cuneiform texts comprising ca. 30 subprojects featuring several Mesopotamian languages.

What makes Oracc an outstanding source of texts for computational analysis and developing tools for cuneiform languages is its open accessibility and the detail of its annotation. Most of the texts have been lemmatized, POS-tagged and provided with necessary metadata on their provenience, language/dialect, period and genre, and several other relevant details. In addition, the corpus is linked with the Cuneiform Digital Library Initiative (CDLI), which provides drawings or photographs for many texts in the corpus. The Oracc data is available for download in JSON format (and at least for some corpora in TEI- XML and ATF8 formats) under the Creative Commons 3.0 (Share-Alike) license.

In practice, the datasets used in the contributions of this thesis are subsets extracted from Korp-Oracc (Jauhiainen et al. 2019a) provided by the Language Bank of Finland. The latest version of Korp-Oracc is a snapshot of Oracc from May 2019, comprising 1.98 million words, of which 1.55 million are labeled as Akkadian.9 Of these words, 1.33 million are provided with POS- tagging and ca. one million with lemmatization, most of the unlemmatized words being lacunae.10 The data in Korp-Oracc differs slightly from Oracc in a few respects. First, it provides an option to use simplified and normalized POS- tags and text genre definitions instead of the very detailed and diverse labels used in Oracc. Second, it contains optional normalizations for divine names and some place names based on their most common lemmatization (for instance, Anunnaki instead of Anunna, Anunnaki, Anunnaku, Anunak etc.). Finally, it is encoded into VRT (VeRticalized Text) format, a token-oriented columnized text format, which represents text structure in XML and text content in TSV. This format is highly human-readable, light, and very efficient and easy to parse.

8 ATF is a commonly used plain-text format for representing transliterated cuneiform text.

9 Majority of the Akkadian data comes from the following Oracc projects: SAAo (S. Parpola, K. Radner, E. Robson, J. Novotny & S. Tinney; lemmatized by M. Luukko, M. Groß & N.

Morello), ADsD (R. Pirngruber, M. Rinderer with support of J. Novotny and N. Morello), RINAP (G. Frame, B. L. Eichler, E. Leichty, K. Radner & S. Tinney; lemmatized by J.

Novotny, J. Jeffers, N. Morello & G. Lentini), HBTIN (L. Pearce, S. Langin-Hooper, C.

Bravo, T. Prussin & J. Cristosomo; lemmatized by L. Pearce, J. Carnahan, J. Cristosomo, C.

Bravo & T. Prusin), RIAo/RIBo (A. Bartelmus, G. Frame, J. Novotny & K. Radner;

lemmatized by J. Novotny, A. Bartelmus & F. Weiershäuser), CAMS-GKAB (M.-H. Besnier, P. Clancier, G. Cunningham, R. Horry, F. Reynolds, E. Robson, K. Stevens, S. Tinney & G.

van Buylaere). See Section 4.1 for links to Korp-Oracc and [http://oracc.org/projectlist.html]

for more information about the projects in Oracc.

10 Errata: Publication II (page 3) incorrectly reports total word count in lemmatized/POS- tagged texts that contain Akkadian, instead of lemmatized words labeled explicitly as Akkadian (as given in this section). Publication V (page 4) states incorrect counts for the left- out data based on a broken file. In reality, ca. 5,400 lexical texts, containing 470k words in total were left out. These comprised 144k Akkadian words, of which 48k were lemmatized.

Description of the used dataset is accurate. I am responsible for both of these errors.

(31)

2 An Overview of Computational Assyriology

After Roberto Busa gave the first public demonstration of computerized linguistic analysis in June of 1952 (Stout 2019), it was not very long before the idea of using computers was also adopted by a handful of Assyriologists. In Paris, Jean-Claude Gardin began to apply computational methods such as decision trees to archaeological and Assyriological research already in the 1950s, and published the first network analysis of Old Assyrian economic relationships with Paul Garelli in 1961 (Gardin & Garelli 1961). This work did not gather much recognition at the time, but recently it has been acknowledged as an important early contribution to network analysis (Plutniak 2018). Pioneering work was also done in Helsinki by Simo Parpola, who used punch cards and mainframe computers to digitize 500 Neo-Assyrian letters in the mid–1960s (Söderlund 2005), and published a computer-generated list of Neo-Assyrian toponyms in 1970 (Parpola 1970). In the 1980s, computer-aided Assyriology continued in Helsinki and culminated in The Neo-Assyrian Text Corpus Project (NATC) established in 1986, which aimed to build a corpus of, as well as to publish, the Assyrian royal archives of Nineveh (Aro & Mattila 2007). During the early years of the project, Laura Kataja and Kimmo Koskenniemi published the first finite- state model of the Akkadian language (Kataja & Koskenniemi 1988).

In Chicago, Ignace J. Gelb began to work on the computerized analysis of Amorite in 1965 with the help of Robert M. Whiting, Joyce Bartels and Stuart- Morgan Vance. The project continued for almost two decades and the results were published in 1980 (Gelb 1980). In Los Angeles, another pioneer, Giorgio Buccellati worked on the Old Babylonian Linguistic Analysis Project (OBLAP) from 1968 to 1975, and in 1977 he began to digitize texts from Ebla (Buccellati 2016). In the course of the OBLAP, Old Babylonian letters were lemmatized, phonologically transcribed, graphemically and morphologically analyzed and printed out in a keyword in context view (Buccellati 1977). In 1979, Buccellati published a comparative graphemic analysis of Old Babylonian and Western Akkadian, which also involved the use of computers (Buccellati 1979), and by the 1980s he had already initiated a concept called Cybernetica-Mesopotamica with the intention to distribute disks containing digital Assyriological publications and data accompanied by computer programs for analyzing them (Buccellati 1990).

Viittaukset

LIITTYVÄT TIEDOSTOT

o asioista, jotka organisaation täytyy huomioida osallistuessaan sosiaaliseen mediaan. – Organisaation ohjeet omille työntekijöilleen, kuinka sosiaalisessa mediassa toi-

− valmistuksenohjaukseen tarvittavaa tietoa saadaan kumppanilta oikeaan aikaan ja tieto on hyödynnettävissä olevaa &amp; päähankkija ja alihankkija kehittävät toimin-

siten, että tässä tutkimuksessa on keskitytty eroihin juuri jätteen arinapolton ja REFin rinnakkaispolton päästövaikutusten välillä sekä eritelty vaikutukset

Vuonna 1996 oli ONTIKAan kirjautunut Jyväskylässä sekä Jyväskylän maalaiskunnassa yhteensä 40 rakennuspaloa, joihin oli osallistunut 151 palo- ja pelastustoimen operatii-

Työn merkityksellisyyden rakentamista ohjaa moraalinen kehys; se auttaa ihmistä valitsemaan asioita, joihin hän sitoutuu. Yksilön moraaliseen kehyk- seen voi kytkeytyä

Aineistomme koostuu kolmen suomalaisen leh- den sinkkuutta käsittelevistä jutuista. Nämä leh- det ovat Helsingin Sanomat, Ilta-Sanomat ja Aamulehti. Valitsimme lehdet niiden

Since both the beams have the same stiffness values, the deflection of HSS beam at room temperature is twice as that of mild steel beam (Figure 11).. With the rise of steel

Kodin merkitys lapselle on kuitenkin tärkeim- piä paikkoja lapsen kehityksen kannalta, joten lapsen tarpeiden ymmärtäminen asuntosuun- nittelussa on hyvin tärkeää.. Lapset ovat