The semi-automatic parser - adnominal-person morphology

adnominal-person morphology

2.5. The semi-automatic parser

3PL = (ɚ|ɹ|ɨ|ɺ|ɟ|ɷ|ɢ|ɵ|ɭ|ɸ)(ɫɬ)(|ɤɚɤ|ɚɧ(|ɝɚɤ)|ɚɬ(|ɤɚɤ)|ɬɚ(ɧ|ɞ)ɨ(|ɹɤ)|(ɨ|ɷ)ɥɢɧɶ(|ɝɚɤ)|(ɨ|ɷ) ɥɢɬ(ɶ|ɹɤ|ɶɤɚɤ)|(ɨ|ɷ)ɥɶ(|ɝɚɤ)|(ɨ|ɷ)ɥɢɧɟɤ(|ɚɤ|ɤɚɤ)|(ɨ|ɷ)ɥɢɞɟ(|ɹɤ)|(ɨ|ɷ)ɥɶɬ(ɶ|ɹɤ|ɶɤɚɤ))(<|$| ) Data extracted with this set of regular expressions will be used for establishing sublexica typically associated with adnominal person. Sublexicon distinctions will show close ad-herence to the parts of speech established by Mariya Imaikina (2000: 56–59), where she enumerates ten different parts of speech: NOUNS, ADJECTIVES, NUMERALS, PRONOUNS, VERBS,

ADVERBS, POSTPOSITIONS, CONJUNCTIONS, PARTICLES, and INTERJECTIONS. Additional semantic characteristics will be taken into consideration to provide a more concise description of adnominal-person morphology.

The data may tend to provide ambiguous readings for the  rst and second persons singular of the nominative and genitive case candidates, due to the readings inde nite genitive for -нь -OĔ and inde nite nominative plural for -ть/-т -T. The reading INDEF.

GEN for -нь <= -OĔ can be contrasted with the reading POSS-1SG>PL/OBL -нь <= -ON; and the reading PL for -ть/-т <= -T can be contrasted with the readings POSS-2SG -ть/-т <=

-OT and POSS-2SG>[+KIN]GEN -ть <= -t. . (This is a counter to the assumption that -́ ть/-т can be reduced to T representation (cf. Abondolo 1987: 219-233).) These two ambigu-ous sets also illustrate limitations in “egrep” strategy attestation and provide an indica-tion as to why certain strategies of avoiding 1SG and 2SG morphemes might be merited, for example, automatic parsing strategies involving other persons.

2.5. The semi-automatic parser

In a morphological analysis of the Erzya language one must bear in mind the extent of synchronic in ectional mechanisms utilized by the collective of speakers and writers of the language. By de ning DECLINABLEWORDS as words that can take case marking in the same manner as nouns, with semantic limitations, we will arrive at subsets of the Erzya lexicon enumerated in nouns, adjectives, numerals, pronouns, non- nites, spatial adver-bials and adpositions. These subsets of the Erzya lexicon attest to varied implementa-tions of the three declension types, i.e. the INDEFINITE, the DEFINITE and the POSSESSIVE DECLENSIONS.

The methodological principles required for the description of the possessive de-clension in Erzya parallel work in the MORPHO-SEMANTIC ANALYSIS OF THE HUNGARIAN NOUN PHRASE by Moravcsik (2003). Her work is quite compatible with the prepara-tory morpho-semantic evaluation required in the construction of a  nite-state two-level morphological parser, such as implemented in the Open Morphology of the Helsinki Finite-State Transducer (<http://www.ling.helsinki. /kieliteknologia/tutkimus/hfst/>),

henceforth HFST. (See also Krister Lindén, Miikka Silfverberg and Tommi Pirinen 2009.) The two descriptions, it should be noted, have different scopes, and although a semi-automated analysis of Erzya, the language of study, might attest to a  ner granu-larity in subdivisions of the lexicon made possible by co-occurrence constraints inher-ent in the morphological concatenation strategies of the language, disambiguation for homonymous forms would be the target of a clausal syntactic description and/or manual disambiguation of a given analyzed text.

The construction of an HFST-based morphological analyzer involves establish-ing morpho-syntactic buildestablish-ing blocks and structural rules that will insure the well-formedness of a non-contextual word form through the delimitation of co-occurrence in phonemes, morphemes and sememes, and the delimitation of linear ordering. An implementation of such delimitation strategies can be outlined in the following sets and formulations, which correspond to the description of Erzya rendered in sections (3.–4.3.): (i) an alphabet of the Erzya language, i.e. phonological and graphical repre-sentations thereof (alphabet); (ii) sets of letters representing various phonetic/graphic feature groups inherent to phonetic contexts (sets); (iii) phonetic/graphic contexts intrin-sic to allomorphic variation in the working of rules (contexts); (iv) rules which allow or disallow co-occurring phonetic/graphic contexts (rules); (v) part-of-speech groups with morpho-semantic granularity inherent in the derivation of well-formed lexemes (root-lexicon), and (vi) continuation lexicon strategies providing for proper linear ordering of the morphemes in a given word (continuation lexicon). Thus the extensible structural in-formation and sets utilized in the construction of the two-level parser allow for address-ing matters of cumulative expression, extended exponence, morpheme co-occurrence and linear ordering simultaneously, and therefore provides implementational force, to the otherwise parallel description afforded in the information extracted from Hungarian by Moravcsik, see sample parse table (2.4).

Table 2.4 Example of an analyzed text fragment

Ton ramik ašo eĺi seń paćańt́? – Ašońt́. (Cf. Egorova 1976)

‘Did you buy the white or the blue kerchief? – The white [one].’

XML input  le for erzya.fst:

<p>

<sent><txt>Ɍɨɧ ɪɚɦɢɤ ɚɲɨ ɷɥɢ ɫɷɧɶ ɩɚɰɹɧɬɶ?</txt></sent>

<sent><txt>Ⱥɲɨɧɬɶ.</txt></sent>

</p>

Output:

<?xml version=”1.0” encoding=”utf-8”?>

<p>

<sent>

<parse id=”ɬɨɧ”>+Pron+Pers+2Sg+Nom+NoPredx+NoClitic</parse>

</const>

<parse id=”ɪɚɦɚɦɫ”>+Verb+Orth_morph+Imp+23+NoClitic</parse>

<parse id=”ɪɚɦɚɦɫ”>+Verb+Orth_morph+Ind+PretI+23+NoClitic</parse>

</const>

<parse id=”ɚɲɨ”>+NCom+Inanim+Cnt+Cx+Indet+Sg+Nom+0Suf+NoPredx+NoClit ic</parse>

<parse id=”ɚɲɨ”>+Adj+Cx+Indet+Sg+Nom+0Suf+NoPredx+NoClitic</parse>

</const>

<parse id=”ɷɥɢ”>+Conjunction</parse>

</const>

<parse id=”ɫɷɧɶ”>+Adj+Cx+Indet+Sg+Nom+0Suf+NoPredx+NoClitic</parse>

<parse id=”ɫɷɧɶ”>+NCom+Inanim+Cnt+Cx+Indet+Sg+Nom+0Suf+NoPredx+NoClit ic</parse>

</const>

<parse id=”ɩɚɰɹ”>+NCom+Inanim+Cnt+NoLVStem+Cx+Det+Sg+Gen+NoClitic</

parse>

</const>

<parse id=”ɚɲɨ”>+NCom+Inanim+Cnt+NoLVStem+Cx+Det+Sg+Gen+NoClitic</

parse>

<parse id=”ɚɲɨ”>+Adj+NoLVStem+Cx+Det+Sg+Gen+NoClitic</parse>

</const>

</p>

Manual disambiguation

Once the corpora have been automatically parsed there are a number of disambiguation problems to be dealt with. Whereas most personal pronoun forms have singleton parses, the ambiguous form si͔ń has two alternative readings: one is the third person plural ‘they’

and the other a  nite verb form ‘I arrived’, see table below. Further ambiguity can be detected in the pronouns/adpositions, such as that found in t́eń with the readings genitive-form proximal demonstrative pronoun ‘of this; this (object)’, and dative of the

 rst person singular ‘to me’, see tables (2.5) and (4.49a-b).

Table 2.5 Examples of items requiring manual disambiguation in this treatise

Homonyms Ambiguous parses

si͔ń they_PRON-PERS.NOM

arrive_V.PRETI.PRED-1SG

t́eń to/for_ PRON-DAT.POSS-1SG

this_PRON-DEM-SG.GEN

2.6. Sublexicon-case alignments and variation

In document Adnominal Person in the Morphological System of Erzya : Adnominaalinen persoona ersän kielen morfologisessa järjestelmässä (sivua 67-70)