Doing Morphosemantic Analyses in Farsi WordNets

This article presents a morpho­lo­gi­cal analysis of 3500 Persian derived nouns (i.e. the Farsi language) combined with their semantic inter­pre­ta­ti­on. These nouns are documen­ted in the computer system FarsNet offering a compu­ta­tio­nal codifi­ca­ti­on (so called wordnets) that specifies morpho­lo­gi­cal relations between classes of derived nouns and their bases. A compre­hen­si­ve and detailed descrip­ti­on of the relevant lingu­i­stic levels is a prere­qui­si­te for achieving progress in natural language proces­sing (NLP).

WordNets are lexical ontologies relying on semantic and morpho­lo­gi­cal descrip­ti­ons and formu­la­ti­ons (an example is provided in Figure 1). FarsNet has been estab­lished in 2009 by the NLP research lab of the Shahid Beheshti University in Teheran and its design princi­ples follow those of other compa­ra­ble resources such as Princeton WordNet, EuroWordNet and BalkaNet (Shamsfard et al. 2010; those who are familiar with Farsi are referred to the FarsNet page). Besides semantic relations (synonymy, hypernymy, hyponymy, meronymy and antonymy) and morpho­lo­gi­cal relations (deriva­ti­on), some additio­nal concep­tu­al relations such as domain and related to, have been devised in FarsNet. An example is provided with Figure 1 showing the semantic net of the word “shir”, which is a polysemic noun with the meanings ‘lion’, ‘Leo’ (constel­la­ti­on), ‘powdered milk’, ‘brave’, ‘faucet’, ‘plant sap’, ‘obverse’, ‘mother’s milk’, ‘milk’ all of them with indivi­du­al semantic relations. Currently (2021), FarsNet has more than one hundred thousand entries, organized in almost nine thousand synsets. A synset (or synonym set) is defined as a set of one or more synonyms that are inter­ch­an­geable in some context without changing the truth value of the propo­si­ti­on in which they are embedded (Shamsfard et al.2010).

Figure 1: Semantic relations of the polysemic word “shir” in FarsNet

In my own work, the morpho­lo­gi­cal and semantic relations are conside­red at the word level (and not at the synset) in order to achieve a cross-linguistic validity even if the morpho­lo­gi­cal aspect of the relation is not the same in the studied languages. In order to provide an overall view at the data, it has been classi­fied into sixteen semantic groups based on general base concepts and some subgroups which have been defined for each section and word. Semantic catego­ries (beginner) have been analyzed in a corpus based on Princeton WordNet (PWN) standards (for the semantic catego­ries following the PWN standard see Table 1). After the calcu­la­ti­on of their frequency, they have been classi­fied into more nine categories.

1. act6. cognition11. location16. plant21. shape
2. animal7. commu­ni­ca­ti­on12. motive17. posses­si­on22. state
3. artifact8. event13. object18. process23. substance
4. attribute9. feeling14. person19. quantity24. time
5. boda10. group15. pheno­me­non20. relation25. food
Table 1: Semantic catego­ries (noun beginners) based on PWN standards

FarsNet Word Entries

Every word entry of the parti­cu­lar word net include a phono­lo­gi­cal transcrip­ti­on together with infor­ma­ti­on on the parts of speech (PoS), synonyms and their synset classi­fi­ca­ti­ons, word meaning and an example. A beginner will be selected for each lexeme. According to Miller et al. (1990) a beginner is a primitive semantic component of any word in its hierar­chi­cal­ly struc­tu­red semantic field. Beginners could be used in the recogni­ti­on of the domain of a synset. Different syntactic types can be related to each other by mapping each entry to its corre­spon­ding concept in Princeton WordNet (Shamsfard et al., 2010). The synsets which do not fall into any of the above catego­ries will get the label nothing (an editor app was formu­la­ted in the NlP labora­to­ry, which is shown in Figure 2).

The semantic relations are also estab­lished among the synsets with the same PoS. Synsets with different PoS will get labels such as related to. There are three possible choices for mapping a synset to the corre­spon­ding one in Princeton WordNet: equiva­lence mapping, near-equivalence mapping and no-mapping. Finally, the morpho­lo­gi­cal relations among senses, such as deriva­tio­nal relations, are marked.

Besides speci­fy­ing a noun type (such as common, proper, countable, uncoun­ta­ble, pronoun, number or infini­ti­ve), a classi­fi­ca­ti­on on the basis of some more general semantic features (such as belonging to human, animal, location or time) is provided.

According to Deléger et al. (2009), a morpho-semantic process decom­po­ses derived words, compounds and complex words into their base compon­ents and associa­tes them to their semantic charac­te­ristics. Derived words and compounds are analysed morpho­lo­gi­cal­ly; relations between base and deriva­tio­nal form are inter­pre­ted seman­ti­cal­ly (Namer & Baud 2007). The term “morpho­seman­tic” was suggested by Raffaelli & Kerovec (2008) for any work dealing with the relation between form and meaning at the word level.

The resulting morpho­seman­tic formu­la­ti­ons notably increase the lingu­i­stic and operative compe­tence and perfor­mance of FarsNet. This is conside­red to be an achie­ve­ment in the codifi­ca­ti­on of the Persian descrip­ti­ve morpho­lo­gy. The compon­ents of the semantic network of nouns can be further obtained by identi­fy­ing basic concepts of semantic fields and new classi­fi­ca­ti­ons of semantic catego­ries. However, identi­fy­ing the lexical gaps between Persian and other languages (e.g. English) can also be helpful in mapping further sections of a wordnet. A practical benefit of such wordnets is that they facili­ta­te human-machine inter­ac­tion. From a lingu­i­stic perspec­ti­ve, they offer new possi­bi­li­ties for semantic analysis.

Figure 2: Farsnet Editor

Data Analysis

An example of analysis is provided by the noun corpus of FarsNet (= 22180 nouns). First of all, the list of derived nouns (= 2756 items) was prepared. Then they were split into their roots and affixes. From the 26 available suffixes, the 12 most frequent ones were selected (= 2461 deriva­ti­ves). Not surpri­sin­gly, their morpho­lo­gi­cal descrip­ti­ons corre­spond to Keshani’s (1992) descrip­ti­on of Persian suffixes (an example is provided by Table 2).


Table 2: Morpho­lo­gi­cal pattern of suffix “-i” derivatives

A morpho­lo­gi­cal analysis of a selection of derived nouns according to semantic catego­ries in FarsNet showed that only 2 words out of 3500 (0.08%) did not fall into the patterns. Thus, it could be concluded that the patterns have successful­ly provided the founda­ti­ons for estab­li­shing automatic relations between derived or complex nouns and their bases in FarsNet. The conside­ra­ti­on of the words’ morpho­lo­gi­cal features such as their PoS, their semantic and gramma­ti­cal category (e.g. agent noun, parti­ci­ple noun, present parti­ci­ple, etc.) as well as recogni­zing the beginners of the bases (e.g. act, person, food, etc.) and their change after the affix­a­ti­on process have been the key criteria in formu­la­ting the relations which were especi­al­ly crucial for the majority of studied suffixes that were polyse­mous. Defining and codifying these morpho­lo­gi­cal patterns lead to a coherent estab­lish­ment of morpho­lo­gi­cal relations. Hence, this offers a remar­kab­le perspec­ti­ve for the appli­ca­bi­li­ty of the data base in machine trans­la­ti­on, question answering systems, etc. Although the morpho­lo­gi­cal relations were conside­red at the word level, mapping the results to the relations formu­la­ted in the wordnets of other languages provides a cross-linguistic validity, even if the morpho­lo­gi­cal aspect of the relation is not the same in the two mapped languages.

In my PhD thesis at the Research Center Deutscher Sprach­at­las I am trans­fer­ring these methods to regional varieties of the German language. We expect this to yield new insights into the struc­tu­ring of the semantics of dialects.

References

  • Davari Ardakani, Negar and Mahdiyeh Arvin (2015): Persian. In N. Grandi and L. Kortve­ly­es­sy, editors. Edinburgh Handbook of Evalua­ti­ve Morpho­lo­gy. Edinburgh Univer­si­ty Press, Edinburgh, pages 287–295.
  • Deléger, Louise, Fiammetta Namer and Pierre Zweigenbaum (2009): Morpho­seman­tic Parsing of Medical Compound Words: Trans­fer­ring a French Analyzer to English. Inter­na­tio­nal Journal of Medical Infor­ma­tics, 78 (1): 48–55.
  • Farshid­vard, Khosrow (2007): Deriva­ti­on and Compoun­ding in Persian. Zavar press, Tehran.
  • Keshani, Khosrow (1992): Suffix Deriva­ti­on in Contem­po­ra­ry Persian. Iran Univer­si­ty Press, Tehran.
  • Miller, George A. et al. (1990): Intro­duc­tion to Wordnet. An online Lexical database, Journal of lexico­gra­phy, 3(4):235–244. doi: 10.1093/ijl/3.4.235
  • Raffaelli, Ida and Barbara Kerovec, 2008, Morpho­seman­tic fields in the Analysis of Croatian Vocabu­la­ry. Jezikos­lo­vlje, 9 (1–2): 141–169.
  • Shamsfard, Mehrnoush et al. (2010): Semi-Automatic Develo­p­men of Farsnet ; The Persian WordNet. 5th Global wordNet confe­rence (GWA8020).
Beitragsbild: Pete Linforth auf Pixabay

Diesen Beitrag zitieren als:

Mirsob­ha­ni, Shabnam. 2021. Doing Morpho­seman­tic Analyses in Farsi WordNets. Sprach­spu­ren: Berichte aus dem Deutschen Sprach­at­las 1(9). https://doi.org/10.57712/2021-09.

Shabnam Mirsobhani
Shabnam Mirsobhani is a doctoral student at the Research Center Deutscher Sprachatlas (Philipps Universität Marburg). She has a BA in French language and translation methods and an MA in General Linguistics (Shahid Beheshti University Teheran) as well as in Computational Linguistics (Universität Stuttgart). Her interests include computational processing of natural language, statistical language modeling and other related fields.