Ecological Models to Explain the Distributions of Words in Texts

Show simple item record

dc.contributor.author Punchi-Manage, R.
dc.contributor.author Mapa, C.
dc.contributor.author Madhushani, M.
dc.contributor.author Dilhara, M.
dc.contributor.author Karunathilaka, P.
dc.contributor.author Amiyangoda, L.
dc.contributor.author Ekenayake, U.
dc.date.accessioned 2022-02-01T14:15:06Z
dc.date.available 2022-02-01T14:15:06Z
dc.date.issued 2021-12
dc.identifier.citation International Symposium of Rajarata University (ISYMRU 2021) en_US
dc.identifier.issn 2235-9710
dc.identifier.uri http://repository.rjt.ac.lk/handle/123456789/3408
dc.description.abstract Genesis 1:1 says “In the beginning was the Word, and the Word was with God, and the Word was God''. It is found that less than 20% of the words can describe more than 80% of the contents of the word of a text. Pareto’s 80:20 rule, power laws, and zip’s laws are often used to explain the distribution of the words. However, we think that the ecological models can be also used to describe the distribution of the words. Species abundances of ecological communities are governed by a few dominant species followed by the majority of the rare and the singletons. This caused the species rank-abundance curves to show highly skewed distributions with long right tails; the patterns resemble the word distributions of texts. Ecologists often used three ecological models to explain species rank-abundance curves (i.e. Mac-Arthur's’ Broken-Stick model, Fisher's log-series model, and Preston’s Octave curves). The first step of our research is to use those three ecological models to see whether they could explain the word distributions of texts. For this purpose, we examined the relative frequencies of words in 10 renowned scientific literatures. We found that the relative frequency of word distributions of all the books was characterized each by a few dominant words preceded by a large number of rarely (infrequently) used words, hence causing long-tail distributions. We found Mac-Arthur’s Broken-Stick model and Fisher’s Log series model poorly explained the word distributions of texts. Also, the observed rank-abundances curves are outside the simulation envelopes of the Broken-Stick models. Further, Fisher’s log series models with different alpha values (parameters) could not explain the full pattern (high values explain only the tail distribution and low values explain only the dominant word frequencies). Interestingly, only Preston's Octave curves are closely matched with observed relative word frequencies. Hence, our research emphasizes that the ecological model (i.e. Preston's Octave curve) can be applied for statistical linguistics en_US
dc.language.iso en en_US
dc.publisher Faculty of Technology Rajarata University of Sri Lanka en_US
dc.subject Log-series model en_US
dc.subject broken-stick model en_US
dc.subject octave curves en_US
dc.subject relative word distribution en_US
dc.title Ecological Models to Explain the Distributions of Words in Texts en_US
dc.type Article en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search RUSL-IR


Browse

My Account