Research Papers - Dept of Information Technology

Search Results

Now showing 1 - 2 of 2

Embargo
Enhanced Tokenizer for Sinhala Language
(IEEE, 2019-10-08) Senanayake, S. Y; Kariyawasam, K. T. P. M; Haddela, P. S
Tokenization process plays a prominent role in natural language processing (NLP) applications. It chops the content into the smallest meaningful units. However, there is a limited number of tokenization approaches for Sinhala language. Standard analyzer in apache software library and natural language toolkit (NLTK) are the main existing approaches to tokenize Sinhala language content. Since these are language independent, there are some limitations when it applies to Sinhala. Our proposed Sinhala tokenizer is mainly focusing on punctuation-based tokenization. It precisely tokenizes the content by identifying the use case of punctuation mark. In our research, we have proved that our punctuation-based tokenization approach outperforms the word tokenization in existing approaches.
Embargo
A rule based stemmer for Sinhala language
(IEEE, 2020-04-13) Kariyawasam, K. T. P. M; Wickramasinghe, S. Y; Haddela, P. S
Stemming, as its word implies it converts the original word into its root/base format which is called as stem. Stemming process plays a prominent role in natural language processing (NLP) because it makes applications more efficient and effective. Though stemming is such an important task, it is hard to find a stemming method for Sinhalese language which is official language of Sri Lanka. There are common language analyzers which cannot be use for stemming since they are highly language dependent. In this paper, we present a rule-based stemming method by using suffix and prefix rules in Sinhalese language.

Research Papers - Dept of Information Technology

Browse

Filters

Advanced Search

Filter by

Settings

Sort By

Results per page

Search Results