Faculty of Computing

Now showing 1 - 4 of 4

Embargo
Automated Spelling Checker And Grammatical Error Detection And Correction Model for Sinhala Language
(IEEE, 2022-10-04) Goonawardena, M; Kulatunga, A; Wickramasinghe, R; Weerasekara, T; De Silva, H; Thelijjagoda, S
Sinhala is a native language spoken by the Sinhalese people, the largest ethnic group in Sri Lanka. It is a morphologically rich language, which is a derivation of Pali and Sanskrit. The Sinhala language creates a diglossia situation, as the language’s written form differs from its spoken form. With this difference, the written form requires more complex rules to be followed when in use. Manually proofreading the content of Sinhala material takes up much time and labor, and it can be a tedious task. Hence, a system is necessary which can be used by different industries such as journalism and even students. At present, there are a handful of systems and research that have automated Sinhala spelling analysis and grammar analysis. In addition, the existing systems are mainly focused on either spelling analysis or grammar analysis. However, the proposed system will cover both aspects and improve upon existing work by either optimizing or re-building the process to provide accurate outputs. The proposed system consists of a suffix list built for verbs and subjects, which helps the system stand out from the current proposed solutions. This research intends to implement a service for spell checking and grammar correctness of formal context in Sinhala. The research follows a rule-based approach with some components adopting a hybrid approach. As per the literature survey, many papers were analyzed, related to different aspects of the proposed system and complete systems. The proposed system would be able to overcome most barriers faced by previous papers whilst it takes a fresh take on providing a solution.
Embargo
Dynamic stopword removal for Sinhala Language
(IEEE, 2019-10-08) Jayaweera, A. A. V. A; Senanayake, Y. N; Haddela, P. S
In the modern era of information retrieval, text summarization, text analytics, extraction of redundant (noise) words that contain a little information with low or no semantic meaning must be filtered out. Such words are known as stopwords. There are more than 40 languages which have identified their language specific stopwords. Most researchers use various techniques to identify their language specific stopword lists. But most of them try to define a magical cut-off point to the list, which they identify without any proof. In this research, the focus is to prove that the cut-off point depends on the source data and the machine learning algorithm, which will be proved by using Newton's iteration method of root finding algorithm. To achieve this, the research focuses on creating a stopword list for Sinhala language using the term frequency-based method by processing more than 90000 Sinhala documents. This paper presents the results received and new datasets prepared for text preprocessing.
Embargo
Enhanced Tokenizer for Sinhala Language
(IEEE, 2019-10-08) Senanayake, S. Y; Kariyawasam, K. T. P. M; Haddela, P. S
Tokenization process plays a prominent role in natural language processing (NLP) applications. It chops the content into the smallest meaningful units. However, there is a limited number of tokenization approaches for Sinhala language. Standard analyzer in apache software library and natural language toolkit (NLTK) are the main existing approaches to tokenize Sinhala language content. Since these are language independent, there are some limitations when it applies to Sinhala. Our proposed Sinhala tokenizer is mainly focusing on punctuation-based tokenization. It precisely tokenizes the content by identifying the use case of punctuation mark. In our research, we have proved that our punctuation-based tokenization approach outperforms the word tokenization in existing approaches.
Embargo
A rule based stemmer for Sinhala language
(IEEE, 2020-04-13) Kariyawasam, K. T. P. M; Wickramasinghe, S. Y; Haddela, P. S
Stemming, as its word implies it converts the original word into its root/base format which is called as stem. Stemming process plays a prominent role in natural language processing (NLP) because it makes applications more efficient and effective. Though stemming is such an important task, it is hard to find a stemming method for Sinhalese language which is official language of Sri Lanka. There are common language analyzers which cannot be use for stemming since they are highly language dependent. In this paper, we present a rule-based stemming method by using suffix and prefix rules in Sinhalese language.

Faculty of Computing

Browse

Filters

Advanced Search

Filter by

Settings

Sort By

Results per page

Search Results