Research Publications

Now showing 1 - 3 of 3

Embargo
Sinhala Named Entity Recognition Model: Domain-Specific Classes in Sports
(IEEE, 2022-12-09) .Wijesinghe, W.M.S.K; Tissera, M
Named Entity Recognition (NER) is one of the crucial and vital subtasks that must be solved in most Natural Language Processing (NLP) tasks. However, constructing a NER system for the Sinhala Language is challenging. Because it comes under the category of low-resource languages. Therefore, the proposed approach attempted designing a mechanism to identify specific named entities in the sports domain. Firstly, a domain-specific corpus was built using Sinhala sport e-News articles. Then a semi-automated, rule-based component named as “Class_Label_Suggester” was built to annotate pre-defined named entities. After auto annotation, the outcome was further validated manually with a little effort. Finally, it was trained using the annotated data. Linear Perceptron, Stochastic Gradient Descent (SGD), Multinomial Naive Bayes (MNB), and Passive Aggressive classifiers were used to train the NER model. Though, the above Machine Learning (ML) algorithms showed approximately 98% accuracy, the MNB model demonstrated highest accuracy for the identified class labels of which, 99.76% for ‘Ground’, 99.53% for ‘School’, 98.55% for ‘Tournament’, and 97.87% for ‘Other’ classes. Additionally, high precision values of the above classes were 81%, 72%, 62%, and 98% respectively. An accurately annotated Sinhala dataset and the trained Sinhala NER model are main contributions of the study.
Open Access
Bidirectional LSTM-CRF for Named Entity Recognition
(32nd Pacific Asia Conference on Language, Information and Computation, 2018-12-01) Panchendrarajan, R; Amaresan, A
Named Entity Recognition (NER) is a challenging sequence labeling task which requires a deep understanding of the orthographic and distributional representation of words. In this paper, we propose a novel neural architecture that benefits from word and character level information and dependencies across adjacent labels. This model includes bidirectional LSTM (BI-LSTM) with a bidirectional Conditional Random Field (BI-CRF) layer. Our work is the first to experiment BI-CRF in neural architectures for sequence labeling task. We show that CRF can be extended to capture the dependencies between labels in both right and left directions of the sequence. This variation of CRF is referred to as BI-CRF and our results show that BI-CRF improves the performance of the NER model compare to an unidirectional CRF and backward CRF is capable of capturing most difficult entities compare to the forward CRF. Our system is competitive on the CoNLL-2003 dataset for English and outperforms most of the existing approaches which do not use any external labeled data.
Embargo
Conditional Random Fields based named entity recognition for sinhala
(IEEE, 2015-12-18) Senevirathne, K. U; Attanayake, N. S; Dhananjanie, A. W. M. H; Weragoda, W. A. S. U; Nugaliyadde, A; Thelijjagoda, S
Named Entity Recognition (NER) plays an important role in Natural Language Processing (NLP). Named Entities (NEs) are special atomic elements in natural languages belonging to predefined categories such as persons, organizations, locations, expressions of times, quantities, monetary values and percentages etc. These are referring to specific things and not listed in grammar or lexicons. NER is the task of identifying such NEs. This is a task entwined with number of challenges. Entities may be difficult to find at first, and once found, difficult to classify. For instance, locations and person names can be the same, and follow similar formatting. This becomes tough when it comes to South and South East Asian languages. That is mainly due to the nature of these languages. Even though Latin languages have accurate NER solutions those cannot be directly applied for Indic languages, because the features found in those languages are different from English. Therefore the research was based on producing a mathematical model which acts as the integral part of the Sinhala NER system. The researchers used Sinhala News corpus as the data set to train the Conditional Random Fields (CRFs) algorithm. 90% of the corpus was used in training the model, 10% is used in testing the resulted model. The research makes use of orthographic word-level features along with contextual information, which are helpful in predicting three different NE classes namely Persons, Locations and Organizations. The findings of the research were applied in developing the NE Annotator which identified NE classes from unstructured Sinhala text. The prominent contribution of this research for NER could benefit Sinhala NLP application developers and NLP related researchers in near future.

Research Publications

Browse

Filters

Advanced Search

Filter by

Settings

Sort By

Results per page

Search Results