Publication:
Enhanced Tokenizer for Sinhala Language

dc.contributor.authorSenanayake, S. Y
dc.contributor.authorKariyawasam, K. T. P. M
dc.contributor.authorHaddela, P. S
dc.date.accessioned2022-04-22T05:27:28Z
dc.date.available2022-04-22T05:27:28Z
dc.date.issued2019-10-08
dc.description.abstractTokenization process plays a prominent role in natural language processing (NLP) applications. It chops the content into the smallest meaningful units. However, there is a limited number of tokenization approaches for Sinhala language. Standard analyzer in apache software library and natural language toolkit (NLTK) are the main existing approaches to tokenize Sinhala language content. Since these are language independent, there are some limitations when it applies to Sinhala. Our proposed Sinhala tokenizer is mainly focusing on punctuation-based tokenization. It precisely tokenizes the content by identifying the use case of punctuation mark. In our research, we have proved that our punctuation-based tokenization approach outperforms the word tokenization in existing approaches.en_US
dc.identifier.citationS. Y. Senanayake, K. T. P. M. Kariyawasam and P. S. Haddela, "Enhanced Tokenizer for Sinhala Language," 2019 National Information Technology Conference (NITC), 2019, pp. 84-89, doi: 10.1109/NITC48475.2019.9114420.en_US
dc.identifier.doi10.1109/NITC48475.2019.9114420en_US
dc.identifier.issn2279-3895
dc.identifier.urihttps://rda.sliit.lk/handle/123456789/2010
dc.language.isoenen_US
dc.publisherIEEEen_US
dc.relation.ispartofseries2019 National Information Technology Conference (NITC);Pages 84-89
dc.subjectSinhala Languageen_US
dc.subjectEnhanceden_US
dc.subjectTokenizeren_US
dc.titleEnhanced Tokenizer for Sinhala Languageen_US
dc.typeArticleen_US
dspace.entity.typePublication

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Enhanced_Tokenizer_for_Sinhala_Language.pdf
Size:
452.92 KB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: