Auto Generation of Gold Standard, Class Labeled Data Set and Ontology Extension Tool [QuadW]

Tissera, M; Weerasinghe, R

Please use this identifier to cite or link to this item: https://rda.sliit.lk/handle/123456789/2599

Title:	Auto Generation of Gold Standard, Class Labeled Data Set and Ontology Extension Tool [QuadW]
Authors:	Tissera, M Weerasinghe, R
Keywords:	Auto Generation Gold Standard Ontology Extension Tool Data Set Class Labeled
Issue Date:	25-Feb-2019
Publisher:	IEEE
Citation:	M. Tissera and R. Weerasinghe, "Auto Generation of Gold Standard, Class Labeled Data Set and Ontology Extension Tool [QuadW]," 2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP), 2019, pp. 1-6, doi: 10.1109/ICACCP.2019.8882996.
Series/Report no.:	2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP);
Abstract:	Automatic Knowledge Extraction (AKE) from domain independent, unstructured text sources is a challenging task in Natural Language Processing and Text analytics. Though, supervised learning mechanisms are very much result promising, application is painful due to the mandatory requirement of a class labeled training data set, as it involves expensive manual effort which is more time consuming. As a solution for this problem, this paper introduces a novel mechanism to build a self-learned classifier model that can automatically generate class labeled training data set for Knowledge/Information Extraction from domain independent unstructured text. Sri Lankan English newspapers (which comprise unstructured text in unconstrained domains) are the main data source for this study and a prototype was built to Professional Information Extraction with the semantic pattern Who holds/held What position, Where and When (Four words start with `W', hence named `QuadW'). Methodology uses advanced machine learning techniques such as, a Random Forest with Adaboost ensemble algorithm to build a composite classification model. This classifier is called as self-learned since, it generates its own training data set automatically. This composite model has improved accuracy and avoided over fitting to data as well. The rule-based feature extraction algorithm and the hand-craft ontology developed, can also be considered as novel components of this study. Self-learned classifier has been extensively improved and tested to show higher accuracy with precision and recall close to one. Therefore, the classified output from the self-learned classifier can be used as a gold-standard data set for future research in Professional Information Extraction. The constructed ontology with approximately 400 facts, also can be effectively used in future researches. Further, introduced classifier can be used as a tool to extend the existing ontology as well. A novel usage of machine learning algorithms to text classification demonstrates that, this study goes with the state-of-the-art technologies.
URI:	http://rda.sliit.lk/handle/123456789/2599
ISBN:	978-1-5386-7989-0
Appears in Collections:	Department of Information Technology-Scopes Research Papers - IEEE Research Publications -Dept of Information Technology

Files in This Item:

File	Description	Size	Format
Auto_Generation_of_Gold_Standard_Class_Labeled_Data_Set_and_Ontology_Extension_Tool_QuadW.pdf Until 2050-12-31		298.47 kB	Adobe PDF	View/Open Request a copy

Show full item record