Publication:
Auto Generation of Gold Standard, Class Labeled Data Set and Ontology Extension Tool [QuadW]

dc.contributor.authorTissera, M
dc.contributor.authorWeerasinghe, R
dc.date.accessioned2022-06-09T08:19:28Z
dc.date.available2022-06-09T08:19:28Z
dc.date.issued2019-02-25
dc.description.abstractAutomatic Knowledge Extraction (AKE) from domain independent, unstructured text sources is a challenging task in Natural Language Processing and Text analytics. Though, supervised learning mechanisms are very much result promising, application is painful due to the mandatory requirement of a class labeled training data set, as it involves expensive manual effort which is more time consuming. As a solution for this problem, this paper introduces a novel mechanism to build a self-learned classifier model that can automatically generate class labeled training data set for Knowledge/Information Extraction from domain independent unstructured text. Sri Lankan English newspapers (which comprise unstructured text in unconstrained domains) are the main data source for this study and a prototype was built to Professional Information Extraction with the semantic pattern Who holds/held What position, Where and When (Four words start with `W', hence named `QuadW'). Methodology uses advanced machine learning techniques such as, a Random Forest with Adaboost ensemble algorithm to build a composite classification model. This classifier is called as self-learned since, it generates its own training data set automatically. This composite model has improved accuracy and avoided over fitting to data as well. The rule-based feature extraction algorithm and the hand-craft ontology developed, can also be considered as novel components of this study. Self-learned classifier has been extensively improved and tested to show higher accuracy with precision and recall close to one. Therefore, the classified output from the self-learned classifier can be used as a gold-standard data set for future research in Professional Information Extraction. The constructed ontology with approximately 400 facts, also can be effectively used in future researches. Further, introduced classifier can be used as a tool to extend the existing ontology as well. A novel usage of machine learning algorithms to text classification demonstrates that, this study goes with the state-of-the-art technologies.en_US
dc.identifier.citationM. Tissera and R. Weerasinghe, "Auto Generation of Gold Standard, Class Labeled Data Set and Ontology Extension Tool [QuadW]," 2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP), 2019, pp. 1-6, doi: 10.1109/ICACCP.2019.8882996.en_US
dc.identifier.doi10.1109/ICACCP.2019.8882996en_US
dc.identifier.isbn978-1-5386-7989-0
dc.identifier.urihttps://rda.sliit.lk/handle/123456789/2599
dc.language.isoenen_US
dc.publisherIEEEen_US
dc.relation.ispartofseries2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP);
dc.subjectAuto Generationen_US
dc.subjectGold Standarden_US
dc.subjectOntology Extension Toolen_US
dc.subjectData Seten_US
dc.subjectClass Labeleden_US
dc.titleAuto Generation of Gold Standard, Class Labeled Data Set and Ontology Extension Tool [QuadW]en_US
dc.typeArticleen_US
dspace.entity.typePublication

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Auto_Generation_of_Gold_Standard_Class_Labeled_Data_Set_and_Ontology_Extension_Tool_QuadW.pdf
Size:
298.47 KB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: