Document Clustering with Evolved Single Word Search Queries

We present a novel, hybrid approach for clustering text databases. We use a genetic algorithm to generate and evolve a set of single word search queries in Apache Lucene format. Clusters are formed as the set of documents matching a search query. The queries are optimized to maximize the number of documents returned and to minimize the overlap between clusters (documents returned by more than one query in a set). Optionally, the number of clusters can be specified in advance, which will normally result in an improvement in performance. Not all documents in a collection are returned by any of the search queries in a set, so once the search query evolution is completed a second stage is performed whereby a KNN algorithm is applied to assign all unassigned documents to their nearest cluster. We describe the method and compare effectiveness with other well-known existing systems on 8 different text datasets. We note that search query format has the qualitative benefits of being interpretable and providing an explanation of cluster construction.

Keywords

Document Clustering, Evolved, Single Word, Search Queries

Citation

L. Hirsch, A. D. Nuovo and P. Haddela, "Document Clustering with Evolved Single Word Search Queries," 2021 IEEE Congress on Evolutionary Computation (CEC), 2021, pp. 280-287, doi: 10.1109/CEC45853.2021.9504770.

URI

https://rda.sliit.lk/handle/123456789/2015

Collections

Research Papers - Dept of Information Technology

Full item page

Publication:
Document Clustering with Evolved Single Word Search Queries

DOI

Files

Type:

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By

Publication: Document Clustering with Evolved Single Word Search Queries

DOI

Files

Type:

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By

Publication:
Document Clustering with Evolved Single Word Search Queries