Research Papers - Dept of Information Technology

Permanent URI for this collectionhttps://rda.sliit.lk/handle/123456789/593

Browse

Search Results

Now showing 1 - 1 of 1
  • Thumbnail Image
    PublicationEmbargo
    Document Clustering with Evolved Single Word Search Queries
    (IEEE, 2021-06-28) Hirsch, L; Haddela, P. S; Di Nuovo, A
    We present a novel, hybrid approach for clustering text databases. We use a genetic algorithm to generate and evolve a set of single word search queries in Apache Lucene format. Clusters are formed as the set of documents matching a search query. The queries are optimized to maximize the number of documents returned and to minimize the overlap between clusters (documents returned by more than one query in a set). Optionally, the number of clusters can be specified in advance, which will normally result in an improvement in performance. Not all documents in a collection are returned by any of the search queries in a set, so once the search query evolution is completed a second stage is performed whereby a KNN algorithm is applied to assign all unassigned documents to their nearest cluster. We describe the method and compare effectiveness with other well-known existing systems on 8 different text datasets. We note that search query format has the qualitative benefits of being interpretable and providing an explanation of cluster construction.