Evaluating Large Language Models for Software Testing: A Systematic Review of Metrics and Practices

Perera V.I.TDe Silva D.I.2026-05-252026978-303218912-7https://rda.sliit.lk/handle/123456789/5049The recent advancements in Large Language Models (LLMs) present substantial potential for revolutionizing software testing practices, particularly through automated test case generation. This review synthesizes contemporary research on LLM-driven software testing methods, with a specific focus on evaluation metrics. A systematic literature review was conducted using databases like IEEE Xplore, ResearchGate, and Google Scholar, targeting literature published between 2020 and 2024, specifically focusing on LLM-based test case generation. Selection criteria included relevance to automated testing and practical application insights. This review analyzes 15 key studies that span multiple test domains, and the key findings reveal significant advancements in using LLMs for diverse testing types, including unit, property-based, security, and user acceptance testing. Despite substantial benefits, issues such as test case validity, reliability, and prompt engineering complexity remain challenging. The review concludes with recommendations for developing a standardized metric-driven evaluation framework for better assessing LLM-generated tests. This comprehensive approach aims to effectively measure and optimize the practical utility and reliability of LLM-generated software tests, ultimately guiding future research directions and improving adoption within the software industry. The key contribution of this review is a comprehensive metric-focused evaluation of LLM-driven software testing techniques offering a foundation for developing standardize evaluation methodologies and practical testing frameworks.enEvaluation FrameworkLarge Language ModelsMetricsSoftware TestingEvaluating Large Language Models for Software Testing: A Systematic Review of Metrics and PracticesArticleDOI: 10.1007/978-3-032-18913-4_27