• Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, Beijing 100700, P. R. China;
CHE Qianzi, Email: cheqianzi123@126.com; SHI Nannan, Email: 13811839164@vip.126.com
Export PDF Favorites Scan Get Citation

Objective  To systematically review the accuracy and consistency of large language models (LLMs) in assessing risk of bias in analytical studies. Methods  The cohort and case-control studies related to COVID-19 based on the team's published systematic review of clinical characteristics of COVID-19 were included. Two researchers independently screened the studies, extracted data, and assessed risk of bias of the included studies with the LLM-based BiasBee model (version Non-RCT) used for automated evaluation. Kappa statistics and score differences were used to analyze the agreement between LLM and human evaluations, with subgroup analysis for Chinese and English studies. Results  A total of 210 studies were included. Meta-analysis showed that LLM scores were generally higher than those of human evaluators, particularly in representativeness of exposed cohorts (△=0.764) and selection of external controls (△=0.109). Kappa analysis indicated slight agreement in items such as exposure assessment (κ=0.059) and adequacy of follow-up (κ=0.093), while showing significant discrepancies in more subjective items, such as control selection (κ=−0.112) and non-response rate (κ=−0.115). Subgroup analysis revealed higher scoring consistency for LLMs in English-language studies compared to that of Chinese-language studies. Conclusion  LLMs demonstrate potential in risk of bias assessment; however, notable differences remain in more subjective tasks. Future research should focus on optimizing prompt engineering and model fine-tuning to enhance LLM accuracy and consistency in complex tasks.

Copyright © the editorial department of Chinese Journal of Evidence-Based Medicine of West China Medical Publisher. All rights reserved