• Department of Health Statistics, Naval Medical University, Shanghai 200433, P. R. China;
JIN Zhichao, Email: jinzhichao@smmu.edu.cn
Export PDF Favorites Scan Get Citation

Objective When using multi-center data to construct clinical prediction models, the independence assumption of data will be violated, and there is an obvious clustering effect among research objects. In order to fully consider the clustering effect, this study intends to compare the model performance of the random intercept logistic regression model (RI) and the fixed effects model (FEM) considering the clustering effect with the standard logistic regression model (SLR) and the random forest algorithm (RF) without considering the clustering effect under different scenarios. Methods In the process of forecasting model establishment, the prediction performance of different models at the center level was simulated when there were different degrees of clustering effects, including the difference of discrimination and calibration in different scenarios, and the change trend of this difference at different event rates was compared. Results At the center level, different models, except RF, showed little difference in the discrimination of different scenarios under the clustering effect, and the mean of their C-index changed very little. When using multi-center highly clustered data for forecasting, the marginal forecasts (M.RI, SLR and RF) had calibrated intercepts slightly less than 0 compared with the conditional forecasts, which overestimated the average probability of prediction. RF performed well in intercept calibration under the condition of multi-center and large samples, which also reflected the advantage of machine learning algorithm for processing large sample data. When there were few multiple patients in the center, the FEM made conditional predictions, the calibrated intercept was greater than 0, and the predicted mean probability was underestimated. In addition, when the multi-center large sample data were used to develop the prediction model, the slopes of the three conditional forecasts (FEM, A.RI, C.RI) were well calibrated, while the calibrated slopes of the marginal forecasts (M.RI and SLR) were greater than 1, which led to the problem of underfitting, and the underfitting problem became more prominent with the increase in the central aggregation effect. In particular, when there were few centers and few patients, overfitting of the data could mask the difference in calibration performance between marginal and conditional forecasts. Finally, the lower the event rate the central clustering effect at the central level had a more pronounced impact on the forecasting performance of the different models. Conclusion The highly clustered multi-center data are used to construct the model and apply it to the prediction in a specific environment. RI and FEM can be selected for conditional prediction when the number of centers is small or the difference between centers is large due to different incidence rates. When the number of hearts is large and the sample size is large, RI can be selected for conditional prediction or RF for edge prediction.

Citation: YU Jian, PENG Chi, JIN Zhichao. Simulation comparison of various prediction model construction strategies under clustering effect. Chinese Journal of Evidence-Based Medicine, 2023, 23(7): 834-842. doi: 10.7507/1672-2531.202301032 Copy

Copyright © the editorial department of Chinese Journal of Evidence-Based Medicine of West China Medical Publisher. All rights reserved

  • Previous Article

    Statistical analysis for the survival data with non-proportional hazard in oncology clinical trials
  • Next Article

    Risk of bias due to missing evidence (ROB-ME): a Chinese interpretation