A Comparative Analysis of the Readability and Information Quality of the Chinese and English Versions of Educational Materials for Thoracic Surgery Patients Generated by DeepSeek, Grok-3 and ChatGPT

Authors

  • Shiyu Wang Chinese Academy of Medical Sciences and Peking Union Medical College
  • Yuan Yu Chinese Academy of Medical Sciences and Peking Union Medical College

DOI:

https://doi.org/10.62177/apjcmr.v1i4.731

Keywords:

Thoracic Surgery, Thoracoscopic Lobectomy, Large Language Models (LLMs), Patient Educational Materials, Readability, Information Quality, Bilingual Comparison

Abstract

Objective: To comparatively analyze the readability and information quality of the educational materials for patients undergoing thoracoscopic lobectomy in both Chinese and English versions generated by three mainstream Large Language Models (LLMS), namely DeepSeek, Grok-3 and ChatGPT, Provide evidence-based basis for the clinical selection of AI-assisted educational tools.

Method: A cross-sectional study design was adopted, with "education for patients undergoing thoracoscopic lobectomy" as the core requirement. Standardized Chinese and English prompts were designed to drive each of the three models to generate 3 independent educational materials (a total of 18, 9 in Chinese and 9 in English). The readability was evaluated using the internationally recognized readability assessment tools (English: Flesch-Kincaid Grade Level, FKGL; Flesch Reading Ease, FRE; Chinese: average sentence length), and the DISCERN scale was used to evaluate the quality of information. The differences among the three models were compared by the Kruskal-Wallis H test, the differences between the Chinese and English versions were analyzed by the paired sample t-test, and the reliability of the raters was tested by the intraclass correlation coefficient (ICC).

Result: 1. Readability: In the English version, DeepSeek V3 had the highest FRE score (80.36±1.18) and the lowest FKGL score (4.83±0.12), which was significantly better than ChatGPT-o3 (FRE: 67.36±0.74, FKGL:) 6.56±0.36) and Grok3 (FRE: 45.67±1.65, FKGL: 11.93±0.17) (P<0.05); Among the Chinese versions, Grok3 had the shortest average sentence length (17.74±1.02 characters), which was significantly better than ChatGPT-o3 (27.81±1.47 characters) and DeepSeek V3 (26.75±1.18 characters) (P<0.05).

2. Information quality: The reliability of the raters was excellent (ICC=0.92, 95% CI: 0.925-0.998, P<0.001); The DISCERN total scores of the Chinese and English versions of the three models were all at the "good - excellent" level (59.00-71.17 points). Among them, the total scores of the Chinese and English versions of ChatGPT-o3 were the highest (English: 71.17±1.17, Chinese: 70.50±0.55), and Grok3 was the lowest (English: (63.17±0.94, Chinese: 59.00±0.89), and the difference between groups was statistically significant (P<0.05).

Conclusion: Among the educational materials for thoracoscopic lobectomy generated by the three LLMS, the English version of DeepSeeking V3 has the best readability, the Chinese version of Grok3 has outstanding reading fluency, and the comprehensive performance of the Chinese and English versions of ChatGPT-o3 is balanced. The Chinese version still needs to be optimized in terms of terminology consistency and information details. When applying it in clinical practice, the model should be selected in combination with language requirements, and the content generated by AI should be professionally reviewed.

Downloads

Download data is not yet available.

References

Bray, F., Laversanne, M., Sung, H., Ferlay, J., Siegel, R. L., Soerjomataram, I., & Jemal, A. (2024). Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: A Cancer Journal for Clinicians, 74(3), 229-263. https://doi.org/10.3322/caac.21834

Jiao, W., Zhao, L., Mei, J., Zhong, J., Yu, Y., Bi, N., ... & Gao, S. (2025). Clinical practice guidelines for perioperative multimodality treatment of non-small cell lung cancer. Chinese Medical Journal (English Edition). https://doi.org/10.1097/cm9.0000000000003635

Lee, P., Bubeck, S., & Petro, J. (2023). Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. The New England Journal of Medicine, 388(13), 1233-1239. https://doi.org/10.1056/NEJMsr2214184

Thirunavukarasu, A. J., Ting, D. S. J., Elangovan, K., Gutierrez, L., Tan, T. F., & Ting, D. S. W. (2023). Large language models in medicine. Nature Medicine, 29(8), 1930-1940. https://doi.org/10.1038/s41591-023-02448-8

Choudhury, A., Shahsavar, Y., & Shamszare, H. (2025). User intent to use DeepSeek for healthcare purposes and their trust in the large language model: Multinational survey study. JMIR Human Factors. https://doi.org/10.2196/72867

Bhushan, R., & Grover, V. (2024). The advent of artificial intelligence into cardiac surgery: A systematic review of our understanding. Brazilian Journal of Cardiovascular Surgery, 39(5), e20230308. https://doi.org/10.21470/1678-9741-2023-0308

Denecke, K., May, R., & Rivera Romero, O. (2024). Potential of large language models in health care: Delphi study. Journal of Medical Internet Research, 26(5), e52399. https://doi.org/10.2196/52399

Ayers, J. W., Poliak, A., Dredze, M., Leas, E. C., Zhu, Z., Kelley, J. B., Faix, D. J., Goodman, A. M., Longhurst, C. A., Hogarth, M., & Smith, D. M. (2023). Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Internal Medicine, 183(6), 589-596. https://doi.org/10.1001/jamainternmed.2023.1838

Khalpey, Z., Kumar, U., King, N., Abraham, A., & Khalpey, A. H. (2024). Large language models take on cardiothoracic surgery: A comparative analysis of the performance of four models on American Board of Thoracic Surgery exam questions in 2023. Cureus, 16(7), e65083. https://doi.org/10.7759/cureus.65083

XAI. (2025). Grok3: Redefining AI Capabilities. Retrieved from https://xai.com/grok3

OpenAI. (2024). ChatGPT Technical Report. Retrieved from https://openai.com/research/chatgpt

Charnock, D., Shepperd, S., Needham, G., & Gann, R. (1999). DISCERN: An instrument for judging the quality of written consumer health information on treatment choices. Journal of Epidemiology & Community Health, 53(2), 105-111. https://doi.org/10.1136/jech.53.2.105

Downloads

How to Cite

Wang, S., & Yu, Y. (2025). A Comparative Analysis of the Readability and Information Quality of the Chinese and English Versions of Educational Materials for Thoracic Surgery Patients Generated by DeepSeek, Grok-3 and ChatGPT. Asia Pacific Journal of Clinical Medical Research, 1(4). https://doi.org/10.62177/apjcmr.v1i4.731

Issue

Section

Articles

DATE

Received: 2025-10-13
Accepted: 2025-10-16
Published: 2025-11-01