Abstract
Artificial Intelligence is increasingly applied in education, but its effectiveness in evaluating research methodologies remains underexplored. This study examines the intra- and inter-rater reliability of ChatGPT-4o, employing zero-shot learning, in assessing 37 research proposals from English majors at Saigon University, Vietnam, focusing on Research Title, Questions, Hypotheses, Paradigm, Design, and Techniques. A quantitative quasi-experimental design was used, with two evaluation groups: Control (module lecturers) and Experimental (ChatGPT-4o). ChatGPT-4o followed a structured zero-shot prompt set, with a researcher-designed five-point rubric and the How to Research book uploaded for reference to evaluate each proposal twice. The lecturers evaluated proposals independently, discussed and finalized scores. Data collected were analyzed using Quadratic Cohen’s weighted Kappa. Results showed moderate to high intra-rater reliability and moderate inter-rater reliability in straightforward areas, but the machine struggled with abstract criteria requiring deeper reasoning, such as evaluating title relevance and the justification of paradigm and design. These findings highlight the limitations of AI in fully capturing the complexities of research methodologies. However, ChatGPT-4o may be a reliable tool in contexts with clear rubrics and minimal training, reducing the need for human intervention. Future studies should expand the sample size and explore different approaches to improve its ability in research evaluation.
References
Andrews, R. (2003). Research questions. Continuum.
Berry, K. J., Janis E. Johnston, & Paul W Miele, Jr. (2008). Weighted Kappa for Multiple Raters. Perceptual and Motor Skills, 107(7), 837–848. https://doi.org/10.2466/PMS.107.7.837-848
Blaxter, L., Hughes, C., & Tight, M. (2010). How to research (4th ed.). McGraw-Hill/Open University Press.
Brookhart, S. M. (2013). How to Create and Use Rubrics for Formative Assessment and Grading. ASCD.
Brown, H. D. (2018). Language assessment: Principles and classroom practices. Pearson.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners (No. arXiv:2005.14165). arXiv. https://doi.org/10.48550/arXiv.2005.14165
Cadman, K. (2002). English for Academic Possibilities: the research proposal as a contested site in postgraduate genre pedagogy. Journal of English for Academic Purposes, 2(2), 85-104.
Chaudhary, S., & Gupta, P. (2023). A Comprehensive Study on Chat GPT. Journal of Emerging Technologies and Innovative Research, 10(10), 196–201.
Checco, A., Bracciale, L., Loreti, P., Pinfield, S., & Bianchi, G. (2021). AI-assisted peer review. Humanities and Social Sciences Communications, 8(1), 25. https://doi.org/10.1057/s41599-020-00703-8
Cohen, J. (1968). Weighted Kappa: Nominal Scale Agreement with Provision for Scaled Disagreement or Partial Credit. Psychological Bulletin, 7(4), 213–220.
Cohen, L., Manion, L., & Morrison, K. (2018). Research Methods in Education. Routledge.
Creswell, J. W. (2015). Educational Research: Planning, Conducting, and Evaluating Quantitative and Qualitative Research. Pearson.
Creswell, J. W., & Creswell, J. D. (2018). Research Design: Qualitative, Quantitative, and Mixed Methods Approaches. SAGE Publications, Inc.
Denscombe, M. (2020). Research proposals: A practical guide (2nd ed). Open University Press.
Duong, N., Tong, T., & Le, D. (2024). Utilizing ChatGPT in checking academic writing for postgraduate students. In Proceedings of the AsiaCALL International Conference (pp. 193–203). https://doi.org/10.54855/paic.24614
Duong, T., & Le, T. (2024). Utilizing artificial intelligence in writing feedback: Benefits and challenges for first-year students at Hanoi University of Industry. In Proceedings of the AsiaCALL International Conference (pp. 238–249).
Fraenkel, J. R., Wallen, N. E., & Hyun, H. (2023). How to Design and Evaluate Research in Education. McGraw Hill.
Gwet, K. L. (2008). Intrarater Reliability. Wiley Encyclopedia of Clinical Trials.
Heriyawati, D. F., & Romadhon, M. G. E. (2025). “Can AI Be Trusted for My Thesis?” The Voices of Indonesian Higher Education Levels About ChatGPT in Automated Writing Evaluation (AWE). Computer-Assisted Language Learning Electronic Journal, 26(1), 58–75. https://doi.org/10.54855/callej.252614
Hoang, T., & Vu, T. (2024). Khám phá vai trò của trí tuệ nhân tạo AI – ChatGPT trong kỷ nguyên chuyển đổi số để ứng dụng vào giảng dạy tiếng Anh (ELT) tại Đại học Kinh tế - Kỹ thuật Công nghiệp (UNETI). In Kỷ yếu hội thảo khoa học quốc gia: Ngôn ngữ học tính toán – những xu hướng mới, triển vọng và thách thức (pp. 31–39)
Hockly, N. (2019). Automated writing evaluation. ELT Journal, 73(1), 82–88. https://doi.org/10.1093/elt/ccy044
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2023). Large Language Models are Zero-Shot Reasoners (No. arXiv:2205.11916). arXiv. https://doi.org/10.48550/arXiv.2205.11916
Kousha, K., & Thelwall, M. (2022). Artificial intelligence technologies to support research assessment: A review. Statistical Cybermetrics and Research Evaluation Group, University of Wolverhampton.
Landis, J. R., & Koch, G. G. (1977). The Measurement of Observer Agreement for Categorical Data. Biometrics, 33(1), 159–174. https://doi.org/10.2307/2529310
Latif, E., & Zhai, X. (2023). Fine-tuning ChatGPT for Automatic Scoring (No. arXiv:2310.10072). arXiv. https://doi.org/10.48550/arXiv.2310.10072
Lee, C. J., Sugimoto, C. R., Zhang, G., & Cronin, B. (2013). Bias in peer review. Journal of the American Society for Information Science and Technology, 64(1), 2–17. https://doi.org/10.1002/asi.22784
Liang, W., Zhang, Y., Cao, H., Wang, B., Ding, D., Yang, X., Vodrahalli, K., He, S., Smith, D., Yin, Y., McFarland, D., & Zou, J. (2023). Can large language models provide useful feedback on research papers? A large-scale empirical analysis (No. arXiv:2310.01783). arXiv. https://doi.org/10.48550/arXiv.2310.01783
Lin, J., Song, J., Zhou, Z., Chen, Y., & Shi, X. (2023). Automated scholarly paper review: Concepts, technologies, and challenges. Information Fusion, 98, 101830. https://doi.org/10.1016/j.inffus.2023.101830
Locke, L. F., Spirduso, W. W., & Silverman, S. J. (2007). Proposals That Work: A Guide for Planning Dissertations and Grant Proposals. SAGE Publications, Inc.
Luu, T. M. V., & Doan, Q. V. (2025). ChatGPT’s Impact on Listening Comprehension: Perspectives from Vietnamese EFL University Learners. Computer-Assisted Language Learning Electronic Journal, 26(3), 43–63. https://doi.org/10.54855/callej.252633
Maclure, M., & Willet, W. C. (1987). Misinterpretation and misuse of the kappa statistic. American Journal of Epidemiology, 126(2), 161–169.
Menke, J., Roelandse, M., Ozyurt, B., Martone, M., & Bandrowski, A. (2020). The Rigor and Transparency Index Quality Metric for Assessing Biological and Medical Science Methods. iScience, 23(11), 101698. https://doi.org/10.1016/j.isci.2020.101698
Nguyen, T. S., Nguyen, T. D. T., Hoang, N. Q. N., & Do, T. K. H. (2025). How AI-Powered Voice Recognition Has Supported Pronunciation Competence among EFL University Learners. Computer-Assisted Language Learning Electronic Journal, 26(3), 64–83. https://doi.org/10.54855/callej.252634
OpenAI. (2022, November 30). Introducing ChatGPT. https://openai.com/index/chatgpt/
OpenAI. (2024a). GPT-4o System Card (p. 32).
OpenAI. (2024b, May 13). Hello GPT-4o. https://openai.com/index/hello-gpt-4o/
Ormerod, C. M., Malhotra, A., & Jafari, A. (2021). Automated essay scoring using efficient transformer-based language models (No. arXiv:2102.13136). arXiv. https://doi.org/10.48550/arXiv.2102.13136
Paltridge, B., & Starfield, S. (2020). Thesis and dissertation writing in a second language: A handbook for students and their supervisors (2nd ed). Routledge.
Pham, M. T., & Cao, T. X. T. (2025). The Practice of ChatGPT in English Teaching and Learning in Vietnam: A Systematic Review. International Journal of TESOL & Education, 5(1), 50–70. https://doi.org/10.54855/ijte.25513
Reynolds, L., & McDonell, K. (2021). Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm (No. arXiv:2102.07350). arXiv. https://doi.org/10.48550/arXiv.2102.07350
Rodriguez, P. U., Jafari, A., & Ormerod, C. M. (2019). Language models and Automated Essay Scoring (No. arXiv:1909.09482). arXiv. https://doi.org/10.48550/arXiv.1909.09482
Russell, S., & Norvig, P. (2020). Artificial Intelligence: A Modern Approach (4th ed.). Pearson.
Sabzalieva, E., & Valentini, A. (2023). ChatGPT and artificial intelligence in higher education: Quick start guide – UNESCO Digital Library. United Nations Educational, Scientific and Cultural Organization. https://unesdoc.unesco.org/ark:/48223/pf0000385146.
Saigon University. (2018, January). Chiến lược phát triển Trường Đại học Sài Gòn đến năm 2025 và tầm nhìn. Saigon university.
Sim, J., & Wright, C. C. (2005). The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements. Physical Therapy, 85(3), 257–268. https://doi.org/10.1093/ptj/85.3.257
Spaapen, J. B., Dijstelbloem, H., & Wamelink, F. J. M. (2007). Evaluating research in context: A method for comprehensive assessment (2nd ed). Consultative Committee of Sector Councils for Research and Development (COS).
Swales, J. (2004). Research genres: Explorations and applications. Cambridge University Press.
Syriani, E., David, I., & Kumar, G. (2023). Assessing the Ability of ChatGPT to Screen Articles for Systematic Reviews.
Tcherni-Buzzeo, M., & Pyrczak, F. (2024). Evaluating Research in Academic Journals. Routledge.
Thelwall, M. (2024). Can ChatGPT evaluate research quality? Journal of Data and Information Science, 9(2), 1–21. https://doi.org/10.2478/jdis-2024-0013
Thomas, R. M. (2003). Blending Qualitative and Quantitative Research Methods in Theses and Dissertations. Corwin.
Turabian, K. L., Booth, W. C., Colomb, G. G., Williams, J. M., Bizup, J., & FitzGerald, W. T. (2018). A Manual for Writers of Research Papers, Theses, and Dissertations: Chicago Style for Students and Researchers (9th ed.). University of Chicago Press.
Wang, Q., & Gayed, J. M. (2024). Effectiveness of large language models in automated evaluation of argumentative essays: Finetuning vs. zero-shot prompting. Computer Assisted Language Learning, 1–29. https://doi.org/10.1080/09588221.2024.2371395

This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright (c) 2025 Author and CALL-EJ