Using ChatGPT to Evaluate the Methodological Components of Research Proposals: An Experimental Study on Undergraduate English Majors in Vietnam
PDF

Keywords

research proposals, evaluation, ChatGPT, Zero-shot learning, Methodological Components

How to Cite

Using ChatGPT to Evaluate the Methodological Components of Research Proposals: An Experimental Study on Undergraduate English Majors in Vietnam. (2025). Computer-Assisted Language Learning Electronic Journal, 26(3), 129-227. https://doi.org/10.54855/callej.252637

Abstract

Artificial Intelligence is increasingly applied in education, but its effectiveness in evaluating research methodologies remains underexplored. This study examines the intra- and inter-rater reliability of ChatGPT-4o, employing zero-shot learning, in assessing 37 research proposals from English majors at Saigon University, Vietnam, focusing on Research Title, Questions, Hypotheses, Paradigm, Design, and Techniques. A quantitative quasi-experimental design was used, with two evaluation groups: Control (module lecturers) and Experimental (ChatGPT-4o). ChatGPT-4o followed a structured zero-shot prompt set, with a researcher-designed five-point rubric and the How to Research book uploaded for reference to evaluate each proposal twice. The lecturers evaluated proposals independently, discussed and finalized scores. Data collected were analyzed using Quadratic Cohen’s weighted Kappa. Results showed moderate to high intra-rater reliability and moderate inter-rater reliability in straightforward areas, but the machine struggled with abstract criteria requiring deeper reasoning, such as evaluating title relevance and the justification of paradigm and design. These findings highlight the limitations of AI in fully capturing the complexities of research methodologies. However, ChatGPT-4o may be a reliable tool in contexts with clear rubrics and minimal training, reducing the need for human intervention. Future studies should expand the sample size and explore different approaches to improve its ability in research evaluation.

PDF

References

Andrews, R. (2003). Research questions. Continuum.

Berry, K. J., Janis E. Johnston, & Paul W Miele, Jr. (2008). Weighted Kappa for Multiple Raters. Perceptual and Motor Skills, 107(7), 837–848. https://doi.org/10.2466/PMS.107.7.837-848

Blaxter, L., Hughes, C., & Tight, M. (2010). How to research (4th ed.). McGraw-Hill/Open University Press.

Brookhart, S. M. (2013). How to Create and Use Rubrics for Formative Assessment and Grading. ASCD.

Brown, H. D. (2018). Language assessment: Principles and classroom practices. Pearson.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners (No. arXiv:2005.14165). arXiv. https://doi.org/10.48550/arXiv.2005.14165

Cadman, K. (2002). English for Academic Possibilities: the research proposal as a contested site in postgraduate genre pedagogy. Journal of English for Academic Purposes, 2(2), 85-104.

Chaudhary, S., & Gupta, P. (2023). A Comprehensive Study on Chat GPT. Journal of Emerging Technologies and Innovative Research, 10(10), 196–201.

Checco, A., Bracciale, L., Loreti, P., Pinfield, S., & Bianchi, G. (2021). AI-assisted peer review. Humanities and Social Sciences Communications, 8(1), 25. https://doi.org/10.1057/s41599-020-00703-8

Cohen, J. (1968). Weighted Kappa: Nominal Scale Agreement with Provision for Scaled Disagreement or Partial Credit. Psychological Bulletin, 7(4), 213–220.

Cohen, L., Manion, L., & Morrison, K. (2018). Research Methods in Education. Routledge.

Creswell, J. W. (2015). Educational Research: Planning, Conducting, and Evaluating Quantitative and Qualitative Research. Pearson.

Creswell, J. W., & Creswell, J. D. (2018). Research Design: Qualitative, Quantitative, and Mixed Methods Approaches. SAGE Publications, Inc.

Denscombe, M. (2020). Research proposals: A practical guide (2nd ed). Open University Press.

Duong, N., Tong, T., & Le, D. (2024). Utilizing ChatGPT in checking academic writing for postgraduate students. In Proceedings of the AsiaCALL International Conference (pp. 193–203). https://doi.org/10.54855/paic.24614

Duong, T., & Le, T. (2024). Utilizing artificial intelligence in writing feedback: Benefits and challenges for first-year students at Hanoi University of Industry. In Proceedings of the AsiaCALL International Conference (pp. 238–249).

Fraenkel, J. R., Wallen, N. E., & Hyun, H. (2023). How to Design and Evaluate Research in Education. McGraw Hill.

Gwet, K. L. (2008). Intrarater Reliability. Wiley Encyclopedia of Clinical Trials.

Heriyawati, D. F., & Romadhon, M. G. E. (2025). “Can AI Be Trusted for My Thesis?” The Voices of Indonesian Higher Education Levels About ChatGPT in Automated Writing Evaluation (AWE). Computer-Assisted Language Learning Electronic Journal, 26(1), 58–75. https://doi.org/10.54855/callej.252614

Hoang, T., & Vu, T. (2024). Khám phá vai trò của trí tuệ nhân tạo AI – ChatGPT trong kỷ nguyên chuyển đổi số để ứng dụng vào giảng dạy tiếng Anh (ELT) tại Đại học Kinh tế - Kỹ thuật Công nghiệp (UNETI). In Kỷ yếu hội thảo khoa học quốc gia: Ngôn ngữ học tính toán – những xu hướng mới, triển vọng và thách thức (pp. 31–39)

Hockly, N. (2019). Automated writing evaluation. ELT Journal, 73(1), 82–88. https://doi.org/10.1093/elt/ccy044

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2023). Large Language Models are Zero-Shot Reasoners (No. arXiv:2205.11916). arXiv. https://doi.org/10.48550/arXiv.2205.11916

Kousha, K., & Thelwall, M. (2022). Artificial intelligence technologies to support research assessment: A review. Statistical Cybermetrics and Research Evaluation Group, University of Wolverhampton.

Landis, J. R., & Koch, G. G. (1977). The Measurement of Observer Agreement for Categorical Data. Biometrics, 33(1), 159–174. https://doi.org/10.2307/2529310

Latif, E., & Zhai, X. (2023). Fine-tuning ChatGPT for Automatic Scoring (No. arXiv:2310.10072). arXiv. https://doi.org/10.48550/arXiv.2310.10072

Lee, C. J., Sugimoto, C. R., Zhang, G., & Cronin, B. (2013). Bias in peer review. Journal of the American Society for Information Science and Technology, 64(1), 2–17. https://doi.org/10.1002/asi.22784

Liang, W., Zhang, Y., Cao, H., Wang, B., Ding, D., Yang, X., Vodrahalli, K., He, S., Smith, D., Yin, Y., McFarland, D., & Zou, J. (2023). Can large language models provide useful feedback on research papers? A large-scale empirical analysis (No. arXiv:2310.01783). arXiv. https://doi.org/10.48550/arXiv.2310.01783

Lin, J., Song, J., Zhou, Z., Chen, Y., & Shi, X. (2023). Automated scholarly paper review: Concepts, technologies, and challenges. Information Fusion, 98, 101830. https://doi.org/10.1016/j.inffus.2023.101830

Locke, L. F., Spirduso, W. W., & Silverman, S. J. (2007). Proposals That Work: A Guide for Planning Dissertations and Grant Proposals. SAGE Publications, Inc.

Luu, T. M. V., & Doan, Q. V. (2025). ChatGPT’s Impact on Listening Comprehension: Perspectives from Vietnamese EFL University Learners. Computer-Assisted Language Learning Electronic Journal, 26(3), 43–63. https://doi.org/10.54855/callej.252633

Maclure, M., & Willet, W. C. (1987). Misinterpretation and misuse of the kappa statistic. American Journal of Epidemiology, 126(2), 161–169.

Menke, J., Roelandse, M., Ozyurt, B., Martone, M., & Bandrowski, A. (2020). The Rigor and Transparency Index Quality Metric for Assessing Biological and Medical Science Methods. iScience, 23(11), 101698. https://doi.org/10.1016/j.isci.2020.101698

Nguyen, T. S., Nguyen, T. D. T., Hoang, N. Q. N., & Do, T. K. H. (2025). How AI-Powered Voice Recognition Has Supported Pronunciation Competence among EFL University Learners. Computer-Assisted Language Learning Electronic Journal, 26(3), 64–83. https://doi.org/10.54855/callej.252634

OpenAI. (2022, November 30). Introducing ChatGPT. https://openai.com/index/chatgpt/

OpenAI. (2024a). GPT-4o System Card (p. 32).

OpenAI. (2024b, May 13). Hello GPT-4o. https://openai.com/index/hello-gpt-4o/

Ormerod, C. M., Malhotra, A., & Jafari, A. (2021). Automated essay scoring using efficient transformer-based language models (No. arXiv:2102.13136). arXiv. https://doi.org/10.48550/arXiv.2102.13136

Paltridge, B., & Starfield, S. (2020). Thesis and dissertation writing in a second language: A handbook for students and their supervisors (2nd ed). Routledge.

Pham, M. T., & Cao, T. X. T. (2025). The Practice of ChatGPT in English Teaching and Learning in Vietnam: A Systematic Review. International Journal of TESOL & Education, 5(1), 50–70. https://doi.org/10.54855/ijte.25513

Reynolds, L., & McDonell, K. (2021). Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm (No. arXiv:2102.07350). arXiv. https://doi.org/10.48550/arXiv.2102.07350

Rodriguez, P. U., Jafari, A., & Ormerod, C. M. (2019). Language models and Automated Essay Scoring (No. arXiv:1909.09482). arXiv. https://doi.org/10.48550/arXiv.1909.09482

Russell, S., & Norvig, P. (2020). Artificial Intelligence: A Modern Approach (4th ed.). Pearson.

Sabzalieva, E., & Valentini, A. (2023). ChatGPT and artificial intelligence in higher education: Quick start guide – UNESCO Digital Library. United Nations Educational, Scientific and Cultural Organization. https://unesdoc.unesco.org/ark:/48223/pf0000385146.

Saigon University. (2018, January). Chiến lược phát triển Trường Đại học Sài Gòn đến năm 2025 và tầm nhìn. Saigon university.

Sim, J., & Wright, C. C. (2005). The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements. Physical Therapy, 85(3), 257–268. https://doi.org/10.1093/ptj/85.3.257

Spaapen, J. B., Dijstelbloem, H., & Wamelink, F. J. M. (2007). Evaluating research in context: A method for comprehensive assessment (2nd ed). Consultative Committee of Sector Councils for Research and Development (COS).

Swales, J. (2004). Research genres: Explorations and applications. Cambridge University Press.

Syriani, E., David, I., & Kumar, G. (2023). Assessing the Ability of ChatGPT to Screen Articles for Systematic Reviews.

Tcherni-Buzzeo, M., & Pyrczak, F. (2024). Evaluating Research in Academic Journals. Routledge.

Thelwall, M. (2024). Can ChatGPT evaluate research quality? Journal of Data and Information Science, 9(2), 1–21. https://doi.org/10.2478/jdis-2024-0013

Thomas, R. M. (2003). Blending Qualitative and Quantitative Research Methods in Theses and Dissertations. Corwin.

Turabian, K. L., Booth, W. C., Colomb, G. G., Williams, J. M., Bizup, J., & FitzGerald, W. T. (2018). A Manual for Writers of Research Papers, Theses, and Dissertations: Chicago Style for Students and Researchers (9th ed.). University of Chicago Press.

Wang, Q., & Gayed, J. M. (2024). Effectiveness of large language models in automated evaluation of argumentative essays: Finetuning vs. zero-shot prompting. Computer Assisted Language Learning, 1–29. https://doi.org/10.1080/09588221.2024.2371395

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright (c) 2025 Author and CALL-EJ