SDA: Using large language models to support teachers in teaching young people about critical thinking

Slovak Debate Association (SDA) is teaching young people about debating, democracy, human rights, and openness. To scale these efforts, we help SDA with developing a generative AI based approach to automatically rate the answers of students involved in educational activities, such as Critical Thinking Olympics. This approach will enable SDA to extend their activities and involve many more students. In the AI-powered solution, it must be assured to provide the students with truthful and trustworthy information, which is a challenge for the current generation of language models.

In this project, we provide consultation services for the Slovak Debate Association, a non-profit organization that teaches young people to debate, develops their critical thinking, and supports democracy, human rights, and social engagement.

As a part of their curriculum, they challenge their students with various tests and quizzes, asking them to use the attained skills. Critical Thinking Olympics is one of their activities, with outreach to thousands of students. However, as the number of their students rises, the capacity to manually evaluate every student’s answer is becoming difficult to manage.

This project uses generative language models (such as infamous ChatGPT) to automatically annotate students’ answers. We keep a close eye on the truthfulness of the ratings, as we need to assure that the students will be rated accurately by the AI. SDA implemented multiple counter measures to identify and correct the hallucinations of the GPT models.

Firstly, the difficult task of grading open-ended answers was broken to smaller independent problems which were less likely to cause hallucinations. Secondly, each prompt needs to go through multiple rounds of iterations where the prompt is tested, evaluated and adjusted by humans on a small sample. Thirdly, methods such as ensembles of GPTs (where multiple GPTs vote on the grading) and self-validation (where the model is asked to check its answers) were used. Fourthly, anomaly detection is performed on the grading and the outlier answers are graded by either a stronger model or by humans. Lastly, a small sample from each grading category is validated by humans which ensures that the grading was done rigorously. This is a nice example of how we can use state-of-the-art NLP technologies to help educate young people.