Ján Čegiň has successfully defended his thesis on LLM-based data augmentation

We are thrilled to announce that our researcher Ján Čegiň has successfully defended his doctoral thesis titled Machine Learning With Human in the Loop for Textual Augmentation in the Era of LLMs. The thesis was supervised by Jakub Šimko and co-supervised by Peter Brusilovsky.

Ján began his doctoral journey in 2021 with the ambition to explore how humans and machines can collaborate when creating labelled data for low-resource domains. His early work focused on data augmentation, expanding existing data samples through techniques ranging from paraphrasing to synthetic sample generation. As part of this effort, he developed a crowdsourcing game for collecting adversarial examples for text classification tasks and published his first paper in mid-2022.

Then, ChatGPT arrived. 

Recognising the transformative potential of large language models, Ján quickly shifted his focus to understanding how this new generation of LLMs would impact human-computation tasks and text-oriented crowdsourcing. It soon became evident that LLMs offer substantial advantages for data augmentation, particularly when compared to traditional human-driven approaches.

Therefore, Ján has built his thesis around the idea of integrating LLMs into textual augmentation, examining how LLM-based augmentation compares to human-centred approaches in terms of cost, performance, and effectiveness.

The thesis is oriented around four central questions: (1) the efficacy of LLMs versus human workers in data augmentation tasks, (2) the transferability of human computation techniques to LLM prompting, (3) the cost-benefit analysis of LLM-based augmentation compared to traditional methods, and (4) the impact of sample selection strategies on downstream model performance when using LLMs.

Through extensive experimentation, Ján demonstrated that LLMs can generate more diverse and valid textual data than human workers while significantly reducing costs. Additionally, incorporating human-inspired prompting techniques, such as hints or banning the usage of certain words, can improve model performance, although the impact on lexical diversity remains limited. His findings also reveal that LLM augmentation is particularly beneficial in low-resource settings where only a few seed samples are available. Finally, he evaluated various sample selection strategies and found that random sampling remains a strong baseline, while hint-based strategies yield the best results for out-of-distribution performance.

The results of this thesis highlight the potential of LLM-based textual augmentation to surpass traditional methods under specific conditions, paving the way for more efficient and cost-effective data augmentation practices in the era of advanced language models.

Over his journey, Ján published his results plentifully. He published 4 main author papers at various conferences (3 x CORE Rank A*, 1x CORE Rank A), and 3 additional publications that he co-authored (1 x workshop at CORE Rank A* conference, 1 x joint main author at CORE Rank A* conference, 1 x co-author at CORE Rank A* conference).