Ján Čegiň

Research areas: natural language processing, crowdsourcing, human-computer interaction, human-inspired large language models interaction

Position: Researcher

Ján focuses on natural language processing, machine learning, crowdsourcing and bridging human-computer interaction with large language models.

He holds a Masters’ degree from the Slovak University of Technology in Intelligent Software Systems. He graduated with honors (cum laude).

He has participated in national research projects in co-operation with industry leaders such as ESET (malware detection) and Continental Automotive Systems (test data generation for the MC/DC criterion).

PhD topic: Machine learning with human in the loop in the era of LLMs

Supervising team: Jakub Šimko (KInIT), Peter Brusilovsky (University of Pittsburgh)

The rapid advancements in large language models (LLMs) have sparked interest in their potential
to enhance data augmentation processes, particularly compared to traditional human-driven methods
like crowdsourcing. This thesis investigates the integration of LLMs into textual augmentation,
addressing how LLM-based augmentation compares to human-centred approaches regarding cost,
performance, and effectiveness. Our research addresses four central questions: (1) the efficacy of
LLMs versus human workers in data augmentation tasks, (2) the transferability of human computation
techniques to LLM prompting, (3) the cost-benefit analysis of LLM-based augmentation
compared to traditional methods, and (4) the impact of sample selection strategies on downstream
model performance when using LLMs. Through extensive experimentation, we demonstrate that
LLMs can generate more diverse and valid textual data than human workers while significantly reducing
costs. Additionally, incorporating human-inspired prompting techniques, such as hints and
chaining, can improve model performance, although the impact on lexical diversity remains limited.
Our findings also reveal that LLM augmentation is particularly beneficial in low-resource settings
where only a few seed samples are available. Furthermore, we evaluate various sample selection
strategies and find that random sampling remains a strong baseline, while hint-based strategies
yield the best results for out-of-distribution performance. The results of this thesis highlight the
potential of LLM-based textual augmentation to surpass traditional methods under specific conditions
and pave the way for more efficient and cost-effective data augmentation practices in the era
of advanced language models.

Selected achievements

Member of the excellence team PeWe of Slovak University of Technology in Bratislava, lead prof. Mária Bieliková

Automated Recognition of Antisocial Behaviour in Online Communities

feb 23. 2021
APVV-17-0267. 2018-2020, Partners: Comenius University in Bratislava, Technical University in Kosice, Navrat, P. – principal investigator