Research Group

Web & User Data Processing

On this page

Bio
PhD topic
Selected achievements
Projects
Publications

Ján Čegiň

Research areas: natural language processing, crowdsourcing, human-computer interaction, human-inspired large language models interaction

Position: Researcher

Email
Google Scholar
Researcher Gate
ORCiD
ResearcherID
LinkedIn

Ján focuses on natural language processing, machine learning, crowdsourcing and bridging human-computer interaction with large language models.

He holds a Masters’ degree from the Slovak University of Technology in Intelligent Software Systems. He graduated with honors (cum laude).

He has participated in national research projects in co-operation with industry leaders such as ESET (malware detection) and Continental Automotive Systems (test data generation for the MC/DC criterion).

PhD topic: Machine learning with human in the loop in the era of LLMs

Supervising team: Jakub Šimko (KInIT), Peter Brusilovsky (University of Pittsburgh)

The rapid advancements in large language models (LLMs) have sparked interest in their potential
to enhance data augmentation processes, particularly compared to traditional human-driven methods
like crowdsourcing. This thesis investigates the integration of LLMs into textual augmentation,
addressing how LLM-based augmentation compares to human-centred approaches regarding cost,
performance, and effectiveness. Our research addresses four central questions: (1) the efficacy of
LLMs versus human workers in data augmentation tasks, (2) the transferability of human computation
techniques to LLM prompting, (3) the cost-benefit analysis of LLM-based augmentation
compared to traditional methods, and (4) the impact of sample selection strategies on downstream
model performance when using LLMs. Through extensive experimentation, we demonstrate that
LLMs can generate more diverse and valid textual data than human workers while significantly reducing
costs. Additionally, incorporating human-inspired prompting techniques, such as hints and
chaining, can improve model performance, although the impact on lexical diversity remains limited.
Our findings also reveal that LLM augmentation is particularly beneficial in low-resource settings
where only a few seed samples are available. Furthermore, we evaluate various sample selection
strategies and find that random sampling remains a strong baseline, while hint-based strategies
yield the best results for out-of-distribution performance. The results of this thesis highlight the
potential of LLM-based textual augmentation to surpass traditional methods under specific conditions
and pave the way for more efficient and cost-effective data augmentation practices in the era
of advanced language models.

Selected achievements

Member of the excellence team PeWe of Slovak University of Technology in Bratislava, lead prof. Mária Bieliková

Selected Projects

Symbiosy: AI and Predictions for Smart buildings

Jun 16. 2021

In collaboration with Symbiosy by HB Reavis we aim at reaching both goals – to optimize the building efficiency and to improve the wellbeing of the people who interact with…

CEDMO: Central European Digital Media Observatory

Dec 7. 2021

As EDMO sets the frame for Europe, the aim of CEDMO is to implement an unprecedented, but highly experienced hub against disinformation for the Czech Republic, Poland and Slovakia. Its…

Automated Recognition of Antisocial Behaviour in Online Communities

Feb 23. 2021

APVV-17-0267. 2018-2020, Partners: Comenius University in Bratislava, Technical University in Kosice, Navrat, P. – principal investigator

Selected Publications

ChatGPT to Replace Crowdsourcing of Paraphrases for Intent Classification: Higher Diversity and Comparable Model Robustness

Cegin, J., Simko, J., Brusilovsky, P., – Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,

The emergence of generative large language models (LLMs) raises the question: what will be its impact on crowdsourcing? Traditionally, crowdsourcing has been used for acquiring solutions to a wide variety…

Download

Effects of diversity incentives on sample diversity and downstream model performance in LLM-based text augmentation

Cegin, J., Pecher, B., Simko, J., Srba, I., Bielikova, M., Brusilovsky, P. – Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) – ACL 2024,

Cegin, J., Pecher, B., Simko, J., Srba, I., Bielikova, M., and Brusilovsky, P.1 1 University of Pittsburgh, Pittsburgh, USA The latest generative large language models (LLMs) have found their application…

Download

LLMs vs Established Text Augmentation Techniques for Classification: When do the Benefits Outweight the Costs?

Čegiň, J., Šimko, J., Brusilovsky, P. – Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),

Social media platforms are constantly shifting towards algorithmically curated content based on implicit or explicit user feedback. Regulators, as well as researchers, are calling for systematic social media algorithmic audits…

Download

Fighting Randomness with Randomness: Mitigating Optimisation Instability of Fine-Tuning using Delayed Ensemble and Noisy Interpolation

Pecher, B., Cegin, J., Belanec, R., Simko, J., Srba, I., Bielikova, M. – Findings of the Association for Computational Linguistics: EMNLP 2024,

While fine-tuning of pre-trained language models generally helps to overcome the lack of labelled training samples, it also displays model performance instability. This instability mainly originates from randomness in initialisation…

Download

A Game for Crowdsourcing Adversarial Examples for False Information Detection

Čegiň, J., Šimko, J., Brusilovsky, P., – AIofAI ‘22: 2nd Workshop on Adverse Impacts and Collateral Effects of Artificial Intelligence Technologies, CEUR-WS.org, 2022

Download

Test Data Generation for MC/DC Criterion Using Reinforcement Learning

Cegin, J., Rastocny, K. – International Conference on Software Testing, Verification and Validation Workshops (ICSTW), 2020

Synthesized Dataset for Search-based Test Data Generation Methods Focused on MC/DC Criterion

Cegin, J., Rastocny, K., Bielikova, M. – 20th International Conference on Software Quality, Reliability and Security Companion (QRS-C), 2020

All publications: see google scholar profile

Web & User Data Processing

Ján Čegiň

PhD topic: Machine learning with human in the loop in the era of LLMs

Selected achievements

Selected Projects

Symbiosy: AI and Predictions for Smart buildings

CEDMO: Central European Digital Media Observatory

Automated Recognition of Antisocial Behaviour in Online Communities

Selected Publications

ChatGPT to Replace Crowdsourcing of Paraphrases for Intent Classification: Higher Diversity and Comparable Model Robustness

Effects of diversity incentives on sample diversity and downstream model performance in LLM-based text augmentation

LLMs vs Established Text Augmentation Techniques for Classification: When do the Benefits Outweight the Costs?

Fighting Randomness with Randomness: Mitigating Optimisation Instability of Fine-Tuning using Delayed Ensemble and Noisy Interpolation

A Game for Crowdsourcing Adversarial Examples for False Information Detection

Test Data Generation for MC/DC Criterion Using Reinforcement Learning

Synthesized Dataset for Search-based Test Data Generation Methods Focused on MC/DC Criterion

Why partner with KInIT