KInIT is going to EMNLP 2025

We are pleased to share that the Kempelen Institute of Intelligent Technologies (KInIT) will present six papers at EMNLP 2025. These works explore various aspects of fact-checking, multilingual NLP, and low-resource language processing, with a particular focus on making large language models (LLMs) more robust, inclusive, and practical.

Here’s a closer look at each of them:

Multilingual vs Crosslingual Retrieval of Fact-Checked Claims: A Tale of Two Approaches (Main)

This paper examines several strategies to improve the multilingual and crosslingual retrieval of previously fact-checked claims. The study compares different negative example selection strategies in supervised settings and evaluates multiple multilingual models combined with reranking in unsupervised setups.

Authors: Alan Ramponi, Marco Rovera, Róbert Móro, Sara Tonelli

A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages (Main)

This work systematically investigates how LLMs can generate synthetic data for low-resource languages, covering 11 languages, 3 NLP tasks, and multiple LLMs. The results show that synthetic data can significantly reduce the performance gap compared to gold-standard data.

Authors: Ján Čegiň, Tatiana Anikina, Jakub Šimko, Simon Ostermann

Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even Performance (Main)

This study explores how many labelled samples are required for small fine-tuned models to match or outperform large state-of-the-art LLMs such as LLaMA3 or ChatGPT. Interestingly, the authors find that a small BERT model needs only around 100 labelled samples to surpass them in text classification tasks.

Authors: Branislav Pecher, Ivan Srba, Mária Bieliková.

Large Language Models for Multilingual Previously Fact-Checked Claim Detection (Findings)

This paper evaluates seven LLMs and various prompting strategies to detect previously fact-checked claims across 20 languages. While the models perform well for high-resource languages, challenges persist for low-resource ones – though simple translation to English can substantially improve results. The findings highlight the potential of LLMs in building multilingual automated fact-checking systems, contributing to the global fight against misinformation.

Authors: Ivan Vykopal, Matúš Pikuliak, Simon Ostermann, Tatiana Anikina, Michal Gregor, Marián Šimko

Use Random Selection for Now: Investigation of Few-Shot Selection Strategies in LLM-Based Text Augmentation for Classification (Findings)

The paper investigates a few-shot selection strategies for LLM-based text augmentation. Surprisingly, simple baselines such as random selection often perform competitively—especially in in-distribution settings.

Authors: Ján Čegiň, Branislav Pecher, Jakub Šimko, Ivan Srba, Mária Bieliková, Peter Brusilovský

oMEGA: Optimized Methods for Explanation Generation & Analysis (Demo) 

This work introduces o-mega, a hyperparameter optimization framework for selecting the most effective explainability (XAI) methods in semantic-matching tasks, particularly within fact-checking pipelines. Tested on a curated post-claim dataset, o-mega automates the selection of explanation techniques and configurations, revealing that Occlusion achieves the best balance between technical fidelity and human interpretability.

Authors: Ľuboš Kriš, Jaroslav Kopčan, Qiwei Peng, Andrej Ridzik, Marcel Veselý, Martin Tamajka

Together, these papers demonstrate both the power and limitations of large language models, highlighting opportunities where smaller, specialized models or simpler strategies can still excel. KInIT would like to acknowledge its collaborators from Fondazione Bruno Kessler (FBK), Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI), and the University of Copenhagen for their invaluable contributions. 

We look forward to engaging with the NLP community at EMNLP 2025 and sharing the outcomes of this research.