Natural Language Processing

Publication

Authors Blšták, M., Kopčan J., Suppa, M., Havran, S., Findor, A., Takac, M. and Simko, M.

Published in Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)

When the Dictionary Strikes Back: A Case Study on Slovak Migration Location Term Extraction and NER via Rule-Based vs. LLM Methods

This study explores the task of automatically extracting migration-related locations (source and destination) from media articles, focusing on the challenges posed by Slovak, a low-resource and morphologically complex language. We present the first comparative analysis of rule-based dictionary approaches (NLP4SK) versus Large Language Models (LLMs, e.g. SlovakBERT, GPT-4o) for both geographical relevance classification (Slovakia-focused migration) and specific source/target location extraction. To facilitate this research and future work, we introduce the first manually annotated Slovak dataset tailored for migration-focused locality detection. Our results show that while a fine-tuned SlovakBERT model achieves high accuracy for classification, specialized rule-based methods still have the potential to outperform LLMs for specific extraction tasks, though improved LLM performance with few-shot examples suggests future competitiveness as research in this area continues to evolve.

Cite: Miroslav Blšták, Jaroslav Kopčan, Marek Suppa, Samuel Havran, Andrej Findor, Martin Takac, and Marian Simko. 2025. When the Dictionary Strikes Back: A Case Study on Slovak Migration Location Term Extraction and NER via Rule-Based vs. LLM Methods. In Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025), pages 91–100, Vienna, Austria. Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2025.bsnlp-1.11

Authors

Miroslav Blšták

AI Specialist

Jaroslav Kopčan

Research Engineer

Marián Šimko

Lead and Researcher