Diacritics Restorer: Automatic diacritics restoration for Slovak Google Docs

One of the minor Natural Language Processing (NLP) tasks is the restoration of diacritics. This topic is not widely discussed as it is not an issue for major languages, such as English. However, we are significantly affected by this problem in Slovakia.

In the Slovak language, one word with different diacritics can have completely different meanings (e.g. zástavka = a small flag, zastávka = a pause or a bus stop). The need for grammatically correct and professionally looking documents is strong but people often do not have enough time to type all the words with complicated diacritics properly. 

At KInIT, we are exploring how to continuously improve NLP, while focusing on different tasks such as sentiment analysis, determining semantic similarity of texts or language modelling. We wanted to make using Slovak diacritics faster and more efficient, so we created a practical tool that is useful for anyone who creates Google Docs in Slovak language

The task of automatic diacritic restoration is basically a task of predicting the next character based on the previous (sometimes the following) words or characters (context). We have built a diacritics restoration tool based on a bidirectional two-layer recurrent neural network model, which we had trained on a large number of Slovak texts [1]. Our model learned to add and remove diacritics by “reading” them. 

Automated diacritics recovery can also be used as a part of text pre-processing and analysis. The presence of diacritics in text can significantly affect the meaning of the text, so restoring diacritics could facilitate better understanding in other language processing tasks.

Diacritics Restorer is a Google Docs add-on for fast and automated addition and removal of diacritics in Slovak online documents. You will find all the necessary information about this handy assistant here: diakritikovac.kinit.sk (website in Slovak only).

The Diacritics Restorer tool is completely free and available to everyone. Installation is fast and we also paid special attention to privacy and ethical considerations. On top of that, our method of diacritics addition uses a compact model that is environmentally friendlier than other available solutions based on large neural language models (their energy consumption is higher due to the higher computational and memory complexity).

[1] Náplava, Jakub; Straka, Milan; Hajič, Jan and Straňák, Pavel, 2018, Corpus for training and evaluating diacritics restoration systems, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-2607.

Project team

Martin Konôpka
Researcher 10/2020 – 12/2021
Marián Šimko
Lead and Researcher
Miroslav Blšták
Research Engineer