Diacritics Restorer: Automatic diacritics restoration for your online documents in Slovak

Stay away from the embarrassing diacritic-free typing. 

Due to the pandemic, the shift of office work to virtual space has been greatly accelerated. To improve efficiency, we are used to working collaboratively and online. When we write, we often type without accents and diacritics for speed or convenience. However, Slovak text without diacritics is harder to read, can cause misunderstandings or make the author look unprofessional. 

At KInIT, we are fascinated by the automatic Natural Language Processing (NLP). We are exploring how to continuously improve it, while focusing on different tasks such as sentiment analysis, determining semantic similarity of texts or language modelling. 

One of the minor NLP tasks is the diacritics restoration. The topic of diacritics is not widely discussed as it is not a problem for major world languages such as English. 

However, we are significantly affected by this problem in Slovakia. One word with different diacritics can have completely different meanings (e.g. zástavka = a small flag, zastávka = a pause or a bus stop). When we need professionally looking documents, we don’t want it to take forever to fill in the diacritics.

The task of automatic diacritic restoration is basically a task of predicting the next character based on the previous (sometimes the following) words or characters (context). We have built a diacritics restoration tool based on a bidirectional two-layer recurrent neural network model, which we had trained on a large number of Slovak texts [1]. Our model learned to add and remove diacritics by “reading” them. 

We were interested to see how we compared to other existing solutions, and so we scientifically validated our diacritics tool. 

There are already a number of solutions that achieve high success rates (e.g. based on large neural language models). However, their improvement of a few tens or hundreds of percent is often at the expense of higher computational and memory complexity, which in practice can translate into higher energy consumption and a greater ecological burden. Our method of diacritics restoration using a more compact model is more considerate in this respect.

We are researchers, but we did not treat the automatic diacritics restoration as a scientific experiment only. We developed a practical tool for everyone working with documents in an online environment. We are thrilled to introduce our Diacritics Restorer tool. It is a Google Docs add-on for fast and automated addition and removal of diacritics in Slovak online documents. You will find all the necessary information about this handy assistant on this page: diakritikovac.kinit.sk (website in Slovak only).

We believe that thanks to the quick installation and ease of use, Diacritics Restorer will find many fans outside of KInIT. We have also paid special attention to privacy and ethical considerations. The Diacritics Restorer tool is completely free and available to everyone (who can type in Slovak 🙂 ).

By the way, automated diacritics restoration can also be used as a part of text pre-processing and analysis. The presence of diacritics in text can significantly affect the meaning of the text, so restoring diacritics could facilitate better understanding in other language processing tasks.

PS: The Slovak version of this article was written without diacritics. We restored diacritics using the KInIT Diacritics Restorer tool and only two words out of 437 were restored incorrectly 🙂

[1] Náplava, Jakub; Straka, Milan; Hajič, Jan and Straňák, Pavel, 2018, Corpus for training and evaluating diacritics restoration systems, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-2607.