Home
News
Introducing the first public large neural Slovak language model – SlovakBERT

What's
new

Author

Matúš Pikuliak

Marián Šimko

Back to news

Oct 6. 2021

Introducing the first public large neural Slovak language model – SlovakBERT

KInIT and Gerulata Technologies introduce SlovakBERT, a new language model for Slovak, which will help improve the automatic processing of texts written in Slovak.

Neural language models have lately been the state-of-the-art technology for natural language processing (NLP). Researchers have been able to improve results for many NLP tasks with these models and they also serve as a technological foundation for applications such as Google Search or Google Translate, which are used by billions of people every day. Such models were initially created mainly for English and subsequently for widely used languages, such as Chinese or French. Models for smaller languages such as Czech and Polish occurred later. Even multilingual models are available nowadays.

Today we present the first such modern (with the so-called transformers architecture) language model for Slovak – SlovakBERT¹. The model, trained by our partner, Gerulata Technologies, was consulted and scientifically evaluated by our NLP team. SlovakBERT learned Slovak from about 20 GB of Slovak text collected from the Web. These data are a snapshot of what Slovak language looks like for the model.

Try it out!

You can explore more about SlovakBERT and experiment how it works on. Just visit the SlovakBERT website and try it out for yourself.

Training SlovakBERT was not an easy task, it required almost two weeks of calculations on a powerful computational server. By comparison, a computer with a mid-range graphics card might take years to finish the computations, a regular work laptop might take perhaps decades. SlovakBERT is now open to the world and accessible² to the NLP community. We believe that this step will improve the level of automated Slovak language processing for researchers, companies, but also for the general public.

As researchers, we verified the potential of SlovakBERT and tested how well it works for various tasks. We have found that it achieves excellent results in grammatical analysis, semantic analysis, sentiment analysis or document classification. We described the results of the experimentation in the publicly available article SlovakBERT: Slovak Masked Language Model. The model proved to be so good that we are already involving it in projects with our partners from industry and it might soon appear in the first deployed applications, for example in the upcoming system for analysing sentiment of customer communication on public social network profiles.

We are also aware of the possible pitfalls of such a model. Since it is trained on text available on the Web, there is no filtering mechanism by which we can verify the suitability of this text. SlovakBERT therefore also learned from texts containing vulgarisms, conspiracies, prejudices, stereotypes and many other negative phenomena that Slovak language users produced on the Web. It is therefore a certain mirror of everything that happens in society. In the near future, we plan to research this issue – how to identify various prejudices in language models and, if necessary, suppress them.

¹ The name follows the original BERT model from the Google, which was trained for English. It is an abbreviation for “Bidirectional Encoder Representations from Transformers”, ie the technology used for deep learning of the neural network.

² SlovakBERT at GitHUB

³ Matúš Pikuliak, Štefan Grivalský, Martin Konôpka, Miroslav Blšták, Martin Tamajka, Viktor Bachratý, Marián Šimko, Pavol Balážik, Michal Trnka, Filip Uhlárik. 2021. SlovakBERT: Slovak Masked Language Model

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.

Necessary

Always Enabled

Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

What's
new

Matúš Pikuliak

Marián Šimko

Introducing the first public large neural Slovak language model – SlovakBERT

Try it out!

Twitter – A communication tool for researchers

Research Engineer for AI-based Software Projects

Why partner with KInIT

What'snew

Matúš Pikuliak

Marián Šimko

Introducing the first public large neural Slovak language model – SlovakBERT

Try it out!

Read more

Twitter – A communication tool for researchers

Research Engineer for AI-based Software Projects

Why partner with KInIT

What's
new