Ivana Beňová
Research areas: natural language processing, machine learning, deep learning
Position: PhD Student
Ivana is a research assistant, part of the NLP team, about to work on the issues of low-resource language processing, transparency and explainability of NLP or multimodal content processing. Her academic interests are machine learning and deep learning.
She holds a Masters’ degree in Mathematical Economics, Finance and Modelling from the Comenius University in Bratislava, Faculty of Mathematics, Physics, and Informatics where she graduated with honors. She focused on mathematical modelling, which consisted of time series analysis, optimization, data dimensionality reduction or categorization and classification of data. During her studies, she participated in CFA Research Challenge, an annual global competition in financial analysis, where the school team placed 2nd in the Czech-Slovak round. She also participated in faculty’s Student Science Conference 2021 with her Master thesis: Generation of origin-destination matrices in transport modelling.
PhD topic: Understanding and grounding of multimodal image-language models
Supervising team: Marián Šimko (KInIT), Jana Košecká (George Mason University)
The development in artificial intelligence subfields, such as machine learning, natural language processing (NLP), and computer vision (CV), has increased the interest in finding solutions to problems that combine linguistic and visual data. Solving complex image-language challenges might be a stepping stone for artificial intelligence. The majority of the advancements in image-language modeling, along with advances in CV and NLP, were made possible by the development of deep learning. Different multimodal models were introduced after the self-attention, and self-supervised training was expanded to include visual information.
In our proposal, we summarized types of state-of-the-art image-language models. We focused on their differences, pretraining strategies, advantages, and usability for downstream tasks. As these models are deep learning algorithms, we don’t know the learned relationships between language and vision tokens. Therefore, we define the concept of grounding language in vision. We also studied different probing methods, mainly the frequently used image-text matching, to explore the understanding of linguistic aspects by image-language models.
After obtaining such an extensive survey of the area of understanding and grounding of image-text models, we identified three open problems that we want to study in the following years:
(1) how to eliminate the identified challenges of the image-text matching probing technique,
(2) the lack of knowledge about how different linguistic aspects are grounded in vision in a pretrained image-language models and
(3) how to improve the language grounding in these models using the knowledge obtained from previous experiments.