Multimodal processing: Can Artificial Intelligence learn the meaning and relationship between several different modalities?

Artificial intelligence has become a hot topic in many industries as its utilization is growing every day. Even though we are far from artificial general intelligence that would be indistinguishable from human being, we are now able to easily and quite reliably work on complicated tasks such as: 

  • translate complicated text to selected language,
  • do face recognition and automatically open a door only for a person, whose photo is in the database of employees,
  • detect and localize an empty parking space using the video footage from a car. 

All of the mentioned examples use complex specific deep learning models to predict the required output with high accuracy (translated sentences, permission or prohibition to open the door, location of free parking space,…). These models use as an input only one modality at the time, so we call them unimodal.  

Modality refers to a particular way or mechanism of encoding information. By different modalities we understand image, video, text or audio. 

However, in the real world, these inputs usually occur together. By processing them jointly,  we can gain more reliable information. This example can be seen in Figure 1. If we want to automatically get movie feedback from cinema visitors, we can either focus on their facial expressions, or ask them and take note of their opinion, or do both. Then, in cases when one modality would be ambiguous, we can rely on the second modality input. 

Figure 1: This image explains the advantage of multimodal models in comparison to unimodal models in a sentiment analysis. In the unimodal model we use only one modality at a time and predict the sentiment. The blue input represents text, the orange input represents facial expression or image and yellow input represents the audio recording. In the bimodal model we use two of the modalities and in the trimodal model we use three modalities to predict sentiment. It can be clearly seen how using more than one modality helps to better predict the feedback of cinema visitors.
Source: [1]

What is multimodality and when is a model multimodal?

Multimodality in machine learning occurs when two or more inputs, recorded on different types of media and that cannot be mapped unambiguously into one another by an algorithm, are processed by the same machine learning model. This means that the deep learning model, which processes the video and written text is multimodal, unlike the model which processes images in PDF and JPG formats (which can be mapped into one another), that is unimodal.

Human beings are naturally good at understanding and connecting multiple modalities without even realizing it. While we are watching a movie, we are able to recognize or locate objects, scenes and activities that we see on the screen. We can read subtitles, recognize relationships between characters and focus on words they are saying and their meanings. We can understand the emotions or importance of the situation through the intensity and emphasis of voices. All these inputs and modalities are processed in our brain at the same time and we create meaningful and understandable concepts from them. Simply put, the human brain is multimodal.

At KInIT we have started to focus on vision-language modeling or even more specifically image-language modeling, which is processing of images and text with the same model. This area of multimodal processing has many applications such as helping visually impared people, streamlining the healthcare with automatic x-ray or CT scans descriptions or understanding images posted on social media and banning the one containing inappropriate content. 

Currently we are starting a Horizon Europe project, DisAI,  where we focus on disinformation combatting problems using AI, which also includes researching combined image and text disinformation.

The development of models used for joint processing of image and text: from statistical models to transformers

There has been a lot of enhancement in image-language research recently. One of the early attempts at joint image and language models before the neural networks, were statistical algorithms, like canonical correlation analysis. Canonical correlation analysis is a method to find a joint representation as a linear combination of previously extracted image and textual representation. 

After the emergence of neural networks, more elaborate methods were introduced. Firstly, the combining of CNN for images and LSTM or another technique for embedding words for text was used, with the help of concatenation, element-wise vector multiplication, or later attention mechanism. One of these methods can be seen depicted in Figure 2. 

Figure 2: An example of one of the first multimodal representations created with neural networks. The green model explains the embedding of images, using CNN for image features. The blue model is using a skip-gram approach for text features to embed the text. These two representations are then concatenated to create the multimodal word vector. 
Source: [2]

After Vaswani introduced transformer architecture [3], and it gained enormous success and state-of-the-art results for the NLP tasks, self-attention or later cross-attention began to be used for combining language and image together. 

There are mainly two types of image-language transformers for modeling the cross-modal interaction: single stream and two-stream

In the single-stream transformer, a BERT-like architecture is used, meaning the text embeddings and image features, with special embeddings to indicate position and modalities, are concatenated together and fed into a transformer-based encoder together. Examples of such models are VisualBERT [4], V-L BERT [5] or OSCAR [6]

On the other hand, dual-stream transformers at first process both features with separate transformers and then combine them using cross-attention, where the query vectors are from one modality, while the key and value vectors are from the other. Examples of such models are ViLBERT [7], LXMERT [8] or ALBERT [9]. To see the difference between a single-stream architecture and dual-stream architecture check the Figure 3.

Figure 3: Comparison between single-stream (left) and dual-stream (right) image-language transformer. 
Source: [10]

In contrast with image-language transformers, there are also dual encoders that use two single-modal encoders to encode two modalities separately. Then these embeddings are projected into the same semantic space using attention layer or dot product and a similarity scores between pairs are calculated and maximized. The most famous dual encoder is CLIP [11] and its pre-training is shown in Figure 4. More detailed descriptions of differences, advantages and disadvantages can be found for example in this survey paper [12] .

Figure 4: CLIP architecture – at first the image and text are encoded separately and then the representations are projected into the same semantic space using dot product. This similarity score is maximized for caption and image that match (on the diagonal) and minimized for those that do not match. 
Source: [11]

What are the image-language models used for?

There are many more models than the one mentioned above, but sometimes their usage is tied with specific image-language tasks. There are various more or less specific tasks mentioned in the scientific papers. 

The first one, highly popular is image description generation, where the main goal is to generate a meaningful and grammatically correct description of the whole image. A more complicated one is image storytelling, where for sets of images a description for each has to be generated in a way that they create a short story together. 

The next two tasks are focused on referring expressions. These are noun phrases that unambiguously describe a given object or a person in the image (i.e. woman in the red hat next to the men with the dog). These can be generated (referring expression generation) when the object on the image is selected and the noun phrase is generated. Or they can be comprehended (referring expression comprehension) when the image and noun phrase are given and a location of mentioned object or person needs to be found. 

Another well-known task is image question answering, where the goal is to answer questions about the image. The answer can either be selected from multiple choices or generated word by word.  Image reasoning is the aggravation of visual question answering as very sophisticated questions need to be answered by reasoning about the visual world. Natural continuation of visual question answering (exactly as the image storytelling to image description generation) is image dialog, where the model is answering the questions, but still remembering previous answers and connecting them. 

A very interesting but probably not so commonly discussed task is image entailment. The model has to decide whether the description or rather hypothesis, entails the images completely, contradicts the image completely or it can not be decided. 

In recent years, image generation has become a very popular task. It is an inverse task to image-captioning and is more precise than ever thanks to the new model DALL-E 2 [13]. A textual description is given and the model generates the image. 

Of course, there are other tasks containing image and text in various survey papers (like this one [14]), but the above mentioned are the most popular ones. 

Existing problems and open questions in image-language processing

However, even with state-of-the-art transformers specified for different tasks, we are still far from an impeccable image-language model. There are various open problems in this area of research that are being studied nowadays. 

One of them is that the size of models (number of parameters) and size of datasets grows with incredible speed. That means that some researchers are not able to train competitive models anymore, as they do not have access to such computational power. This approach to the problem is also harmful to the environment, as more and more energy is needed for training such models. 

Another one is marked as object hallucination. This happens e.g. during image caption generation, when the generated caption contains a word describing an object that is not in the image. This happens because the model is used to seeing the object in the given context and relies on that more than the actual visual input. This problem is also connected with problematic evaluation of generated text, as one image can have multiple different, but correct captions. 

We can also mention other problems. For example, datasets often contain statistical biases, so that tasks that require information from both modalities with equal importance become solvable by models which exploit data biases in a single modality to make predictions. 

Another problem is that the large transformers have hundreds of millions of parameters, which makes them incomprehensible for humans and their choices and results cannot be straightforwardly explained. Also, there is a problem with generalization as the models only learn what they see in the training set. 

These problems, biases, explainability issues and robustness of models are some of the topics we are researching at KInIT, alongside the direct deployment of models for practical use, e.g. in disinformation combatting. 


[1] Zadeh, A., Chen, M., Poria, S., Cambria, E., & Morency, L. P. (2017). Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250.

[2] Kiela, D., & Bottou, L. (2014, October). Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In Proceedings of the 2014 Conference on empirical methods in natural language processing (EMNLP) (pp. 36-45).

[3] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

[4] Li, L. H., Yatskar, M., Yin, D., Hsieh, C. J., & Chang, K. W. (2019). Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.

[5] Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., & Dai, J. (2019). Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530.

[6] Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., … & Gao, J. (2020, August). Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision (pp. 121-137). Springer, Cham.

[7] Lu, J., Batra, D., Parikh, D., & Lee, S. (2019). Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32.

[8] Tan, H., & Bansal, M. (2019). Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490.

[9] Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.

[10] Cao, J., Gan, Z., Cheng, Y., Yu, L., Chen, Y. C., & Liu, J. (2020, August). Behind the scene: Revealing the secrets of pre-trained vision-and-language models. In European Conference on Computer Vision (pp. 565-580). Springer, Cham.

[11] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (pp. 8748-8763). PMLR.

[12] Du, Y., Liu, Z., Li, J., & Zhao, W. X. (2022). A survey of vision-language pre-trained models. arXiv preprint arXiv:2202.10936.

[13] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.

[14] Mogadala, A., Kalimuthu, M., & Klakow, D. (2021). Trends in integration of vision and language research: A survey of tasks, datasets, and methods. Journal of Artificial Intelligence Research, 71, 1183-1317.