Čo je
Multimodal Fact-Check Retrieval
In the two previous installments of our 3-part series, we covered how a text embedding model (TEM) can help fact-checkers to cut down on duplicate effort by retrieving existing fact-checks across different languages (part 1) and how generative LLMs can further improve the results of such retrieval (part 2) by essentially going over the retrieved documents and filtering out irrelevant ones – much like a human sifts through results retrieved by a search engine such as Google, figuring out which ones are really useful.
One important aspect that we did not address so far is how to also incorporate visual information when performing the retrieval. Taking visual content into account can be crucial in some cases, because such content can carry essential information that is not expressed in the text. Even when this is not the case, images can still make the retrieval task much easier – it is, for instance, often the case that a fact-check article will include an annotated version of an image from the post it fact-checks. One such example is shown in Fig. 1. Finally, it is very common that a social media post contains no actual textual content and the text is instead rendered in the form of an image. In such cases, optical character recognition (OCR) needs to be applied to extract the text.

Fig. 1: An example used in Multimodal and Multilingual Fact-Checked Article Retrieval (ICMR 2025); a social media post, including an image (top left); the post’s translation to English (bottom left); a corresponding fact-checking article (right).
How, then, does one incorporate visual content into a retrieval pipeline? It depends on the character of the images and the retrieval pipeline. We are going to go over several different cases, namely:
- Incorporating visual content with a text-only TEM;
- Incorporating visual content with a multimodal TEM;
- Incorporating visual content with a generative LLM;
- Handling text rendered in images.
Visual Content and a Text-Only TEM
In the current LLM landscape, a lot of work goes towards building both strong multilingual models and multimodal models that can handle both language and visual content. However, these efforts often do not go hand in hand – to get the best multilingual support, you will often need to opt for a text-only TEM. It would therefore be great if one could effectively combine a pretrained text embedding model with a vision encoder in a way that would allow one to create joint embeddings, combining the information from both modalities.
We explored one very simple approach of this kind in our ICMR 2025 paper. The idea is straightforward: take a competent TEM and a high-performing vision encoder and use them to embed pieces of text into embedding vectors. Then use another, much smaller model, to combine these into a single vector. As the paper shows, when one initialises things reasonably, it is possible to get good results even using this simple baseline.
As illustrated in Fig. 2, in the paper we actually combine three modalities in this way: texts, images and texts extracted from images using OCR. As a further simplification, we only use the first image of each post and fact-check in our baseline and show that this already helps. Incorporating multiple images is straightforward, however. The model that combines the embedding vectors is a transformer model, so it can easily process variable-length sequences of items – one can essentially easily plug in as many images as necessary.

Fig. 2: An architecture that combines three modalities: images, texts and texts extracted from images using OCR. Each modality is processed using a separate pre-trained embedding model, and the resulting embedding vectors are then fused into a single vector using an additional (and much smaller) model trained for this task.
Visual Content and a Multimodal TEM
With multimodal TEMs, there is a number of different model families that suggest very different ways of combining the visual and textual information. However, the most expressive approach would probably be the one where the entire content is presented to the model as a single sequence – comprising both the text and one or multiple images, with the images being interleaved with the text as necessary – as illustrated in Fig. 3. The images are typically converted to a sequence of “visual tokens” by a vision encoder before entering the model itself. More advanced models have support for arbitrary image resolutions, where larger images will get converted to a larger number of tokens than smaller images.
The advantage of this single-sequence paradigm is that it allows one to preserve a lot of the information from the original content. It is trivially obvious that, especially with more complex texts, preserving the order and position of images with respect to the text is necessary to form a correct understanding of the content. E.g. in highly technical texts, even human readers would be confused if figures were presented out of order at the end of the text and without proper referencing.

Fig. 3: A multimodal embedding model, where the text and the images
are modelled jointly as a single sequence of textual and visual tokens.
Multimodal embedding models of this single-sequence kind would typically be closely related to architectures of modern causal, generative LLMs – they may even actually be generative LLMs fine-tuned as embedding models after pre-training. They tend to be a bit larger and more expensive to run than standard TEMs, although this is by no means an absolute rule.
While, as already mentioned, strong multimodal models do not always offer the best multilingual support, models under this single-sequence paradigm are certainly going to be among the main contenders in multimodal retrieval tasks in the future.
Handling Text Rendered in Images
Finally, there is one separate component: handling text rendered in images. While this could technically be handled very well by multimodal TEMs, especially those in the more sophisticated single-sequence paradigm, the current generation of these are typically not trained to perform OCR or they are at least not very proficient in it. It is therefore – at least for now – a good idea to handle this separately.
The process is actually quite straightforward – one applies an off-the-shelf OCR solution to extract text from the image (if there is any). This text can then be embedded using a text embedding model. The one caveat is that standard OCR solutions can sometimes have trouble handling images included in the fact-checks, as these can be a bit specific. As illustrated in Fig. 4, the text is often partially obscured by being crossed out.
It is also quite common that fact-checkers take high-resolution screenshots of images, including their captions and the font, although quite legible at the high resolution, can be very small in comparison to the image. This is also a problem for standard OCR solutions, which are not tailored to such content and expect text to be presented prominently in the image.
One way to work around such issues is to use one of the commercially available multimodal LLMs, such as GPT-4o or the more recent versions of Claude, which can generally handle cases like this gracefully – although they do bring the risk of hallucinating content in some cases.

Fig. 4: An example used in Multimodal and Multilingual Fact-Checked Article Retrieval (ICMR 2025); the text in the image is partially obscured by being crossed out.
Conclusion
All in all, NLP models are now able to significantly assist human fact-checkers by retrieving relevant existing fact-checks and potentially preventing duplicate work. By creating setups where language models can also leverage visual information – whether this is done by augmenting text-only TEMs or by using TEMs that are natively multimodal – one can squeeze out additional performance and make the results even better.
This blog is part of a series; here is a list of all the parts:
- Fact-Check Retrieval Using Text Embedding Models, where we explain the fact-check retrieval task and how it can be addressed using text embedding models (TEMs), while supporting search across different languages.
- Fact-Check Retrieval Using LLMs, where we explain how generative large language models (LLMs) enter into the equation and how they can help to refine retrieval to improve results further.
[This is where you are!] Multimodal Fact-Check Retrieval, where we explain how visual content can be leveraged to improve retrieval performance, starts with a very straightforward setup based on a combination of several smaller models and then goes on to how generative LLMs can be applied to the same task.
