Čo je
Fact-Check Retrieval Using Causal LLMs
In the first instalment of our 3-part series, we saw how a text embedding model can be used to retrieve existing fact-checks across different languages and help fact-checkers to cut down on duplicate effort. In this second part, we are going to look at how one can plug causal, generative LLMs (large language models) into the pipeline to refine the results further.
Generative LLMs are now widely understood to have the ability to capture richer semantic representations and to better handle contextual nuances in comparison to text embedding models. They owe this to several factors, such as the scale and generality of their pretraining data, their instruction tuning, their (limited) ability to perform reasoning, and even just the sheer number of parameters.
LLMs Will Sift through Your Retrieved Fact-Checks
So how can we incorporate generative LLMs into the retrieval process? Let us recall what we did last time using TEMs (text embedding models). As shown in Fig. 1, the idea was to simply feed both our documents (fact-checks) and our query (the claim we are interested in) into the TEM to get their embedding vectors. We would then compare the embedding vector of our query against those of all the documents and rank the documents by the resulting similarity scores. The most relevant documents should then be at the top of the list. Note that the individual documents and the query are all embedded independently of each other, so the embeddings of all the documents can be precomputed ahead of time and then reused across all searches.
Fig. 1: Document retrieval using text embedding models (TEMs).
When incorporating generative LLMs into the process, we are not going to touch the TEM part, as it is still a very efficient way to get the first rough ranking. You can think of this as performing a Google search – you get a list of results which are more or less relevant already, but then you need to sift through them and figure out which are really useful to you and which, although superficially similar, are not what you were looking for.
In our new setup, it is this sifting task that is going to be performed by the generative LLM. The LLM is going to sort through a list of top-ranked documents retrieved by the TEM. It will push the most relevant items towards the top of the list, and it will filter some items out altogether if it finds that they are not actually relevant. The pipeline as a whole is illustrated in Fig. 2.
Fig. 2: A retrieval pipeline combining a text embedding model with an LLM.
Why LLMs Have an Easier Time Figuring Out Details
Note that the TEM and the LLM tackle the task in two very different ways. The TEM is asked to embed every item independently into a fixed-size vector. Consequently, what it really needs to do is take each document and squash all the information that could possibly be relevant into a very compact form – and all that without having any idea of what the query is going to be about! It is no wonder that the TEM-based ranking can sometimes miss particular details or ignore various nuances.
The LLM, in contrast, would naturally have an easier time figuring out details. After all, we are not asking it to blindly squash all the content into one small vector and then make do with the information that survived. On the contrary, the LLM attends to the entire query and to all the documents as it works. It can literally weigh the meaning of each particular word and, having seen the query, consider all this information in the context of what the user is looking for just now.
Prompting Is Crucial!
If you work with large language models often, you may not be surprised that it matters how you ask the LLM to perform the task – LLMs can be quite sensitive to what prompt you provide, and a prompt can often even make or break the entire pipeline. We have experimented with several different setups in our accompanying paper (which is now out on arXiv):
- Zero-shot: We just ask the LLM: „Is the claim relevant to the social media post?“ – without providing any additional context.
- Zero-shot with task description: We provide the LLM with more detailed instructions on how to approach the task, derived from guidelines provided to human annotators.
- Few-shot with task description: We provide the LLM with the task description and also include 10 examples (5 of relevant and 5 of irrelevant pairs). These 10 examples are actually not fixed ahead of time, but retrieved from a larger set by a TEM, based on similarity to the user’s query, to make them as helpful to the LLM as possible.
- Chain-of-thought: We additionally instruct the LLM to “think step by step” – this encourages it to spend some time reasoning about the post and fact-checks before it provides an answer.
- Cross-lingual thought prompting (XLT): We additionally instruct the model to translate the content into English before it proceeds with the rest of the task – it has been observed that some models’ skills just work better in English than in other languages, so this can sometimes help.
If you ask which setup is the best for fact-check retrieval, the answer is that it depends. The optimal strategy differs a lot from model to model. Out of the open-weight models that we experimented with, Mistral-Large showed the most robust performance, improving over TEM baselines in all setups, but achieving the best scores when using few-shot prompting with the task description – the results are summarised in Fig. 3. Note that in the plain zero-shot setup (without the more comprehensive task description), the improvement is very marginal – the results are more or less the same as in the TEM-only setup.
Fig. 3: Performance comparison of LLMs across five prompting strategies in the original language, measured by Macro F1 score with confidence intervals. The dashed horizontal lines represent the best-performing baselines.
Multilingual Performance of Generative LLMs
One particularly challenging aspect of searching for existing fact-checks is that they may easily have been written in a different language than your post, so to find them efficiently, you really need a tool that can conduct cross-lingual searches.
Thankfully, there are now some models – both TEMs and generative LLMs – with reasonably good multilingual capabilities. As we already know, Mistral-Large takes the lead overall – this can also be observed in Fig. 4, which shows the results achieved by different generative LLMs and by TEM baselines (shown in bold) across 20 individual languages and 20 cross-lingual pairs (i.e. the searched-for claim is in a different language than the target fact-checks).
Interestingly, though, it seems that even small models can be competitive – Qwen 2.5 7B’s results arguably do not lag very far behind Mistral Large in some cases and even manage to outpace it in a few. Furthermore, although both versions of Qwen actually underperform the highest-achieving TEM (DeBERTa v3 Large) in many monolingual cases, they do seem to be more robust in terms of cross-lingual performance.
For additional information, have a look at our paper on arXiv.
Fig. 4: Performance of TEMs (bold) and generative LLMs (normal) across 20 individual languages (left) and 20 cross-lingual pairs (right).
The Way Ahead
It is clear that as models keep getting better, they are able to tackle increasingly complex tasks – and across a growing number of languages. At present, one can often push performance by carefully designing prompts and combining several kinds of models in a pipeline. Hopefully, going forward, the amount of engineering required to get the best results will decrease, and as models gain more skills, improved multilingual support and more advanced reasoning capabilities, they will also get even easier to apply – we will see.
As a quick recap, our blog series has 3 parts. The first covered TEMs. In this, the second part, we had a look at generative LLMs and the third and final part – on multimodal fact-check retrieval – is yet to come:
- Fact-Check Retrieval Using Text Embedding Models, where we explain the fact-check retrieval task and how it can be addressed using text embedding models (TEMs), while supporting search across different languages.
- [This is where you are!] Fact-Check Retrieval Using LLMs, where we explain how generative large language models (LLMs) enter into the equation and how they can help to refine retrieval to improve results further.
- Multimodal Fact-Check Retrieval, where we explain how visual content can be leveraged to improve retrieval performance, starts with a very straightforward setup based on a combination of several smaller models and then goes on to how generative LLMs can be applied to the same task.
