Research Visit at the University of Sheffield: Spreading and Combating False Information with Large Language Models

Ivan Srba is a member of the Web & User Data Processing team where he focuses on tackling information disorders with the assistance of artificial intelligence. In June 2023, Ivan had the wonderful opportunity to do a research visit at the University of Sheffield. Read on to find out more about his experience. 

The emergence of pre-trained transformer large language models (with the parameter sizes ranging from hundreds millions, like RoBERTa, up to the most recent models with the size of hundreds billions parameters, like ChatGPT) has attracted a significant attention of not only the research community but also the general public.

Among multiple concerns related to adoption of such models in practice, especially one is mentioned repetitively – such models can be misused to automatically generate a text spreading an existing or even completely new made-up false information. At the same time, the pretrained language models can significantly help with combating false information, despite this potential remains largely undiscovered.

As a part of my research visit to the GATE NLP team at the University of Sheffield (supported by the SoBigData++ TNA programme), I had a great opportunity to summarise the potential of large language models to spread and combat false information. I also tapped into their potential to help with automatic assessment of online content credibility.

This blog post documents the achieved outcomes and my personal experience from the fruitful research visit and collaboration with Sheffield’s NLP research team.

Large Language Models And False Information

At first, I summarised the current state-of-the-art knowledge to obtain a complex picture of the large language models and their interconnection with false information (misinformation as well as disinformation). I proceeded from the existing works and from our own, KIniT’s research team, as well as Sheffield’s GATE research team past/ongoing research activities.

The obtained overview of latest results confirms that the recognized threats about their potential to generate disinformation are non-negligible. Large language models can generate high-quality text that is indistinguishable by humans. Moreover, as we showed in our ongoing work within vera.ai project, they can also generate high-persuasive disinformation text (news articles, social media posts, comments) with new (completely made-up) arguments. The current mitigation techniques (like safety filters) have been shown as critically insufficient. 

The positive news is, as we showed in our recent research in the VIGILANT project, that the detection of generated synthetic text is plausible – even in multilingual settings, while performance varies for different languages, text generators (including languages they are trained on) and detectors themselves.

At the same time, large language models provide a plethora of potential for various tasks, including disinformation-related ones. They can serve as a source of weak signals, assess the credibility of sources, or even provide explanations to end users. This potential, however, remains mostly undiscovered and therefore it became the target of my focus during the research visit.

Comparison of the size of large language models (source). The size of GPT-4 is not publicly available at this moment, but expects to be at least 1 trillion of parameters.

Text Classification With Language Models

Starting in 2017, pre-trained language models brought a new paradigm on how to solve Natural Language Processing (NLP) tasks denoted as “pre-train, fine-tune”, in which the language model is at first pre-trained on large amounts of raw data and then in the fully supervised way adapted to the downstream task. With the latest state-of-the-art large language models, an additional “pre-train, prompt, and predict” paradigm emerged. In this paradigm, the downstream task is solved with the help of prompts – while techniques on how to find the best prompts are named prompt engineering. A specific case of prompt engineering is so-called in-context learning, in which the language model is fine-tuned by providing few annotated samples before the prompt itself.

Both of these paradigms provide a potential to help in combating disinformation. More specifically, I focused on researching how they can be utilised to evaluate credibility of content. During the research visit, we built on our previous results within the SemEval Task 3 data challenge where the GATE team proposed the best-performing solution for subtasks 1 and 2 (News Genre Categorisation and Framing Detection). In addition, I was a member of the KInIT team, which proposed a solution that achieved the best results in subtask 3 (Persuasion Techniques Detection), in which we won 6 first places out of 9 languages.

Persuasion Techniques Detection With Prompting And In-Context Learning

At first, we did multiple experiments on prompting and in-context learning on the top of SemEval data as well as data from a similar data challenge called DIPROMATS. We experimented with a simplified binary version of the task (to predict whether the input text is persuasive or not) as well as the original version of the task (to predict the presence of individual persuasion techniques). Experiments were done with an OpenAI large language model (ChatGPT) as well as a LLaMA-based large language model (LLaMA Vicuna). A specific focus was given to inherent randomness of the predictions – we experimented with various numbers, order, selection techniques and label distributions of in-context samples to investigate the best prompt formats.

The experiments showed that the prompting nor in-context learning cannot overcome the fine-tuned model. The reason is that both datasets (SemEval and DIPROMATS) contain quite large training sets. These training sets allow the multilingual model (XLM-RoBERTaLARGE) sufficiently fine-tune to the downstream task and thus zero- or few-shot settings even with the state-of-the-art large language models cannot achieve a comparable performance.

This is an interesting finding that deserves further investigation. Additional experiments can be done on multiple NLP tasks and multiple datasets, to investigate whether the same pattern will be observed. And if so, whether there is some break-even point in the number of the training samples, when fine-tuning overcomes prompting and in-context learning on the top of latest large language models.

Making Fine-Tuning More Efficient With Adapters

The previous result showed that in the case of sufficient size of the training set, fine-tuning overcomes prompting and in-context learning. However, fine-tuning of the large language models can be quite expensive in terms of computational costs. To make training more efficient, several techniques have been proposed.

In this direction, in a close collaboration with the GATE team, we compared the performance of fully fine-tuned models (which we used originally on the SemEval data) with performance of models trained with low-rank adaptation (LoRA) and large language models adapters. Both techniques aim to provide parameter-efficient fine-tuning of large language models without significantly losing the performance, or even achieving better results in comparison with fully-tuned models.

We showed that LoRA and adapters are able to overcome the performance of the fully fine-tuned models in some situations. The experiments on the SemEval data allowed us to show how individual techniques perform on tasks with different properties (input length, number of classes, overall difficulty of classification), with various multilingual settings (training on original data, English data with and without translation) as well as to compare the performance for individual languages. 

In general, our findings are in line with the existing research works, that show no significant decrease of the performance or even its increase while reducing the training costs. Moreover, we significantly extended the existing works by above-mentioned in-depth comparisons.

SoBigData++ TNA Experience

This work was done as a part of the research visit supported by a SoBigData++ Transnational Access (TNA) programme. During the visit, I cooperated with the GATE NLP team at the University of Sheffield led by professor Kalina Bontcheva

The visit gave me a unique opportunity to present the outcomes of my work as a part of GATE team seminars, discuss and conduct the research experiments with other GATE team members, as well as participate in summarising the achieved results in a form of joint journal paper. 

Last but not least, it was an opportunity to improve my expertise in the NLP techniques, strengthen my international cooperation and research network, and learn from the methodological, organisational and management skills of the GATE team.