What's
Dealing with sensitivity of Large Language Models
Large language models, such as ChatGPT, have become popular recently and are widely used as assistants for many tasks by many researchers as well as people without extensive AI knowledge. As such, they represent one of the most popular uses of learning with limited labelled data. Large language models allow for more effective work by summarising longer texts, as “discussion partners” when coming up with new ideas, generating texts (such as email) based on a few keywords, for a simple planning, or for categorisation/classification tasks such as determining sentiment.
Despite their popularity and widespread adoption, not many people are aware of their shortcomings and weaknesses. One of the most significant of such shortcomings is their instability and sensitivity to the effects of randomness, which negatively affects their effectiveness and trustworthiness.
This behaviour can be best showcased on the texts or instructions we give to the language model, where we describe what we want it to do – called “prompts” by AI researchers and practitioners. How these prompts are written has a significant impact on whether the model will correctly accomplish the task, or will fail completely. As many are aware, using a completely different prompt will lead to a completely different answer from the model. However, even when using the exact same prompt, but replacing only a single word for its synonym, or changing part without semantic meaning (such as punctuation), can also lead to this effect.
We can showcase this on the task of determining the sentiment of the sentence – an easy task for people. We have the sentence “The movie was terrific” and two prompts that differ only in one word:
- Determine sentiment of the following sentence
- Determine sentiment of the subsequent sentence
Oftentimes, it can happen that the answer for the first prompt will be “positive”, but using the second one it will be “negative” – although ChatGPT can handle this small change for sentiment, as it was extensively trained for it, for other tasks it can cause problems.
Besides the wording of the prompts, the order in which we give the instructions and the examples can cause different answers. For example, we have following two prompts that have only the order changed:
- Which one is larger, 13.11 or 13.8?
- 13.11 and 13.8, which one is larger?
For the first prompt, the model will correctly answer that 13.8 is larger. However, if we use the second one (where the instructions are switched, but still make sense) we will get an incorrect answer that 13.11 is larger – this is something that even ChatGPT still struggles with.
When trying only a few cases, the problem of sensitivity may not seem as significant or even be obscured (as with the sentiment example). However, it can cause significant problems when opting for more extensive use of the models – taking the sentiment example into consideration, the difference between whether we will get the correct answer in 9/10 cases or in 7/10 can be significant if applying it across hundreds or even thousands of examples.
So how can we deal with this problem?
Good news is that there are multiple possibilities on how to deal with this problem – although they make the use of the large language models slightly slower and more expensive. Here, we will cover 3 of the most popular ones based on our comprehensive survey of the papers that address the sensitivity, which was recently published in a prestigious journal.
Leave nothing to chance, write more complete prompts. The more details that are included in the prompt, the higher probability that the answer will be what we are looking for. As such, the prompt should include as much information and detail as possible. For example, when we ask the large language model to write an email for us, it may not be enough to only provide it with some keywords it should include. Instead, the prompt should also include things like:
- Which keywords/sentences are more important and so should be highlighted more
- What language should be used, formal or informal?
- For whom is the email meant? This will mostly affect how it is worded. For example an instruction such as “word it so that it can be read and understood by elementary school children” works rather well
When preparing the prompts, do not be afraid of iteratively improving it along with the model – first use a simple prompt, see what the model outputs and then either modify the original prompt or ask it to include additional information (e.g., “can you also include a paragraph that I would like them to answer as soon as possible?”)
In most cases, the length of the prompt will not be an issue. However, what can be an issue is the number of back-and-forth answers are in the conversation already, as the models were shown that they cannot handle that many of them. In case the model either stops including the new instructions or starts to forget the older ones, do not hesitate to create a complete prompt with all the instructions to get it back on track.
A well written, complete prompt will often deal with the sensitivity to the small changes.
Use examples wherever possible. Even though just giving the model instructions works in most cases, showing examples of how it should solve the problem/task can significantly improve its answers. For example, taking the sentiment task as an example, showing it some sentences that you consider to be positive, neutral or negative can be beneficial. However, an important aspect is how to choose the examples to show – it was shown that the most informative samples that represent the task the best always bring the most benefit.
Showing the examples as part of the prompt also deals with the sensitivity as it does not have to work purely based on the instructions, but can also do a kind of “imitation”.
Ask multiple times with different prompts and combine the answers. Instead of asking the model a single time with a single prompt, it is often beneficial to create multiple prompts, each with slightly different instructions and wording, and get answers for each of them from the model. The idea behind this is that the different prompts may lead the model to focus on different parts – for example when writing an email, one prompt may lead to a better introduction in the email, while other may better highlight the most important issues. However, when combining them into a single prompt, their benefits may not show. As such, leveraging the strengths of different prompts (with different order of instructions or different examples) can be done by repeated questions.
After getting the answer to each of the prompts, they can be combined together. In the simpler cases, this can be done as a majority voting – for example in the sentiment task, if we have 10 prompts and in 7 out of 10 cases the model will return positive sentiment, we can say that the sentence indeed has positive sentiment (or neutral when it is 50:50). In the more complicated cases, where a lot of text is generated, there are two solutions. One possibility is to combine the answers manually – by choosing parts from each answer and combining them. A better solution is to do it automatically, by utilising the language model itself – we can take all the answers and give them to another language model (or the same one that generated them) and ask it to combine them into one. Although this may introduce further problems, the large language models excel at tasks such as summarisation so the issues should be minimal – especially when using well written and complete prompts.
Asking the model multiple times and combining the answers is the most effective way to deal with the sensitivity and to guarantee the best possible answers, but also the most expensive one. The most significant strength is that it not only deals with the sensitivity to how the prompt is worded, but also to other factors such as the order of instructions, the examples we are choosing or even the inherent randomness in the model itself.
For more detailed information and findings, please check out our scientific paper.
We would like to thank the PwC Endowment Fund at the Pontis Foundation for funding this research!