Survey paper accepted to ACM Computing Surveys – Addressing sensitivity of language models to effects of randomness

Did you ever try to replicate the results of a specific machine learning study, but often found different performance numbers and findings to the ones observed in the official study? Or have you ever tried to determine which model can be considered state-of-the-art for a specific task, but found that many studies report contradictory findings in this regard? Or did you ever try the newest method for which everyone claims it leads to a significantly better performance, expecting it could help you progress on your research problem, but only found that it actually underperforms a simple baseline?

A common culprit that significantly contributes to all of these problems is uncontrolled randomness in the training and evaluation process. Especially the approaches for dealing with limited labelled data (but to a certain extent also neural networks in general), such as in-context learning, fine-tuning, parameter-efficient fine-tuning or meta-learning, were identified to be sensitive to the effects of uncontrolled randomness. Take for example, in-context learning, where such a simple thing as changing the set of in-context examples, or the order in which the in-context samples are presented to the model can determine whether we get state-of-the-art predictions or random guessing. Similarly, repeating the fine-tuning process multiple times can lead to large deviation in performance, where in some cases the smaller variants can outperform their larger counterparts. 

This uncontrolled randomness, if not properly addressed, was identified to lead to negative consequences. In comparisons and benchmarks, changing only the random seed or using a different prompt format may lead to completely different model rankings. It may also prohibit an objective comparison between different models, creating an imaginary perception of research progress (due to unintentional cherry-picking), or making the research unreproducible. However, even though the effects of randomness can have significant impact, the focus on addressing them is limited in its extent, mainly when dealing with a limited number of labels.

In our newest paper entitled A Survey on Stability of Learning with Limited Labelled Data and its Sensitivity to the Effects of Randomness, which was accepted to the prestigious ACM Computing Surveys journal, we provide a comprehensive survey of 415 papers that address the effects of randomness. First, we provide an overview of all the possible sources of randomness in the training (e.g., randomness factors), such as initialisation, data choice or data order, that may lead to lower stability of the learned models. Second, we focus on all tasks for addressing the effects of randomness – investigation of the impact of different factors is determined across different approaches for learning strategies; determining the underlying origin of the randomness, such the problem of underspecification; and finally the mitigation of the effects, where the impact is reduced, increasing stability without reducing the overall performance of the models.

Overall we find that the majority focus is on in-context learning with large language models, especially on choosing a set of high-quality in-context examples. However, there are other parts that are getting more and more attention, including:

  • Design of prompt format as it was identified to be the most significant contributor of the variance in the results
  • Designing general mitigation strategies by extending ensembling strategy, making it more efficient
  • Considering sensitivity to the effects of randomness in comparisons and benchmarks, as small change can lead to completely different rankings

However, there are still many areas that are left underexplored, such as:

  • More in-depth analysis of the randomness factors and their importance that would allow for better comparison across different experimental setup
  • Exploring how the interactions between randomness factors and the systematic choices affect the importance of the factors
  • Sensitivity of the parameter-efficient fine-tuning methods and its mitigation

Finally, we provide aggregate findings of our analysis of the different papers, based on which we identify 7 challenges and open problems that provide future directions in this field. The main challenges include the inconsistency in findings, limited in-depth analysis for the effects of randomness, the suboptimal experimental setup that disregards the effects of systematic choices.

The purpose of this survey is to emphasize the importance of the research area, as it has so far not received adequate attention. First, it should serve existing or new-coming researchers in this field to support their research. At the same time, its purpose is also to inform research and practitioners utilizing learning with limited labeled data about the consequences of unaddressed randomness and how to effectively prevent and deal with them. We hope this survey will help researchers more effectively understand the negative effects of randomness, the tasks performed when dealing with them, grasp its core challenges and better focus the attention to addressing the randomness and the open problems so that the field can be advanced. Finally, we believe that this survey will allow future works to determine and compare how the area is continuously advancing and evolving. 

For more detailed information and findings, please check out our paper.

We would like to thank the EU-funded projects TAILOR, DisAI and vera.ai for fundings this research!