On the Effects of Randomness on Stability of Learning with Limited Labelled Data

Have you ever tried to replicate the results of a specific machine learning study, but often found different performance numbers and findings to the ones observed in the official study? Or have you ever tried to determine which model can be considered state-of-the-art for a specific task, but found that many studies report contradictory findings in this regard? Maybe you tried the newest method that should lead to a significantly better performance, but only found that it actually underperforms a simple baseline?

We recently published a survey paper as a preprint which aims to inform researchers and practitioners utilising learning with limited labelled data about the consequences of unaddressed randomness and how to effectively prevent and deal with them.

A common culprit that significantly contributes to all of these problems is uncontrolled randomness in the training process. Especially the approaches for dealing with limited labelled data (but to a certain extent also neural networks in general), such as in-context learning, transfer learning or meta-learning, were identified to be sensitive to the effects of uncontrolled randomness. 

Take for example in-context learning, where such a simple thing as changing the order in which the in-context samples are presented to the model can determine whether we get state-of-the-art predictions or random guessing. Similarly, repeating fine-tuning with different initialisation can lead in the setting of limited data to large deviation in performance, where in some cases the smallest BERT variants can outperform their larger counterparts. 

This uncontrolled randomness, if not properly addressed, was identified to lead to negative consequences, such as:

  • prohibiting objective comparisons between different models
  • creating an imaginary perception of research progress (due to unintentional cherry-picking)
  • making the research unreproducible 

However, even though the effects of randomness can have significant impact, the focus on addressing them is limited in its extent, mainly when dealing with a limited number of labels.

In our new paper, we provide a comprehensive survey of papers that address the effects of randomness. First, we provide an overview of all the possible sources of randomness in the training (e.g., randomness factors), such as initialisation, data choice or data order, that may lead to lower stability of the learned models.

Second, we focus on all tasks for addressing the effects of randomness:

  • investigation of the impact of different factors is determined across different approaches for learning strategies; 
  • determining the underlying origin of the randomness, such the problem of underspecification; 
  • and finally the mitigation of the effects, where the impact is reduced, increasing stability without reducing the overall performance of the models. 

Finally, we provide aggregate findings of our analysis of the different papers, based on which we identify 7 open problems that provide future directions in this field.

The purpose of this survey is to emphasise the importance of the research area, as it has so far not received adequate attention. First, it should serve researchers in this field to support their research. At the same time its purpose is also to inform researchers and practitioners utilising learning with limited labelled data about the consequences of unaddressed randomness and how to effectively prevent and deal with them. The survey paper, which we plan to continuously update along with its supplementary material, is available as a preprint here.