Overshoot: Speeding up model training with a new momentum-based method

In machine learning, models are trained in steps. During a step, the model is shown a bunch of examples (training data) and then an update to the model is made – so the model performs better on such data in the future. An update step can be imagined as a shift in a multidimensional space (one model parameter = one dimension). The hope is, that over many steps, the model arrives at such parameter configuration that it performs the given task well.

This basic principle works just fine, but since the training of models consumes resources, people enhanced it with the idea of momentum. In momentum-based training, models are updated with a weighted sum of several recent steps, not just the last step. This allows the model to move faster through those parts of the space, where consecutive steps point roughly in the same direction, saving some training steps. It also helps the model to overcome bumps along the way (local optima).

Our new Overshoot method tries to improve on the idea of momentum (see figure below). In classical momentum, some of the past steps (dark grey) may be placed away from the current position of the model (blue). Overshoot mitigates this problem by computing updates (gradients) against models shifted gamma-times further in the direction of the last update (red). This leads to better placement of the past steps vis-a-vis the current position, hence better informed next step. Overshoot can be used in combination with existing momentum methods like ADAM.

We tested the idea on multiple neural architectures and tasks. We observed that Overshoot enhancement, either combined with ADAM or SGD CM (classical momentum), regularly speeds up the convergence, saving between 15% to 25% of the training steps required to reach a 95% loss reduction threshold. With only minimal computational and no memory overhead, this makes Overshoot potentially beneficial for any large-scale training efforts.

For more details on Overshoot and its evaluation, see our paper or try using Overshoot yourself.

As of January 2025, this work is far from over. Further variants of the method will be tested in the future. A broader evaluation is ahead of us as well.