Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning


View a PDF of the paper titled Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations, by Alexander H”agele and 5 other authors

View PDF
HTML (experimental)

Abstract:Scale has become a main ingredient in obtaining strong machine learning models. As a result, understanding a model’s scaling properties is key to effectively designing both the right training setup as well as future generations of architectures. In this work, we argue that scale and training research has been needlessly complex due to reliance on the cosine schedule, which prevents training across different lengths for the same model size. We investigate the training behavior of a direct alternative — constant learning rate and cooldowns — and find that it scales predictably and reliably similar to cosine. Additionally, we show that stochastic weight averaging yields improved performance along the training trajectory, without additional training costs, across different scales. Importantly, with these findings we demonstrate that scaling experiments can be performed with significantly reduced compute and GPU hours by utilizing fewer but reusable training runs. Our code is available at url{this https URL}.

Submission history

From: Alexander Hägele [view email]
[v1]
Tue, 28 May 2024 17:33:54 UTC (704 KB)
[v2]
Wed, 29 May 2024 16:56:26 UTC (702 KB)
[v3]
Thu, 17 Oct 2024 12:01:15 UTC (958 KB)



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.