Resolving Discrepancies in Compute-Optimal Scaling of Language Models

Porian, Tomer; Wortsman, Mitchell; Jitsev, Jenia; Schmidt, Ludwig; Carmon, Yair

Computer Science > Machine Learning

arXiv:2406.19146 (cs)

[Submitted on 27 Jun 2024 (v1), last revised 19 Jan 2025 (this version, v4)]

Title:Resolving Discrepancies in Compute-Optimal Scaling of Language Models

Authors:Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, Yair Carmon

View PDF HTML (experimental)

Abstract:Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model size as a function of the compute budget, but these laws yield substantially different predictions. We explain the discrepancy by reproducing the Kaplan scaling law on two datasets (OpenWebText2 and RefinedWeb) and identifying three factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. With these factors corrected, we obtain excellent agreement with the Hoffmann et al. (i.e., "Chinchilla") scaling law. Counter to a hypothesis of Hoffmann et al., we find that careful learning rate decay is not essential for the validity of their scaling law. As a secondary result, we derive scaling laws for the optimal learning rate and batch size, finding that tuning the AdamW $\beta_2$ parameter is essential at lower batch sizes.

Comments:	Spotlight at NeurIPS 2024
Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:	arXiv:2406.19146 [cs.LG]
	(or arXiv:2406.19146v4 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2406.19146

Submission history

From: Tomer Porian [view email]
[v1] Thu, 27 Jun 2024 13:02:43 UTC (3,601 KB)
[v2] Thu, 25 Jul 2024 13:09:18 UTC (3,650 KB)
[v3] Mon, 28 Oct 2024 09:42:09 UTC (3,727 KB)
[v4] Sun, 19 Jan 2025 10:34:08 UTC (3,947 KB)

Computer Science > Machine Learning

Title:Resolving Discrepancies in Compute-Optimal Scaling of Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Resolving Discrepancies in Compute-Optimal Scaling of Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators