copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

H. Liu, Z. Li, D. Hall, P. Liang, and T. Ma. (2023)cite arxiv:2305.14342.

Abstract

Given the massive cost of language model pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead. In this paper, we propose Sophia, Second-order Clipped Stochastic Optimization, a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner. The update is the moving average of the gradients divided by the moving average of the estimated Hessian, followed by element-wise clipping. The clipping controls the worst-case update size and tames the negative impact of non-convexity and rapid change of Hessian along the trajectory. Sophia only estimates the diagonal Hessian every handful of iterations, which has negligible average per-step time and memory overhead. On language modeling with GPT-2 models of sizes ranging from 125M to 770M, Sophia achieves a 2x speed-up compared with Adam in the number of steps, total compute, and wall-clock time. Theoretically, we show that Sophia adapts to the curvature in different components of the parameters, which can be highly heterogeneous for language modeling tasks. Our run-time bound does not depend on the condition number of the loss.

Description

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

Links and resources

BibTeX key: liu2023sophia
entry type: misc
year: 2023
url: http://arxiv.org/abs/2305.14342
note: cite arxiv:2305.14342

Cite this publication

@misc{liu2023sophia, abstract = {Given the massive cost of language model pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead. In this paper, we propose Sophia, Second-order Clipped Stochastic Optimization, a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner. The update is the moving average of the gradients divided by the moving average of the estimated Hessian, followed by element-wise clipping. The clipping controls the worst-case update size and tames the negative impact of non-convexity and rapid change of Hessian along the trajectory. Sophia only estimates the diagonal Hessian every handful of iterations, which has negligible average per-step time and memory overhead. On language modeling with GPT-2 models of sizes ranging from 125M to 770M, Sophia achieves a 2x speed-up compared with Adam in the number of steps, total compute, and wall-clock time. Theoretically, we show that Sophia adapts to the curvature in different components of the parameters, which can be highly heterogeneous for language modeling tasks. Our run-time bound does not depend on the condition number of the loss.}, added-at = {2023-08-21T20:08:41.000+0200}, author = {Liu, Hong and Li, Zhiyuan and Hall, David and Liang, Percy and Ma, Tengyu}, biburl = {https://www.bibsonomy.org/bibtex/21dc8c41a922193d2416480b46bb05fad/vincentqb}, description = {Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training}, interhash = {d8851030f6342ed70e03b16669613d51}, intrahash = {1dc8c41a922193d2416480b46bb05fad}, keywords = {optimization}, note = {cite arxiv:2305.14342}, timestamp = {2023-08-21T20:08:41.000+0200}, title = {Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training}, url = {http://arxiv.org/abs/2305.14342}, year = 2023 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

Abstract

Description

Links and resources

Tags

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

Abstract

Description

Links and resources

Tags

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

Comments and Reviews
(0)