非单调周期随机梯度法

解智捷; XIE Zhijie; 孙聪; SUN Cong

doi:10.48014/bcam.20250527001

非单调周期随机梯度法

Nonmonotonic Cyclic Stochastic Gradient Method

中国应用数学通报 2025年第3卷第3期页码[19-29] 下载全文[3.9MB]

摘要

随机梯度下降方法是机器学习中广泛使用的优化方法, 其中步长的选择是该方法性能表现的重要影响因素。本文将非单调线搜索技巧与周期步长更新框架相结合, 提出了非单调的周期随机梯度法。针对强凸问题、凸问题和非凸问题, 我们分别给出了新算法的收敛性分析。在数值实验中, 提出的新算法与已有算法相比有更优的表现, 且对超参数的改变表现稳定。

Abstract

The Stochastic Gradient Descent (SGD) method is a widely used optimization method for machine learning, where the selection of the step size is a crucial factor for the performance of SGD. This paper combines the nonmonotonic line search technique with a cyclic update strategy for step size, to propose a nonmonotonic cyclic stochastic gradient method. For strongly convex, convex, and non-convex cases, we provide convergence analyses of the proposed algorithm. In numerical experiments, the proposed algorithm shows better performances compared to existing algorithms and is stable in terms of changes in hyperparameters.

DOI

10.48014/bcam.20250527001

文章类型

研究性论文

收稿日期

2025-05-27

接收日期

2025-07-20

出版日期

2025-09-28

关键词

随机梯度下降法, 周期梯度法, 非单调线搜索, 机器学习

Keywords

Stochastic gradient descent, cyclic gradient method, nonmonotonic line search, machine learning

作者

解智捷^1,2, 孙聪^1,2,*

Author

XIE Zhijie^1,2, SUN Cong^1,2,*

所在单位

1. 北京邮电大学数学科学学院, 北京 100876
2. 数学与信息网络教育部重点实验室 (北京邮电大学) , 北京 100876

Company

1. School of Mathematical Sciences, Beijing University of Posts and Telecommunications, Beijing 100876, China
2. Key Laboratory of Mathematics and Information Networks Ministry of Education (Beijing University of Posts and Telecommunications) , Beijing 100876, China

浏览量

下载量

基金项目

本项研究得到了国家自然科学基金项目(项目编号:12171051)以及中央高校基本科研业务费专项资金(项目编号:2023ZCJH02)的资助。

参考文献

[1] 孙聪, 张亚. 梯度法简述[J]. 运筹学学报, 2021, 25(03): 119-132.
https://doi.org/10.15960/j.cnki.issn.1007-6093.2021.03.007.
[2] Robbins H, Monro S. A stochastic approximation method[J]. The annals of mathematical statistics, 1951: 400-407.
https://doi.org/10.1214/aoms/1177729586
[3] Gower R M, Loizou N, Qian X, et al. SGD: General Analysis and Improved Rates[C]. 2019.
https://doi.org/10.48550/arXiv.1901.09401.
[4] Léon Bottou, Curtis F E, Nocedal J. Optimization Methods for Large-Scale Machine Learning[J]. Society for Industrial and Applied Mathematics, 2018, 60(2): 223-311.
https://doi.org/10.1137/16M1080173.
[5] Needell D, Srebro N, Ward R. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm[J]. Advances in neural information processing systems, 2014, 27.
https://doi.org/10.1007/s10107-015-0864-7.
[6] Gower R M, Loizou N, Qian X, et al. SGD: General Analysis and Improved Rates[C]. 2019.
https://doi.org/10.48550/arXiv.1901.09401.
[7] Ghadimi S, Lan G. Stochastic First-and Zeroth-order Methods for Nonconvex Stochastic Programming[J]. SIAM Journal on Optimization, 2012, 23(4): 2341-2368.
https://doi.org/10.1137/120880811.
[8] Wang X, Yuan Y X. On the Convergence of Stochastic Gradient Descent with Bandwidth-based Step Size[J]. Journal of Machine Learning Research, 2023, 24(1): 49.
https://doi.org/10.12677/jisp.2025.142025.
[9] Kingma D, Ba J. Adam. A Method for Stochastic Optimization[J]. Computer Science, 2014: 1412, 6980.
https://doi.org/10.48550/arXiv.1412.6980.
[10] Duchi J, Hazan E, Singer Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization[C]. 2011: 257-269.
https://doi.org/10.1109/TNN.2011.2146788.
[11] Tieleman T, Hinton G. Divide the gradient by a running average of its recent magnitude. coursera: Neural networks for machine learning[J]. Technical report, 2017.
[12] Zeiler M D. ADADELTA: An adaptive learning rate method[J]. Computer Science, 2012: 1212, 5701.
https://doi.org/10.48550/arXiv.1212.5701.
[13] Tan C, Ma S, Dai Y H, et al. Barzilai-Borwein Step Size for Stochastic Gradient Descent[C]//The Thirtieth Annual Conference on Neural Information Processing Systems(NIPS). Curran Associates Inc. 2016.
https://doi.org/10.48550/arXiv.1605.04131.
[14] Barzilai J, Borwein J M. Two-Point Step Size Gradient Methods[J]. Ima J. numer. anal, 1988, 8(1): 141-148.
https://doi.org/10.1093/imanum/8.1.141.
[15] Vaswani S, Laradji I, Gidel G, et al. Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates[C]//Advances in Neural Information Processing Systems 32, Volume 5 of 20: 32nd Conference on Neural Information Processing Systems(NeurIPS 2019). Vancouver(CA). 8-14 December 2019. 2020.
[16] Loizou N, Vaswani S, Laradji I H, et al. Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence[C]//International Conference on Artificial Intelligence and Statistics. PMLR, 2021.
[17] Fathi Hafshejani S, Gaur D, Hossain S, et al. A fast non-monotone line search for stochastic gradient descent[J]. Optimization and Engineering, 2024, 25(2): 1105-1124.
https://doi.org/10.1007/s11081-023-09836-6
[18] Grippo L, Lampariello F, Lucidi S. A nonmonotone line search technique for Newton’s method[J]. SIAM journal on Numerical Analysis, 1986, 23(4): 707-716.
https://doi.org/10.1137/0723046.
[19] YA ZHANG, CONG SUN. Cyclic Gradient Methods for Unconstrained Optimization[J]. Journal of Operational Research of Society of China, 2024, 12(3): 809-828.
https://doi.org/10.1007/s40305-022-00432-6.
[20] Chang C C, Lin C J. LIBSVM: A library for support vector machines[J]. ACM Transactions on Intelligent Systems and Technology, 2007, 2(3, article 27).
https://doi.org/10.1145/1961189.1961199.

引用本文

解智捷, 孙聪. 非单调周期随机梯度[J]. 中国应用数学通报, 2025, 3(3): 19-29.

Citation

XIE Zhijie, SUN Cong. Nonmonotonic cyclic stochastic gradient method[J]. Bulletin of Chinese Applied Mathematics, 2025, 3(3): 19-29.