Convergence Analysis of Training Two-Hidden-Layer Partially Over-Parameterized ReLU Networks via Gradient Descent

Over-parameterized neural networks have attracted a
great deal of attention in recent deep learning theory research,
as they challenge the classic perspective of over-fitting when
the model has excessive parameters and have gained empirical
success in various settings. While a number of theoretical works
have been presented to demystify properties of such models, the
convergence properties of such models are still far from being
thoroughly understood. In this work, we study the convergence
properties of training two-hidden-layer partially over-parameterized
fully connected networks with the Rectified Linear Unit activation via
gradient descent. To our knowledge, this is the first theoretical work
to understand convergence properties of deep over-parameterized
networks without the equally-wide-hidden-layer assumption and
other unrealistic assumptions. We provide a probabilistic lower bound
of the widths of hidden layers and proved linear convergence rate of
gradient descent. We also conducted experiments on synthetic and
real-world datasets to validate our theory.

Authors:



References:
[1] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol
Vinyals. Understanding deep learning requires rethinking generalization.
arXiv preprint arXiv:1611.03530, 2016.
[2] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv
preprint arXiv:1605.07146, 2016.
[3] Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon
Bottou. Empirical analysis of the hessian of over-parametrized neural
networks. arXiv preprint arXiv:1706.04454, 2017.
[4] Ji Xu, Daniel J Hsu, and Arian Maleki. Benefits of over-parameterization
with em. In Advances in Neural Information Processing Systems, pages
10662–10672, 2018.
[5] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural
networks via stochastic gradient descent on structured data. In Advances
in Neural Information Processing Systems, pages 8157–8166, 2018.
[6] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient
descent provably optimizes over-parameterized neural networks. arXiv
preprint arXiv:1810.02054, 2018.
[7] Lili Su and Pengkun Yang. On learning over-parameterized neural
networks: A functional approximation prospective. arXiv preprint
arXiv:1905.10826, 2019.
[8] Xiaoxia Wu, Simon S Du, and Rachel Ward. Global convergence of
adaptive gradient methods for an over-parameterized neural network.
arXiv preprint arXiv:1902.07111, 2019.
[9] Samet Oymak and Mahdi Soltanolkotabi. Towards moderate
overparameterization: global convergence guarantees for training
shallow neural networks. arXiv preprint arXiv:1902.04674, 2019.
[10] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic
gradient descent optimizes over-parameterized deep relu networks. arXiv
preprint arXiv:1811.08888, 2018.
[11] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and
generalization in overparameterized neural networks, going beyond two
layers. arXiv preprint arXiv:1811.04918, 2018.
[12] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence
theory for deep learning via over-parameterization. arXiv preprint
arXiv:1811.03962, 2018.
[13] Difan Zou and Quanquan Gu. An improved analysis of
training over-parameterized deep neural networks. arXiv preprint
arXiv:1906.04688, 2019.
[14] Yann LeCun, L´eon Bottou, Yoshua Bengio, Patrick Haffner, et al.
Gradient-based learning applied to document recognition. Proceedings
of the IEEE, 86(11):2278–2324, 1998.
[15] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel
image dataset for benchmarking machine learning algorithms. arXiv
preprint arXiv:1708.07747, 2017.
[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep
residual learning for image recognition. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 770–778,
2016.
[17] Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz.
Sgd learns over-parameterized networks that provably generalize on
linearly separable data. arXiv preprint arXiv:1710.10174, 2017.
[18] Arthur Jacot, Franck Gabriel, and Cl´ement Hongler. Neural tangent
kernel: Convergence and generalization in neural networks. In Advances
in neural information processing systems, pages 8571–8580, 2018.
[19] Bo Xie, Yingyu Liang, and Le Song. Diverse neural network learns true
target functions. arXiv preprint arXiv:1611.03131, 2016.
[20] Russell Tsuchida, Farbod Roosta-Khorasani, and Marcus Gallagher.
Invariance of weight distributions in rectified mlps. arXiv preprint
arXiv:1711.09090, 2017.
[21] Kenji Kawaguchi. Deep learning without poor local minima. In
Advances in neural information processing systems, pages 586–594,
2016.
[22] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv
preprint arXiv:1611.04231, 2016.
[23] Yi Zhou and Yingbin Liang. Critical points of neural networks:
Analytical forms and landscape properties. arXiv preprint
arXiv:1710.11205, 2017.
[24] Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A
convergence analysis of gradient descent for deep linear neural networks.
arXiv preprint arXiv:1810.02281, 2018.
[25] Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization
of deep networks: Implicit acceleration by overparameterization. arXiv
preprint arXiv:1802.06509, 2018. [26] Peter L Bartlett, David P Helmbold, and Philip M Long. Gradient
descent with identity initialization efficiently learns positive-definite
linear transformations by deep residual networks. Neural computation,
31(3):477–502, 2019.
[27] Samet Oymak and Mahdi Soltanolkotabi. Overparameterized nonlinear
learning: Gradient descent takes the shortest path? arXiv preprint
arXiv:1812.10004, 2018.
[28] Robert Devaney. An introduction to chaotic dynamical systems. CRC
Press, 2018.