Adaptive Gradient Methods and Normalization Methods in Machine Learning
Xiaoxia Wu, Mathematics, The University of Texas at Austin
11 – 12PM
Friday Feb 14, 2020
POB 6.304
Abstract
Deep neural networks optimization with stochastic gradient descent (SGD) methods requires many heuristic tricks including normalization methods such as batch normalization (Ioffe and Szegedy, 2015), and adaptive gradient methods such as Adam (Kingma and Ba, 2014). A significant challenge in understanding these heuristic methods are the highly non-convex and non-linear nature of neural networks such as ResNet and BERT. Thus, classical convex optimization techniques and theories for SGD methods do not necessarily apply to understand these heuristic tricks of training neural networks.
In this talk, I will provide convergence results of SGD with adaptive learning rates in general non-convex landscapes, give the convergence rates of adaptive gradient methods to global minima in two-layer over-parameterized neural networks, and present an interesting connection between adaptive gradient methods and normalization methods. Beyond convergence, I will show some interesting perspectives of the implicit regularization of the normalization algorithms.