University of Texas at Austin

Past Event: CSEM Student Forum

Adaptive Gradient Methods and Normalization Methods in Machine Learning

Xiaoxia Wu, Mathematics, The University of Texas at Austin

11 – 12PM
Friday Feb 14, 2020

POB 6.304

Abstract

Deep neural networks optimization with stochastic gradient descent (SGD) methods requires many heuristic tricks including normalization methods such as batch normalization (Ioffe and Szegedy, 2015), and adaptive gradient methods such as Adam (Kingma and Ba, 2014). A significant challenge in understanding these heuristic methods are the highly non-convex and non-linear nature of neural networks such as ResNet and BERT. Thus, classical convex optimization techniques and theories for SGD methods do not necessarily apply to understand these heuristic tricks of training neural networks. In this talk, I will provide convergence results of SGD with adaptive learning rates in general non-convex landscapes, give the convergence rates of adaptive gradient methods to global minima in two-layer over-parameterized neural networks, and present an interesting connection between adaptive gradient methods and normalization methods. Beyond convergence, I will show some interesting perspectives of the implicit regularization of the normalization algorithms.

Event information

Date
11 – 12PM
Friday Feb 14, 2020
Location POB 6.304
Hosted by