True Asymptotic Natural Gradient Optimization

نویسنده

  • Yann Ollivier
چکیده

We introduce a simple algorithm, True Asymptotic Natural Gradient Optimization (TANGO), that converges to a true natural gradient descent in the limit of small learning rates, without explicit Fisher matrix estimation. For quadratic models the algorithm is also an instance of averaged stochastic gradient, where the parameter is a moving average of a “fast”, constant-rate gradient descent. TANGO appears as a particular de-linearization of averaged SGD, and is sometimes quite different on non-quadratic models. This further connects averaged SGD and natural gradient, both of which are arguably optimal asymptotically. In large dimension, small learning rates will be required to approximate the natural gradient well. Still, this shows it is possible to get arbitrarily close to exact natural gradient descent with a lightweight algorithm. Let pθ(y|x) be a probabilistic model for predicting output values y from inputs x (x = ∅ for unsupervised learning). Consider the associated log-loss l(y|x) := − ln pθ(y|x) (1) Given a dataset D of pairs (x, y), we optimize the average log-loss over θ via a momentum-like gradient descent. Definition 1 (TANGO). Let δtk 6 1 be a sequence of learning rates and let γ > 0. Set v0 = 0. Iterate the following: • Select a sample (xk, yk) at random in the dataset D. • Generate a pseudo-sample ỹk for input xk according to the predictions of the current model, ỹk ∼ pθ(ỹk|xk) (or just ỹk = yk for the “outer product” variant). Compute gradients gk ← ∂l(yk|xk) ∂θ , g̃k ← ∂l(ỹk|xk) ∂θ (2) • Update the velocity and parameter via vk = (1− δtk−1)vk−1 + γgk − γ(1− δtk−1)(v⊤ k−1 g̃k)g̃k (3) θk = θk−1 − δtkvk (4)

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Non-Asymptotic Convergence Analysis of Inexact Gradient Methods for Machine Learning Without Strong Convexity

Many recent applications in machine learning and data fitting call for the algorithmic solution of structured smooth convex optimization problems. Although the gradient descent method is a natural choice for this task, it requires exact gradient computations and hence can be inefficient when the problem size is large or the gradient is difficult to evaluate. Therefore, there has been much inter...

متن کامل

Momentum and Optimal Stochastic Search

The rate of convergence for gradient descent algorithms, both batch and stochastic, can be improved by including in the weight update a “momentum” term proportional to the previous weight update. Several authors [1, 2] give conditions for convergence of the mean and covariance of the weight vector for momentum LMS with constant learning rate. However stochastic algorithms require that the learn...

متن کامل

Sequential Convex Approximations to Joint Chance Constrained Programs: A Monte Carlo Approach

When there is parameter uncertainty in the constraints of a convex optimization problem, it is natural to formulate the problem as a joint chance constrained program (JCCP) which requires that all constraints be satisfied simultaneously with a given large probability. In this paper, we propose to solve the JCCP by a sequence of convex approximations. We show that the solutions of the sequence o...

متن کامل

Approximate Joint Diagonalization Using a Natural Gradient Approach

We present a new algorithm for non-unitary approximate joint diagonalization (AJD), based on a “natural gradient”-type multiplicative update of the diagonalizing matrix, complemented by step-size optimization at each iteration. The advantages of the new algorithm over existing non-unitary AJD algorithms are in the ability to accommodate non-positive-definite matrices (compared to Pham’s algorit...

متن کامل

Bayesian Learning via Stochastic Gradient Langevin Dynamics

In this paper we propose a new framework for learning from large scale datasets based on iterative learning from small mini-batches. By adding the right amount of noise to a standard stochastic gradient optimization algorithm we show that the iterates will converge to samples from the true posterior distribution as we anneal the stepsize. This seamless transition between optimization and Bayesi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1712.08449  شماره 

صفحات  -

تاریخ انتشار 2017