Energetic Natural Gradient Descent
ثبت نشده
چکیده
In this appendix we show that 1 2 ∆ F (θ)∆ is a second order Taylor approximation of D KL (p(θ)p(θ + ∆)). First, let g q (θ) :=D KL (qp(θ)) = ω∈Ω q(ω) ln q(ω) p(ω|θ). We begin by deriving equations for the Jacobian and Hessian of g q at θ: ∂g q (θ) ∂θ = ω∈Ω q(ω) p(ω|θ) q(ω) ∂ ∂θ q(ω) p(ω|θ) = ω∈Ω q(ω) p(ω|θ) q(ω) −q(ω) ∂p(ω|θ) ∂θ p(ω|θ) 2 = ω∈Ω − q(ω) p(ω|θ) ∂p(ω|θ) ∂θ , (4) and so: ∂ 2 g q (θ) ∂θ 2 = ∂ ∂θ ∂g q (θ) ∂θ = − ω∈Ω q(ω) ∂ ∂θ 1 p(ω|θ) ∂p(ω|θ) ∂θ = − ω∈Ω q(ω) p(ω|θ) ∂ 2 p(ω|θ) ∂θ 2 + ω∈Ω q(ω) p(ω|θ) 2 ∂p(ω|θ) ∂θ ∂p(ω|θ) ∂θ = − ω∈Ω q(ω) p(ω|θ) ∂ 2 p(ω|θ) ∂θ 2 + ω∈Ω q(ω) ∂ ln p(ω|θ) ∂θ ∂ ln p(ω|θ) ∂θ. (5) Next we compute a second order Taylor expansion of g q (θ + ∆) around g q (θ): g p(θ) (θ + ∆) Taylor 2 ≈ g p(θ) (θ) + ∆ ∂g p(θ) (θ) ∂θ (6) + 1 2 ∆ ∂ 2 g p(θ) (θ) ∂θ 2 ∆. Notice that g p(θ) (θ) = D KL (p(θ)p(θ)) = 0, and by (4) ∆ ∂g p(θ) (θ) ∂θ = − ∆ ω∈Ω p(ω|θ) p(ω|θ) ∂p(ω|θ) ∂θ = − ∆ ∂ ∂θ ω∈Ω p(ω|θ) (a) =0, where (a) holds because ω∈Ω p(ω|θ) = 1, so ∂ ∂θ ω∈Ω p(ω|θ) = ∂1 ∂θ = 0. (7) Thus, the first two terms on the right side of (6) are zero, and thus: g p(θ) (θ + ∆) Taylor 2 ≈ 1 2 ∆ ∂ 2 g p(θ) (θ) ∂θ 2 ∆. (8) Next we focus on the Hessian, (5), with q = p(θ): ∂ 2 g p(θ) (θ) ∂θ 2 = − ω∈Ω p(ω|θ) p(ω|θ) ∂ 2 p(ω|θ) ∂θ 2 (a) =0 + ω∈Ω p(ω|θ) ∂ ln p(ω|θ) ∂θ ∂ ln p(ω|θ) ∂θ =F (θ), where (a) comes from taking the derivative of both sides of (7) with respect to θ. Substituting this into (8) we have that g p(θ) (θ + ∆) Taylor 2 ≈ 1 2 ∆ F (θ)∆. In this section we show that ∆ E(θ)∆ is a second order Taylor approximation of D E (p(θ), p(θ + ∆)) 2. First, let g q (θ) :=D E (q, p(θ)) =2 …
منابع مشابه
Energetic Natural Gradient Descent
We propose a new class of algorithms for minimizing or maximizing functions of parametric probabilistic models. These new algorithms are natural gradient algorithms that leverage more information than prior methods by using a new metric tensor in place of the commonly used Fisher information matrix. This new metric tensor is derived by computing directions of steepest ascent where the distance ...
متن کاملAn eigenvalue study on the sufficient descent property of a modified Polak-Ribière-Polyak conjugate gradient method
Based on an eigenvalue analysis, a new proof for the sufficient descent property of the modified Polak-Ribière-Polyak conjugate gradient method proposed by Yu et al. is presented.
متن کاملExtensions of the Hestenes-Stiefel and Polak-Ribiere-Polyak conjugate gradient methods with sufficient descent property
Using search directions of a recent class of three--term conjugate gradient methods, modified versions of the Hestenes-Stiefel and Polak-Ribiere-Polyak methods are proposed which satisfy the sufficient descent condition. The methods are shown to be globally convergent when the line search fulfills the (strong) Wolfe conditions. Numerical experiments are done on a set of CUTEr unconstrained opti...
متن کاملA Note on the Descent Property Theorem for the Hybrid Conjugate Gradient Algorithm CCOMB Proposed by Andrei
In [1] (Hybrid Conjugate Gradient Algorithm for Unconstrained Optimization J. Optimization. Theory Appl. 141 (2009) 249 - 264), an efficient hybrid conjugate gradient algorithm, the CCOMB algorithm is proposed for solving unconstrained optimization problems. However, the proof of Theorem 2.1 in [1] is incorrect due to an erroneous inequality which used to indicate the descent property for the s...
متن کاملNatural Gradient Descent for Training Stochastic Complex-Valued Neural Networks
In this paper, the natural gradient descent method for the multilayer stochastic complex-valued neural networks is considered, and the natural gradient is given for a single stochastic complex-valued neuron as an example. Since the space of the learnable parameters of stochastic complex-valued neural networks is not the Euclidean space but a curved manifold, the complex-valued natural gradient ...
متن کاملScaling up Natural Gradient by Sparsely Factorizing the Inverse Fisher Matrix
Second-order optimization methods, such as natural gradient, are difficult to apply to highdimensional problems, because they require approximately solving large linear systems. We present FActorized Natural Gradient (FANG), an approximation to natural gradient descent where the Fisher matrix is approximated with a Gaussian graphical model whose precision matrix can be computed efficiently. We ...
متن کامل