We introduce a class of variational actor-critic algorithms based on formulation over both the value function and policy. The objective consists two parts: one for maximizing other minimizing Bellman residual. Besides vanilla gradient descent with policy updates, we propose variants, clipping method flipping method, in order to speed up convergence. also prove that, when prefactor residual is s...