The policy-based reinforcement learning (RL) can be considered as maximization of its objective. However, due to the inherent non-concavity objective, policy gradient method a first-order stationary point (FOSP) cannot guar- antee maximal point. A FOSP minimal or even saddle point, which is undesirable for RL. It has found that if all points are strict, second-order station- ary (SOSP) exactly ...