Riemannian Proximal Policy Optimization

In this paper, We propose a general Riemannian proximal optimization algorithm with guaranteed convergence to solve Markov decision process (MDP) problems. To model policy functions in MDP, we employ Gaussian mixture model (GMM) and formulate it as a nonconvex optimization problem in the Riemannian space of positive semidefinite matrices. For two given policy functions, we also provide its lower bound on policy improvement by using bounds derived from the Wasserstein distance of GMMs. Preliminary experiments show the efficacy of our proposed Riemannian proximal policy optimization algorithm.


Introduction
Reinforcement learning studies how agents explore/exploit environment, and take actions to maximize long-term reward. It has broad applications in robot control and game playing (Mnih et al., 2015;Silver et al., 2016;Argall et al., 2009;Silver et al., 2017). Value iteration and policy gradient methods are mainstream methods for reinforcement learning (Sutton and Barto, 2018;Li, 2017).
Policy gradient methods learn optimal policy directly from past experience or on the fly. It maximizes expected discounted reward through a parametrized policy whose parameters are updated using gradient ascent. Traditional policy gradient methods suffer from three well-known obstacles: high-variance, sample inefficiency and difficulty in tuning learning rate. To make the learning algorithm more robust and scalable to large datasets, Schulman et al. proposed trust region policy optimization algorithm (TRPO) (Schulman et al., 2015). TRPO searches for the optimal policy by maximizing a surrogate function with constraints placed upon the KL divergence between old and new policy distributions, which guarantees monotonically improvements. To further improve the data efficiency and reliable performance, proximal policy optimization algorithm (PPO) was proposed which utilizes firstorder optimization and clipped probability ratio between the new and old policies (Schulman et al., 2017). TRPO was also extended to constrained reinforcement learning. Achiam et al. proposed constrained policy optimization (CPO) which guarantees near-constraint satisfaction at each iteration (Achiam et al., 2017).
Although TRPO, PPO and CPO have shown promising performance on complex decisionmaking problems, such as continuous-control tasks and playing Atari, as other neural network based models, they face two typical challenges: the lack of interpretability, and difficult to converge due to the nature of non-convex optimization in high dimensional parameter space. For many real applications, data lying in a high dimensional ambient space usually have a much lower intrinsic dimension. It may be easier to optimize the policy function in low dimensional manifolds.
In recent years, Many optimization methods are generalized from Euclidean space to Riemannian space due to manifold structures existed in many machine learning problems (Absil et al., 2007(Absil et al., , 2009Vandereycken, 2013;Huang et al., 2015;Zhang et al., 2016). In this paper, we leverage merits of TRPO, PPO, and CPO and propose a new algorithm called Riemannian proximal policy optimization (RPPO) by taking manifold learning into account for policy optimization. In order to estimate the policy, we need a density-estimation function. Options we have include kernel density estimation, neural networks, Gaussian mixture model (GMM), etc. In this study we choose GMM due to its good analytical characteristics, universal representation power and low computational cost compared with neural networks. It is well-known that the covariance matrices of GMM lie in a Riemannian manifold of positive semidefinite matrices.
To be more specific, we model policy functions using GMM first. Secondly, to optimize GMM and learn the optimal policy functions efficiently, we formulate it as a non-convex optimization problem in the Riemannian space. By this way, our method gains advantages in improving both interpretability and speed of convergence. Please note that Our RPPO algorithm can be easily extended to any other non-GMM density estimators, as long as their parameter space is Riemannian. In addition, previously GMM has been applied to reinforcement learning by embedding GMM in the Q-learning framework (Agostini and Celaya, 2010). So it also suffers from the headache of Q-learning that it can hardly handle problems with large continuous state-action space.

Reinforcement learning
In this study, we consider the following Markov decision process (MDP) which is defined as a tuple (S, A, P, r, γ), where S is the set of states, A is the set of actions, P : S × A × S → [0, 1] is the transition probability function, r : S × A × S → R is the reward function, and γ is the discount factor which balances future rewards and immediate ones.
To make optimal decisions for MDP problems, reinforcement learning was proposed to learn optimal value function or policy. A value function is an expected, discounted accumulative reward function of a state or state-action pair by following a policy π. Here we define state value function as v π (s) = E τ ∼π [r(τ ) | s 0 = s] where τ = (s 0 , a 0 , s 1 , ...) denotes a trajectory by playing policy π, a t ∼ π (a t | s t ), and s t+1 ∼ P (s t+1 | s t , a t ). Similarly we define state-action value function as: q π (s, a) = E τ ∼π [r(τ ) | s 0 = s, a 0 = a]. We also define advantage function as A π (s, a) = q π (s, a) − v π (s).
In reinforcement learning, we try to find or learn an optimal policy π which maximizes a given performance metric J (π). Infinite horizon discounted accumulative return is widely used to evaluate a given policy which is defined as: where r (s t , a t , s t+1 ) is the reward from s t to s t+1 by taking action a t . Please note that the expectation operation is performed over the distribution of trajectories.

Riemannian space
Here we give a brief introduction to Riemannian space, for more details see (Eisenhart, 2016).Let M be a connected and finite dimensional manifold with dimensionality of m. We denote by T p M the tangent space of M at p. Let M be endowed with a Riemannian metric ., . , with corresponding norm denoted by . , so that M is now a Riemannian manifold (Eisenhart, 2016). We use l (γ) = b a γ (t) dt to denote length of a piecewise smooth curve γ : [a, b] −→ M joining θ to θ, i.e., such that γ (a) = θ and γ (b) = θ. Minimizing this length functional over the set of all piecewise smooth curves passing θ and θ, we get a Riemannian distance d (θ , θ) which induces original topology on M . Take θ ∈ M, the exponential map exp θ : which maps a tangent vector v at θ to M along the curve γ. For any θ ∈ M we define the exponential inverse map exp −1 θ : M −→ T θ M which is C ∞ and maps a point θ on M to a tangent vector at θ with d (θ , θ) = exp −1 θ θ . We assume (M, d) is a complete metric space, bounded and all closed subsets of M are compact. For a given convex function f : The set of all subgradients of f at θ ∈ M is called subdifferential of f at θ ∈ M which is denoted by ∂f (θ ). If M is a Hadamard manifold which is complete, simply connected and has everywhere non-positive sectional curvature, the subdifferential of f at any point on M is non-empty (Ferreira and Oliveira, 2002).

Modeling policy function using Gaussian mixture model
To model policy functions, we employ the Gaussian mixture model which is a widely used and statistically mature method for clustering and density estimation. The policy function can be modeled as π(a | s) where N is a (multivariate) Gaussian distribution with mean µ ∈ R d and covariance matrix S 0, K is number of components in the mixture model, α = (α 1 , α 2 , ..., α K ) are mixture component weights which sum to 1. In the following, we drop s in GMM to make it simple and parameters of GMM still depend on state variable s implicitly.
We would like to optimize the following problem with corresponding constraints from GMMs: We employ a reparametrization method to make the Gaussian distributions zero-centered. We augment action variables by 1 and define a new variable vector as a = [a, 1] with new covariance matrix S = S + µµ µ µ 1 (Hosseini and Sra, 2015).
where α k is the step size. Proof of Lemma 1 can be found in the Appendix. Theorem 1. Under Assumption 1, the following statements hold for any sequence {θ k } k≥0 generated by Algorithm 1: (a) Any limit point of the sequence {θ k } k≥0 is a critical point, and the sequence of function values {f (θ k )} k≥0 is strictly decreasing and convergent.
Proof of Theorem 1 can be found in the Appendix.

Lower bound of policy improvement
Assume we have two policy functions π (a | s) = Σ i α i N (a; S i ) and π(a | s) = Σ i α i N (a; S i ) parameterized by GMMs with parameters θ we would like to bound the performance improvement of π (a | s) over π(a | s) under limitation of the proximal operator.

Implementation of the Riemannian proximal policy optimization method
Recall that in the optimization problem (2), we are trying to optimize the following objective function: min

2) Retraction
With S i,t and grad S i,t g(θ ) shown above at iteration t, we would like to calculate S i,t+1 using retraction. From (Cheng, 2013), for any tangent vector η ∈ T W M , where W is a point in Riemannian space M , its retraction R W (η) := arg min X∈M W + η − X F . For our case where σ i and q i are the i-th eigenvalues and eigenvector of matrix S i,t − α t (grad S i,t g(θ ) + ∂ S i,t ϕ(θ )). η i , i = 1, 2, ..., K −1 are updated using standard gradient decent method in the Euclidean space. The calculation and retraction shown above are repeated until f (θ ) converges.

Simulation environments and baseline methods
We choose TRPO and PPO, which are well-known excelling at continuous-control tasks, as baseline algorithms. Each algorithm runs on the following 3 environments in OpenAI Gym MuJoCo simulator (Todorov et al., 2012): InvertedPendulum-v2, Hopper-v2, and Walker2d-v2 with increasing task complexity regarding size of state and action spaces. For each run, we compute the average reward for every 50 episodes, and report the mean reward curve and parameters statistics for comparison.

Preliminary results
In Fig. 1 we show mean reward (column1) for PPO, RPPO and TRPO algorithms on three MuJoCo environments, screenshots (column2) and probability density of GMM (column3) for RPPO on each environment. From the learning curves, we can see that as the state-action dimension of environment increases (shown in Table 1), both the convergence speed and the reward improvement slow down. This is because the higher dimension the environment sits, the more difficult the optimization task is for the algorithm. Correspondingly, in the GMM plot, S, A represent the state and the action dimensions respectively, and the probability density is shown in z axis. In the density plot, we can see that as the environment complexity increases, the density pattern becomes more diverse, and non-diagonal matrix terms also show its importance. The probability density of GMM shows that RPPO learns meaningful structure of policy functions.
TRPO and PPO are pure neural-network-based models with numerous parameters. This makes the model highly vulnerable to overfitting, poor network architecture design and the hyper-parameters tuning. RPPO achieves better robustness by having much fewer parameters. In Table 1 we compare the number of parameters of each algorithm on each environment. It can be seen that GMM has 10 3 ∼ 10 5 order fewer parameters as compared with TRPO and PPO.

Conclusion
We proposed a general Riemannian proximal optimization algorithm with guaranteed convergence to solve Markov decision process (MDP) problems. To model policy functions in MDP, we employed the Gaussian mixture model (GMM) and formulated it as a non-convex optimization problem in the Riemannian space of positive semidefinite matrices. Preliminary experiments on benchmark tasks in OpenAI Gym MuJoCo (Todorov et al., 2012) show the efficacy of the proposed RPPO algorithm.
In Sec. 4.1, the algorithm 1 we proposed is capable of optimizing a general class of non-convex functions of the form f (θ) = g (θ) − h (θ) + ϕ (θ). Due to page limit, in this study we focus on f (θ) = g (θ) + ϕ (θ) as shown in the Optimization problem (2). In the future, it would be interesting to incorporate constraints in MDP problems like constrained policy optimization (Achiam et al., 2017) and encode them as a concave function −h(θ) in our RPPO algorithm.