Notes for Generative Adversarial Imitation Learning (GAIL)

Notes For Paper Generative Adversarial Imitation Learning (GAIL)

Author: Charles Zhang
All Notes Catelog for Reinforcement Learning. This post is created following BY-NC-ND 4.0 agreement, please follow terms while sharing.

Generative Adversarial Imitation Learning (GAIL)

PAPER LINK

Simple Glance at Experiment

Input state $s\Rightarrow$ Policy $\pi_\theta \Rightarrow$ Actions $a\Rightarrow$ Discriminator $D_w\Rightarrow$ Expert Policy $\pi_E$ YES/NO

Algorithm

Solving by finding the saddle point $(\pi, D)$ of the expression

$$ \mathbb{E}_{\pi}[\log (D(s, a))]+\mathbb{E}_{\pi_{E}}[\log (1-D(s, a))]-\lambda H(\pi) $$

Algorithm Generative adversarial imitation learning


1: Input: Expert trajectories $\tau_{E} \sim \pi_{E}$, initial policy and discriminator parameters $\theta_{0}, w_{0}$

2: for $i=0,1,2, \ldots$ do

3: Sample trajectories $\tau_{i} \sim \pi_{\theta_{i}}$

4: Update the discriminator parameters from $w_{i}$ to $w_{i+1}$ with the gradient $$ \hat{\mathbb{E}}_{\tau_{i}}\left[\nabla_{w} \log \left(D_{w}(s, a)\right)\right]+\hat{\mathbb{E}}_{\tau_{E}}\left[\nabla_{w} \log \left(1-D_{w}(s, a)\right)\right] $$

5: Take a policy step from $\theta_{i}$ to $\theta_{i+1}$, using the TRPO rule with cost function $\log \left(D_{w_{i+1}}(s, a)\right)$. Specifically, take a KL-constrained natural gradient step with $$ \begin{aligned} &\hat{\mathbb{E}}_{\tau_{i}}\left[\nabla_{\theta} \log \pi_{\theta}(a \mid s) Q(s, a)\right]-\lambda \nabla_{\theta} H\left(\pi_{\theta}\right) \\ &\text { where } Q(\bar{s}, \bar{a})=\hat{\mathbb{E}}_{\tau_{i}}\left[\log \left(D_{w_{i+1}}(s, a)\right) \mid s_{0}=\bar{s}, a_{0}=\bar{a}\right] \end{aligned} $$

6: end for


Specifically, Adam gradient ascent update step on $w$ is applied to $D_w$ network for increasing the expression, while a TRPO gradient descent step on $\theta$ is applied for $\pi_\theta$ network for decreasing the expression.

Architecture: $\pi_\theta, D_w$: 2 hidden layers of 100 units each; $\tanh$ nonlinearities in between

To redue the gradient variance, the experiment in the paper also fits a value network with the same architecture employed GAE with $\gamma = 0.995$ and $\lambda = 0.97$.

For implementation, the output of $D_w$ is clipped into $(0,1)$. For policy net, it can be used for any RL algorithms($Q(s,a)$ can be used by advantage functions as well). For example, using PPO, advantage loss can derived by calculating the GAE loss by $C(s,a)$ calculated by $D_w$ and $V$ calculated by Critic, for updating the actor and critic networks for policy

Paper Workflow

$*$ Background: Two Approaches of Imitation Learning

  • Behavioral Cloning(BC)
    • BC learns a policy over SA pairs from "expert" trajectories.
    • Problems: 1. depend onn a large of training data; 2. not efficient due to compunding error cased by covariant shifts.
  • Inverse Reinforcement Learning(IRL)
    • IRL learns a cost function(reward) which the expert is uniquly optimal, which explains the expert behavior, but not a policy(how to act).
    • Problems: 1. based on cost instead of policy; 2. expensive as it is learns RL in an innor loop; 3. diverge for locally optimal RL cost.

$*$ Typical IRL(e.g. Apprenticeship Learning): $$\underset{\pi}{\operatorname{minimize}} \max_{c\in\mathcal{C}} \mathbb{E}_\pi[c(s, a)] - \mathbb{E}_{\pi_E}[c(s, a)]$$

Note that $\displaystyle \mathbb{E}_\pi[(s, a)] = \mathbb{E}\left[\sum_{t=0}^\infty \gamma^t c(s_t, a_t)\right]$

$*$ Adding:

  • convex(make sure only one saddle point) cost function regularizer $\psi(c)$
  • $\gamma$-discounted casual entropy $H(\pi) = \mathbb{E}_\pi[-\log \pi(a\mid s)]$

$\Rightarrow$ $\displaystyle\operatorname{IRL}_{\psi}(\pi_E) = \underset{c\in\mathbb{R}^{\mathcal{S}\times\mathcal{A}}}{\operatorname{argmax}} -\psi(c) + \left( \min_{\pi\in\Pi} -H(\pi) + \mathbb{E}_\pi[c(s, a)] \right) - \mathbb{E}_{\pi_E}[c(s,a)]$

$\left(\min_{\pi\in\Pi} -H(\pi) + \mathbb{E}_\pi[c(s, a)] \right)$ can be considered as a RL process while IRL now is to find the cost that maximize the difference between the cost of the "RL policy derived by corresponding cost function" and the cost of expert trajectory

Therefore, the maximum of the equation above is $0$ (i.e. leans the cost function that has the mimum cost with respect to the cost of expert trajectory).

$*$ $\operatorname{RL}(c) = \displaystyle \underset{\pi}{\operatorname{argmax}}\mathbb{E}_\pi(c)$

$*$ Occupancy Measure: $\displaystyle \rho_\pi:=\pi(a\mid s)\sum_{t=0}^\infty \gamma^t P(s_t = s\mid \pi)$

  • it is the probability of $(s, a)$ pair under $\pi$
  • $\pi_\rho(a\mid s) = \rho(s, a)/ \sum_{a^\prime}\rho(s, a^{\prime})$

unnormalized distribution of state-action pairs(similar in TRPO). Correspondingly, $\mathbb{E}_\pi[c(s, a)] = \sum_{s, a}\rho_\pi(s, a)c(s,a)$.

Convex conjugate: $\psi^*(x) = \sup_{y\in \mathbb{R}^{\mathcal{S}\times\mathcal{A}}} x^T y - f(y)$

$*\Rightarrow$ Main Framework: "the policy learned by RL on the cost recovered by IRL":

$$R L \circ I R L_\psi\left(\pi_{E}\right)=\underset{\pi \in \Pi}{\arg \min }-H(\pi)+\psi^{*}\left(\rho_{\pi}-\rho_{\pi_{E}}\right)$$

$*$ Some Propositions:

  1. for any $\pi, \rho$, $$\begin{aligned}H(\pi) &= \mathbb{E}_\pi[-\log \pi(a\mid s)] = \sum_{s,a}\rho_\pi(s, a)\log \pi(a\mid s) \\ &= -\sum_{s, a}\rho_\pi(s, a)\log \frac{\rho_\pi(s, a)}{\sum_{a^\prime} \rho_\pi(s, a^\prime)} =: \bar{H}(\rho_\pi) \end{aligned}$$ $$\bar{H} = \sum_{s,a}\rho_{\pi_\rho}(s, a)\log \pi_\rho(a\mid s) = \mathbb{E}_{\pi_\rho}[-\log \pi_\rho (a\mid s)] = H(\pi_\rho)$$
  2. Let $L(\pi, c) = -H(\pi) + \mathbb{E}_\pi[c(s, a)], \bar{L}(\rho,c) = -\bar{H}(\rho)+\sum_{s, a}\rho(s, a)c(s, a)$, then $\forall c\in\mathcal{C}$:
$$L(\pi, c) = \bar{L}(\rho_\pi, c), \quad \forall \pi\in\Pi$$$$\bar{L}(\rho, c) = L(\pi_\rho, c), \quad \forall \rho$$
  1. If $\psi$ is a constant function, $\tilde{c}\in\operatorname{IRL}_\psi(\pi_E)$, and $\tilde{\pi}\in \operatorname{RL}(\tilde{c})$, then $\rho_{\tilde{\pi}} = \rho{\pi_E}$. As: $$ \begin{aligned} \tilde{c} & \in \operatorname{IRL}_{\psi}\left(\pi_{E}\right)=\underset{c \in \mathbb{R}^{\mathcal{S} \times \mathcal{A}}}{\arg \max } \min _{\pi \in \Pi}-H(\pi)+\mathbb{E}_{\pi}[c(s, a)]-\mathbb{E}_{\pi_{E}}[c(s, a)]+\text { const. } \\ &=\underset{c \in \mathbb{R}^{S} \times \mathcal{A}}{\arg \max } \min _{\rho \in \mathcal{D}}-\bar{H}(\rho)+\sum_{s, a} \rho(s, a) c(s, a)-\sum_{s, a} \rho_{E}(s, a) c(s, a) \end{aligned} $$ $$\text{so with optimal }\tilde{c}, \text{the optimal solution }\tilde{rho} = \underset{\rho}{\operatorname{argmin}} \sim = \rho_E$$

    This means that without regularization, the occupancy measure of derived policy is exactly same as the expert's. This proposition can be proved by some provided lemma and the dual of the optimization problem(proof is in the appendix of the paper and somewhat straightforward).

$*$ Apprenticeship Learning $\Rightarrow$ a speciall case of above framework

As: let $\delta_\mathcal{C}(c) = 0$ if $c\in\mathcal{C}$, $+\infty$ otherwise then

$$ \max _{c \in \mathcal{C}} \mathbb{E}_{\pi}[c(s, a)]-\mathbb{E}_{\pi_{E}}[c(s, a)]=\max _{c \in \mathbb{R}^{S\times \mathcal{A}} }-\delta_{\mathcal{C}}(c)+\sum_{s, a}\left(\rho_{\pi}(s, a)-\rho_{\pi_{E}}(s, a)\right) c(s, a)=\delta_{\mathcal{C}}^{*}\left(\rho_{\pi}-\rho_{\pi_{E}}\right) $$$$ \therefore \underset{\pi}{\operatorname{minimize}}-H(\pi)+\max _{c \in \mathcal{C}} \mathbb{E}_{\pi}[c(s, a)]-\mathbb{E}_{\pi_{E}}[c(s, a)]=-H(\pi)+\delta_{C}^{*} $$

which means $\psi = \delta_\mathcal{C}$

$*$ GAIL $\Rightarrow$ a specical case of above framework

Given: $$ \begin{aligned} &g_{\phi}(x)= \begin{cases}-x+\phi\left(-\phi^{-1}(-x)\right) & \text { if } x \in T \\ +\infty & \text { otherwise }\end{cases} \\ &\psi_{\phi}(c)= \begin{cases}\displaystyle\sum_{s, a} \rho_{\pi_{E}}(s, a) g_{\phi}(c(s, a)) & \text { if } c(s, a) \in T \text { for all } s, a \\ +\infty & \text { otherwise }\end{cases} \end{aligned} $$ where $\phi$ is strictly decreasing convex funtion, the range is $-T$ Let $R_{\phi}\left(\pi, \pi_{E}\right)=\displaystyle\sum \min _{\gamma \in \mathbb{R}} \rho_{\pi}(s, a) \phi(\gamma)+\rho_{\pi_{E}}(s, a) \phi(-\gamma)$, then: $$ \begin{aligned} \psi_{\phi}^{*}\left(\rho_{\pi}-\rho_{\pi_{E}}\right) &=\sum_{s, a} \max _{c \in T}\left(\rho_{\pi}(s, a)-\rho_{\pi_{E}}(s, a)\right) c-\rho_{\pi_{E}}(s, a)\left[-c+\phi\left(-\phi^{-1}(-c)\right)\right] \\ &=\sum_{s, a} \max _{c \in T} \rho_{\pi}(s, a) c-\rho_{\pi_{E}}(s, a) \phi\left(-\phi^{-1}(-c)\right) \\ &=\sum_{s, a} \max _{\gamma \in \mathbb{R}} \rho_{\pi}(s, a)(-\phi(\gamma))-\rho_{\pi_{E}}(s, a) \phi\left(-\phi^{-1}(\phi(\gamma))\right) \\ &= -R_\phi (\rho_\pi, \rho_{\pi_E}) \\ & \text{when }\phi(x) = \log(1+e^{-x}) \\ &=\sum_{s, a} \max _{\gamma \in \mathbb{R}} \rho_{\pi}(s, a) \log \left(\frac{1}{1+e^{-\gamma}}\right)+\rho_{\pi_{E}}(s, a) \log \left(\frac{1}{1+e^{\gamma}}\right)\\ &=\sum_{s, a} \max _{\gamma \in \mathbb{R}} \rho_{\pi}(s, a) \log \left(\frac{1}{1+e^{-\gamma}}\right)+\rho_{\pi_{E}}(s, a) \log \left(1-\frac{1}{1+e^{-\gamma}}\right)\\ &=\sum_{s, a} \max _{d \in(0,1)} \rho_{\pi}(s, a) \log d+\rho_{\pi_{E}}(s, a) \log (1-d) \\ &= \max_{D\in(0,1)^{\mathcal{S}\times\mathcal{A}}} \sum_{s, a}\rho_\pi(s, a)\log (D(s, a))+ \rho_{\pi_E}(s, a)\log(1-D(s, a)) \\ &=\max_{D\in(0,1)^{\mathcal{S}\times\mathcal{A}}} \mathbb{E}_\pi[\log(D(s, a))] +\mathbb{E}_{\pi_E}[\log(1-D(s, a))] \end{aligned} $$

Meaning: replace the cost function of aprenticeship learnig with discriminator $D$.

the learner's occupancy measure is the data generated by $G$; the expert's occupancy measure is the true data distribution

Therefore, GAIL going to find an agent policy $\pi$ and $D$ to discriminate if the state-action pair are generated by expert(true data) or by the policy

$\Rightarrow$ GAIL: $\underset{\pi}{\operatorname{minimize}} \psi_{\mathrm{GA}}^{*}\left(\rho_{\pi}-\rho_{\pi_{E}}\right)-\lambda H(\pi)$

$*$ Some further Explanation in the Algorithm

KL-constrained natural gradient step: The casual entropy gradient:

$$\begin{aligned} \nabla_{\theta} \mathbb{E}_{\pi_{\theta}}\left[-\log \pi_{\theta}(a \mid s)\right] &=\mathbb{E}_{\pi_{\theta}}\left[\nabla_{\theta} \log \pi_{\theta}(a \mid s) Q_{\log }(s, a)\right] \\ \text { where } Q_{\log }(\bar{s}, \bar{a}) &=\mathbb{E}_{\pi_{\theta}}\left[-\log \pi_{\theta}(a \mid s) \mid s_{0}=\bar{s}, a_{0}=\bar{a}\right] \end{aligned}$$

Just in the proof, note that $\sum_{a} \nabla_{\theta} \pi_{\theta}(a \mid s)=\nabla_{\theta} \sum_{a} \pi_{\theta}(a \mid s)=\nabla_{\theta} 1=0$.

$$\Rightarrow \nabla_{\theta} \mathbb{E}_{\pi_{\theta}}\left[-\log \pi_{\theta}(a \mid s)\right]=\sum_{s, a}\left(\nabla_{\theta} \rho_{\pi_{\theta}}(s, a)\right)\left(-\log \pi_{\theta}(a \mid s)\right)$$

$\Rightarrow$ Policy Gradient for RL with fixed cost function $c_{\log}(s, a):= -\log \pi_\theta(a|s)$

Optimize by gradient descent over policy and discriminator parameters. PG guarantees to converge to a local minima and is model free

To solve the problem that the gradient estimates can have a high variance and small steps are required for PG, the paper uses a TRPO step on $\theta$ to decrease the value with respect to policy $\pi$.

$*$ Result: outperform BC, FEM, and GTAL

Outlook

Improve learning speed for GAIL by intitialize policy parameters with BC.

Reference

[1] Generative Adversarial Imitation Learning

[2] Implementation by OpenAI Baselines GAIL