Post Author: Charles Zhang
All Notes Catelog for Reinforcement Learning: An Introduction. This post is created following BY-NC-ND 4.0 agreement, please follow terms while sharing.

Action-Value Methods¶

k-armed: k options(actions)

Value: $q_{*}(a) \doteq \mathbb{E}[R_t\mid A_t = a]$, using $Q_t(a) \approx q_{*}(a)$ estimation.

$$ \begin{array}{l} Q_{s}(a) \doteq \displaystyle\frac{\displaystyle\sum_{i=1}^{t-1} R_{i} \cdot \mathbb{1}_{A_{i}=a}}{\displaystyle\sum_{i=1}^{t-1} \mathbb{1}_{A_{i}=a}}, \mathbb{1}:=\left\{\begin{array}{l} 1 \text { if predicate is true } \\ 0 \text { if predicate is false } \end{array}\right. \end{array} $$

when $\displaystyle \sum_{i=1}^{t-1} \mathbb{1}_{A_i=a} \rightarrow \infty$ , $Q(a) \rightarrow q_{*}(a)$

Greedy action: $A_{t} \doteq \arg \max_{a} Q_{t}(a), \epsilon$ -greedy: prevent local optimum

The estimate of its action value after it has been selected $n-1$ times:

$$ \begin{aligned} Q_{n} \doteq \frac{\sum_{i=1}^{n-1} R_{i}}{n-1} &=\frac{1}{n-1}\left(R_{n-1}+\sum_{i=1}^{n-2} R_{i}\right) \\ &=\frac{1}{n-1}\left(R_{n-1}+(n-2) \frac{1}{n-1} \sum_{i=1}^{n-2} R_{i}\right) \\ &=\frac{1}{n-1}\left(R_{n-1}+(n-2) Q_{n-1}\right) \\ &=\frac{1}{n-1}\left(R_{n-1}+(n-1) Q_{n-1}-Q_{n-1}\right) \\ &=Q_{n-1}+\frac{1}{n-1}\left(R_{n-1}-Q_{n-1}\right) \end{aligned} $$

or more general, $Q_{n+1}=Q_{n}+\frac{1}{n}\left(R_{n}-Q_{n}\right), n=1 \rightarrow Q_{2}=R_{1}$ for arbitrary.

General [Update Rule]: NewEstimate $\leftarrow$ StepSize $\cdot$ [Target - OldEstimate], stepsize: $\alpha_t(a)$, Target - OldEstimate: "error" in estimate.
$$\begin{aligned} Q_{n+1} &= \displaystyle Q_{n}+\alpha(R_{n}-Q_{n}) = \alpha R_n +(1-\alpha) Q_n \\ &= \alpha R_n +(1-\alpha)[\alpha R_{n-1} +(1-\alpha) Q_{n-1}] \\ &= \alpha R_n +(1-\alpha)\alpha R_{n-1} +(1-\alpha)^2\alpha R_{n-2} + ... +(1-\alpha)^n\alpha Q_1 \\ &= (1-\alpha)^n Q_1 + \sum_{i=1}^n \alpha(1-\alpha)^{n-i}R_i \end{aligned}$$

Implementations¶

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from tqdm import trange
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')


class Bandit:

    def __init__(self, k_arm=10, epsilon=0., initial_val=0., step_size=0.1, sample_averages=False, UCB_param=None,
                 gradient=False, gradient_baseline=False, true_reward=0.):
        """
        :param k_arm: # of arms, 10 by default(same in the)
        :param epsilon: probability for exploration in epsilon-greedy algorithm
        :param initial_val: initial estimated value for each action
        :param step_size: constant step size(learning rate) for updating estimations
        :param sample_averages: if True, use sample averages to update estimations instead of constant step size
        :param UCB_param: if not None, use UCB algorithm to select action
        :param gradient: if True, use gradient based bandit algorithm
        :param gradient_baseline: if True, use average reward as baseline for gradient based bandit algorithm
        :param true_reward: the actual reward, q(A_t)
        """
        self.k = k_arm
        self.epsilon = epsilon
        self.initial_val = initial_val
        self.step_size = step_size
        self.sample_averages = sample_averages
        self.UCB_param = UCB_param
        self.gradient = gradient
        self.gradient_baseline = gradient_baseline
        self.true_reward = true_reward
        self.indices = np.arange(self.k)
        self.time = 0
        self.average_reward = 0
        self.action_prob = 0
        self.q_true = np.random.randn(self.k) + self.true_reward    # true(actual) value for each action, q(a)
        self.q_estimation = np.zeros(self.k) + self.initial_val     # estimated value for each action, Q(a)
        self.action_count = np.zeros(self.k)                        # # of chosen times for each action
        self.best_action = np.argmax(self.q_true)                   # best action based no Q table

    def reset(self):
        self.q_true = np.random.randn(self.k) + self.true_reward
        self.q_estimation = np.zeros(self.k) + self.initial_val
        self.action_count = np.zeros(self.k)
        self.best_action = np.argmax(self.q_true)
        self.time = 0

    def action(self):
        """
        :return: get an action for this bandit
        """
        if np.random.rand() < self.epsilon:
            return np.random.choice(self.indices)

        if self.UCB_param is not None:
            UCB_estimation = self.q_estimation + \
                             self.UCB_param * np.sqrt(np.log(self.time + 1) / (self.action_count + 1e-5))
            q_best = np.max(UCB_estimation)
            return np.random.choice(np.where(UCB_estimation == q_best)[0])

        if self.gradient:
            exp_est = np.exp(self.q_estimation)
            self.action_prob = exp_est / np.sum(exp_est)
            return np.random.choice(self.indices, p=self.action_prob)

        q_best = np.max(self.q_estimation)
        return np.random.choice(np.where(self.q_estimation == q_best)[0])

    # take an action, update estimation for this action
    def step(self, action):
        # generate the reward under N(real reward, 1)
        reward = np.random.randn() + self.q_true[action]
        self.time += 1
        self.action_count[action] += 1
        self.average_reward += (reward - self.average_reward) / self.time

        if self.sample_averages:
            # update estimation using sample averages
            self.q_estimation[action] += (reward - self.q_estimation[action]) / self.action_count[action]
        elif self.gradient:
            one_hot = np.zeros(self.k)
            one_hot[action] = 1
            if self.gradient_baseline:
                baseline = self.average_reward
            else:
                baseline = 0
            self.q_estimation += self.step_size * (reward - baseline) * (one_hot - self.action_prob)
        else:
            # update estimation with constant step size
            self.q_estimation[action] += self.step_size * (reward - self.q_estimation[action])
        return reward


def simulate(runs, time, bandits):
    rewards = np.zeros((len(bandits), runs, time))
    best_action_counts = np.zeros(rewards.shape)
    for i, bandit in enumerate(bandits):
        for r in trange(runs):
            bandit.reset()
            for t in range(time):
                action = bandit.action()
                reward = bandit.step(action)
                rewards[i, r, t] = reward
                if action == bandit.best_action:
                    best_action_counts[i, r, t] = 1
    mean_best_action_counts = best_action_counts.mean(axis=1)
    mean_rewards = rewards.mean(axis=1)
    return mean_best_action_counts, mean_rewards

def figure_2_1(runs=1000, time=1000):
    epsilons = [0, 0.1, 0.01]
    bandits = [Bandit(epsilon=eps, sample_averages=True) for eps in epsilons]
    best_action_counts, rewards = simulate(runs, time, bandits)
    plt.figure(figsize=(9, 12))
    plt.subplot(2, 1, 1)
    for eps, rewards in zip(epsilons, rewards):
        plt.plot(rewards, label='$\epsilon = %.02f$' % (eps))
    plt.xlabel('steps')
    plt.ylabel('average reward')
    plt.legend()
    plt.subplot(2, 1, 2)
    for eps, counts in zip(epsilons, best_action_counts):
        plt.plot(counts, label='$\epsilon = %.02f$' % (eps))
    plt.xlabel('steps')
    plt.ylabel('% optimal action')
    plt.legend()
    plt.plot()

figure_2_1()

100%|██████████| 1000/1000 [00:25<00:00, 39.60it/s]
100%|██████████| 1000/1000 [00:24<00:00, 40.61it/s]
100%|██████████| 1000/1000 [00:24<00:00, 40.22it/s]

Note $(1-\alpha)^n+ \displaystyle\sum_{i=1}^n \alpha(1-\alpha)^{n-i} = 1, \alpha \in [0,1]$, $Q_{n+1}$ therefore is called (exponetial recency-)weight average of past $R \text{ and } Q_1$.

Proof:
$n=1, 1-\alpha+\alpha = 1$;
assume for $n=k, (1-\alpha)^k+ \displaystyle\sum_{i=1}^k \alpha(1-\alpha)^{k-i} = 1$;
then for $n=k+1$,

$\begin{aligned} & (1-\alpha)^{k+1}+ \displaystyle\sum_{i=1}^{k+1} \alpha(1-\alpha)^{k+1-i} - [(1-\alpha)^k+ \displaystyle\sum_{i=1}^k \alpha(1-\alpha)^{k-i}] \\ &= (1-\alpha)^{k+1}- (1-\alpha)^{k} + \displaystyle\sum_{i=1}^k \alpha(1-\alpha)^{k-i} + \alpha(1-\alpha)^{k+1-1} - \displaystyle\sum_{i=1}^k \alpha(1-\alpha)^{k-i}\\ &= (1-\alpha)^k(1-\alpha-1)+\alpha(1-\alpha)^k \\ &= -\alpha(1-\alpha)^k+\alpha(1-\alpha)^k \\ &= 0 \end{aligned} $

$\therefore (1-\alpha)^{k+1}+ \displaystyle\sum_{i=1}^{k+1} \alpha(1-\alpha)^{k+1-i} = 1, \forall k \in \mathbb{N}$
$\therefore \forall n\in\mathbb{N}, \text{ }\alpha \in [0,1], \text{ } (1-\alpha)^n+ \displaystyle\sum_{i=1}^n \alpha(1-\alpha)^{n-i} = 1$

Consider {$\alpha_n$} by stochastic approximation theory: P(coverage) = 1:

$$ \sum_{n=1}^\infty \alpha_n(a) = \infty \text{ }\text{ }\text{ }\text{ and } \text{ }\text{ }\text{ } \sum_{n=1}^\infty \alpha_n(a) < \infty $$

This means (i)steps are large enough to overcome initial condition or random fluctuations (ii) steps become small enough to coverage.

As $Q_{n+1}$ more or less dependent on $Q_1(a) \Rightarrow$ biased by initial estimate.
Then, for genral stationary case, we encourage exploration optimistic initial value($Q_1>0, \epsilon=0$ compared with greedy's).

def figure_2_2(runs=1000, time=1000):
    bandits = [Bandit(epsilon=0, initial_val=5, step_size=0.1), 
               Bandit(epsilon=0.1, initial_val=0, step_size=0.1)]
    best_action_counts, _ = simulate(runs, time, bandits)
    plt.figure(figsize=(9, 6))
    plt.plot(best_action_counts[0], label='$\epsilon = 0, q = 5$')
    plt.plot(best_action_counts[1], label='$\epsilon = 0.1, q = 0$')
    plt.xlabel('Steps')
    plt.ylabel('% optimal action')
    plt.legend()
    plt.plot()

figure_2_2()

Chapter 2 Multi-armed Bandits

Action-Value Methods¶

Implementations¶

Upper Confidence Bound(UCB)¶

Summary¶

Share on