強化學習——MDPs求解之動態規劃

03-04

目錄
學習目標
策略評估（Policy Evaluation）
策略提升（Policy Improvement）
策略迭代（Policy Iteration）

值迭代（Value Iteration）

學習目標

1. 理解策略評估（Policy Evaluation）和策略提升（Policy Improvement）；

2. 理解策略迭代（Policy Iteration）演算法；

3. 理解值迭代（Value Iteration）演算法；

4. 理解策略迭代和值迭代的不同之處；

5. 動態規劃方法的局限性；

6. Python實現格子世界（Gridworld）策略迭代和值迭代。

動態規劃（Dynamic Programming, DP）是一種解決複雜問題的方法，它通過定義問題狀態和狀態之間的關係，將複雜問題拆分成若干較為簡單的子問題，使得問題能夠以遞推（或者說分治）的方式去解決。所以要能使用動態規劃，這種問題一要能夠分解成許多子問題，二要這些子問題能夠多次被迭代使用。而馬爾科夫決策過程就正好滿足了這兩個條件，MDPs可以看成是各個狀態之間的轉移，而貝爾曼方程則將這個問題分解成了一個個狀態的遞歸求解問題，而值函數就用於存儲這個求解的結果，得到每一個狀態的最優策略，合在一起以後就完成了整個MDPs的求解。但是DP的使用時建立在我們知道MDP環境的模型的基礎上的，所以也稱其為model based method。

策略評估（Policy Evaluation）

策略評估如其字面意思，就是評價一個策略好不好。計算任意一個策略 $pi$ 的狀態值函數 $v_{pi}left( s ight)$ 即可，這也叫做預測（Prediction），上一篇文章已經通過backup圖得到了的求解公式，如下：

$v_{pi}left( s ight) =sum_{ain mathcal{A}}{pi left( a|s ight) left( mathcal{R}_{s}^{a}+gamma sum_{sin mathcal{S}}{mathcal{P}_{ss}^{a}}v_{pi}left( s ight) ight)}$

那這個式子怎麼算呢？狀態 $s$ 的值函數我也不知道啊。這裡我們會使用高斯-賽德爾迭代演算法來求解，先人為給一個初值，再根據下面的式子迭代求解，可以證明，當k趨於無窮時，最後是會收斂到 $v_{pi}left( s ight)$ 的。

$v_{k+1}left( s ight) =sum_{ain mathcal{A}}{pi left( a|s ight) left( mathcal{R}_{s}^{a}+gamma sum_{sin mathcal{S}}{mathcal{P}_{ss}^{a}}v_kleft( s ight) ight)}$

策略提升（Policy Improvement）

我們已經知道怎麼去評價一個策略好不好，那接下來就要找到那個最好的策略。每到一個狀態，我們可能就會想是不是需要改變一下策略，這樣也許能使回報更大，即選擇一個動作 $a e pi left( s ight)$ ，然後再繼續遵循 $pi$ ，這種方式的值就是動作值函數（還記得在上一篇中提出那個思考嗎，這裡就是一個比較好的回答）：

$q_{pi}left( s,a ight) =mathcal{R}_{s}^{a}+gamma sum_{sin mathcal{S}}{mathcal{P}_{ss}^{a}}v_{pi}left( s ight)$

我們用一種貪婪的方式來提升我們策略，即選擇那個能使動作值函數最大的動作：

$pi left( s ight) =underset{ain mathcal{A}}{argmax}q_{pi}left( s,a ight)$

可以證明，改變了策略 $pi$ 以後，狀態值函數也變大了，即 $v_{pi }left( s ight) geqslant v_{pi}left( s ight)$ ，具體證明參見學習資料。

策略迭代（Policy Iteration）

說完了策略評估和策略提升，策略迭代就簡單了，就是反覆使用策略評估和策略提升，最後會收斂到最優策略。

其偽代碼如圖所示

Sutton的書中給了一個Gridworld的例子，如下圖所示，具體規則我就不翻譯了，大致就是說最上角和右下角是終點（終止狀態），每一步的reward都是-1，最終目的是要找到一個最優策略。

我們現在就用這個例子來用Python實現策略迭代。

import numpy as npfrom lib.envs.gridworld import GridworldEnvdef policy_eval(policy, env, discount_factor=1.0, theta=0.00001): """ Evaluate a policy given an environment and a full description of the environments dynamics. Args: policy: [S, A] shaped matrix representing the policy. env: OpenAI env. env.P represents the transition probabilities of the environment. env.P[s][a] is a list of transition tuples (prob, next_state, reward, done). env.nS is a number of states in the environment. env.nA is a number of actions in the environment. theta: We stop evaluation once our value function change is less than theta for all states. discount_factor: Gamma discount factor. Returns: Vector of length env.nS representing the value function. """ # Start with a random (all 0) value function V = np.zeros(env.nS) while True: delta = 0 # For each state, perform a "full backup" for s in range(env.nS): v = 0 # Look at the possible next actions for a, action_prob in enumerate(policy[s]): # For each action, look at the possible next states... for prob, next_state, reward, done in env.P[s][a]: # Calculate the expected value v += action_prob * prob * (reward + discount_factor * V[next_state]) # How much our value function changed (across any states) delta = max(delta, np.abs(v - V[s])) V[s] = v # Stop evaluating once our value function change is below a threshold if delta < theta: break return np.array(V)def policy_improvement(env, policy_eval_fn=policy_eval, discount_factor=1.0): """ Policy Improvement Algorithm. Iteratively evaluates and improves a policy until an optimal policy is found. Args: env: The OpenAI envrionment. policy_eval_fn: Policy Evaluation function that takes 3 arguments: policy, env, discount_factor. discount_factor: gamma discount factor. Returns: A tuple (policy, V). policy is the optimal policy, a matrix of shape [S, A] where each state s contains a valid probability distribution over actions. V is the value function for the optimal policy. """ def one_step_lookahead(state, V): """ Helper function to calculate the value for all action in a given state. Args: state: The state to consider (int) V: The value to use as an estimator, Vector of length env.nS Returns: A vector of length env.nA containing the expected value of each action. """ A = np.zeros(env.nA) for a in range(env.nA): for prob, next_state, reward, done in env.P[state][a]: A[a] += prob * (reward + discount_factor * V[next_state]) return A # Start with a random policy policy = np.ones([env.nS, env.nA]) / env.nA while True: # Evaluate the current policy V = policy_eval_fn(policy, env, discount_factor) # Will be set to false if we make any changes to the policy policy_stable = True # For each state... for s in range(env.nS): # The best action we would take under the currect policy chosen_a = np.argmax(policy[s]) # Find the best action by one-step lookahead # Ties are resolved arbitarily action_values = one_step_lookahead(s, V) best_a = np.argmax(action_values) # Greedily update the policy if chosen_a != best_a: policy_stable = False policy[s] = np.eye(env.nA)[best_a] # If the policy is stable weve found an optimal policy. Return it if policy_stable: return policy, Venv = GridworldEnv()random_policy = np.ones([env.nS, env.nA]) / env.nAv = policy_eval(random_policy, env)policy, v = policy_improvement(env)print("Policy Probability Distribution:")print(policy)print("")print("Reshaped Grid Policy (0=up, 1=right, 2=down, 3=left):")print(np.reshape(np.argmax(policy, axis=1), env.shape))print("")print("Value Function:")print(v)print("")print("Reshaped Grid Value Function:")print(v.reshape(env.shape))print("")

得到如下結果：

可以看出，這和書上得到的最優策略時一致的。

值迭代（Value Iteration）

策略迭代有一個缺點，就是每一步都要進行策略評估，當狀態空間很大的時候是非常耗費時間的。值迭代是直接將貝爾曼最優化方程拿來迭代計算的，這一點是不同於策略迭代的，我們直接對比兩者的偽代碼。

所以值迭代會直接收斂到最優值，從而我們就可以得到最優策略，因為它就是一個貪婪的選擇。再反過去看一下策略迭代的過程，策略評估過程是應用貝爾曼方程來計算當前最優策略下的值函數，接著進行策略提升，即在每個狀態都選擇一個最優動作來最大化值函數，以改進策略。但是想一下，在策略評估過程我們一定要等到它收斂到準確的值函數嗎？答案是不一定，我們可以設定一個誤差，中斷這個過程，用一個近似的值函數用以策略提升（格子世界的例子中就可以看出，在迭代到第三步以後，其實最優策略就已經確定了），而我們提出這個方法的時候並不是這麼做的，而是等到策略評價過程收斂，這是一個極端的選擇，相當於在迭代貝爾曼最優化方程！所以，換句話說，值迭代其實可以看成是策略迭代一個極端情況。

一般來說，策略迭代的收斂速度更快一些，在狀態空間較小時，最好選用策略迭代方法。當狀態空間較大時，值迭代的計算量更小一些。

同樣，還是以格子世界為例，用Python實現一遍值迭代演算法：

import numpy as npfrom lib.envs.gridworld import GridworldEnvdef value_iteration(env, theta=0.0001, discount_factor=1.0): """ Value Iteration Algorithm. Args: env: OpenAI env. env.P represents the transition probabilities of the environment. env.P[s][a] is a list of transition tuples (prob, next_state, reward, done). env.nS is a number of states in the environment. env.nA is a number of actions in the environment. theta: We stop evaluation once our value function change is less than theta for all states. discount_factor: Gamma discount factor. Returns: A tuple (policy, V) of the optimal policy and the optimal value function. """ def one_step_lookahead(state, V): """ Helper function to calculate the value for all action in a given state. Args: state: The state to consider (int) V: The value to use as an estimator, Vector of length env.nS Returns: A vector of length env.nA containing the expected value of each action. """ A = np.zeros(env.nA) for a in range(env.nA): for prob, next_state, reward, done in env.P[state][a]: A[a] += prob * (reward + discount_factor * V[next_state]) return A V = np.zeros(env.nS) while True: # Stopping condition delta = 0 # Update each state... for s in range(env.nS): # Do a one-step lookahead to find the best action A = one_step_lookahead(s, V) best_action_value = np.max(A) # Calculate delta across all states seen so far delta = max(delta, np.abs(best_action_value - V[s])) # Update the value function V[s] = best_action_value # Check if we can stop if delta < theta: break # Create a deterministic policy using the optimal value function policy = np.zeros([env.nS, env.nA]) for s in range(env.nS): # One step lookahead to find the best action for this state A = one_step_lookahead(s, V) best_action = np.argmax(A) # Always take the best action policy[s, best_action] = 1.0 return policy, Venv = GridworldEnv()policy, v = value_iteration(env)print("Policy Probability Distribution:")print(policy)print("")print("Reshaped Grid Policy (0=up, 1=right, 2=down, 3=left):")print(np.reshape(np.argmax(policy, axis=1), env.shape))print("")print("Value Function:")print(v)print("")print("Reshaped Grid Value Function:")print(v.reshape(env.shape))print("")

輸出結果與策略迭代一致。

參考

[1] Reinforcement Learning: An Introduction- Chapter 4: Dynamic Programming

[2] David Silvers RL Course Lecture 3 - Planning by Dynamic Programming(video, slides)

[3] Quora: https://www.quora.com/How-is-policy-iteration-different-from-value-iteration by Sergio Valcarcel Macua

[4] 策略迭代與值迭代的區別

[5] github開源代碼