強化學習——MDPs求解之動態規劃

目錄

學習目標

策略評估(Policy Evaluation)

策略提升(Policy Improvement)

策略迭代(Policy Iteration)

值迭代(Value Iteration)

學習目標

1. 理解策略評估(Policy Evaluation)和策略提升(Policy Improvement);

2. 理解策略迭代(Policy Iteration)演算法;

3. 理解值迭代(Value Iteration)演算法;

4. 理解策略迭代和值迭代的不同之處;

5. 動態規劃方法的局限性;

6. Python實現格子世界(Gridworld)策略迭代和值迭代。

動態規劃(Dynamic Programming, DP)是一種解決複雜問題的方法,它通過定義問題狀態和狀態之間的關係,將複雜問題拆分成若干較為簡單的子問題,使得問題能夠以遞推(或者說分治)的方式去解決。所以要能使用動態規劃,這種問題一要能夠分解成許多子問題,二要這些子問題能夠多次被迭代使用。而馬爾科夫決策過程就正好滿足了這兩個條件,MDPs可以看成是各個狀態之間的轉移,而貝爾曼方程則將這個問題分解成了一個個狀態的遞歸求解問題,而值函數就用於存儲這個求解的結果,得到每一個狀態的最優策略,合在一起以後就完成了整個MDPs的求解。但是DP的使用時建立在我們知道MDP環境的模型的基礎上的,所以也稱其為model based method。

策略評估(Policy Evaluation)

策略評估如其字面意思,就是評價一個策略好不好。計算任意一個策略  pi 的狀態值函數 v_{pi}left( s 
ight) 即可,這也叫做預測(Prediction),上一篇文章已經通過backup圖得到了 的求解公式,如下:

v_{pi}left( s 
ight) =sum_{ain mathcal{A}}{pi left( a|s 
ight) left( mathcal{R}_{s}^{a}+gamma sum_{sin mathcal{S}}{mathcal{P}_{ss}^{a}}v_{pi}left( s 
ight) 
ight)}

那這個式子怎麼算呢?狀態  s 的值函數我也不知道啊。這裡我們會使用高斯-賽德爾迭代演算法來求解,先人為給一個初值,再根據下面的式子迭代求解,可以證明,當k趨於無窮時,最後是會收斂到  v_{pi}left( s 
ight) 的。

v_{k+1}left( s 
ight) =sum_{ain mathcal{A}}{pi left( a|s 
ight) left( mathcal{R}_{s}^{a}+gamma sum_{sin mathcal{S}}{mathcal{P}_{ss}^{a}}v_kleft( s 
ight) 
ight)}

策略提升(Policy Improvement)

我們已經知道怎麼去評價一個策略好不好,那接下來就要找到那個最好的策略。每到一個狀態,我們可能就會想是不是需要改變一下策略,這樣也許能使回報更大,即選擇一個動作 a
e pi left( s 
ight) ,然後再繼續遵循  pi ,這種方式的值就是動作值函數(還記得在上一篇中提出那個思考嗎,這裡就是一個比較好的回答):

q_{pi}left( s,a 
ight) =mathcal{R}_{s}^{a}+gamma sum_{sin mathcal{S}}{mathcal{P}_{ss}^{a}}v_{pi}left( s 
ight)

我們用一種貪婪的方式來提升我們策略,即選擇那個能使動作值函數最大的動作:

 pi left( s 
ight) =underset{ain mathcal{A}}{argmax}q_{pi}left( s,a 
ight)

可以證明,改變了策略 pi 以後,狀態值函數也變大了,即  v_{pi }left( s 
ight) geqslant v_{pi}left( s 
ight) ,具體證明參見學習資料。

策略迭代(Policy Iteration)

說完了策略評估和策略提升,策略迭代就簡單了,就是反覆使用策略評估和策略提升,最後會收斂到最優策略。

其偽代碼如圖所示

Sutton的書中給了一個Gridworld的例子,如下圖所示,具體規則我就不翻譯了,大致就是說最上角和右下角是終點(終止狀態),每一步的reward都是-1,最終目的是要找到一個最優策略。

我們現在就用這個例子來用Python實現策略迭代。

import numpy as npfrom lib.envs.gridworld import GridworldEnvdef policy_eval(policy, env, discount_factor=1.0, theta=0.00001): """ Evaluate a policy given an environment and a full description of the environments dynamics. Args: policy: [S, A] shaped matrix representing the policy. env: OpenAI env. env.P represents the transition probabilities of the environment. env.P[s][a] is a list of transition tuples (prob, next_state, reward, done). env.nS is a number of states in the environment. env.nA is a number of actions in the environment. theta: We stop evaluation once our value function change is less than theta for all states. discount_factor: Gamma discount factor. Returns: Vector of length env.nS representing the value function. """ # Start with a random (all 0) value function V = np.zeros(env.nS) while True: delta = 0 # For each state, perform a "full backup" for s in range(env.nS): v = 0 # Look at the possible next actions for a, action_prob in enumerate(policy[s]): # For each action, look at the possible next states... for prob, next_state, reward, done in env.P[s][a]: # Calculate the expected value v += action_prob * prob * (reward + discount_factor * V[next_state]) # How much our value function changed (across any states) delta = max(delta, np.abs(v - V[s])) V[s] = v # Stop evaluating once our value function change is below a threshold if delta < theta: break return np.array(V)def policy_improvement(env, policy_eval_fn=policy_eval, discount_factor=1.0): """ Policy Improvement Algorithm. Iteratively evaluates and improves a policy until an optimal policy is found. Args: env: The OpenAI envrionment. policy_eval_fn: Policy Evaluation function that takes 3 arguments: policy, env, discount_factor. discount_factor: gamma discount factor. Returns: A tuple (policy, V). policy is the optimal policy, a matrix of shape [S, A] where each state s contains a valid probability distribution over actions. V is the value function for the optimal policy. """ def one_step_lookahead(state, V): """ Helper function to calculate the value for all action in a given state. Args: state: The state to consider (int) V: The value to use as an estimator, Vector of length env.nS Returns: A vector of length env.nA containing the expected value of each action. """ A = np.zeros(env.nA) for a in range(env.nA): for prob, next_state, reward, done in env.P[state][a]: A[a] += prob * (reward + discount_factor * V[next_state]) return A # Start with a random policy policy = np.ones([env.nS, env.nA]) / env.nA while True: # Evaluate the current policy V = policy_eval_fn(policy, env, discount_factor) # Will be set to false if we make any changes to the policy policy_stable = True # For each state... for s in range(env.nS): # The best action we would take under the currect policy chosen_a = np.argmax(policy[s]) # Find the best action by one-step lookahead # Ties are resolved arbitarily action_values = one_step_lookahead(s, V) best_a = np.argmax(action_values) # Greedily update the policy if chosen_a != best_a: policy_stable = False policy[s] = np.eye(env.nA)[best_a] # If the policy is stable weve found an optimal policy. Return it if policy_stable: return policy, Venv = GridworldEnv()random_policy = np.ones([env.nS, env.nA]) / env.nAv = policy_eval(random_policy, env)policy, v = policy_improvement(env)print("Policy Probability Distribution:")print(policy)print("")print("Reshaped Grid Policy (0=up, 1=right, 2=down, 3=left):")print(np.reshape(np.argmax(policy, axis=1), env.shape))print("")print("Value Function:")print(v)print("")print("Reshaped Grid Value Function:")print(v.reshape(env.shape))print("")

得到如下結果:

可以看出,這和書上得到的最優策略時一致的。

值迭代(Value Iteration)

策略迭代有一個缺點,就是每一步都要進行策略評估,當狀態空間很大的時候是非常耗費時間的。值迭代是直接將貝爾曼最優化方程拿來迭代計算的,這一點是不同於策略迭代的,我們直接對比兩者的偽代碼。

所以值迭代會直接收斂到最優值,從而我們就可以得到最優策略,因為它就是一個貪婪的選擇。再反過去看一下策略迭代的過程,策略評估過程是應用貝爾曼方程來計算當前最優策略下的值函數,接著進行策略提升,即在每個狀態都選擇一個最優動作來最大化值函數,以改進策略。但是想一下,在策略評估過程我們一定要等到它收斂到準確的值函數嗎?答案是不一定,我們可以設定一個誤差,中斷這個過程,用一個近似的值函數用以策略提升(格子世界的例子中就可以看出,在迭代到第三步以後,其實最優策略就已經確定了),而我們提出這個方法的時候並不是這麼做的,而是等到策略評價過程收斂,這是一個極端的選擇,相當於在迭代貝爾曼最優化方程!所以,換句話說,值迭代其實可以看成是策略迭代一個極端情況。

一般來說,策略迭代的收斂速度更快一些,在狀態空間較小時,最好選用策略迭代方法。當狀態空間較大時,值迭代的計算量更小一些。

同樣,還是以格子世界為例,用Python實現一遍值迭代演算法:

import numpy as npfrom lib.envs.gridworld import GridworldEnvdef value_iteration(env, theta=0.0001, discount_factor=1.0): """ Value Iteration Algorithm. Args: env: OpenAI env. env.P represents the transition probabilities of the environment. env.P[s][a] is a list of transition tuples (prob, next_state, reward, done). env.nS is a number of states in the environment. env.nA is a number of actions in the environment. theta: We stop evaluation once our value function change is less than theta for all states. discount_factor: Gamma discount factor. Returns: A tuple (policy, V) of the optimal policy and the optimal value function. """ def one_step_lookahead(state, V): """ Helper function to calculate the value for all action in a given state. Args: state: The state to consider (int) V: The value to use as an estimator, Vector of length env.nS Returns: A vector of length env.nA containing the expected value of each action. """ A = np.zeros(env.nA) for a in range(env.nA): for prob, next_state, reward, done in env.P[state][a]: A[a] += prob * (reward + discount_factor * V[next_state]) return A V = np.zeros(env.nS) while True: # Stopping condition delta = 0 # Update each state... for s in range(env.nS): # Do a one-step lookahead to find the best action A = one_step_lookahead(s, V) best_action_value = np.max(A) # Calculate delta across all states seen so far delta = max(delta, np.abs(best_action_value - V[s])) # Update the value function V[s] = best_action_value # Check if we can stop if delta < theta: break # Create a deterministic policy using the optimal value function policy = np.zeros([env.nS, env.nA]) for s in range(env.nS): # One step lookahead to find the best action for this state A = one_step_lookahead(s, V) best_action = np.argmax(A) # Always take the best action policy[s, best_action] = 1.0 return policy, Venv = GridworldEnv()policy, v = value_iteration(env)print("Policy Probability Distribution:")print(policy)print("")print("Reshaped Grid Policy (0=up, 1=right, 2=down, 3=left):")print(np.reshape(np.argmax(policy, axis=1), env.shape))print("")print("Value Function:")print(v)print("")print("Reshaped Grid Value Function:")print(v.reshape(env.shape))print("")

輸出結果與策略迭代一致。

參考

[1] Reinforcement Learning: An Introduction- Chapter 4: Dynamic Programming

[2] David Silvers RL Course Lecture 3 - Planning by Dynamic Programming(video, slides)

[3] Quora: quora.com/How-is-policy by Sergio Valcarcel Macua

[4] 策略迭代與值迭代的區別

[5] github開源代碼

推薦閱讀:

word embedding之GLOVE代碼
機器學習篇-名詞:候選集,覆蓋率
機器學習基石筆記12:非線性轉換
谷歌今日上線基於TensorFlow的機器學習速成課程(免費中文版)

TAG:機器學習 | 強化學習ReinforcementLearning | 動態規劃 |