MADDPG移植過程

maddpg 實現過程

openAI針對多智能體交互發表 Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments ——— 一種新的actor、critic交互模式。並在multiagent-particle-envs環境上做實驗並和其他實驗做了對比,具體結果可見論文。本文旨在參考 pytorch-maddpg 在MADRL的實現,在openAIde環境MPE中實現演算法,這是為了更好的理解演算法並知曉其效果。演算法和pytorch-maddpg的介紹和參見筆記。

yexme/maddpg-mpe?

github.com圖標

MPE設計結構

core.py中包含EntityState、EntityState、Action、Entity、Landmark、Agent、World 等基類的實現。每個場景都繼承world類,根據不同場景創造環境MultiAgentEnv。action改變agent的p_vel、p_pos來與環境進行交互,這是通過env.step來實現的。

運行場景

選擇場景 simple_tag。

openAI的實現效果可參見:sites.google.com/site/m 三個紅球合作追鋪綠球,通過maddpg訓練,紅球直奔綠球並緊緊追隨綠球。 場景中reward的計算:

def agent_reward(self, agent, world): # Agents are negatively rewarded if caught by adversaries rew = 0 shape = False adversaries = self.adversaries(world) if shape: # reward can optionally be shaped (increased reward for increased distance from adversary) for adv in adversaries: rew += 0.1 * np.sqrt(np.sum(np.square(agent.state.p_pos - adv.state.p_pos))) if agent.collide: for a in adversaries: if self.is_collision(a, agent): rew -= 10 def adversary_reward(self, agent, world): # Adversaries are rewarded for collisions with agents rew = 0 shape = False agents = self.good_agents(world) adversaries = self.adversaries(world) if shape: # reward can optionally be shaped (decreased reward for increased distance from agents) for adv in adversaries: rew -= 0.1 * min([np.sqrt(np.sum(np.square(a.state.p_pos - adv.state.p_pos))) for a in agents]) if agent.collide: for ag in agents: for adv in adversaries: if self.is_collision(ag, adv): rew += 10 return rew

場景返回的obervation是一個數組:

def observation(self, agent, world): # get positions of all entities in this agents reference frame entity_pos = [] for entity in world.landmarks: if not entity.boundary: entity_pos.append(entity.state.p_pos - agent.state.p_pos) # communication of all other agents comm = [] other_pos = [] other_vel = [] for other in world.agents: if other is agent: continue comm.append(other.state.c) other_pos.append(other.state.p_pos - agent.state.p_pos) if not other.adversary: other_vel.append(other.state.p_vel) print([agent.state.p_vel] + [agent.state.p_pos] + entity_pos + other_pos + other_vel, np.concatenate([agent.state.p_vel] + [agent.state.p_pos] + entity_pos + other_pos + other_vel)) return np.concatenate([agent.state.p_vel] + [agent.state.p_pos] + entity_pos + other_pos + other_vel)

移植過程

網路和基本結構並沒有改變,環境的創建使用MPE,帶來需要改變的地方在於:

observation維度

pytorch-maddpg中每個agent觀測維度一致,都是213.而在mpe環境中,每個agent的維度是有差異的。比如simple_tag中,good_agent維度為14,而adversar_agent維度是16.主函數中observation處理:

obs = world.reset() #返回數組obs = np.stack(obs)obs = th.from_numpy(obs).float() #將數組變為tensorobs = Variable(obs).type(FloatTensor) #將tensor裝在Variable中action = maddpg.select_action(obs).data.cpu()---------------------------------------------------obs = env.reset() #返回list、list中是維度不同的數組。obs = np.asarray(obs)for i in range(len(obs)): #分別對每個數組進行處理、在包裝在list中 if isinstance(obs[i], np.ndarray): obs[i] = th.from_numpy(obs[i]).float() obs[i] = Variable(obs[i]).type(FloatTensor) action = maddpg.select_action(obs).data.cpu()## 由於傳給maddpg的obs變了,處理obs的地方也需要改變。----------------------------------------------------#obs:obs = env.reset()obs [array([]), —— 16array([ ]), —— 16array([]), —— 16array([ ])] —— 14 obs[0][0] <class numpy.float64>obs <class list>obs[0] <class list>

action

環境中action離散和連續使用變數discrete_action_space標識,使用gym的Disctete和Box處理,這裡選擇Box,Box的使用可參見openAI的官方文檔,Box(2,)表示一個二維數組,維度是由環境的dim_p定義。作用於環境時,則算出碰撞的力和加速度,根據與其他agent相遇與否進行不同處理,進而改變agent的state。

輸出的action結構:

bin/interactive.py --scenario simple_tag.pyu_action_space Discrete(5)c_action_space Discrete(2)total_action_space [Discrete(5)]total_action_space[0] Discrete(5)action_space [Discrete(5)]使用連續動作時:action_space [Box(2,)]action_space [Box(2,), Box(2,)]action_space [Box(2,), Box(2,), Box(2,)]action_space [Box(2,), Box(2,), Box(2,), Box(2,)]main.py:action = maddpg.select_action(obs).data.cpu()#action [torch.FloatTensor of size 2x2] len() == 2maddpg.py:def select_action(self, state_batch):#actions Variable containing:[torch.FloatTensor of size 2x2]actions = Variable(th.zeros(self.n_agents,self.n_actions))sb = state_batch[i, :].detach() //Variable tensor#[torch.FloatTensor of size 213]act = self.actors[i](sb.unsqueeze(0)).squeeze() //Variable#act Variable containing:[torch.FloatTensor of size 2]return actions#action Variable containing:[torch.FloatTensor of size 2x2]main.py:# action.numpy():[[ 0.85404074 0.98276335][-0.12386087 -0.52516943]] #action.numpy():<class numpy.ndarray> len = 2#action: <class torch.FloatTensor>

可以看到action維度一直是2*2,與原程序沒有區別,所以不同改動。

reply buffer

這是較難處理的一塊,首先看看pytorch-maddpg中reply buffer的結構定義及其在運行中如何對數據進行存儲:

結構:

Experience = namedtuple(Experience, (states, actions, next_states, rewards))class ReplayMemory: def __init__(self, capacity): self.capacity = capacity self.memory = [] self.position = 0 def push(self, *args): if len(self.memory) < self.capacity: self.memory.append(None) self.memory[self.position] = Experience(*args) self.position = (self.position + 1) % self.capacity def sample(self, batch_size): return random.sample(self.memory, batch_size) def __len__(self): return len(self.memory)

存儲:

Experience(states=[torch.FloatTensor of size 2x213],actions=[torch.FloatTensor of size 2x2],next_states=[torch.FloatTensor of size 2x213],rewards=[torch.FloatTensor of size 2]),

從中抽樣進行critic的訓練:

batch = Experience(*zip(*transitions)) print(Experience(*zip(*transitions)),type(batch)) <class memory.Experience>len(batch) —— 4batch[0] [torch.FloatTensor of size 2x213] statesbatch[1] [torch.FloatTensor of size 2x2] actionsbatch.states <class tuple> len(batch.states) 1000th.stack(batch.states) [torch.FloatTensor of size 1000x2x213]th.stack(batch.states).type(FloatTensor) [torch.FloatTensor of size 1000x2x213]batch.states[0] <class torch.FloatTensor>batch.states[0][0] <class torch.FloatTensor>state_batch = Variable(th.stack(batch.states).type(FloatTensor))state_batch Variable containing:[torch.FloatTensor of size 1000x2x213]whole_state = state_batch.view(self.batch_size, -1) len 1000 1000 * 426<class torch.autograd.variable.Variable> type(whole_state)print(whole_state,type(whole_state[1])) <class torch.autograd.variable.Variable>whole_state[1][1] <class torch.autograd.variable.Variable>len(whole_state[1] 426len(whole_state[1][1] 1whole_state[1][425] whole_state Variable containing: 2 [torch.FloatTensor of size 1]whole_state Variable containing:[torch.FloatTensor of size 1000x426]whole_state[0] [torch.FloatTensor of size 426]batch.next_states[0] <class torch.FloatTensor> len 2whole_action [torch.FloatTensor of size 100x8] [torch.FloatTensor of size 1000x4]current_Q = self.critics[agent](whole_state, whole_action)# 前面的一系列處理都是為了獲得critic網路的輸入。

MPE中的reply buffer:

Experience(states=array([ Variable containing:[torch.FloatTensor of size 16], Variable containing:[torch.FloatTensor of size 16], Variable containing:[torch.FloatTensor of size 16], Variable containing:[torch.FloatTensor of size 14]], dtype=object), actions=[torch.FloatTensor of size 4x2], next_states=array([ Variable containing:[torch.FloatTensor of size 16], Variable containing:[torch.FloatTensor of size 16], Variable containing:[torch.FloatTensor of size 16], Variable containing:[torch.FloatTensor of size 14]], dtype=object), rewards=[torch.FloatTensor of size 4]),

抽樣操作:

batch.states <class tuple> len(batch.states) 1000batch.states[0] <class list>batch.states[0][0] <class torch.autograd.variable.Variable>batch.states[0][0][0] <class torch.autograd.variable.Variable>len(batch.states[0] 4----------------------------------------------(batch.states[0] 4:[Variable containing:[torch.FloatTensor of size 16], Variable containing:[torch.FloatTensor of size 16], Variable containing:[torch.FloatTensor of size 16], Variable containing:[torch.FloatTensor of size 14]]-------------------------------------------------whole_state [torch.FloatTensor of size 100x62]len(batch.states[0][0] 16batch.states:Variable containing:[torch.FloatTensor of size 16]

很容易看出,問題還是在於state中agent觀測的維度不一致,需要對其分別進行處理,主要就是處理給cirtic的輸入。如下:

for i in range(len(batch.states)): n_list = [] for j in range(4): for k in range(len(batch.states[i][j])): n_list.append(batch.states[i][j][k].data.numpy()) n_array = np.asarray(n_list) n_tensor = th.from_numpy(n_array).float() n_variable = Variable(n_tensor).type(FloatTensor) whole_list.append(n_variable.data.numpy()) whole_array = np.asarray(whole_list) whole_tensor = th.from_numpy(whole_array).float() for i in range(len(batch.states)): next_list = [] if batch.next_states[i] != None: for j in range(4): for k in range(len(batch.next_states[i][j])): next_list.append(batch.next_states[i][j][k].data.numpy()) next_array = np.asarray(next_list) next_tensor = th.from_numpy(next_array).float() next_variable = Variable(next_tensor).type(FloatTensor) next_whole_list.append(th.t(next_variable).data.numpy()) next_whole_array = np.asarray(next_whole_list) next_whole_tensor = th.from_numpy(next_whole_array).float() state_batch = Variable(th.stack(whole_tensor).type(FloatTensor)) action_batch = Variable(th.stack(batch.actions).type(FloatTensor)) reward_batch = Variable(th.stack(batch.rewards).type(FloatTensor)) non_final_next_states = Variable(th.stack(next_whole_tensor).type(FloatTensor)) whole_state = state_batch.view(self.batch_size, -1) whole_action = action_batch.view(self.batch_size, -1) self.critic_optimizer[agent].zero_grad() for a in range(self.n_agents): batch_obs = [] for i in range(len(batch.next_states)): if batch.next_states[i] is not None: batch_obs.append(batch.next_states[i][a].data.numpy()) batch_obs = np.asarray(batch_obs) batch_obs = th.from_numpy(batch_obs).float() batch_obs = Variable(batch_obs).type(FloatTensor) non_final_next_actions.append(self.actors_target[a](batch_obs)) target_Q = Variable(th.zeros( self.batch_size).type(FloatTensor)) target_Q[non_final_mask] = self.critics_target[agent]( non_final_next_states.view(-1, 62), non_final_next_actions.view(-1, self.n_agents * self.n_actions)) state_i = [] for i in range(len(state_batch)): state_i.append(batch.states[i][agent].data.numpy()) state_i = np.asarray(state_i)

實現效果:

推薦閱讀:

十分種讀懂KNN
【深度學習系列】卷積神經網路CNN原理詳解(一)——基本原理
第一章:機器學習在能源互聯網中的應用綜述(一)
深入機器學習系列21-最大熵模型
這些是 Python 官方推薦的最好書籍(推薦)

TAG:強化學習ReinforcementLearning | 機器學習 | 人工智慧 |