基於tensorflow的最簡單的強化學習入門-part1.5: 基於上下文老虎機問題(Contextual Bandits)
本文翻譯自 Simple Reinforcement Learning with Tensorflow Part 1.5: Contextual Bandits, 作者是 Arthur Juliani, 原文鏈接。
在這個系列的前一部分文章中,我們介紹了增強學習的一些概念,並且演示了如何通過建立一個agent來解決多臂老虎機問題(Multi-arm bandits)。多臂老虎機可以當作一種特殊的增強學習問題,沒有狀態(state),只需要採取行動(action)並獲取最大的獎勵(reward)即可。由於沒有給定的狀態,那麼任意時刻的最佳動作始終都是最佳的動作。而在第二部分的文章展示了完整的強化學習問題,其中包括環境狀態和延遲獎勵。
import tensorflow as tfnimport tensorflow.contrib.slim as slimnimport numpy as npnnnclass contextual_bandit():n def __init__(self):n self.state = 0n #List out our bandits. Currently arms 4, 2, and 1 (respectively) are the most optimal.n self.bandits = np.array([[0.2,0,-0.0,-5],[0.1,-5,1,0.25],[-5,5,5,5]])n self.num_bandits = self.bandits.shape[0]n self.num_actions = self.bandits.shape[1]n n def getBandit(self):n self.state = np.random.randint(0,len(self.bandits)) #Returns a random state for each episode.n return self.staten n def pullArm(self,action):n #Get a random number.n bandit = self.bandits[self.state,action]n result = np.random.randn(1)n if result > bandit:n #return a positive reward.n return 1n else:n #return a negative reward.n return -1n
class agent():n def __init__(self, lr, s_size,a_size):n #These lines established the feed-forward part of the network. The agent takes a state and produces an action.n self.state_in= tf.placeholder(shape=[1],dtype=tf.int32)n state_in_OH = slim.one_hot_encoding(self.state_in,s_size)n output = slim.fully_connected(state_in_OH,a_size,n biases_initializer=None,activation_fn=tf.nn.sigmoid,weights_initializer=tf.ones_initializer())n self.output = tf.reshape(output,[-1])n self.chosen_action = tf.argmax(self.output,0)nn #The next six lines establish the training proceedure. We feed the reward and chosen action into the networkn #to compute the loss, and use it to update the network.n self.reward_holder = tf.placeholder(shape=[1],dtype=tf.float32)n self.action_holder = tf.placeholder(shape=[1],dtype=tf.int32)n self.responsible_weight = tf.slice(self.output,self.action_holder,[1])n self.loss = -(tf.log(self.responsible_weight)*self.reward_holder)n optimizer = tf.train.GradientDescentOptimizer(learning_rate=lr)n self.update = optimizer.minimize(self.loss)n ntf.reset_default_graph() #Clear the Tensorflow graph.n
cBandit = contextual_bandit() #Load the bandits.nmyAgent = agent(lr=0.001,s_size=cBandit.num_bandits,a_size=cBandit.num_actions) #Load the agent.nweights = tf.trainable_variables()[0] #The weights we will evaluate to look into the network.nntotal_episodes = 10000 #Set total number of episodes to train agent on.ntotal_reward = np.zeros([cBandit.num_bandits,cBandit.num_actions]) #Set scoreboard for bandits to 0.ne = 0.1 #Set the chance of taking a random action.nninit = tf.initialize_all_variables()nn# Launch the tensorflow graphnwith tf.Session() as sess:n sess.run(init)n i = 0n while i < total_episodes:n s = cBandit.getBandit() #Get a state from the environment.n n #Choose either a random action or one from our network.n if np.random.rand(1) < e:n action = np.random.randint(cBandit.num_actions)n else:n action = sess.run(myAgent.chosen_action,feed_dict={myAgent.state_in:[s]})n n reward = cBandit.pullArm(action) #Get our reward for taking an action given a bandit.n n #Update the network.n feed_dict={myAgent.reward_holder:[reward],myAgent.action_holder:[action],myAgent.state_in:[s]}n _,ww = sess.run([myAgent.update,weights], feed_dict=feed_dict)n n #Update our running tally of scores.n total_reward[s,action] += rewardn if i % 500 == 0:n print "Mean reward for each of the " + str(cBandit.num_bandits) + " bandits: " + str(np.mean(total_reward,axis=1))n i+=1nfor a in range(cBandit.num_bandits):n print "The agent thinks action " + str(np.argmax(ww[a])+1) + " for bandit " + str(a+1) + " is the most promising...."n if np.argmax(ww[a]) == np.argmin(cBandit.bandits[a]):n print "...and it was right!"n else:n print "...and it was wrong!"n
※Alpha Go 的影響
TAG:深度学习DeepLearning | 人工智能 | 强化学习ReinforcementLearning |