手把手教你用TensorFlow實現看圖說話｜教程+代碼

02-11

看圖說話這種技能，我們人類在幼兒園時就掌握了，機器們前赴後繼學了這麼多年，也終於可以對圖像進行最簡單的描述。

O』reilly出版社和TensorFlow團隊聯合發布了一份教程，詳細介紹了如何在Google的Show and Tell模型基礎上，用Flickr30k數據集來訓練一個圖像描述生成器。模型的創建、訓練和測試都基於TensorFlow。

如果你一時想不起O』reilly是什麼，量子位很願意幫你回憶一下：

好了，看教程：

（王新民編譯整理 | 量子位出品公眾號 QbitAI）

準備工作

裝好TensorFlow；
安裝pandas、opencv2、Jupyter庫；
下載Flicker30k數據集的圖像嵌入和圖像描述

在教程對應的GitHub代碼介紹（ mlberkeley/oreilly-captions ）里，有庫、圖像嵌入、圖像描述的下載鏈接。

圖像描述生成模型

△ 圖像描述生成模型的網路示意圖。

該網路輸入馬的圖像，經由深度卷積神經網路Deep CNN和語言生成模型RNN（循環神經網路）學習訓練，最終得到字幕生成網路的模型。

這就是一個我們將要訓練的網路結構示意圖。深度卷積神經網路將每個輸入圖像進行編碼表示成一個4,096維的矢量，利用循環神經網路的語言生成模型解碼該矢量，生成對輸入圖像的描述。

圖像描述生成是圖像分類的擴展

圖像分類是一種經典的計算機視覺任務，可以使用很多強大的經典分類模型。分類模型是通過將圖像中存在的形狀和物體的相關視覺信息拼湊在一起，以實現對圖像中物體的識別。

機器學習模型可以被應用到計算機視覺任務中，例如物體檢測和圖像分割，不僅需要識別圖像中的信息，而且還要學習和解釋呈現出的2D空間結構，融合這兩種信息，來確定物體在圖像中的位置信息。想要實現字幕生成，我們需要解決以下兩個問題：

1. 我們如何在已有成功的圖像分類模型的基礎上，從圖像中獲取重要信息？

2. 我們的模型如何在理解圖像的基礎上，融合信息實現字幕生成？

運用遷移學習

我們可以利用現有的模型來幫助提取圖像信息。遷移學習允許我們用現有用於訓練不同任務的神經網路，通過數據格式轉換，將這些網路應用到我們的數據之中。

在我們的實驗中，該vgg-16圖像分類模型的輸入圖像格式為224×224像素，最終會產生一個4096維的特徵向量，連接到多層全連接網路進行圖像分類。

我們可以使用vgg-16網路模型的特徵提取層，用來完善我們的字幕生成網路。在這篇文章的工作中，我們抽象出vgg-16網路的特徵提取層和預先計算的4096維特徵，這樣就省去了圖像的預訓練步驟，來加速全局網路訓練進程。

載入VGG網路特徵和實現圖像標註功能的代碼是相對簡單的：

def get_data(annotation_path, feature_path): annotations = pd.read_table(annotation_path, sep= , header=None, names=[image, caption]) return np.load(feature_path,r), annotations[caption].values

理解圖像描述

現在，我們對圖像標註了多個物體標籤，我們需要讓模型學習將表示標籤解碼成一個可理解的標題。

由於文本具有連續性，我們利用RNN及LSTM網路，來訓練在給定已有前面單詞的情況下網路預測後續一系列描述圖像的句子的功能。

由於長短期記憶模型（LSTM）單位的存在，使得模型更好地在字幕單詞序列中提取到關鍵信息，選擇性記住某些內容以及忘記某些無用的信息。TensorFlow提供了一個封裝函數，用於在給定輸入和確定輸出維度的條件下生成一個LSTM網路層。

為了將單詞轉化成適合於LSTM網路輸入的具有固定長度的表示序列，我們使用一個嵌入層來學習如何將單詞映射到256維特徵，即詞語嵌入操作。詞語嵌入幫助將我們的單詞表示為向量形式，那麼類似的單詞向量就說明對應的句子在語義上也是相似的。

在VGG-16網路所構建的圖像分類器中，卷積層提取到的4,096維矢量表示將通過softmax層進行圖像分類。由於LSTM單元更支持用256維文本特徵作為輸入，我們需要將圖像表示格式轉換為用於描述序列的表示格式。因此，我們添加了嵌入層，該層能夠將4,096維圖像特徵映射到另一個256維文本特徵的矢量空間。

建立和訓練模型

下圖展示了看圖說話模型的原理：

在該圖中，{s0，s1，…，sN}表示我們試圖預測的描述單詞，{wes0,wes1,…,wesN-1}是每個單詞的字嵌入向量。LSTM的輸出{p1，p2，…，pN}是由該模型基於原有的單詞序列為下一個單詞生成的概率分布。該模型的訓練目標是為了最大化每個單詞對數概率的總和指標。

def build_model(self): # declaring the placeholders for our extracted image feature vectors, our caption, and our mask # (describes how long our caption is with an array of 0/1 values of length `maxlen` img = tf.placeholder(tf.float32, [self.batch_size, self.dim_in]) caption_placeholder = tf.placeholder(tf.int32, [self.batch_size, self.n_lstm_steps]) mask = tf.placeholder(tf.float32, [self.batch_size, self.n_lstm_steps]) # getting an initial LSTM embedding from our image_imbedding image_embedding = tf.matmul(img, self.img_embedding) + self.img_embedding_bias # setting initial state of our LSTM state = self.lstm.zero_state(self.batch_size, dtype=tf.float32) total_ loss = 0.0 with tf.variable_scope("RNN"): for i in range(self.n_lstm_steps): if i > 0: #if this isnt the first iteration of our LSTM we need to get the word_embedding corresponding # to the (i-1)th word in our caption with tf.device("/cpu:0"): current_embedding = tf.nn.embedding_lookup(self.word_embedding, caption_placeholder[:,i-1]) + self.embedding_bias else: #if this is the first iteration of our LSTM we utilize the embedded image as our input current_embedding = image_embedding if i > 0: # allows us to reuse the LSTM tensor variable on each iteration tf.get_variable_scope().reuse_variables() out, state = self.lstm(current_embedding, state) print (out,self.word_encoding,self.word_encoding_bias) if i > 0: #get the one-hot representation of the next word in our caption labels = tf.expand_dims(caption_placeholder[:, i], 1) ix_range=tf.range(0, self.batch_size, 1) ixs = tf.expand_dims(ix_range, 1) concat = tf.concat([ixs, labels],1) onehot = tf.sparse_to_dense( concat, tf.stack([self.batch_size, self.n_words]), 1.0, 0.0) #perform a softmax classification to generate the next word in the caption logit = tf.matmul(out, self.word_encoding) + self.word_encoding_bias xentropy = tf.nn.softmax_cross_entropy_with_logits(logits=logit, labels=onehot) xentropy = xentropy * mask[:,i] loss = tf.reduce_sum(xentropy) total_loss += loss total_loss = total_loss / tf.reduce_sum(mask[:,1:]) return total_loss, img, caption_placeholder, mask

通過推斷生成描述

訓練後，我們得到一個模型，能夠根據圖像和標題的已有單詞給出下一個單詞出現的概率。那麼我們該如何用這個網路來產生新的字幕？

最簡單的方法是根據輸入圖像并迭代輸出下一個最可能的單詞，來構建單個標題。

def build_generator(self, maxlen, batchsize=1): #same setup as `build_model` function img = tf.placeholder(tf.float32, [self.batch_size, self.dim_in]) image_embedding = tf.matmul(img, self.img_embedding) + self.img_embedding_bias state = self.lstm.zero_state(batchsize,dtype=tf.float32) #declare list to hold the words of our generated captions all_words = [] print (state,image_embedding,img) with tf.variable_scope("RNN"): # in the first iteration we have no previous word, so we directly pass in the image embedding # and set the `previous_word` to the embedding of the start token ([0]) for the future iterations output, state = self.lstm(image_embedding, state) previous_word = tf.nn.embedding_lookup(self.word_embedding, [0]) + self.embedding_bias for i in range(maxlen): tf.get_variable_scope().reuse_variables() out, state = self.lstm(previous_word, state) # get a one-hot word encoding from the output of the LSTM logit = tf.matmul(out, self.word_encoding) + self.word_encoding_bias best_word = tf.argmax(logit, 1) with tf.device("/cpu:0"): # get the embedding of the best_word to use as input to the next iteration of our LSTM previous_word = tf.nn.embedding_lookup(self.word_embedding, best_word) previous_word += self.embedding_bias all_words.append(best_word) return img, all_words

在許多情況下，這種方法是比較有效的。但是通過貪心演算法來選取最可能的單詞序列，我們可能不會得到一句連貫通順的字幕序列。

為避免這種情況，一個解決辦法是使用一種叫做「集束搜索（Beam Search）」的演算法。該演算法迭代地使用k個長度為t的最佳句子集合來生成長度為t+1的候選句子，並且能夠自動找到最優的k值。這個演算法在易於處理推理計算的同時，也在探索生成更合適的標題長度。在下面的示例中，在搜索每個垂直時間步長的粗體字路徑中，此演算法能夠列出一系列k=2的最佳候選句子。

局限性和討論

神經網路實現的圖像描述生成器，為學習從圖像映射到自然語言圖像描述提供了一個有用的框架。通過對大量圖像和對應標題的集合進行訓練，該模型能夠從視覺特徵中捕獲相關的語義信息。

然而，使用靜態圖像時，字幕生成器將專註於提取對圖像分類有用的圖像特徵，而不一定是對字幕生成有用的特徵。為了提高每個特徵中所包含相關任務信息的數量，我們可以將圖像嵌入模型，即用於編碼特徵的VGG-16網路，來作為字幕生成模型進行訓練，使網路在反向傳播過程中對圖像編碼器進行微調，以更好地實現字幕生成的功能。

此外，如果我們真正仔細研讀生成的字幕序列，我們會注意到都是比較普通而且變化不大的句子。拿如下的圖像作為例子：

此圖片的對應生成字幕是「長頸鹿站在樹旁邊」。但是如果我們觀察其他圖片，我們可能會注意到，對於任何帶有長頸鹿照片，它可能都會生成標題「一隻長頸鹿站在樹的旁邊」，因為在訓練集中，帶有長頸鹿的圖像樣本經常出現在樹林附近。

後續工作

首先，如果你想改進這裡字幕生成的模型，可以看看谷歌的開源項目Show and Tell network，是利用MS COCO數據集和一個三層圖像嵌入模型進行訓練生成的預測網路。

目前最先進的圖像字幕模型引入了視覺注意機制，其允許模型關注圖像中特定的區域並且生成字幕時選擇性地關注特定類別的信息。

此外，如果您對這種最先進的字幕生成功能實現感興趣，請查看Yoshua Bengio的論文：Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention。

文中提到的相關資源：

Oreilly教程原文：Caption this, with TensorFlow

MS COCO：Common Objects in Context

一份用MS COCO數據集訓練看圖說話的教程：https://arxiv.org/pdf/1609.06647.pdf

Flickr30k數據集：http://web.engr.illinois.edu/~bplumme2/Flickr30kEntities/

教程相關GitHub代碼：mlberkeley/oreilly-captions

Yoshua Bengio的論文 Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention：Neural Image Caption Generation with Visual Attention

今天AI界還有哪些事值得關注？在量子位（QbitAI）公眾號會話界面回復「今天」，看我們全網搜羅的AI行業和研究動態。筆芯?~
量子位新建了一個機器學習入門群，歡迎加小助手的微信：qbitbot，註明「加入門群」並介紹一下你自己，如果符合要求，我們會拉你入群。