循環神經網路(RNN)介紹2：keras代碼分析

01-30

根據上一篇的介紹，我們已經對循環神經網路有了基本了解。上一篇文章的「keras部署神經網路」部分，我們提供了搭建一個簡單RNN網路的代碼，但是實際運行代碼時總會遇見各種問題，筆者就是如此，為了方便理解代碼，筆者帶著大家一步步分析代碼，對代碼中涉及到的一些自然語言處理的概念解釋，實際運行中遇到的錯誤進行分析並給出解決方法。

我們的目標是：使用keras搭建RNN網路，使用推文數據訓練模型

實現目標的思路是：準備工作 -> 數據序列化 -> 得到詞嵌入矩陣 -> 訓練網路

準備工作：讀取相關庫函數，讀入數據集，劃分訓練集和測試集，
數據序列化：將文本轉換為數字序列
得到詞嵌入矩陣：讀取glove模型，得到詞嵌入矩陣
訓練網路：劃分訓練集和驗證集，搭建簡單的RNN網路,訓練網路

準備工作

1. 讀取相關庫函數.

import kerasnnfrom keras.models import Sequentialnnfrom keras.layers import Dense, Activation, Dropoutnnfrom keras.layers.convolutional import Conv1Dnnfrom keras.preprocessing.text import Tokenizernnfrom keras.preprocessing.sequence import pad_sequencesnnimport pandas as pdnnimport numpy as npnnimport spacyn

這裡需要先安裝keras庫和spacy。spacy是一個Python自然語言處理工具包，誕生於2014年年中，號稱「Industrial-Strength Natural Language Processing in Python」，是具有工業級強度的Python NLP工具包。

2. 讀入數據集

#load the datasetnntrain=pd.read_csv("./datasets/training.1600000.processed.noemoticon.csv" , encoding= "latin-1")nnY_train = train[train.columns[0]]nnX_train = train[train.columns[5]]n

首先讀入數據，然後提取我們需要的數據，也就是標籤和推文。我們可以使用train.head(3)來查看文件的前三行，其中第1行數據是我們需要的標籤數據，第6行數據是我們需要的推文。如何得到這些數據呢，我們知道對於一個panda.Dataframe類型的數據，可以通過列名稱直接提取列數據train[column_name]，但是這裡我們想通過列的索引提取，所以這裡我們使用了一個技巧，就是通過train.columns[i]得到指定列的名稱。

我們可以看一下數據和標籤。首先看一下前三行數據：

X_train.head(3)nn0 is upset that he cant update his Facebook by ...nn1 @Kenichan I dived many times for the ball. Man...nn2 my whole body feels itchy and like its on fire nnName: @switchfoot [http://twitpic.com/2y1zl](http://twitpic.com/2y1zl) - Awww, thats a bummer. You shoulda got David Carr of Third Day to do it. ;D, dtype: objectnnY_train.head(3)nn0 0nn1 0nn2 0nnName: 0, dtype: int64n

很明顯，前三行數據的情緒都是負情緒，標籤也都被標記為0.

再看一下最後3行數據。

X_train.tail(3)nn19996 Are you ready for your MoJo Makeover? Ask me f...nn19997 Happy 38th Birthday to my boo of alll time!!! ...nn19998 happy #charitytuesday @theNSPCC @SparksCharity...nnName: @switchfoot [http://twitpic.com/2y1zl](http://twitpic.com/2y1zl) - Awww, thats a bummer. You shoulda got David Carr of Third Day to do it. ;D, dtype: objectnnY_train.tail(3)nn19996 4nn19997 4nn19998 4nnName: 0, dtype: int64n

很明顯，表達開心情緒的推文被標記為4

我們的目的就是使用RNN網路對這些推文進行分類。

3. 將數據劃分為訓練集和測試集

# split the data into test and trainnnfrom sklearn.model_selection import train_test_splitnntrainset1x, trainset2x, trainset1y, trainset2y = train_test_split(X_train.values, Y_train.values, test_size=0.02,random_state=42 )nntrainset1y=pd.get_dummies(trainset1y)n

我們使用sklean的train_test_split函數進行訓練集測試集分類，最後一行代碼pd.get_dummies()的意思是將數據轉換為one-hot向量，我們的y只有2個值0和4，所以使用兩維的one-hot向量進行編碼。

數據序列化

tokenizer = Tokenizer()nntokenizer.fit_on_texts(trainset1x)nnsequences = tokenizer.texts_to_sequences(trainset1x)nnword_index = tokenizer.word_indexnndata = pad_sequences(sequences, maxlen=15, padding="post")nnFound 29863 unique tokens.nn(19599, 15)n

這裡比較複雜，也涉及了較多的自然語言處理知識，為了照顧一般讀者，我們不長篇大論這段代碼背後的nlp原理。

我們只關注每一步完成了什麼變化，以及這些變化的目的。仔細觀察，這部分代碼可以分為3部分。第一部分得到了sequences，我們先來看看sequences是什麼

sequences[1]nn[27, 95, 94, 15, 4, 27, 105, 976, 487, 78, 139]nntrainset1x[1]nngood night twitter have a good sleep everybody xx 2 days!n

通過對比sequences[1]和trainset1x[1]，我們發現，這一步的變化就是將單詞映射成了數字，注意sequences[1]的第一個值和第六個值都是27，對應的trainset1x[1]的第一個單詞和第六個單詞都是good。這樣做的目的就是將單詞映射為數字，後面，這些數字作為索引能得到對應單詞的詞向量。

接下來看第二部分

print(Found %s unique tokens. % len(word_index))nnFound 29863 unique tokens.n

這部分很簡單，得到一共有多少個唯一的單詞。

第三部分也只有一行代碼，我們觀察得到的數據data

data.shapenn(19599, 15)n

pad_sequences的作用是將sequences的每一列擴展到15，擴展方式是向後填充0，我們再來對比一下sequences[1]和data[1]

data[1]nnarray([ 27, 95, 94, 15, 4, 27, 105, 976, 487, 78, 139, 0, 0,nn 0, 0], dtype=int32)n

我們發現原來的sequences[1]被擴展為了15個元素，且最後的元素使用0進行填充。

得到詞嵌入矩陣

1. 讀取glove模型

#loading the glove modelnndef loadGloveModel(gloveFile):nn print("Loading Glove Model")nn f = open(gloveFile,r)nn model = {}nn for line in f:nn splitLine = line.split()nn word = splitLine[0]nn embedding = [float(val) for val in splitLine[1:]]nn model[word] = embeddingnn print ("Done.",len(model)," words loaded!")nn return modelnn# save the glove modelnnmodel=loadGloveModel("./glove/glove.twitter.27B/glove.twitter.27B.200d.txt")nnLoading Glove ModelnnDone. 1177902 words loaded!n

glove模型是什麼，我們可以簡單的理解為單詞對應的詞向量,這裡我們用200維的向量來表示一個單詞。繼續使用單詞good來看一下，這裡只看前3個值，避免太長。

model.get(good)[0:3]nn[0.018223, -0.012323, 0.035569]n

讀者一定注意到了這部分代碼的輸出，我們一共讀取了1177902個單詞的詞向量，但是實際上，根據上一部分的代碼，我們的數據集只有29863個單詞。所以接下來我們通過得到詞嵌入矩陣，只保留我們需要的單詞。

2.得到詞嵌入（the embedding）矩陣

#calculate the number of wordsnnnb_words=len(word_index)+1nn#obtain theembedding matrixnnembedding_matrix = np.zeros((nb_words, 200))nnfor word, i in word_index.items():nn embedding_vector = model.get(word)nn if embedding_vector is not None:nn embedding_matrix[i] = embedding_vectornnprint(Null word embeddings: %d % np.sum(np.sum(embedding_matrix, axis=1) == 0))nnNull word embeddings: 12354n

直接來看看第27個向量，也就是good對應的詞向量的前三個數是多少。

embedding_matrix[27][0:3]nnarray([ 0.018223, -0.012323, 0.035569])n

ok，與我們之前看到的詞向量一致，矩陣搞定，每一行代表一個單詞的詞向量。

訓練網路

1. 將訓練集再次劃分為訓練集和驗證集

#reshape the data and preparing to trainnndata=data.reshape(19599,15,1)nntrainx, validx, trainy, validy = train_test_split(data, trainset1y, test_size=0.3,random_state=42 )n

這一步是為了做交叉驗證，也是為了使用數字化之後的data，而不是原始的數據trainset1x。

trainy=np.array(trainy)nnvalidy=np.array(validy)n

保證格式合法

2. 打造簡單的RNN網路

#building a simple RNN modelnndef modelbuild():nn model = Sequential()nnnn model.add(keras.layers.InputLayer(input_shape=(15,1)))nnnn keras.layers.embeddings.Embedding(nb_words, 15, weights=[embedding_matrix], input_length=15,nn trainable=False)nnnn model.add(keras.layers.recurrent.SimpleRNN(units = 100, activation=relu,nn use_bias=True))nnnn model.add(keras.layers.Dense(units=1000, input_dim = 2000, activation=sigmoid))nn model.add(keras.layers.Dense(units=500, input_dim=1000, activation=relu))nn model.add(keras.layers.Dense(units=2, input_dim=500,activation=softmax))nn model.compile(loss=categorical_crossentropy, optimizer=adam, metrics=[accuracy])nnnn return modeln

這裡是打造rnn網路的主體部分，也比較簡單，首先初始化一個序列模型的實例model.

整個模型需要明確輸入層，權重矩陣，全連接層和輸出層。

首先第一層的維度是(15，1),因為我們把所有句子的長度都限定為15個單詞了。

然後明確我們的權重矩陣維度是(nb_words, 15),使用我們之前已經得到的矩陣embedding_matrix。

主體部分是簡單rnn網路，我們指定循環單元數量是100，激活函數是relu。

接下來是兩層，分別為1000層和500層的全連接層。

最後的輸出使用softmax函數，得到兩個輸出0或1。

3. 訓練模型

最後我們使用訓練集和驗證集進行模型訓練。

#compiling the modelnnfinalmodel = modelbuild()nnfinalmodel.fit(trainx, trainy, epochs=10, batch_size=120,validation_data=(validx,validy))nnTrain on 13719 samples, validate on 5880 samplesnnEpoch 1/10nn13719/13719 [==============================] - 6s 439us/step - loss: 0.7447 - acc: 0.5237 - val_loss: 0.6976 - val_acc: 0.5235nnEpoch 2/10nn13719/13719 [==============================] - 5s 343us/step - loss: 0.6947 - acc: 0.5254 - val_loss: 0.6891 - val_acc: 0.5376nnEpoch 3/10nn13719/13719 [==============================] - 5s 329us/step - loss: 0.6881 - acc: 0.5306 - val_loss: 0.6863 - val_acc: 0.5425nnEpoch 4/10nn13719/13719 [==============================] - 4s 326us/step - loss: 0.6876 - acc: 0.5350 - val_loss: 0.6858 - val_acc: 0.5417nnEpoch 5/10nn13719/13719 [==============================] - 5s 334us/step - loss: 0.6857 - acc: 0.5431 - val_loss: 0.6873 - val_acc: 0.5255nnEpoch 6/10nn13719/13719 [==============================] - 5s 358us/step - loss: 0.6857 - acc: 0.5462 - val_loss: 0.6852 - val_acc: 0.5410nnEpoch 7/10nn13719/13719 [==============================] - 4s 326us/step - loss: 0.6850 - acc: 0.5482 - val_loss: 0.6841 - val_acc: 0.5383nnEpoch 8/10nn13719/13719 [==============================] - 5s 359us/step - loss: 0.6848 - acc: 0.5491 - val_loss: 0.6924 - val_acc: 0.5306nnEpoch 9/10nn13719/13719 [==============================] - 5s 363us/step - loss: 0.6838 - acc: 0.5528 - val_loss: 0.6851 - val_acc: 0.5454nnEpoch 10/10nn13719/13719 [==============================] - 5s 356us/step - loss: 0.6837 - acc: 0.5516 - val_loss: 0.6874 - val_acc: 0.5378nn<keras.callbacks.History at 0x171a31b38>n

最終，我們得到了訓練集精度為0.5516的RNN模型，接下來我們使用第二部分用於分析的推文』good night twitter have a good sleep everybody xx 2 days!』來測試一下模型效果。

finalmodel.predict(np.array(data[1].reshape(1,15,1)))nnarray([[ 0.54520983, 0.4547902 ]], dtype=float32)n

輸出層softmax得到了兩個輸出，其中比較大的是0.54520983，如果閾值是0.5，那麼我們可以將該推文分類為0，也就是「積極情緒」