關於TensorFlow的一些總結

02-04

這麼多深度學習框架，選擇的時候著實很頭疼。最早我是用Keras，後來隨著寫的模型越來越複雜，發現Keras實在是不夠靈活，太過於抽象了，就想找個偏底層的框架。在PyTorch與TensorFlow之間糾結了一段，最後因為PyTorch對跨平台支持的不好，決定用TensorFlow（囧），不過後來又發現利用TensorBoard來畫圖真的很棒。

在使用TF的時候不時遇到一些問題，不得不說TF的API實在有些亂，而且tutorials寫的太不友好，上手比其他框架要難一些，但熟悉了以後發現還是很好用的。這篇文章總結一下遇到的一些問題，解決方案以及一些有趣的功能。

數據的導入

由於我的數據量比較大，想利用TF的數據導入機制來讀，這樣的話比較節省內存，而且TF還支持各種Format的decode函數，比較方便，其實主要還是比較懶不想自己寫dataloader。具體使用的是r1.2新添加的tf.contrib.data的API。代碼也很簡單，就這麼點

def input_pipeline(filenames, batch_size):n # Define a `tf.contrib.data.Dataset` for iterating over one epoch of the data.n dataset = (tf.contrib.data.TextLineDataset(filenames)n .map(lambda line: tf.decode_csv(n line, record_defaults=[[1], [1], [1]], field_delim=t))n .shuffle(buffer_size=10000) # Equivalent to min_after_dequeue=10.n .batch(batch_size))nn # Return an *initializable* iterator over the dataset, which will allow us ton # re-initialize it at the beginning of each epoch.n return dataset.make_initializable_iterator() nnfilenames=[1.txt]nbatch_size = 300nnum_epochs = 10niterator = input_pipeline(filenames, batch_size)nn# `a1`, `a2`, and `a3` represent the next element to be retrieved from the iterator. na1, a2, a3 = iterator.get_next()nnwith tf.Session() as sess:n for _ in range(num_epochs):n # Resets the iterator at the beginning of an epoch.n sess.run(iterator.initializer)nn try:n while True:n a, b, c = sess.run([a1, a2, a3])n print(a, b, c)n except tf.errors.OutOfRangeError:n # This will be raised when you reach the end of an epoch (i.e. then # iterator has no more elements).n pass nn # Perform any end-of-epoch computation here.n print(Done training, epoch reached)n

這個API是在tf.train.string_input_producer基礎上的一些改進，較為好用一些。可以在epoch的開始利用sess.run(iterator.initializer)進行重新shuffle。

但在用的過程中，我發現這種shuffle機制並不真的是全數據集進行shuffle。以上面的代碼舉例說明TF的機制：首先設置buffer_size=10000代表將文件中的前10000行讀入緩存，然後根據batch_size=300隨機取出300，這時候，緩存區只有9700個數據，於是又從文件中取出300行填充進緩存區，然後再shuffle取batch...

這種方法不僅沒法在全數據集上隨機，而且每取一次都需要shuffle buffer導致在跑起來很慢。最後我使用的還是自己寫的dataloader，相比TF提供的方法速度反而提高了五倍。

具體見在Stack Overflow上的討論 How to use TensorFlow tf.train.string_input_producer to produce several epochs data?

參數共享

拿博客Text Matching（II）中的模型來說，如果模型需要對兩個輸入共享參數（如Question和Answer），就需要設計Graph的時候小心一些。通常是使用tf.get_variable()來聲明參數，然後將調用語句放在同一個variable_scope中聲明變數可以reuse，這樣TF在建圖的時候會自動檢測變數是否已被使用過。簡單地來寫一下就是

def nets(sequence):n W = tf.get_variable(W, shape, initializer=tf.contrib.layers.xavier_initializer())n passnndef inference(question, answer):n with tf.variable_scope("nets") as scope:n q = nets(query)n scope.reuse_variables()n a = nets(answer)n

利用TensorBoard畫圖

使用了TensorBoard以後發現利用它來可視化簡直太方便了，基本不用自己畫圖了。Tensorboard中提供一個tf.summary的API，其中常用的包含

Scalar：可以直接看到每一個step loss，accuracy等的變化情況
Distribution，Histogram：可以直接看參數在學習過程中的分布變化，根據這個可以判斷自己的模型有沒有充分的學習
Graph：直接定義出模型的可視化架構，方便看到建圖的過程。例如上面說的參數共享如果實現了的話，在Graph中我們就會看到question和answer使用的是同一個module
Embedding：可以利用PCA降維，將輸入映射到低維空間，很炫酷

這是強烈建議使用的功能，細節參考summaries_and_tensorboard，這個特性tutorials介紹的還是比較詳細的。
推薦閱讀：

※資源|100精品開源項目助你成為TensorFlow專家（一）
※給妹紙的深度學習教學(2)——拿NIN試水
※當tensorflow模型超過單張顯卡顯存的時候，應該怎麼拆分到多個GPU上運行？

TAG:TensorFlow |