【0.1】Tensorflow踩坑記之tf.data

09-03

來自專欄慢慢學TensorFlow2 人贊了文章

這篇博客的所有內容都已經放到瀾子的博客上啦。歡迎關注哇。

瀾子的博客?

lanhongvp.github.io

涉及的所有代碼也都放到瀾子的Github上啦。歡迎互粉哇。

lanhongvp/tensorflow_dataset_learn?

github.com

今天嘗試總結一下 tf.data 這個API的一些用法吧。之所以會用到這個API，是因為需要處理的數據量很大，而且數據均是分散式的存儲在多台伺服器上，所以沒有辦法採用傳統的喂數據方式，而是運用了 tf.data 對數據進行了相應的預處理，並且最近正趕上總結需要，嘗試寫一下關於 tf.data 的一些用法，有錯誤的地方一定告訴我哈。

Tensorflow的數據讀取

先來看一下Tensorflow的數據讀取機制吧

知乎專欄-何之源：十圖詳解tensorflow數據讀取機制

這一篇文章對於 tensorflow的數據讀取機制 講解得很不錯，大噶可以先看一下，有一個了解。

Dataset API是怎麼用的呢

知乎專欄-何之源-TensorFlow全新的數據讀取方式：Dataset API入門教程
Tensorflow官方中文導入數據文檔
Github關於dataset運用的一個很清楚的例子
CS230 - (TensorFlow) how to build the data pipeline

雖然上面的資料關於 tf.data 講解得都很好，但是我沒有找到一個很完整滴運用 tf.data.TextLineDataset() 和 tf.data.TFRecordDataset() 的例子，所以才想嘗試寫一寫這篇總結。

MNIST的經典例子

本篇博客結合 mnist 的經典例子，針對不同的源數據：csv數據和tfrecord數據，分別運用 tf.data.TextLineDataset() 和 tf.data.TFRecordDataset() 創建不同的 Dataset 並運用四種不同的 Iterator ，分別是 單次，可初始化，可重新初始化，以及可饋送迭代器 的方式實現對源數據的預處理工作。

我將相關的資料放在了瀾子的Github 上，歡迎互粉哇（星星眼）。其中包括了所需的 後綴名為csv和tfrecords的源數據 (data的文件夾)，以及在 jupyter notebook實現的具體代碼 (tf_dataset_learn.ipynb)。

如果有需要的同學可以直接 git clone https://github.com/lanhongvp/tensorflow_dataset_learn.git 然後用 jupyter 跑一跑看看輸出，這樣可以有一個比較直觀的認識。關於 Git和Github 的使用，大噶可以看我VSCODE_GIT這一篇博客啦。接下來，針對MNIST例子做一個簡單的說明吧。

tf.data.TFRecordDataset() & make_one_shot_iterator()

tf.data.TFRecordDataset() 輸入參數直接是後綴名為tfrecords的文件路徑，正因如此，即可解決數據量過大，導致無法單機訓練的問題。本篇博客中，文件路徑即為/Users/honglan/Desktop/train_output.tfrecords，此處是我自己電腦上的路徑，大家可以 根據自己的需要修改為對應的文件路徑。 make_one_shot_iterator() 即為單次迭代器，是最簡單的迭代器形式，僅支持對數據集進行一次迭代，不需要顯式初始化。配合 MNIST數據集以及tf.data.TFRecordDataset()，實現代碼如下。

# Validate tf.data.TFRecordDataset() using make_one_shot_iterator()import tensorflow as tfimport numpy as npnum_epochs = 2num_class = 10sess = tf.Session()# Use `tf.parse_single_example()` to extract data from a `tf.Example`# protocol buffer, and perform any additional per-record preprocessing.def parser(record): keys_to_features = { "image_raw": tf.FixedLenFeature((), tf.string, default_value=""), "pixels": tf.FixedLenFeature((), tf.int64, default_value=tf.zeros([], dtype=tf.int64)), "label": tf.FixedLenFeature((), tf.int64, default_value=tf.zeros([], dtype=tf.int64)), } parsed = tf.parse_single_example(record, keys_to_features) # Parse the string into an array of pixels corresponding to the image images = tf.decode_raw(parsed["image_raw"],tf.uint8) images = tf.reshape(images,[28,28,1]) labels = tf.cast(parsed[label], tf.int32) labels = tf.one_hot(labels,num_class) pixels = tf.cast(parsed[pixels], tf.int32) print("IMAGES",images) print("LABELS",labels) return {"image_raw": images}, labelsfilenames = ["/Users/honglan/Desktop/train_output.tfrecords"] # replace the filenames with your own pathdataset = tf.data.TFRecordDataset(filenames)print("DATASET",dataset)# Use `Dataset.map()` to build a pair of a feature dictionary and a label# tensor for each example.dataset = dataset.map(parser)print("DATASET_1",dataset)dataset = dataset.shuffle(buffer_size=10000)print("DATASET_2",dataset)dataset = dataset.batch(32)print("DATASET_3",dataset)dataset = dataset.repeat(num_epochs)print("DATASET_4",dataset)iterator = dataset.make_one_shot_iterator()# `features` is a dictionary in which each value is a batch of values for# that feature; `labels` is a batch of labels.features, labels = iterator.get_next()print("FEATURES",features)print("LABELS",labels)print("SESS_RUN_LABELS ",sess.run(labels))

tf.data.TFRecordDataset() & Initializable iterator

make_initializable_iterator() 為可初始化迭代器，運用此迭代器首先需要先運行顯式 iterator.initializer 操作，然後才能使用。並且，可運用 可初始化迭代器實現訓練集和驗證集的切換。配合 MNIST數據集 實現代碼如下。

# Validate tf.data.TFRecordDataset() using make_initializable_iterator()# In order to switch between train and validation datanum_epochs = 2num_class = 10def parser(record): keys_to_features = { "image_raw": tf.FixedLenFeature((), tf.string, default_value=""), "pixels": tf.FixedLenFeature((), tf.int64, default_value=tf.zeros([], dtype=tf.int64)), "label": tf.FixedLenFeature((), tf.int64, default_value=tf.zeros([], dtype=tf.int64)), } parsed = tf.parse_single_example(record, keys_to_features) # Parse the string into an array of pixels corresponding to the image images = tf.decode_raw(parsed["image_raw"],tf.uint8) images = tf.reshape(images,[28,28,1]) labels = tf.cast(parsed[label], tf.int32) labels = tf.one_hot(labels,10) pixels = tf.cast(parsed[pixels], tf.int32) print("IMAGES",images) print("LABELS",labels) return {"image_raw": images}, labelsfilenames = tf.placeholder(tf.string, shape=[None])dataset = tf.data.TFRecordDataset(filenames)dataset = dataset.map(parser) # Parse the record into tensors# print("DATASET",dataset)dataset = dataset.shuffle(buffer_size=10000)dataset = dataset.batch(32)dataset = dataset.repeat(num_epochs)print("DATASET",dataset)iterator = dataset.make_initializable_iterator()features, labels = iterator.get_next()print("ITERATOR",iterator)print("FEATURES",features)print("LABELS",labels)# Initialize `iterator` with training data.training_filenames = ["/Users/honglan/Desktop/train_output.tfrecords"] # replace the filenames with your own pathsess.run(iterator.initializer,feed_dict={filenames: training_filenames})print("TRAIN ",sess.run(labels))# print(sess.run(features))# Initialize `iterator` with validation data.validation_filenames = ["/Users/honglan/Desktop/val_output.tfrecords"] # replace the filenames with your own pathsess.run(iterator.initializer, feed_dict={filenames: validation_filenames})print("VAL ",sess.run(labels))

tf.data.TextLineDataset() & Reinitializable iterator

tf.data.TextLineDataset()，輸入參數可以是後綴名為csv或者是txt的源數據的文件路徑。此處用的迭代器是 Reinitializable iterator，即為可重新初始化迭代器。官方定義如下。配合 MNIST數據集 實現代碼見第二部分。

可重新初始化迭代器可以通過多個不同的 Dataset 對象進行初始化。例如，您可能有一個訓練輸入管道，它會對輸入圖片進行隨機擾動來改善泛化；還有一個驗證輸入管道，它會評估對未修改數據的預測。這些管道通常會使用不同的 Dataset 對象，這些對象具有相同的結構（即每個組件具有相同類型和兼容形狀）。

# validate tf.data.TextLineDataset() using Reinitializable iterator# In order to switch between train and validation datadef decode_line(line): # Decode the line to tensor record_defaults = [[1.0] for col in range(785)] items = tf.decode_csv(line, record_defaults) features = items[1:785] label = items[0] features = tf.cast(features, tf.float32) features = tf.reshape(features,[28,28,1]) label = tf.cast(label, tf.int64) label = tf.one_hot(label,num_class) return features,labeldef create_dataset(filename, batch_size=32, is_shuffle=False, n_repeats=0): """create dataset for train and validation dataset""" dataset = tf.data.TextLineDataset(filename).skip(1) if n_repeats > 0: dataset = dataset.repeat(n_repeats) # for train # dataset = dataset.map(decode_line).map(normalize) dataset = dataset.map(decode_line) # decode and normalize if is_shuffle: dataset = dataset.shuffle(10000) # shuffle dataset = dataset.batch(batch_size) return datasettraining_filenames = ["/Users/honglan/Desktop/train.csv"] # replace the filenames with your own pathvalidation_filenames = ["/Users/honglan/Desktop/val.csv"] # replace the filenames with your own path# Create different datasetstraining_dataset = create_dataset(training_filenames, batch_size=32, is_shuffle=True, n_repeats=num_epochs) # train_filenamevalidation_dataset = create_dataset(validation_filenames, batch_size=32, is_shuffle=True, n_repeats=num_epochs) # val_filename# A reinitializable iterator is defined by its structure. We could use the# `output_types` and `output_shapes` properties of either `training_dataset`# or `validation_dataset` here, because they are compatible.iterator = tf.data.Iterator.from_structure(training_dataset.output_types, training_dataset.output_shapes)features, labels = iterator.get_next()training_init_op = iterator.make_initializer(training_dataset)validation_init_op = iterator.make_initializer(validation_dataset)# Using reinitializable iterator to alternate between training and validation.sess.run(training_init_op)print("TRAIN ",sess.run(labels))# print(sess.run(features))# Reinitialize `iterator` with validation data.sess.run(validation_init_op)print("VAL ",sess.run(labels))

tf.data.TextLineDataset() & Feedable iterator.

數據集讀取方式同上一部分一樣，運用tf.data.TextLineDataset()此處運用的迭代器是 可饋送迭代器，其可以與 tf.placeholder 一起使用，通過熟悉的 feed_dict 機制選擇每次調用 tf.Session.run 時所使用的 Iterator。並使用 tf.data.Iterator.from_string_handle定義一個可讓在兩個數據集之間切換的可饋送迭代器，結合 MNIST數據集 的代碼如下

# validate tf.data.TextLineDataset() using two different iterator# In order to switch between train and validation datadef decode_line(line): # Decode the line to tensor record_defaults = [[1.0] for col in range(785)] items = tf.decode_csv(line, record_defaults) features = items[1:785] label = items[0] features = tf.cast(features, tf.float32) features = tf.reshape(features,[28,28]) label = tf.cast(label, tf.int64) label = tf.one_hot(label,num_class) return features,labeldef create_dataset(filename, batch_size=32, is_shuffle=False, n_repeats=0): """create dataset for train and validation dataset""" dataset = tf.data.TextLineDataset(filename).skip(1) if n_repeats > 0: dataset = dataset.repeat(n_repeats) # for train # dataset = dataset.map(decode_line).map(normalize) dataset = dataset.map(decode_line) # decode and normalize if is_shuffle: dataset = dataset.shuffle(10000) # shuffle dataset = dataset.batch(batch_size) return datasettraining_filenames = ["/Users/honglan/Desktop/train.csv"] # replace the filenames with your own pathvalidation_filenames = ["/Users/honglan/Desktop/val.csv"] # replace the filenames with your own path# Create different datasetstraining_dataset = create_dataset(training_filenames, batch_size=32, is_shuffle=True, n_repeats=num_epochs) # train_filenamevalidation_dataset = create_dataset(validation_filenames, batch_size=32, is_shuffle=True, n_repeats=num_epochs) # val_filename# A feedable iterator is defined by a handle placeholder and its structure. We# could use the `output_types` and `output_shapes` properties of either# `training_dataset` or `validation_dataset` here, because they have# identical structure.handle = tf.placeholder(tf.string, shape=[])iterator = tf.data.Iterator.from_string_handle( handle, training_dataset.output_types, training_dataset.output_shapes)features, labels = iterator.get_next()# You can use feedable iterators with a variety of different kinds of iterator# (such as one-shot and initializable iterators).training_iterator = training_dataset.make_one_shot_iterator()validation_iterator = validation_dataset.make_initializable_iterator()# The `Iterator.string_handle()` method returns a tensor that can be evaluated# and used to feed the `handle` placeholder.training_handle = sess.run(training_iterator.string_handle())validation_handle = sess.run(validation_iterator.string_handle())# Using different handle to alternate between training and validation.print("TRAIN ",sess.run(labels, feed_dict={handle: training_handle}))# print(sess.run(features))# Initialize `iterator` with validation data.sess.run(validation_iterator.initializer)print("VAL ",sess.run(labels, feed_dict={handle: validation_handle}))

小結

運用tfrecords處理數據的速度明顯加快
可以根據自身需要選擇不同的iterator方式對源數據進行預處理
單機訓練時也可以採用 tf.data中API的相應處理方式