TensorFlow初步(5)

01-24

大家好我是zyy，本人是機器學習和深度學習的初學愛好者，想跟大家一起分享我的學習經驗，大家一起交流。我寫的東西不一定全對，但肯定是我一步一步走出來的坑，嚼爛了的經驗，可以供大家直接「吸收」

我的文章主要會涉及各種機器學習和深度學習演算法的推導和輪子的實現，以及一些小的應用demo，偶爾還會有一些論文的演算法實現。

文中出現的所有代碼都可以在我的GitHub上找到。

GitHub

TensorFlow初步(5)

寫到今天，終於寫到怎麼去做神經網路了。先來寫點基本內容。

我們先來看一看什麼是softamax。

這就是一個多分類的softmax模型，最底下是輸入層，中間是非線性變化的轉換層，最上面是結果輸出層。其中 $u_i=x^mathrm{T} heta_i$ ， $pi_i=frac{exp(u_i)}{sum^N_{j=1}exp(u_j)}$ ，最後輸出的是每個類的概率，所以叫softmax。通過多個非線性的變換進行分類，所以較一般的線性分類器效果要好。

再說神經網路，簡單來講，就是通過多個非線性變換加多層非線性變換（激活函數的作用，activation function），不斷去映射、分類，最終將難以分類的數據變換到易分類的空間中。

總的來說，就是上圖這種感覺。因為在不斷加隱藏層和節點的過程中，對數據的抽象映射能力會提高，所以有時候我們在進行深度學習調優的時候，會嘗試加層加節點，提高演算法的分類能力。從另一個角度，深度學習也算是一種「深度」的集成學習演算法，不斷通過弱分類演算法來逼近我們的目標。

用神經網路進行學習的時候，我們用前向傳播（feedforward）來計算預測值，就是用數據進行逐層的計算，最後得到結果，在訓練的時候，我們通過定義誤差函數，通過梯度，進行反響傳播（backpropagation）進行逐層遞進的權重修正，我們來看一個較為典型的例子：

如圖，我們對神經網路做切片， $X$ 是一個三維的輸入， $a_i$ 為這一層的輸出，我們的正向傳播過程就是： $a_i=sigmoid(WX+b)$

其中 $sigmoid(cdot)$ 是一個激活函數，具體樣子大家可以去百度。（我這兒畢竟不是從0開始的科普文章） $W$ 和 $b$ 也不言而喻，這兒我們定義一下具體的格式， $b=[output,1]$ ， $W=[w_{output,input}]$ ，我們定義從這一層輸出的誤差（也就是應修正的量）為 $E_i$ 。訓練的時候我們對權重等進行修正：

（我們以 $E_1$ ， $w_{11}$ ， $b_1$ 和 $x_1$ 為例）

$frac{partial{E_i}}{partial{w_{11}}}=frac{partial{E_1}}{partial{a_1}}frac{partial{a_1}}{partial{f}}frac{partial{f}}{partial{w_{11}}}=frac{partial{E_1}}{partial{a_1}}cdot f$

$frac{partial{E_i}}{partial{b_{1}}}=frac{partial{E_1}}{partial{a_1}}frac{partial{a_1}}{partial{f}}frac{partial{f}}{partial{b_{1}}}=frac{partial{E_1}}{partial{a_1}}cdot f$

$frac{partial{E_i}}{partial{x_{1}}}=frac{partial{E_1}}{partial{a_1}}frac{partial{a_1}}{partial{f}}frac{partial{f}}{partial{x_{1}}}=frac{partial{E_1}}{partial{a_1}}cdot f$

$sigmoid(cdot)$ 函數有一個很好的性質，它的導數等於 $f$ ，但這也是它的缺點，多層次往下進行傳播時，這個導數會越來越小越來越小。上面求到的第一個是，是對權重 $w_{11}$ 求導，得出來就是根據誤差對 $w_{11}$ 的修正量，同理對 $b_1$ 和 $x_1$ 。有人會問了， $x_1$ 不是輸入嗎？為什麼也有求修正量，修正輸入有意義嗎？有！當然有，因為在這層， $x_1$ 看似是數據輸入，但是它同時也是上一層的輸出！上一層的誤差也就是這層對 $x_1$ 的修正量！對每層進行這樣反覆的計算與修正，整個神經網路也就可以向我們所希望的方向改變了。

嗷，突然想起來一個事兒，前兩天我看別人博客的時候，發現這麼一張圖：

我的天吶，簡直就是文章中放小黃圖的楷模。大家如果看了這幅圖能對 $sigmoid(cdot)$ 函數有些許興趣，那想必是極好的。。。。

Python版

這次我寫的比較多，python和tensorflow版本都選了兩個例子，一個是我自己做的分類數據，另一個是mnist手寫數據集。python版本的名字是NeuralNetwork1和NeuralNetwork2，這次裡面的注釋寫的真的非常詳細，大家可以下載下來仔細看一看。啥話不說，先放代碼。

# author zyyFTD# Github: https://github.com/YuyangZhangFTD/zyy_ML-DL""" This code is for python3. Neural Network using numpy. For convenience, all matrices and vectors should be np.mat. ================================== The whole architecture =========================================== input layer : x1 x2 x3 ... x784 ==> x[784, 1] || full connection ==> w1[512, 784] + b1[512, 1] hidden layer1 : a1 a2 a3 ... a512 ==> a[512, 1] || full connection ==> w2[256, 512] + b2[256, 1] hidden layer2 : b1 b2 b3 ... b256 ==> b[256, 1] || full connection ==> w3[128, 256] + b3[128, 1] hidden layer3 : c1 c2 c3 ... c128 ==> c[128, 1] || full connection ==> w4[64, 128] + b4[64, 1] hidden layer4 : d1 d2 d3 ... d64 ==> d[64, 1] || softmax ==> w5[64, 10] + b5[10, 1] output layer : y1 y2 y3 ... y10 ==> y[10, 1] ====================================== One part of the network ====================================== output f() weight input bias |y1| |w11 w12 w13| |x1| |b1| |y2| = f( |w21 w22 w23| * |x2| + |b2| ) f() is activation function |y3| |w31 w32 w33| |x3| |b3| The correction is C. If the loss function is frac{1}{2}(hat{y}-y)^2, C = hat{y}-y. f" is the derivation of f. the gradient: dC/dw = C*f"*x dC/db = C*f"*1 dC/dx = C*f"*w The weight and bias can be updated with gradient of w and b. And the correction of the previous layer is the gradient of x in this layer. ========================================== softmax ================================================== output f() weight input bias |y1| |w11 w12 w13| |x1| |b1| |y2| = f( |w21 w22 w23| * |x2| + |b2| ) f() is activation function |y3| |w31 w32 w33| |x3| |b3| |z1| = y1/sum(y) |y1| |z2| = y2/sum(y) y=|y2| |z3| = y3/sum(y) |y3|"""import numpy as npimport pandas as pd# ================================= init parameter ========================================layer_n = 3input_size = 2hidden1_size = 5hidden2_size = 5# hidden3_size = 128hidden4_size = 5output_size = 4learning_rate = 0.1# batch_size = 32 # batch size should be 2^n, for GPU training.epoch_n = 1001# =========================================================================================# ================================= all function ==========================================def one_hot(para_y, para_output): # label one-hot encoding [2]==>[0,1,0,0,0,0,0,0,0,0] tmp = [] for ii in range(len(para_y)): data = [0] * para_output data[int(para_y[ii])] = 1 tmp.append(data) return np.mat(tmp)def one_encode(para_x): # the input data [1-256]==>[1] tmp = [] for ii in range(len(para_x)): tmp.append(list(map(lambda x: 1 if x > 0 else 0, para_x[ii]))) return np.mat(tmp)def sigmoid(para_x, para_weight, para_bias): tmp = np.exp(-1*(para_weight * para_x + para_bias)) return 1/(1+tmp)def relu(para_x, para_weight, para_bias): tmp = para_weight * para_x + para_bias return np.mat(list(map(lambda x: x if x > 0 else 0, [float(x) for x in tmp]))).Tdef softmax(para_x, para_weight, para_bias): tmp = np.exp(para_weight * para_x + para_bias) return tmp/sum(tmp)def feedforward(para_weight, para_bias, para_value, para_input_vector, activation_function=sigmoid, output_function=softmax): # the keys of dict must be ordered n = len(para_weight.keys()) # get number of layers for ii in range(n): # get x, w, b for each layer if ii == 0: tmp_x = para_input_vector else: tmp_x = para_value[ii-1] tmp_weight = para_weight[ii] tmp_bias = para_bias[ii] # y = f(wx+b) # the output layer is softmax layer if ii < n-1: para_value[ii] = activation_function(tmp_x, tmp_weight, tmp_bias) else: para_value[ii] = output_function(tmp_x, tmp_weight, tmp_bias) return para_value[n-1]def loss_func(para_hat, para_true): # a row vector is a piece of data # para_hat : [0.1, 0.2, ..., 0.1] # para_true : [ 0, 1, ..., 0 ] # the sum of a row vector is 1.0 # calculate the loss of the whole data, return loss and average loss return np.sum(np.power((para_hat - para_true), 2) * 0.5)def backpropagation(para_weight, para_bias, para_value, para_true, para_input_vector, para_eta=0.01, activation_function=sigmoid, loss_function=loss_func): n = len(para_weight.keys()) # get number of layers # the output layer # the derivation of loss function if loss_function != loss_func: # get the derivation of loss function which you set. # error = print("Define your own derivation of loss function") return None else: error = para_value[n-1]-para_true # column vector para_weight[n-1] -= para_eta * error * para_value[n-2].T para_bias[n-1] -= para_eta * error tmp_delta = (para_eta * error.T * para_weight[n-1]).T # the hidden layer # the derivation of activation function if activation_function == sigmoid: def gradient(para_para_x): return np.multiply(para_para_x, (1 - para_para_x)) elif activation_function == relu: gradient = one_encode else: # get the derivation of activation function which you set # gradient = print("Define your own derivation of activation function") return None for ii in range(n-1)[::-1]: tmp_delta = np.multiply(gradient(para_value[ii]), tmp_delta) if ii == 0: # para_weight[ii] -= para_eta * para_input_vector * tmp_delta.T para_weight[ii] -= para_eta * tmp_delta * para_input_vector.T else: # para_weight[ii] -= para_eta * para_value[ii-1] * tmp_delta.T para_weight[ii] -= para_eta * tmp_delta * para_value[ii-1].T para_bias[ii] -= para_eta * tmp_delta tmp_delta = para_eta * (tmp_delta.T * para_weight[ii]).T return para_weight, para_biasdef is_true(para_hat, para_true): if np.argmax(para_hat) == np.argmax(para_true): return 1 else: return 0def accuracy(para_hat, para_true): true_num = 0 tmp = 0 for ii in range(len(para_hat)): if np.argmax(para_hat[ii]) == np.argmax(para_true[ii]): true_num += 1 tmp += is_true(para_hat[ii], para_true[ii]) print(tmp) return true_num / len(para_hat)# ========================================================================================file = pd.read_csv("Labeled_Data_4cls_100.csv")train_y = one_hot(file["label"].values, output_size)train_x = np.mat(file.drop("label", axis=1))# size_list = [input_size, hidden1_size, hidden2_size, hidden4_size, output_size]size_list = [input_size, hidden2_size, hidden4_size, output_size]# size_list = [input_size, hidden4_size, output_size]weight = dict()bias = dict()tmp_value = dict()for i in range(layer_n): weight[i] = np.mat(np.random.rand(size_list[i+1], size_list[i])) bias[i] = np.mat(np.random.rand(size_list[i+1], 1)) tmp_value[i] = np.mat(np.random.rand(size_list[i+1], 1))data_n = len(train_y)for epoch_i in range(epoch_n): loss = 0 sum_true = 0 for ii in range(data_n): x_tmp = train_x[ii].T y_tmp = train_y[ii].T hat_y = feedforward(weight, bias, tmp_value, x_tmp) backpropagation(weight, bias, tmp_value, y_tmp, x_tmp, para_eta=learning_rate) loss += loss_func(hat_y, y_tmp) sum_true += is_true(hat_y, y_tmp) if epoch_i % 20 == 0: print("epoch_i :", epoch_i) print("loss :", loss) print("average accuracy :", sum_true/data_n)print("predict all data: ")hat_test = np.mat(np.ones([len(train_y), output_size]))for ii in range(len(train_y)): hat_test[ii] = feedforward(weight, bias, tmp_value, train_x[ii].T).Tprint(accuracy(hat_test, train_y))

TensorFlow版

# author zyyFTD# Github: https://github.com/YuyangZhangFTD/zyy_ML-DL""" this code is for python3 logistic regression with tensorflow"""import numpy as npimport tensorflow as tfimport pandas as pd# parameter initn_epochs = 100eta = 0.01n_input = 2n_hidden1 = 5n_hidden2 = 5n_output = 4# ========================================== function ====================================================# You can write your own neural network model here.# It is not a good style to define a model in this way.# Define keys of weights and biases in a convenient way, then use loop to define each layer.# By the way, batch normalization and other tricks can be applied here.def my_nn(para_x, para_weights, para_biases): layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(para_x, para_weights["h1"]), para_biases["b1"])) layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, para_weights["h2"]), para_biases["b2"])) out_layer = tf.nn.softmax(tf.matmul(layer_2, para_weights["out"]) + para_biases["out"]) return out_layerdef one_hot(para_y, para_output): # label one-hot encoding [2]==>[0,1,0,0,0,0,0,0,0,0] tmp = [] for ii in range(len(para_y)): data = [0] * para_output data[int(para_y[ii])] = 1 tmp.append(data) return np.mat(tmp)def one_encode(para_x): # the input data [1-256]==>[1] tmp = [] for ii in range(len(para_x)): tmp.append(list(map(lambda x: 1 if x > 0 else 0, para_x[ii]))) return np.mat(tmp)# ======================================================================================================="""# classification test# ========================================== init parameter ==============================================# read datafile = pd.read_csv("Labeled_Data_4cls_100.csv")ys = one_hot(file["label"].values, n_output)xs = np.mat(file.drop("label", axis=1))weights = { "h1": tf.Variable(tf.random_normal([n_input, n_hidden1])), "h2": tf.Variable(tf.random_normal([n_hidden1, n_hidden2])), "out": tf.Variable(tf.random_normal([n_hidden2, n_output]))}biases = { "b1": tf.Variable(tf.random_normal([n_hidden1])), "b2": tf.Variable(tf.random_normal([n_hidden2])), "out": tf.Variable(tf.random_normal([n_output])),}X = tf.placeholder(tf.float32, [None, n_input])Y = tf.placeholder(tf.float32, [None, n_output])y_hat = my_nn(X, weights, biases)cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=y_hat, labels=Y))correct_prediction = tf.equal(tf.argmax(y_hat, 1), tf.argmax(Y, 1))accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))optimizer = tf.train.AdamOptimizer(learning_rate=eta).minimize(cost)with tf.Session() as sess: sess.run(tf.global_variables_initializer()) prev_training_cost = 0.0 for i_epoch in range(n_epochs): for (x, y) in zip(xs, ys): sess.run(optimizer, feed_dict={X: x, Y: y}) if i_epoch % 20 == 0: training_cost = sess.run(accuracy, feed_dict={X: xs, Y: ys}) print(training_cost)"""# mnist test# ========================================== init parameter ==============================================learning_rate = 0.01n_input = 784n_hidden1 = 256n_hidden2 = 64n_output = 10epoch_n = 100batch_size = 512 # batch size should be 2^n, for GPU training.# ========================================================================================================# read data# train datatrain = pd.read_csv("mnist_train_mine.csv")train_y = one_hot(train.label.values, n_output)train_x = one_encode(train.drop("label", 1).values)n_train = len(train_x)# test datatest = pd.read_csv("mnist_test_mine.csv")test_y = one_hot(test.label.values, n_output)test_x = one_encode(test.drop("label", 1).values)weights = { "h1": tf.Variable(tf.random_normal([n_input, n_hidden1])), "h2": tf.Variable(tf.random_normal([n_hidden1, n_hidden2])), "out": tf.Variable(tf.random_normal([n_hidden2, n_output]))}biases = { "b1": tf.Variable(tf.random_normal([n_hidden1])), "b2": tf.Variable(tf.random_normal([n_hidden2])), "out": tf.Variable(tf.random_normal([n_output])),}X = tf.placeholder(tf.float32, [None, n_input])Y = tf.placeholder(tf.float32, [None, n_output])y_hat = my_nn(X, weights, biases)cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=y_hat, labels=Y))correct_prediction = tf.equal(tf.argmax(y_hat, 1), tf.argmax(Y, 1))accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))optimizer = tf.train.AdamOptimizer(learning_rate=eta).minimize(cost)with tf.Session() as sess: sess.run(tf.global_variables_initializer()) prev_training_cost = 0.0 for i_epoch in range(n_epochs): total_batch = int(n_train / batch_size) for i in range(total_batch): avg_cost = 0. batch_index = np.random.randint(n_train, size=[batch_size]) batch_x = train_x[batch_index] batch_y = train_y[batch_index] _, c = sess.run([optimizer, cost], feed_dict={X: batch_x, Y: batch_y}) if i_epoch % 20 == 0: training_cost = sess.run(accuracy, feed_dict={X: train_x, Y: train_y}) print(training_cost) print("Accuracy:", accuracy.eval({X: test_x, Y: test_y}))

總結

我仔細看了看今天這篇感覺寫的好少，好多地方沒講那麼清楚，比如激活函數啊，softmax優化啊什麼的，不過大體上也算是講清楚了吧，下期可以著重寫一下激活函數損失函數衡量標準這些，然後給大家梳理一下TensorFlow裡面常用到的函數。

到今天為止，TensorFlow系列再有一兩期就寫完了，有人問我，說別人寫python也就七八行就把一個演算法寫好了，你為什麼寫了一兩百行？就這還有臉拿出給別人看。其實不然，現在博客裡面的文章大部分都是讀數據選模型fit的一個過程，很多初學者看見之後，只知道fit一下就可以predict了，卻不知道後面的原因。而且有些東西，還是自己寫寫的好，動了手自己做了，才會有收穫。

但是我也不是鼓勵大家什麼都自己寫，life is short，we use python。人和動物最大的差別是人會使用工具吶。下次會寫一個簡短的Keras給大家看看，三行NN，十行CNN...