TensorFlow Estimator of Deep CTR --DeepFM/NFM/AFM/FNN/PNN

深度學習在ctr預估領域的應用越來越多,新的模型不斷冒出。從ctr預估問題看看f(x)設計—DNN篇整理了各模型之間的聯繫之後,一直在琢磨這些東西如何在工業界落地。經過幾個月的調研,發現目前存在的一些問題:

* 開源的實現基本都是學術界的人在搞,距離工業應用還有較大的鴻溝* 模型實現大量調用底層API,各版本實現千差萬別,代碼臃腫難懂,遷移成本較高* 單機,放到工業場景下跑不動

針對存在的問題做了一些探索,摸索出一套可行方案,有以下特性:

* 讀數據採用Dataset API,支持 parallel and prefetch讀取* 通過Estimator model_fn來實現f(x),遷移到其他演算法非常方便,只需要改寫model_fn f(x)部分* 支持分散式以及單機多線程訓練* 支持export model,然後用TensorFlow Serving提供線上預測服務

按工業界的套路,完整的機器學習項目應該包含五個部分:特徵框架,訓練框架,服務框架,評估框架和監控框架,這裡只討論前三個框架。

特徵框架 -- logs in,samples out

實驗數據集用criteo,特徵工程參考: github.com/PaddlePaddle

#1 連續特徵 剔除異常值/歸一化#2 離散特徵 剔掉低頻,然後統一編碼(特徵編碼需要保存下來,線上預測的時候要用到)

對大規模離散特徵建模是用DNN做ctr預估的優勢,paper關注點大都放在ID類特徵如何做embedding上,至於連續特徵如何處理很少討論,大概有以下3種方式:

--不做embedding |1--concat[continuous, emb_vec]做fc--做embedding |2--離散化之後embedding |3--類似FM二階部分, 統一做embedding, <id, val> 離散特徵val=1.0

為了模型設計上的簡單統一,採用第3種方式,感興趣的讀者可是試試前兩種的效果。

訓練框架 -- samples in,model out

目前實現了DeepFM/wide_n_deep/NFM/AFM/FNN/PNN幾個演算法. 以DeepFM為例來看看如何使用TensorFlow Estimator and Datasets API 來實現input_fn and model_fn:

#1 1:0.5 2:0.03519 3:1 4:0.02567 7:0.03708 8:0.01705 9:0.06296 10:0.18185 11:0.02497 12:1 14:0.02565 15:0.03267 17:0.0247 18:0.03158 20:1 22:1 23:0.13169 24:0.02933 27:0.18159 31:0.0177 34:0.02888 38:1 51:1 63:1 132:1 164:1 236:1def input_fn(filenames, batch_size=32, num_epochs=1, perform_shuffle=False): print(Parsing, filenames) def decode_libsvm(line): columns = tf.string_split([line], ) labels = tf.string_to_number(columns.values[0], out_type=tf.float32) splits = tf.string_split(columns.values[1:], :) id_vals = tf.reshape(splits.values,splits.dense_shape) feat_ids, feat_vals = tf.split(id_vals,num_or_size_splits=2,axis=1) feat_ids = tf.string_to_number(feat_ids, out_type=tf.int32) feat_vals = tf.string_to_number(feat_vals, out_type=tf.float32) return {"feat_ids": feat_ids, "feat_vals": feat_vals}, labels # Extract lines from input files using the Dataset API, can pass one filename or filename list dataset = tf.data.TextLineDataset(filenames).map(decode_libsvm, num_parallel_calls=10).prefetch(500000) # multi-thread pre-process then prefetch # Randomizes input using a window of 256 elements (read into memory) if perform_shuffle: dataset = dataset.shuffle(buffer_size=256) # epochs from blending together. dataset = dataset.repeat(num_epochs) dataset = dataset.batch(batch_size) # Batch size to use iterator = dataset.make_one_shot_iterator() batch_features, batch_labels = iterator.get_next() return batch_features, batch_labels

def model_fn(features, labels, mode, params): """Bulid Model function f(x) for Estimator.""" #------hyperparameters---- field_size = params["field_size"] feature_size = params["feature_size"] embedding_size = params["embedding_size"] l2_reg = params["l2_reg"] learning_rate = params["learning_rate"] layers = map(int, params["deep_layers"].split(,)) dropout = map(float, params["dropout"].split(,)) #------bulid weights------ FM_B = tf.get_variable(name=fm_bias, shape=[1], initializer=tf.constant_initializer(0.0)) FM_W = tf.get_variable(name=fm_w, shape=[feature_size], initializer=tf.glorot_normal_initializer()) FM_V = tf.get_variable(name=fm_v, shape=[feature_size, embedding_size], initializer=tf.glorot_normal_initializer()) #------build feaure------- feat_ids = features[feat_ids] feat_ids = tf.reshape(feat_ids,shape=[-1,field_size]) feat_vals = features[feat_vals] feat_vals = tf.reshape(feat_vals,shape=[-1,field_size]) #------build f(x)------ with tf.variable_scope("First-order"): feat_wgts = tf.nn.embedding_lookup(FM_W, feat_ids) # None * F * 1 y_w = tf.reduce_sum(tf.multiply(feat_wgts, feat_vals),1) with tf.variable_scope("Second-order"): embeddings = tf.nn.embedding_lookup(FM_V, feat_ids) # None * F * K feat_vals = tf.reshape(feat_vals, shape=[-1, field_size, 1]) embeddings = tf.multiply(embeddings, feat_vals) #vij*xi sum_square = tf.square(tf.reduce_sum(embeddings,1)) square_sum = tf.reduce_sum(tf.square(embeddings),1) y_v = 0.5*tf.reduce_sum(tf.subtract(sum_square, square_sum),1) # None * 1 with tf.variable_scope("Deep-part"): if FLAGS.batch_norm: if mode == tf.estimator.ModeKeys.TRAIN: train_phase = True else: train_phase = False deep_inputs = tf.reshape(embeddings,shape=[-1,field_size*embedding_size]) # None * (F*K) for i in range(len(layers)): #if FLAGS.batch_norm: # deep_inputs = batch_norm_layer(deep_inputs, train_phase=train_phase, scope_bn=bn_%d %i) #normalizer_params.update({scope: bn_%d %i}) deep_inputs = tf.contrib.layers.fully_connected(inputs=deep_inputs, num_outputs=layers[i], #normalizer_fn=normalizer_fn, normalizer_params=normalizer_params, weights_regularizer=tf.contrib.layers.l2_regularizer(l2_reg), scope=mlp%d % i) if FLAGS.batch_norm: deep_inputs = batch_norm_layer(deep_inputs, train_phase=train_phase, scope_bn=bn_%d %i) #放在RELU之後 https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md#bn----before-or-after-relu if mode == tf.estimator.ModeKeys.TRAIN: deep_inputs = tf.nn.dropout(deep_inputs, keep_prob=dropout[i]) #Apply Dropout after all BN layers and set dropout=0.8(drop_ratio=0.2) #deep_inputs = tf.layers.dropout(inputs=deep_inputs, rate=dropout[i], training=mode == tf.estimator.ModeKeys.TRAIN) y_deep = tf.contrib.layers.fully_connected(inputs=deep_inputs, num_outputs=1, activation_fn=tf.identity, weights_regularizer=tf.contrib.layers.l2_regularizer(l2_reg), scope=deep_out) y_d = tf.reshape(y_deep,shape=[-1]) with tf.variable_scope("DeepFM-out"): #y_bias = FM_B * tf.ones_like(labels, dtype=tf.float32) # None * 1 warning;這裡不能用label,否則調用predict/export函數會出錯,train/evaluate正常;初步判斷estimator做了優化,用不到label時不傳 y_bias = FM_B * tf.ones_like(y_d, dtype=tf.float32) # None * 1 y = y_bias + y_w + y_v + y_d pred = tf.sigmoid(y) predictions={"prob": pred} export_outputs = {tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY: tf.estimator.export.PredictOutput(predictions)} # Provide an estimator spec for `ModeKeys.PREDICT` if mode == tf.estimator.ModeKeys.PREDICT: return tf.estimator.EstimatorSpec(mode=mode,predictions=predictions,export_outputs=export_outputs) #------bulid loss------ loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=y, labels=labels)) + l2_reg * tf.nn.l2_loss(FM_W) + l2_reg * tf.nn.l2_loss(FM_V) # Provide an estimator spec for `ModeKeys.EVAL` eval_metric_ops = { "auc": tf.metrics.auc(labels, pred) } if mode == tf.estimator.ModeKeys.EVAL: return tf.estimator.EstimatorSpec(mode=mode,predictions=predictions,loss=loss,eval_metric_ops=eval_metric_ops) #------bulid optimizer------ if FLAGS.optimizer == Adam: optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate, beta1=0.9, beta2=0.999, epsilon=1e-8) elif FLAGS.optimizer == Adagrad: optimizer = tf.train.AdagradOptimizer(learning_rate=learning_rate, initial_accumulator_value=1e-8) elif FLAGS.optimizer == Momentum: optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=0.95) elif FLAGS.optimizer == ftrl: optimizer = tf.train.FtrlOptimizer(learning_rate) train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step()) # Provide an estimator spec for `ModeKeys.TRAIN` modes if mode == tf.estimator.ModeKeys.TRAIN: return tf.estimator.EstimatorSpec(mode=mode,predictions=predictions,loss=loss,train_op=train_op)

封裝成estimator之後,調用非常簡單

#trainpython DeepFM.py --task_type=train --learning_rate=0.0005 --optimizer=Adam --num_epochs=1 --batch_size=256 --field_size=39 --feature_size=117581 --deep_layers=400,400,400 --dropout=0.5,0.5,0.5 --log_steps=1000 --num_threads=8 --model_dir=./model_ckpt/criteo/DeepFM/ --data_dir=../../data/criteo/#predictpython DeepFM.py --task_type=infer --learning_rate=0.0005 --optimizer=Adam --num_epochs=1 --batch_size=256 --field_size=39 --feature_size=117581 --deep_layers=400,400,400 --dropout=0.5,0.5,0.5 --log_steps=1000 --num_threads=8 --model_dir=./model_ckpt/criteo/DeepFM/ --data_dir=../../data/criteo/

完整代碼: lambdaji/tf_repos

服務框架 -- request in,pctr out

TensorFlow Serving 是一個用於機器學習模型 serving 的高性能開源庫。它可以將訓練好的機器學習模型部署到線上,使用 gRPC 作為介面接受外部調用。更加讓人眼前一亮的是,它支持模型熱更新與自動模型版本管理。這意味著一旦部署 TensorFlow Serving 後,你再也不需要為線上服務操心,只需要關心你的線下模型訓練。

首先要導出TF-Serving能識別的模型文件

python DeepFM.py --task_type=export --learning_rate=0.0005 --optimizer=Adam --batch_size=256 --field_size=39 --feature_size=117581 --deep_layers=400,400,400 --dropout=0.5,0.5,0.5 --log_steps=1000 --num_threads=8 --model_dir=./model_ckpt/criteo/DeepFM/ --servable_model_dir=./servable_model/

默認以時間戳來管理版本,生成文件如下:

$ ls -lh servable_model/1517971230|--saved_model.pb|--variables |--variables.data-00000-of-00001 |--variables.index

然後寫一個client發送請求,這裡用C++來寫

PredictRequest predictRequest;PredictResponse response;ClientContext context;predictRequest.mutable_model_spec()->set_name(model_name);predictRequest.mutable_model_spec()->set_signature_name(model_signature_name); //serving_defaultgoogle::protobuf::Map<tensorflow::string, tensorflow::TensorProto>& inputs = *predictRequest.mutable_inputs();//feature to tfrequeststd::vector<long> ids_vec = {1,2,3,4,5,6,7,8,9,10,11,12,13,15,555,1078,17797,26190,26341,28570,35361,35613, 35984,48424,51364,64053,65964,66206,71628,84088,84119,86889,88280,88283,100288,100300,102447,109932,111823};std::vector<float> vals_vec = {0.05,0.006633,0.05,0,0.021594,0.008,0.15,0.04,0.362,0.1,0.2,0,0.04, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1};tensorflow::TensorProto feat_ids;for (uint32_t i = 0; i < ids_vec.size(); i++) { feat_ids.add_int64_val(ids_vec[i]);}feat_ids.mutable_tensor_shape()->add_dim()->set_size(1); //batch_sizefeat_ids.mutable_tensor_shape()->add_dim()->set_size(feat_ids.int64_val_size());feat_ids.set_dtype(tensorflow::DataType::DT_INT64);inputs["feat_ids"] = feat_ids;tensorflow::TensorProto feat_vals;for (uint32_t i = 0; i < vals_vec.size(); i++) { feat_vals.add_float_val(vals_vec[i]);}feat_vals.mutable_tensor_shape()->add_dim()->set_size(1); //batch_sizefeat_vals.mutable_tensor_shape()->add_dim()->set_size(feat_vals.float_val_size()); //sample sizefeat_vals.set_dtype(tensorflow::DataType::DT_FLOAT);inputs["feat_vals"] = feat_vals;Status status = _stub->Predict(&context, predictRequest, &response);

完整代碼: lambdaji/tf_repos

生產環境對時耗和性能的要求較高,而DNN的計算量比LR的簡單查表操作大得多,往往需要在效果和性能之間做折中. 這個環節比較考驗工程能力, 下圖是wide_n_deep model放到線上環境的真實數據,可以看到:

截距部分15ms:對應解析請求包,查詢redis/tair,轉換特徵格式以及打log等斜率部分0.5ms:一條樣本forward一次需要的時間

一個比較有意思的現象是:隨著進一步放量,平均時耗不升反降,懷疑TF-Serving內部做了cache類的優化.

Model Performance

本來打算調好參再放出來,但是自從把機器跑掛三次就放棄了:(

圖上跑出來的效果不好,可能有幾個原因:

--特徵工程沒做好(連續特徵不適合做embedding,負採樣,shuffle等等)--模型設計有問題(不確定有沒有bug)--調參,模型沒有收斂到一個足夠好的解

感興趣的小夥伴可以fork下來折騰折騰,做人肉層面的並行,比一個人閉門搞快得多.

項目地址:github.com/lambdaji/tf_

最後提前祝大家新年煉丹愉快!

參考資料:

github.com/wnzhang/deep

github.com/Atomu2014/pr

github.com/hexiangnan/a

github.com/hexiangnan/n

github.com/ChenglongChe

zhuanlan.zhihu.com/p/32

zhuanlan.zhihu.com/p/28


推薦閱讀:

都說拍視頻po上網能賺點擊率 那點擊率有什麼用呢?
常見計算廣告點擊率預估演算法總結
gbdt怎麼用在 點擊率預測中?
CTR預估[十一]: Algorithm-GBDT Encoder
CTR預估[七]: Algorithm-GBDT: Preliminary

TAG:TensorFlow | 深度学习DeepLearning | 点击率 |