Kaggle實戰——點擊率預估

版權聲明:本文出自程世東的知乎,原創文章,轉載請註明出處:Kaggle實戰——點擊率預估。

請安裝TensorFlow1.0,Python3.5

項目地址: chengstone/kaggle_criteo_ctr_challenge-

點擊率預估用來判斷一條廣告被用戶點擊的概率,對每次廣告的點擊做出預測,把用戶最有可能點擊的廣告找出來,是廣告技術最重要的演算法之一。

數據集下載

這次我們使用Kaggle上的Display Advertising Challenge挑戰的criteo數據集。

下載數據集請在終端輸入下面命令(腳本文件路徑:./data/download.sh):

wget --no-check-certificate s3-eu-west-1.amazonaws.com

tar zxf dac.tar.gz

rm -f dac.tar.gz

mkdir raw

mv ./*.txt raw/

解壓縮以後,train.txt文件11.7G,test.txt文件1.35G。

數據量太大了,我們只使用前100萬條數據。

head -n 1000000 test.txt > test_sub100w.txt

head -n 1000000 train.txt > train_sub100w.txt

然後將文件名重新命名為train.txt和test.txt,文件位置不變。

Data fields

Label

  • Target variable that indicates if an ad was clicked (1) or not (0).

I1-I13

  • A total of 13 columns of integer features (mostly count features).

C1-C26

  • A total of 26 columns of categorical features. The values of these features have been hashed onto 32 bits for anonymization purposes.

數據中含有Label欄位,表示這條廣告是否被點擊,I1-I13一共13個數值特徵(Dense Input),C1-C26共26個Categorical類別特徵(Sparse Input)。

網路模型

模型包含三部分網路,一個是FFM(Field-aware Factorization Machines),一個是FM(Factorization Machine),另一個是DNN,其中FM網路包含GBDT和FM兩個組件。通常在數據預處理的部分,需要做特徵交叉組合等特徵工程,以便找出幫助我們預測的特徵出來,這絕對是技術活。

這次我們跳過特徵工程的步驟,把這些組件和深度神經網路組合在一起,將挑選特徵的工作交給模型來處理。其中FFM使用了LibFFM,FM使用了LibFM,GBDT使用了LightGBM,當然你也可以使用xgboost

GBDT

給入訓練數據後,GBDT會訓練出若干棵樹,我們要使用的是GBDT中每棵樹輸出的葉子結點,將這些葉子結點作為categorical類別特徵輸入給FM。有關決策樹的使用,請參照Facebook的這篇文章Practical Lessons from Predicting Clicks on Ads at Facebook

FM

FM用來解決數據量大並且特徵稀疏下的特徵組合問題,先來看看公式(只考慮二階多項式的情況):n代表樣本的特徵數量, x_i 是第i個特徵的值, w_0、w_i、w_{ij} 是模型參數。

從公式可以看出來這是在線性模型基礎上,添加了特徵組合 x_ix_j ,當然只有在特徵 x_ix_j 都不為0時才有意義。然而在實際的應用場景中,訓練組合特徵的參數是很困難的。因為輸入數據普遍存在稀疏性,這導致 x_ix_j 大部分情況都是0,而組合特徵的參數 w_{ij} 只有在特徵不為0時才能訓練出有意義的值。

比如跟購物相關的特徵中,女性可能會更關注化妝品或者首飾之類的物品,而男性可能更關注體育用品或者電子產品等商品,這說明特徵組合訓練是有意義的。而商品特徵可能存在幾百上千種分類,通常我們將類別特徵轉成One hot編碼的形式,這樣一個特徵就要變成幾百維的特徵,再加上其他的分類特徵,這導致輸入的特徵空間急劇膨脹,所以數據的稀疏性是實際問題中不可避免的挑戰。

為了解決二次項參數訓練的問題,引入了矩陣分解的概念。在上一篇文章中我們討論的是電影推薦系統,我們構造了用戶特徵向量和電影特徵向量,通過兩個特徵向量的點積得到了用戶對於某部電影的評分。如果將用戶特徵矩陣與電影特徵矩陣相乘就會得到所有用戶對所有影片的評分矩陣。

如果將上面的過程反過來看,實際上對於評分矩陣,我們可以分解成用戶矩陣和電影矩陣,而評分矩陣中每一個數據點就相當於上面討論的組合特徵的參數 w_{ij}

對於參數矩陣W,我們採用矩陣分解的方法,將每一個參數 w_{ij} 分解成兩個向量(稱之為隱向量)的點積。這樣矩陣就可以分解為 W=V^TV ,而每個參數 w_{ij}=?v_i,v_j?v_i 是第i維特徵的隱向量,這樣FM的二階公式就變成:

這就是FM模型的思想。

將GBDT輸出的葉子節點作為訓練數據的輸入,來訓練FM模型。這樣對於我們的FM網路,需要訓練GBDT和FM。看得出來,這次我們的點擊率預測網路要複雜了許多,影響最終結果的因素和超參更多了。關於FM和GBDT兩個組件的訓練我們會在下文進行說明。

FFM

接下來需要訓練FFM模型。FFM在FM的基礎上增加了一個Field的概念,比如說一個商品欄位,是一個分類特徵,可以分成很多不同的feature,但是這些feature都屬於同一個Field,或者說同一個categorical的分類特徵都可以放到同一個Field。

這可以看成是1對多的關係,打個比方,比如職業欄位,這是一個特徵,經過One Hot以後,變成了N個特徵。那這N個特徵其實都屬於職業,所以職業就是一個Field。

我們要通過特徵組合來訓練隱向量,這樣每一維特徵 x_i ,都會與其他特徵的每一種Field f_j 學習一個隱向量 v_{i,f_j} 。也就是說,隱向量不僅與特徵有關,還與Field有關。模型的公式:

DNN

我們來看DNN的部分。將輸入數據分成兩部分,一部分是數值特徵(Dense Input),一部分是類別特徵(Sparse Input)。我們仍然不適用One Hot編碼,將類別特徵傳入嵌入層,得到多個嵌入向量,再將這些嵌入向量和數值特徵連接在一起,傳入全連接層,一共連接三層全連接層,使用Relu激活函數。然後再將第三層全連接的輸出和FFM、FM的全連接層的輸出連接在一起,傳入最後一層全連接層。

我們要學習的目標Label表示廣告是否被點擊了,只有1(點擊)和0(沒有點擊)兩種狀態。所以我們網路的最後一層要做Logistic回歸,在最後一層全連接層使用Sigmoid激活函數,得到廣告被點擊的概率。

使用LogLoss作為損失函數,FTRL作為學習演算法。

FTRL有關的Paper:Ad_click_prediction_a_view_from_the_trenches

LibFFM和LibFM的代碼我做了修改,請使用代碼庫中我的相關代碼。

預處理數據集

  • 生成神經網路的輸入
  • 生成FFM的輸入
  • 生成GBDT的輸入

首先要為DNN、FFM和GBDT的輸入做預處理。對於數值特徵,我們將I1-I13轉成0-1之間的小數。類別特徵我們將某類別使用次數少於cutoff(超參)的忽略掉,留下使用次數多的feature作為某類別欄位的特徵,然後將這些特徵以各自欄位為組進行編號。

比如有C1和C2兩個類別欄位,C1下面有特徵a(大於cutoff次)、b(少於cutoff次)、c(大於cutoff次),C2下面有特徵x和y(均大於cutoff次),這樣留下來的特徵就是C1:a、c和C2:x、y。然後以各自欄位為分組進行編號,對於C1欄位,a和c的特徵id對應0和1;對於C2欄位,x和y也是0和1。

對於類別特徵的輸入數據處理,FFM和GBDT各不相同,我們分別來說。

GBDT

GBDT的處理要簡單一些,C1-C26每個欄位各自的特徵id值作為輸入即可。 GBDT的輸入數據格式是:Label I1-I13 C1-C26 所以實際輸入可能是這樣:0 小數1 小數2 ~ 小數13 1(C1特徵Id) 0(C2特徵Id) ~ C26特徵Id 其中C1特徵Id是1,說明此處C1欄位的feature是c,而C2欄位的feature是x。

下面是一段生成的真實數據: 0 0.05 0.004983 0.05 0 0.021594 0.008 0.15 0.04 0.362 0.166667 0.2 0 0.04 2 3 0 0 1 1 0 3 1 0 0 0 0 3 0 0 1 4 1 3 0 0 2 0 1 0

很抱歉,我的造句能力實在很差,要是上面一段文字看的你很混亂的話,那就直接看代碼吧:)

FFM

FFM的輸入數據要複雜一些,詳細可以參看官方Github上的說明,摘抄如下:

It is important to understand the difference between field and feature. For example, if we have a raw data like this:

Click Advertiser Publishern===== ========== =========n0 Nike CNNn1 ESPN BBCn

Here, we have

* 2 fields: Advertiser and Publishern* 4 features: Advertiser-Nike, Advertiser-ESPN, Publisher-CNN, Publisher-BBCn

Usually you will need to build two dictionares, one for field and one for features, like this:

DictField[Advertiser] -> 0nDictField[Publisher] -> 1nnDictFeature[Advertiser-Nike] -> 0nDictFeature[Publisher-CNN] -> 1nDictFeature[Advertiser-ESPN] -> 2nDictFeature[Publisher-BBC] -> 3n

Then, you can generate FFM format data:

0 0:0:1 1:1:1n1 0:2:1 1:3:1n

Note that because these features are categorical, the values here are all ones.

fields應該很好理解,features的劃分跟之前GBDT有些不一樣,在剛剛GBDT的處理中我們是每個類別內獨立編號,C1有features 0~n,C2有features 0~n。而這次FFM是所有的features統一起來編號。你看它的例子,C1是Advertiser,有兩個feature,C2是Publisher,有兩個feature,統一起來編號就是0~3。而在GBDT我們要獨立編號的,看起來像這樣:

DictFeature[Advertiser-Nike] -> 0nDictFeature[Advertiser-ESPN] -> 1nDictFeature[Publisher-CNN] -> 0nDictFeature[Publisher-BBC] -> 1 n

現在我們假設有第三條數據,看看如何構造FFM的輸入數據:

Click Advertiser Publishern===== ========== =========n0 Nike CNNn1 ESPN BBCn0 Lining CNN n

按照規則,應該是像下面這樣:

DictFeature[Advertiser-Nike] -> 0nDictFeature[Publisher-CNN] -> 1nDictFeature[Advertiser-ESPN] -> 2nDictFeature[Publisher-BBC] -> 3nDictFeature[Advertiser-Lining] -> 4n

在我們這次FFM的輸入數據處理中,跟上面略有些區別,每個類別編號以後,下一個類別繼續編號,所以最終的features編號是這樣的:

DictFeature[Advertiser-Nike] -> 0nDictFeature[Advertiser-ESPN] -> 1nDictFeature[Advertiser-Lining] -> 2nDictFeature[Publisher-CNN] -> 3nDictFeature[Publisher-BBC] -> 4n

對於我們的數據是從I1開始編號的,從I1-I13,所以C1的編號要從加13開始。

這是一條來自真實的FFM輸入數據: 0 0:0:0.05 1:1:0.004983 2:2:0.05 3:3:0 4:4:0.021594 5:5:0.008 6:6:0.15 7:7:0.04 8:8:0.362 9:9:0.166667 10:10:0.2 11:11:0 12:12:0.04 13:15:1 14:29:1 15:64:1 16:76:1 17:92:1 18:101:1 19:107:1 20:122:1 21:131:1 22:133:1 23:143:1 24:166:1 25:179:1 26:209:1 27:216:1 28:243:1 29:260:1 30:273:1 31:310:1 32:317:1 33:318:1 34:333:1 35:340:1 36:348:1 37:368:1 38:381:1n

DNN

DNN的輸入數據就沒有那麼複雜了,仍然是I1-I13的小數和C1-C26的統一編號,就像FFM一樣,只是不需要從加13開始,最後是Label。

真實數據就像這樣:

0.05,0.004983,0.05,0,0.021594,0.008,0.15,0.04,0.362,0.166667,0.2,0,0.04,2,16,51,63,79,88,94,109,118,120,130,153,166,196,203,230,247,260,297,304,305,320,327,335,355,368,0n

要說明的就這麼多了,我們來看看代碼吧,因為要同時生成訓練數據、驗證數據和測試數據,所以要運行一段時間。

核心代碼講解

完整代碼請參見項目地址

以下代碼來自百度deep_fm的preprocess.py,稍稍添了些代碼,我就不重複造輪子了:)

# There are 13 integer features and 26 categorical featuresncontinous_features = range(1, 14)ncategorial_features = range(14, 40)nn# Clip integer features. The clip point for each integer featuren# is derived from the 95% quantile of the total values in each featurencontinous_clip = [20, 600, 100, 50, 64000, 500, 100, 50, 500, 10, 10, 10, 50]nnclass ContinuousFeatureGenerator:n """n Normalize the integer features to [0, 1] by min-max normalizationn """nn def __init__(self, num_feature):n self.num_feature = num_featuren self.min = [sys.maxsize] * num_featuren self.max = [-sys.maxsize] * num_featurenn def build(self, datafile, continous_features):n with open(datafile, r) as f:n for line in f:n features = line.rstrip(n).split(t)n for i in range(0, self.num_feature):n val = features[continous_features[i]]n if val != :n val = int(val)n if val > continous_clip[i]:n val = continous_clip[i]n self.min[i] = min(self.min[i], val)n self.max[i] = max(self.max[i], val)nn def gen(self, idx, val):n if val == :n return 0.0n val = float(val)n return (val - self.min[idx]) / (self.max[idx] - self.min[idx])nnclass CategoryDictGenerator:n """n Generate dictionary for each of the categorical featuresn """nn def __init__(self, num_feature):n self.dicts = []n self.num_feature = num_featuren for i in range(0, num_feature):n self.dicts.append(collections.defaultdict(int))nn def build(self, datafile, categorial_features, cutoff=0):n with open(datafile, r) as f:n for line in f:n features = line.rstrip(n).split(t)n for i in range(0, self.num_feature):n if features[categorial_features[i]] != :n self.dicts[i][features[categorial_features[i]]] += 1n for i in range(0, self.num_feature):n self.dicts[i] = filter(lambda x: x[1] >= cutoff,n self.dicts[i].items())nn self.dicts[i] = sorted(self.dicts[i], key=lambda x: (-x[1], x[0]))n vocabs, _ = list(zip(*self.dicts[i]))n self.dicts[i] = dict(zip(vocabs, range(1, len(vocabs) + 1)))n self.dicts[i][<unk>] = 0nn def gen(self, idx, key):n if key not in self.dicts[idx]:n res = self.dicts[idx][<unk>]n else:n res = self.dicts[idx][key]n return resnn def dicts_sizes(self):n return list(map(len, self.dicts))nndef preprocess(datadir, outdir):n """n All the 13 integer features are normalzied to continous values and thesen continous features are combined into one vecotr with dimension 13.nn Each of the 26 categorical features are one-hot encoded and all the one-hotn vectors are combined into one sparse binary vector.n """n dists = ContinuousFeatureGenerator(len(continous_features))n dists.build(os.path.join(datadir, train.txt), continous_features)nn dicts = CategoryDictGenerator(len(categorial_features))n dicts.build(n os.path.join(datadir, train.txt), categorial_features, cutoff=200)#200 50nn dict_sizes = dicts.dicts_sizes()n categorial_feature_offset = [0]n for i in range(1, len(categorial_features)):n offset = categorial_feature_offset[i - 1] + dict_sizes[i - 1]n categorial_feature_offset.append(offset)nn random.seed(0)nn # 90% of the data are used for training, and 10% of the data are usedn # for validation.n train_ffm = open(os.path.join(outdir, train_ffm.txt), w)n valid_ffm = open(os.path.join(outdir, valid_ffm.txt), w)nn train_lgb = open(os.path.join(outdir, train_lgb.txt), w)n valid_lgb = open(os.path.join(outdir, valid_lgb.txt), w)nn with open(os.path.join(outdir, train.txt), w) as out_train:n with open(os.path.join(outdir, valid.txt), w) as out_valid:n with open(os.path.join(datadir, train.txt), r) as f:n for line in f:n features = line.rstrip(n).split(t)n continous_feats = []n continous_vals = []n for i in range(0, len(continous_features)):nn val = dists.gen(i, features[continous_features[i]])n continous_vals.append(n "{0:.6f}".format(val).rstrip(0).rstrip(.))n continous_feats.append(n "{0:.6f}".format(val).rstrip(0).rstrip(.))#({0}.format(val))nn categorial_vals = []n categorial_lgb_vals = []n for i in range(0, len(categorial_features)):n val = dicts.gen(i, features[categorial_features[i]]) + categorial_feature_offset[i]n categorial_vals.append(str(val))n val_lgb = dicts.gen(i, features[categorial_features[i]])n categorial_lgb_vals.append(str(val_lgb))nn continous_vals = ,.join(continous_vals)n categorial_vals = ,.join(categorial_vals)n label = features[0]n if random.randint(0, 9999) % 10 != 0:n out_train.write(,.join(n [continous_vals, categorial_vals, label]) + n)n train_ffm.write(t.join(label) + t)n train_ffm.write(t.join(n [{}:{}:{}.format(ii, ii, val) for ii,val in enumerate(continous_vals.split(,))]) + t)n train_ffm.write(t.join(n [{}:{}:1.format(ii + 13, str(np.int32(val) + 13)) for ii, val in enumerate(categorial_vals.split(,))]) + n)n n train_lgb.write(t.join(label) + t)n train_lgb.write(t.join(continous_feats) + t)n train_lgb.write(t.join(categorial_lgb_vals) + n)nn else:n out_valid.write(,.join(n [continous_vals, categorial_vals, label]) + n)n valid_ffm.write(t.join(label) + t)n valid_ffm.write(t.join(n [{}:{}:{}.format(ii, ii, val) for ii,val in enumerate(continous_vals.split(,))]) + t)n valid_ffm.write(t.join(n [{}:{}:1.format(ii + 13, str(np.int32(val) + 13)) for ii, val in enumerate(categorial_vals.split(,))]) + n)n n valid_lgb.write(t.join(label) + t)n valid_lgb.write(t.join(continous_feats) + t)n valid_lgb.write(t.join(categorial_lgb_vals) + n)n n train_ffm.close()n valid_ffm.close()nn train_lgb.close()n valid_lgb.close()nn test_ffm = open(os.path.join(outdir, test_ffm.txt), w)n test_lgb = open(os.path.join(outdir, test_lgb.txt), w)nn with open(os.path.join(outdir, test.txt), w) as out:n with open(os.path.join(datadir, test.txt), r) as f:n for line in f:n features = line.rstrip(n).split(t)nn continous_feats = []n continous_vals = []n for i in range(0, len(continous_features)):n val = dists.gen(i, features[continous_features[i] - 1])n continous_vals.append(n "{0:.6f}".format(val).rstrip(0).rstrip(.))n continous_feats.append(n "{0:.6f}".format(val).rstrip(0).rstrip(.))#({0}.format(val))nn categorial_vals = []n categorial_lgb_vals = []n for i in range(0, len(categorial_features)):n val = dicts.gen(i,n features[categorial_features[i] -n 1]) + categorial_feature_offset[i]n categorial_vals.append(str(val))nn val_lgb = dicts.gen(i, features[categorial_features[i] - 1])n categorial_lgb_vals.append(str(val_lgb))nn continous_vals = ,.join(continous_vals)n categorial_vals = ,.join(categorial_vals)nn out.write(,.join([continous_vals, categorial_vals]) + n)n n test_ffm.write(t.join([{}:{}:{}.format(ii, ii, val) for ii,val in enumerate(continous_vals.split(,))]) + t)n test_ffm.write(t.join(n [{}:{}:1.format(ii + 13, str(np.int32(val) + 13)) for ii, val in enumerate(categorial_vals.split(,))]) + n)n n test_lgb.write(t.join(continous_feats) + t)n test_lgb.write(t.join(categorial_lgb_vals) + n)nn test_ffm.close()n test_lgb.close()n return dict_sizesn

訓練FFM

數據準備好了,開始調用LibFFM,訓練FFM模型。

learning rate是0.1,迭代32次,訓練好後保存的模型文件是model_ffm。

import subprocess, sys, os, timennNR_THREAD = 1ncmd = ./libffm/libffm/ffm-train --auto-stop -r 0.1 -t 32 -s {nr_thread} -p ./data/valid_ffm.txt ./data/train_ffm.txt model_ffm.format(nr_thread=NR_THREAD) nos.popen(cmd).readlines()n

訓練結果:

[First check if the text file has already been converted to binary format (1.3 seconds)n,n Binary file found. Skip converting text to binaryn,n First check if the text file has already been converted to binary format (0.2 seconds)n,n Binary file found. Skip converting text to binaryn,n iter tr_logloss va_logloss tr_timen,n 1 0.49339 0.48196 12.8n,n 2 0.47621 0.47651 25.9n,n 3 0.47149 0.47433 39.0n,n 4 0.46858 0.47277 51.2n,n 5 0.46630 0.47168 63.0n,n 6 0.46447 0.47092 74.7n,n 7 0.46269 0.47038 86.4n,n 8 0.46113 0.47000 98.0n,n 9 0.45960 0.46960 109.6n,n 10 0.45811 0.46940 121.2n,n 11 0.45660 0.46913 132.5n,n 12 0.45509 0.46899 144.3n,n 13 0.45366 0.46903n,n Auto-stop. Use model at 12th iteration.n]n

FFM模型訓練好了,我們把訓練、驗證和測試數據輸入給FFM,得到FFM層的輸出,輸出的文件名為*.out.logit

cmd = ./libffm/libffm/ffm-predict ./data/train_ffm.txt model_ffm tr_ffm.out.format(nr_thread=NR_THREAD) nos.popen(cmd).readlines()ncmd = ./libffm/libffm/ffm-predict ./data/valid_ffm.txt model_ffm va_ffm.out.format(nr_thread=NR_THREAD) nos.popen(cmd).readlines()ncmd = ./libffm/libffm/ffm-predict ./data/test_ffm.txt model_ffm te_ffm.out true.format(nr_thread=NR_THREAD) nos.popen(cmd).readlines()n

訓練GBDT

現在調用LightGBM訓練GBDT模型,因為決策樹較容易過擬合,我們設置樹的個數為32,葉子節點數設為30,深度就不設置了,學習率設為0.05。

def lgb_pred(tr_path, va_path, _sep = t, iter_num = 32):n # load or create your datasetn print(Load data...)n df_train = pd.read_csv(tr_path, header=None, sep=_sep)n df_test = pd.read_csv(va_path, header=None, sep=_sep)n n y_train = df_train[0].valuesn y_test = df_test[0].valuesn X_train = df_train.drop(0, axis=1).valuesn X_test = df_test.drop(0, axis=1).valuesn n # create dataset for lightgbmn lgb_train = lgb.Dataset(X_train, y_train)n lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)n n # specify your configurations as a dictn params = {n task: train,n boosting_type: gbdt,n objective: binary,n metric: {l2, auc, logloss},n num_leaves: 30,n# max_depth: 7,n num_trees: 32,n learning_rate: 0.05,n feature_fraction: 0.9,n bagging_fraction: 0.8,n bagging_freq: 5,n verbose: 0n }n n print(Start training...)n # trainn gbm = lgb.train(params,n lgb_train,n num_boost_round=iter_num,n valid_sets=lgb_eval,n feature_name=["I1","I2","I3","I4","I5","I6","I7","I8","I9","I10","I11","I12","I13","C1","C2","C3","C4","C5","C6","C7","C8","C9","C10","C11","C12","C13","C14","C15","C16","C17","C18","C19","C20","C21","C22","C23","C24","C25","C26"],n categorical_feature=["C1","C2","C3","C4","C5","C6","C7","C8","C9","C10","C11","C12","C13","C14","C15","C16","C17","C18","C19","C20","C21","C22","C23","C24","C25","C26"],n early_stopping_rounds=5)n n print(Save model...)n # save model to filen gbm.save_model(lgb_model.txt)n n print(Start predicting...)n # predictn y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)n # evaln print(The rmse of prediction is:, mean_squared_error(y_test, y_pred) ** 0.5)nn return gbm,y_pred,X_train,y_trainn

訓練結果:

[1]tvalid_0s l2: 0.241954tvalid_0s auc: 0.70607nTraining until validation scores dont improve for 5 rounds.n[2]tvalid_0s l2: 0.234704tvalid_0s auc: 0.715608n[3]tvalid_0s l2: 0.228139tvalid_0s auc: 0.717791n[4]tvalid_0s l2: 0.222168tvalid_0s auc: 0.72273n[5]tvalid_0s l2: 0.216728tvalid_0s auc: 0.724065n[6]tvalid_0s l2: 0.211819tvalid_0s auc: 0.725036n[7]tvalid_0s l2: 0.207316tvalid_0s auc: 0.727427n[8]tvalid_0s l2: 0.203296tvalid_0s auc: 0.728583n[9]tvalid_0s l2: 0.199582tvalid_0s auc: 0.730092n[10]tvalid_0s l2: 0.196185tvalid_0s auc: 0.730792n[11]tvalid_0s l2: 0.193063tvalid_0s auc: 0.732316n[12]tvalid_0s l2: 0.190268tvalid_0s auc: 0.733773n[13]tvalid_0s l2: 0.187697tvalid_0s auc: 0.734782n[14]tvalid_0s l2: 0.185351tvalid_0s auc: 0.735636n[15]tvalid_0s l2: 0.183215tvalid_0s auc: 0.736346n[16]tvalid_0s l2: 0.181241tvalid_0s auc: 0.737393n[17]tvalid_0s l2: 0.179468tvalid_0s auc: 0.737709n[18]tvalid_0s l2: 0.177829tvalid_0s auc: 0.739096n[19]tvalid_0s l2: 0.176326tvalid_0s auc: 0.740135n[20]tvalid_0s l2: 0.174948tvalid_0s auc: 0.741065n[21]tvalid_0s l2: 0.173675tvalid_0s auc: 0.742165n[22]tvalid_0s l2: 0.172499tvalid_0s auc: 0.742672n[23]tvalid_0s l2: 0.171471tvalid_0s auc: 0.743246n[24]tvalid_0s l2: 0.17045tvalid_0s auc: 0.744415n[25]tvalid_0s l2: 0.169582tvalid_0s auc: 0.744792n[26]tvalid_0s l2: 0.168746tvalid_0s auc: 0.745478n[27]tvalid_0s l2: 0.167966tvalid_0s auc: 0.746282n[28]tvalid_0s l2: 0.167264tvalid_0s auc: 0.74675n[29]tvalid_0s l2: 0.166582tvalid_0s auc: 0.747429n[30]tvalid_0s l2: 0.16594tvalid_0s auc: 0.748392n[31]tvalid_0s l2: 0.165364tvalid_0s auc: 0.748986n[32]tvalid_0s l2: 0.164844tvalid_0s auc: 0.749362nDid not meet early stopping. Best iteration is:n[32]tvalid_0s l2: 0.164844tvalid_0s auc: 0.749362nSave model...nStart predicting...nThe rmse of prediction is: 0.406009502303n

我們把每個特徵的重要程度排個序看看

def ret_feat_impt(gbm):n gain = gbm.feature_importance("gain").reshape(-1, 1) / sum(gbm.feature_importance("gain"))n col = np.array(gbm.feature_name()).reshape(-1, 1)n return sorted(np.column_stack((col, gain)),key=lambda x: x[1],reverse=True)nn[array([I6, 0.1978774213012332],n dtype=<U32), array([I11, 0.1892171073393491],n dtype=<U32), array([C13, 0.09876586224832032],n dtype=<U32), array([I7, 0.09328723289667494],n dtype=<U32), array([C15, 0.07837089393651243],n dtype=<U32), array([I1, 0.06896606612740637],n dtype=<U32), array([C18, 0.03397325870627491],n dtype=<U32), array([C4, 0.03194220375573926],n dtype=<U32), array([I13, 0.027751948092299045],n dtype=<U32), array([C14, 0.022884477973766117],n dtype=<U32), array([C17, 0.01758709018584479],n dtype=<U32), array([I3, 0.01745531293913725],n dtype=<U32), array([C24, 0.015748415135270675],n dtype=<U32), array([C7, 0.014203757070472703],n dtype=<U32), array([I8, 0.013413268591324624],n dtype=<U32), array([C11, 0.012366386458128355],n dtype=<U32), array([C10, 0.011022221770323784],n dtype=<U32), array([I5, 0.01042866903792042],n dtype=<U32), array([C16, 0.010389410428237439],n dtype=<U32), array([I9, 0.009918639946598076],n dtype=<U32), array([C2, 0.006787009911825981],n dtype=<U32), array([C12, 0.005168884905437884],n dtype=<U32), array([I4, 0.00468917800335175],n dtype=<U32), array([C26, 0.003364625407413743],n dtype=<U32), array([C23, 0.0031263193710805628],n dtype=<U32), array([C21, 0.0008737398560005959],n dtype=<U32), array([C19, 0.00042059860405565207],n dtype=<U32), array([I2, 0.0],n dtype=<U32), array([I10, 0.0],n dtype=<U32), array([I12, 0.0],n dtype=<U32), array([C1, 0.0],n dtype=<U32), array([C3, 0.0],n dtype=<U32), array([C5, 0.0],n dtype=<U32), array([C6, 0.0],n dtype=<U32), array([C8, 0.0],n dtype=<U32), array([C9, 0.0],n dtype=<U32), array([C20, 0.0],n dtype=<U32), array([C22, 0.0],n dtype=<U32), array([C25, 0.0],n dtype=<U32)]n

通過eli5分析參數

import eli5 nnfrom sklearn.feature_extraction import DictVectorizernfrom sklearn.pipeline import make_pipelinenfrom sklearn.model_selection import cross_val_scorennfrom sklearn.utils import shufflenfrom sklearn.model_selection import train_test_splitnimport csvnimport numpy as npnnwith open(./data/train_eli5.csv, rt) as f:n data = list(csv.DictReader(f))nn_all_xs = [{k: v for k, v in row.items() if k != clicked} for row in data]n_all_ys = np.array([int(row[clicked]) for row in data])nnall_xs, all_ys = shuffle(_all_xs, _all_ys, random_state=0)ntrain_xs, valid_xs, train_ys, valid_ys = train_test_split(n all_xs, all_ys, test_size=0.25, random_state=0)nprint({} items total, {:.1%} true.format(len(all_xs), np.mean(all_ys)))nn# from xgboost import XGBClassifiernimport warningsn# xgboost <= 0.6a2 shows a warning when used with scikit-learn 0.18+nwarnings.filterwarnings(ignore, category=UserWarning)nclass CSCTransformer:n def transform(self, xs):n # work around https://github.com/dmlc/xgboost/issues/1238#issuecomment-243872543n return xs.tocsc()n def fit(self, *args):n return selfnnclf = lgb.LGBMClassifier()nvec = DictVectorizer()npipeline = make_pipeline(vec, CSCTransformer(), clf)nndef evaluate(_clf):n scores = cross_val_score(_clf, all_xs, all_ys, scoring=accuracy, cv=10)n print(Accuracy: {:.3f} ± {:.3f}.format(np.mean(scores), 2 * np.std(scores)))n _clf.fit(train_xs, train_ys) # so that parts of the original pipeline are fittednnevaluate(pipeline)nnbooster = clf.booster_ #如果運行出錯請使用這句clf.booster()noriginal_feature_names = booster.feature_namenbooster.feature_names = vec.get_feature_names()n# recover original feature namesnbooster.feature_names = original_feature_namesn

輸出結果:

899991 items total, 25.5% truenAccuracy: 0.776 ± 0.003n

from eli5 import show_weightsnshow_weights(clf, vec=vec)n

from eli5 import show_predictionnshow_prediction(clf, valid_xs[1], vec=vec, show_feature_values=True)n

用LightGBM的輸出生成FM數據

數據格式請參見libFM 1.4.2 manual中的說明,截取文檔中的格式說明如下:

GBDT已經訓練好了,我們需要GBDT輸出的葉子節點作為輸入數據X傳給FM,一共30個葉子節點,那麼輸入給FM的數據格式就是X中不是0的數據的index:value。

一段真實數據如下:0 0:31 1:61 2:93 3:108 4:149 5:182 6:212 7:242 8:277 9:310 10:334 11:365 12:401 13:434 14:465 15:491 16:527 17:552 18:589 19:619 20:648 21:678 22:697 23:744 24:770 25:806 26:826 27:862 28:899 29:928 30:955 31:988n

def generat_lgb2fm_data(outdir, gbm, dump, tr_path, va_path, te_path, _sep = t):n with open(os.path.join(outdir, train_lgb2fm.txt), w) as out_train:n with open(os.path.join(outdir, valid_lgb2fm.txt), w) as out_valid:n with open(os.path.join(outdir, test_lgb2fm.txt), w) as out_test:n df_train_ = pd.read_csv(tr_path, header=None, sep=_sep)n df_valid_ = pd.read_csv(va_path, header=None, sep=_sep)n df_test_= pd.read_csv(te_path, header=None, sep=_sep)nn y_train_ = df_train_[0].valuesn y_valid_ = df_valid_[0].values nn X_train_ = df_train_.drop(0, axis=1).valuesn X_valid_ = df_valid_.drop(0, axis=1).valuesn X_test_= df_test_.valuesn n train_leaves= gbm.predict(X_train_, num_iteration=gbm.best_iteration, pred_leaf=True)n valid_leaves= gbm.predict(X_valid_, num_iteration=gbm.best_iteration, pred_leaf=True)n test_leaves= gbm.predict(X_test_, num_iteration=gbm.best_iteration, pred_leaf=True)nn tree_info = dump[tree_info]n tree_counts = len(tree_info)n for i in range(tree_counts):n train_leaves[:, i] = train_leaves[:, i] + tree_info[i][num_leaves] * i + 1n valid_leaves[:, i] = valid_leaves[:, i] + tree_info[i][num_leaves] * i + 1n test_leaves[:, i] = test_leaves[:, i] + tree_info[i][num_leaves] * i + 1n# print(train_leaves[:, i])n# print(tree_info[i][num_leaves])nn for idx in range(len(y_train_)): n out_train.write((str(y_train_[idx]) + t))n out_train.write(t.join(n [{}:{}.format(ii, val) for ii,val in enumerate(train_leaves[idx]) if float(val) != 0 ]) + n)n n for idx in range(len(y_valid_)): n out_valid.write((str(y_valid_[idx]) + t))n out_valid.write(t.join(n [{}:{}.format(ii, val) for ii,val in enumerate(valid_leaves[idx]) if float(val) != 0 ]) + n)n n for idx in range(len(X_test_)): n out_test.write(t.join(n [{}:{}.format(ii, val) for ii,val in enumerate(test_leaves[idx]) if float(val) != 0 ]) + n)n

訓練FM

為訓練FM的數據已經準備好了,我們調用LibFM進行訓練。

迭代64次,使用sgd訓練,學習率是0.00000001,訓練好的模型保存為文件fm_model。

訓練輸出的log,Train和Test的數值不是loss,是accuracy。

cmd = ./libfm/libfm/bin/libFM -task c -train ./data/train_lgb2fm.txt -test ./data/valid_lgb2fm.txt -dim 』1,1,8』 -iter 64 -method sgd -learn_rate 0.00000001 -regular 』0,0,0.01』 -init_stdev 0.1 -save_model fm_modelnos.popen(cmd).readlines()n

訓練結果:

[----------------------------------------------------------------------------n,n libFMn,n Version: 1.4.4n,n Author: Steffen Rendle, srendle@libfm.orgn,n WWW: http://www.libfm.org/n,n This program comes with ABSOLUTELY NO WARRANTY; for details see license.txt.n,n This is free software, and you are welcome to redistribute it under certainn,n conditions; for details see license.txt.n,n ----------------------------------------------------------------------------n,n Loading train...tn,n has x = 1n,n has xt = 0n,n num_rows=899991tnum_values=28799712tnum_features=32tmin_target=0tmax_target=1n,n Loading test... tn,n has x = 1n,n has xt = 0n,n num_rows=100009tnum_values=3200288tnum_features=32tmin_target=0tmax_target=1n,n #relations: 0n,n Loading meta data...tn,n learnrate=1e-08n,n learnrates=1e-08,1e-08,1e-08n,n #iterations=64n,n "SGD: DONT FORGET TO SHUFFLE THE ROWS IN TRAINING DATA TO GET THE BEST RESULTS.n",n #Iter= 0tTrain=0.625438tTest=0.619484n,n #Iter= 1tTrain=0.636596tTest=0.632013n,n #Iter= 2tTrain=0.627663tTest=0.623114n,n #Iter= 3tTrain=0.609776tTest=0.606605n,n #Iter= 4tTrain=0.563581tTest=0.56092n,n #Iter= 5tTrain=0.497907tTest=0.495655n,n #Iter= 6tTrain=0.461677tTest=0.461408n,n #Iter= 7tTrain=0.453666tTest=0.452639n,n #Iter= 8tTrain=0.454026tTest=0.453419n,n #Iter= 9tTrain=0.456836tTest=0.455919n,n #Iter= 10tTrain=0.46032tTest=0.459339n,n #Iter= 11tTrain=0.466546tTest=0.465358n,n #Iter= 12tTrain=0.473565tTest=0.472317n,n #Iter= 13tTrain=0.481726tTest=0.480967n,n #Iter= 14tTrain=0.492357tTest=0.491216n,n #Iter= 15tTrain=0.504419tTest=0.502935n,n #Iter= 16tTrain=0.517793tTest=0.516214n,n #Iter= 17tTrain=0.533604tTest=0.532102n,n #Iter= 18tTrain=0.552926tTest=0.5515n,n #Iter= 19tTrain=0.575645tTest=0.573198n,n #Iter= 20tTrain=0.59418tTest=0.590887n,n #Iter= 21tTrain=0.610691tTest=0.607815n,n #Iter= 22tTrain=0.626138tTest=0.623384n,n #Iter= 23tTrain=0.640751tTest=0.637923n,n #Iter= 24tTrain=0.65393tTest=0.652141n,n #Iter= 25tTrain=0.666099tTest=0.6641n,n #Iter= 26tTrain=0.677933tTest=0.675419n,n #Iter= 27tTrain=0.689539tTest=0.687108n,n #Iter= 28tTrain=0.700177tTest=0.697397n,n #Iter= 29tTrain=0.709265tTest=0.706156n,n #Iter= 30tTrain=0.716553tTest=0.713266n,n #Iter= 31tTrain=0.723218tTest=0.719635n,n #Iter= 32tTrain=0.729163tTest=0.726065n,n #Iter= 33tTrain=0.734428tTest=0.731354n,n #Iter= 34tTrain=0.738863tTest=0.735844n,n #Iter= 35tTrain=0.74284tTest=0.740323n,n #Iter= 36tTrain=0.746316tTest=0.743793n,n #Iter= 37tTrain=0.749123tTest=0.746333n,n #Iter= 38tTrain=0.751573tTest=0.748493n,n #Iter= 39tTrain=0.753264tTest=0.750292n,n #Iter= 40tTrain=0.754803tTest=0.751642n,n #Iter= 41tTrain=0.756011tTest=0.753062n,n #Iter= 42tTrain=0.756902tTest=0.753892n,n #Iter= 43tTrain=0.757642tTest=0.754872n,n #Iter= 44tTrain=0.758293tTest=0.755372n,n #Iter= 45tTrain=0.758855tTest=0.755782n,n #Iter= 46tTrain=0.759293tTest=0.756322n,n #Iter= 47tTrain=0.759695tTest=0.756652n,n #Iter= 48tTrain=0.760084tTest=0.756982n,n #Iter= 49tTrain=0.760343tTest=0.757252n,n #Iter= 50tTrain=0.76055tTest=0.757332n,n #Iter= 51tTrain=0.760706tTest=0.757582n,n #Iter= 52tTrain=0.760944tTest=0.757842n,n #Iter= 53tTrain=0.761035tTest=0.757952n,n #Iter= 54tTrain=0.761173tTest=0.758152n,n #Iter= 55tTrain=0.761291tTest=0.758382n,n #Iter= 56tTrain=0.76142tTest=0.758412n,n #Iter= 57tTrain=0.761541tTest=0.758452n,n #Iter= 58tTrain=0.761677tTest=0.758572n,n #Iter= 59tTrain=0.76175tTest=0.758692n,n #Iter= 60tTrain=0.761829tTest=0.758822n,n #Iter= 61tTrain=0.761855tTest=0.758862n,n #Iter= 62tTrain=0.761918tTest=0.759002n,n #Iter= 63tTrain=0.761988tTest=0.758972n,n FinaltTrain=0.761988tTest=0.758972n,n Writing FM model to fm_modeln]n

FM模型訓練好了,我們把訓練、驗證和測試數據輸入給FM,得到FM層的輸出,輸出的文件名為*.fm.logits

cmd = ./libfm/libfm/bin/libFM -task c -train ./data/train_lgb2fm.txt -test ./data/valid_lgb2fm.txt -dim 』1,1,8』 -iter 32 -method sgd -learn_rate 0.00000001 -regular 』0,0,0.01』 -init_stdev 0.1 -load_model fm_model -train_off true -prefix trnos.popen(cmd).readlines()ncmd = ./libfm/libfm/bin/libFM -task c -train ./data/valid_lgb2fm.txt -test ./data/valid_lgb2fm.txt -dim 』1,1,8』 -iter 32 -method sgd -learn_rate 0.00000001 -regular 』0,0,0.01』 -init_stdev 0.1 -load_model fm_model -train_off true -prefix vanos.popen(cmd).readlines()ncmd = ./libfm/libfm/bin/libFM -task c -train ./data/test_lgb2fm.txt -test ./data/valid_lgb2fm.txt -dim 』1,1,8』 -iter 32 -method sgd -learn_rate 0.00000001 -regular 』0,0,0.01』 -init_stdev 0.1 -load_model fm_model -train_off true -prefix te -test2predict truenos.popen(cmd).readlines()n

開始構建模型

embed_dim = 32nsparse_max = 30000 # sparse_feature_dim = 117568nsparse_dim = 26ndense_dim = 13nout_dim = 400n

定義輸入佔位符

import tensorflow as tfndef get_inputs():n dense_input = tf.placeholder(tf.float32, [None, dense_dim], name="dense_input")n sparse_input = tf.placeholder(tf.int32, [None, sparse_dim], name="sparse_input")n FFM_input = tf.placeholder(tf.float32, [None, 1], name="FFM_input")n FM_input = tf.placeholder(tf.float32, [None, 1], name="FM_input")n n targets = tf.placeholder(tf.float32, [None, 1], name="targets")n LearningRate = tf.placeholder(tf.float32, name = "LearningRate")n return dense_input, sparse_input, FFM_input, FM_input, targets, LearningRaten

輸入類別特徵,從嵌入層獲得嵌入向量

def get_sparse_embedding(sparse_input):n with tf.name_scope("sparse_embedding"):n sparse_embed_matrix = tf.Variable(tf.random_uniform([sparse_max, embed_dim], -1, 1), name = "sparse_embed_matrix")n sparse_embed_layer = tf.nn.embedding_lookup(sparse_embed_matrix, sparse_input, name = "sparse_embed_layer")n sparse_embed_layer = tf.reshape(sparse_embed_layer, [-1, sparse_dim * embed_dim])n return sparse_embed_layern

輸入數值特徵,和嵌入向量鏈接在一起經過三層全連接層

def get_dnn_layer(dense_input, sparse_embed_layer):n with tf.name_scope("dnn_layer"):n input_combine_layer = tf.concat([dense_input, sparse_embed_layer], 1) #(?, 845 = 832 + 13)n fc1_layer = tf.layers.dense(input_combine_layer, out_dim, name = "fc1_layer", activation=tf.nn.relu)n fc2_layer = tf.layers.dense(fc1_layer, out_dim, name = "fc2_layer", activation=tf.nn.relu)n fc3_layer = tf.layers.dense(fc2_layer, out_dim, name = "fc3_layer", activation=tf.nn.relu)n return fc3_layern

構建計算圖

如前所述,將FFM和FM層的輸出經過全連接層,再和數值特徵、嵌入向量的三層全連接層的輸出連接在一起,做Logistic回歸。

採用LogLoss損失,FtrlOptimizer優化損失。

tf.reset_default_graph()ntrain_graph = tf.Graph()nwith train_graph.as_default():n dense_input, sparse_input, FFM_input, FM_input, targets, lr = get_inputs()n sparse_embed_layer = get_sparse_embedding(sparse_input)n fc3_layer = get_dnn_layer(dense_input, sparse_embed_layer)nn ffm_fc_layer = tf.layers.dense(FFM_input, 1, name = "ffm_fc_layer")n fm_fc_layer = tf.layers.dense(FM_input, 1, name = "fm_fc_layer")n feature_combine_layer = tf.concat([ffm_fc_layer, fm_fc_layer, fc3_layer], 1) #(?, 402)nn with tf.name_scope("inference"):n logits = tf.layers.dense(feature_combine_layer, 1, name = "logits_layer")n pred = tf.nn.sigmoid(logits, name = "prediction")n n with tf.name_scope("loss"):n # LogLoss損失,Logistic回歸到點擊率n# cost = tf.losses.sigmoid_cross_entropy(targets, logits )n sigmoid_cost = tf.nn.sigmoid_cross_entropy_with_logits(labels=targets, logits=logits, name = "sigmoid_cost")n logloss_cost = tf.losses.log_loss(labels=targets, predictions=pred)n cost = logloss_cost # + sigmoid_costn loss = tf.reduce_mean(cost)n # 優化損失 n# train_op = tf.train.AdamOptimizer(lr).minimize(loss) #costn global_step = tf.Variable(0, name="global_step", trainable=False)n optimizer = tf.train.FtrlOptimizer(lr) #tf.train.FtrlOptimizer(lr) AdamOptimizern gradients = optimizer.compute_gradients(loss) #costn train_op = optimizer.apply_gradients(gradients, global_step=global_step)n n # Accuracyn with tf.name_scope("score"):n correct_prediction = tf.equal(tf.to_float(pred > 0.5), targets)n accuracy = tf.reduce_mean(tf.to_float(correct_prediction), name="accuracy")n n# auc, uop = tf.contrib.metrics.streaming_auc(pred, targets)n

超參

數據量太大,我們只跑一個epoch。

# Number of Epochsnnum_epochs = 1n# Batch Sizenbatch_size = 32nn# Learning Ratenlearning_rate = 0.01n# Show stats for every n number of batchesnshow_every_n_batches = 25nnsave_dir = ./savennffm_tr_out_path = ./tr_ffm.out.logitnffm_va_out_path = ./va_ffm.out.logitnfm_tr_out_path = ./tr.fm.logitsnfm_va_out_path = ./va.fm.logitsntrain_path = ./data/train.txtnvalid_path = ./data/valid.txtn

讀取FFM的輸出

ffm_train = pd.read_csv(ffm_tr_out_path, header=None) nffm_train = ffm_train[0].valuesnnffm_valid = pd.read_csv(ffm_va_out_path, header=None) nffm_valid = ffm_valid[0].valuesn

讀取FM的輸出

fm_train = pd.read_csv(fm_tr_out_path, header=None) nfm_train = fm_train[0].valuesnnfm_valid = pd.read_csv(fm_va_out_path, header=None) nfm_valid = fm_valid[0].valuesn

讀取數據集

將DNN數據和FM、FFM的輸出數據讀取出來,並連接在一起

train_data = pd.read_csv(train_path, header=None) ntrain_data = train_data.valuesnnvalid_data = pd.read_csv(valid_path, header=None) nvalid_data = valid_data.valuesnncc_train = np.concatenate((ffm_train.reshape(-1, 1), fm_train.reshape(-1, 1), train_data), 1)ncc_valid = np.concatenate((ffm_valid.reshape(-1, 1), fm_valid.reshape(-1, 1), valid_data), 1)nnnp.random.shuffle(cc_train)nnp.random.shuffle(cc_valid)nntrain_y = cc_train[:,-1]ntest_y = cc_valid[:,-1]nntrain_X = cc_train[:,0:-1]ntest_X = cc_valid[:,0:-1]n

訓練網路

%matplotlib inlinen%config InlineBackend.figure_format = retinanimport matplotlib.pyplot as pltnfrom sklearn.model_selection import train_test_splitnimport timenimport datetimenfrom sklearn.metrics import log_lossnfrom sklearn.learning_curve import learning_curvenfrom sklearn import metricsndef train_model(num_epochs):n losses = {train:[], test:[]}n acc_lst = {train:[], test:[]}n pred_lst = []nn with tf.Session(graph=train_graph) as sess:n n n # Keep track of gradient values and sparsityn grad_summaries = []n for g, v in gradients:n if g is not None:n grad_hist_summary = tf.summary.histogram("{}/grad/hist".format(v.name.replace(:, _)), g)n sparsity_summary = tf.summary.scalar("{}/grad/sparsity".format(v.name.replace(:, _)), tf.nn.zero_fraction(g))n grad_summaries.append(grad_hist_summary)n grad_summaries.append(sparsity_summary)n grad_summaries_merged = tf.summary.merge(grad_summaries)n n # Output directory for models and summariesn timestamp = str(int(time.time()))n out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))n print("Writing to {}n".format(out_dir))n n # Summaries for loss and accuracyn loss_summary = tf.summary.scalar("loss", loss)n# acc_summary = tf.scalar_summary("accuracy", accuracy)n n # Train Summariesn train_summary_op = tf.summary.merge([loss_summary, grad_summaries_merged])n train_summary_dir = os.path.join(out_dir, "summaries", "train")n train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph)n n # Inference summariesn inference_summary_op = tf.summary.merge([loss_summary])n inference_summary_dir = os.path.join(out_dir, "summaries", "inference")n inference_summary_writer = tf.summary.FileWriter(inference_summary_dir, sess.graph)n n sess.run(tf.global_variables_initializer())n sess.run(tf.local_variables_initializer())n saver = tf.train.Saver()n for epoch_i in range(num_epochs):n n #將數據集分成訓練集和測試集n train_batches = get_batches(train_X, train_y, batch_size)n test_batches = get_batches(test_X, test_y, batch_size)n n #訓練的迭代,保存訓練損失n for batch_i in range(len(train_X) // batch_size):n x, y = next(train_batches)n n feed = {n dense_input: x.take([2,3,4,5,6,7,8,9,10,11,12,13,14],1),n sparse_input: x.take([15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40],1),n FFM_input: np.reshape(x.take(0,1), [batch_size, 1]),n FM_input: np.reshape(x.take(1,1), [batch_size, 1]),n targets: np.reshape(y, [batch_size, 1]),n lr: learning_rate}n # _ = sess.run([train_op], feed) #costn step, train_loss, summaries, _, prediction, acc = sess.run(n [global_step, loss, train_summary_op, train_op, pred, accuracy], feed) #costn n prediction = prediction.reshape(y.shape)n losses[train].append(train_loss)nn acc_lst[train].append(acc)n train_summary_writer.add_summary(summaries, step) #nn if(np.mean(y) != 0):n auc = metrics.roc_auc_score(y, prediction)n else:n auc = -1n n # Show every <show_every_n_batches> batchesn if (epoch_i * (len(train_X) // batch_size) + batch_i) % show_every_n_batches == 0:n time_str = datetime.datetime.now().isoformat()n print({}: Epoch {:>3} Batch {:>4}/{} train_loss = {:.3f} accuracy = {} auc = {}.format(n time_str,n epoch_i,n batch_i,n (len(train_X) // batch_size),n train_loss,n acc,n auc))n# print(metrics.classification_report(y, np.float32(prediction > 0.5)))n n #使用測試數據的迭代n for batch_i in range(len(test_X) // batch_size):n x, y = next(test_batches)n n feed = {n dense_input: x.take([2,3,4,5,6,7,8,9,10,11,12,13,14],1),n sparse_input: x.take([15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40],1),n FFM_input: np.reshape(x.take(0,1), [batch_size, 1]),n FM_input: np.reshape(x.take(1,1), [batch_size, 1]),n targets: np.reshape(y, [batch_size, 1]),n lr: learning_rate}n # Get Predictionn step, test_loss, summaries, prediction, acc = sess.run(n [global_step, loss, inference_summary_op, pred, accuracy], feed) #costn n #保存測試損失和準確率n prediction = prediction.reshape(y.shape)n losses[test].append(test_loss)nn acc_lst[test].append(acc)n inference_summary_writer.add_summary(summaries, step) #n pred_lst.append(prediction)nn if(np.mean(y) != 0):n auc = metrics.roc_auc_score(y, prediction)n else:n auc = -1nn time_str = datetime.datetime.now().isoformat()n if (epoch_i * (len(test_X) // batch_size) + batch_i) % show_every_n_batches == 0:n print({}: Epoch {:>3} Batch {:>4}/{} test_loss = {:.3f} accuracy = {} auc = {}.format(n time_str,n epoch_i,n batch_i,n (len(test_X) // batch_size),n test_loss,n acc,n auc))n print(metrics.classification_report(y, np.float32(prediction > 0.5)))nn # Save Modeln saver.save(sess, save_dir) #, global_step=epoch_in print(Model Trained and Saved)n save_params((losses, acc_lst, pred_lst, save_dir))n return losses, acc_lst, pred_lst, save_dirn

輸出驗證集上的訓練信息

  • 平均準確率
  • 平均損失
  • 平均Auc
  • 預測的平均點擊率
  • 精確率、召回率、F1 Score等信息

因為數據中大部分都是負例,正例較少,如果模型全部猜0就能有75%的準確率,所以準確率這個指標是不可信的。

我們需要關注正例的精確率和召回率,當然最主要還是要看LogLoss的值,因為比賽採用的評價指標是LogLoss,而不是採用AUC值。

def train_info():n print("Test Mean Acc : {}".format(np.mean(acc_lst[test]))) #test_pred_meann print("Test Mean Loss : {}".format(np.mean(losses[test]))) #test_pred_meann print("Mean Auc : {}".format(metrics.roc_auc_score(test_y[:-9], np.array(pred_lst).reshape(-1, 1))))n print("Mean prediction : {}".format(np.mean(np.array(pred_lst).reshape(-1, 1))))n print(metrics.classification_report(test_y[:-9], np.float32(np.array(pred_lst).reshape(-1, 1) > 0.5)))n

TensorBoard中查看loss

總結

以上就是點擊率預估的完整過程,沒有進行完整數據的訓練,並且有很多超參可以調整,從只跑了一次epoch的結果來看,驗證集上的LogLoss是0.46,其他數據都在75%~80%之間,這跟FFM、GBDT和FM網路訓練的準確率差不多。

擴展閱讀

  • Code for the 3rd place finish for Avazu Click-Through Rate Prediction
  • Kaggle : Display Advertising Challenge( ctr 預估 )
  • 用機器學習對CTR預估建模
  • Beginner's Guide to Click-Through Rate Prediction with Logistic Regression
  • 2nd place solution for Avazu click-through rate prediction competition
  • 常見計算廣告點擊率預估演算法總結
  • 3 Idiots' Approach for Display Advertising Challenge
  • Solution to the Outbrain Click Prediction competition
  • Deep Interest Network for Click-Through Rate Prediction
  • Learning Piece-wise Linear Models from Large Scale Data for Ad Click Prediction
  • 重磅!阿里媽媽首次公開自研CTR預估核心演算法MLR
  • 阿里蓋坤團隊提出深度興趣網路,更懂用戶什麼時候會剁手
  • 深入FFM原理與實踐

今天的分享就到這裡,就醬~

推薦閱讀:

用TensorFlow做Kaggle「手寫識別」達到98%準確率-詳解
Kaggle 一個神奇的網站
如何看待Kaggle最新比赛Zillow禁止中国居民参加第二轮?
kaggle:Titanic: Machine Learning from Disaster,有什麼比較好的feature可以提取,哪位大神hit 80%了?
Titanic: kaggle入門實踐-top10%(附Python代碼)

TAG:Kaggle | 点击率 | 深度学习DeepLearning |