開源5個版本basline|遊戲玩家付費金額預測大賽

07-12

這幾天恰好有空，嘗試了幾種方案，但是分數都比較低，在這裡分享下，給大家提供些參考，比較粗糙，大神就不要看了。包括R語言和Python。R語言包括三個： 流氓版本，線性回歸，xgboost；Python兩個版本：領回歸和lgb。

比賽鏈接

閱讀原文獲取

比賽任務

建立模型通過用戶註冊賬戶後7天之內的遊戲數據，預測用戶在45天內的消費金額

比賽數據

1）訓練集（帶標籤）：2288007個樣本帶標籤的訓練集中共有2288007個樣本。tap_fun_train.csv中存有訓練集樣本的全部信息，user_id為樣本的id，prediction_pay_price為訓練集標籤，其他欄位為特徵。

2）測試集：828934個樣本 tap_fun_test.csv中存有測試集的特徵信息，除無prediction_pay_price欄位外，格式同tap_fun_train.csv。參賽者的目標是儘可能準確地預測第45天的消費金額prediction_pay_price。

3） tap4fun 數據欄位解釋.xlsx 為本次比賽數據109個欄位的解釋，每個屬性對應的數據均用「數值」表示，無空值。

比賽獎項

一等獎（一支隊伍）：50000元獎金

二等獎（一支隊伍）：30000元獎金

三等獎（一支隊伍）：20000元獎金

前五名團隊還將獲得tap4fun直面綠色通道

時間安排

正賽階段：6月19日-8月31日

報名開始：6月19日11:00

線上評分：2018年6月19日 11:00--8月31日 15:00

提交代碼：2018年8月31日 17:00--9月2日 16:00

人工複核：2018年9月2日--9月12日

公布結果：2018年9月14日

註：2018年8月31日17:00排行榜將刷新為B榜成績

開源代碼

-------------------------------------------------------------------------------------------------

R語言

--------------------------------------------------------------------------------------------------------

library(dplyr)

library(ggplot2)

library(xgboost)

library(lubridate)

library(data.table)

options(scipen = 200)#不要以科學計數法顯示

版本1：流氓方式（效果最好-65分）

計算前七天充值金額與預測目標的相關性：0.7352345，高度相關（關係圖如首頁），直接使用一個變數乘以係數提交，目前比較好的係數是4.3，線上成績65.71001，大概20名左右。

cor(tap_fun_train$pay_price,tap_fun_train$prediction_pay_price)#計算相關性

比較重要的特徵還有ivory_add_value，pay_count大家可以嘗試加入更多的進行回歸

簡單代碼如下：

setwd("C:/Users/wuzhengxiang/Desktop/遊戲玩家付費金額預測大賽")

tap_fun_train = fread("tap_fun_train.csv",header=T)#2288007--1-2242019/2288007 付費比例2%

tap_fun_test = fread("tap_fun_test.csv" ,header=T)#828934

predict_final = select(tap_fun_test,user_id)

predict_final$prediction_pay_price = 4.3*tap_fun_test$pay_price

write.csv(predict_final,lm-07072319.csv, row.names = FALSE, quote = FALSE)

版本2：線性回歸版本（效果很差127分）

model = lm(prediction_pay_price~ pay_price+ivory_add_value+pay_count,data = tap_fun_train)

pred = predict(model,data = tap_fun_test)

predict_final = select(tap_fun_test,user_id)

predict_final$prediction_pay_price = pred

predict_final$prediction_pay_price = if_else(predict_final$prediction_pay_price<0,0,predict_final$prediction_pay_price)

#預測結果導出

write.csv(predict_final,lm-0707-01.csv, row.names = FALSE, quote = FALSE)

版本3：xgboost（效果一般-72分）

#基礎數據讀取

setwd("C:/Users/wuzhengxiang/Desktop/遊戲玩家付費金額預測大賽")

tap_fun_train = fread("tap_fun_train.csv",header=T)#2288007--1-2242019/2288007 付費比例2%

tap_fun_test = fread("tap_fun_test.csv" ,header=T)#828934

train = tap_fun_train

test = tap_fun_test

LABEL=train$prediction_pay_price

train$register_time = NULL

train$prediction_pay_price = NULL

test$register_time = NULL

user_id = select(tap_fun_test,user_id)

#異常值處理、特徵選擇

train[is.na(train)] = -1

test[is.na(test)] = -1

#數據變化

dtrain = xgb.DMatrix(data.matrix(train), label = LABEL)

dtest = xgb.DMatrix(data.matrix(test))

#參數設置

param = list(booster = "gbtree",

objective = "reg:linear",

eval_metric = "rmse",

#gamma=0.2,

#lambda=2,

eta = 0.05,

subsample = 0.65,

colsample_bytree = 0.75,

min_child_weight = 3,

max_depth = 5

)

#交叉驗證

model.cv = xgb.cv(data = dtrain,

params=param,

nrounds = 10000,

nfold = 5,

print_every_n = 10,

early_stopping_rounds = 30

)

#test-rmse:3726.037988+1361.778319

#模型建立

model_xgb = xgboost(data = dtrain,

params=param,

print_every_n = 100,

nrounds = 1200)

#結果預測

pred = predict(model_xgb, dtest)

predict_final = data.frame(user_id, prediction_pay_price = pred)

#預測結果導出

pred_xgb=str_c("pred_final-",str_c(str_sub(Sys.time(),1,10),

str_sub(Sys.time(),12,13),str_sub(Sys.time(),15,16),sep="_"),".csv")

write.csv(predict_final ,pred_xgb , row.names = FALSE, quote = FALSE,fileEncoding = "utf8")

#特徵重要性評估

names = dimnames(data.matrix(train))[[2]]

importance=xgb.importance(names,model = model_xgb)

#top10特徵繪圖

xgb.plot.importance(importance[1:20,])

-------------------------------------------------------------------------------------------------

Python

--------------------------------------------------------------------------------------------------------

import gc

import re

import sys

import time

import os.path

import os

import datetime

import numpy as np

import pandas as pd

import lightgbm as lgb

版本1：lgb演算法（81分）

#工作空間設置

data_path = C:/Users/wuzhengxiang/Desktop/遊戲玩家付費金額預測大賽

os.chdir(data_path)#設置當前工作空間

print (os.getcwd())#獲得當前工作目錄

#數據讀取

tap_fun_train = pd.read_csv(tap_fun_train.csv)#2288007

tap_fun_test = pd.read_csv(tap_fun_test.csv)#828934

tap_fun_test[prediction_pay_price] = -1

#protein_concat = pd.concat([tap_fun_train,tap_fun_test])

data = pd.concat([tap_fun_train,tap_fun_test])

#模型訓練

train_feat = data[data[prediction_pay_price]> -1].fillna(0)

testt_feat = data[data[prediction_pay_price]<=-1].fillna(0)

label_x = train_feat[prediction_pay_price]

label_y = testt_feat[prediction_pay_price]

submission = testt_feat[[user_id]]

train_feat = train_feat.drop(prediction_pay_price,axis=1)

testt_feat = testt_feat.drop(prediction_pay_price,axis=1)

#train_feat = train_feat.drop(user_id,axis=1)

#testt_feat = testt_feat.drop(user_id,axis=1)

train_feat = train_feat.drop(register_time,axis=1)

testt_feat = testt_feat.drop(register_time,axis=1)

#lgb演算法

train = lgb.Dataset(train_feat, label = label_x)

test = lgb.Dataset(testt_feat, label = label_y,reference=train)

params = {

boosting_type: gbdt,

objective: regression_l2,

metric: l2,

#objective: multiclass,

#metric: multi_error,

min_child_weight: 3,

num_leaves: 2 ** 5,

#lambda_l2: 10,

subsample: 0.75,

colsample_bytree: 0.65,

colsample_bylevel: 0.75,

learning_rate: 0.01,

tree_method: exact,

seed: 2018,

nthread: 12,

silent: True

}

num_round = 8000

gbm = lgb.train(params,

train,

num_round,

verbose_eval=50,

valid_sets=[train,test]

)

preds_sub = gbm.predict(testt_feat)

#結果保存

submission[prediction_pay_price] = preds_sub

nowTime=datetime.datetime.now().strftime(%m%d%H%M)#現在

name=lgb_+nowTime+.csv

submission.to_csv(name, index=False)

#特徵重要性

features = pd.DataFrame()

features[features] = gbm.feature_name()

features[importance] = gbm.feature_importance()

features.sort_values(by=[importance],ascending=False,inplace=True)

版本2：嶺回歸（97分）

from sklearn.linear_model import Ridge

from scipy.sparse import coo_matrix

def offline_train():

train_feat = data[data[prediction_pay_price]> -1].fillna(0)

testt_feat = data[data[prediction_pay_price]<=-1].fillna(0)

label_x = train_feat[prediction_pay_price]

label_y = testt_feat[prediction_pay_price]

submission = testt_feat[[user_id]]

train_feat = train_feat.drop(prediction_pay_price,axis=1)

testt_feat = testt_feat.drop(prediction_pay_price,axis=1)

train_feat = train_feat.drop(user_id,axis=1)

testt_feat = testt_feat.drop(user_id,axis=1)

train_feat = train_feat.drop(register_time,axis=1)

testt_feat = testt_feat.drop(register_time,axis=1)

offline_traindata = coo_matrix(train_feat)

offline_testdata = coo_matrix(testt_feat)

print(offline_traindata.shape[0], offline_testdata.shape[0])

clf = Ridge()

model = clf.fit(offline_traindata, label_x)

res = model.predict(offline_testdata)

submission[prediction_pay_price] = res

offline_train()

nowTime=datetime.datetime.now().strftime(%m%d%H%M)#現在

name=Ridge_+nowTime+.csv

submission.to_csv(name, index=False)

公眾號：Python或R人工智慧學習

ID： Python_R_wu