小白參加kaggle比賽全記錄

小白參加kaggle比賽全記錄

4 人贊了文章

目錄:

1、Kaggle比賽的意義

2、比賽流程

3、比賽1——Mercari Price Suggestion Challenge

4、比賽2——Toxic Comment Classification Challenge(簡略)

正文開始:

1、kaggle比賽的意義:對於正研究有關深度學習、數據挖掘的課題的學生,或像我一樣急於轉行到數據科學家行列的人來說,kaggle提供了一個很好的平台,它既可以用來驗證你的模型在別的數據集上是不是有效,又可以成為一個很好的小白練習場,對於大牛來說,還可以成為賺錢的方式。

2、比賽流程:

首先進入kaggle頁面,點擊competation

選擇一個比賽

這裡左側可以看到:比賽描述、評價標準、獎金、常見問題以及比賽時間軸。

上邊一欄可以查看:數據、開源的代碼、討論區、排行榜、規則以及隊伍

右側可以查看自己隊伍的提交結果以及提交新的結果。

(1)Data Exploration

在這一步要做的是EDA(Exploratory Data Analysis),對數據進行探索性的分析,為之後的處理和建模做準備。

(2)Visualization

通常來說使用matplotlib和seaborn兩個工具來進行可視化。

比較常用的圖表有:

  • 查看目標變數的分布。當分布不平衡時,根據評分標準和具體模型的使用不同,可能會嚴重影響性能。
  • Numerical Variable,可以用 Box Plot 來直觀地查看它的分布。
  • 對於坐標類數據,可以用 Scatter Plot 來查看它們的分布趨勢和是否有離群點的存在。
  • 對於分類問題,將數據根據 Label 的不同著不同的顏色繪製出來,這對 Feature 的構造很有幫助。
  • 繪製變數之間兩兩的分布和相關度圖表。
  • 參考資料:

Python Data Visualizations | Kaggle?

www.kaggle.com

(3)Data Preprocessing

數據預處理的通常步驟是

  • 有時數據會分散在幾個不同的文件中,需要 Join 起來。
  • 處理 Missing Data
  • 處理 Outlier
  • 必要時轉換某些 Categorical Variable 的表示方式。
  • 有些 Float 變數可能是從未知的 Int 變數轉換得到的,這個過程中發生精度損失會在數據中產生不必要的 Noise,即兩個數值原本是相同的卻在小數點後某一位開始有不同。這對 Model 可能會產生很負面的影響,需要設法去除或者減弱 Noise。

(4)Dummy Variables

對於 Categorical Variable,常用的做法就是 One-hot encoding。即對這一變數創建一組新的偽變數,對應其所有可能的取值。這些變數中只有這條數據對應的取值為 1,其他都為 0。

(5) Feature Engineering

怎麼構造有用的 Feature,是一個不斷學習和提高的過程。

總的來說,我們應該生成盡量多的 Feature,相信 Model 能夠挑出最有用的 Feature

通過挑選出最重要的 Feature,可以將它們之間進行各種運算和操作的結果作為新的 Feature,可能帶來意外的提高。

(6)模型選擇

常用模型:

Gradient Boosting

Random Forest

Extra Randomized Trees

SVM

Linear Regression

Logistic Regression

Neural Netwoks

(7)訓練模型

在訓練時最重要的就是調整參數,這要求我們對模型有足夠的了解,知道每個參數對性能的影響是怎樣的。

此外,還有Cross Validation,Ensemble Generation

3、比賽1——Mercari Price Suggestion Challenge

概況:該比賽是一個價格預測問題,日本購物網站Mercari提供了商品的數據,要求在給出新的商品時,我們要設計一個演算法準確的預測新商品的價格。給出的訓練數據有1482525個,8列特徵,測試數據有693359行,7列特徵。

id 商品名稱 物品條件 物品類別名 品牌名 價格 運費 物品描述

觀察目標變數:價格

觀察shipping

觀察數據的方式有很多,這裡不一一贅述。下面直接來講我們提交的作品。

import gcimport timeimport numpy as npimport pandas as pdfrom scipy.sparse import csr_matrix, hstackfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.preprocessing import LabelBinarizerfrom sklearn.model_selection import train_test_splitimport lightgbm as lgbfrom sklearn.linear_model import HuberRegressorfrom sklearn.linear_model import PassiveAggressiveRegressorimport sys

載入一些需要用到的包

# lasso (least absolute shrinkage and selection operator)from sklearn.linear_model import Lassoimport timestart_time = time.time()tcurrent = start_timenp.random.seed(4413) sys.path.insert(0, ../input/wordbatch/wordbatch/)import wordbatchfrom wordbatch.extractors import WordBag, WordHashfrom wordbatch.models import FTRL, FM_FTRLfrom nltk.corpus import stopwordsimport reNUM_BRANDS = 4560NUM_CATEGORIES = 1290#規定了能處理品牌和類別的最大值develop = False# develop= True

下面開始寫函數:

def rmsle(y, y0): assert len(y) == len(y0) return np.sqrt(np.mean(np.power(np.log1p(y) - np.log1p(y0), 2)))

計算rmsle評價指標

def split_cat(text): try: return text.split("/") except: return ("No Label", "No Label", "No Label")

分割文本,如果該文本為空,則返回「No Label」

def handle_missing_inplace(dataset): dataset[general_cat].fillna(value=missing, inplace=True) dataset[subcat_1].fillna(value=missing, inplace=True) dataset[subcat_2].fillna(value=missing, inplace=True) dataset[brand_name].fillna(value=missing, inplace=True) dataset[item_description].fillna(value=missing, inplace=True)

缺失值的填充,所有列的缺失值都用missing來填充。

def cutting(dataset): pop_brand = dataset[brand_name].value_counts().loc[lambda x: x.index != missing].index[:NUM_BRANDS] dataset.loc[~dataset[brand_name].isin(pop_brand), brand_name] = missing pop_category1 = dataset[general_cat].value_counts().loc[lambda x: x.index != missing].index[:NUM_CATEGORIES] pop_category2 = dataset[subcat_1].value_counts().loc[lambda x: x.index != missing].index[:NUM_CATEGORIES] pop_category3 = dataset[subcat_2].value_counts().loc[lambda x: x.index != missing].index[:NUM_CATEGORIES] dataset.loc[~dataset[general_cat].isin(pop_category1), general_cat] = missing dataset.loc[~dataset[subcat_1].isin(pop_category2), subcat_1] = missing dataset.loc[~dataset[subcat_2].isin(pop_category3), subcat_2] = missing

pop_brand,pop_category全稱應該是popular,,這一段是把出現頻率過低的品牌和類別名cutting掉~,全部當做缺失值missing來處理。

def to_categorical(dataset): dataset[general_cat] = dataset[general_cat].astype(category) dataset[subcat_1] = dataset[subcat_1].astype(category) dataset[subcat_2] = dataset[subcat_2].astype(category) dataset[item_condition_id] = dataset[item_condition_id].astype(category)

將類別名分割出來的三列的type轉換成與category一樣。

stopwords = {x: 1 for x in stopwords.words(english)}non_alphanums = re.compile(u[^A-Za-z0-9]+)def normalize_text(text): return u" ".join( [x for x in [y for y in non_alphanums.sub( , text).lower().strip().split(" ")] if len(x) > 1 and x not in stopwords])

文本的規範化,由於對正則表達式不是很熟悉,這部分有待學習。

def main(): start_time = time.time() from time import gmtime, strftime print(strftime("%Y-%m-%d %H:%M:%S", gmtime())) #計時 # if 1 == 1: train = pd.read_table(../input/mercari-price-suggestion-challenge/train.tsv, engine=c) test = pd.read_table(../input/mercari-price-suggestion-challenge/test.tsv, engine=c)#讀入數據 #train = pd.read_table(../input/train.tsv, engine=c) #test = pd.read_table(../input/test.tsv, engine=c) print([{}] Finished to load data.format(time.time() - start_time)) print(Train shape: , train.shape) print(Test shape: , test.shape) nrow_test = train.shape[0] # -dftt.shape[0] dftt = train[(train.price < 1.0)] train = train.drop(train[(train.price < 1.0)].index)#去掉價格小於1的樣本 del dftt[price] nrow_train = train.shape[0] # print(nrow_train, nrow_test) y = np.log1p(train["price"]) merge: pd.DataFrame = pd.concat([train, dftt, test]) submission: pd.DataFrame = test[[test_id]] del train del test gc.collect() merge[general_cat], merge[subcat_1], merge[subcat_2] = zip(*merge[category_name].apply(lambda x: split_cat(x))) merge.drop(category_name, axis=1, inplace=True) print([{}] Split categories completed..format(time.time() - start_time))#分割類別名 handle_missing_inplace(merge) print([{}] Handle missing completed..format(time.time() - start_time))#處理缺失值 cutting(merge) print([{}] Cut completed..format(time.time() - start_time))#去掉低頻值 to_categorical(merge) print([{}] Convert categorical completed.format(time.time() - start_time))#轉換格式 wb = wordbatch.WordBatch(normalize_text, extractor=(WordBag, {"hash_ngrams": 2, "hash_ngrams_weights": [1.5, 1.0], "hash_size": 2 ** 29, "norm": None, "tf": binary, "idf": None, }), procs=8) wb.dictionary_freeze= True X_name = wb.fit_transform(merge[name]) del(wb) X_name = X_name[:, np.array(np.clip(X_name.getnnz(axis=0) - 1, 0, 1), dtype=bool)] print([{}] Vectorize `name` completed..format(time.time() - start_time))#用WordBatch工具將『name』變成向量 wb = CountVectorizer() X_category1 = wb.fit_transform(merge[general_cat]) X_category2 = wb.fit_transform(merge[subcat_1]) X_category3 = wb.fit_transform(merge[subcat_2]) print([{}] Count vectorize `categories` completed..format(time.time() - start_time))#將『categoties』變成向量 # wb= wordbatch.WordBatch(normalize_text, extractor=(WordBag, {"hash_ngrams": 3, "hash_ngrams_weights": [1.0, 1.0, 0.5], wb = wordbatch.WordBatch(normalize_text, extractor=(WordBag, {"hash_ngrams": 2, "hash_ngrams_weights": [1.0, 1.0], "hash_size": 2 ** 28, "norm": "l2", "tf": 1.0, "idf": None}) , procs=8) wb.dictionary_freeze= True X_description = wb.fit_transform(merge[item_description]) del(wb) X_description = X_description[:, np.array(np.clip(X_description.getnnz(axis=0) - 1, 0, 1), dtype=bool)] print([{}] Vectorize `item_description` completed..format(time.time() - start_time))#向量化『item_description』 lb = LabelBinarizer(sparse_output=True) X_brand = lb.fit_transform(merge[brand_name]) print([{}] Label binarize `brand_name` completed..format(time.time() - start_time))#用LabelBinarizer工具將『brand_name』表示成二進位 X_dummies = csr_matrix(pd.get_dummies(merge[[item_condition_id, shipping]], sparse=True).values) print([{}] Get dummies on `item_condition_id` and `shipping` completed..format(time.time() - start_time))#這裡是把`item_condition_id` and `shipping`編碼,變成離散特徵 print(X_dummies.shape, X_description.shape, X_brand.shape, X_category1.shape, X_category2.shape, X_category3.shape, X_name.shape) sparse_merge = hstack((X_dummies, X_description, X_brand, X_category1, X_category2, X_category3, X_name)).tocsr() print([{}] Create sparse merge completed.format(time.time() - start_time)) del X_dummies, merge, X_description, lb, X_brand, X_category1, X_category2, X_category3, X_name; gc.collect() # pd.to_pickle((sparse_merge, y), "xy.pkl") # else: # nrow_train, nrow_test= 1481661, 1482535 # sparse_merge, y = pd.read_pickle("xy.pkl") # Remove features with :document frequency <=1 print(sparse_merge.shape) mask = np.array(np.clip(sparse_merge.getnnz(axis=0) - 1, 0, 1), dtype=bool) sparse_merge = sparse_merge[:, mask] X = sparse_merge[:nrow_train] X_test = sparse_merge[nrow_test:] print(sparse_merge.shape) train_X, train_y = X, y if develop: train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.05, random_state=100) model = FTRL(alpha=0.01, beta=0.1, L1=0.00001, L2=1.0, D=sparse_merge.shape[1], iters=30, inv_link="identity", threads=1) del X; gc.collect() model.fit(train_X, train_y) print([{}] Train FTRL completed.format(time.time() - start_time)) if develop: preds = model.predict(X=valid_X) print("FTRL dev RMSLE:", rmsle(np.expm1(valid_y), np.expm1(preds))) predsF = model.predict(X_test) print([{}] Predict FTRL completed.format(time.time() - start_time))#FTRL模型的訓練與預測,有興趣的同學可詳細了解 model = FM_FTRL(alpha=0.012, beta=0.01, L1=0.00001, L2=0.1, D=sparse_merge.shape[1], alpha_fm=0.01, L2_fm=0.0, init_fm=0.01, D_fm=200, e_noise=0.0001, iters=17, inv_link="identity", threads=4) model.fit(train_X, train_y) del train_X, train_y; gc.collect() print([{}] Train ridge v2 completed.format(time.time() - start_time)) if develop: preds = model.predict(X=valid_X) print("FM_FTRL dev RMSLE:", rmsle(np.expm1(valid_y), np.expm1(preds))) predsFM = model.predict(X_test) print([{}] Predict FM_FTRL completed.format(time.time() - start_time))#FM_FTRL模型的訓練與預測,有興趣的同學可詳細了解 del X_test; gc.collect() params = { learning_rate: 0.65, application: regression, max_depth: 4, num_leaves: 42, verbosity: -1, metric: RMSE, data_random_seed: 1, bagging_fraction: 0.71, bagging_freq: 5, feature_fraction: 0.67, nthread: 4, min_data_in_leaf: 120, max_bin: 40 } print(sparse_merge.shape) mask = np.array(np.clip(sparse_merge.getnnz(axis=0) - 100, 0, 1), dtype=bool) sparse_merge = sparse_merge[:, mask] X = sparse_merge[:nrow_train] X_test = sparse_merge[nrow_test:] print(sparse_merge.shape) del sparse_merge; gc.collect() train_X, train_y = X, y if develop: train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.05, random_state=100) del X, y; gc.collect() d_train = lgb.Dataset(train_X, label=train_y) # del train_X, train_y; gc.collect() watchlist = [d_train] if develop: d_valid = lgb.Dataset(valid_X, label=valid_y) del valid_y; gc.collect() watchlist = [d_train, d_valid] #model = lgb.train(params, train_set=d_train, num_boost_round=7500, valid_sets=watchlist, # early_stopping_rounds=1000, verbose_eval=1000) model = lgb.train(params, train_set=d_train, num_boost_round=3000, valid_sets=watchlist, early_stopping_rounds=1000, verbose_eval=1000) del d_train; gc.collect() if develop: preds = model.predict(valid_X) del valid_X; gc.collect() print("LGB dev RMSLE:", rmsle(np.expm1(valid_y), np.expm1(preds))) predsL = model.predict(X_test) # del X_test; gc.collect() print([{}] Predict LGB completed..format(time.time() - start_time))#lightgbm模型的訓練與預測,有興趣的同學可以詳細了解 #--- BEGIN Huber # Details: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.HuberRegressor.html # class sklearn.linear_model.HuberRegressor(epsilon=1.35, # max_iter=100, alpha=0.0001, warm_start=False, fit_intercept=True, # tol=1e-05)[source] setup_Huber = 0 if (setup_Huber==1): model = HuberRegressor(fit_intercept=True, alpha=0.01, max_iter=80, epsilon=363) if (setup_Huber==2): model = HuberRegressor(fit_intercept=True, alpha=0.05, max_iter=200, epsilon=1.2) if (setup_Huber==3): model = HuberRegressor(fit_intercept=True, alpha=0.02, max_iter=200, epsilon=256) if (setup_Huber>0): model.fit(train_X, train_y) print([{}] Predict Huber completed..format(time.time() - start_time)) preds = model.predict(X=X_test) #--- END Huber#Huber回歸模型的訓練與預測,有興趣的同學可以詳細了解 #--- BEGIN PassiveAggressiveRegressor (PAR) # Details: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.PassiveAggressiveRegressor.html#sklearn.linear_model.PassiveAggressiveRegressor # class sklearn.linear_model.PassiveAggressiveRegressor(C=1.0, # fit_intercept=True, max_iter=None, tol=None, shuffle=True, verbose=0, # loss=』epsilon_insensitive』, epsilon=0.1, # random_state=None, warm_start=False, average=False, n_iter=None) setup_PAR = 1 if (setup_PAR==1): model = PassiveAggressiveRegressor(C=1.05, fit_intercept=True, loss=epsilon_insensitive, max_iter=120, random_state=433) if (setup_PAR>0): model.fit(train_X, train_y) print([{}] Predict PAR completed..format(time.time() - start_time)) preds = model.predict(X=X_test) #--- END PAR#PAR回歸模型的訓練與預測,有興趣的同學可以詳細了解 # modified setup (IT NEEDS MORE TUNING TESTS) w = (0.10, 0.11, 0.23, 0.56) preds = preds*w[0] + predsF*w[1] + predsL*w[2] + predsFM*w[3] submission[price] = np.expm1(preds) submission.to_csv("sub ftrl_fm_lgb_PAR.csv", index=False) nm=(time.time() - start_time)/60 print ("Total processing time %s min" % nm)#四個模型的加權融合,通常單個模型較好的,賦予更高的權重if __name__ == __main__: main()

比賽1的代碼分析完畢。比賽2的過程與比賽1差不多~所不同的是多了stacking環節,與剛剛所述的blending,也就是加權求和不同,stacking是不同的模型融合方法,具體的方法可以參考這篇知乎專欄。

Leon:Kaggle機器學習之模型融合(stacking)心得?

zhuanlan.zhihu.com圖標

這份代碼我是參考了別人的,非原創,我自己原創的代碼實在是太亂了所以沒有放,而且思路也是模仿這位大牛的。

第一次寫文章,所以如果有什麼侵權的我馬上就會刪除!


推薦閱讀:

流量漫遊費正式取消,但這幾類用戶無法享受
科技理性與自由
小程序商城的優點到底是什麼?
流程圖是做什麼的呢?
科技每日推送

TAG:機器學習 | 科技 |