植物盲福音—kaggle植物幼苗識別比賽

02-16

年關將至，又可以回到充滿鄉土氣息的老家了~~作為農民的兒子——碼農（霧），我表示一年都在思念家裡的那方純天然無污染有機菜園。不過，碼農的我其實是個植物盲，經常分不清楚哪個植物是哪個植物。

在Kaggle上有這麼一個比賽Kaggle：Plant Seedlings Classification，植物幼苗分類，倒是挺吸引我這個碼農的：

Can you differentiate a weed from a crop seedling?
The ability to do so effectively can mean better crop yields and better stewardship of the environment.

這個比賽提供了一個數據集，裡面共有12種不同植物的圖像來讓我們做識別分類。RussellCloud上復現了一個Kaggle上點贊數最多的公開kernel。這是一種使用深度神經網路預訓練模型的方法，來看看是怎麼實現的吧。

實現思路

談到圖像的識別和分類，就會聯想到ImageNet的比賽。這個比賽誕生了許多優秀的用於圖片識別的模型。我們的比賽也是圖像識別類，那我們不妨選用一些在ImageNet上表現優秀的模型。再看一眼我們的數據集，約有4、5千圖像，這裡如果我們自己定義一個深度神經網路並訓練，實際是很困難的。那我們就可以使用預訓練模型遷移學習的方法來幫助我們完成。

什麼是/為什麼用遷移學習？
遷移學習或歸納轉移是機器學習中的一個研究問題，其重點是存儲在解決一個問題時所獲得的知識，並將其應用於不同但相關的問題。例如，在學習識別汽車時獲得的知識可能適用於識別卡車。
通過遷移學習可以將已經學習到的模型參數分享，加快和優化新模型的學習速度。這裡可以幫助我們使用較少的訓練數據獲得較好的訓練結果。

其實我們實現這個項目的思路也相對簡單，使用預訓練模型提取圖片特徵，使用邏輯回歸分類得出結果：

遷移學習解決植物幼苗分類

來看點代碼進一步了解下：

# 設定12個標籤，獲取標籤數目CATEGORIES = [Black-grass, Charlock, Cleavers, Common Chickweed, Common wheat, Fat Hen, Loose Silky-bent, Maize, Scentless Mayweed, Shepherds Purse, Small-flowered Cranesbill, Sugar beet]NUM_CATEGORIES = len(CATEGORIES)# 每個標籤取200張圖片作為訓練的數據SAMPLE_PER_CATEGORY = 200# 分割訓練和驗證數據集的隨機種子SEED = 1987# 設定數據的目錄data_dir = ../input/train_dir = os.path.join(data_dir, train/train)test_dir = os.path.join(data_dir, test/test)# 輸出每個標籤的圖像數目for category in CATEGORIES: print({} {} images.format(category, len(os.listdir(os.path.join(train_dir, category)))))# 將所有的圖像路徑、標籤ID、標籤名稱這些數據讀入轉換為pandas的DataFrametrain = []for category_id, category in enumerate(CATEGORIES): for file in os.listdir(os.path.join(train_dir, category)): train.append([{}/{}/{}.format(train_dir,category, file), category_id, category])train = pd.DataFrame(train, columns=[file, category_id, category])# 每個標籤保留之前設定數目的數據，打亂順序並重新設定indextrain = pd.concat([train[train[category] == c][:SAMPLE_PER_CATEGORY] for c in CATEGORIES])train = train.sample(frac=1)train.index = np.arange(len(train))

這就獲得了一個12個標籤每個標籤200組存有訓練圖片路徑數據的DataFrame，接下來還要劃分訓練和驗證的數據：

# 使用SEED生成2400個隨機數np.random.seed(seed=SEED)rnd = np.random.random(len(train))# 隨機數小於0.8記錄在train_idx中，大於等於0.8記錄在valid_idx中train_idx = rnd < 0.8valid_idx = rnd >= 0.8# 獲取標籤作為y等待訓練，ytr是訓練用數據的y，yv是驗證用的yytr = train.loc[train_idx, category_id].valuesyv = train.loc[valid_idx, category_id].valueslen(ytr), len(yv)

接下來開始獲取訓練的圖像：

# 使用Xception預訓練模型需要299*299*3的輸入，讀取並轉化圖像為numpy數組INPUT_SIZE = 299POOLING = avgx_train = np.zeros((len(train), INPUT_SIZE, INPUT_SIZE, 3), dtype=float32)for i, file in tqdm(enumerate(train[file])): img = read_img(file, (INPUT_SIZE, INPUT_SIZE)) x = xception.preprocess_input(np.expand_dims(img.copy(), axis=0)) x_train[i] = xprint(Train Images shape: {} size: {:,}.format(x_train.shape, x_train.size))

圖像獲取完畢，通過預訓練模型Xception提取特徵：

# 通過之前劃分的train和valid劃分圖像numpy數組Xtr = x_train[train_idx]Xv = x_train[valid_idx]print((Xtr.shape, Xv.shape, ytr.shape, yv.shape))xception_bottleneck = xception.Xception(weights=imagenet, include_top=False, pooling=POOLING)# 使用預訓練模型獲取特徵train_x_bf = xception_bottleneck.predict(Xtr, batch_size=32, verbose=1)valid_x_bf = xception_bottleneck.predict(Xv, batch_size=32, verbose=1)

接下來對特徵邏輯回歸，並驗證：

# 邏輯回歸fitlogreg = LogisticRegression(multi_class=multinomial, solver=lbfgs, random_state=SEED)logreg.fit(train_x_bf, ytr)# 驗證準確率valid_probs = logreg.predict_proba(valid_x_bf)valid_preds = logreg.predict(valid_x_bf)print(Validation Xception Accuracy {}.format(accuracy_score(yv, valid_preds)))

RussellCloud復現

復現前準備：

註冊 RussellCloud 賬號

如果你沒有邀請碼，可以到RussellCloud社區發帖，每位註冊的用戶也有5枚邀請碼。

RussellCloud社區：本站新帖 - RussellCloud - Powered by phpwind

安裝 russell-cli 終端工具
Clone 項目文件，Git地址RussellCloud/plant-seedlings-classification

# clone代碼$ git clone https://github.com/RussellCloud/plant-seedlings-classification

使用命令行登錄：

# 使用russell login命令$ russell login

輸入y，網頁登錄後在網頁端拷貝賬戶的Token，粘貼進終端，回車。如果你使用Windows的命令行，可能會出現粘貼不進的情況，請右鍵窗口粘貼。

成功登錄輸出：

Login Successful as XXX

新建項目：

來到RussellCloud主頁，進入控制台，新建一個項目。項目名隨便起一個，默認容器環境一定要選擇：keras 。

網頁創建項目

初始化項目：

項目創建完成後記得在項目主頁複製概覽ID，用於項目初始化。

# 綁定遠程項目，此處<project_id>是在網頁上複製的項目概覽 ID$ russell init --id <project_id>

初始化成功輸出：

Project "XXX" initialized in current directory

運行項目：

我們的項目是一個IPython Notebook工程，我們使用Jupyter模式啟動並且加上GPU參數。

# 以jupyter模式啟動我們的項目，掛載訓練數據集、測試數據集、預訓練模型數據集$ russell run --mode jupyter --data f22acb30ae3948e48990fedaf61a0ae6:train --data ba0b93ee80854dfeaf0be5284ee2449a:test --data 4b8076e2928e4cb3ac04e69a9bc073f7:model --gpu

成功運行輸出如下，並且會自動打開一個網頁讓我們進行Notebook的操作：

RUN ID NAME VERSION-------------------------------- --------------------------------------- ---------4de3e17ca14a4bb780d38289cf3b376e Kaggle/plant-seedlings-classification:4 4Setting up your instance and waiting for Jupyter notebook to become available ...Path to jupyter notebook: http://gpu.russellcloud.com/notebook/4de3e17ca14a4bb780d38289cf3b376e/ To view logs enter: russell logs 4de3e17ca14a4bb780d38289cf3b376e

載入了這麼多的數據集，我們最好能檢查一下我們的數據集是否正確掛載了。項目Notebook文件內就有了一些檢查文件的模塊，你只需要在Notebook中按順序運行就可以查看。