想免費用谷歌資源訓練神經網路？Colab 詳細使用教程 —— Jinkey 原創

02-02

原文鏈接 https://jinkey.ai/post/tech/xiang-mian-fei-yong-gu-ge-zi-yuan-xun-lian-shen-jing-wang-luo-colab-xiang-xi-shi-yong-jiao-cheng
本文作者 Jinkey（微信公眾號 jinkey-love，官網 https://jinkey.ai）
文章允許非篡改署名轉載，刪除或修改本段版權信息轉載的，視為侵犯知識產權，我們保留追求您法律責任的權利，特此聲明！

1 簡介

Colab 是谷歌內部類 Jupyter Notebook 的互動式 Python 環境，免安裝快速切換 Python 2和 Python 3 的環境，支持Google全家桶(TensorFlow、BigQuery、GoogleDrive等)，支持 pip 安裝任意自定義庫。網址：https://colab.research.google.com

2 庫的安裝和使用

Colab 自帶了 Tensorflow、Matplotlib、Numpy、Pandas 等深度學習基礎庫。如果還需要其他依賴，如 Keras，可以新建代碼塊，輸入

# 安裝最新版本Kerasn# https://keras.io/n!pip install kerasn# 指定版本安裝n!pip install keras==2.0.9n# 安裝 OpenCVn# https://opencv.org/n!apt-get -qq install -y libsm6 libxext6 && pip install -q -U opencv-pythonn# 安裝 Pytorchn# http://pytorch.org/n!pip install -q http://download.pytorch.org/whl/cu75/torch-0.2.0.post3-cp27-cp27mu-manylinux1_x86_64.whl torchvisionn# 安裝 XGBoostn# https://github.com/dmlc/xgboostn!pip install -q xgboostn# 安裝 7Zipn!apt-get -qq install -y libarchive-dev && pip install -q -U libarchiven# 安裝 GraphViz 和 PyDotn!apt-get -qq install -y graphviz && pip install -q pydotn

3 Google Drive 文件操作

授權登錄

對於同一個 notebook，登錄操作只需要進行一次，然後才可以進度讀寫操作。

# 安裝 PyDrive 操作庫，該操作每個 notebook 只需要執行一次n!pip install -U -q PyDrivenfrom pydrive.auth import GoogleAuthnfrom pydrive.drive import GoogleDrivenfrom google.colab import authnfrom oauth2client.client import GoogleCredentialsnn# 授權登錄，僅第一次的時候會鑒權nauth.authenticate_user()ngauth = GoogleAuth()ngauth.credentials = GoogleCredentials.get_application_default()ndrive = GoogleDrive(gauth)n

執行這段代碼後，會列印以下內容，點擊連接進行授權登錄，獲取到 token 值填寫到輸入框，按 Enter 繼續即可完成登錄。

遍歷目錄

# 列出根目錄的所有文件n# "q" 查詢條件教程詳見：https://developers.google.com/drive/v2/web/search-parametersnfile_list = drive.ListFile({q: "root in parents and trashed=false"}).GetList()nfor file1 in file_list:n print(title: %s, id: %s, mimeType: %s % (file1[title], file1[id], file1["mimeType"]))n

可以看到控制台列印結果

title: Colab 測試, id: 1cB5CHKSdL26AMXQ5xrqk2kaBv5LSkIsJ8HuEDyZpeqQ, mimeType: application/vnd.google-apps.document
title: Colab Notebooks, id: 1U9363A12345TP2nSeh2K8FzDKSsKj5Jj, mimeType: application/vnd.google-apps.folder

其中 id 是接下來的教程獲取文件的唯一標識。根據 mimeType 可以知道 Colab 測試 文件為 doc 文檔，而 Colab Notebooks 為文件夾（也就是 Colab 的 Notebook 儲存的根目錄），如果想查詢 Colab Notebooks 文件夾下的文件，查詢條件可以這麼寫：

# 目錄 id in parentsnfile_list = drive.ListFile({q: "1cB5CHKSdL26AMXQ5xrqk2kaBv5LBkIsJ8HuEDyZpeqQ in parents and trashed=false"}).GetList()n

讀取文件內容

目前測試過可以直接讀取內容的格式為 .txt（mimeType: text/plain），讀取代碼：

file = drive.CreateFile({id: "替換成你的 .txt 文件 id"}) nfile.GetContentString()n

而 .csv 如果用GetContentString()只能列印第一行的數據，要用``

file = drive.CreateFile({id: "替換成你的 .csv 文件 id"}) n#這裡的下載操作只是緩存，不會在你的Google Drive 目錄下多下載一個文件nfile.GetContentFile(iris.csv, "text/csv") nn# 直接列印文件內容nwith open(iris.csv) as f:n print f.readlines()n# 用 pandas 讀取nimport pandasnpd.read_csv(iris.csv, index_col=[0,1], skipinitialspace=True)n

Colab 會直接以表格的形式輸出結果（下圖為截取 iris 數據集的前幾行）， iris 數據集地址為 http://aima.cs.berkeley.edu/data/iris.csv ，學習的同學可以執行上傳到自己的 Google Drive。

寫文件操作

# 創建一個文本文件nuploaded = drive.CreateFile({title: 示例.txt})nuploaded.SetContentString(測試內容)nuploaded.Upload()nprint(創建後文件 id 為 {}.format(uploaded.get(id)))n

更多操作可查看 http://pythonhosted.org/PyDrive/filemanagement.html

4 Google Sheet 電子表格操作

授權登錄

對於同一個 notebook，登錄操作只需要進行一次，然後才可以進度讀寫操作。

!pip install --upgrade -q gspreadnfrom google.colab import authnauth.authenticate_user()nnimport gspreadnfrom oauth2client.client import GoogleCredentialsnngc = gspread.authorize(GoogleCredentials.get_application_default())n

讀取

把 iris.csv 的數據導入創建一個 Google Sheet 文件來做演示，可以放在 Google Drive 的任意目錄

worksheet = gc.open(iris).sheet1nn# 獲取一個列表[n# [第1行第1列, 第1行第2列, ... , 第1行第n列], ... ,[第n行第1列, 第n行第2列, ... , 第n行第n列]]nrows = worksheet.get_all_values()nprint(rows)nn# 用 pandas 讀取nimport pandas as pdnpd.DataFrame.from_records(rows)n

列印結果分別為

[[5.1, 3.5, 1.4, 0.2, setosa], [4.9, 3, 1.4, 0.2, setosa], ...

寫入

sh = gc.create(谷歌表)nn# 打開工作簿和工作表nworksheet = gc.open(谷歌表).sheet1ncell_list = worksheet.range(A1:C2)nnimport randomnfor cell in cell_list:n cell.value = random.randint(1, 10)nworksheet.update_cells(cell_list)n

5 下載文件到本地

from google.colab import filesnwith open(example.txt, w) as f:n f.write(測試內容)nfiles.download(example.txt)n

6 實戰

這裡以我在 Github 的開源LSTM 文本分類項目為例子https://github.com/Jinkeycode/keras_lstm_chinese_document_classification把 master/data 目錄下的三個文件存放到 Google Drive 上。該示例演示的是對健康、科技、設計三個類別的標題進行分類。

新建

在 Colab 上新建 Python2 的筆記本

安裝依賴

!pip install kerasn!pip install jieban!pip install h5pynnimport h5pynimport jieba as jbnimport numpy as npnimport keras as krsnimport tensorflow as tfnfrom sklearn.preprocessing import LabelEncodern

載入數據

授權登錄

# 安裝 PyDrive 操作庫，該操作每個 notebook 只需要執行一次n!pip install -U -q PyDrivenfrom pydrive.auth import GoogleAuthnfrom pydrive.drive import GoogleDrivenfrom google.colab import authnfrom oauth2client.client import GoogleCredentialsnndef login_google_drive():n # 授權登錄，僅第一次的時候會鑒權n auth.authenticate_user()n gauth = GoogleAuth()n gauth.credentials = GoogleCredentials.get_application_default()n drive = GoogleDrive(gauth)n return driven

列出 GD 下的所有文件

def list_file(drive):n file_list = drive.ListFile({q: "root in parents and trashed=false"}).GetList()n for file1 in file_list:n print(title: %s, id: %s, mimeType: %s % (file1[title], file1[id], file1["mimeType"]))n nndrive = login_google_drive()nlist_file(drive)n

緩存數據到工作環境

def cache_data():n # id 替換成上一步讀取到的對應文件 idn health_txt = drive.CreateFile({id: "117GkBtuuBP3wVjES0X0L4wVF5rp5Cewi"}) n tech_txt = drive.CreateFile({id: "14sDl4520Tpo1MLPydjNBoq-QjqOKk9t6"})n design_txt = drive.CreateFile({id: "1J4lndcsjUb8_VfqPcfsDeOoB21bOLea3"})n #這裡的下載操作只是緩存，不會在你的Google Drive 目錄下多下載一個文件n n health_txt.GetContentFile(health.txt, "text/plain")n tech_txt.GetContentFile(tech.txt, "text/plain")n design_txt.GetContentFile(design.txt, "text/plain")n n print("緩存成功")n ncache_data()n

讀取工作環境的數據

def load_data():n titles = []n print("正在載入健康類別的數據...")n with open("health.txt", "r") as f:n for line in f.readlines():n titles.append(line.strip())nn print("正在載入科技類別的數據...")n with open("tech.txt", "r") as f:n for line in f.readlines():n titles.append(line.strip())nnn print("正在載入設計類別的數據...")n with open("design.txt", "r") as f:n for line in f.readlines():n titles.append(line.strip())nn print("一共載入了 %s 個標題" % len(titles))nn return titlesn ntitles = load_data()n

載入標籤

def load_label():n arr0 = np.zeros(shape=[12000, ])n arr1 = np.ones(shape=[12000, ])n arr2 = np.array([2]).repeat(7318)n target = np.hstack([arr0, arr1, arr2])n print("一共載入了 %s 個標籤" % target.shape)nn encoder = LabelEncoder()n encoder.fit(target)n encoded_target = encoder.transform(target)n dummy_target = krs.utils.np_utils.to_categorical(encoded_target)nn return dummy_targetn ntarget = load_label()n

文本預處理

max_sequence_length = 30nembedding_size = 50nn# 標題分詞ntitles = [".".join(jb.cut(t, cut_all=True)) for t in titles]nn# word2vec 詞袋化nvocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(max_sequence_length, min_frequency=1)ntext_processed = np.array(list(vocab_processor.fit_transform(titles)))nn# 讀取詞標籤ndict = vocab_processor.vocabulary_._mappingnsorted_vocab = sorted(dict.items(), key = lambda x : x[1])n

構建神經網路

這裡使用 Embedding 和 lstm 作為前兩層，通過 softmax 激活輸出結果

# 配置網路結構ndef build_netword(num_vocabs):n # 配置網路結構n model = krs.Sequential()n model.add(krs.layers.Embedding(num_vocabs, embedding_size, input_length=max_sequence_length))n model.add(krs.layers.LSTM(32, dropout=0.2, recurrent_dropout=0.2))n model.add(krs.layers.Dense(3))n model.add(krs.layers.Activation("softmax"))n model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])nn return modeln nnum_vocabs = len(dict.items())nmodel = build_netword(num_vocabs=num_vocabs)nnimport timenstart = time.time()n# 訓練模型nmodel.fit(text_processed, target, batch_size=512, epochs=10, )nfinish = time.time()nprint("訓練耗時：%f 秒" %(finish-start))n

預測樣本

sen 可以換成你自己的句子，預測結果為[健康類文章概率, 科技類文章概率, 設計類文章概率], 概率最高的為那一類的文章，但最大概率低於 0.8 時判定為無法分類的文章。

sen = "做好商業設計需要學習的小技巧"nsen_prosessed = " ".join(jb.cut(sen, cut_all=True))nsen_prosessed = vocab_processor.transform([sen_prosessed])nsen_prosessed = np.array(list(sen_prosessed))nresult = model.predict(sen_prosessed)nncatalogue = list(result[0]).index(max(result[0]))nthreshold=0.8nif max(result[0]) > threshold:n if catalogue == 0:n print("這是一篇關於健康的文章")n elif catalogue == 1:n print("這是一篇關於科技的文章")n elif catalogue == 2:n print("這是一篇關於設計的文章")n else:n print("這篇文章沒有可信分類") n