入門教程 | 使用 Colab，玩轉谷歌深度學習全家桶！

04-05

前言

Colab 是不久前 Google 公開的一個 Python Notebook 工具，支持包括 TensorFlow、BigQuery、Google Drive 在內的 Google 全家桶。本文首先給出 Colab 的基礎操作，最後給出一份代碼實戰：在 Colab 中玩轉 LSTM 文本分類任務，快收藏學習吧！

1.簡介——Colab 是什麼

Colab 是谷歌內部類 Jupyter Notebook 的互動式 Python 環境，免安裝快速切換 Python 2和 Python 3 的環境，支持 Google 全家桶（TensorFlow、BigQuery、GoogleDrive 等），支持 pip https://colab.research.google.com

2.如何在 Colab 中安裝和使用各種深度學習庫

Colab 自帶了 Tensorflow、Matplotlib、Numpy、Pandas 等深度學習基礎庫。如果還需要其他依賴，如 Keras，可以新建代碼塊，輸入

# 安裝最新版本Keras# https://keras.io/!pip install keras# 指定版本安裝!pip install keras==2.0.9# 安裝 OpenCV# https://opencv.org/!apt-get -qq install -y libsm6 libxext6 && pip install -q -U opencv-python# 安裝 Pytorch# http://pytorch.org/!pip install -q http://download.pytorch.org/whl/cu75/torch-0.2.0.post3-cp27-cp27mu-manylinux1_x86_64.whl torchvision# 安裝 XGBoost# https://github.com/dmlc/xgboost!pip install -q xgboost# 安裝 7Zip!apt-get -qq install -y libarchive-dev && pip install -q -U libarchive# 安裝 GraphViz 和 PyDot!apt-get -qq install -y graphviz && pip install -q pydot

3.在 Colab 中讀寫 Google Drive 文件

3.1授權登錄

對於同一個 notebook，登錄操作只需要進行一次，然後才可以進度讀寫操作。

# 安裝 PyDrive 操作庫，該操作每個 notebook 只需要執行一次!pip install -U -q PyDrivefrom pydrive.auth import GoogleAuthfrom pydrive.drive import GoogleDrivefrom google.colab import authfrom oauth2client.client import GoogleCredentials# 授權登錄，僅第一次的時候會鑒權auth.authenticate_user()gauth = GoogleAuth()gauth.credentials = GoogleCredentials.get_application_default()drive = GoogleDrive(gauth)

執行這段代碼後，會列印以下內容，點擊連接進行授權登錄，獲取到 token 值填寫到輸入框，按 Enter 繼續即可完成登錄。

3.2遍歷目錄

# 列出根目錄的所有文件# "q" 查詢條件教程詳見：https://developers.google.com/drive/v2/web/search-parametersfile_list = drive.ListFile({q: "root in parents and trashed=false"}).GetList()for file1 in file_list: print(title: %s, id: %s, mimeType: %s % (file1[title], file1[id], file1["mimeType"]))

可以看到控制台列印結果

title: Colab 測試, id: 1cB5CHKSdL26AMXQ5xrqk2kaBv5LSkIsJ8HuEDyZpeqQ, mimeType: application/vnd.google-apps.documenttitle: Colab Notebooks, id: 1U9363A12345TP2nSeh2K8FzDKSsKj5Jj, mimeType: application/vnd.google-apps.folder

其中 id 是接下來的教程獲取文件的唯一標識。根據 mimeType 可以知道 Colab 測試 文件為 doc 文檔，而 Colab Notebooks 為文件夾（也就是 Colab 的 Notebook 儲存的根目錄），如果想查詢 Colab Notebooks 文件夾下的文件，查詢條件可以這麼寫：

# 目錄 id in parentsfile_list = drive.ListFile({q: "1cB5CHKSdL26AMXQ5xrqk2kaBv5LBkIsJ8HuEDyZpeqQ in parents and trashed=false"}).GetList()

3.3讀取文件內容

目前測試過可以直接讀取內容的格式為 .txt（mimeType: text/plain），讀取代碼：

file = drive.CreateFile({id: "替換成你的 .txt 文件 id"}) file.GetContentString()

而 .csv 如果用GetContentString()只能列印第一行的數據，要用

file = drive.CreateFile({id: "替換成你的 .csv 文件 id"}) #這裡的下載操作只是緩存，不會在你的Google Drive 目錄下多下載一個文件file.GetContentFile(iris.csv, "text/csv") # 直接列印文件內容with open(iris.csv) as f: print f.readlines()# 用 pandas 讀取import pandaspd.read_csv(iris.csv, index_col=[0,1], skipinitialspace=True)

Colab 會直接以表格的形式輸出結果（下圖為截取 iris 數據集的前幾行）， iris 數據集地址為 http://aima.cs.berkeley.edu/data/iris.csv ，學習的同學可以執行上傳到自己的 Google Drive。

3.4寫文件操作

# 創建一個文本文件uploaded = drive.CreateFile({title: 示例.txt})uploaded.SetContentString(測試內容)uploaded.Upload()print(創建後文件 id 為 {}.format(uploaded.get(id)))

更多操作可查看：http://pythonhosted.org/PyDrive/filemanagement.html

4.在 Colab 中操作 Google Sheet 電子表格

4.1授權登錄

對於同一個 notebook，登錄操作只需要進行一次，然後才可以進度讀寫操作

!pip install --upgrade -q gspreadfrom google.colab import authauth.authenticate_user()import gspreadfrom oauth2client.client import GoogleCredentialsgc = gspread.authorize(GoogleCredentials.get_application_default())

4.2讀取

把 iris.csv 的數據導入創建一個 Google Sheet 文件來做演示，可以放在 Google Drive 的任意目錄

worksheet = gc.open(iris).sheet1# 獲取一個列表[# [第1行第1列, 第1行第2列, ... , 第1行第n列], ... ,[第n行第1列, 第n行第2列, ... , 第n行第n列]]rows = worksheet.get_all_values()print(rows)# 用 pandas 讀取import pandas as pdpd.DataFrame.from_records(rows)

列印結果分別為

[[5.1, 3.5, 1.4, 0.2, setosa], [4.9, 3, 1.4, 0.2, setosa], ...

4.3寫入

sh = gc.create(谷歌表)# 打開工作簿和工作表worksheet = gc.open(谷歌表).sheet1cell_list = worksheet.range(A1:C2)import randomfor cell in cell_list: cell.value = random.randint(1, 10)worksheet.update_cells(cell_list)

5.在 Colab 中將文件下載到本地

from google.colab import fileswith open(example.txt, w) as f: f.write(測試內容)files.download(example.txt)

6.實戰：實現 LSTM 文本分類

這裡以我在 Github 的開源LSTM 文本分類項目為例子https://github.com/Jinkeycode/keras_lstm_chinese_document_classification把 master/data 目錄下的三個文件存放到 Google Drive 上。該示例演示的是對健康、科技、設計三個類別的標題進行分類。

6.1新建

在 Colab 上新建 Python2 的筆記本

6.2安裝依賴

!pip install keras!pip install jieba!pip install h5pyimport h5pyimport jieba as jbimport numpy as npimport keras as krsimport tensorflow as tffrom sklearn.preprocessing import LabelEncoder

6.3載入數據

授權登錄

# 安裝 PyDrive 操作庫，該操作每個 notebook 只需要執行一次!pip install -U -q PyDrivefrom pydrive.auth import GoogleAuthfrom pydrive.drive import GoogleDrivefrom google.colab import authfrom oauth2client.client import GoogleCredentialsdef login_google_drive(): # 授權登錄，僅第一次的時候會鑒權 auth.authenticate_user() gauth = GoogleAuth() gauth.credentials = GoogleCredentials.get_application_default() drive = GoogleDrive(gauth) return drive

列出 GD 下的所有文件

def list_file(drive): file_list = drive.ListFile({q: "root in parents and trashed=false"}).GetList() for file1 in file_list: print(title: %s, id: %s, mimeType: %s % (file1[title], file1[id], file1["mimeType"])) drive = login_google_drive()list_file(drive)

緩存數據到工作環境

def cache_data(): # id 替換成上一步讀取到的對應文件 id health_txt = drive.CreateFile({id: "117GkBtuuBP3wVjES0X0L4wVF5rp5Cewi"}) tech_txt = drive.CreateFile({id: "14sDl4520Tpo1MLPydjNBoq-QjqOKk9t6"}) design_txt = drive.CreateFile({id: "1J4lndcsjUb8_VfqPcfsDeOoB21bOLea3"}) #這裡的下載操作只是緩存，不會在你的Google Drive 目錄下多下載一個文件 health_txt.GetContentFile(health.txt, "text/plain") tech_txt.GetContentFile(tech.txt, "text/plain") design_txt.GetContentFile(design.txt, "text/plain") print("緩存成功") cache_data()

讀取工作環境的數據

def load_data(): titles = [] print("正在載入健康類別的數據...") with open("health.txt", "r") as f: for line in f.readlines(): titles.append(line.strip()) print("正在載入科技類別的數據...") with open("tech.txt", "r") as f: for line in f.readlines(): titles.append(line.strip()) print("正在載入設計類別的數據...") with open("design.txt", "r") as f: for line in f.readlines(): titles.append(line.strip()) print("一共載入了 %s 個標題" % len(titles)) return titles titles = load_data()

載入標籤

def load_label(): arr0 = np.zeros(shape=[12000, ]) arr1 = np.ones(shape=[12000, ]) arr2 = np.array([2]).repeat(7318) target = np.hstack([arr0, arr1, arr2]) print("一共載入了 %s 個標籤" % target.shape) encoder = LabelEncoder() encoder.fit(target) encoded_target = encoder.transform(target) dummy_target = krs.utils.np_utils.to_categorical(encoded_target) return dummy_target target = load_label()

6.4文本預處理

max_sequence_length = 30embedding_size = 50# 標題分詞titles = [".".join(jb.cut(t, cut_all=True)) for t in titles]# word2vec 詞袋化vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(max_sequence_length, min_frequency=1)text_processed = np.array(list(vocab_processor.fit_transform(titles)))# 讀取詞標籤dict = vocab_processor.vocabulary_._mappingsorted_vocab = sorted(dict.items(), key = lambda x : x[1])

6.5構建神經網路

這裡使用 Embedding 和 lstm 作為前兩層，通過 softmax 激活輸出結果

# 配置網路結構def build_netword(num_vocabs): # 配置網路結構 model = krs.Sequential() model.add(krs.layers.Embedding(num_vocabs, embedding_size, input_length=max_sequence_length)) model.add(krs.layers.LSTM(32, dropout=0.2, recurrent_dropout=0.2)) model.add(krs.layers.Dense(3)) model.add(krs.layers.Activation("softmax")) model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"]) return model num_vocabs = len(dict.items())model = build_netword(num_vocabs=num_vocabs)import timestart = time.time()# 訓練模型model.fit(text_processed, target, batch_size=512, epochs=10, )finish = time.time()print("訓練耗時：%f 秒" %(finish-start))

6.6預測樣本

sen 可以換成你自己的句子，預測結果為[健康類文章概率, 科技類文章概率, 設計類文章概率], 概率最高的為那一類的文章，但最大概率低於 0.8 時判定為無法分類的文章。

sen = "做好商業設計需要學習的小技巧"sen_prosessed = " ".join(jb.cut(sen, cut_all=True))sen_prosessed = vocab_processor.transform([sen_prosessed])sen_prosessed = np.array(list(sen_prosessed))result = model.predict(sen_prosessed)catalogue = list(result[0]).index(max(result[0]))threshold=0.8if max(result[0]) > threshold: if catalogue == 0: print("這是一篇關於健康的文章") elif catalogue == 1: print("這是一篇關於科技的文章") elif catalogue == 2: print("這是一篇關於設計的文章") else: print("這篇文章沒有可信分類")

教程就到這裡了，你可以開始利用谷歌資源搭建自己的神經網路咯~

http://weixin.qq.com/r/RXXBxTjEhPD4KXymbyDB (二維碼自動識別)

關注集智AI學園公眾號

獲取更多更有趣的AI教程吧！

搜索微信公眾號：swarmAI

集智AI學園QQ群：426390994

學園網站：

集智AI學園?

campus.swarma.org

http://weixin.qq.com/r/FzpGXp3ElMDrrdk9928F (二維碼自動識別)

商務合作｜zhangqian@swarma.org

投稿轉載｜wangjiannan@swarma.org