「Python」用Cosine Similarity實現相關文章推薦

05-01

這是大數據與商業分析的一項作業。

要求是：針對目標的6篇文章，要求我們從2081篇文章的數據集中分別列出最值得推薦的3篇文章（利用老師給出的245個詞來構建向量空間，計算文章間的向量內積，取最高的3位）。

依照慣例我們看看數據長什麼樣子：

2081篇文章的數據集（與鴻海相關）：

目標的6篇文檔：我們要做的就是分別找出與六篇文章最相關的3篇文章

此外，還有245個特徵詞（省略了從文字中提取特徵的步驟）。

現在正式來操作，先導入需要的包：

import stringimport numpy as npimport pandas as pdfrom pandas import Series, DataFramefrom sklearn.feature_extraction import DictVectorizer

先讀取總數據集（9萬篇文字），然後返回與關鍵詞有關的文章（鴻海）：

#Read all artiles and return the data set of keyworddef readAllArticles(path, keyword): #Read all articles and store in records(pandas Series type) records = pd.read_excel(path, sheet_name=all, parse_cols="E", squeeze=True)# articleTitle = pd.read_excel(path, sheet_name=all, parse_cols="D", squeeze=True) #Pick out the news that contain Bank bankRecords = pd.Series([records[x] for x in range(len(records)) if keyword in records[x]]) return bankRecords

這裡還是延續我上一篇的n-gram做法，將文章進行n-gram分詞，然後將特徵詞的tf記錄下來，保存到totalTermList中：

def getAllNgramTerms(foxRecords): termList = [] lenOfRecords = len(foxRecords) totalTermsList = [] featureDict = [] for content in foxRecords: #處理文章並將每個字分開 names = "[ |.|,||?|"|%|`|(|)|d|_|!|;|#|\|/|$|&|*|{|}|-|@|:|，|、|。|（|）]|「|」|！| |？|；|．|：|》|《|○|／" exclude = string.digits + string.punctuation + string.ascii_letters + names charList = [ch for ch in content if ch not in exclude] AddTermDict = {} for n in range(2, 7): frame = DataFrame([.join(ch) for ch in [charList[i:i+n] for i in range(0, len(charList)-n+1)]]) bigramSeries = frame[0].value_counts() newBigram = bigramSeries[bigramSeries > 1].to_dict() AddTermDict.update(newBigram) tempDict = {} #如果兩邊都存在，則存進去，如果文章中沒有，則存為1，作smoothing for i in featureSelected.to_dict().values(): if i in AddTermDict: tempDict[i] = AddTermDict[i] if i not in AddTermDict: tempDict[i] = 0 #某篇文章2-6所有的terms（僅當term屬於features時） totalTermsList.append(tempDict) return totalTermsList

將每篇文章向量化：

#將文章向量化def termVec(termListForVec): vec = DictVectorizer() termVector = vec.fit_transform(termListForVec).toarray() return termVector

以上的幾個步驟我直接寫成了函數，而後面的，就懶得寫了：

先把數據都讀出來，存好：

#尋找與鴻海相關的文章path = "/Users/vincentyau/Documents/台大碩一下資料/大數據與商業分析/bda2018_hw1/bda2018_hw1_text.xlsx"keyword = 鴻海#foxRecords 存為SeriesfoxRecords = readAllArticles(path, keyword)#讀取特徵詞，245個featurePath = /Users/vincentyau/Documents/台大碩一下資料/大數據與商業分析/HW2/bda2018_hw2_table.xlsxfeatureSelected = pd.read_excel(featurePath, sheetname=L2_foxconn_keyword, parse_cols=B, squeeze=True)#導入目標文章articles = pd.read_excel(featurePath, sheetname=L2_query, parse_cols=E, squeeze=True)

分別獲得數據集文章和目標文章的向量化形式：

#獲得6篇文章及所有文章的terms向量array,存在numpy.ndarraytotalTermVec = termVec(getAllNgramTerms(foxRecords))articlesVec = termVec(getAllNgramTerms(articles))

分別計算每一篇目標文章和數據集文章的Cosine Similarity：

#計算cosine similaritysaveAllConSim = []for vector in articlesVec: vec_articles = np.mat(vector) singleSim = [] for vectorAll in totalTermVec: vec_All = np.mat(vectorAll) num = float(vec_articles * vec_All.T) denom = np.linalg.norm(vec_articles) * np.linalg.norm(vec_All) cos = num / denom sim = 0.5 + 0.5 * cos singleSim.append(sim) saveAllConSim.append(singleSim)

取得相似度最高的前四篇文章，將結果列印出來：

#list列表是有序集合x = 1for i in saveAllConSim: #將數據變為DataFrame newDataFrame = DataFrame(i) newDataFrame.columns = [conSim] newDataFrame.columns.name = articles sortedConSim = newDataFrame.sort_values(by=conSim,ascending=False)[:4] print(sortedConSim) n = 1 for a in sortedConSim.index: if n == 1: print(這是第 + str(x) + 篇原文: + foxRecords[a] + ) else: print(這是與第 + str(x) + 篇文章相關的文章) print(foxRecords[a] + ) n += 1 x += 1

我們來看看結果，相似度為1的是那篇文章本身，剩下的三篇就是推薦給這篇文章的相關文章啦：

總結：

1、用Cosine Similarity稍微地做文章推薦，而這裡使用的向量值是詞的詞頻（tf），但有同學使用了tf-idf作為向量值，在這裡也沒有獲得比較好的結果。

2、用tf-idf結果沒比較好的原因應該是，老師已經給出了245個特徵詞，本身已經去掉了common terms。

3、想知道今日頭條的演算法包含哪一些。