n-gram挑選中文新聞數據集的關鍵詞

05-06

大數據商業分析的課，老師給了一份數據集，數據總共有9萬條新聞，數據結構主要包含新聞的標題和內容。

具體的要求：

選出與「銀行」有關的新聞
去除純英數字或特殊符號
使用n-gram（n為2~6）分詞取出
合併多餘的子字串
根據TF-IDF，列出前100的關鍵詞

我知道，你想說沒有人會用n-gram來分中文詞了吧，沒錯，但就想自己跑跑n-gram這個方法。

新聞數據大概長這個樣子：

新聞數據樣本

這已是好久之前寫的，我就大概講一下代碼的主要函數和設計思路，假裝有很多朋友在看我這篇文章。

一如既往，大概需要以下Packages：

import stringimport numpy as npimport pandas as pdfrom pandas import Series, DataFrame

先數據集讀出來，再將與「銀行」這一關鍵字有關的文章找出來保存到bankRecords，這是個pd.Series類型：

#Read all artiles and return the data set of keyworddef readAllArticles(path, keyword): #Read all articles and store in records(pandas Series type) records = pd.read_excel(path, sheet_name=all, parse_cols="E", squeeze=True) #Pick out the news that contain Bank bankRecords = pd.Series([records[x] for x in range(len(records)) if keyword in records[x]]) return bankRecords

這時候，將bankRecords中的文章進行處理，去掉英文數據特殊符號，根據n-gram進行分詞（n為2-6隨便選），並將分詞後的文章保存到termList中：

#Get N-gram term List of the article set and we delete the term if its term frequency is less than 3 in the articledef getNgramTermList(n, bankRecords): termList = [] lenOfRecords = len(bankRecords) for content in bankRecords[:lenOfRecords]: names = "[ |.|,||?|"|%|`|(|)|d|_|!|;|#|\|/|$|&|*|{|}|-|@|:|，|、|。|（|）]|「|」|！| |？|；|．|：|》|《|○" exclude = string.digits + string.punctuation + string.ascii_letters + names charList = [ch for ch in content if ch not in exclude] frame = DataFrame([.join(ch) for ch in [charList[i:i+n] for i in range(0, len(charList)-n+1)]]) bigramSeries = frame[0].value_counts() #delete the terms if its frequency is less than 3 newBigram = bigramSeries[bigramSeries > 3].to_dict() termList.append(newBigram) return termList

這時候你有一個termList，你可以來計算一下TF、DF之類的了，由於是從6000篇與銀行相關的文章中，選出一百個關鍵詞，所以這裡用CF代替TF，並保存兩個數據表allTermDocFre與termFrequency：

#Calculate tf(means Collection Frequency)、Document Frequencydef getFrequency(termList): #Store all terms Document Frequency into allTermDocFre allTermDocFre = {} #Store Collection Frequency for terms into termFrequency termFrequency = {} for content in termList: for term in content: if term in allTermDocFre: termFrequency[term] = termFrequency[term] + content[term] allTermDocFre[term] += 1 else: termFrequency[term] = content[term] allTermDocFre[term] = 1 return allTermDocFre, termFrequency

接下來對數據類型做一個轉換，其實是不太需要這一步的，但是將於對Series用得不熟練，所以之前只能先將數據存成list，之後再存為Series：

#Convert list to Seriesdef getSeries(list, names, indexName): h = pd.Series(list, name=names) h.index.name = indexName return h.reset_index()

有了以上的CF和DF數據之後，我們就可以計算TF-IDF啦，這邊的公式也和標準的TF-IDF略有不同，感興趣的你可以去wiki查查看，然後將返回的數據保存到data中：

#tf-idfdef calTfidf(allTermDocFre, termFrequency, bankRecords): # The banks counts is 1830 less than the number of all articles beacause we delete some terms(tf<3 in single article) h1 = getSeries(allTermDocFre, df, Term) h2 = getSeries(termFrequency, tf, Term) data = pd.merge(h1, h2).sort_values(by=tf, ascending=False)[:1000] numOfRecords = len(bankRecords) numOfTerms = len(allTermDocFre) #math.log only accepts single float value, So you should use np.log here data[tf-idf] = np.log(numOfRecords / data[df]) * (data[tf] / numOfTerms) data = data.sort_values(by=tf-idf, ascending=False)[:100] return data

最後我寫了一個函數把以上的執行步驟都封裝成一個函數：

#This function is still scalable, we nedd more functions for this funcdef topTerms(path, keyword, n): bankRecords = readAllArticles(path, keyword) termList = getNgramTermList(n, bankRecords) allTermDocFre, termFrequency = getFrequency(termList) topTerms = calTfidf(allTermDocFre, termFrequency, bankRecords) return topTerms

所以，最後的執行句是：

path = "/Users/vincentyau/Documents/Python/bda2018_hw1/bda2018_hw1_text.xlsx"topTerms1 = topTerms(path, 銀行, 3)

來看看運行的結果：

數據結果也不咋樣，我只是想把這個運行的過程給大致地記錄下來，關於數據或者某些函數的具體用法，感興趣的你，可以來和我討論啊。

PS：禮拜四考日文，想考九十分，卻只學了八十分，希望別掉到70就好~ 最後，及格萬歲。

以上！