n-gram挑選中文新聞數據集的關鍵詞

大數據商業分析的課,老師給了一份數據集,數據總共有9萬條新聞,數據結構主要包含新聞的標題和內容。

具體的要求:

  • 選出與「銀行」有關的新聞
  • 去除純英數字或特殊符號
  • 使用n-gram(n為2~6)分詞取出
  • 合併多餘的子字串
  • 根據TF-IDF,列出前100的關鍵詞

我知道,你想說沒有人會用n-gram來分中文詞了吧,沒錯,但就想自己跑跑n-gram這個方法。

新聞數據大概長這個樣子:

新聞數據樣本


這已是好久之前寫的,我就大概講一下代碼的主要函數和設計思路,假裝有很多朋友在看我這篇文章。

一如既往,大概需要以下Packages:

import stringimport numpy as npimport pandas as pdfrom pandas import Series, DataFrame

先數據集讀出來,再將與「銀行」這一關鍵字有關的文章找出來保存到bankRecords,這是個pd.Series類型:

#Read all artiles and return the data set of keyworddef readAllArticles(path, keyword): #Read all articles and store in records(pandas Series type) records = pd.read_excel(path, sheet_name=all, parse_cols="E", squeeze=True) #Pick out the news that contain Bank bankRecords = pd.Series([records[x] for x in range(len(records)) if keyword in records[x]]) return bankRecords

這時候,將bankRecords中的文章進行處理,去掉英文數據特殊符號,根據n-gram進行分詞(n為2-6隨便選),並將分詞後的文章保存到termList中:

#Get N-gram term List of the article set and we delete the term if its term frequency is less than 3 in the articledef getNgramTermList(n, bankRecords): termList = [] lenOfRecords = len(bankRecords) for content in bankRecords[:lenOfRecords]: names = "[
|.|,||?|"|%|`|(|)|d|_|!|;|#|\|/|$|&|*|{|}|-|@|:|,|、|。|(|)]|「|」|!| |?|;|.|:|》|《|○" exclude = string.digits + string.punctuation + string.ascii_letters + names charList = [ch for ch in content if ch not in exclude] frame = DataFrame([.join(ch) for ch in [charList[i:i+n] for i in range(0, len(charList)-n+1)]]) bigramSeries = frame[0].value_counts() #delete the terms if its frequency is less than 3 newBigram = bigramSeries[bigramSeries > 3].to_dict() termList.append(newBigram) return termList

這時候你有一個termList,你可以來計算一下TF、DF之類的了,由於是從6000篇與銀行相關的文章中,選出一百個關鍵詞,所以這裡用CF代替TF,並保存兩個數據表allTermDocFre與termFrequency:

#Calculate tf(means Collection Frequency)、Document Frequencydef getFrequency(termList): #Store all terms Document Frequency into allTermDocFre allTermDocFre = {} #Store Collection Frequency for terms into termFrequency termFrequency = {} for content in termList: for term in content: if term in allTermDocFre: termFrequency[term] = termFrequency[term] + content[term] allTermDocFre[term] += 1 else: termFrequency[term] = content[term] allTermDocFre[term] = 1 return allTermDocFre, termFrequency

接下來對數據類型做一個轉換,其實是不太需要這一步的,但是將於對Series用得不熟練,所以之前只能先將數據存成list,之後再存為Series:

#Convert list to Seriesdef getSeries(list, names, indexName): h = pd.Series(list, name=names) h.index.name = indexName return h.reset_index()

有了以上的CF和DF數據之後,我們就可以計算TF-IDF啦,這邊的公式也和標準的TF-IDF略有不同,感興趣的你可以去wiki查查看,然後將返回的數據保存到data中:

#tf-idfdef calTfidf(allTermDocFre, termFrequency, bankRecords): # The banks counts is 1830 less than the number of all articles beacause we delete some terms(tf<3 in single article) h1 = getSeries(allTermDocFre, df, Term) h2 = getSeries(termFrequency, tf, Term) data = pd.merge(h1, h2).sort_values(by=tf, ascending=False)[:1000] numOfRecords = len(bankRecords) numOfTerms = len(allTermDocFre) #math.log only accepts single float value, So you should use np.log here data[tf-idf] = np.log(numOfRecords / data[df]) * (data[tf] / numOfTerms) data = data.sort_values(by=tf-idf, ascending=False)[:100] return data

最後我寫了一個函數把以上的執行步驟都封裝成一個函數:

#This function is still scalable, we nedd more functions for this funcdef topTerms(path, keyword, n): bankRecords = readAllArticles(path, keyword) termList = getNgramTermList(n, bankRecords) allTermDocFre, termFrequency = getFrequency(termList) topTerms = calTfidf(allTermDocFre, termFrequency, bankRecords) return topTerms

所以,最後的執行句是:

path = "/Users/vincentyau/Documents/Python/bda2018_hw1/bda2018_hw1_text.xlsx"topTerms1 = topTerms(path, 銀行, 3)

來看看運行的結果:

數據結果也不咋樣,我只是想把這個運行的過程給大致地記錄下來,關於數據或者某些函數的具體用法,感興趣的你,可以來和我討論啊。

PS:禮拜四考日文,想考九十分,卻只學了八十分,希望別掉到70就好~ 最後,及格萬歲。

以上!


推薦閱讀:

如何用Python來EDA數據分析
數據分析師需要讀那些書?
9. 一篇好的分析報告有什麼樣的標準
大數據有哪些工作崗位,日常工作內容是什麼,需要掌握哪些工具和技能
數據部門如何給銷售部門制定銷售目標(上)

TAG:Python | 數據分析 | 數據挖掘 |