基於NLP的股價預測

02-12

寫在前面的話：

第三周，第三篇數據分析報告來的有些遲，但我不會讓她缺席的
這次主要應用到一些自然語言處理（NLP）方法結合邏輯回歸演算法對股價進行預測
本次數據集特點是新聞內容，需要先對語言進行處理轉化為數字矩陣

主要內容:

1 數據導入預覽

2 數據處理轉換

3 模型的訓練和預測

PART 1 數據導入預覽

兵馬未動，糧草先行，導入數據先載入點庫

#導入機器學習演算法import pandas as pd#導入sklearn中處理文字信息的庫from sklearn.feature_extraction.text import CountVectorizer#導入邏輯回歸演算法from sklearn.linear_model import LogisticRegression

數據可以露臉了，查看下長啥樣。

#導入數據data = pd.read_csv(Combined_News_DJIA.csv)data.head()

圖1-1

圖1-2

領略完數據的風騷，來總結下她的美貌。

簡單的查看數據的前五行可以發現，Data表示當日時間，Label表示公司股票漲跌情況，Label=0表示公司當日股票開盤價低於收盤價，Label=1則表示公司當日股票收盤價低於開盤價，其他的特徵列（Top1-Top25）表示當日和公司相關的新聞信息，接下來，我們需要考慮的是如何將文字的信息轉化成為機器學習演算法可以識別的數字。這個真的很重要呢！

在引入文字信息轉化數字矩陣前，先通過一個簡單的例子來說明下。

#先按照時間區分訓練集和測試集

train = data[data[Date] < 2015-01-01]test = data[data[Date] > 2014-12-31]

#舉個栗子

#獲取第4行第十一列的值example = train.iloc[3,10]print(example)輸出結果：b"The commander of a Navy air reconnaissance squadron that provides the Presidentand the defense secretary the airborne ability to command the nations nuclear weapons has been relieved of duty"

上面的輸出結果中單詞包含大小寫，但其實我們知道一個單詞的意思和大寫並沒有關係，所以可以統一處理成為小寫，處理辦法很簡單，lower下就好了

example2 = example.lower()print(example2)輸出結果：b"the commander of a navy air reconnaissance squadron that provides the presidentand the defense secretary the airborne ability to command the nations nuclearweapons has been relieved of duty"

#格式處理好了，我們可以對文字進行必要的處理了。

#利用sklearn中的CountVectorizer對文欄位落進行處理分割成為一個單詞列表example3 = CountVectorizer().build_tokenizer()(example2)print(example3)輸出結果：[the, commander, of, navy, air, reconnaissance, squadron, that, provides, the, president, and, the, defense, secretary, the, airborne, ability, to, command, the, nation, nuclear, weapons, has, been, relieved, of, duty]

#統計單詞列表中每個單詞出現的次數，匯總在一個DataFrame表格，這裡利用了一個復循環函數統計出每個單詞的個數，同時也注意set()保證單詞唯一性的巧妙應用。

pd.DataFrame([[x,example3.count(x)] for x in set(example3)], columns = [Word, Count])輸出結果： Word Count0 commander 11 defense 12 secretary 13 ability 14 and 15 of 26 has 17 duty 18 to 19 navy 110 squadron 111 command 112 nation 113 that 114 reconnaissance 115 weapons 116 nuclear 117 been 118 air 119 airborne 120 relieved 121 president 122 provides 123 the 5

以上的例子簡單說明如何將單詞信息轉化成為數字，也就是單詞出現的頻率了，接著我們需要利用sklearn中的CountVectorizer模塊中的函數fit_transform進行轉化。

PART 2 數據處理轉換

在轉換之前，我們需要把所有的單詞都匯總在一起，形成一張原始詞庫。

trainheadlines = []for row in range(0,len(train.index)): trainheadlines.append( .join(str(x) for x in train.iloc[row,2:27]))

現在所有的新聞都保存在trainheadline中，直接調用CountVectorizer中的fit_transform大法

basicvectorizer = CountVectorizer()basictrain = basicvectorizer.fit_transform(trainheadlines)print(basictrain.shape)列印結果：(1611, 31675)

從列印結果中我們可以看出，fit_transfrom構造了一個1611行，31675列單詞的大型矩陣，1611代表著原始的數據中1611行，對應在31675列下，單詞存在則為1，不存在則為0。

PART 3 模型的訓練和預測

#構造模型，進行預測

basicmodel = LogisticRegression()basicmodel = basicmodel.fit(basictrain, train["Label"])

#對測試集進行同樣的矩陣化的處理，然後使用上面已經訓練好的模型進行預測

testheadlines = []for row in range(0,len(test.index)): testheadlines.append( .join(str(x) for x in test.iloc[row,2:27]))basictest = basicvectorizer.transform(testheadlines)predictions = basicmodel.predict(basictest)

使用類似混淆矩陣的方式展示預測的準確度

pd.crosstab(test["Label"], predictions, rownames=["Actual"], colnames=["Predicted"])輸出結果：Predicted 0 1Actual 0 61 1251 92 100

通過表格計算髮現通過邏輯回歸演算法預測的準確率大概只有0.42，準確率太低，還不如自己隨便猜，都可能有0.5的概率。

在進一步處理分析之前，先來查看不同單詞和預測結果的相關性

basicwords = basicvectorizer.get_feature_names()basiccoeffs = basicmodel.coef_.tolist()[0]coeffdf = pd.DataFrame({Word : basicwords, Coefficient : basiccoeffs})coeffdf = coeffdf.sort_values([Coefficient, Word], ascending=[0,1])#列出前十項相關性排名coeffdf.head(10)輸出結果： Coefficient Word19419 0.497924 nigeria25261 0.452526 self29286 0.428011 tv15998 0.425863 korea20135 0.425716 olympics15843 0.411636 kills26323 0.411267 so29256 0.394855 turn10874 0.388555 fears28274 0.384031 territory#列出尾十項相關性排名coeffdf.tail(10)輸出結果： Coefficient Word27299 -0.424441 students8478 -0.427079 did6683 -0.431925 congo12818 -0.444069 hacking7139 -0.448571 country16949 -0.463116 low3651 -0.470454 begin25433 -0.494555 sex24754 -0.549725 sanctions24542 -0.587794 run

以上算是一整套的解決方案了，但是最終的預測效果真的不是一般的差，必然是要進行優化的，前面我們使用的是COUNTVectorizer裡面默認的參數，這裡可以做了些小的變化，通過設置ngram_range=(2,2)將所有的詞按照2個一組進行構造，很顯然這樣構造出的列數明顯更多，有將近37萬列。如下：

advancedvectorizer = CountVectorizer(ngram_range=(2,2))advancedtrain = advancedvectorizer.fit_transform(trainheadlines)print(advancedtrain.shape)輸出結果：(1611, 366721)

#重新導入邏輯回歸演算法進行模型訓練和預測

advancedmodel = LogisticRegression()advancedmodel = advancedmodel.fit(advancedtrain, train["Label"])testheadlines = []for row in range(0,len(test.index)): testheadlines.append( .join(str(x) for x in test.iloc[row,2:27]))advancedtest = advancedvectorizer.transform(testheadlines)advpredictions = advancedmodel.predict(advancedtest)

#通過兩個片語的構造，很明顯的看出預測准率現在變成了0.57

pd.crosstab(test["Label"], advpredictions, rownames=["Actual"], colnames=["Predicted"])輸出結果：Predicted 0 1Actual 0 66 120 1 45 147

#再看看不同詞語的相關性

advcoeffs = advancedmodel.coef_.tolist()[0]advcoeffdf = pd.DataFrame({Words : advwords, Coefficient : advcoeffs})advcoeffdf = advcoeffdf.sort_values([Coefficient, Words], ascending=[0, 1])#列出前十項相關性排名advcoeffdf.head(10)輸出結果： Coefficient Words272047 0.286533 right to24710 0.275274 and other285392 0.274698 set to316194 0.262873 the first157511 0.227943 in china159522 0.224184 in south125870 0.219130 found in124411 0.216726 forced to173246 0.211137 it has322590 0.209239 this is#列出尾十項相關性排名advcoeffdf.tail(10) Coefficient Words326846 -0.198495 to help118707 -0.201654 fire on155038 -0.209702 if he242528 -0.211303 people are31669 -0.213362 around the321333 -0.215699 there is327113 -0.221812 to kill340714 -0.226289 up in358917 -0.227516 with iran315485 -0.331153 the country

以上結果，構造片語矩陣，可以提高預測的準確度。