樸素貝葉斯演算法和R語言

04-20

最近在看一本書《機器學習和R語言》，就針對第四章樸素貝葉斯演算法的實現寫一篇小文章。主要自己在練習的時候發現網上好像還沒有完整解決方案，所以我來搬磚勒。文章的內容主要是介紹這一章實驗的實現以及我給出的一些小調整。

樸素貝葉斯演算法是非常經典的分類演算法，在分析分本數據時有很好的應用。這裡就不再介紹他的理論依據了，直接上數據。

—————————————————————————————————————————

基於貝葉斯演算法的手機簡訊過濾

數據可在Github上打包下載，需要的數據為sms_spam.csv。

由於原數據文本中包含雙引號，會導致在讀取數據時發生錯誤，將原本屬於text的誤讀入type。可以在Excel中查找所有的雙引號，然後替換成單引號。這裡的替換對模型不會有任何影響，因為後面將去除text中的所有標點符號。數據中可能還有部分文本的錯位，可以邊讀邊定位，並且進行修正。

回到RConsole

一、將數據導入R

首先，導入數據

> sms_raw <- read.csv("sms_spam.csv",header=TRUE,stringsAsFactors=FALSE)

查看一下sms_raw的數據結構

> str(sms_raw)data.frame: 5574 obs. of 2 variables: $ type: chr "ham" "ham" "spam" "ham" ... $ text: chr "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..." "Ok lar... Joking wif u oni..." "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C"| __truncated__ "U dun say so early hor... U c already then say..." ...

sms_raw為數據框結構，有5574個對象，2個變數

將sms_raw$type設置為因子變數

> sms_raw$type <- factor(sms_raw$type)> str(sms_raw$type)Factor w/ 2 levels "ham","spam": 1 1 2 1 1 2 1 1 2 2 ...

這裡也剛好可以檢驗一下數據有沒有誤讀。若type中有其他類型的文本，那麼factor會大於2個。看一下垃圾簡訊與非垃圾簡訊在這個數據集中各佔了多少。

> table(sms_raw$type) ham spam 4827 747 > prop.table(table(sms_raw$type)) ham spam 0.8659849 0.1340151

其中，垃圾簡訊佔了13.4%，非垃圾簡訊佔了86.6%

二、清洗數據

看一下數據集中的text

> sms_raw$text[1:3][1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..." [2] "Ok lar... Joking wif u oni..." [3] "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&Cs apply 08452810075over18s"

文本中包含著數字、縮略的短語和標點符號等，對於NaiveBayesClassifier而言，這些信息是有干擾的，因為關注的更多的是terms和spam之間的聯繫。因此，在建模之前需要在語料庫中對數據進行清洗。

這裡用到添加包tm

> install.packages("tm")> library(NLP)> library(tm)

1.創建語料庫（語料庫中只包含text的內容）

> sms_corpus <- Corpus(VectorSource(sms_raw$text))> print(sms_corpus)<<VCorpus>>Metadata: corpus specific: 0, document level (indexed): 0Content: documents: 5574

sms_corpus中含有5574個文件

2.清理語料庫

從raw語料庫到cleaned語料庫 sms_corpus -> corpus_clean

# 所有字母轉換成小寫> corpus_clean <- tm_map(sms_corpus, tolower)# 去除text中的數字> corpus_clean <- tm_map(corpus_clean, removeNumbers)# 去除停用詞，例如and,or,until...> corpus_clean <- tm_map(corpus_clean, removeWords, stopwords())# 去除標點符號> corpus_clean <- tm_map(corpus_clean, removePunctuation)# 去除多餘的空格，使單詞之間只保留一個空格> corpus_clean <- tm_map(corpus_clean, stripWhitespace)

查看一下清理後的語料庫文本

> inspect(corpus_clean[1:3])<<VCorpus>>Metadata: corpus specific: 0, document level (indexed): 0Content: documents: 3[[1]][1] go jurong point crazy available bugis n great world la e buffet cine got amore wat[[2]][1] ok lar joking wif u oni[[3]][1] free entry wkly comp win fa cup final tkts st may text fa receive entry questionstd txt ratetcs apply s

三、標記化——將文本分解成由單個單片語成的組

這裡實現語料庫向稀疏矩陣的轉變 corpus_clean -> sms_dtm

> sms_dtm <- DocumentTermMatrix(corpus_clean)Error: inherits(doc, "TextDocument") is not TRUE

這裡報錯了，corpus_clean的數據結構沒有滿足DocumentTermMatrix( )的代入要求。在tm 0.6.0之後tm_map() 不再返回TextDocument，而是返回charactor。

NEWS

Changes in tm version 0.6

SIGNIFICANT USER-VISIBLE CHANGES

Text documents now provide the functions content() and as.character() to access the (possibly raw) document content and the natural language text in a suitable (not necessarily structured) form.

此時需要藉助一個函數PlainTextDocument( )

Plain Text Documents

Description

Create plain text documents.

> corpus_clean <- tm_map(corpus_clean, PlainTextDocument)

將文本信息轉化成DocumentTermMatrix類型的稀疏矩陣，快接近數據處理的最後階段了。

1. 數據準備——建立train set & test set

# split sms_raw> sms_raw_train <- sms_raw[1:4169,]> sms_raw_test <- sms_raw[4170:5574,]#split sms_dtm> sms_dtm_train <- sms_dtm[1:4169,]> sms_dtm_test <- sms_dtm[4170:5574,]#split corpus_clean> sms_corpus_train <- corpus_clean[1:4169,]> sms_corpus_test <- corpus_clean[4170:5574,]

train和test好像分的有點隨意，咳咳。那啥我們看看兩個set中spam佔比如何。

> prop.table(table(sms_raw_train$type)) ham spam 0.8647158 0.1352842> prop.table(table(sms_raw_test$type)) ham spam 0.8694722 0.1305278

spam均在13%左右，這裡可以暫時往下走，出問題再回頭look look。

2.可視化文本數據——詞雲

通過詞雲可以大致瀏覽一下哪些詞在spam中經常出現，哪些詞在ham中經常出現。當然，前者對於垃圾簡訊的過濾相對重要一點。繪製詞雲可以通過添加包wordcloud實現。

> install.packages(wordcloud)> library(RColorBrewer)> library(wordcloud)

為了看train set中spam和ham各自的多頻詞，取train set的子集

> spam <- subset(sms_raw_train, type == "spam")> ham <- subset(sms_raw_train, type == "ham")

接下來是見證奇蹟的時刻。。。plot wordcloud...

> wordcloud(spam$text, max.words=40, scale=c(3,0.5))

call free now 等詞出現的次數較多，腦補了一下自己收到的垃圾簡訊，好像都是啥啥啥促銷了。。。看來老外的垃圾簡訊很多都讓打電話啊，還free呢，了解的不多。但是，在選擇合適的方式過濾垃圾簡訊時，瀏覽一些垃圾簡訊的範本還是很有必要的。

> wordcloud(ham$text,max.words=40,scale=c(3,0.5))

正常簡訊用的詞都差不多呢，大小都比較接近，可以看到其中也有出現call，但一般情況一下會覺得call有固定的搭配，比如call you later啊，call me啊，call 某某啊（以及這些搭配的縮寫）之類的。

2.縮減特徵

現在面臨問題是稀疏矩陣的特徵太多了，而且很多詞在所有text中可能都沒怎麼出現過，為減少運算量對特徵瘦瘦身。先留下來在所有text中出現至少5次的詞。

> findFreqTerms(sms_dtm_train,5) [1] "abiola" "able" "abt" [4] "accept" "access" "account" [7] "across" "activate" "actually" [10] "add" "address" "admirer" [13] "advance" "aft" "afternoon" [16] "aftr" "age" "ago" [19] "ahead" "aight" "aint" [22] "air" "aiyah" "alex" [25] "almost" "alone" "already" ............[1216] "yun" "yup" "攏 award" [1219] "攏 bonus" "攏 cash" "攏 gift" [1222] "攏 per" "攏 prize" "攏 sub" [1225] "攏 weekly" "攏 worth" "攏wk" [1228] "眉 wan"

將這些詞設置成指示標識，下面建模時用這個指示標識提示模型只對這些詞進行計算

> sms_dict <- Dictionary(findFreqTerms(sms_dtm_train,5))Error: could not find function "Dictionary"

納尼，沒有找到Dictionary，反覆檢查了一下拼寫，沒錯啊

Changes in tm version 0.5-10

DEPRECATED & DEFUNCT

Following functions have been removed:

Dictionary() (use a character vector instead; use Terms() to extract terms from a document-term or term-document matrix),

媽賣批，函數都給我刪了，學術之路說斷就斷（微笑臉）。不過還好給指了一條明路，可以用Terms( )來代替，但是Terms( )只讀入TermDocumentMatrix或者DocumentTermMatrix，但是findFreqTerms 返回的是charactor，好吧，別逼我。改一下findFreqTerms

myfindFreqTerms <- function(x,lowfreq=0,highfreq=Inf){ stopifnot(inherits(x,c("DocumentTermMatrix","TermDocumentMatrix")), is.numeric(lowfreq),is.numeric(highfreq)) if(inherits(x,"DocumentTermMatrix")) x<-t(x) rs <- slam::row_sums(x) y <- which(rs >= lowfreq & rs<= highfreq) return(x[y,])}

Ctrl+A Ctrl+Enter 又能愉快地玩耍了

> sms_dict <- Terms(myfindFreqTerms(sms_dtm_train,5))

用sms_dict對sms_corpus進行一下瘦身

> sms_train <- DocumentTermMatrix(sms_corpus_train,list(dictionary=sms_dict))> sms_train <- DocumentTermMatrix(sms_corpus_test, list(dictionary=sms_dict))

train和set留下了1228個特徵

> sms_train$ncol[1] 1228> sms_test$ncol[1] 1228

好了，有train set了，可以帶入NB了吧？不不不，train和test都是計數矩陣，如果一條text中某個單詞出現2次，那麼這個單詞在這條文本下會被記上2，NB只想知道這個單詞出現了或者沒出現，因此需要對矩陣進行轉化成因子矩陣。

> convert_counts <- function(x){ x <- ifelse(x>0,1,0) x <- factor(x, levels=c(0,1),labels=c("No","Yes")) return(x)}

將convert_counts用在train和set的每一列上

> sms_train <- apply(sms_train, MARGIN=2, convert_counts)> sms_test <- apply(sms_test, MARGIN=2, convert_counts)

好了，「萬事俱備，只欠東風」！

四、訓練模型

需要的包是e1071，老規矩

> install.packages("e1071")> library(e1071)

這裡分為兩步：

1.建立NaiveBayesClassifier

2.測試Classifier

> sms_classifier <- naiveBayes(sms_train,sms_raw_train$type)> sms_prediction <- predict(sms_classifier, sms_test)

好了，經歷了「萬里長征」最後怎麼也得憋個大招啊，可是。。。就是這麼突然的已經有分類器勒。當然啦，因為這是比較經典（古老）的分類方法，所以不用自己寫太多東西。

五、評估模型

用交叉表來看看test中多少預測對了，好給我們的工作有個匯總

> install.packags(gmodels)> library(gmodels)> CrossTable(sms_prediction,sms_raw_test$type,prop.chisq=TRUE,prop.t=FALSE,dnn=c("predicted","actual")) Cell Contents|-------------------------|| N || N / Row Total || N / Col Total || N / Table Total ||-------------------------| Total Observations in Table: 1405 | sms_raw_test$type sms_predictions | ham | spam | Row Total | ----------------|-----------|-----------|-----------| ham | 1216 | 28 | 1244 | | 0.977 | 0.023 | 0.885 | | 0.995 | 0.153 | | | 0.865 | 0.020 | | ----------------|-----------|-----------|-----------| spam | 6 | 155 | 161 | | 0.037 | 0.963 | 0.115 | | 0.005 | 0.847 | | | 0.004 | 0.110 | | ----------------|-----------|-----------|-----------| Column Total | 1222 | 183 | 1405 | | 0.870 | 0.130 | | ----------------|-----------|-----------|-----------|

ham-ham和spam-spam是預測正確的，本身不是垃圾簡訊卻被認為是垃圾簡訊過濾掉，即spam-ham，可能會導致用戶不再信任此功能。可能需要將這些誤分類的簡訊挑選出來，看是由於什麼原因導致的，由於Classifier1沒有設置拉普拉斯估計，下面再嘗試建立classifier2，看結果是否被優化。

> sms_classifier2 <- naiveBayes(sms_train,sms_raw_train$type,laplace=1)> sms_predictions2<- predict(sms_classifier2,sms_test)> CrossTable(sms_predictions2,sms_raw_test$type,prop.chisq = FALSE,prop.r = FALSE,dnn=c(predicted,actual)) Cell Contents|-------------------------|| N || N / Col Total || N / Table Total ||-------------------------| Total Observations in Table: 1405 | actual predicted | ham | spam | Row Total | -------------|-----------|-----------|-----------| ham | 1217 | 30 | 1247 | | 0.996 | 0.164 | | | 0.866 | 0.021 | | -------------|-----------|-----------|-----------| spam | 5 | 153 | 158 | | 0.004 | 0.836 | | | 0.004 | 0.109 | | -------------|-----------|-----------|-----------|Column Total | 1222 | 183 | 1405 | | 0.870 | 0.130 | | -------------|-----------|-----------|-----------|

spam-ham誤分類減少，而ham-spam誤分數量上升，雖然laplace只設置了1，但是分類有明顯的變化，可能是由於簡訊text較短，由此導致一個document的單詞數量少，因此對laplace比較敏感。為了減少laplace的影響，但又避免類條件概率為0，嘗試laplace=0.5。代碼和之前的類似，只需修改最後一個parameter。

Cell Contents|-------------------------|| N || N / Col Total || N / Table Total ||-------------------------| Total Observations in Table: 1405 | actual predicted | ham | spam | Row Total | -------------|-----------|-----------|-----------| ham | 1217 | 28 | 1245 | | 0.996 | 0.153 | | | 0.866 | 0.020 | | -------------|-----------|-----------|-----------| spam | 5 | 155 | 160 | | 0.004 | 0.847 | | | 0.004 | 0.110 | | -------------|-----------|-----------|-----------|Column Total | 1222 | 183 | 1405 | | 0.870 | 0.130 | | -------------|-----------|-----------|-----------|

結果比Classifer1和Classifer2都有一定的改進。當然，優化的道路才剛剛開始。。。哈哈哈哈哈。。。

滾去吃飯了， see u guys