R語言分析上海日料店價格和評價之前的聯繫

02-10

一.前期數據準備，爬去大眾點評上日料的數據，包括，價格，口味，環境，服務，

代碼如下：

hy1<-function(name,leftchar,rightchar){ left<-gregexpr(leftchar,name) right<-gregexpr(rightchar,name) for(i in 1:length(name)){ name[i]<-substring(name[i],left[[i]][1]+attr(left[[i]],"match.length"),right[[i]][1]-1) } name}myheader<-c( "User-Agent"="Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.1.6) ", "Accept"="text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language"="en-us", "Connection"="keep-alive", "Accept-Charset"="GB2312,utf-8;q=0.7,*;q=0.7")library(XML)library(bitops)library(RCurl)date_all<-data.frame()for (j in 1:50){ url<-paste("抱歉-大眾點評網",j,sep = "") temp<-getURL(url,httpheader=myheader)#偽裝報頭訪問瀏覽器 k<-strsplit(temp," ")[[1]] name1<-k[grep("data-hippo-type",k)+1] left<-gregexpr("<h4>",name1) right<-gregexpr("</h4>",name1) for(i in 1:length(left)){ name1[i]<-substring(name1[i],left[[i]][1]+attr(left[[1]],"match.length"),right[[i]][1]-1) } name<-name1 price1<-k[grep("￥",k)] price<-hy1(price1,"￥","</b>") price<-as.numeric(price) taste1<-k[grep("comment-list",k)+1] taste<-hy1(taste1,"<b>","</b>") taste<-as.numeric(taste) environment1<-k[grep("comment-list",k)+2] environment<-hy1(environment1,"<b>","</b>") environment<-as.numeric(environment) service1<-k[grep("comment-list",k)+3] service<-hy1(service1,"<b>","</b>") service<-as.numeric(service) address1<-k[grep("class="addr"",k)] address<-hy1(address1,""addr">","</span>") if (length(name)==length(price)&length(price)==length(taste)&length(taste)==length(environment)&length(environment)==length(service)) { date_0105<-data.frame(name,price,taste,environment,service,address)date_all<-rbind(date_0105,date_all) } else { print(paste("cant get page",j)) }}

最終獲得一個數據框數據分別為

name price taste environment service

二.數據分析

1 散點圖

library(ggplot2) hy1<-hy0106hy1<-hy1[(hy1$price<1000),]#剔除價格超過1000的點ggplot(hy1,aes(x=taste,y=price))+geom_point()

2.頻率直方圖

library(ggplot2) hy2<-hy1[(hy1$price<500),] ggplot(hy2,aes(x=price))+geom_histogram(binwidth_=20,fill="white",colour="black")

3.數據概覽

summary(hy0106$price)#Min. 1st Qu. Median Mean 3rd Qu. Max. #141.0 168.0 209.0 294.1 315.0 2714.0

發現price數據主要集中在209，而且發現日料人均最低都141元，大部分還是209元，還是可以接受的，至於最大的2714元，可能就不是僅僅吃個日料那麼簡單了。

三.可以進行稍微高級的數據分析(比如探討下是否價格越高，服務就越好呢)

hy0107<-date_allhy2<-hy0107hy2<-hy2[,2:6]hy2$score<-(hy2$taste+hy2$environment+hy2$service)/3ggplot(hy2,aes(x=foodclass,y=score))+geom_boxplot()

觀察上面的箱線圖，發現日本菜和火鍋的普遍評價最好。但是西餐和日本菜一些異常評價（評分很低）是最對的。所以去吃日本菜是要謹慎點哈。

2.做一下價格和評價的回歸分析

lm_hy<-lm(price~score,data=hy2)summary(lm_hy)Call:lm(formula = price ~ score, data = hy2)Residuals: Min 1Q Median 3Q Max -289.1 -219.5 -138.3 65.8 4770.9 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 447.73 186.20 2.405 0.0165 *score 16.07 22.34 0.719 0.4721 ---Signif. codes: 0 『***』 0.001 『**』 0.01 『*』 0.05 『.』 0.1 『』 1Residual standard error: 424.3 on 688 degrees of freedomMultiple R-squared: 0.0007517, Adjusted R-squared: -0.0007007 F-statistic: 0.5176 on 1 and 688 DF, p-value: 0.4721

可以看到 p值是0.47 ，遠大於0.05，所以價格和評價的相關性不強。

3.檢驗價格，評價是不是服從正太分布，這裡我們有兩種方法檢驗

3.1 Shapiro–Wilk 檢測

>shapiro.test(hy2$scoreShapiro-Wilk normality testdata: hy2$scoreW = 0.9302, p-value < 2.2e-16

可以看到 P值是遠遠小於0.05，故得分不服從正太分布

3.2 Pearson 卡方檢驗

先分組和計數

X<-hy2$price> summary(X) Min. 1st Qu. Median Mean 3rd Qu. Max. 304.0 364.0 446.5 581.2 649.0 5367.0 > A<-cut(X,br=c(304,330,364,394,446,500,580,620,640,5000,5367))p<-pnorm(c(330,364,394,446,500,580,620,640,5000,5367), mean(A), sd(A))p<-c(p[1], p[2]-p[1], p[3]-p[2], p[4]-p[3],p[5]-p[4],p[6]-p[5],p[7]-p[6],p[8]-p[7],p[9]-p[8],p[10]-p[9])> chisq.test(A,p=p) Chi-squared test for given probabilitiesdata: AX-squared = Inf, df = 9, p-value < 2.2e-16

可以看到 P值是遠遠小於0.05，故價格也不服從正太分布
推薦閱讀：

※玩轉數據地圖系列之——地圖上的迷你條形圖
※用R語言處理Excel數據
※學習與實踐筆記—第三講簡單數據處理
※第14章字元串

TAG:R | 网页爬虫 | 假设检验 |