簡單數據分析和處理實踐(R語言)

01-28

本次實踐方案：通過分析朝陽醫院2016年銷售數據，並通過R語言的處理，最終輸出朝陽醫院「月均消費次數」、「月均消費金額」、「客單價」、「消費趨勢」等結果和可視化圖形。

第一步、讀取數據

導入 Excel 數據。使用函數read.xlsx() 導入一個工作表到一個數據框中。一般格式為 read.xlsx(file, n)。具體代碼如下：

library(xlsx)nreadFilepPath<-"E:/猴子大數據分析課程/朝陽醫院2016年銷售數據.xlsx"nexcelData<-read.xlsx(readFilepPath,"Sheet1",encoding="UTF-8")nexcelDatan

又或者

library(openxlsx)nreadFilepPath<-"E:/猴子大數據分析課程/朝陽醫院2016年銷售數據.xlsx"nexcelData<-read.xlsx(readFilepPath,"Sheet1")nexcelDatan

注意事項：一是確保安裝了最新的java，並且進行了必要的配置。

[win10環境下如何配置java環境變數]（ http://jingyan.baidu.com/article/02027811629b941bcc9ce521.html）n

二是確保已經安裝了「xlsx」包或者「openxlsx」包。使用第一種方式時，若是沒有encoding="UTF-8"可能會出現亂碼。"Sheet1"也可以用1代替。

三是確保讀取路徑的正確，並且是使用正斜杠「/」，而不是反斜杠「」。

第二步，對數據進行預處理

刪除所有含有缺失數據的行。

第四章4.5.2中講到如何在分析中排除缺失值的方法。使用na.omit()函數可以刪除所有含有缺失數據的行。本例採用的是！is.na來進行保留未缺失數據的行。代碼如下：

excelData<-excelData[!is.na(excelData$購葯時間),] #保留未缺失時間的數據nexcelData #輸出結果n

第三步、進行列名重命名工作

使用fix(excelData)調出互動式編輯器進行修改名稱即可。

fix（exceldata）

或者使用names()函數來重命名變數。

names(excelData)<-c("time","carNo","drugID","drugName","saleNmuber",n"virtualMoney","actualMoney") #對原數據標題行進行重命名nexcelData #輸出重命名後的數據n

第四步、處理日期列

1.目的是把年月日和星期分開，並刪除星期，只保留日期即可，便於進行樣本分析。
採用R語言中的字元串處理包stringr中的str_split_fixed()來實現。
library(stringr) #載入字元串處理包ntimeSplit<-str_split_fixed(excelData$time," ",n=2) #將time列分成兩列日期和星期nexcelData$time<-timeSplit[,1] #time列選取第一列即日期列nexcelData #輸出最新的數據n

時間的處理分成兩列並輸出一列

2.然後將time數據從字元類型轉換為時間格式數據。
class(excelData$time) #輸出對象的類別nis.character(excelData$time) #判斷是否為字元型nexcelData$time<-as.Date(excelData$time,"%Y-%m-%d") #轉換為日期格式nis.character(excelData$time) #判斷是否為字元型nclass(excelData$time) #輸出對象的類別nexcelData$time #輸出日期n

轉換日期格式後的結果

第五步、對saleNumber、virtualMoney、actualMoney進行類型轉換

# 銷售數量nexcelData$saleNmuber<-as.numeric(excelData$saleNmuber)n# 應收金額nexcelData$virtualMoney<-as.numeric(excelData$virtualMoney)n# 實收金額nexcelData$actualMoney<-as.numeric(excelData$actualMoney)n

第六步、排序

可以使用order()函數對一個數據框進行排序，按照銷售時間對數據進行降序排序。

# 按照銷售時間降序進行排序nexcelData<-excelData[order(excelData$time,decreasing = FALSE),]nexcelDatan

銷售時間降序排序

或者這樣處理也是可以的

attach(excelData)#excelData作為一個數據框進行處理，先進行綁定nexcelData<-excelData[order(excelData$time,decreasing = FALSE),]#按照時間進行降序排列ndetach(excelData)#解除綁定nexcelData#輸出降序排序後的數據n

第七步、計算消費次數和月份數

同一天內同一個人發生的所有消費計算為一次消費行為，剔除重複的日期和醫保卡號。

#運用邏輯運算符把time、cardNo中重複的次數剔除nkpi1<-excelData[!duplicated(excelData[,c("time","cardNo")]),]n#總消費次數等於列值nconsumeNumber<-nrow(kpi1)n

duplicated()函數作用是返回重複的元素和向量。

計算出消費的總次數

# 最小的時間值nstartTime<-kpi1$time[1]n# 最大的時間值nendTime<-kpi1$time[nrow(kpi1)]n# 天數nday<-endTime-startTimen# 月份數nmonth<-as.numeric(day)%/%30 #注意的是把day的格式轉換為數字格式才可以進行整數除法。nX%/%Y表示X除以Y取整。nmonth #輸出月份數n

第八步、數據目標處理

業務指標1
計算月均消費次數=總消費次數/月份數。
#月均消費次數n> monthconsume<-consumeNumber/monthn> monthconsumen[1] 899.6667n> monthconsume<-consumeNumber%/%monthn> monthconsumen[1] 899n> class(consumeNumber)n[1] "integer"n> class(day)n[1] "difftime"n> class(month)n[1] "numeric"n

業務指標2
月均消費金額=總金額/月均消費次數
> totalmoney<-sum(excelData$actualMoney,na.rm = TRUE)n> monthmoney<-totalmoney/monthn> monthmoneyn[1] 50771.71 n（原結果為50776.38），嘗試了兩次結果都不對，還是要仔細仔細，少了28元。n
業務指標3
客單價=總消費金額/總消費次數，直接計算就可
> pct<-totalmoney / consumeNumbern> pctn[1] 56.43391n
業務指標4
消費趨勢，計算每周消費金額。用到tapply（）函數。

tapply(X, INDEX, FUN = NULL, ...,)，其中X一般為向量，INDEX為一個或多個因子的列表，每個因子的長度與X相同，FUN = NULL表示tapply返回一個向量，simplify = TRUE表示如果FUN總是返回一個標量，tapply返回一個具有標量模式的數組。

> # 計算每周的消費金額n> weekconsume<-tapply(excelData$actualMoney,format(excelData$time,"%Y-%U"),sum) n#選擇數據中的actualMoney列。其次用format()函數指定日期的格式（年-周數），n最後sum()函數進行求和。最後tapply返回一個數組。n> weekconsumen 2016-00 2016-01 2016-02 2016-03 2016-04 2016-05 2016-06 2016-07 2016-08 n 1972.80 9679.64 10979.01 8719.73 15662.30 18758.82 3665.70 8441.51 8453.57 n 2016-09 2016-10 2016-11 2016-12 2016-13 2016-14 2016-15 2016-16 2016-17 n 9988.98 8500.78 9869.16 10135.23 8426.46 11400.66 14408.21 10385.33 10265.98 n 2016-18 2016-19 2016-20 2016-21 2016-22 2016-23 2016-24 2016-25 2016-26 n 9496.06 9728.40 11794.11 11497.20 9530.38 10806.71 11877.43 14077.38 10894.90 n 2016-27 2016-28 2016-29 n 8386.97 13372.67 3454.18n

上文中%U是用十進位表示一年當中的周數（week of the year as decimal number）,using the first Sunday as day 1 of week 1。使用規則是用第一個星期天作為第一周的第一天（00-53）。感謝小熊貓的數據分析學習筆記。

第九步、結果的輸出和展示

使用plot()函數進行繪圖。

> # 將數據存儲到數據框中n> weekconsume<-as.data.frame.table(weekconsume)n> weekconsumen Var1 Freqn1 2016-00 1972.80n2 2016-01 9679.64n3 2016-02 10979.01n4 2016-03 8719.73n5 2016-04 15662.30n6 2016-05 18758.82n7 2016-06 3665.70n8 2016-07 8441.51n9 2016-08 8453.57n10 2016-09 9988.98n11 2016-10 8500.78n12 2016-11 9869.16n13 2016-12 10135.23n14 2016-13 8426.46n15 2016-14 11400.66n16 2016-15 14408.21n17 2016-16 10385.33n......n

> # 進行重命名n> names(weekconsume)<-c("time","actualmoney")n> weekconsumen time actualmoneyn1 2016-00 1972.80n2 2016-01 9679.64n3 2016-02 10979.01n4 2016-03 8719.73n5 2016-04 15662.30n6 2016-05 18758.82n7 2016-06 3665.70n8 2016-07 8441.51n9 2016-08 8453.57n10 2016-09 9988.98n......n# 轉換為字元型nweekconsume$time<-as.character(weekconsume$time)nclass(weekconsume$time)n# 計算時間個數nweekconsume$timenumber<-c(1:nrow(weekconsume))nweekconsume$timenumbern

# 繪製圖形nplot(weekconsume$timenumber,weekconsume$actualmoney,n xlab ="時間（年份-第幾周）",n ylab ="消費金額",n xaxt="n",n main="2016年朝陽醫院消費曲線",n col="blue",n type="b")n# 定義坐標軸顯示naxis(1,at=weekconsume$timenumber,labels = weekconsume$time,cex.axis=1.5)n

總結一下

通過學習R語言實戰前四章，結合猴子的課程第三講，並參考部分學員筆記，完成了整個分析的過程。讀取數據、數據預處理、數據的分析和數據結果輸出和展示。其中用到了xlsx/openxlsx(兩個會相互影響)，is.na()，stringr，order()，duplicated()，class()和plot()函數等，進行了數據重命名、拆分、數據格式轉換、排序，以及求和、繪圖顯示等操作，過程中多處出現錯誤和不理解，經過多次的反覆回頭看，但依然存在不明白的地方，需要進一步的去學習和實踐。