手把手教你使用ggplot2進行數據分布探索
數據探索過程中往往需要了解數據的分布情況,例如上、下四分位數的位置、數據符合哪種分布等,下文將使用R的ggplot2包探索數據分布情況。
繪製直方圖
數據探索中,使用最為廣泛的分布圖就是直方圖,ggplot2包中的geom_histogram()函數就可方便的實現直方圖的繪製。
library(ggplot2)set.seed(1234)x <- rnorm(1000,mean = 2, sd = 3)ggplot(data = NULL, mapping = aes(x = x)) + geom_histogram()
默認情況下,直方圖將數據切割為30組,即bins = 30,如果對默認分組不滿意,可以自定義直方圖的組距(binwidth = )和分組數量(bins = );如果對默認的顏色不敏感,也可以自定義直方圖的填充色和邊框顏色。
#將數據切割為50組,並將直方圖的填充色設置為鐵藍色,邊框色設置為黑色ggplot(data = NULL, mapping = aes(x = x)) + geom_histogram(bins = 50, fill = "steelblue", colour = "black")
#將直方圖的組距設置為極差的二十分之一group_diff <- diff(range(x))/20ggplot(data = NULL, mapping = aes(x = x)) + geom_histogram(binwidth = group_diff, fill = "steelblue", colour = "black")
繪製分組直方圖
對於分組直方圖,必須為直方圖傳遞一個分組變數,這個變數可以是字元型變數,也可以是因子型的數值變數。一般繪製分組直方圖,有兩種方式,即:
1)將分組變數映射給顏色屬性
2)使用ggplot2包中的分面功能
有關分面的介紹可查看鏈接:
ggplot2作圖之分面操作
#將分組變數映射給顏色屬性set.seed(1234)x <- c(rnorm(500,mean = 1, sd = 2), rt(500, df = 10))y <- rep(c(0,1), times = c(500,500))df <- data.frame(x = x ,y = y)#將數值型分組變數進行因子化df$y = factor(df$y)ggplot(data = df, mapping = aes(x = x, fill = y)) + geom_histogram(position = "identity", bins = 50, colour = "black")
ggplot(data = df, mapping = aes(x = x, fill = y)) + geom_histogram( bins = 50, colour = "black")
#使用分面功能ggplot(data = df, mapping = aes(x = x)) + geom_histogram( bins = 50, fill = "steelblue", colour = "black") + facet_grid(. ~ y)
繪製核密度曲線
除了直方圖可以很好的表達數據的分布情況,還可以通過核密度曲線生成數據的分布估計,下面使用geom_density()函數和geom_line()函數中stat="density"兩種方法繪製核密度曲線。
#使用geom_density()函數繪製核密度曲線state <- as.data.frame(state.x77)ggplot(data = state, mapping = aes(x = Income)) + geom_density()
#geom_line()函數繪製核密度曲線ggplot(data = state, mapping = aes(x = Income)) + geom_line(stat = "density")
#為了對比不同帶寬,將密度圖繪製在一起ggplot(data = state, mapping = aes(x = Income)) + geom_line(stat = "density", adjust = 0.5, colour = "red",size = 2) + geom_line(stat = "density", adjust = 1, colour = "black", size = 2) + geom_line(stat = "density", adjust = 2, colour = "steelblue", size = 2)
ggplot(data = state, mapping = aes(x = Income)) + geom_density( adjust = 0.5, fill = "red",alpha = .2) + geom_density(adjust = 1, fill = "black", alpha = .5) + geom_density(adjust = 2, fill = "steelblue", alpha = .4)
同樣,密度曲線也可以進行分組繪製,方法與直方圖一致,這裡使用兩個例子說明:
#將分組變數映射給顏色屬性set.seed(1234)x <- c(rnorm(500), rnorm(500,2,3), rnorm(500, 0,5))y <- rep(c("A","B","C"), each = 500)df <- data.frame(x = x, y = y)ggplot(data = df, mapping = aes(x = x, colour = y)) + geom_line(stat = "density", size = 2)
#使用分面功能ggplot(data = df, mapping = aes(x = x, colour = y)) + geom_density(size = 2) + facet_grid(. ~ y)
上面分別介紹了直方圖和核密度曲線的繪製,接下來將把兩種圖形組合在一起,可對數據的理論分布和實際分布進行比較。
ggplot(data = df, mapping = aes(x = x)) + geom_histogram(bins = 50, fill = "blue", colour = "black") + geom_density(adjust = 0.5, colour = "black", size = 2) + facet_grid(. ~ y)
ggplot(data = df, mapping = aes(x = x)) + geom_histogram(aes(y = ..density..), bins = 50, fill = "blue", colour = "black") + geom_density(adjust = 0.5, colour = "red") + facet_grid(. ~ y)
繪製箱線圖
繪製箱線圖也是數據探索過程中常用的手法,箱線圖的實現可以使用ggplot2包中的geom_boxplot()繪製。箱線圖一般使用在分組變數中,即通過箱線圖的比較,發現組別之間的差異。
ggplot(data = iris, mapping = aes(x = Species, y = Sepal.Width)) + geom_boxplot(fill = "steelblue")
ggplot(data = iris, mapping = aes(x = Species, y = Sepal.Width)) + geom_boxplot(fill = "steelblue", outlier.colour = "red", outlier.shape = 15, width = 1.2)
為了比較各組數據中位數的差異,可以為盒形圖設置槽口,只需將geom_boxplot()函數中notch參數設置為TRUEggplot(data = iris, mapping = aes(x = Species, y = Sepal.Width)) + geom_boxplot(notch = TRUE, fill = "steelblue", outlier.colour = "red", outlier.shape = 15, width = 1.2)
如果各箱線圖的槽口互補重合,則說明各組數據的中位數是由差異的。
如何繪製不分組的箱線圖呢?這裡的提醒點非常重要:
1)必須給x賦一個常量值,賦值將會報錯
2)清除x軸上的刻度標記和標籤
ggplot(data = iris, mapping = aes(x = "Test", y = Sepal.Width)) + geom_boxplot(fill = "steelblue", outlier.colour = "red", outlier.shape = 15, width = 1.2)
#清除x軸上的刻度標記和標籤ggplot(data = iris, mapping = aes(x = "Test", y = Sepal.Width)) + geom_boxplot(fill = "steelblue", outlier.colour = "red", outlier.shape = 15, width = 1.2) + theme(axis.title.x = element_blank()) + scale_x_discrete(breaks = NULL)
ggplot(data = iris, mapping = aes(x = "Test", y = Sepal.Width)) + geom_boxplot(fill = "steelblue", outlier.colour = "red", outlier.shape = 15, width = 1.2) + theme(axis.title.x = element_blank()) + scale_x_discrete(breaks = NULL) + stat_summary(fun.y = "mean", geom = "point", shape = 18, colour = "orange", size = 5)
繪製小提琴圖
在學習《R語言實戰》這本書,就提到了小提琴圖,並且預測該圖形的應用將受到重視。ggplot2包也將該圖形納入範疇,通過geom_violin()函數可以輕鬆繪製小提琴圖。
小提琴圖實質上也是核密度估計,其用來對多組數據的分布進行比較,如果使用上文中的密度曲線時容易被多條彩色曲線所干擾,而小提琴圖是並排排列,對分組數據的分布進行比較比較容易一些。
#繪製普通的小提琴圖ggplot(data = iris, mapping = aes(x = Species, y = Sepal.Width)) + geom_violin()
#繪製疊加盒形圖的小提琴圖ggplot(data = iris, mapping = aes(x = Species, y = Sepal.Width)) + geom_violin() + geom_boxplot(width = 0.3, outlier.colour = NA, fill = "blue") + stat_summary(fun.y = "median", geom = "point", shape = 18, colour = "orange")
set.seed(1234)x <- rep(c("A","B","C"), times = c(100,300,200))y <- c(rnorm(100), rnorm(300,1,2), rnorm(200,2,3))df <- data.frame(x = x, y = y)ggplot(data = df, mapping = aes(x = x, y = y)) + geom_violin(scale = "count") + geom_boxplot(width = 0.1, outlier.colour = NA, fill = "blue") + stat_summary(fun.y = "median", geom = "point", shape = 18, colour = "orange")
繪製二維密度分布圖
以上數據的探索均是基於1維數據的直方圖、核密度曲線、盒形圖和小提琴圖,下面使用ggplot2包探索一下2維數據的分布情況,有關二維數據的分布常使用密度圖進行探索。
使用stat_density2d()函數實現二維數據的核密度估計,具體探索見下文的幾個例子:
library(C50)data(churn)#繪製散點圖和密度等高線ggplot(data = churnTrain, mapping = aes(x = total_day_minutes, y = total_eve_calls)) + geom_point() + stat_density2d()
#使用..level..,將密度曲面的高度映射給等高線的顏色
ggplot(data = churnTrain, mapping = aes(x = total_day_minutes, y = total_eve_calls)) + stat_density2d(aes(colour = ..level..)) + scale_color_gradient(low = "lightblue", high = "darkred")
#將密度估計映射給填充色ggplot(data = churnTrain, mapping = aes(x = total_day_minutes, y = total_eve_calls)) + stat_density2d(aes(fill = ..density..), geom = "tile", contour = FALSE)
#將密度估計映射給透明度ggplot(data = churnTrain, mapping = aes(x = total_day_minutes, y = total_eve_calls)) + stat_density2d(aes(alpha = ..density..), geom = "tile", contour = FALSE)
ggplot(data = iris, mapping = aes(x = Petal.Length, y = Petal.Width)) + stat_density2d(aes(alpha = ..density..), geom = "tile", contour = FALSE, h = c(0.1,0.2))
參考資料
R語言實戰
R語言_ggplot2:數據分析與圖形藝術
R數據可視化手冊
----------------------------------------------
作者:劉順祥
出處:劉順祥博客
公眾號:每天進步一點點2015
大家也可以加小編微信:tswenqu,進R語言中文社區 交流群,可以跟各位老師互相交流
推薦閱讀:
※柱形圖,百分比堆積柱形圖,簇狀柱形圖等圖表統統「一網打盡」!
※【學習心法】一張圖了解數據分析/挖掘的精髓
※大數據精準營銷|如何與用戶談一場不分手的戀愛?
※Python數據分析之基情的擇天記
※數據 | 沒錢沒資源,怎麼做調研(一)