seaborn可視化學習之distribution visualization

02-02

在做數據分析&挖掘的時候，描述性統計必不可少。比如：我們需要去看看各個quantitative變數的分布情況，良好的分布可視化效果會為之後進一步做數據建模打下基礎。這篇文檔結合科賽網上面的數據集，對如何使用seaborn這個強大的庫做distribution visualization做一下講解。詳細代碼請看這裡。

對於quantitative變數做分布可視化，主要有兩點，一是探尋變數自身的分布規律，也就是univariate distributions可視化；二是探尋兩個變數之間是否有分布關係，也就是bivariate distributions可視化。seaborn也是按照這個workflow給出了plot function.

univariate distributions visualization:

distplot --- 繪製某單一變數的分布情況
kdeplot --- fit某變數(單一變數或兩個變數之間)分布的核密度估計(kernel density estimate)
rugplot --- 在坐標軸上按戳的樣式(sticks)依次繪製數據點序列

bivariate distributions visualization:

jointplot --- 繪製某兩個變數之間的分布關係

數據集展示

為了避免中文解碼出現bug，將表頭進行替換，如下：

# 導入繪圖的包nimport warningsnwarnings.filterwarnings("ignore")nimport seaborn as snsnimport matplotlib.pyplot as pltn%matplotlib inlinen

單一變數可視化初探

在這個數據集中，quantitative的變數主要有房屋的面積Area,每平米單價Price，以及房屋總價Tprice.

先來看看上海每個行政區房屋總價Tprice的分布情況。我們用distplot繪製。需要注意的是，在默認情況下，distplot會直接給出變數核密度估計的fit曲線。

dist = sh.Dist.unique()nplt.figure(1,figsize=(16,30))nwith sns.axes_style("ticks"):n for i in range(17):n temp = sh[sh.Dist == dist[i]]n plt.subplot(6,3,i+1)n plt.title(dist[i])n sns.distplot(temp.Tprice)n plt.xlabel( ) nplt.show()n

當然，我們也可以關閉核密度估計fit曲線，直接去看直方圖分布(histograms)。seaborn在distplot function的API中給出了kde和rug這兩個參數，分別對應kernel density和rugplot(也就是在坐標軸上繪製出datapoint所在的位置)。我們單獨取出徐匯區(Xuhui)的數據，對kde和rug這兩個參數進行設置，做出的直方圖如下。

temp = sh[sh.Dist == Xuhui]nplt.figure(1,figsize=(6,6))nplt.title(Xuhui)nsns.distplot(temp.Tprice,kde=False,bins=20,rug=True)nplt.xlabel( )nplt.show()n

在seaborn中，我們也可以直接調用kdeplot和rugplot去做圖。現在我們去研究一下徐匯區數據中，房屋面積變數Area的分布情況。

from scipy import statsnplt.figure(1,figsize=(12,6))nwith sns.axes_style("ticks"):n plt.subplot(1,2,1)n sns.kdeplot(temp.Area,shade=True)n sns.rugplot(temp.Area)n plt.title(Xuhui --- Area Distribution)nplt.subplot(1,2,2)nplt.title(Xuhui - Area Distribution fits with gamma distribution)nsns.distplot(temp.Area, kde=False, fit=stats.gamma)nplt.show()n

上方左圖是kdeplot function和rugplot function分別調用後的疊加，體現了seaborn做圖的靈活性。右側則是利用在distplot function設置了fit參數，讓數據的分布與gamma分布進行擬合。

兩個變數(pairs)可視化

在做了單個quantitative變數分布的可視化研究後，我們來看看某兩個變數組之間是否存在分布關係。seaborn在這裡提供了jointplot function使用。下面我們來對整個數據集的房屋面積(Area)和房價(Tprice)這兩個變數進行可視化分析。

繪製散點圖Scatterplot

sns.jointplot(x=Area,y=Tprice,data=sh)nplt.show()n

我們發現房價小於1000W並且面積小於200平方米的數據點很集中。設置一個filter，將這部分數據單獨拿出來做研究，重新繪製散點圖。

test = sh[(sh.Tprice<1000)&(sh.Area<200)]nwith sns.axes_style("white"):n sns.jointplot(x=Area,y=Tprice,data=test)nplt.show()n

當數據量很大的時候，可以進一步利用hexbin plot去做可視化，顯示數據集中分布的區域，如下圖所示。

with sns.axes_style("white"):n sns.jointplot(x=Tprice,y=Area,data=test,kind=hex)nplt.show()n

當然，我們也可以用kernel density estimation去做可視化，看分布情況。

with sns.axes_style("white"):n sns.jointplot(x=Area,y=Tprice,data=test,kind=kde)nplt.show()n

小結

seaborn的巧妙之處就是利用最短的代碼去可視化儘可能多的內容，而且API十分靈活，只有你想不到的，沒有你做不到的。這篇小短文重點放在了向大家介紹利用seaborn如何去做distribution visualization，可能對數據集本身的探索與解釋不是很多。對於數據集的探索，歡迎前往科賽網探索。後續將會繼續更新講解seaborn可視化學習的內容。