10分鐘python圖表繪製 | seaborn入門（一）：distplot與kdeplot

01-25

Seaborn介紹

官方鏈接：Seaborn: statistical data visualization

Seaborn是一種基於matplotlib的圖形可視化python libraty。它提供了一種高度互動式界面，便於用戶能夠做出各種有吸引力的統計圖表。

Seaborn其實是在matplotlib的基礎上進行了更高級的API封裝，從而使得作圖更加容易，在大多數情況下使用seaborn就能做出很具有吸引力的圖，而使用matplotlib就能製作具有更多特色的圖。應該把Seaborn視為matplotlib的補充，而不是替代物。同時它能高度兼容numpy與pandas數據結構以及scipy與statsmodels等統計模式。掌握seaborn能很大程度幫助我們更高效的觀察數據與圖表，並且更加深入了解它們。

其有如下特點：

基於matplotlib aesthetics繪圖風格，增加了一些繪圖模式
增加調色板功能，利用色彩豐富的圖像揭示您數據中的模式
運用數據子集繪製與比較單變數和雙變數分布的功能
運用聚類演算法可視化矩陣數據
靈活運用處理時間序列數據
利用網格建立複雜圖像集

安裝seaborn

利用pip安裝

pip install seaborn

2. 在Anaconda環境下，打開prompt

conda install seaborn

distplot

seaborn.distplot - seaborn 0.7.1 documentation

seaborn的displot()集合了matplotlib的hist()與核函數估計kdeplot的功能，增加了rugplot分布觀測條顯示與利用scipy庫fit擬合參數分布的新穎用途。具體用法如下：

seaborn.distplot(a, bins=None, hist=True, kde=True, rug=False, fit=None, hist_kws=None, kde_kws=None, rug_kws=None, fit_kws=None, color=None, vertical=False, norm_hist=False, axlabel=None, label=None, ax=None)

Parameters:

a : Series, 1d-array, or list.

Observed data. If this is a Series object with a name attribute, the name will be used to label the data axis.

bins : argument for matplotlib hist(), or None, optional #設置矩形圖數量

Specification of hist bins, or None to use Freedman-Diaconis rule.

hist : bool, optional #控制是否顯示條形圖

Whether to plot a (normed) histogram.

kde : bool, optional #控制是否顯示核密度估計圖

Whether to plot a gaussian kernel density estimate.

rug : bool, optional #控制是否顯示觀測的小細條（邊際毛毯）

Whether to draw a rugplot on the support axis.

fit : random variable object, optional #控制擬合的參數分布圖形

An object with fit method, returning a tuple that can be passed to a pdf method a positional arguments following an grid of values to evaluate the pdf on.

{hist, kde, rug, fit}_kws : dictionaries, optional

Keyword arguments for underlying plotting functions.

vertical : bool, optional #顯示正交控制

If True, oberved values are on y-axis.

Histograms直方圖

直方圖(Histogram)又稱質量分布圖。是一種統計報告圖，由一系列高度不等的縱向條紋或線段表示數據分布的情況。一般用橫軸表示數據類型，縱軸表示分布情況。

%matplotlib inlineimport numpy as npimport pandas as pdfrom scipy import stats, integrateimport matplotlib.pyplot as plt #導入import seaborn as snssns.set(color_codes=True)#導入seaborn包設定顏色np.random.seed(sum(map(ord, "distributions")))

x = np.random.normal(size=100)sns.distplot(x, kde=False, rug=True);#kde=False關閉核密度分布,rug表示在x軸上每個觀測上生成的小細條（邊際毛毯）

當繪製直方圖時，你最需要確定的參數是矩形條的數目以及如何放置它們。利用bins可以方便設置矩形條的數量。如下所示：

sns.distplot(x, bins=20, kde=False, rug=True);#設置了20個矩形條

Kernel density estimaton核密度估計

核密度估計是在概率論中用來估計未知的密度函數，屬於非參數檢驗方法之一。．由於核密度估計方法不利用有關數據分布的先驗知識，對數據分布不附加任何假定，是一種從數據樣本本身出發研究數據分布特徵的方法，因而，在統計學理論和應用領域均受到高度的重視。

seaborn.kdeplot - seaborn 0.7.1 documentation

⒈distplot()

sns.distplot(x, hist=False, rug=True);#關閉直方圖，開啟rug細條

⒉kdeplot()

sns.kdeplot(x, shade=True);#shade控制陰影

Fitting parametric distributions擬合參數分布

可以利用distplot() 把數據擬合成參數分布的圖形並且觀察它們之間的差距,再運用fit來進行參數控制。

x = np.random.gamma(6, size=200)#生成gamma分布的數據sns.distplot(x, kde=False, fit=stats.gamma);#fit擬合

Example Ⅰ for practice

Python source code:[download source: distplot_options.py]

import numpy as npimport seaborn as snsimport matplotlib.pyplot as pltsns.set(style="white", palette="muted", color_codes=True)rs = np.random.RandomState(10)# Set up the matplotlib figuref, axes = plt.subplots(2, 2, figsize=(7, 7), sharex=True)sns.despine(left=True)# Generate a random univariate datasetd = rs.normal(size=100)# Plot a simple histogram with binsize determined automaticallysns.distplot(d, kde=False, color="b", ax=axes[0, 0])# Plot a kernel density estimate and rug plotsns.distplot(d, hist=False, rug=True, color="r", ax=axes[0, 1])# Plot a filled kernel density estimatesns.distplot(d, hist=False, color="g", kde_kws={"shade": True}, ax=axes[1, 0])# Plot a historgram and kernel density estimatesns.distplot(d, color="m", ax=axes[1, 1])plt.setp(axes, yticks=[])plt.tight_layout()

Example Ⅱ for practice

利用kdeplot探索某大學學生消費習慣於助學金獲得關係，數據集如下所示：

一共有10798條數據

import seaborn as snsimport matplotlib.pyplot as plt%matplotlib inlinesns.set(style="white", palette="muted", color_codes=True)train = pd.read_csv("train.csv")#導入數據集#ax1表示對獲得助學金的人分布作圖，ax2表示對未獲得助學金的人分布作圖ax1=sns.kdeplot(train["金額"][train["是否得到助學金"]==1],color="r")ax2=sns.kdeplot(train["金額"][train["是否得到助學金"]==0],color="b")

通過分布可以發現，藍色圖像分布靠右，紅色分布靠左，x軸表示消費金額，得出得到助學金的同學日常消費較未得到的同學更低，印證了助學金一定程度用來幫助貧困生改善生活的應用。

小作者第一篇文章，對大家有用的話請多多關注與提出意見，一定積極採納。需要數據集進行訓練的小夥伴可以私信我。

謝謝各位觀眾大老爺。

本文同步發於集智（http://jizhi.im）:Seaborn入門 01 distplot與kdeplot，原文代碼可在線調試與訓練，方便各位小夥伴練習~