數據分析基礎過程

03-17

Nunpy

Numpy的引用

import numpy as np

N緯數組對象：ndarry

In [5]: a = np.array([[0,1,2,3,4],[9,8,7,6,5]])In [6]: aOut[6]: array([[0, 1, 2, 3, 4],[9, 8, 7, 6, 5]])In [7]: a.shapeOut[7]: (2, 5)In [8]: a.sizeOut[8]: 10In [9]: a.dtypeOut[9]: dtype(int64)In [10]: a.ndimOut[10]: 2

ndarray數組的創建方法

1. 從Python中的列表、元組等類型創建ndarray數組In [11]: x = np.array([0,1,2,3]) #從列表類型創建In [12]: xOut[12]: array([0, 1, 2, 3])In [13]: y = np.array((4,5,6,7)) #從元組類型創建In [14]: yOut[14]: array([4, 5, 6, 7])In [15]: z = np.array([[1,2],[8,9],(0.2,0.4)]) ##從元組類型創建In [16]: zOut[16]: array([[ 1. , 2. ],[ 8. , 9. ],[ 0.2, 0.4]])

使用NumPy中函數創建ndarray數組，如：arange, ones, zeros等

In [17]: np.arange(10)Out[17]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])In [18]: np.ones((3,4))Out[18]: array([[ 1., 1., 1., 1.], [ 1., 1., 1., 1.], [ 1., 1., 1., 1.]])In [19]: np.zeros((2,3),dtype=np.int32)Out[19]: array([[0, 0, 0], [0, 0, 0]], dtype=int32)In [20]: np.eye(4)Out[20]: array([[ 1., 0., 0., 0.], [ 0., 1., 0., 0.], [ 0., 0., 1., 0.], [ 0., 0., 0., 1.]])In [21]: x = np.ones((2,3,4))In [23]: xOut[23]: array([[[ 1., 1., 1., 1.], [ 1., 1., 1., 1.], [ 1., 1., 1., 1.]], [[ 1., 1., 1., 1.], [ 1., 1., 1., 1.], [ 1., 1., 1., 1.]]])In [24]: x.shapeOut[24]: (2, 3, 4)

ndarray數組的維度變換

In [28]: a = np.ones((2,3,4),dtype=np.int32)In [29]: aOut[29]: array([[[1, 1, 1, 1],[1, 1, 1, 1],[1, 1, 1, 1]],[[1, 1, 1, 1],[1, 1, 1, 1],[1, 1, 1, 1]]], dtype=int32)In [30]: a.reshape((3,8)) #不改變數組元素，返回一個shape形狀的數組，原數組不變Out[30]: array([[1, 1, 1, 1, 1, 1, 1, 1],[1, 1, 1, 1, 1, 1, 1, 1],[1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)In [31]: a.resize((3,8)) #與.reshape()功能一致，但修改原數組In [32]: aOut[32]: array([[1, 1, 1, 1, 1, 1, 1, 1],[1, 1, 1, 1, 1, 1, 1, 1],[1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)In [33]: a.flatten() #對數組進行降維，返回摺疊後的一維數組，原數組不變Out[33]: array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

ndarray數組的類型變換

new_a = a.astype(new_type)In [34]: a = np.ones((2,3,4),dtype=np.int)In [35]: aOut[35]: array([[[1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]], [[1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]]])In [36]: b = a.astype(np.float)In [37]: bOut[37]: array([[[ 1., 1., 1., 1.], [ 1., 1., 1., 1.], [ 1., 1., 1., 1.]], [[ 1., 1., 1., 1.], [ 1., 1., 1., 1.], [ 1., 1., 1., 1.]]])Series

數組的索引和切片

一維數組的索引和切片：與Python的列表類似

In [38]: a = np.array([9,8,7,6,5])In [39]: a[2]Out[39]: 7In [40]: a[1:4:2]Out[40]: array([8, 6])

多維數組的索引：

In [41]: a= np.arange(24).reshape((2,3,4))In [42]: aOut[42]: array([[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]], [[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]])In [43]: a[1,2,3]Out[43]: 23In [44]: a[0,1,2]Out[44]: 6In [45]: a[-1,-2,-3]Out[45]: 17

多維數組的切片：

In [41]: a= np.arange(24).reshape((2,3,4))In [42]: aOut[42]: array([[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]], [[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]])In [46]: a[:,1,-3]Out[46]: array([ 5, 17])In [47]: a[:,:,::2]Out[47]: array([[[ 0, 2], [ 4, 6], [ 8, 10]], [[12, 14], [16, 18], [20, 22]]])

數組與標量之間的運算

數組與標量之間的運算作用於數組的每一個元素

In [48]: a.mean()Out[48]: 11.5In [49]: b = a /a.mean()In [50]: bOut[50]: array([[[ 0. , 0.08695652, 0.17391304, 0.26086957], [ 0.34782609, 0.43478261, 0.52173913, 0.60869565], [ 0.69565217, 0.7826087 , 0.86956522, 0.95652174]], [[ 1.04347826, 1.13043478, 1.2173913 , 1.30434783], [ 1.39130435, 1.47826087, 1.56521739, 1.65217391], [ 1.73913043, 1.82608696, 1.91304348, 2. ]]])

NumPy一元函數實例

In [53]: np.square(a)Out[53]: array([[[ 0, 1, 4, 9], [ 16, 25, 36, 49], [ 64, 81, 100, 121]], [[144, 169, 196, 225], [256, 289, 324, 361], [400, 441, 484, 529]]])In [54]: b = np.sqrt(a)In [55]: bOut[55]: array([[[ 0. , 1. , 1.41421356, 1.73205081], [ 2. , 2.23606798, 2.44948974, 2.64575131], [ 2.82842712, 3. , 3.16227766, 3.31662479]], [[ 3.46410162, 3.60555128, 3.74165739, 3.87298335], [ 4. , 4.12310563, 4.24264069, 4.35889894], [ 4.47213595, 4.58257569, 4.69041576, 4.79583152]]])

NumPy二元函數實例

In [56]: np.maximum(a,b)Out[56]: array([[[ 0., 1., 2., 3.], [ 4., 5., 6., 7.], [ 8., 9., 10., 11.]], [[ 12., 13., 14., 15.], [ 16., 17., 18., 19.], [ 20., 21., 22., 23.]]])In [57]: a>bOut[57]: array([[[False, False, True, True], [ True, True, True, True], [ True, True, True, True]], [[ True, True, True, True], [ True, True, True, True], [ True, True, True, True]]], dtype=bool)

Pandas庫的引用

import pandas as pdfrom pandas import Series,DataFrame

Series 帶標籤的一維數組

In [58]: a = pd.Series([9,8,7,6])In [59]: aOut[59]: 0 91 82 73 6dtype: int64

Series類型由一組數據及與之相關的數據索引組成

可以自定義索引

In [60]: b = pd.Series([9,8,7,6],index=[a,b,c,d])In [61]: bOut[61]: a 9b 8c 7d 6dtype: int64

Series類型可以由如下類型創建：

? Python列表

上面兩個例子就是

? 標量值 (索引不能省略)

In [62]: s = pd.Series(25,index=[a,b,c])In [63]: sOut[63]: a 25b 25c 25dtype: int64

? Python字典

In [64]: d = pd.Series({a:9,b:8,c:7})In [65]: dOut[65]: a 9b 8c 7dtype: int64#index從字典中進行選擇操作In [66]: e = pd.Series({a:9,b:8,c:7},index=[c,a,b,d])In [67]: eOut[67]: c 7.0a 9.0b 8.0d NaNdtype: float64

? ndarray

In [68]: n = pd.Series(np.arange(5))In [69]: nOut[69]: 0 01 12 23 34 4dtype: int64In [71]: m = pd.Series(np.arange(5),index=np.arange(9,4,-1))In [72]: mOut[72]: 9 08 17 26 35 4dtype: int64

Series類型的基本操作

Series類型包括index和values兩部分

Series類型的操作類似ndarray類型

Series類型的操作類似Python字典類型

In [73]: b = pd.Series([9,8,7,6],[a,b,c,d])In [74]: bOut[74]: a 9b 8c 7d 6dtype: int64In [75]: b.index #獲取索引Out[75]: Index([a, b, c, d], dtype=object)In [76]: b.values #獲取數據Out[76]: array([9, 8, 7, 6])In [77]: b[b] #使用自定義索引Out[77]: 8In [78]: b[1] #使用自動索引Out[78]: 8In [79]: b[[c,d,0]] #兩套索引不能混用Out[79]: c 7.0d 6.00 NaNdtype: float64In [80]: b[[c,d,a]]Out[80]: c 7d 6a 9dtype: int64#切片操作類似numpyIn [81]: b[:3]Out[81]: a 9b 8c 7dtype: int64In [82]: b[b>b.median()]Out[82]: a 9b 8dtype: int64

DataFrame

DataFrame類型由共用相同索引的一組列組成,是一個表格型的數據類型,既有行索引、也有列索引。DataFrame常用於表達二維數據，但可以表達多維數據。

DataFrame類型可以由如下類型創建：

? 二維ndarray對象

In [83]: d = pd.DataFrame(np.arange(10).reshape(2,5))In [84]: dOut[84]: 0 1 2 3 40 0 1 2 3 41 5 6 7 8 9

? 由一維ndarray、列表、字典、元組或Series構成的字典

In [85]: dt = {one:pd.Series([1,2,3],index=[a,b,c]), ...: two:pd.Series([9,8,7,6],index=[a,b,c,d])} ...: In [86]: d = pd.DataFrame(dt)In [87]: dOut[87]: one twoa 1.0 9b 2.0 8c 3.0 7d NaN 6In [88]: pd.DataFrame(dt,index=[b,c,d],columns=[two,three])Out[88]: two threeb 8 NaNc 7 NaNd 6 NaN

DataFrame類型

DataFrame是二維帶「標籤」數組

數據類型的操作

增加或重排：重新索引

.reindex()能夠改變或重排Series和DataFrame索引

In [66]: dl = {城市:[北京,上海,廣州,深圳,瀋陽], ...: 環比:[101.5, 101.2, 101.3, 102.0, 100.1], ...: 同比:[120.7, 127.3, 119.4, 140.9, 101.4], ...: 定基:[121.4, 127.8, 120.0, 145.5, 101.6]}In [67]: d = pd.DataFrame(dl, index= [c1,c2,c3,c4,c5])In [68]: dOut[68]: 同比城市定基環比c1 120.7 北京 121.4 101.5c2 127.3 上海 127.8 101.2c3 119.4 廣州 120.0 101.3c4 140.9 深圳 145.5 102.0c5 101.4 瀋陽 101.6 100.1In [75]: d = d.reindex(index= [c5,c4,c3,c2,c1])In [76]: dOut[76]: 同比城市定基環比c5 101.4 瀋陽 101.6 100.1c4 140.9 深圳 145.5 102.0c3 119.4 廣州 120.0 101.3c2 127.3 上海 127.8 101.2c1 120.7 北京 121.4 101.5In [77]: d = d.reindex(columns= [城市,同比,環比,定基])In [78]: dOut[78]: 城市同比環比定基c5 瀋陽 101.4 100.1 101.6c4 深圳 140.9 102.0 145.5c3 廣州 119.4 101.3 120.0c2 上海 127.3 101.2 127.8c1 北京 120.7 101.5 121.4

In [80]: newc = d.columns.insert(4,新增)In [81]: newd = d.reindex(columns=newc, fill_value=200)In [82]: newdOut[82]: 城市同比環比定基新增c5 瀋陽 101.4 100.1 101.6 200c4 深圳 140.9 102.0 145.5 200c3 廣州 119.4 101.3 120.0 200c2 上海 127.3 101.2 127.8 200c1 北京 120.7 101.5 121.4 200In [83]: d.indexOut[83]: Index([c5, c4, c3, c2, c1], dtype=object)In [84]: d.columnsOut[84]: Index([城市, 同比, 環比, 定基], dtype=object)

刪除指定索引對象

.drop()能刪除Series和DataFrame指定行或列索引

默認刪除0軸上元素，若想操作1軸要添加axis=1參數

In [95]: a = pd.Series([9,8,7,6], index=[a,b,c,d])In [96]: aOut[96]: a 9b 8c 7d 6dtype: int64In [97]: a.drop([b,c])Out[97]: a 9d 6dtype: int64In [99]: dOut[99]: 城市同比環比定基c5 瀋陽 101.4 100.1 101.6c4 深圳 140.9 102.0 145.5c3 廣州 119.4 101.3 120.0c2 上海 127.3 101.2 127.8c1 北京 120.7 101.5 121.4In [100]: d.drop(c5) #默認刪除0軸上的元素Out[100]: 城市同比環比定基c4 深圳 140.9 102.0 145.5c3 廣州 119.4 101.3 120.0c2 上海 127.3 101.2 127.8c1 北京 120.7 101.5 121.4In [101]: d.drop(同比,axis=1) #若要刪列索引，要添加參數axis=1Out[101]: 城市環比定基c5 瀋陽 100.1 101.6c4 深圳 102.0 145.5c3 廣州 101.3 120.0c2 上海 101.2 127.8c1 北京 101.5 121.4

iloc屬性用於根據位置查詢值

In [90]: d = pd.DataFrame(dl, index= [c1,c2,c3,c4,c5])In [91]: dOut[91]: 同比城市定基環比c1 120.7 北京 121.4 101.5c2 127.3 上海 127.8 101.2c3 119.4 廣州 120.0 101.3c4 140.9 深圳 145.5 102.0c5 101.4 瀋陽 101.6 100.1#查詢第1行第2列第數據值In [93]: d.iloc[0,1]Out[93]: 北京#獲取第1行In [94]: d.iloc[0,:]Out[94]: 同比 120.7城市北京定基 121.4環比 101.5Name: c1, dtype: object#獲取第1列In [96]: d.iloc[:,0]Out[96]: c1 120.7c2 127.3c3 119.4c4 140.9c5 101.4Name: 同比, dtype: float64

loc 屬性用於根據索引查詢值

#查詢元素In [97]: d.loc[c1,城市]Out[97]: 北京#獲取第1行In [98]: d.loc[c1,:]Out[98]: 同比 120.7城市北京定基 121.4環比 101.5Name: c1, dtype: object

數據描述統計信息

#列印前5行salesDf.head()#描述數據salesDf.describe()

實際案例

1.提出問題

從銷售數據中分析出以下業務指標： 1）月均消費次數2）月均消費金額3）客單價4）消費趨勢

1讀取Excel數據

fileNameStr=./朝陽醫院2018年銷售數據.xlsxxls = pd.ExcelFile(fileNameStr, dtype=object)salesDf = xls.parse(Sheet1,dtype=object)

查看數據基本信息

列印出前5行，以確保數據運行正常

salesDf.head()

查看形狀

salesDf.shape

查看每一列的數據類型

salesDf.dtypes

2. 理解數據

3. 數據清洗(重點)

選子集利用DataFrame的loc方法，根據索引選取子集。
列重命名

colNameDict = {購葯時間:銷售時間}inplace=False，數據框本身不會變，而會創建一個改動後新的數據框，默認的inplace是Falseinplace=True，數據框本身會改動salesDf.rename(columns = colNameDict,inplace=True)salesDf.head()

3. 缺失數據處理

#刪除列（銷售時間，社保卡號）中為空的行#how=any 在給定的任何一列中有缺失值就刪除salesDf=salesDf.dropna(subset=[銷售時間,社保卡號],how=any)print(刪除缺失後大小,salesDf.shape)

刪除缺失後大小 (6575, 7)

4. 數據類型轉換

字元串轉換為數值（浮點型）

#字元串轉換為數值（浮點型）salesDf[銷售數量] = salesDf[銷售數量].astype(float)salesDf[應收金額] = salesDf[應收金額].astype(float)salesDf[實收金額] = salesDf[實收金額].astype(float)print(轉換後的數據類型： ,salesDf.dtypes)

字元串轉換為日期數據類型

定義分割函數，分割銷售日期，def split_time (time_column): time_list= [] for value in time_column: time_data = value.split( )[0] time_list.append(time_data) #將列錶轉為一位數據Series類型 time_series = pd.Series(time_list) return time_series#使用split_time函數分割時間，傳給變數time_nowtime_now = split_time(salesDf.loc[:,銷售時間])#修改銷售時間這一列的值salesDf.loc[:,銷售時間] = time_now#查看前5行數據salesDf.head()

字元串轉換日期格式

#errors=coerce 如果原始數據不符合日期的格式，轉換後的值為空值NaT#format 是你原始數據中日期的格式salesDf.loc[:,銷售時間]=pd.to_datetime(salesDf.loc[:,銷售時間], format=%Y-%m-%d, errors=coerce)salesDf.dtypes

5. 數據排序

by：按那幾列排序ascending=True 表示降序排列，ascending=False表示升序排列#按銷售日期進行升序排列salesDf=salesDf.sort_values(by=銷售時間,ascending=True)#重命名行名（index）：排序後的列索引值是之前的行號，需要修改成從0到N按順序的索引值salesDf=salesDf.reset_index(drop=True)salesDf.head()

6. 異常值處理

#刪除異常值：通過條件判斷篩選出數據#查詢條件querySer=salesDf.loc[:,銷售數量]>0#應用查詢條件print(刪除異常值前：,salesDf.shape)salesDf=salesDf.loc[querySer,:]print(刪除異常值後：,salesDf.shape)

數據清理的基本過程。