Pandas手記

04-24

Pandas 的核心是DataFrame（跟R中的概念一樣，也可以類比Excel表格），然後是圍繞DataFrame的一些操作，參照Excel打開了個表格去理解很直觀。(panel實用場景少，不多說)

numpy的ndarray要求元素類型相同；pandas的DataFrame則不限——DataFrame實質上是n個等長Series/序列組成的，每個Series中數據類型一致。

import numpy as npimport pandas as pdimport matplotlib.pyplot as plt

Series序列——用來構成DataFrame

#一個序列=數據+索引+序列名+數據類型s_age=pd.Series(data=[21,22,22,22],index=[zhao,qian,sun,li],name=age,dtype=np.int32)#python list的操作，Series都基本支持。#額外的，由於每個數值都有名字(index)索引，可以像字典那樣訪問元素s_age[0]s_age[1:2]=22 #索引到就能修改s_age[li]s_age.reindex([li,wang,qian]) #重排,wang is NaN

序列更多不常用的操作見Series官方說明

DataFrame創建

# 1、用Series創建s_sex=pd.Series(data=[M,F,F,M],index=[zhao,qian,sun,li],name=sex)s_score=pd.Series(data=[86,93,90,98],index=[zhao,qian,sun,li],name=score)df1=pd.DataFrame({age:s_age,sex:s_sex.astype("category"),score:s_score}) #4個人4行，3個屬性列df1[name]=df1.index #再增加3個屬性。索引到就能修改，統一認知。# age score sex name#zhao 21 86 M zhao#qian 22 93 F qian#sun 22 90 F sun#li 22 98 M li# 2、從文件讀入。這是工作中更常用的創建方式。iris=pd.read_csv(https://raw.github.com/pydata/pandas/master/pandas/tests/data/iris.csv, sep=,)# iris.to_csv(iris.csv,index=False)

文件IO的更多方法見官方說明文件IO，主要的控制參數是行分隔符、欄位分隔符、行名和列名、缺失值處理等。

查看數據描述——就像用Excel打開了表格，看看數據特點。

#基本信息df1.index;df1.columns;df1.dtypes #行名、列名、每列數據類型df1.head(5);df1.tail(5) #前後5列df1.describe() #數值列的統計描述#排序，返回DataFrame，然後可以配合head看topdf1.sort_index(ascending=False) #按行名排序。df1.sort_values(by=score,ascending=False) #按屬性值排序。#聚合。不想被各種「教程」繞暈，就記住這一句（大致對應Excel的分類匯總）：df1.groupby([df1[sex]]).agg({score:[mean,count],age:[sum]})# score age# mean count sum#sex #F 91.5 2 44#M 92.0 2 43#透視。老規矩，不想被各種「教程」繞暈，就記住這一句（對應Excel的透視表）：df1.pivot_table(index=[age],columns=[sex],values=[score],aggfunc=[len,np.mean,np.sum],margins=True,fill_value=0)#透視表的行、列、值、聚合函數全都是數組，可以多個，行為和excel透視表一致。# len sum mean# score score score#sex F M All F M All F M All#age #21 0 1 1 0 86 86 0.0 86 86.000000#22 2 1 3 183 98 281 91.5 98 93.666667#All 2 2 4 183 184 367 91.5 92 91.750000#繪圖。借matplotlib。df1.plot(); plt.show()

Visualization - pandas 0.21.0 documentation

選擇數據、合併、採樣 ——這幾個操作在玩機器學習訓練數據時比較常用

#選擇行，仿python的，都很好理解，只要別被「教程」繞暈就行df1[1:3] #切片選擇df1[df1.score>90] #bool選擇df1.query(score>90) #仿sqldf1.where(df1.score>90,other=x) # 保留df1的結構和滿足條件的值，不滿足條件的行填充為x#選擇列df1.age; df1[[age,sex]]#用名字選擇行列區域df1.loc[[zhao,li],[age,sex]] #行,列#不想寫名字麻煩，通過數字序號限定，有對應的 .ilocdf1.iloc[1:3,0:2] #1~2行,0~1列#合併數據.concatpd.concat([df1,df2],axis=0,join=outer) #axis設置0行/1列拼接，另一維按join指定的方式(inner/outer)處理#合併數據.merge=join 仿sql的執行方式pd.merge(df1,df2,on=[age],how=left) # "df1 left outer join df2 on age" ，how?inner/outer/left/right#採樣，shuffle等df1.sample(n=100, weights=age,axis=0, replace=True) #以age列為權重，有放回隨機採樣100行

concat 和 merge 的行為結果，有表格示例，還是很直觀的。

SQL——沒毛病，DataFrame是個table，sql操作也算順理成章。這麼帥的功能值得單獨列出來寫一下。

# pip install pandasqlimport pandasqlpandasql.sqldf(select * from df1 limit 5, globals())

無獨有偶，處理數據的R、Excel、Spark、Hadoop(hive)等等，全都可以用外掛形式支持SQL這個世界上第二好的語言，第一是shell-awk。