利用python進行數據分析之數據載入、存儲與文件格式(五)
6.4 使用資料庫
輸入和輸出可以劃分幾個大類:讀取文本文件和其他更高效的磁碟存儲格式,載入資料庫中的數據,利用web API操作網路資源。6.1 讀寫文件格式的數據pandas中得解析函數函數read_csvt 從文件、URL、文件型對象中載入帶分隔符的數據。默認分隔符為逗號read_table 從文件、URL、文件型對象中載入帶分隔符的數據。默認分隔符為製表符("t")reda_fwft 讀取定寬格式數據(沒有分隔符)read_clipboard 讀取剪貼板中的數據,可以看做read_table的剪貼板版。在將網頁轉換為表格時很有用這些函數的選項可以劃分下面幾大類:- 索引:將一個或多個列當做返回的DataFrame處理,以及是否從文件、用戶獲得列名
- 類型推斷和數據轉換:包括用戶定義值得轉換、缺失值標記列表等
- 日期解析:包括組合功能,比如將分散在多個列中的日期時間信息組合成結果中的單個列。
- 迭代:支持對大文件進行逐塊迭代
- 不規整數據問題:跳過一些行、頁腳、注釋或其他一些不重要的東西(比如成千上萬個逗號隔開的數值數據)
In [22]: !type E:ch06ex1.csvna,b,c,d,messagen1,2,3,4,hellon5,6,7,8,worldn9,10,11,12,foon
In [25]: df=pd.read_csv(E:ch06ex1.csv)nnIn [26]: dfnOut[26]:na b c d messagen0 1 2 3 4 hellon1 5 6 7 8 worldn2 9 10 11 12 foon
In [28]: pd.read_table(E:ch06ex1.csv,sep=,)nOut[28]:na b c d messagen0 1 2 3 4 hellon1 5 6 7 8 worldn2 9 10 11 12 foon
In [29]: pd.read_table(E:ch06ex1.csv)nOut[29]:na,b,c,d,messagen0 1,2,3,4,hellon1 5,6,7,8,worldn2 9,10,11,12,foon
In [3]: !type E:ch06ex2.csvn1,2,3,4,hellon5,6,7,8,worldn9,10,11,12,foonnIn [5]: pd.read_csv(E:ch06ex2.csv,header=None)nOut[5]:n0 1 2 3 4n0 1 2 3 4 hellon1 5 6 7 8 worldn2 9 10 11 12 foonnIn [8]: pd.read_csv(E:ch06ex2.csv,names=[a,b,c,d,message])nOut[8]:na b c d messagen0 1 2 3 4 hellon1 5 6 7 8 worldn2 9 10 11 12 foon
In [10]: names=[a,b,c,d,message]nnIn [11]: pd.read_csv(E:ch06ex2.csv,names=names,index_col=message)nOut[11]:na b c dnmessagenhello 1 2 3 4nworld 5 6 7 8nfoo 9 10 11 12n
In [15]: !type E:ch06_mindex.csvnkey1,key2,value1,value2none,a,1,2none,b,3,4none,c,5,6none,d,7,8ntwo,a,9,10ntwo,b,11,12ntwo,c,13,14ntwo,d,15,16nnIn [17]: parsed=pd.read_csv(E:ch06_mindex.csv,index_col=[key1,key2])nnIn [18]: parsednOut[18]:nvalue1 value2nkey1 key2none a 1 2nb 3 4nc 5 6nd 7 8ntwo a 9 10nb 11 12nc 13 14nd 15 16n
In [35]: list(open(E:ch06ex3.txt))nOut[35]:n[ A B Cn,naaa -0.264438 -1.343465 -0.619500n,nbbb -0.264438 -1.343465 -0.619500n,nccc -0.264438 -1.343465 -0.619500n,nddd -0.264438 -1.343465 -0.619500n]n
In [36]: result=pd.read_table(E:ch06ex3.txt,sep=s+)nnIn [37]: resultnOut[37]:nA B Cnaaa -0.264438 -1.343465 -0.6195nbbb -0.264438 -1.343465 -0.6195nccc -0.264438 -1.343465 -0.6195nddd -0.264438 -1.343465 -0.6195n
In [4]: !type E:ch06ex4.csvn#hey!na,b,c,d,messagen#just wanted to make things more difficult for youn#who reads CSV files with computers,anyway?n1,2,3,4,hellon5,6,7,8,worldn9,10,11,12,foonnIn [5]: pd.read_csv(E:ch06ex4.csv,skiprows=[0,2,3])nOut[5]:na b c d messagen0 1 2 3 4 hellon1 5 6 7 8 worldn2 9 10 11 12 foon
In [56]: !type E:ch06ex5.csvnsomething,a,b,c,d,messagenone,1,2,3,4,NAntwo,5,6,,8,worldnthree,9,10,11,12,foonnIn [58]: import pandas as pdnnIn [59]: result=pd.read_csv(E:ch06ex5.csv)nnIn [60]: resultnOut[60]:nsomething a b c d messagen0 one 1 2 3.0 4 NaNn1 two 5 6 NaN 8 worldn2 three 9 10 11.0 12 foonnIn [61]: pd.isnull(result)nOut[61]:nsomething a b c d messagen0 False False False False False Truen1 False False False True False Falsen2 False False False False False Falsen
In [62]: result=pd.read_csv(E:ch06ex5.csv,na_values=[NULL])nnIn [63]: resultnOut[63]:nsomething a b c d messagen0 one 1 2 3.0 4 NaNn1 two 5 6 NaN 8 worldn2 three 9 10 11.0 12 foon
In [65]: sentinels={message:[foo,NA],something:[two]}nnIn [66]: result=pd.read_csv(E:ch06ex5.csv,na_values=sentinels)nnIn [67]: resultnOut[67]:nsomething a b c d messagen0 one 1 2 3.0 4 NaNn1 NaN 5 6 NaN 8 worldn2 three 9 10 11.0 12 NaNn
In [4]: result=pd.read_csv(rF:pydata-book-masterch06ex6.csv)nnIn [5]: resultnOut[5]: n one two three four keyn0 0.467976 -0.038649 -0.295344 -1.824726 Ln1 -0.358893 1.404453 0.704965 -0.200638 Bn2 -0.501840 0.659254 -0.421691 -0.057688 Gn3 0.204886 1.074134 1.388361 -0.982404 Rn4 0.354628 -0.133116 0.283763 -0.837063 Qn5 1.817480 0.742273 0.419395 -2.251035 Qn... ... ... ... ... ..n9993 1.821117 0.416445 0.173874 0.505118 Xn9994 0.068804 1.322759 0.802346 0.223618 Hn9995 2.311896 -0.417070 -1.409599 -0.515821 Ln9996 -0.479893 -0.650419 0.745152 -0.646038 En9997 0.523331 0.787112 0.486066 1.093156 Kn9998 -0.362559 0.598894 -1.843201 0.887292 Gn9999 -0.096376 -1.012999 -0.657431 -0.573315 0nn[10000 rows x 5 columns]n
In [7]: pd.read_csv(rF:pydata-book-masterch06ex6.csv,nrows=5)nOut[7]: n one two three four keyn0 0.467976 -0.038649 -0.295344 -1.824726 Ln1 -0.358893 1.404453 0.704965 -0.200638 Bn2 -0.501840 0.659254 -0.421691 -0.057688 Gn3 0.204886 1.074134 1.388361 -0.982404 Rn4 0.354628 -0.133116 0.283763 -0.837063 Qn
In [10]: chunker=pd.read_csv(rF:pydata-book-masterch06ex6.csv,chunksize=10000)nnIn [11]: chunkernOut[11]: <pandas.io.parsers.TextFileReader at 0xb894d68>n
In [10]: chunker=pd.read_csv(rF:pydata-book-masterch06ex6.csv,chunksize=10000)nnIn [11]: chunkernOut[11]: <pandas.io.parsers.TextFileReader at 0xb894d68>nnnIn [13]: from pandas import Series,DataFramennIn [14]: tot=Series([])nnIn [15]: for piece in chunker:n ...: tot=tot.add(piece[key].value_counts(),fill_value=0)n ...:nnIn [16]: tot.order(ascending=False)n__main__:1: FutureWarning: order is deprecated, use sort_values(...)nOut[16]:nE 368.0nX 364.0nL 346.0nO 343.0nQ 340.0nM 338.0nJ 337.0nF 335.0nK 334.0nH 330.0nV 328.0nI 327.0nU 326.0nP 324.0nD 320.0nA 320.0nR 318.0nY 314.0nG 308.0nS 308.0nN 306.0nW 305.0nT 304.0nB 302.0nZ 288.0nC 286.0n4 171.0n6 166.0n7 164.0n3 162.0n8 162.0n5 157.0n2 152.0n0 151.0n9 150.0n1 146.0ndtype: float64nnIn [17]: tot[:6]nOut[17]:nE 368.0nX 364.0nL 346.0nO 343.0nQ 340.0nM 338.0ndtype: float64n
In [1]: import pandas as pdnnIn [2]: data=pd.read_csv(r"F:pydata-book-masterch06ex5.csv")nnIn [3]: datanOut[3]: n something a b c d messagen0 one 1 2 3.0 4 NaNn1 two 5 6 NaN 8 worldn2 three 9 10 11.0 12 foonnnIn [6]: !type "F:pydata-book-masterch06out.csv"n,something,a,b,c,d,messagen0,one,1,2,3.0,4,n1,two,5,6,,8,worldn2,three,9,10,11.0,12,foon
In [8]: import sysnnIn [9]: data.to_csv(sys.stdout,sep=|)n|something|a|b|c|d|messagen0|one|1|2|3.0|4|n1|two|5|6||8|worldn2|three|9|10|11.0|12|foon
In [10]: data.to_csv(sys.stdout,na_rep="NULL")n,something,a,b,c,d,messagen0,one,1,2,3.0,4,NULLn1,two,5,6,NULL,8,worldn2,three,9,10,11.0,12,foon
In [11]: data.to_csv(sys.stdout,index=False,header=False)none,1,2,3.0,4,ntwo,5,6,,8,worldnthree,9,10,11.0,12,foon
In [16]: from pandas import Series,DataFramennIn [18]: import numpy as npnnIn [20]: dates=pd.date_range(1/1/2000,periods=7)nnIn [21]: ts=Series(np.arange(7),index=dates)nnIn [22]: ts.to_csv(r"F:pydata-book-masterch06tseries.csv")nnIn [23]: !type F:pydata-book-masterch06tseries.csv"n2000-01-01,0n2000-01-02,1n2000-01-03,2n2000-01-04,3n2000-01-05,4n2000-01-06,5n2000-01-07,6n
In [24]: Series.from_csv(rF:pydata-book-masterch06tseries.csv,parse_dates=True)nOut[24]: n2000-01-01 0n2000-01-02 1n2000-01-03 2n2000-01-04 3n2000-01-05 4n2000-01-06 5n2000-01-07 6ndtype: int64n
6.1.3 ·手工處理分隔符格式
In [25]: !type F:pydata-book-masterch06ex7.csvn"a","b","c"n"1","2","3"n"1","2","3","4"n
In [26]: import csvnnIn [27]: f=open(r"F:pydata-book-masterch06ex7.csv")nnIn [28]: reader=csv.reader(f)n
In [30]: for line in reader:n ...: print linen ...: n[a, b, c]n[1, 2, 3]n[1, 2, 3, 4]n
In [4]: lines=list(csv.reader(open(F:pydata-book-masterch06ex7.csv)))nnIn [5]: header,values=lines[0],lines[1:]nnIn [8]: header,valuesnOut[8]: ([a, b, c], [[1, 2, 3], [1, 2, 3, 4]])nnIn [9]: data_dict={h:v for h,v in zip(header,zip(*values))}nnIn [10]: header,valuesnOut[10]: ([a, b, c], [[1, 2, 3], [1, 2, 3, 4]])n
class my_dialect(csv.Dialect):n lineterminator=nn delimiter=;n quotechar= " nreader=csv.reader(f,dialect=my_dialect)n
reader=csv.reader(f,delimiter=|)n
with open(rF:pydata-book-masterch06mydata.csv,w) as f: n writer=csv.writer(f,dialect=my_dialect)n writer.writerow((one,two,three))n writer.writerow((1,2,3))n writer.writerow((4,5,6))n writer.writerow((7,8,9))n
obj="""{"name":"Wes","places_lived":["United States","Spain","Germany"],"pet":null,"siblings":[{"name":"Scott","age":25,"pet":"Zuko"},{"name":"Katie","age":33,"pet":"Cisco"}]}"""nIn [12]: obj="""{"name":"Wes","places_lived":["United States","Spain","Germany"],"pet":null,"siblings":[{"name":"Scott","age":25,"pet":"Zuko"},{"name":"Katie","age":33,"pet":"Cisco"}]}"""n
In [13]: import jsonnnIn [14]: result=json.loads(obj)nnIn [15]: resultnOut[15]:n{name: Wes,npet: None,nplaces_lived: [United States, Spain, Germany],nsiblings: [{age: 25, name: Scott, pet: Zuko},n{age: 33, name: Katie, pet: Cisco}]}n
asjson=json.dumps(result)n
In [29]: siblings=DataFrame(result[siblings],columns=[name,age])nnIn [30]: siblingsnOut[30]:nname agen0 Scott 25n1 Katie 33n
推薦閱讀:
TAG:Python |