pandas 1 | 10分鐘入門pandas,SO EASY!!!

Hello World,我是熱愛學習的聶大哥。

這天,我看到了一篇妖艷賤貨文「10分鐘入門pandas」。

我的內心os是:開什麼玩笑,怎麼可能...

但是pandas官方文章就是這麼寫的:

10 Minutes to pandas

抱著10分鐘肯定搞不定的想法,我果真花了1小時16分鐘才看完...

官方文檔這標題簡直氣死人不償命...

想認真學pandas的寶寶,請用電腦打開閱讀...

手機的格式真的是風中凌亂...

閱讀本文前,最好對python,pandas和numpy有一丟丟的小了解,最起碼知道series, array, dataframe等基本概念有所了解。

小麻瓜們可以先閱讀以下另一篇文章:Python數據處理:關於Pandas你需要知道的都在這裡了。

在讀這篇「10分鐘」騙子文之前,我對pandas的了解僅限皮毛,比如讀取文件,了解數據結構,選取需要的數據等,都是一些最最小白的姿勢。雖然它騙了我,不過不可否認,它真的是,大寫加粗的學習素材精華!!!

如果你也是跟我一樣的pandas小白,相信你讀完後,也會跟我一樣,大呼過癮,手抖抖拉到文章末尾,點個贊,再存到收藏夾,然後也許就沒有然後了。

好吧,廢話說了那麼多,接下來跟進大哥好好學習(大哥的版本比原文有一丟丟精簡),主要還是用的英文,原文的英文注釋都非常友好,在遇到一些小麻瓜們可能會卡機的地方,我加上了注釋~~~

今天先看創建對象查看數據兩部分,大概就是一個10分鐘的學習量,非常簡單~ 還有另外一個原因就是... 編輯代碼快把大哥累死了... 所以咱們還是一點一點來吧(圍笑)...

牆裂建議,不要偷懶,在自己的jupyter notebook里打一打代碼,會對pandas理解更深刻!然後,你就能愉快地跟pandas做個好盆友啦~~~

0)Import Libraries

Customarily, we import as follows:

import pandas as pdnimport numpy as npnimport matplotlib.pyplot as pltn

*木有導入pandas,numpy,matplot畫圖庫,接下來怎麼玩~

1)Object Creation

Creating a Series by passing a list of values, letting pandas create a default integer index:

s = pd.Series([1,3,5,np.nan,6,8])nnsn

output:

0 1.0n1 3.0n2 5.0n3 NaNn4 6.0n5 8.0ndtype: float64n

*pandas使用NaN(not a number)來表示缺失值,使用numpy的nan來生成,這些值默認不會包含在計算中

Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns:

input:

dates = pd.date_range(20130101, periods=6)nndatesn

output:

DatetimeIndex([2013-01-01, 2013-01-02, 2013-01-03, 2013-01-04,n 2013-01-05, 2013-01-06],n dtype=datetime64[ns], freq=D)n

input:

df = pd.DataFrame(np.random.randn(6,4), n index=dates, columns=list(ABCD))nndfn

output:

A B C Dn2013-01-01 0.469112 -0.282863 -1.509059 -1.135632n2013-01-02 1.212112 -0.173215 0.119209 -1.044236n2013-01-03 -0.861849 -2.104569 -0.494929 1.071804n2013-01-04 0.721555 -0.706771 -1.039575 0.271860n2013-01-05 -0.424972 0.567020 0.276232 -1.087401n2013-01-06 -0.673690 0.113648 -1.478427 0.524988n

Creating a DataFrame by passing a dict of objects that can be converted to series-like.

input:

df2 = pd.DataFrame({ A : 1.,nB : pd.Timestamp(20130102),nC : pd.Series(1,index=list(range(4)),dtype=float32),nD : np.array([3] * 4,dtype=int32),nE : pd.Categorical(["test","train","test","train"]),nF : foo })nndf2n

output:

A B C D E Fn0 1.0 2013-01-02 1.0 3 test foon1 1.0 2013-01-02 1.0 3 train foon2 1.0 2013-01-02 1.0 3 test foon3 1.0 2013-01-02 1.0 3 train foon

Having specific dtypes

input:

df2.dtypesn

output:

A float64nB datetime64[ns]nC float32nD int32nE categorynF objectndtype: objectn

*float 32和float64的區別:數位的區別,一個在內存中佔分別32和64個bits,也就是4bytes或8bytes,數位越高浮點數的精度越高(百度知道_行雲啊)

2)Viewing Data

See the top & bottom rows of the frame

input:

df.head()n

output:

A B C Dn2013-01-01 0.469112 -0.282863 -1.509059 -1.135632n2013-01-02 1.212112 -0.173215 0.119209 -1.044236n2013-01-03 -0.861849 -2.104569 -0.494929 1.071804n2013-01-04 0.721555 -0.706771 -1.039575 0.271860n2013-01-05 -0.424972 0.567020 0.276232 -1.087401n

input:

df.tail(3)n

output:

A B C Dn2013-01-04 0.721555 -0.706771 -1.039575 0.271860n2013-01-05 -0.424972 0.567020 0.276232 -1.087401n2013-01-06 -0.673690 0.113648 -1.478427 0.524988n

Display the index, columns, and the underlying numpy data

input:

df.indexn

output:

DatetimeIndex([2013-01-01, 2013-01-02, 2013-01-03, 2013-01-04,n 2013-01-05, 2013-01-06],n dtype=datetime64[ns], freq=D)n

input:

df.columnsn

output:

Index([A, B, C, D], dtype=object)n

input:

df.valuesn

output:

array([[ 0.4691, -0.2829, -1.5091, -1.1356],n [ 1.2121, -0.1732, 0.1192, -1.0442],n [-0.8618, -2.1046, -0.4949, 1.0718],n [ 0.7216, -0.7068, -1.0396, 0.2719],n [-0.425 , 0.567 , 0.2762, -1.0874],n [-0.6737, 0.1136, -1.4784, 0.525 ]])n

Describe shows a quick statistic summary of your data

input:

df.describe()n

output:

A B C Dncount 6.000000 6.000000 6.000000 6.000000nmean 0.073711 -0.431125 -0.687758 -0.233103nstd 0.843157 0.922818 0.779887 0.973118nmin -0.861849 -2.104569 -1.509059 -1.135632n25% -0.611510 -0.600794 -1.368714 -1.076610n50% 0.022070 -0.228039 -0.767252 -0.386188n75% 0.658444 0.041933 -0.034326 0.461706nmax 1.212112 0.567020 0.276232 1.071804n

Transposing your data (行和列交換)

input:

df.Tn

output:

2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06nA 0.469112 1.212112 -0.861849 0.721555 -0.424972 -0.673690nB -0.282863 -0.173215 -2.104569 -0.706771 0.567020 0.113648nC -1.509059 0.119209 -0.494929 -1.039575 0.276232 -1.478427nD -1.135632 -1.044236 1.071804 0.271860 -1.087401 0.524988n

Sorting by an axis

input:

df.sort_index(axis=1, ascending=False)n

output:

D C B An2013-01-01 -1.135632 -1.509059 -0.282863 0.469112n2013-01-02 -1.044236 0.119209 -0.173215 1.212112n2013-01-03 1.071804 -0.494929 -2.104569 -0.861849n2013-01-04 0.271860 -1.039575 -0.706771 0.721555n2013-01-05 -1.087401 0.276232 0.567020 -0.424972n2013-01-06 0.524988 -1.478427 0.113648 -0.673690n

sort_index()默認是axis=0,ascending=True,對行進行排序,升序排列。

如果要對列進行排序,並設成降序,就是df.sort_index(axis=1, ascending=False)~

Sorting by values

input:

df.sort_values(by=B)n

output:

A B C Dn2013-01-03 -0.861849 -2.104569 -0.494929 1.071804n2013-01-04 0.721555 -0.706771 -1.039575 0.271860n2013-01-01 0.469112 -0.282863 -1.509059 -1.135632n2013-01-02 1.212112 -0.173215 0.119209 -1.044236n2013-01-06 -0.673690 0.113648 -1.478427 0.524988n2013-01-05 -0.424972 0.567020 0.276232 -1.087401n

後續戳→_→ pandas 2 | 10分鐘入門之數據選取


推薦閱讀:

從零學會數據分析:複雜數據分析
數據分析的道與術
(轉)41個超級網路資源資料庫,絕對有你想要的!
工具推薦 | 分析大數據最需要的Top 10數據挖掘工具
與大數據相關的工作職位有哪些?

TAG:Python | 数据分析 | 大数据分析 |