Python數據分析模塊 | pandas做數據分析(二):常用預處理操作

02-05

作者：謝小小
原文鏈接：http://blog.csdn.net/xierhacker/article/details/65935459
在數據分析和機器學習的一些任務裡面,對於數據集的某些列或者行丟棄，以及數據集之間的合併操作是非常常見的.

1、合併操作

pandas.merge

pandas.merge(left, right, how=』inner』, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=(『_x』, 『_y』), copy=True, indicator=False)

作用:通過執行一個類似於資料庫風格join的操作,來在columns(列)或者indexes(行)上合併DataFrame對象. 如果在columns和columns上面進行join,那麼indexes就會被忽略.同樣,要是在indexes和indexes之間或者indexes和columns之間進行join,那麼index也會被忽略.

參數:

left : DataFrame

right : DataFrame

how : {『left』, 『right』, 『outer』, 『inner』}, default 『inner』

left: use only keys from left frame (SQL: left outer join)

right: use only keys from right frame (SQL: right outer join)

outer: use union of keys from both frames (SQL: full outer join)

inner: use intersection of keys from both frames (SQL: inner join)

on : label or list Field names to join on. Must be found in both DataFrames. If on is None and not merging on indexes, then it merges on the intersection of the columns by default.

left_on : label or list, or array-like Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns

right_on : label or list, or array-like Field names to join on in right DataFrame or vector/list of vectors per left_on docs

left_index : boolean, default False Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels

right_index : boolean, default False Use the index from the right DataFrame as the join key. Same caveats as left_index sort : boolean, default False Sort the join keys lexicographically in the result DataFrame

suffixes : 2-length sequence (tuple, list, …) Suffix to apply to overlapping column names in the left and right side, respectively copy : boolean, default True If False, do not copy data unnecessarily indicator : boolean or string, default False If True, adds a column to output DataFrame called 「_merge」 with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of 「left_only」 for observations whose merge key only appears in 『left』 DataFrame, 「right_only」 for observations whose merge key only appears in 『right』 DataFrame, and 「both」 if the observation』s merge key is found in both. New in version 0.17.0. Returns: merged : DataFrame The output type will the be same as 『left』, if it is a subclass of DataFrame.

pandas.concat

2、丟棄操作

pandasDataFrame.drop

DataFrame.drop(labels, axis=0, level=None, inplace=False, errors=』raise』)

作用：返回一個指定軸上label被移除之後的對象。

參數：

labels : 一個或者一列label值

axis : int類型或者軸的名字，這個軸和labels配合起來，比如，當axis=0的時候，就是行上面的label，當axis=1的時候，就是列上面的label

level : int or level name, default None

For MultiIndex

inplace : bool, 默認是False，這個表示是不是在原始的dataframe上面做替換。要是是Ture的話，原始dataframe會變化，同時返回的是None。

errors : {『ignore』, 『raise』},默認是『raise』。要是是『ignore』的話，就不管error,已經存在的labels會被丟棄。

例子：

import numpy as npnimport pandas as pdndf = pd.DataFrame({A: [a, b, a], B: [b, a, c], nC: [1, 2, 3]})nprint("original:n",df)n#get1接受的是第0行（因為這個時候axis=0）移除之後的新對象n#因為inplace默認是False，所以df不會有變化nget1=df.drop(labels=0)nprint("df:n",df)nprint("get1:n",get1)n#因為inplace這時候是True，所以df會變化，同時get2接受的是None值nget2=df.drop(labels=0,inplace=True)nprint("df:n",df) print("get1:n",get2)n#這個時候是移除列了，對比上面來看nget3=df.drop(labels="A",axis=1)nprint("df:n",df)nprint("get3:n",get3)n

結果：

pandas.dataframe.pop

DataFrame.pop(item)

作用：返回這個item，同時把這個item從frame裡面丟棄。

3、編碼

pandas.get_dummies()

把類別量裝換為指示變數(其實就是one-hot encoding)

pandas.get_dummies(data, prefix=None, prefix_sep=』_』, dummy_na=False, columns=None, sparse=False, drop_first=False)

參數:

data : 類array類型,Series或者是DataFrame類型.

prefix : 字元串,或者字元串列表,或者字元串字典.默認為None,這裡應該傳入一個字元串列表,且這個列表的長度是和將要被get_dummis的那些列數量是相等的.同樣,prefix選項也可以是一個把列名映射到prefixes的字典.

prefix_sep : string, default 『_』 If appending prefix, separator/delimiter to use. Or pass a list or dictionary as with prefix.

dummy_na : bool, default False Add a column to indicate NaNs, if False NaNs are ignored.

columns : list-like, default None Column names in the DataFrame to be encoded. If columns is None then all the columns with object or category dtype will be converted. sparse : bool, default False Whether the dummy columns should be sparse or not. Returns SparseDataFrame if data is a Series or if all columns are included. Otherwise returns a DataFrame with some SparseBlocks.

New in version 0.16.1.

drop_first : bool, default False Whether to get k-1 dummies out of k categorical levels by removing the first level.

New in version 0.18.0.

Returns

——-

dummies : DataFrame or SparseDataFrame

例1.Series

import numpy as npnimport pandas as pdn#對於一個Series來說,行數保持不變,列數變為不同類的個數n#但是每一行還是以編碼的形式表示原來的類別n#這個函數返回是一個DataFrame,其中列名為各種類別ns = pd.Series(list(abca))nprint("original:")nprint(s)nprint("get dummy:")ns_dummy=pd.get_dummies(data=s)nprint(s_dummy)nprint("type of s_dummy:",ntype(s_dummy))n

結果:

例2.DataFrame

import numpy as npnimport pandas as pd df = pd.DataFrame({A: [a, b, a], B: [b, a, c], n C: [1, 2, 3]})nprint("original:")nprint(df)n#其中只要是類別相關的,都會被hot-encodingn#每一個特徵(原始形式的列名)下面有幾種不同的類別,就會生成幾列(比如A下面只有a和b兩種形式,就會生成A_a和A_b兩列)n#原始為數字的那些特徵,保持不變n#prefix表示你對於新生成的那些列想要的前綴,你可以自己命名ndf_dummy=pd.get_dummies(data=df,prefix=["A","B"])nprint("get dummy:")nprint(df_dummy)n

結果:

4、處理缺失值

pandas使用浮點數NaN(not a number)表示浮點和非浮點數組中的缺失數據.

pandas中,自己傳入的np.nan或者是python內置的None值,都會被當做NaN處理,如下例.

import numpy as npnimport pandas as pdns=pd.Series(data=["tom","jack","kate",np.nan])nprint(s)ns[0]=Nonenprint(s)n

結果:

查找缺失值

DataFrame.isnull()

作用,返回一個和原來DataFrame一樣形狀的,裡面值為布爾型的DataFrame.

例子:

import numpy as npnimport pandas as pdns=pd.Series(data=["tom","jack","kate",np.nan])nprint(s)nprint(s.isnull())nprint(type(s.isnull()))ndf = pd.DataFrame({A: [a, b, np.nan], B: [b, a, c], nC: [1, 2, np.nan]})nprint("original:")nprint(df)nprint(df.isnull())n

結果:

填充缺失值

pandas.DataFrame.fillna

使用指定的方法來填充缺失值,並且返回被填充好的DataFrame

DataFrame.fillna(value=None,method=None,axis=None,inplace=False,limit=None,downcast=None, **kwargs)

參數:

value : 可以是標量,字典,Series對象,DataFrame對象.value的作用就是用來填充那些缺失的部分.

method : 可選為{『backfill』, 『bfill』, 『pad』, 『ffill』, None}, 默認是None,

Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap

axis : {0 or 『index』, 1 or 『columns』}

inplace : 布爾值,默認為False.要是為True的話,那麼就會就地修改.

limit : (對於前向填充和後向填充)可以連續填充的最大數量.