Pandas備忘錄

03-02

眾所周知，Pandas是最受歡迎的Python數據科學與分析庫。

Numpy用於較低級別的科學計算。Pandas構建於Numpy之上，專為實際數據分析而設計。

本文總結了這個庫中一些常見、方便的功能。話不多說，讓我們開始吧！

數據集

數據是數據科學與分析的「命根子」，我們需要從數據中尋找答案，從數據中發現模式……如果沒有數據的話，就無從談起。

數據獲取的方式有很多種，這裡分享一些提供數據的平台。

pandas-videos：https://github.com/justmarkham/pandas-videos

pandas-videos上的數據集

UCI Machine Learning Reposity：https://archive.ics.uci.edu/ml/datasets.html

UCI Machine Learning Reposity

The 50 Best Free Datasets for Machine Learning：https://gengo.ai/datasets/the-50-best-free-datasets-for-machine-learning/

導入數據

任何的數據分析工作都是從導入數據開始，Pandas提供了很多導入數據的方法。

pd.read_csv(filename) # From a CSV file pd.read_table(filename) # From a delimited text file (like TSV) pd.read_excel(filename) # From an Excel file pd.read_sql(query, connection_object) # Reads from a SQL table/database pd.read_json(json_string) # Reads from a JSON formatted string, URL or file pd.read_html(url) # Parses an html URL, string or file and extract tables to a list of dataframes pd.read_clipboard() # Takes the contents of your clipboard and passes it to read_table() pd.DataFrame(dict) # From a dict, keys for columns names, values for data as lists

不妨舉個例子，我們想分析一下IMDB高分電影，經過一番搜索，發現 https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/imdb_1000.csv 上的數據就不錯，若想把它導入，使用pd.read_csv方法就行啦！

具體來說，就是：

dataURL = https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/imdb_1000.csv df = pd.read_csv(dataURL) df.head() # Prints first 5 rows of the DataFrame

前5行數據

關於上述導入數據方法的更多詳細內容，請查看IO Tools。

探索數據

一旦將數據導入到DataFrame中之後，就可以用以下方法來了解數據的情況。

df.index # Index Description df.columns # Columns in the DataFrame df.shape # Prints number of row and columns in DataFrame df.head(n) # Prints first n rows of the DataFrame df.tail(n) # Prints last n rows of the DataFrame df.info() # Index, DataType and Memory information df.describe() # Summary statistics for numerical columns s.value_counts(dropna=False) # Views unique values and counts df.apply(pd.Series.value_counts) # Unique value and counts for all columns df.mean() # Returns the mean of all columns df.corr() # Returns the correlation between columns in a DataFrame df.count() # Returns the number of non-null values in each DataFrame column df.max() # Returns the highest value in each column df.min() # Returns the lowest value in each column df.median() # Returns the median of each column df.std() # Returns the standard deviation of each column df.idxmax() # Index of the lowest value df.idxmin() # Index of the highest value

舉個例子，在導入IMDB高分電影數據後統計一下每種電影類型的頻數，我們就可以用:

df[genre].value_counts()

每種電影類型的頻數

選擇

通常，我們可能需要選擇單個元素或者數據的某個子集來進行深入分析。那麼，這些方法就會大顯身手：

df[col] # Returns column with label col as Series df[[col1, col2]] # Returns columns as a new DataFrame s.iloc[0] # Selection by position (selects first element) s.loc[0] # Selection by index (selects element as index 0) df.iloc[0, :] # First row df.iloc[0, 0] # First element of first column df.iat[0, 0] # First element of first column. Access a single value for row/column pair by integer position df.at[row_label, col_label] # Access a single value for row/column label pair

更多內容，請查看Indexing and Selecting Data。

數據清理

實際上，我們拿到數據後，往往需要清理它。以下就是一些非常有用的方法：

df.columns = [a, b, c] # Renames columns pd.isnull() # Checks for null values, return Boolean Array pd.notnull() # Opposite of pd.isnull() df.dropna() # Drops all rows that contain null values df.dropna(axis=1) # Drops all columns that contain null values df.dropna(axis=1, thresh=n) # Drops all rows hava less than non null values df.fillna(x) # Replaces all null values with x s.fillna(s.mean()) # Replaces all null values with the mean s.astype(float) # Converts the datatype of the series to float s.replace(1, one) # Replaces all values equal to 1 with one s.replace([1, 3], [one, three]) # Replace all 1 with one and 3 with three df.rename(columns=lambda x: x + 1) # Mass renaming of columns df.rename(columns={old_name: new_name}) # Selective renaming df.set_index(column_one) # Changes the index df.rename(index=lambda x: x + 1) # Mass renaming of index df.drop(labels) # Drop specified labels from rows or columns df.drop_duplicates(subset) # Return DataFrame with duplicate rows removed, optionally only considering certain columns

更多內容，請查看Working with missing data。

過濾，排序和分組

一些對數據過濾、排序和分組的方法：

df[df[col] > 0.5] # Rows where the col column is greater than 0.5 df[(df[col] > 0.5) & (df[col] < 0.7)] # Rows where 0.5 < col < 0.7 df.sort_values(col1) # Sorts values by col1 in ascending order df.sort_values(col2, ascending=False) # Sorts values by col2 in descending order df.sort_values([col1, col2], ascending=[True, False]) # Sorts values by col1 in ascending order then col2 in descending order df.groupby(col) # Returns a groupby object for values from one column df.groupby([col1,col2]) # Returns a groupby object values from multiple columns df.groupby(col1)[col2].mean() # Returns the mean of the values in col2, grouped by the values in col1 (mean can be replaced with almost any function from the statistics section) df.pivot_table(index=col1, values= col2,col3], aggfunc=mean) # Creates a pivot table that groups by col1 and calculates the mean of col2 and col3 df.groupby(col1).agg(np.mean) # Finds the average across all columns for every unique column 1 group df.apply(np.mean) # Applies a function across each column df.apply(np.max, axis=1) # Applies a function across each row

更多內容，請查看Group By: split-apply-combine。

導出數據

最後，當需要將分析結果導出時，也有幾種方便的發方：

df.to_csv(filename) # Writes to a CSV file df.to_excel(filename) # Writes to an Excel file df.to_sql(table_name, connectiion_object) # Writes to a SQL table df.to_json(filename) # Writes to a file in JSON format df.to_html(filename) # Saves as an HTML table df.to_clipboard() # Writes to the clipboard

詳情請查看IO Tools。

連接和合併

合併兩個DataFrame的方法：

pd.concat([df1, df2], axis=1) # Adds the columns in df1 to the end of df2 pd.merge(df11, df2) # SQL-style merges df1.append(df2) # Adds the rows in df1 to the end of df2 (columns should be identical) df1.join(df2,on=col1,how=inner) # SQL-style joins

更多內容，請查看Merge, join, and concatenate。

創建測試對象

通用用於測試代碼段。

pd.DataFrame(np.random.rand(20,5)) # 5 columns and 20 rows of random floats pd.Series(my_list) # Create a series from an iterable my_list df.index = pd.data_range(1900/1/30, periods=df.shape[0]) # Add a date index

實例

或許我們可以從一個實例（來源於書籍《Python for Data Analysis》）出發，當作對上述內容的一個小練習。

我們用的是bitly_usagov，數據可以在：https://github.com/wesm/pydata-book 處找到。

導入數據

假如將數據下載到了本地，我們可以嘗試將其導入並得到DataFrame對象，便於之後的分析工作。

import pandas as pd

file_path = ../../Datasets/bitly_usagov/example.txt # Local file path df = pd.read_json(file_path, lines=True) df.info()

用info方法查看數據

對時區計數

df對象有一列名為tz，表示的是時區。我們可以對所有數據的時區進行統計，

tz_counts = df[tz].value_counts() tz_counts.head(10)

我們可以將結果可視化，但在此之前，可以將未知或者缺失的時區補上一個替代值。

fillna函數可以替換缺失值（NA），而未知值（空字元串）則可以通過布爾數組索引進行替換：

clean_tz = df[tz].fillna(Missing) clean_tz[clean_tz == ] = Unknown tz_counts = clean_tz.value_counts()

之後，我們可以用seaborn包創建水平柱狀圖：

df對象名為a的列中含有瀏覽器、設備、應用程序的相關信息，我們可以簡單地將瀏覽器信息提取出來：

df.a.str.split( ).str.get(0).head(10)

類似地，我們也可以將瀏覽器的統計信息可視化：

參考

[1] https://paulovasconcellos.com.br/28-useful-pandas-functions-you-might-not-know-de42c59db085

[2] https://elitedatascience.com/python-cheat-sheet

[3] https://www.dataquest.io/blog/pandas-cheat-sheet/

[4] http://pandas.pydata.org/pandas-docs/stable/10min.html