Pandas備忘錄

眾所周知,Pandas是最受歡迎的Python數據科學與分析庫。

Numpy用於較低級別的科學計算。Pandas構建於Numpy之上,專為實際數據分析而設計。

本文總結了這個庫中一些常見、方便的功能。話不多說,讓我們開始吧!

數據集

數據是數據科學與分析的「命根子」,我們需要從數據中尋找答案,從數據中發現模式……如果沒有數據的話,就無從談起。

數據獲取的方式有很多種,這裡分享一些提供數據的平台。

pandas-videos:github.com/justmarkham/

pandas-videos上的數據集

UCI Machine Learning Reposity:archive.ics.uci.edu/ml/

UCI Machine Learning Reposity

UCI Machine Learning Reposity

The 50 Best Free Datasets for Machine Learning:gengo.ai/datasets/the-5

導入數據

任何的數據分析工作都是從導入數據開始,Pandas提供了很多導入數據的方法。

pd.read_csv(filename) # From a CSV file
pd.read_table(filename) # From a delimited text file (like TSV)
pd.read_excel(filename) # From an Excel file
pd.read_sql(query, connection_object) # Reads from a SQL table/database
pd.read_json(json_string) # Reads from a JSON formatted string, URL or file
pd.read_html(url) # Parses an html URL, string or file and extract tables to a list of dataframes
pd.read_clipboard() # Takes the contents of your clipboard and passes it to read_table()
pd.DataFrame(dict) # From a dict, keys for columns names, values for data as lists

不妨舉個例子,我們想分析一下IMDB高分電影,經過一番搜索,發現 raw.githubusercontent.com 上的數據就不錯, 若想把它導入,使用pd.read_csv方法就行啦!

具體來說,就是:

dataURL = https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/imdb_1000.csv
df = pd.read_csv(dataURL)
df.head() # Prints first 5 rows of the DataFrame

前5行數據

關於上述導入數據方法的更多詳細內容,請查看IO Tools。

探索數據

一旦將數據導入到DataFrame中之後,就可以用以下方法來了解數據的情況。

df.index # Index Description
df.columns # Columns in the DataFrame
df.shape # Prints number of row and columns in DataFrame
df.head(n) # Prints first n rows of the DataFrame
df.tail(n) # Prints last n rows of the DataFrame
df.info() # Index, DataType and Memory information
df.describe() # Summary statistics for numerical columns
s.value_counts(dropna=False) # Views unique values and counts
df.apply(pd.Series.value_counts) # Unique value and counts for all columns
df.mean() # Returns the mean of all columns
df.corr() # Returns the correlation between columns in a DataFrame
df.count() # Returns the number of non-null values in each DataFrame column
df.max() # Returns the highest value in each column
df.min() # Returns the lowest value in each column
df.median() # Returns the median of each column
df.std() # Returns the standard deviation of each column
df.idxmax() # Index of the lowest value
df.idxmin() # Index of the highest value

舉個例子,在導入IMDB高分電影數據後統計一下每種電影類型的頻數,我們就可以用:

df[genre].value_counts()

每種電影類型的頻數

選擇

通常,我們可能需要選擇單個元素或者數據的某個子集來進行深入分析。那麼,這些方法就會大顯身手:

df[col] # Returns column with label col as Series
df[[col1, col2]] # Returns columns as a new DataFrame
s.iloc[0] # Selection by position (selects first element)
s.loc[0] # Selection by index (selects element as index 0)
df.iloc[0, :] # First row
df.iloc[0, 0] # First element of first column
df.iat[0, 0] # First element of first column. Access a single value for row/column pair by integer position
df.at[row_label, col_label] # Access a single value for row/column label pair

更多內容,請查看Indexing and Selecting Data。

數據清理

實際上,我們拿到數據後,往往需要清理它。以下就是一些非常有用的方法:

df.columns = [a, b, c] # Renames columns
pd.isnull() # Checks for null values, return Boolean Array
pd.notnull() # Opposite of pd.isnull()
df.dropna() # Drops all rows that contain null values
df.dropna(axis=1) # Drops all columns that contain null values
df.dropna(axis=1, thresh=n) # Drops all rows hava less than non null values
df.fillna(x) # Replaces all null values with x
s.fillna(s.mean()) # Replaces all null values with the mean
s.astype(float) # Converts the datatype of the series to float
s.replace(1, one) # Replaces all values equal to 1 with one
s.replace([1, 3], [one, three]) # Replace all 1 with one and 3 with three
df.rename(columns=lambda x: x + 1) # Mass renaming of columns
df.rename(columns={old_name: new_name}) # Selective renaming
df.set_index(column_one) # Changes the index
df.rename(index=lambda x: x + 1) # Mass renaming of index
df.drop(labels) # Drop specified labels from rows or columns
df.drop_duplicates(subset) # Return DataFrame with duplicate rows removed, optionally only considering certain columns

更多內容,請查看Working with missing data。

過濾,排序和分組

一些對數據過濾、排序和分組的方法:

df[df[col] > 0.5] # Rows where the col column is greater than 0.5
df[(df[col] > 0.5) & (df[col] < 0.7)] # Rows where 0.5 < col < 0.7
df.sort_values(col1) # Sorts values by col1 in ascending order
df.sort_values(col2, ascending=False) # Sorts values by col2 in descending order
df.sort_values([col1, col2], ascending=[True, False]) # Sorts values by col1 in ascending order then col2 in descending order
df.groupby(col) # Returns a groupby object for values from one column
df.groupby([col1,col2]) # Returns a groupby object values from multiple columns
df.groupby(col1)[col2].mean() # Returns the mean of the values in col2, grouped by the values in col1 (mean can be replaced with almost any function from the statistics section)
df.pivot_table(index=col1, values= col2,col3], aggfunc=mean) # Creates a pivot table that groups by col1 and calculates the mean of col2 and col3
df.groupby(col1).agg(np.mean) # Finds the average across all columns for every unique column 1 group
df.apply(np.mean) # Applies a function across each column
df.apply(np.max, axis=1) # Applies a function across each row

更多內容,請查看Group By: split-apply-combine。

導出數據

最後,當需要將分析結果導出時,也有幾種方便的發方:

df.to_csv(filename) # Writes to a CSV file
df.to_excel(filename) # Writes to an Excel file
df.to_sql(table_name, connectiion_object) # Writes to a SQL table
df.to_json(filename) # Writes to a file in JSON format
df.to_html(filename) # Saves as an HTML table
df.to_clipboard() # Writes to the clipboard

詳情請查看IO Tools。

連接和合併

合併兩個DataFrame的方法:

pd.concat([df1, df2], axis=1) # Adds the columns in df1 to the end of df2
pd.merge(df11, df2) # SQL-style merges
df1.append(df2) # Adds the rows in df1 to the end of df2 (columns should be identical)
df1.join(df2,on=col1,how=inner) # SQL-style joins

更多內容,請查看Merge, join, and concatenate。

創建測試對象

通用用於測試代碼段。

pd.DataFrame(np.random.rand(20,5)) # 5 columns and 20 rows of random floats
pd.Series(my_list) # Create a series from an iterable my_list
df.index = pd.data_range(1900/1/30, periods=df.shape[0]) # Add a date index

實例

或許我們可以從一個實例(來源於書籍《Python for Data Analysis》)出發,當作對上述內容的一個小練習。

我們用的是bitly_usagov,數據可以在:github.com/wesm/pydata- 處找到。

導入數據

假如將數據下載到了本地,我們可以嘗試將其導入並得到DataFrame對象,便於之後的分析工作。

import pandas as pd

file_path = ../../Datasets/bitly_usagov/example.txt # Local file path
df = pd.read_json(file_path, lines=True)
df.info()

用info方法查看數據

對時區計數

df對象有一列名為tz,表示的是時區。我們可以對所有數據的時區進行統計,

tz_counts = df[tz].value_counts()
tz_counts.head(10)

我們可以將結果可視化,但在此之前,可以將未知或者缺失的時區補上一個替代值。

fillna函數可以替換缺失值(NA),而未知值(空字元串)則可以通過布爾數組索引進行替換:

clean_tz = df[tz].fillna(Missing)
clean_tz[clean_tz == ] = Unknown
tz_counts = clean_tz.value_counts()

之後,我們可以用seaborn包創建水平柱狀圖:

df對象名為a的列中含有瀏覽器、設備、應用程序的相關信息,我們可以簡單地將瀏覽器信息提取出來:

df.a.str.split( ).str.get(0).head(10)

類似地,我們也可以將瀏覽器的統計信息可視化:

參考

[1] paulovasconcellos.com.br

[2] elitedatascience.com/py

[3] dataquest.io/blog/panda

[4] pandas.pydata.org/panda


推薦閱讀:

TAG:Python | 數據分析 | Pandas(Python) |