Python數據分析Notes（一）

09-03

Python數據分析Notes（一）

來自專欄山間詞話

一、count問題

[ America/New_York, America/New_York, America/New_York, Pacific/Auckland, America/Los_Angeles, America/Chicago, ,Europe/Berlin, , America/New_York, America/New_York, America/New_York, America/Chicago, America/Chicago]
試求出出現次數的前幾名（或者按照出現次數從小到大排序）

solution 1、

# first waydef get_count(sequence): count={} for x in sequence: if x in count: count[x] += 1 else: count[x] = 1 return count

然後

def top_counts(count_dict,n=10): value_key_pairs = [(count,tz) for tz,count in count_dict.items()] value_key_pairs.sort() value_key_pairs.reverse() return value_key_pairs[-n:]

這裡的sort( )方法需要注意。

sort( )

可以接受key的parameter作為排序的條件，其實參是一個函數。例如：

def takeSecond(pair): return pair[1]key_value_pairs_oo.sort(key = takeSecond)

Solution 2、

# by python labfrom collections import Countertime_zones_count_by_lib = Counter(timezones)timezones = time_zones_count_by_lib.most_common(10)print(timezones)

直接使用Collections中的Counter即可。

二、povit_table(透視表)簡要使用方法

使用透視表，很重要的是要理解你要處理的數據的結構。

假設我們現在擁有這樣的數據：

df1 = pd.DataFrame( data={ name:[ Tom,Tom,Tom,Jerry,Jerry,Jerry ], category:[ 0,1,2,0,1,2 ], rating:[ 5,4,3,4,4,3 ] })

即為：

category name rating0 0 Tom 51 1 Tom 42 2 Tom 33 0 Jerry 44 1 Jerry 45 2 Jerry 3

1、計算每個name的rating總分

pt1 = pd.pivot_table( data = df1, index = name, values=rating, aggfunc = np.sum)

得到：

ratingnameJerry 11Tom 12

Note：如果這裡我們在「aggfunc」中不傳入實參，則默認使用「mean」,即為求平均值的方法。

2、計算每個name的每個不用category的rating總分

現在我們把數據稍微增加一點

category name rating0 0 Tom 51 1 Tom 42 2 Tom 33 1 Tom 24 2 Tom 25 0 Jerry 46 1 Jerry 47 2 Jerry 38 0 Jerry 59 1 Jerry 2

這樣，對應每個不同name，都會有相同的category的不用的rating。

我們現在在pivot_table中的columns中寫入實參

pt1 = pd.pivot_table( data = df1, index = name, values=rating, columns = category, aggfunc= np.sum)

會得到：

category 0 1 2nameJerry 9 6 3Tom 5 6 5

其實，如果這樣展示，可能會更加明白一些：

namecategory 0 1 2Jerry 9 6 3Tom 5 6 5

Summary：現在我們來做一個簡單的概述：values會控制內容，aggfunc會apply在values中，columns會控制不同的子分類。

三、刪選出index

原因：刪選出index之後，我們可以使用loc方法去獲取到特定的行

（1）刪選index

active_titles = rating_by_title.index[rating_by_title>250]

得到：

Index([burbs, The (1989), 10 Things I Hate About You (1999),
101 Dalmatians (1961), 101 Dalmatians (1996), 12 Angry Men (1957),
......])

類似這樣的數值。

(2) 使用loc方法得到行

mean_ratings = mean_ratings.loc[active_titles]

這樣，我們得到大於250個rating數量（size）的數據行。

四、sort_value

DateFrame.sort_value

這是一個DataFrame的方法

top_female_ratings = mean_ratings.sort_values(by=F, ascending=False)