python3機器學習經典實例-第五章構建推薦引擎26

04-20

計算皮爾遜相關係數

歐氏距離分數是一個非常好的指標，但它也有一些缺點。因此，皮爾遜相關係數常用於推薦引擎。接下來學習如何計算皮爾遜相關係數。

建立pearson-score.py文件，載入必要的資料庫

import jsonimport numpy as np

接下來定義一個用於計算兩個用戶之間的皮爾遜相關度係數的函數。第一步先判斷用戶是否在資料庫中出現：

# Returns the Pearson correlation score between user1 and user2 def pearson_score(dataset, user1, user2): if user1 not in dataset: raise TypeError(User + user1 + not present in the dataset) if user2 not in dataset: raise TypeError(User + user2 + not present in the dataset)

提取兩個用戶均評過分的電影：

# Movies rated by both user1 and user2 rated_by_both = {} for item in dataset[user1]: if item in dataset[user2]: rated_by_both[item] = 1 num_ratings = len(rated_by_both)

如果沒有兩個用戶共同評過分的電影，則說明這兩個用戶之間沒有相似度，此時返回0：

# If there are no common movies, the score is 0 if num_ratings == 0: return 0

計算相同評分電影的平方值之和：計算所有相同評分電影的評分的平方和：計算數據集的乘積之和：計算皮爾遜相關係數需要的各種元素：考慮分母為0的情況;如果一切正常，返回皮爾遜相關係數：

# Compute the sum of ratings of all the common preferences user1_sum = np.sum([dataset[user1][item] for item in rated_by_both]) user2_sum = np.sum([dataset[user2][item] for item in rated_by_both]) # Compute the sum of squared ratings of all the common preferences user1_squared_sum = np.sum([np.square(dataset[user1][item]) for item in rated_by_both]) user2_squared_sum = np.sum([np.square(dataset[user2][item]) for item in rated_by_both]) # Compute the sum of products of the common ratings product_sum = np.sum([dataset[user1][item] * dataset[user2][item] for item in rated_by_both]) # Compute the Pearson correlation Sxy = product_sum - (user1_sum * user2_sum / num_ratings) Sxx = user1_squared_sum - np.square(user1_sum) / num_ratings Syy = user2_squared_sum - np.square(user2_sum) / num_ratings if Sxx * Syy == 0: return 0 return Sxy / np.sqrt(Sxx * Syy)

定義main函數並計算兩個用戶之間的皮爾遜相關係數：

if __name__==__main__: data_file = movie_ratings.json with open(data_file, r) as f: data = json.loads(f.read()) user1 = John Carson user2 = Michelle Peterson print (" Pearson score:") print (pearson_score(data, user1, user2) )

結果輸出out

Pearson score:0.39605901719066977