python3機器學習經典實例-第五章構建推薦引擎26
計算皮爾遜相關係數
歐氏距離分數是一個非常好的指標,但它也有一些缺點。因此,皮爾遜相關係數常用於推薦引擎。接下來學習如何計算皮爾遜相關係數。
- 建立pearson-score.py文件,載入必要的資料庫
import jsonimport numpy as np
- 接下來定義一個用於計算兩個用戶之間的皮爾遜相關度係數的函數。第一步先判斷用戶是否在資料庫中出現:
# Returns the Pearson correlation score between user1 and user2 def pearson_score(dataset, user1, user2): if user1 not in dataset: raise TypeError(User + user1 + not present in the dataset) if user2 not in dataset: raise TypeError(User + user2 + not present in the dataset)
- 提取兩個用戶均評過分的電影:
# Movies rated by both user1 and user2 rated_by_both = {} for item in dataset[user1]: if item in dataset[user2]: rated_by_both[item] = 1 num_ratings = len(rated_by_both)
- 如果沒有兩個用戶共同評過分的電影,則說明這兩個用戶之間沒有相似度,此時返回0:
# If there are no common movies, the score is 0 if num_ratings == 0: return 0
- 計算相同評分電影的平方值之和:計算所有相同評分電影的評分的平方和:計算數據集的乘積之和:計算皮爾遜相關係數需要的各種元素:考慮分母為0的情況;如果一切正常,返回皮爾遜相關係數:
# Compute the sum of ratings of all the common preferences user1_sum = np.sum([dataset[user1][item] for item in rated_by_both]) user2_sum = np.sum([dataset[user2][item] for item in rated_by_both]) # Compute the sum of squared ratings of all the common preferences user1_squared_sum = np.sum([np.square(dataset[user1][item]) for item in rated_by_both]) user2_squared_sum = np.sum([np.square(dataset[user2][item]) for item in rated_by_both]) # Compute the sum of products of the common ratings product_sum = np.sum([dataset[user1][item] * dataset[user2][item] for item in rated_by_both]) # Compute the Pearson correlation Sxy = product_sum - (user1_sum * user2_sum / num_ratings) Sxx = user1_squared_sum - np.square(user1_sum) / num_ratings Syy = user2_squared_sum - np.square(user2_sum) / num_ratings if Sxx * Syy == 0: return 0 return Sxy / np.sqrt(Sxx * Syy)
- 定義main函數並計算兩個用戶之間的皮爾遜相關係數:
if __name__==__main__: data_file = movie_ratings.json with open(data_file, r) as f: data = json.loads(f.read()) user1 = John Carson user2 = Michelle Peterson print ("
Pearson score:") print (pearson_score(data, user1, user2) )
結果輸出out
Pearson score:0.39605901719066977
推薦閱讀: