python3機器學習經典實例-第五章構建推薦引擎26

計算皮爾遜相關係數

歐氏距離分數是一個非常好的指標,但它也有一些缺點。因此,皮爾遜相關係數常用於推薦引擎。接下來學習如何計算皮爾遜相關係數。

  • 建立pearson-score.py文件,載入必要的資料庫

import jsonimport numpy as np

  • 接下來定義一個用於計算兩個用戶之間的皮爾遜相關度係數的函數。第一步先判斷用戶是否在資料庫中出現:

# Returns the Pearson correlation score between user1 and user2 def pearson_score(dataset, user1, user2): if user1 not in dataset: raise TypeError(User + user1 + not present in the dataset) if user2 not in dataset: raise TypeError(User + user2 + not present in the dataset)

  • 提取兩個用戶均評過分的電影:

# Movies rated by both user1 and user2 rated_by_both = {} for item in dataset[user1]: if item in dataset[user2]: rated_by_both[item] = 1 num_ratings = len(rated_by_both)

  • 如果沒有兩個用戶共同評過分的電影,則說明這兩個用戶之間沒有相似度,此時返回0:

# If there are no common movies, the score is 0 if num_ratings == 0: return 0

  • 計算相同評分電影的平方值之和:計算所有相同評分電影的評分的平方和:計算數據集的乘積之和:計算皮爾遜相關係數需要的各種元素:考慮分母為0的情況;如果一切正常,返回皮爾遜相關係數:

# Compute the sum of ratings of all the common preferences user1_sum = np.sum([dataset[user1][item] for item in rated_by_both]) user2_sum = np.sum([dataset[user2][item] for item in rated_by_both]) # Compute the sum of squared ratings of all the common preferences user1_squared_sum = np.sum([np.square(dataset[user1][item]) for item in rated_by_both]) user2_squared_sum = np.sum([np.square(dataset[user2][item]) for item in rated_by_both]) # Compute the sum of products of the common ratings product_sum = np.sum([dataset[user1][item] * dataset[user2][item] for item in rated_by_both]) # Compute the Pearson correlation Sxy = product_sum - (user1_sum * user2_sum / num_ratings) Sxx = user1_squared_sum - np.square(user1_sum) / num_ratings Syy = user2_squared_sum - np.square(user2_sum) / num_ratings if Sxx * Syy == 0: return 0 return Sxy / np.sqrt(Sxx * Syy)

  • 定義main函數並計算兩個用戶之間的皮爾遜相關係數:

if __name__==__main__: data_file = movie_ratings.json with open(data_file, r) as f: data = json.loads(f.read()) user1 = John Carson user2 = Michelle Peterson print ("
Pearson score:") print (pearson_score(data, user1, user2) )

結果輸出out

Pearson score:0.39605901719066977

推薦閱讀:

機器學習(入門):邏輯回歸
sqlite應用
Python編碼為什麼那麼蛋疼?

TAG:機器學習 | Python |