基於word2vec和Elasticsearch實現個性化搜索

05-24

在word2vec學習小記一文中我們曾經學習了word2vec這個工具，它基於神經網路語言模型並在其基礎上進行優化，最終能獲取詞向量和語言模型。在我們的商品搜索系統里，採用了word2vec的方式來計算用戶向量和商品向量，並通過Elasticsearch的function_score評分機制和自定義的腳本插件來實現個性化搜索。

背景介紹

先來看下維基百科上對於個性化搜索的定義和介紹：

Personalized search refers to web search experiences that are tailored specifically to an individuals interests by incorporating information about the individual beyond specific query provided. Pitkow et al. describe two general approaches to personalizing search results, one involving modifying the users query and the other re-ranking search results.

由此我們可以得到兩個重要的信息：

個性化搜索需要充分考慮到用戶的偏好，將用戶感興趣的內容優先展示給用戶；
另外是對於實現個性化的方式上主要有查詢修改和對搜索結果的重排序兩種。

而對我們電商網站來說，個性化搜索的重點是當用戶搜索某個關鍵字，如【衛衣】時，能將用戶最感興趣最可能購買的商品（如用戶偏好的品牌或款式）優先展示給用戶，以提升用戶體驗和點擊轉化。

設計思路

在此之前我們曾經有一般的個性化搜索實現，其主要是通過計算用戶和商品的一些重要屬性（比如品牌、品類、性別等）的權重，然後得到一個用戶和商品之間的關聯繫數，然後根據該係數進行重排序。
但是這一版從效果來看並不是很好，我個人覺得主要的原因有以下幾點：用戶對商品的各個屬性的重視程度並不是一樣的，另外考慮的商品的屬性並不全，且沒有去考慮商品和商品直接的關係；
在新的版本的設計中，我們考慮通過用戶的瀏覽記錄這種時序數據來獲取用戶和商品以及商品和商品直接的關聯關係，其核心就是通過類似於語言模型的詞出現的順序來訓練向量表示結果；
在獲取用戶向量和商品向量表示後，我們就可以根據向量直接的距離來計算相關性，從而將用戶感興趣的商品優先展示；

實現細節

商品向量的計算

根據用戶最近某段時間（如30天內）的瀏覽記錄，獲取得到瀏覽SKN的列表並使用空格分隔；核心的邏輯如下面的SQL所示：

select concat_ws( , collect_set(product_skn)) as skns from (select uid, cast(product_skn as string) as product_skn, click_time_stamp from product_click_record where date_id <= $date_id and date_id >= $date_id_30_day_ago order by uid, click_time_stamp) as a group by uid;

將該SQL的執行結果寫入文件作為word2vec訓練的輸入；
調用word2vec執行訓練，並保存訓練的結果：

time ./word2vec -train $prepare_file -output $result_file -cbow 1 -size 20 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -iter 15

讀取訓練結果的向量，保存到搜索庫的商品向量表。

用戶向量的計算

在計算用戶向量時採用了一種簡化的處理，即通過用戶最近某段時間（如30天內）的商品瀏覽記錄，根據這些商品的向量進行每一維的求平均值處理來計算用戶向量，核心的邏輯如下：

vec_list = []for i in range(feature_length): vec_list.append("avg(coalesce(b.vec[%s], 0))" % (str(i)))vec = , .join(vec_list)select a.uid as uid, array(%s) as vec from (select * from product_click_record where date_id <= $date_id and date_id >= $date_id_30_day_ago) as aleft outer join (select * from product_w2v where date_id = $date_id) as bon a.product_skn = b.product_skngroup by a.uid;

將計算獲取的用戶向量，保存到Redis里供搜索服務獲取。

搜索服務時增加個性化評分

商品索引重建構造器在重建索引時設置商品向量到product_index的某個欄位中，比如下面例子的productFeatureVector欄位；
搜索服務在默認的cross_fields相關性評分的機制下，需要增加個性化的評分，這個可以通過function_score來實現。

Map<String, Object> scriptParams = new HashMap<>();scriptParams.put("field", "productFeatureVector");scriptParams.put("inputFeatureVector", userVector);scriptParams.put("version", version);Script script = new Script("feature_vector_scoring_script", ScriptService.ScriptType.INLINE, "native", scriptParams);functionScoreQueryBuilder.add(ScoreFunctionBuilders.scriptFunction(script));

這裡採用了elasticsearch-feature-vector-scoring插件來進行相關性評分，其核心是向量的餘弦距離表示，具體見下面一小節的介紹。在腳本參數中，field表示索引中保存商品特徵向量的欄位；inputFeatureVector表示輸入的向量，在這裡為用戶的向量；
這裡把version參數單獨拿出來解釋一下，因為每天計算出來的向量是不一樣的，向量的每一維並沒有對應商品的某個具體的屬性（至少我們現在看不出來這種關聯），因此我們要特別避免不同時間計算出來的向量的之間計算相關性。在實現的時候，我們是通過一個中間變數來表示最新的版本，即在完成商品向量和用戶向量的計算和推送給搜索之後，再更新這個中間向量；搜索的索引構造器定期輪詢這個中間變數，當發現發生更新之後，就將商品的特徵向量批量更新到ES中，在後面的搜索中就可以採用新版本的向量了；

elasticsearch-feature-vector-scoring插件

這是我自己寫的一個插件，具體的使用可以看下項目主頁，其核心也就一個類，我將其主要的代碼和注釋貼一下：

public class FeatureVectorScoringSearchScript extends AbstractSearchScript { public static final ESLogger LOGGER = Loggers.getLogger("feature-vector-scoring"); public static final String SCRIPT_NAME = "feature_vector_scoring_script"; private static final double DEFAULT_BASE_CONSTANT = 1.0D; private static final double DEFAULT_FACTOR_CONSTANT = 1.0D; // field in index to store feature vector private String field; // version of feature vector, if it isnt null, it should match version of index private String version; // final_score = baseConstant + factorConstant * cos(X, Y) private double baseConstant; // final_score = baseConstant + factorConstant * cos(X, Y) private double factorConstant; // input feature vector private double[] inputFeatureVector; // cos(X, Y) = Σ(Xi * Yi) / ( sqrt(Σ(Xi * Xi)) * sqrt(Σ(Yi * Yi)) ) // the inputFeatureVectorNorm is sqrt(Σ(Xi * Xi)) private double inputFeatureVectorNorm; public static class ScriptFactory implements NativeScriptFactory { @Override public ExecutableScript newScript(@Nullable Map<String, Object> params) throws ScriptException { return new FeatureVectorScoringSearchScript(params); } @Override public boolean needsScores() { return false; } } private FeatureVectorScoringSearchScript(Map<String, Object> params) throws ScriptException { this.field = (String) params.get("field"); String inputFeatureVectorStr = (String) params.get("inputFeatureVector"); if (this.field == null || inputFeatureVectorStr == null || inputFeatureVectorStr.trim().length() == 0) { throw new ScriptException("Initialize script " + SCRIPT_NAME + " failed!"); } this.version = (String) params.get("version"); this.baseConstant = params.get("baseConstant") != null ? Double.parseDouble(params.get("baseConstant").toString()) : DEFAULT_BASE_CONSTANT; this.factorConstant = params.get("factorConstant") != null ? Double.parseDouble(params.get("factorConstant").toString()) : DEFAULT_FACTOR_CONSTANT; String[] inputFeatureVectorArr = inputFeatureVectorStr.split(","); int dimension = inputFeatureVectorArr.length; double sumOfSquare = 0.0D; this.inputFeatureVector = new double[dimension]; double temp; for (int index = 0; index < dimension; index++) { temp = Double.parseDouble(inputFeatureVectorArr[index].trim()); this.inputFeatureVector[index] = temp; sumOfSquare += temp * temp; } this.inputFeatureVectorNorm = Math.sqrt(sumOfSquare); LOGGER.debug("FeatureVectorScoringSearchScript.init, version:{}, norm:{}, baseConstant:{}, factorConstant:{}." , this.version, this.inputFeatureVectorNorm, this.baseConstant, this.factorConstant); } @Override public Object run() { if (this.inputFeatureVectorNorm == 0) { return this.baseConstant; } if (!doc().containsKey(this.field) || doc().get(this.field) == null) { LOGGER.error("cannot find field {}.", field); return this.baseConstant; } String docFeatureVectorStr = ((ScriptDocValues.Strings) doc().get(this.field)).getValue(); return calculateScore(docFeatureVectorStr); } public double calculateScore(String docFeatureVectorStr) { // 1. check docFeatureVector if (docFeatureVectorStr == null) { return this.baseConstant; } docFeatureVectorStr = docFeatureVectorStr.trim(); if (docFeatureVectorStr.isEmpty()) { return this.baseConstant; } // 2. check version and get feature vector array of document String[] docFeatureVectorArr; if (this.version != null) { String versionPrefix = version + "|"; if (!docFeatureVectorStr.startsWith(versionPrefix)) { return this.baseConstant; } docFeatureVectorArr = docFeatureVectorStr.substring(versionPrefix.length()).split(","); } else { docFeatureVectorArr = docFeatureVectorStr.split(","); } // 3. check the dimension of input and document int dimension = this.inputFeatureVector.length; if (docFeatureVectorArr == null || docFeatureVectorArr.length != dimension) { return this.baseConstant; } // 4. calculate the relevance score of the two feature vector double sumOfSquare = 0.0D; double sumOfProduct = 0.0D; double tempValueInDouble; for (int i = 0; i < dimension; i++) { tempValueInDouble = Double.parseDouble(docFeatureVectorArr[i].trim()); sumOfProduct += tempValueInDouble * this.inputFeatureVector[i]; sumOfSquare += tempValueInDouble * tempValueInDouble; } if (sumOfSquare == 0) { return this.baseConstant; } double cosScore = sumOfProduct / (Math.sqrt(sumOfSquare) * inputFeatureVectorNorm); return this.baseConstant + this.factorConstant * cosScore; }}

總結與後續改進

基於word2vec、Elasticsearch和自定義的腳本插件，我們就實現了一個個性化的搜索服務，相對於原有的實現，新版的點擊率和轉化率都有大幅的提升；
基於word2vec的商品向量還有一個可用之處，就是可以用來實現相似商品的推薦；
但是以我個人的理解，使用word2vec來實現個性化搜索或個性化推薦是有一定局限性的，因為它只能處理用戶點擊歷史這樣的時序數據，而無法全面的去考慮用戶偏好，這個還是有很大的改進和提升的空間；
後續的話我們會更多的參考業界的做法，更多地更全面地考慮用戶的偏好，另外還需要考慮時效性的問題，以優化商品排序和推薦。

參考資料

Personalized search Wiki
搜索下一站：個性化搜索基本方法和簡單實驗
京東基於大數據技術的個性化電商搜索引擎
淘寶為什麼還不能實現個性化推薦和搜索？