《Ranking Relevance in Yahoo Search》(上)

01-29

寫在前面：

這是Yahoo解散前又釋放出的一點熱量，也是KDD2016 best paper。

一句話簡述：

從評估指標，模型，特徵等角度介紹了搜索引擎上層的全貌，並對排序問題，基礎相關性問題，擴大召回問題，以及時間地點感知問題的給出了解決方案。

ABSTRACT

We introduce three key techniques for base relevance:ranking functions, semantic matching features and query rewriting.
We also describe solutions for recency and location sensitive relevance.

主要介紹幾個大的方向，ranking，語義匹配，query rewriting（即query理解與改寫），以及時效性與地域性。

（這裡其實表徵了搜索引擎幾個大的方面，排序問題，基礎相關性問題，擴大召回問題，以及時間地點感知。但為何沒有提及重要的反饋信息--點擊模型？後文中提及「...semantic matching features including click similarity...」，相當於把click歸類到semantic matching來進行闡述。）

INTRODUCTION

Relevance challenges:
1 Query language is usually different from document language.
2 The majority of queries are tail queries.
3 Comprehensive relevance includes both temporal and spatial dimensions.

即query口語化描述與doc描述存在差異，如「特斯拉多少錢」，實際上在大多數網頁中是以「價格」這樣字面存在的。

當前長尾詞越來越多，缺少反饋信息。

同時，隨著移動端的發展，對於地域和時間的感知度也越來越高了。

BACKGROUND

Overview of Architecture

parallelizing the work for one query across many servers; and repeatedly taking the best candidate documents from a cheaper ranking function and reranking them using a better one.

使用一個粗粒度的rank方式，並行地在多台伺服器上選擇出候選項結果；然後再將這些結果匯總，使用精細的rank方式(Core Ranking Function)再計算一次。這樣的做的目的主要是節約性能開銷。

在第一輪的召回過程中，有個聚合操作，對同一站點或主域的結果數量做個限制，目的是維護結果的多樣性。

Obtaining labeled samples is very expensive and time-consuming.

對於Core Ranking Function，由於扮演著重要角色，其所需的訓練數據也是需要非常準確的，同時量級又不小。通常這樣一份數據，需要花費大量的人力來進行標註和維護。

Ranking Features

這裡所說的Ranking Features，是指第二輪rank方法中使用的features。實際搜索引擎使用的feature可以達到幾百幾千的量級，可以將它們分為以下幾個大類：

Web graph: determine the quality or the popularity of a document based on its connectivity in the web graph.
Document statistics: some basic statistics of the document such as the number of words in various fields.
Document classifier: such as spam, adult, language, maintopic, quality, type of page.
Query Features: number of terms, frequency of the query and of its terms,click-through rate of the query.
Text match: Basic text matching features are computed from different sections of the document (title,body, abstract, keywords) as well as from the anchor text and the URL.
Topical matching: tries to go beyond similarity at the word level and compute similarity at the topic level.
Click: These features try to incorporate user feedback，different click probabilities can be computed:probability of click, first click, last click, long dwell time click or only click.
Time: For time sensitive queries, the freshness of a page is important.
There are several features which measure the age of a document as well as the one of its inlinks and outlinks.

（感覺作者給的分類不是太好，因為Text match和Topical matching可以統一成matching類；關於Time，雖然其也很重要，但是在作為幾個大類中的一個倒是稍有牽強。

這樣的話，features可以分為以下幾個大類：權威性，doc統計和分析類，query相關類，query-doc匹配類，點擊，時效類。

此外，這裡提及Time時，只提及了衡量doc時間的問題，即時間因子的提取。這裡忽略了另外一個問題，就是query time程度的識別，因為不是所有query都對doc有明顯時間需求，並且需求的程度也可能不一樣。）

Evaluation of Search Relevance

對一個搜索結果的評估可以分為human labeling和user behavioral metrics，後者包括點擊率，換query比例，停留時長等重要指標。

（前者的優勢是精度高，可以更好的評估準確率，但劣勢是成本過高，過於費時費力。相比之下，後者更加容易統計，但也會有很多的干擾因素在其中。）

they may be affected by other factors, such as presentation (e.g.fonts, bold, size, abstract summary etc.) and result diversity.

（這裡沒有作者沒有提及重要的一個影響因素，position）

人工標註的方式是對query-url進行5檔標註（ Perfect, Excellent, Good, Fair or Bad.），然後來計算序列的DCG（Discounted Cumulative Gain）：

Gi是第i個位置上的label，可以看出分母其實表示的就是位置的權重，當把label越高的結果排在越高的位置，該序列的DCG得分越高。

（不同的數據集具有不同的DCG，DCG的取值不僅和排序結果有關，也和數據集中實際的label分布有關，這樣兩份不同數據集上的DCG不具有可比性。其實NDCG更好，更具有可比性，它是對DCG做了歸一，即除以該序列理想情況下的DCG，也稱IDCG）

MACHINE LEARNED RANKING

Search can be treated as a binary problem. In our experiments,we observe that gradient boosting trees (GBDT) with logistic loss usually are able to reduce the bad URLs at top positions.

具體第二輪使用的core ranking 是GBDT with logistic loss。它的log likelihood(loss)如下：

為了表徵樣本的不同，對其賦予了不同的權重。 (e.g., 3 for Perfect, 2 for Excellent, and 1 for Good).

這裡給出的結論是LogisticRank要比GBRank和LambdaMart效果要好。

（這裡有所懷疑，因為從實踐的角度來看pairwise的要優於pointwise，網路上也發現其他對此也有懷疑：gbRank & logsitcRank自頂向下）

（可見，調整樣本分布以及樣本賦權是個精細活。）

Contextual Reranking

上述的core ranking只是考慮Q-U對，但並沒有考慮該query其他候選url中可以提取的信息。由於候選項集合是海量的，通常考慮取其中的topN。

（相當於僅看一個Q-U對是不夠的，還要和其他候選項整體的有所比較。）

We extract following contextual features for specific existing features ：
(1) Rank: sorting URLs by the feature value in ascending order to get the ranks of specific URLs.
(2) Mean: calculating the mean of the feature values of top 30 URLs.
(3) Variance: calculating the variance of the feature values of top 30 URLs.
(4) Normalized feature: normalizing the feature by using mean and standard deviation.
(5) Topic model feature: aggregating the topical distributions of 30 URLs to create a query topic model vector, and calculating similarity with each individual result.

（未完待續。）