智能運維繫統（一）

01-30

本文作為智能運維繫統的探索，這篇論文的標題是《Focus: Shedding Light on the High Search Response Time in the Wild》，來自於清華大學裴丹教授。目標是解決在運維過程中，發現高搜索響應時間之後，使用機器學習演算法發現異常的原因和規則。該系統（Focus）使用過2.5個月的數據，並且分析過數十億的日誌。下面將會詳細介紹這篇文章的主要內容。

問題描述：

To help search operators dubug HSRT (high search response time)，Focus is a search log analysis framework to answer the three questions:

(1) What is the HSRT condition?

(2) Which HSRT condition types are prevalent across days?

(3) How does each attribute affect SRT in those prevalent HSRT condition types?

解決方案：

Focus has one component for each of the above questions:

(1) A decision tree based classifier to identify HSRT conditions in search logs of each day;

(2) A clustering based condition type miner to combine similar HSRT conditions into one type, and find the prevalent condition types across days; following Occam』s razor principle.

(3) An attribute effect estimator to analyze the effect of each individual attribute of SRT within a prevalent condition type.

基礎知識準備：

(A) Search Logs:

For each measured query, its search log records two types of data: SRT and SRT components, Query Attributes.

(1) SRT and SRT components:（特徵層）

$t_{1}$ is when a query is submitted; $t_2$ is when the result HTML file has been downloaded; $t_3$ is when a brower finishes parsing the HTML; $t_4$ is when the page is completely rendered. SRT is measured by $t_4-t_1$ , the user-received search response time. $T_{server}$ is the server response time of the HTML file, which is recorded by servers; $T_{net}=t_{2}-t_{1}-T_{server}$ is the network transmission time of the HTML file; $T_{browser}=t_{3}-t_{2}$ is the browser parsing time of the HTML; $T_{other}=t_{4}-t_{3}$ is the remaining time spent before the page is rendered, e.g. download time of images from image servers.

(2) Query Attributes:（特徵層）

The search logs record the following attributes for each measured query:

(i) Browser Engine: Webkit(e.g. Chrome, Safari and 360 Secure Browser), Gecko, Trident LEGC, Trident 4.0, Trident 5.0, and others.

(ii) ISP: China Telecom, China Unicom, China Mobile, China Netcom, CHina Tietong, others.

(iii) Localtion: Based on the client IP, convert IP to its geographic location. In total, there are 32 provinces.

(iv) #Image: the number of embedded images in the result page.

(v) Ads: A result page contains paid advertise links or not.

(vi) Loading Mode: The loading mode of a result page can be either synchronous or asynchronous.

(vii) Background page views: On the service side, the search engine S also post-analyzes the logs and generates the background page views. The background PVs (page views) for a query q is measured by the number of queries served within 30 seconds before and after q is served.It reflects the average search request load where q is served. Due to confidentiality constraints, we normalize specific background PVs (page Views) by the maximum value.（事後分析，統計出一些必要的特徵，輸入 Focus 系統的機器學習模型中）

(B) HSRT and HSRT Conditions:（樣本層）

Usually, we can use cumulative distribution fraction (CDF) of SRT in the search logs to determine the high search response time condition (HSRT condition). In this paper, we define HSRT as the SRT longer than 1s.

Challenges of Identifying HSRT Conditions: In order to identify HSRT conditions in multi-dimensional search logs.（以下是這個系統的一些難點和挑戰點）

(a) Naive Single Dimensional Based Methods: including pair-wise correlation analysis and so on, but is inefficient.

(b) Attributes can be potentially interdependent on each other: that means Naive Bayes Method may not applicable in this situation.

(c) Need to avoid output overlapping conditions: like {#image>30}, {ads=yes}, and {#image>20, ads=yes}. （隨著時間的推移，每天使用模型可能會推出類似或者重複的規則）。

關鍵思想和系統概況

Condition is a combination of attributes and specific values in search logs.

HSRT Condition is a condition that covers at least 1%$ of total queries, and has the fraction of HSRT large than the global level:

(# of HSRT queries in a HSRT condition / #of queries in a HSRT condition) > (# of HSRT queries / # of queries). This is in order to assign to labels and we can change this definition in practice. （這只是用來打標籤的定義，用於判斷什麼是HSRT，在實際的應用中，我們可以根據具體的場景採用不同的定義，例如返回碼等指標）。

『Focus』 System Overview:

Input: search logs（日誌）

(i) Use a decision tree based classifier to identify HSRT conditions in search logs every day; （每天可以使用決策樹模型從日誌中提取HSRT條件）。

(ii) Use a clustering based condition type miner to identify condition types of similar HSRT conditions, and fine prevalent condition types across days; （用於把類似的條件融合在一起）。

(iii) Use an attribute effect estimator to analyze how an attribute affects SRT and SRT components in each prevalent condition type. （用於判斷哪些屬性或者特徵對這個標籤影響更加深遠）。

Output: prevalent condition types and their attributes effects on SRT.（第二步輸出的條件以及第三步屬性的重要性）。

Part (i): Decision Tree Based Classifier including ID3, C4.5, CART. It contains five important parts: (1) expressing attribute splits; (2) evaluate splits; (3) stopping tree growing; (4) assigning Labels: assign HSRT labels to the left nodes whose fraction of HSRT is larger than the global fraction of HSRT; (5) identify HSRT Branching Attribute Conditions. （這裡是 Focus 系統所採用的機器學習演算法）。

Part (ii): Condition Type Miner: group HSRT conditions according to (1) the same combination of attributes, (2) the same value from each category attribute, and (3) similar interval for each numeric attribute, using Jaccard Index to measure the similarity between intervals. （條件的融合）。

Part (iii): Attribute Effect Estimator: With each condition type $C={c_{1}wedge c_{2} wedge cdots wedge c_{i}wedgecdots wedge c_{n} }$ , we design a method to understand how each attribute condition $c_{i}$ affects SRT.

For example, what is the HSRT fraction caused by $c_{i}$ in $C$ ? What SRT components (e.g. $T_{net}$ and $T_{server}$ ) are affected by $c_{i}$ ?

Main Idea: flip condition $c_{i}$ to the opposite $overline{c}_{i}$ to get a variant condition type

$C_{i}={c_{1}wedge c_{2} wedge cdots wedge overline{c}_{i}cdots wedge c_{n} }$

In the past days, we have the number of HSRT events in total, the number of HSRT events in condition $C$ and the number of HSRT events in condition $C_{i}$ . As a result, we believe the historical data based comparison can provide a reasonable estimate of the attribute effects. The comparison between $C$ and $C_{i}$ in these days is based on the specific HSRT conditions of these days. （用於判斷哪些屬性更能夠引起 HSRT）。

In Table IV, the results are sorted by the variation of the fraction of HSRT in condition types (HSRT% column) caused by flipping an attribute condition.

(i) We highlight the variations greater than zero (getting worse after flipping an attribute condition).

(ii) We focus on that flipping the HSRT branching attribute conditions can yield improvements on HSRT%. For example, the condition #image>x are all ranked at the top. It means we need to reduce the impact of images on SRT and we can get the highest potential improvement of HSRT.

(iii) Table III and Table IV are the output of Focus to the operators for these months.

Observations by Further Inverstigation

Table IV raises some interesting questions:（通過 Focus 輸出的表格 Table IV 可以提出很多其餘的問題，也許是人工經驗不容易發現的問題）

(1) Why does reducing #images increase $T_{server}$ , the time that servers prepare the result HTML (row 1, 2, 3, 4 of Table IV)?

(2) How do ads inflate SRT? Why do the pages with ads need more $T_{net}$ and $T_{browser}$ (row 7)?

(3) Why does Webkit engine perform better, especially greatly decreasing $T_{browser}$ (row 5, 10, 11, 12)?

(4) It is nature that switching ISPs can affect network transmission time $T_{net}$ , but why does switching to China Telecom reduce $T_{server}$ by over 20% (row 6, 8, 9)?