Reformulation of Query for Code Search

10-12

原文為Effective Reformulation of Query for Code Search using Crowdsourced Knowledge and Extra-Large Data Analytics。

文章的主要主題是searching relevant code snippets. 這個領域的主要存在的問題是vocabulary mismatch issues. 因此，開發人員不停的重新組織他們的搜索的語言：移除一些不相關的關鍵詞，新增一些相關的關鍵詞。然而這個過程非常瑣碎並且可能帶有很多錯誤。於是這篇文章的目標是自動的重新組織一般性的語句，把相關的API類名作為關鍵詞加入搜索語句中：query reformulation targeting code search。

為此，本文作者提出了一個新的技術：NLP2API技術，能夠針對用自然語言描述的編程任務，自動的識別出相關的API類別，然後得到的API類別重新組織查詢語句，已得到更好的代碼搜索結果。

Development of Candidate API Lists

Corpus Preparation:(Steps 1a, 1b, Algorithms Line 3):

Data from StackOverFlow:
question or answer with a nit of code(existence of <code> tags);
accepted as solution by the submitter of question;
avoid stemming on controversial evidence
then perform standard natural language preprocessing (i.e., removal of stop words, punctuation marks and programming keywords, token splitting)
indexed using Lucene

Pseudo-Relevance Feedback (PRF) on the NL Query(Fig. 1, Steps 2–4, Algorithm 1, Lines 4–8):

Given a query, first perform standard natural language preprocessing, then use it retrieve Top-M Q&A thread from Corpus with Lucene (Machansm of Top-M??)

[Side Effect? I just wonder the relationship between program elements and PR-feedback:

Extract the program elements( code segments, API classes).

Develop two separate sets of code segments from questions and answers from feedback threads.

]

API Class Weight Estimation with TF-IDF:(Fig. 1, Step 5,

Algorithm 1, Lines 11–12):

For each code segment(feedback document) and API classes inside the code segments, their relative weigh is calculated by:

$TF-IDF(A_{i}) = (1 + log(TF_{A_{i}})) * log(1 + N/DF_{A_{i}})$

$TF_{A_{i}}$ 是指API $A_{i}$ 在文本中出現分頻率， $E$ 是指整個Q&A thread 的數量， $DF_{A_{i}}$ 是指有 $A_{i}$ 的文本數量。

API Class Weight Estimation with PageRank(Fig. 1, Step 6, Algorithm 1, Line 13):

However, TFIDF assumes term independence (i.e., ignores term contexts) in the weight estimation. Hence, it might fail to identify highly important but not so frequent terms from a body of texts. We thus employ another term weighting method that considers dependencies among terms in the weight estimation。

Selection of Candidate API Classe(Fig. 1, Step 7, Algorithm 1, Lines 9–16):

Then we select Top-N (e.g., N = 16, check RQ1 for justification) API classes from each of the four lists (i.e., two lists for each term weight)。

API classes involved with the problem and API classes forming the solution should be treated differently for identifying the relevant and specific API classes for the task. We leverage this inherent differences of context and semantics between questions and answers, and treat their code segments separately。

Borda Score Calculation(Fig. 1, Step 8, Algorithm 1, Lines 22–23)

we apply this method to our four candidate API lists where each of the API classes are ranked based on their importance estimates. Thus, an API class that occurs at the top positions in multiple candidate lists is likely to be more important for a target programming task than the ones that either occurs at the lower positions or does not occur in multiple lists.

Query-API Semantic Proximity Analysis

Word2Vec Model Development((Fig. 1, Step 1b, 1c, 1d, Algorithm 1, Lines 17–18)：We thus develop a word2vec model where 1.3 million programming questions and answers are employed as the corpus.

Semantic Relevance Estimation:(Fig. 1, Step 9, Algorithm 1, Lines 24–25):

the relevance estimate of the class of a query and candidate API class is captured by the amximun proximity berween an API class and any of the query keywords.

word embedding of candidate API is a vector of 100 real valued estimates of the contexts.

$S_{P}(A_{i} in A) = {f(A_{i},q)| f(A_{i},q_{0}) forall q_{0}in Q}$

$f(A_{i},q) = cosine(fastText(A_{i}),fastText(q))$

Here $fastText(.)$ returns the learned word embeddings of either a query keyword or an API class. $f(A_{i},q)$ returns the cosine similarity between their word embeddings.

API Class Relevance Ranking & Query Reformulation(Line 26, Algorithm 1)

sum Borda score $S_{B}$ and semantic proximity score $S_{P}$ up using a linear combination.

The API classes are then ranked according to their final scores, and Top-K (e.g., K = 10) classes are suggested as the relevant classes for the programming task stated as a generic query (i.e., Fig. 1, Steps 10–12, Algorithm 1, Lines 19–29).

These API classes are then appended to the given query as reformulations [31] (i.e., Fig. 1, Steps 13, Algorithm 1, Lines30–31).

總結：這篇文章。。。通過非常複雜的演算法（吐血）得到了較好的結果。。。。