如何計算兩份代碼的相似度？

01-02

推薦一本書《軟體加密與解密》，作者是Christian Collberg和Jasvir Nagra。

裡面第十章很系統地介紹了目前主流的相似性檢測演算法。

以前做腳本病毒的查殺引擎的時候有做過一些這方面的分析，你把它做成AST，然後比對一下一些關鍵的地方就能破掉許多簡單的混淆。

正好之前的畢設中閱讀了相關的文獻這裡依照我膚淺的理解也來回答一下

計算兩份代碼的相似度是有很多實際的應用背景的，比如在代碼倉庫中（Software Repository）中，大量的代碼來自於相互的複製粘貼，如果能夠合理地檢測到相似的代碼，將相似的代碼只保存一份，那麼可以減少大量的存儲冗餘，同時也可以更好地保證數據的一致性。

相關的研究有不少，在論文[*]中，提到了五種主要類型的相似性比較，分別是：

Textual comparison,
Token comparison,
Metric comparison,
Comparison of abstract syntax trees (AST),
Comparison of program dependency graphs (PDG) ,
其他一些類型的比較方法。

對於這五種相似度對比方法及其他的一些方法，文章是這麼闡述的：

Textual comparison. The approach of Ducasse et al. compares whole lines to each other textually [4]. To increase performance, lines are partitioned using a hash function for strings. Only lines in the same partition are compared. The result is visualized as a dot plot, where each dot indicates a pair of cloned lines. Clones may be found as certain patterns in those dot plots visually. Consecutive lines can be summarized to larger cloned sequences automatically as uninterrupted diagonals or displaced diagonals in the dot plot.
Johnson [13] uses the efficient string matching by Karp and Rabin [14] based on fingerprints.
Token comparison. Baker』s technique is also a line- based comparison. Instead of a string comparison, the token sequences of lines are compared efficiently through a suffix tree. First, each token sequence for a whole line is summarized by a so-called functor that abstracts from concrete values of identifiers and literals [1]. The functor characterizes this token sequence uniquely. Assigning functors can be viewed as a perfect hash function. Concrete values of identifiers and literals are captured as parameters o this functor. An encoding of these parameters abstracts from their concrete values but not from their order so that code fragments may be detected that differ only in systematic renaming of parameters. Two lines are clones if they match in their functors and parameter encoding.
The functors and their parameters are summarized in a suffix tree, a trie that represents all suffixes of the program in a compact fashion. A suffix tree can be built in time and space linear to the input length [7], [15]. Every branch in the suffix tree represents program suffixes with common beginnings, hence, cloned sequences.
Kamiya et al. increase recall for superficially different yet equivalent sequences by normalizing the token sequences [9]. Because syntax is not taken into account, the found clones may overlap different syntactic units, which cannot be replaced through functional abstraction. In either a preprocessing [16], [17] or a postprocessing [18] step, clones that completely fall in syntactic blocks can be found if block delimiters are known.
Metric comparison. Merlo et al. gather different metrics for code fragments and compare these metric vectors instead of comparing code directly [2], [3], [12], [19]. An allowable distance (for instance, euclidean distance) for these metric vectors can be used as a hint for similar code. Specific metric-based techniques were also proposed for clones in Web sites [20], [21].
Comparison of abstract syntax trees (AST). Baxter et al. partition subtrees of the abstract syntax tree of a program based on a hash function and then compare subtrees in the same partition through tree matching (allowing for some divergences) [8]. A similar approach was proposed earlier by Yang [22] using dynamic programming to find differ- ences between two versions of the same file.
Comparison of program dependency graphs (PDG). Control and data flow dependencies of a function may be represented by a program dependency graph; clones may be identified as isomorphic subgraphs [10], [11]; because this problem is NP-hard, Krinke uses approximative solutions.
Other techniques. Marcus and Maletic use latent semantic indexing (an information retrieval technique) to identify fragments in which similar names occur [23]. Leitao [24] combines syntactic and semantic techniques through a combination of specialized comparison functions that com- pare various aspects (similar call subgraphs, commutative operators, user-defined equivalences, and transformations into canonical syntactic forms). Each comparison function yields an evidence that is summarized in an evidence-factor model yielding a clone likelihood. Wahler et al. [25] and Li et al. [26] cast the search for similar fragments as a data mining problem. Statement sequences are summarized to item sets. An adapted data mining algorithm searches for frequent item sets.

對於上述段落中具體引用的文章，這裡不貼出來了。

這篇文章的Abstract如下：

Abstract—Many techniques for detecting duplicated source code (software clones) have been proposed in the past. However, it is not yet clear how these techniques compare in terms of recall and precision as well as space and time requirements. This paper presents an experiment that evaluates six clone detectors based on eight large C and Java programs (altogether almost 850 KLOC). Their clone candidates were evaluated by one of the authors as an independent third party. The selected techniques cover the whole spectrum of the state-of-the-art in clone detection. The techniques work on text, lexical and syntactic information, software metrics, and program dependency graphs.

文章裡面提供了大量的相似性計算公式和實驗對比，如果感興趣，可以深入閱讀此文章。

Reference:

[*]. Bellon, S., Koschke, R., Antoniol, G., Krinke, J., Merlo, E. (2007). Comparison and evaluation of clone detection tools. Software Engineering, IEEE Transactions on, 33(9), 577–591. IEEE.

用斯坦福大學的moss

呵呵，本來因為不熟悉這類問題，所以沒有打算回答的。不過被邀請了還是說一下自己的觀點吧。

對於可以反編譯的語言來說，可以將兩份代碼全部編譯，然後再反編譯，之後對反編譯的結果進行文本對比。

這樣的好處是不需要專門去做什麼語意分析的功能了，但是實際效果怎麼樣那我也不是很清楚，因為沒有用過。

我覺得這裡可能有一個問題就是提問者是在什麼場合下要做這樣的事情？

內部重構么？那應該文本比對就夠了

要檢查是否有抄襲？那就必須按照受認可的標準進行對比，建議找找之前的案例

這個問題好，讓我想起了之前編譯上機練習助教出的一個題: 怎麼根據一份別人的代碼，量產多份作業？還有作為助教怎麼識破？

問題里既然說是代碼了，那就可以看到源碼了？(我默認)

最簡單的就是diff一下啦，

對策可以改改代碼風格，排版，變數命名，甚至插曲一些無用代碼等等。這些只是看起來不同了。

另外幾位大神的答案，似乎忽略了程序運行時的狀況呢，如果程序本身邏輯沒變，只是文本和語法上做些處理(忽略lisp之類的奇葩)，程序運行起來之後，比較調用堆棧的變化，這種方法很容易識破。

你說運行起來比較難的話，那就忽略了測試人員的作用了，把程序/代碼段/模塊當成黑盒，通過mock等方式，還是可以一部分一部分搞的。

那改變調用層次，函數套函數？甚至某些地方變成宏？模塊重構？諸如此類。這就不好識別了。模塊都給重構了的話，姑且不算抄了，但是設計上一樣的，人工code review，去了解它的設計，還是能識破。

怎麼做到毫無痕迹呢？學會之後變成自己的東西吧。然後再創造。

好像跑題了，回答問題，怎麼比較，我覺得:

首先可以diff一下，然後看運行狀態，如果是差別很大的，看設計，做code review，看文檔，如果自己的代碼特別多，而人家的代碼又拿不到，可以在自己的代碼里加些【水印】，比如搞些magic number，運行時去調試，就找這個magic number，有的話，基本就是盜的。計算的話，我覺得評級比較靠譜，列規則，挨個檢測，根據規則中槍程度評級，完全量化的方法應該很難做到，但是規則+評級可以糙快猛的構造出一個比較相似度的系統，工程思維哈，不斷根據實踐，去調整這個系統，讓它更接近準確就可以啦～

我說的都是土方法，期待學術大神給個系統答案。

這個還真做過一點，主要用了2種辦法：

1.借鑒rsync將代碼按長度分塊比較，得到相同和修改的塊

2.基於編輯距離演算法，計算修改的地方

基於這2個辦法實現了一個帶有增量更新的js模塊載入器mt：

mtjs/mt · GitHub

如果是acm，記得讀書時acm小組檢查作業的方式為看程序佔用的內存大小，還有運行時的幾個其他參數（記不住了）。話說你不會是老師或者助教吧？如果是，我就犯錯了。

分情況：

1. 有源碼：用到的方法可參考斯坦福的moss。

2. 沒有源碼，沒有obfuscation：常見的方法是比較AST, CFG, PDG等。

3. 沒有源碼，有obfuscation：只能靠動態分析了，給兩個相同或相似的input，比較兩個程序的trace。

之前寫過一個通過字元串相似度來模擬實現的庫，可以參考下，演算法複雜度有待優化:https://github.com/SKing7/mini-clone-digger

核心思想是怎麼找到代碼的中間表達，可以是語法分析樹, 也可以是控制流圖，也可以自己定義規則，將代碼中的關鍵信息抽取出來映射到向量空間上，就可以比較其相似性了。

這是一個可能涉及到很多步驟和技術的問題。如果打算使用機器學習技術做這方面的研究和應用，可以讀一下 Springer 去年出的一本綜述小書《Software Similarity and Classification》，從而獲得一個整體和清晰的印象。每章後的參考文獻，可以有選擇地看看。

請參考GiT

摘錄自網路:

Levenshtein Distance演算法及其優化改進

假設文件A有a行代碼，B有b行代碼，在使用Hunt的改進演算法後得到，從A到B需要修改c行，從B到A需要修改d行（即A中有b-c行代碼和B是相同的），因此可以定義A、B間的相似度為max( (b-c)/a, (a-d)/b )。