Clang parser是完全手寫的嗎？

01-12

如果是，為什麼不自動生成parser呢？
自己在看代碼，但是一下子看不到全貌。

是。

Clang - Features and Goals

A single unified parser for C, Objective C, C++, and Objective C++
Clang is the "C Language Family Front-end", which means we intend to support the most popular members of the C family. We are convinced that the right parsing technology for this class of languages is a hand-built recursive-descent parser. Because it is plain C++ code, recursive descent makes it very easy for new developers to understand the code, it easily supports ad-hoc rules and other strange hacks required by C/C++, and makes it straight-forward to implement excellent diagnostics and error recovery.
We believe that implementing C/C++/ObjC in a single unified parser makes the end result easier to maintain and evolve than maintaining a separate C and C++ parser which must be bugfixed and maintained independently of each other.

因為要加各種奇怪的功能還是手寫的parser更容易實現和理解。

用parser generator的話總會有彆扭的地方要跟那個generator「抗爭」一下，反而麻煩。

其實GCC的C++ parser從GCC 3.4系列開始也改為手寫的了。

https://gcc.gnu.org/gcc-3.4/changes.html

A hand-written recursive-descent C++ parser has replaced the YACC-derived C++ parser from previous GCC releases. The new parser contains much improved infrastructure needed for better parsing of C++ source codes, handling of extensions, and clean separation (where possible) between proper semantics analysis and parsing. The new parser fixes many bugs that were found in the old parser.

而GCC的C和Objective-C parser從GCC 4.1系列開始也改為手寫的了。

https://gcc.gnu.org/gcc-4.1/changes.html

The old Bison-based C and Objective-C parser has been replaced by a new, faster hand-written recursive-descent parser.

當然，還是有人覺得用parser generator更好。例如這裡：

c - Are GCC and Clang parsers really handwritten?

來自Semantic Designs的Ira Baxter認為坊間傳聞C++ parser用generator不好寫不是因為generator這個概念不好，而是因為用的parsing演算法太落後了——傳統的parser generator很多都是用LALR(1)的，例如bison的默認模式。

他認為如果用更強力的parsing演算法，例如GLR，問題就迎刃而解了。

嗯這個還是見仁見智。召喚 @vczh大神來發表對GLR的見解 &>_&<

有興趣的同學可以看看一個用GLR的C++ parser，Elsa的設計文檔，就跟 @vczh 的回答說的一樣，在parse階段其實parse出了多份有歧義的AST，然後通過類型檢查把不合理的刪除掉，得到最終的乾淨的AST。

Clang官網文檔有對比Clang parser與Elsa parser的段落：Comparing clang to other open source compilers

補充一下 @RednaxelaFX 的答案。GLR對這個問題迎刃而解的方法是，當你遇到下面的代碼：

Fuck& bitch;

的時候，他會在這裡就地生成量兩顆語法樹，分別是比較表達式和變數定義。然後整個parse完了，你的程序裡面就會有很多小孔都有這些不同的選擇，最後語義分析的時候再去挑。所以你就不需要跟現在C++標準規定的一樣，在這個地方就必須知道Fuck是類型還是值了。

補充一下為何大家不用GLR呢？因為時間不確定性，一般情況下GLR一產生歧義處理就會增加內存消耗和佔用多核心。再加上挑選時間，最後效率總是不如手寫，所以知名通用語言很少用GLR來解析。相反，做各種DSL解析，用GLR就比較好，語法規則隨便寫，因為支持所有上下文無關語法，對效率又不是要求很高，所以這類情況會用parser generator比較多。