微軟究竟遇到了什麼問題使得他們到現在都無法在 C1 中實現兩步名稱查找?
以前的c1的parser是基於YACC的,只用這個很難parse template裡面的non-dependent name,所以要新加一個parser去專門parse template。
對這個問題中提到的「CL」、「C1」/「C1XX」、「C2」等代號的所指沒有了解的同學,請跳這個傳送門:Optimizing C++ Code : Overview
- cl.exe:Visual C++編譯器的「外殼」或者說「驅動」。CL的意思是compile-and-link。
- c1.dll / c1xx.dll:分別是Visual C++編譯器的C語言前端與C++語言前端。它們負責把C / C++語言的源碼讀入,單趟完成詞法分析 -&> 語法分析 -&> 語義分析 -&> 中間代碼生成。其產出是叫做「CIL」 / 「CxxIL」的中間代碼(C Intermediate Language / C++ Intermediate Language)。
- c2.dll:Visual C++編譯器的後端,也叫做「UTC」(Universal Tuple Compiler)。它負責讀入CIL / CxxIL,並對其做平台無關及平台相關優化後,做目標代碼生成。
Our compiler is old. There are comments in the source from 1982, which was when Microsoft was just starting its own C compiler project. The comments of that person (Ralph Ryan) led me to a paper he published in 1985 called 「The C Programming Language and a C Compiler」. It is an interesting read and some of what he describes is still reflected in the code today. He mentions that you can compile C programs with two floppy drives and 192K of RAM (although he recommends a hard drive and 256K of RAM). Being able to run in that environment meant that you couldn』t keep a lot of work in memory at a time. The compiler was designed to scan programs and convert statements and expressions to IL (intermediate language) as quickly as possible and write them to disk without ever having an entire function in memory at one time. In fact, the compiler will start emitting IL for an expression before even seeing the end of the expression. This meant you could compile programs that were quite large on a pretty small machine.
Note: Our compiler consists of two pieces (a front-end and a back-end). The front-end reads in source code, lexes, parses, does semantic analysis and emits the IL. The back-end reads the IL and performs code generation and optimizations. The use of the term 「compiler」 in the rest of this post pertains only to the front-end.
For C code (especially KR C), this approach worked well. Remember, you didn』t even need to have prototypes for functions. Microsoft added support for C++ in C 6.07.0, which was released in 19891992. It shared much of the same code as the C compiler and that is still true today. Although the compiler has two different binaries (c1.dll and c1xx.dll) for C and C++, there is a lot of source code that is shared between them.
At first, the old design of the compiler worked OK for C++. However, once templates arrived, a new approach was needed. The method chosen to implement this was to do some minimal parsing of a template and then capture the whole template as a string of tokens (this is very similar to how macros are handled in the compiler). Later, when a template is instantiated, that token stream would be replayed through the parser and template arguments would be replaced. This approach is the fundamental reason why our compiler has never implemented two phase lookup.
就這樣,歷史原因…技術債不是那麼好還的系列。
對喜歡做技術考古的同學來說,上面引用的博文中提到的論文,The C programming language and a C compiler,內容相當有趣。它裡面所講述的早期微軟C語言編譯器的設計反映了當時那個時代的編譯器的技術背景和設計潮流。這是一篇1985年的論文。引用其中幾段:Because C continues to be an evolving language, the first question we asked was which C we should implement. We started with the UNIX System-V C compilers, which we used as a base language definition. When the ANSI standards effort began, we reviewed the extensions proposed by the committee and incorporated the ones that seemed likely to become part of the eventual standard.
然後
General architecture. The Microsoft C compiler consists of three phases -- P1, P2, and P3 -- as shown in Figure 1. The P1 phase or front end reads the source program and does the C language preprocessing. This includes macro substitution, conditional compilation, and inclusion of named files. P1 then translates the program into expression trees, doing semantic analysis, checking for syntax and type correctness, and writes an intermediate language (IL) temporary file. This intermediate language is both machine- and language-independent and contains information about such things as control structures, expressions, symbols, and data initialization. P1 also emits DIL, which contains data-type information to be used by a symbolic debugger.
這裡提到的P1、P2+P3就是後來Visual C++編譯器中的C1、C2的前身。
注意這張圖裡提到的「PGO」是「Post-Generation Optimizer」,不是現在這個縮寫常指的profile-guided optimization。這裡說的PGO就是後端的一些機器相關優化而已,思路是基於Bliss編譯器的FINAL phase的設計。2016/11/16最新的博文提到VC2017 RC還不會實現two-phase name lookup,但是明年上半年有機會:Give Visual C++ a Switch to Standard Conformance關於two-phase name lookup與Clang的實現:The Dreaded Two-Phase Name LookupAndrewPardoe MSVC tools; EWG scribe 34 points 1 month ago*
And we"re targeting full C++ 98/11/14 Standard conformance in 2017.
Off the top of my head, the remaining issues are:
Expression SFINAE completion
Bugs in current 11/14 features, especially having to do with the ordering of initialization of statics, but goodness knows where else (thank you for your bug reports!)
Two-phase name lookup. The parser changes should be sufficiently complete for us to get two-phase done early next year. No promises but it"s likely enough that you could probably win a few bets with your colleagues.
The C99 preprocessor. This is probably the last thing to come. But we"ll get it in 2017.
其實,沒有技術原因,只有歷史原因。一開始msvc的C++編譯器只是在C基礎上改改,打算湊活用。有什麼新feature不支持就往裡加,架構不改,而且仍然要在巨爛的機器上也能跑。結果補丁摞補丁就成了現在的樣子。two-phase name lookup優先順序一直不高,也就沒人去搞。
兩步名稱查找的唯一作用就是讓你的代碼更長,完全沒有看出來有什麼實現得必要(而且恰好在那份代碼上比較難實現),而且標準也沒有說你非要這麼做不可。
把R大的答案翻譯了下,根據原文改了點東西,例如是在C 7.0引入對C++的支持, 1992年發布
Jim Springfield: Rejuvenating the Microsoft C/C++ Compiler - Visual C++ Team Blog, MSDN
我們的編譯器有些年頭了。早在1982年,微軟剛剛開始自己的C編譯器項目。源代碼里有Ralph Ryan的注釋,引導著我去看他在1985年發表的一篇題為「C編程語言和C編譯器」的論文。這篇文章今天讀來仍是有趣的,他其中描述的一些方法仍然反應在現在代碼的實現方式上。他提到「你可以只用兩個軟盤驅動器和192K的內存編譯C程序(儘管他建議最好有一個硬碟驅動器和256K的內存),能夠在這種環境中運行意味著你不能在內存中一次保存大量的工作」。編譯器目的在於儘可能快地掃描程序並將語句和表達式轉換為IL(中間語言),並將它們寫入磁碟,而無需在內存中完成整個編譯過程。事實上,編譯器甚至在看不到表達式的結尾的時候就能彈出表達式的IL部分表示,這意味著你可以在一個性能相當一般的小機器上編譯相當大的程序。
註解:我們的編譯器由兩部分組成(前端和後端)。前端讀取源代碼,詞法分析,語法分析,執行語義分析並發出IL。後端讀取IL執行代碼生成和優化。在本文的其餘部分中使用術語「編譯器」僅與前端部分相關。
對於C代碼(特別是K&R C),這種方法工作得很好。我們甚至不需要函數的原型。微軟在C 7.0中增加了對C++的支持,它在1992年發布。C++編譯器與C編譯器共享了許多相同的代碼,今天仍然在共享這部分代碼。雖然編譯器有兩個不同的二進位文件(c1.dll和c1xx.dll)分別用於C和C++,但是要知道有很多源代碼在它們之間是共享的。
起初,編譯器的舊設計對於C++工作是可行的。然而,一旦模板語法引入,就需要一種新的方法。當時的選擇是,針對模板引入專門的解析,然後把整個模板作為一個標記字元串(非常類似於宏在編譯器中處理)。稍後,當實例化模板時,將通過解析器重放該模板的分析單元流,並且將相應的模板參數進行替換,這就是我們的編譯器從來沒有實現兩階段查找的根本原因。沒有ast
推薦閱讀:
※微軟會把 clang 擴展到可以徹底替換 C1,並真的換掉 C1 嗎?
※這個程序哪裡錯了,還是G++出了問題?
※為什麼儘管 C++ 早就有了很多現代功能,但是卻長期給人原始的印象呢?
※GCC 下 C++ 中 new int[] 內存的額外信息在哪裡?