編譯器入門
An Intro to Compilers
(作者 Nicole Orchard)
How to Speak to Computers, Pre-Siri
tl;dr: Learning new meanings for front-end and back-end.
A compiler is just a program that translates other programs. Traditional compilers translate source code into executable machine code that your computer understands. (Some compilers translate source code into another programming language. These compilers are called source-to-source translators or transpilers.) LLVM is a widely used compiler project, consisting of many modular compiler tools.
Traditional compiler design comprises three parts:
- The Frontend translates source code into an intermediate representation (IR)*. clang is LLVM』s frontend for the C family of languages.
- The Optimizer analyzes the IR and translates it into a more efficient form. opt is the LLVM optimizer tool.
- The Backend generates machine code by mapping the IR to the target hardware instruction set. llc is the LLVM backend tool.
* LLVM IR is a low-level language that is similar to assembly. However, it abstracts away hardware-specific information.
Hello, Compiler ??
Below is a simple C program that prints 「Hello, Compiler!」 to stdout. The C syntax is human-readable, but my computer wouldn』t know what to do with it. I』m going to walk through the three compilation phases to make this program machine-executable.
// compile_me.c// Wave to the compiler. The world can wait.#include <stdio.h>int main() { printf("Hello, Compiler!
");return 0;}
The Frontend
As I mentioned above, clang is LLVM』s frontend for the C family of languages. Clang consists of a C preprocessor, lexer, parser, semantic analyzer, and IR generator.
- The C Preprocessor modifies the source code before beginning the translation to IR. The preprocessor handles including external files, like #include <stdio.h> above. It will replace that line with the entire contents of the stdio.h C standard library file, which will include the declaration of the printf function.
See the output of the preprocessor step by running:
clang -E compile_me.c -o preprocessed.i
- The Lexer (or scanner or tokenizer) converts a string of characters to a string of words. Each word, or token, is assigned to one of five syntactic categories: punctuation, keyword, identifier, literal, or comment.
Tokenization of compile_me.c
- The Parser determines whether or not the stream of words consists of valid sentences in the source language. After analyzing the grammar of the token stream, it outputs an abstract syntax tree (AST). Nodes in a Clang AST represent declarations, statements, and types.
The AST of compile_me.c
The Semantic Analyzer traverses the AST, determining if code sentences have valid meaning. This phase checks for type errors. If the main function in compile_me.c returned "zero" instead of 0, the semantic analyzer would throw an error because "zero" is not of type int.
- The IR Generator translates the AST to IR.
Run the clang frontend on compile_me.c to generate LLVM IR:
clang -S -emit-llvm -o llvm_ir.ll compile_me.c
The main function in llvm_ir.ll
123456789101112; llvm_ir.ll@.str = private unnamed_addr constant [18 x i8] c"Hello, Compiler!