`
hanmiao
  • 浏览: 56922 次
  • 性别: Icon_minigender_1
  • 来自: 上海
文章分类
社区版块
存档分类
最新评论

如何编写你自己的编译器

 
阅读更多

1、Introduction 简介

This on-line document accompanies the slides and other material available for the courses Formal Languages and Compilers and Linguaggi e Traduttori.
It describes the project of a complete compiler for a real programming language and is organized to demonstrate how to incrementally design and implement the successive phases of the compilation process.
All the used tools are freely available on the Internet:
1. the front end of the compiler has been developed by means of JFlex, a lexical analyzer generator, and Cup, an LALR parser generator, both based on the Java Standard Edition Platform
2. the back end is made by the LLVM retargettable code generator, which currently supports X86, X86-64, PowerPC, PowerPC-64, ARM, Thumb, SPARC, Alpha, and IA-64 architectures.

2、Source Language 源语言

The source language for the compiler is mjava, a subset of Java which mainly differs from the standard language in the following points:
1. all classes in a program are declared in a single file (linking is not needed)
2. multiple inheritance is not supported (interface declarations are not allowed)
3. abstract classes are not admitted
4. static fields and static methods are not allowed
5. overloading is not supported
6. basic only flow-of-control statements ( if, if-else and while ) are implemented
7. basic input/output operations on the I/O standard devices are performed by invoking the C-like functions printf and scanf.

This is an example of mjava source program:

/********************************************************************
 * Sample mjava source program
 *********************************************************************/
public class String {

} // end of class String

public class Int {
	int n;

	public Int(int i) {
		n = i;
	}

	public int f() {
		return fact(n);
	}

	int fact(int n) {
		return n > 2 ? n * fact(n - 1) : n;
	}
} // end of class Int

public class Test {
	public void main(){
    int n, f;
    Int t;
    n = 0;
    while(n < 1 || n > 16) {
       printf ("Enter an integer greater than 0 and less than 17: ");
       scanf ("%d", &n);
    }
    t = new Int(n);
    f = t.f();
    printf("factorial(%d)= %d\n", n, f);
  }
} // end of class Test

3、Lexical Analysis 词法分析

First of all we have to specify the lexical elements of our source language by means of a lexical-analyzer generator.
JFlex transforms the specification of a scanner (regular definitions, actions to be executed when a matching occurs, ...) into a Java program that implements a minimum state Deterministic Finite Automaton accepting the specified lexemes.
The lexical specification of the mjava language is reported in the mjava.flex file, that uses the interface sym.java to declare integer constants for the terminal tokens in the language.
From these files JFlex generates a Scanner.java file with one class that contains code for the scanner. The class will have a constructor taking a java.io.Reader from which the input is read.
The %debug directive in mjava.flex also creates a main function in the generated class that expects the name of an input file on the command line and then runs the scanner on this input file, by printing information about each returned token.
The information includes: line number (if line counting is enabled), column (if column counting is enabled), the matched text, the executed action (with line number in the specification), and the integer value of the token declared in sym.java.

You can generate a scanner for mjava and run it on an input program such as SampleProgram.mjava by means of the Windows batch file mjava.bat or the Linux shell script mjava.spt :
1. put the mjava.flex, sym.java, SampleProgram.mjava and mjava.bat / mjava.spt files in a same directory and “cd” into it
2. run mjava.bat / mjava.spt
3. you will get something like SampleProgramTokens.txt printed to your Java console.

4、Parsing 解析

Now we have to specify the syntactic structure of mjava well-formed programs.
By means of Cup, we’ll then transform the parser specification (grammar, conflict resolution directives, ...) into a Java program implementing an LALR parser.
The parser specification of the mjava language is reported in the mjava.cup file, that uses the scanner generated by mjava.flex to perform lexical analysis and then generates the interface sym.java and the parser mjavac.java.
We have slightly modified the previous version of mjava.flex, by introducing the class JavaSymbol.java to retrieve more detailed information about lexical tokens.

You can generate a parser for mjava and run it on an input program such as SampleProgram.mjava by means of the Windows batch file mjava.bat or the Linux shell script mjava.spt:
1. put the mjava.flex, mjava.cup, JavaSymbol.java, SampleProgram.mjava and mjava.bat / mjava.spt files in a same directory and “cd” into it
2. run mjava.bat / mjava.spt
3. the files Scanner.java, mjavac.java and sym.java will be generated
4. you will get something like SampleProgramParsingResult.txt printed to your Java console.

5、Error handling 错误处理

The parser produced so far has a major drawback: on receiving an input program with syntactic errors, such as ErroneousProgram.mjava, it will stop with a Fatal Syntax Error Exception on the first error occurrence, as shown in ErroneousProgramResult.txt.
This behavior is quite unkind to the user, who would like to have all the errors reported, not just the first one.
Cup supports a local error recovery mechanism, based on a special error symbol, to allow the parser to continue when a syntactic error occurs: whenever the error symbol appears in a grammar rule, it can match a sequence of erroneous input symbols.
On encountering an error action, the parser:
1. pops the stack until a state including an item obtained from an error rule A ::= error a is reached
2. shifts a fictitious error token onto the stack, as though error was found on input
3. skips ahead on the input, discarding symbols until a substring is found that can be reduced to a
4. reduces error a to A
5. emits a diagnostic message
6. resumes normal parsing.

Unfortunately this error recovery mechanism is not powerful enough to correctly report any kind of syntactic error: error rules cannot be inserted anywhere into an LALR grammar because they may introduce both shift/reduce and reduce/reduce conflicts.
A parser with error rules for mjava is specified in mjava.cup.
When run on the input program ErroneousProgram.mjava, it will produce the output reported in ErrorHandlerResult.txt on the Java console.

6、Symbol table 语法表

Once completed the syntax analysis, we can start to develop the semantic analysis phase.
First of all we have to design the symbol table organization, in order to hold information about source-program constructs.
Information will be collected incrementally by the analysis phase and used by the synthesis phase to generate the code.
Entries in the symbol table will contain information about every identifier, such as its lexeme, its type, its position in storage, and any other information relevant to the translation process.
Since mjava admits nested environments, where identifiers can be redeclared, we support multiple declarations of a same identifier by setting up a separate symbol table for each scope.
The most-closely nested rule will be implemented by chaining the symbol tables so that the table for a nested scope points to the table for its enclosing scope.
An example of multiple declarations, their scopes, and the corresponding chained symbol tables is shown in the figure:

语法表样例与说明

The implementation of chained symbol tables is defined by the class Env in Env.java.
This class models an environment by means of an HashMap (table) and a reference (prev) to the table of the enclosing scope; the static variables top and root maintain the references to the current environment and to the root environment respectively.
Entries in a table are key-value pairs, where the key is the lexeme of an identifier and the value is an instance of the class Symb, declared in Symb.java, that will contain information about the identifier (in this first implementation Symb is empty).
The following static methods are declared in class Env :
1. put adds a new entry in the current environment if its lexeme is not already there
2. get retrieves an entry in the chain of tables starting from the current environment
3. putClass adds a new entry in the root environment if its lexeme (a class name) is not already there and its super class (if any) has already been declared
4. push creates a new environment and makes it the current environment
5. pop restores the previous environment.
The current environment is printed on the Java console anytime it is modified.
The construction of symbol tables for mjava is specified in mjava.cup.
You can generate the parser and run it on the input program SampleProgram.mjava by means of mjava.bat or mjava.spt (Env.java and Symb.java should be stored in a subdirectory named symtab).
The output reported in SymbolTableResult.txt will be printed on the Java console.

7、Types 数据类型

The structure of a type in mjava is represented by a type expression defined as:
· a basic type ( integer, floating, character, boolean, void, error )
· the expression formed by applying a type constructor (name, array, product, method, constructor, reference ) to a type expression.

Types are implemented by the class hierarchy in package type : the superclass Type in Typejava, declares a field (tag) corresponding to either a basic type or a type constructor, and a field (width) containing the number of bytes required to store variables of that type.
Each type constructor is implemented by a corresponding subclass of Type defined, respectively, in Name.java, Array.java, Product.java, Method.java, Constructor.java, and Reference. java.
Type expressions are strings formed by:
· just a tag (for basic types)
· the tag of a type constructor followed by the type expression to which it applies (for constructed types).

In order to efficiently check type equivalence, we maintain a dictionary of the types declared in a source program by means of the HashMap table in class Type : entries in the dictionary are key-value pairs where the key is a type expression and the value is the corresponding Type object.
When a type is encountered in the source program, its type expression is searched for in the dictionary: if present, the corresponding Type object is retrieved and used, otherwise a new Type object is created and a new entry is added to the dictionary.
In this way two types are equivalent if and only if they are represented by the same Type object.
To complete the construction of symbol tables, we have to specify the information we want to collect about identifiers in the class Symb.
In Symb.java we have declared the following fields:
· a reference (type) to a Type object
· a reference (ownerClass) to the Name object of the class where the identifier is declared
· a boolean value (pub), indicating whether the identifier is public or private.

In order to admit forward references, it is necessary to postpone type checking until all the identifiers have been inserted into the symbol table, and therefore to organize the front end in two passes: the first one to create the symbol table, the second one to perform type checking and intermediate code generation.
The front end for mjava specified in mjava.cup implements two passes by reading and processing the source file two times .
The static variables parser.first and parser.second indicate the current pass.
In the second pass the parser moves through the chains of symbol tables, following the same path followed in the first pass.
This behavior is obtained by saving, in the first pass, the current (top) environment into the ArrayList newEnvs (in class Env ) anytime it is going to be changed by push and pop operations.
In the second pass the sequence of environments saved in newEnvs is restored by invoking method next of class Env in Env.java, in place of push and pop.
You can generate the parser and run it on the input program SampleProgram.mjava by means of mjava.bat or mjava.spt (the type package should be stored in a subdirectory named type).
The output reported in TypesResult.txt will be printed on the Java console.

8、Type Checking 类型检查

In the second pass, the compiler front end must firstly perform type checking, to assure that the type of a construct matches the type expected by its context.
Type checking rules have to verify that operators and functions are applied to the right number and types of operands.
Actual types must match with expected types even when coercion occurs (the type of an operand is automatically converted to the type expected by an operator).
The type checker for mjava is specified in mjava.cup.
You can generate the parser and run it on the input program WrongProgram.mjava by means of the files in TypeChecher.rar.
The output reported in TypeCheckerResult.txt will be printed on the Java console.

9、Code generation 代码生成

Finally the front end of a compiler has to generate intermediate code.
Since we don't aim at developing a new back end for our compiler, the choice of an intermediate representation language is determined by the choice of an already available back end.
The LLVM compiler infrastructure is a collection of source code that implements static back ends for the X86, X86-64, PowerPC, PowerPC-64, ARM, Thumb, SPARC, Alpha, and IA-64 architectures.
LLVM uses a virtual instruction set that provides type safety, low-level operations and flexibility.
We have installed LLVM-2.7 revision on the Linux Fedora 13 platform and used the shell script llvm.spt to perform the following steps:
1. assemble a LLVM program ( .ll file ) into LLVM bitcode ( .bc file )
2. compile LLVM bitcode into Intel assembly language for the x86 architecture ( .s file )
3. run LLVM bitcode .

The LLVM code generator for mjava is specified in mjava.cup.
In this first implementation, we have only considered non object-oriented mjava programs, such as Fact.mjava, where no class instance is created and all the methods are called internally to their declaring classes.
You can generate the front end, run it on the input program Fact.mjava to produce the LLVM program Fact.mjava.ll, and finally launch llvm by means of mjava.spt and the files in CodeGenerator.rar.
The x86 assembly program Fact.mjava.s will be emitted and the output reported in CodeGeneratorResult.txt will be printed on your terminal.
You can also try to compile and run the other mjava programs stored in the CodeGenerator/source directory, or to write and test your own mjava programs.

10、Object oriented code generation 面向对象的代码生成

In object oriented languages such as Java (and mjava), array and object variables behave like pointers whose values are assigned when new instances are created.
When a method is called externally to its declaring class, the following operations must be performed:
1. determine the class in which the method is declared and look for the method in that class
2. pass the object on which the method is called as an extra argument of the method call.

The translation of method declarations must avoid name clashes among methods declared with the same name in different classes: for this reason the name of the translated method is the concatenation of the class name and the method name.
The final version of the LLVM code generator for mjava is specified in mjava.cup.
You can generate the front end, run it on the input program SampleProgram.mjava to produce the LLVM program SampleProgram.mjava.ll and finally launch llvm by means of mjava.spt and the files in OOCodeGenerator.rar.
The x86 assembly program SampleProgram.mjava.s will be emitted and the output reported in OOCodeGeneratorResult.txt will be printed on your terminal.
You can also try to compile and run the other mjava programs stored in the OOCodeGenerator/source directory, or to write and test your own mjava programs.

本文原文来自:http://staff.polito.it/silvano.rivoira/HowToWriteYourOwnCompiler.htm,转载请注明原始出处。


分享到:
评论

相关推荐

    MFC编写的C编译器

    C编译器是一种将高级语言(如C语言)编写的源代码转换为计算机可执行的机器代码的程序。它扮演着至关重要的角色,因为C语言是许多操作系统、应用程序以及嵌入式系统的基础。MFC(Microsoft Foundation Classes)是...

    C#编写的词法编译器

    总的来说,"C#编写的词法编译器"是一个实用的工具,它结合了编译原理和C#编程技术,提供了自定义和测试词法分析规则的能力。通过状态转换矩阵和图形化界面,用户可以深入理解编译器的工作机制,并且有效地开发和调试...

    c语言编写的c语言编译器.zip

    在用C编写C编译器时,你需要实现自己的预处理器来处理这些特性。 此外,C语言的指针操作需要特别注意,因为它们允许直接访问内存,这在编译器中可能涉及到复杂的内存模型和类型检查。 在实现过程中,可能会用到...

    C++编写的小型编译器

    可以使用单元测试框架(如Google Test)来编写自动化测试,同时利用调试工具(如GDB)来查找和修复程序中的错误。 总之,这个基于C++的小型编译器项目涵盖了编程语言基础、文本编辑器设计、文件操作、命令处理、...

    用C++编写的编译器

    它的强大在于其灵活性和高效性,使得编写编译器这样的系统软件成为可能。 "tiny"一词在这里可能是指这个编译器项目的规模相对较小,可能是为了简化教学或者快速实现编译器的基本功能。一个小型编译器通常会包括词法...

    简单易读的JavaScript编写的最小编译器

    JavaScript编写的最小编译器,通常被称为微型编译器,是一种用JavaScript实现的简单...通过实践和探索这个项目,你可以深化对编程语言本质的理解,提高自己的编程能力,并为进一步深入研究编译器技术奠定坚实的基础。

    C#编写的语法编译器

    在C#中编写编译器,我们需要关注的是语法分析部分,这部分主要负责验证输入代码是否符合特定的语法规则。 1. **词法分析**:这是编译过程的第一步,也称为扫描或词法分解。程序源代码被分割成一个个称为“标记”的...

    自己动手写编译器链接器PDF及源码.rar

    同时,源码部分则提供了实际操作的机会,让读者能够动手编写简单的编译器和链接器,从而更深入地理解这些工具的工作原理。 在源码实践中,读者可能需要了解汇编语言、数据结构和算法、编译原理等相关知识。例如,...

    自己动手写编译器链接器高清完整版

    ### 自己动手编写编译器和链接器的意义 学习如何编写自己的编译器和链接器不仅可以帮助程序员深入理解程序的执行过程,还能提升对底层系统架构的理解。通过实践编写编译器和链接器,开发者能够更好地掌握计算机科学...

    自己动手写编译器、链接器_编译器_

    本书介绍的SCC编译器没有借助Lex与 Yacc这些编译 器自动生成工具纯手工编写而成更便于学习和理解。为了生成可以直接运行EXE文件本书还实现了 一个链接器。读完本书读者将知道一门全新的语言如何定义一个真实的编译器...

    用MFC编写一个编译器程序、词法分析

    在MFC框架下编写编译器,首先需要了解MFC的基本架构,包括消息映射、文档/视图模型、控件和对话框的使用。MFC提供了一套丰富的类库,使得开发者可以方便地创建图形用户界面(GUI),这对于一个编译器的前端至关重要...

    c++编写的Brainfk编译器

    这个项目是用C++编写的一个Brainfuck编译器,可以将Brainfuck源代码转换为可执行的机器码,从而在计算机上运行。 **1. Brainfuck语言简介** - Brainfuck的基本命令包括:`+`(增加当前指针的值)、`-`(减少当前...

    VB6编写的编译器

    《VB6编写的编译器:深入理解与探讨》 Visual Basic 6(简称VB6)是Microsoft在1998年推出的一款经典的编程环境,它以其易学易用的特性,深受程序员喜爱。VB6编写的编译器,如"Visia Compiler 2018",是一个专门...

    用C++编写的Lex 编译器

    LEX广泛地用于各种语言的词法分析器的描述,我们...本程序用C++语言来编写这种编译器,同时用此款编译器分析C++语言的子集。 压缩包中包括编译器生成程序和应用部分,主要算法和功能有说明进行解释。其中算法参考龙书。

    自己动手写编译器链接器

    编写自己的编译器和链接器不仅可以帮助深入理解这些工具的工作原理,还可以提升编程技能。以下是一些关键点: 1. **选择编程语言**:首先,你需要选择一种编程语言来实现你的编译器和链接器。常见的选择包括C/C++、...

    vc一个自己写的编译器

    标题中的“vc一个自己写的编译器”表明这是一个基于Visual C++(简称VC)平台自制的编译器项目。在编程领域,编译器扮演着至关重要的角色,它将高级语言源代码转换为机器可执行的二进制代码。下面我们将深入探讨...

    cpp-8ccvim采用Vim脚本编写的C编译器

    《使用Vim脚本编写的C编译器:8cc.vim详解》 在IT行业中,开发工具的选择往往直接影响到编程效率与代码质量。本文将深入探讨一款独特...通过这个压缩包文件"8cc.vim-master",你可以下载并亲自体验这一创新的C编译器。

    decaf简易编译器实验

    《深入理解decaf简易编译器实验:从概念到实践》 decaf简易编译器实验,是一项旨在...通过这个实验,你可以掌握编译器设计的关键概念,这对于理解软件开发的底层机制以及未来在软件工程领域的深造都有着深远的影响。

    自己动手写编译器、链接器高清带目录PDF书籍+源码

    这本书的配套源码能够帮助读者通过实践加深对理论的理解,通过编写自己的小型编译器和链接器,你可以更直观地看到这些过程如何在实际中运作。这不仅锻炼了编程技能,也有助于培养解决问题和调试的能力。 在学习过程...

    c语言编写的pl0 文发编译器

    C语言因其高效和可移植性,常被用于编写编译器。 在PL0编译器中,词法分析器(也称为扫描器)会识别输入源代码中的关键字、标识符、数字、运算符和分隔符等基本元素。这些元素被称为“符号”(tokens),并被转化为...

Global site tag (gtag.js) - Google Analytics