- Rui Ueyama's chibicc
- Daisuke Kashiwagi's babygo
- Thorsten Ball's following two books
- Writing An Interpreter In Go (Yes, the "An" and "In" are unfortunately stylized as capitalized. See title case.)
- Writing A Compiler In Go (This casing yet again pains me...)
When ./cmd/x86/main.go is invoked, input nit source code goes through 5 stages to produce an executable. Usage of this command:
-asm
only output assembly; do not output object file or executable
-asm-path string
(optional) path to the output assembly file. defaults to [base-path].asm
-file string
path to a file containing nit source code
-obj-path string
(optional) path to the output file. defaults to [base-path].o
-out string
(optional) path to the output executable. defaults to "[base-path].exe"
-run
run the generated executable
base-path refers to the name of the .nit file with its extension stripped.
However, if the input file does not end in .nit, base-path refers to the
name of this file itself.
The lex package's Lexer takes in or reads nit source code and breaks it up
into tokens. For example,
fun sum(a, b int) int {
let c = 0 // comment are ignored
return a + b + c
}Gets translated into
type Token struct {
Kind Kind
Literal string
// ... (omitted for simplicity)
}
{Fun "fun"} {Ident "sum"} {LParen "("}
{Ident "a"} {Comma ","} {Ident "b"} {Ident "int"}
{RParen ")"} {LCurly "{"} {Newline "\n"}
{Let "let"} {Ident "c"} {Assign "="} {Number "0"} {Newline "\n"}
{Return "return"} {Ident "a"} {Plus "+"} {Ident "b"} {Plus "+"}
{Ident "c"} {Newline "\n"}
{RCurly "}"} {Eof ""}When the lex package is used by itself, a *Lexer can be created and used in
the following way:
// Create a *Lexer
func New(input string, filename string) *Lexer
func NewFromReader(rd io.Reader, filename string) (*Lexer, error)
func NewFromFilename(name string) (*Lexer, error)
// Eg, iterate through tokens and print each
import (
"log"
"nit/lex"
)
func main() {
lexer, err := lex.NewFromFilename("./examples/ret2.nit")
if err != nil {
log.Fatalln(err)
}
numTokens := 0
for {
token := lexer.EatToken()
log.Println(token)
numTokens += 1
if token.Kind == lex.Eof {
// Reached end of file
break
}
}
log.Printf("\ngot %d token(s)\n", numTokens)
}
The ast package's Parser uses lex.Lexer to step through each token and
produce a recursive tree structure for later stages to be able to easily step
through the code.
The parser would produce roughly the following (simplified for demonstration).
&Fun{
Name: &Ident{"sum"},
Params: []FunParam{
{Name: &Ident{"a"}, Type: &Ident{"int"}},
{Name: &Ident{"b"}, Type: &Ident{"int"}}},
RetType: &Ident{"int"},
Body: &Block{
Stmts: []Stmt{
&Let{
Name: &Ident{"c"},
Value: &Int{0}},
&Return{
Vals: []Expr{
&BinaryExpr{
Left: &Ident{"a"},
Operator: lex.Token{Kind: lex.Plus},
Right: &BinaryExpr{
Left: &Ident{"b"},
Operator: lex.Token{Kind: lex.Plus},
Right: &Ident{"c"}}}}}}}}When the ast backage is used by itself, a *Parser can be created and user in
the following way:
// Create a *Parser
func NewParser(input string, filename string) *Parser {
func NewParserFromFilename(filename string) (*Parser, error) {
func NewParserFromReader(rd io.Reader, filename string) (*Parser, error)
// Eg, print the AST
import (
"log"
"nit/ast"
)
func main() {
parser, err := ast.NewParserFromFilename("./examples/ret2.nit")
if err != nil {
log.Fatalln(err)
}
program, err := parser.ParseProgram()
if err != nil {
log.Fatalln(err)
}
log.Println(program)
log.Printf("\ngot %d top-level statement(s)\n", len(program.Stmts))
}The seman (semantic **an**alysis) package walks the raw, untyped AST
generated by ast.Parser validates that the language's semantics are adhered
to. That is, it type-checks the AST. It makes sure that, among other things,
sumis not a name that is already taken (for instance, by another global function or variable)sums parametersaandbactually have a specified typecis not a taken name, and0is an expression that produces a type (as opposed to, for example, a call to a function that returns nothing)a + bis a valid operation since both sides are of the same type, andints support the binary operation+cis in scope and its value can likewise be added to the result ofa + bsumreturns a singleintas it declared it would to so
The asmgen package generates x86_64 assembly which invokes Linux system
calls (to achieve things like exiting the program). The generated assembly
follows the design of a simple stack machine. The generated x86_64 assembly
would look something like this:
global sum
sum:
; set up stack frame
; redundant push in nit's calling convention? probably
push rbp
mov rbp, rsp
; allocate local variables
; in this case, the only variable is c and it is of type int (size 8)
sub rsp, 8
push 0 ; stack = [0]
pop rax ; rax = 0; stack = []
mov [rbp - 8], rax ; c = 0
; function parameters are pushed to the stack just before the function is
; called, so we access them using the base pointer + an offset, skipping
; over the return address.
; - rbp + 8 (upwards) contains the return address
; - rbp + 16 contains b
; - rbp + 24 contains a
mov rax, [rbp + 24] ; rax = a
push rax ; stack = [a]
mov rax, [rbp + 16] ; rax = b
push rax ; stack = [a, b]
pop rcx ; rcx = b; stack = [a]
pop rax ; rax = a; stack = []
add rax, rcx ; rax += rcx
push rax ; stack = [a+b]
mov rax, [rbp - 8] ; rax = c
push rax ; stack = [a+b, c]
pop rcx ; rcx = c; stack = [a+b]
pop rax ; rax = a+b; stack = []
add rax, rcx ; rax += rcx
push rax ; stack = [a+b+c]
; the caller grows its stack to accomodate the function's return values,
; and the callee places the values in those slots
pop rax ; rax = a+b+c, stack = []
mov [rbp + 32], rax ; return_slot = a+b+c
; deallocate stack frame and return
mov rsp, rbp
pop rbp
retnit has a simple calling convention:
- The caller grows its stack to make space for the return values. The callee places the return values in these slots using the base pointer.
- The caller pushes its arguments to the stack. The callee accesses these arguments using the base pointer.
The order of both the arguments and the return values is as declared in the
source code: left-to-right in nit: top-to-bottom in the stack.
This step generates an .asm file (if its io.Writer was indeed a handle to
a file).
The x86 package
- Invokes the host's
nasmto assemble the assembly file into anELF64.oobject file. - Then, it passes the
.ofile to the host'sgccto link the object file vialdand generate the final executable, ready to be run. - In addition, if a
mainfunction was declared, an executable file containingmainas its entry point will be produces, ready to be run on anx64_64Linux environment.
I don't like that intermediate files have to be created for the .asm and .o
files before the executable is able to be generated; however, I was not able to
find library forms of gcc or nasm for Go. I am considering switching to
another assembler or writing C bindings myself in order to prevent generating
intermediate/temporary files, since disk read/writes are much slower than memory
manipulation.