Work / Virtual Machine / Dev Blog

2011-02-19 23:25:41

Nested Labels

I have the intention of implementing support for nested labels (a la nasm). I think they are a really elegant way of achieving some form of logical compartmentalisation - i.e. *when* the high-level language is complete, its cross-compiler can spit out non-local labels for the 'functions', with local labels inside that don't interfere with the guts of other 'functions'. Without nested labels, any automatically generated label would need to be uniquely named, even for inner loops etc. which i would imagine might make any generated code quite annoying and/or difficult to read (but hey, what auto-generated code isn't fugly! ;p).

So this is the sort of syntax I'm aiming for (*tips hat to nasm*):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
; void someFunction()
someFunction:
    entr    1
    movi    0, 10
.loop:
    farc    cout
    dec     0
    jz      0, .exitLoop
    jmp    .loop
.exitLoop:
    leav
    retn

; void anotherFunction()
anotherFunction:
    entr    3
    movi    0, 0
    movi    1, 10
anotherFunction.loop:
    farc    cout
    inc     0
    sub     1, 0
    jz      1, anotherFunction.exitLoop
    jmp    .loop
.exitLoop:
    leav
    retn
 

If the identifier starts with a dot, the label is a nested (or local) label. If the identifier contains a dot, its a local label that has its namespace (non-local label context) explicitly defined. Each non-local label definition essentially defines a name-space in which all nested labels reside and can be referenced through. So to jump straight into the someFunction non-local label context if you were in a different context you could do:

1
2
    jmp    someFunction.loop
 

Or if you're in the someFunction context itself you could do:

1
2
    jmp    .loop
 

The compiler will see this as shorthand for someFunction.loop and substitute internally. There is an edge case that must be checked for which involves declaring a namespaced local label with an invalid context - i.e. the part before the dot doesn't match with the current non-local label context:

1
2
3
4
5
6
nonLocal0:
    jmp .loop
wibble.loop:
    die

 

Here, 'wibble' is invalid because it does not match the current context of 'nonLocal0'.


the way the parser currently works at the moment is roughly as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
for each token do the following
{
    if its a directive
    {
        process it and consume any additional tokens (dependant on directive)
    }
    else if its an identifier (any [a-zA-Z0-9_] string with optional terminating ':')
    {
        if its a label (has terminating ':')
        {
            process it
        }
        else if its an instruction
        {
            process it and consume any additional tokens (dependant on instruction)
        }
        else
        {
            syntax error
        }
    }
    else
    {
        syntax error
    }
}

 

Each 'process' call in the above pseudo-code calls one of the following functions:

1
2
3
4
5
6
7
bool process_directive(segment_type_t& segment_type, uint& token_index, parse_tree& p_output_tree, std::vector<lex_token>& p_input_stream);

bool process_instruction(segment_type_t segment_type, uint& token_index, parse_tree& p_output_tree, std::vector<lex_token>& p_input_stream);

bool process_label(segment_type_t segment_type, uint& token_index, parse_tree& p_output_tree, std::vector<lex_token>& p_input_stream);

 

It is in these functions that the logic for each syntax element type is placed and it is the process_label() function in particular that needs to be modified. The segment_type parameter specifies whether or not the current token is inside a data or code segment. The p_output_tree parameter is the syntax tree being generated. The p_input_stream is the token vector passed to the parser. I wrote this code a long time ago when i was still using what appears to be some rough form of Hungarian Notation (p is for parameter)!

Each process*() function operates directly on the input token vector and output syntax tree. The process_label() function manages a flat list of label identifiers. The actual storage of this list is in the syntax tree - i don't think the storage logic needs to be changed in order to support nested labels.


I'm not convinced there is any real need for a nesting depth more than 1 - not only would it make the syntax a bit messy and potentially hard to read (how many dots are preceding that identifier?) but it would add a significant amount of additional complexity to the parser logic. Enforcing a max depth of 1 keeps the syntax simple and clear (a preceding dot signifies local label, no dot is non-local) and also keeps the parser logic simple as well.

Parser simplicity really is the main driving force behind my intention here as I'm trying really hard to KISS because i feel this is the sort of application area where the complexity can run away with you really quickly (and before you know it you're lost) if you don't keep it under control.

The way a max depth of 1 keeps the parser simple is as follows - while iterating the tokens, if it comes across a label, the parser simply needs to check if it starts with a dot - if it does, its a local label - if it doesn't its a non-local label. Every time the parser encounters a non-local label, it remembers that label's identifier, until it encounters the next non-local label - and so on. When the parser encounters a local label, it simply pre-pends the non-local label identifier (that's it's currently remembering) to the local label identifier and pushes that concatenated identifier string onto the syntax tree's label identifier vector.

Thus the syntax tree's label identifier vector ends up containing identifiers like this (taking above vm asm code as input):

1
2
3
4
5
6
someFunction
someFunction.loop
someFunction.exitLoop
anotherFunction
anotherFunction.loop
anotherFunction.exitLoop

Exactly the same logic can be applied to instruction operands too :)


So, to sum up I think the parser needs to be changed in the following ways:

  • Make parser capable of remembering the current non-local label.
  • Alter syntax tree label push logic to support dotted identifiers.
  • Expand Instruction operand type enum to include local and non-local label variants.
  • Enable parser sensing of local and non-local label declarations and references.

the instructions that currently take labels as operands are as follows:

  • call
  • jmp
  • jz

 


 

I may implement this quite soon as it is something i have been thinking about doing for some time - its a negligible change to the language syntax and hopefully wont require too much alteration in the compiler logic (the execution core wont be affected by this as its effectively  just syntactic sugar :) Ooh i love this sorta stuff!


Now im trying to keep the compiler as thread safe as possible beause it would be nice for it to go multi-core later on. One cheeky way of doing this is to keep as much data on the stack as possible :p no static variables or shared heap instances. So the parser remebers the current non-local label context in a scoped stack variable which it passes to the process_label() and process_instruction() functions. the process_label() function must be able to alter the current label context so it is passed as a reference. the process_instruction() function merely needs to know of the current context if one of the instructions operands is a local label that must converted to an absolute label OR is an absolute label that must be confirmed to match the current context.