Dave's nasmjf Dev Log 01
Created: 2022-07-22
This is an entry in my developer’s log series written between December 2021 and August 2022 (started project in September). I wrote these as I completed my port of the JONESFORTH assembly language Forth interpreter.
First log session to test what I've got so far. GNU Debugger recorded in
GNU screen for the full GNU experience.
I'll clean up a lot of the gdb prompts and stuff for clarity.
Reading symbols from nasmjf...
Breakpoint 1 at 0x804900e: file nasmjf.asm, line 80.
Breakpoint 1, _start () at nasmjf.asm:80
80 cld ; Clear the "direction flag" which means the string instructions (such
82 mov [var_S0], esp ; save the regular stack pointer (used for data) in FORTH var S0!
84 mov ebp, return_stack_top ; Initialise the return stack pointer
Trying a defined "function" in GDB to cut down on the typing. I always
have to cast the NASM labels to (int) since the debugging info has no
way of telling GDB what I'm storing there. "int" in this case just means
I've got a 4-byte (32 bit) value. GDB has a strong C heritage.
p - displays the VALUE of the label, which is an address
x - displays the memory at the address
p/x and x/x displays as hexadecimal
*(int) uses the address stored AT the memory referenced by the label
(again, strong C heritage in this syntax)
All three of these won't always be relevant, but it saves a lot of typing.
(gdb) define foo
Type commands for definition of "foo".
End with a line saying just "end".
>p/x (int)$arg0
>x/x (int)$arg0
>x/x *(int)$arg0
>end
Initial nonsense over. now we use the main mechanism that drives the Forth
instructions: the NEXT macro is inlined at the end of every word and here
to bootstrap the action. cold_start contains the address of the "QUIT" word.
(quit is a silly name - it doesn't quit Forth, it "quits" TO the interpreter)
(side note: i'd like everything to be lowercase except assembly macros. But
after 'quit' and 'docol', I haven't been good about converting them. Will
probably do a couple rounds of cleanup at some point...)
NEXT loads the address of the next instruction and we jump to it, executing
the machine code there.
_start () at nasmjf.asm:88
88 mov esi, cold_start ; give next forth word to execute
27 lodsd ; NEXT: Load from memory into eax, inc esi to point to next word.
28 jmp [eax] ; Jump to whatever code we're now pointing at.
Since QUIT is defined with the DEFWORD macro, it begins with a call to
the 'DOCOL' word - which, in essense, sets up the rest of the Forth word
to be executed (QUIT, in this case) for another call to NEXT.
docol () at nasmjf.asm:40
40 lea ebp, [ebp-4] ; "load effective address" of next stack position
41 mov [ebp], %1 ; "push" the register value to the address at ebp
70 add eax, 4 ; eax points to docol (me!) in word definition. Go to next.
Here I use that 'foo' function to see if that's true about the eax register.
Note that the add 4 instruction has NOT yet executed. GDB always shows the
next instruction before you tell it to step forward to that instruction!
(gdb) foo $eax
$9 = 0x804a010
0x804a010: 0x08049000
0x8049000 <docol>: 0x89fc6d8d
Yup! It points to DOCOL all right. Now we step and add 4 to eax:
(gdb) s
71 mov esi, eax ; Put the next word pointer into esi
(gdb) foo $eax
$10 = 0x804a014
0x804a014: 0x0804a12c
0x804a12c: 0x08049218
Every single Forth word ends with NEXT, which executes the next word.
In this case, it's happening at the end of DOCOL (and DOCOL's job is
to get everything set up to have NEXT execute the rest of the word...)
(gdb) s
27 lodsd ; NEXT: Load from memory into eax, inc esi to point to next word.
28 jmp [eax] ; Jump to whatever code we're now pointing at.
Double-checking that the instructions in QUIT are what we'll be running
now...
(gdb) foo $eax
$12 = 0x804a12c
0x804a12c: 0x08049218
0x8049218 <code_R0>: 0x04c30868
Yes! The 'R0' constant is the first thing we run in QUIT! It's really wild
how constants in Forth are actually words with a single instruction that
pushes a value onto the stack! In this case, R0 is the top of the return
stack.
The push %5 line is from the DEFCONST macro, which, in turn, calls the
DEFCODE macro because consts are words. Then the NEXT macro continues to
the next word in QUIT...
code_R0 () at nasmjf.asm:568
568 push %5
code_R0 () at nasmjf.asm:27
27 lodsd ; NEXT: Load from memory into eax, inc esi to point to next word.
28 jmp [eax] ; Jump to whatever code we're now pointing at.
...which happens to be RSPSTORE, which puts a value on the return stack.
code_RSPSTORE () at nasmjf.asm:201
201 pop ebp
code_RSPSTORE () at nasmjf.asm:27
27 lodsd ; NEXT: Load from memory into eax, inc esi to point to next word.
28 jmp [eax] ; Jump to whatever code we're now pointing at.
...and then QUIT runs INTERPRET, which takes words on STDIN and then
...calls _WORD to get a word from input which
...calls _KEY to get a character ("key") of input
code_INTERPRET () at nasmjf.asm:209
209 call _WORD ; Returns %ecx = length, %edi = pointer to word.
_WORD.skip_non_words () at nasmjf.asm:310
310 call _KEY ; get next key, returned in %eax
_KEY () at nasmjf.asm:351
_KEY first checks to see if it needs input (currkey has reached
bufftop). On first run, they're both zero, so yeah, we need more
input.
Aside: again, "key" isn't how we would normally describe this in
a modern environment - it's the next "character" (and even that's
becoming a thing of the past now that Unicode is pretty much standard
everywhere...).
Anyway, comparing currkey (ebx = 0) and bufftop (0) sets the Zero
Flag (ZF) because the difference between them is the same. As we
can see in the 'info reg' display below:
351 mov ebx, [currkey]
352 cmp ebx, [bufftop]
353 jge .get_more_input
(gdb) info reg
...
ebx 0x0 0
eflags 0x246 [ PF ZF IF ]
...
We get more input by telling Linux to give us input from
STDIN into a fixed-size buffer:
_KEY.get_more_input () at nasmjf.asm:361
361 xor ebx,ebx ; 1st param: stdin
362 mov ecx,buffer ; 2nd param: buffer
363 mov [currkey],ecx
364 mov edx,buffer_size ; 3rd param: max length
365 mov eax,__NR_read ; syscall: read
366 int 0x80 ; syscall!
Now I type "foo<enter>":
foo
We check to make sure the input isn't zero-length.
I don't think it would ever be - the <enter> key would
always give us at least '\n'?
367 test eax,eax ; If %eax <= 0, then exit.
368 jbe .eof
369 add ecx,eax ; buffer+%eax = bufftop
370 mov [bufftop],ecx
We can see how long the input string is. Yup, 4 bytes is
right: "foo\n".
(gdb) foo $eax
$15 = 0x4
Now we're back to _KEY, having gathered some input.
We repeat the check...
371 jmp _KEY
_KEY () at nasmjf.asm:351
351 mov ebx, [currkey]
352 cmp ebx, [bufftop]
353 jge .get_more_input
This time we have input (and bufftop is at a higher
address than currkey), so we continue on by grabbing
the current "key" (character):
354 xor eax, eax
355 mov al, [ebx] ; get next key from input buffer
If that worked, the al register now has the first
character of "foo\n". Yup, there's the "f"! (p/c means
print as a character. We can also p/s to print a C-style
string.)
(gdb) p/c $al
$19 = 102 'f'
Now we set currkey to the next character and return...
356 inc ebx
357 mov [currkey], ebx ; increment currkey
358 ret
Back at _WORD, we check to see if we've hit a character
to skip. Forth is so syntactically simple, I just love it.
NOTE that the jbe instruction is "jump if compared value is
before (less than) or equal", so any character smaller
than an ASCII space (0x20) will cause us to keep seeking in the
.skip_non_words loop. This is a clever way to skip spaces,
tabs, newlines, returns, form feeds, etc. I'll improve the
comments for these instructions in the actual program now.
_WORD.skip_non_words () at nasmjf.asm:311
311 cmp al,'\' ; start of a comment?
312 je .skip_comment ; if so, skip the comment
313 cmp al,' ' ; space?
314 jbe .skip_non_words ; if so, keep looking
Nope, character looks good. So we add it to word_buffer
in memory. The stosb instruction implicitly copies what's
in the al register (the 'b' is for byte) to memory at
the address stored in the edi register.
Then edi is incremented so that the next time this happens,
the next byte will go to the next position, and so forth.
It turns out, this is the sort of thing we're guaranteeing
when we cleared the direction flag at the very beginning.
317 mov edi,word_buffer ; put addr to word return buffer in edi
Now that we've established that we're past any whitespace
and are gathering the actual input, we're in .collect_word.
I'll snip the stepping through _KEY for 'o', 'o', and '\n'
_WORD.collect_word () at nasmjf.asm:319
319 stosb ; add character to return buffer
320 call _KEY ; get next key, returned in %al
After every call to _KEY, we check to see if we're done
collecting the word. The ja instruction is "jump if the
compared value is after (greater than)," which is the
exact opposite of the jbe check above.
To put it straight: before we were looping WHILE the
character was whitespace, now we loop UNTIL the character
is whitespace.
321 cmp al,' ' ; is blank?
322 ja .collect_word ; if not, keep looping
Now _WORD returns the length and address of the collected word.
325 sub edi, word_buffer ; hmm, the len?
326 mov ecx, edi ; return it
327 mov edi, word_buffer ; return address of the word
328 ret
Then we return to _INTERPRET from _WORD:
code_INTERPRET () at nasmjf.asm:212
212 xor eax,eax ; back from _WORD...zero eax
...
Let's check the return values now:
(gdb) p $ecx
$1 = 3
(gdb) x/3c $edi
0x804a068 <word_buffer>: 102 'f' 111 'o' 111 'o'
Yay! There's the "foo" string that was input.
Even though I've got some of the _FIND word that tries to
match the input word, I think this has been quite enough
for one log. :-)