Welcome back! In the last log, I got string literals working with the 'quote' word. One thing lead to another and I ended up solving a lot of different problems so that now I can define words not just from "primitives" (machine code words written in assembler), but also words composed from *other* non-primitive words. So yeah, you can only make programs that print strings, but they are real programs. Okay, so now the fun TODOs! [ ] Pretty-print meta info about word! [ ] Loop through dictionary, list all names [ ] Loop through dictionary, pretty-print all (And I'm not being sarcastic or something - I'm really looking forward to these.) First, I need general number handling! I need to parse numbers from input and I need to be able to print numbers as strings. [ ] New word: str2num (ascii to integer) [ ] New word: num2str (integer to ascii) I'd also like to have the ability to intuitively append strings. Ideally: "There are " 5 " chickens." "Your name is '" username "'?" (I also don't have variables yet, so the username one is even more hypothetical. But you get the idea.) My new 'str2num' word seems to work. I realized I could use the exit status of the epplication since it is already popping a number from the stack (which I'd completely forgotten about, ha ha). I updated my `mr` (meow5 run) alias to print the exit status: alias mr=`./build.sh run ; echo "Exit status: $?"' The current parse/print number radix is set in a new "variable" currently set to decimal on startup: mov dword [radix], 10 And created these words to manage the radix: radix (takes number from stack) dec - sets radix to 10 hex - sets radix to 16 oct - sets radix to 8 bin - sets radix to 2 Now I'll try str2num out with a decimal number: 42 exit $ mr Exit status: 42 And I'll mess with it by giving it a hex number without updating the radix to hex: ff exit $ mr Could not find word "ff" while looking in IMMEDIATE mode. Exit status: 1 Now I'll compact the results a bit to not take up too much room here: hex ff exit <-- hexadecimal Exit status: 255 oct 100 exit <-- octal Exit status: 64 bin 101010 exit <-- binary Exit status: 42 (uh...seems right) bin 11111111 exit Exit status: 255 (yup, that's definitely right) Alright! Now this is getting fun. It'll be even better when it's taking input from STDIN to be fully interactive. Number printing is interesting. So I could just have it print numbers when requested, like Forth's DOT (.) word. But to print some mixed strings and numbers would look like this: "You own " print 16 printnum " snakes." print And like I wrote above in this log, I'd rather have strings and numbers automatically appended like this: "You own " 16 " snakes." print But I'm not sure if that's a good idea or not because I'm going to either have to further complicate my interpreter (it's already doing look-ahead for quotes so I don't have to have a space like in Forth: " foo") or some up with something else. One thing I've thought about is having interpolated strings. As long as I have some unambiguous symbol, I could have it pop numbers off the stack and append them to the string as I go: 16 "You own $ snakes." print I planed to have escapes anyway, so \$ for a literal '$' is no problem. Okay, I like that. So I need to write num2str now and then add number interpolation to my 'quote' word. This is gonna be cool! Next night: Wrote num2str. I've got a division error. Here's how I'm testing: ; test num2str push 42 ; num push dword [free] ; store here CALLWORD num2str pop eax ; chars written First, let's double-check the input: (gdb) break num2str 528 pop ebp ; address of string destination num2str () at meow5.asm:529 529 pop eax ; number num2str () at meow5.asm:530 530 mov ecx, 0 ; counter of digit characters 531 mov ebx, [var_radix] 532 mov edx, 0 (gdb) p/a $ebp $1 = 0x804b114 <-- addr for string storage (gdb) p $eax $2 = 42 <-- num to convert (gdb) p $ebx $3 = 10 <-- radix That's all good: string storage addr: 0x804b114 num to convert: 42 radix: 10 Now we'll start converting. I divide the number by the radix. The remainder is the digit that was less than the radix (so 0-9 for dec, 0-f for hex, etc.). The result of the division is the rest of the number. (gdb) s num2str.divide_next () at meow5.asm:534 534 idiv ebx ; eax / ebx = eax, remainder in edx (gdb) p $eax $6 = 4 <-- quotient (gdb) p $ebx $7 = 10 <-- radix (double-checking) (gdb) p $edx $8 = 2 <-- remainder (digit to convert) Yup, that looks good. We've got the first digit (working lowest to highest) to convert to a character (2) and the rest of the number with the first digit gone (4). I convert to the character equivalent and store that on the stack temporarily. This is because I convert starting from the least significant digit, but we want the string to have the number in the expected order, starting with the greatest significant digit. (gdb) s 535 cmp edx, 9 ; digit bigger than 9? (radix allows a-z) 536 jg .toalpha ; yes, convert to 'a'-'z' 537 add edx, '0' ; no, convert to '0'-'9' 538 jmp .store_char num2str.store_char () at meow5.asm:542 542 push edx ; put on stack (pop later to reverse order) 543 inc ecx 544 cmp eax, 0 ; are we done converting? 545 jne .divide_next ; no, loop Now it's on to the next division: Program received signal SIGFPE, Arithmetic exception. num2str.divide_next () at meow5.asm:534 534 idiv ebx ; eax / ebx = eax, remainder in edx There it is. Well, I'm not dividing by 0, so that's good. But then I remember idiv is signed division, and I'm not doing signed integers just yet. Maybe that's it? Ah, a quick look at a reference: "32-bit division with IDIV requires EAX be sign-extended into EDX." Yeah, I don't want to do signed division anyway because I don't have the ability to parse or print negative numbers (yet?) so I should really be using DIV anyway: Program received signal SIGFPE, Arithmetic exception. Nope! I know this isn't a divide-by-zero problem... Oh, I needed to read a bit more: "DIV Always divides the 64 bits value accross EDX:EAX by a value." Oh, right, I need to clear EDX to 0 in my division loop because it currently holds the character to convert. It makes total sense that the answer is overflowing. I was dividing by a huge 64-bit number the second time around! So I just need to clear edx before my division: 533 mov edx, 0 ; div actually divides edx:eax / ebx! 534 div ebx ; eax / ebx = eax, remainder in edx 535 cmp edx, 9 ; digit bigger than 9? (radix allows a-z) (gdb) p $eax $1 = 0 (gdb) p $edx $2 = 4 That's got it! The final digit (remainder after division) is 4 and there's nothing left to divide after that (quotient is 0). Once it sees the rest of the number is now 0, I store the string at the address provided and return the number of characters written (digits in radix) on the stack: 554 push eax ; return num chars written end_num2str () at meow5.asm:149 149 mov eax, [return_addr] ; RETURN (gdb) p $eax $1 = 2 (gdb) x/s $ebp 0x804b114 : "42" Correct! Two characters were written, and the string is "42", my value in decimal! Since I can now, I'll make the test print real nice and use the exit value to show the number of digits written (to make sure it's correct): ; test num2str CALLWORD decimal push dword 42 ; num push dword [free] ; store here CALLWORD num2str PRINTSTR "Answer: " push dword [free] ; print from here CALLWORD print CALLWORD newline CALLWORD exit ; stack still has chars written $ mr Answer: 42 Exit status: 2 Correct. Let's see some others: CALLWORD oct push dword 64 Answer: 100 Exit status: 3 CALLWORD hex push dword 3735928559 Answer: deadbeef Exit status: 8 CALLWORD bin push dword 7 Answer: 111 Exit status: 3 CALLWORD bin push dword 257 Answer: 100000001 Exit status: 9 Looks good! Now I'm kinda wishing the interpreter was interactive because that would have been way more fun than changing the assembly test for each run. But that'll be coming up soon enough. After sleeping on it, I'm really excited about the idea of string interpolation using '$' as a placeholder. So I'll make that the next todo: [ ] Add "$" placeholders so the 'quote' word so it can interpolate numbers from the stack into the string. I suspect the hard part will be remembering to push the stack values in the opposite order that they appear in the string...so we'll see what that's like in practice. Okay, I've got it working, but only just barely. It feels incredibly fragile. In particular, the interpreter has become way too complicated for my taste. Juggling registers willy-nilly and using the stack when I run out of places to put things is just sloppy. It does work. Cleaned up example: 42 "The answer is $." print 42 hex "The answer is 0x$ in hex." print 42 bin "The answer is $ in computer." print 42 oct "The answer is $ in octal." print Output: The answer is 42. The answer is 0x2a in hex. The answer is 101010 in computer. The answer is 52 in octal. Which is really cool. But I need to get my mess under control. For one thing, I've been using the registers very inconsistantly because I don't have any plan regarding where to put stuff. This article has helped a ton: https://www.swansontec.com/sregisters.html I'm not sure I'm going to re-write EVERYTHING to follow these guidelines, but new stuff definitely will and I'm going to immediately switch the 'quote' word to use the esi and edi registers! Using esi and edi in the 'quote' word helped make the usage much clearer and simplified some of the address manipulation considerably (the source and destination pointers aren't always moving in sync because escapes and placeholders in the source string may expand to a different number of characters in the destination string. Another thing that makes the interpreter so fragile is the way I have to juggle registers and the stack around to check for a valid numeric literal when an input token doesn't match any of the exisitng words. So what I'm thinking is that just like quote, I'll test for tokens that start with a number and handle them _before_ trying to find them in the dictionary. Consequently, I won't be able to start any word or variable names with a number. (Like Forth allows.) In exchange, it should simplify the interpreter a fair amount in some messy areas. [ ] Break out 'number' into separate word that does pre-check like 'quote'. This will also give it a nice symetry with the way I'm handling string literals (my first big departure from pur Forth syntax). Three sleepy nights later: Well, it works...but I've now exposed another problem. This is what the top of my interpreter loop looks like: get_next_token: CALLWORD eat_spaces ; skip whitespace CALLWORD quote ; handle any string literals CALLWORD number ; handle any number literals CALLWORD get_token It's nice and neat and readable. The problem is that it doesn't work unless you have an optional quote followed by an optional number followed by a non-string, non-quote token. Two strings in a row or a number followed by a string or various other combos won't work. Here's what needs to happen: interpreter loop: eat spaces if quote make string literal restart interpreter loop if number make number literal restart interpreter loop if token find/execute/compile restart interpreter loop else error So as much as I liked having 'quote' and 'number' handle all aspects of those respective inputs, it's just making everying too interdependant. I need to break them up into even smaller words and give more control back to the interpreter where it belongs. So more refactoring awaits me before I can do the fun stuff. Ah well, this is where the "slow and steady" approach wins - I know I'll eventually push through this kinda boring housekeeping stuff. Next night: Progress made on that refactor. Then there's a segfault. So that's what I'll work on tomorrow night. Next night: Looks like the segfault is happening in old code (specifically inline, but because of wrong calculations in semicolon ";"). Since the old code won't just start failing for no reason, this has to be due to bad input from all the changes I've made in the outer interpreter loop. Hard to track down with GDB with breakpoints and watch statements. So I added some print debugging and sure enough: COLON here: 0804a290 inlining print SEMICOLON there: 0804b2e6 That's the problem. 'colon' is trying to save the machine code start address for 'meow' on the stack and 'semicolon' should be getting that same value, but it's not. It's while executing this, by the way: : meow "Meow." print ; The botched semicolon tail writing is why trying to use 'meow' in 'meow5' fails - the calculations are wildly wrong (it thinks the machine code length is -4144). So I need to figure out where I'm leaving something on the stack. But how? Oh, I've got an idea, That wrong value from the stack looks like an address. I'll fire up GDB and see what's at that address. Maybe that will help. (gdb) x 0x0804b2e6 0x804b2e6: 0x776f654d Hmmm. Looks like a string? (gdb) x/s 0x0804b2e6 0x804b2e6: "Meow." Aha! Interesting. It's the address from the string literal before the print. What's that doing there? In compile mode, 'quote' should be compiling that address into the new word's machine code memory, not on the stack. Oh, right, I knew that would come back to haunt me. At some point, my run/compile logic got lost (could be a dozen commits ago by now). I'm currently compiling the instruction to push the string's address *and* I'm also actually pushing the string's address every single time. So this has actually been broken for a while and my refactoring has (almost?) nothing to do with it. This is why I need a set of proper regression tests ASAP. Well, I'll get there soon. Gotta be able to take input before I can do that! Anyway, time to add (back) some conditional jumps in 'quote'! The answer is 42. The answer is 0x2a in hex. The answer is 101010 in computer. The answer is 52 in octal. Meow.Meow.Meow.Meow.Meow. Much better. And I was reminded that I had been in the middle of adding escape sequences to string literals! Let's try them out: 42 "I paid \$$ for beans\\cheese.\nOkay?\n" print $ mr I paid $42 for beans\cheese. Okay? Wow, that was super satisfying. Okay, I think *now* I'm finally ready to make those introspective word-printing words! Next night: Uh, so I didn't think far enough ahead. I made some really nice string making and printing facilities, but I can't use them easily from assembly. Nor can I define something as complex as a "dictionary" printing word (to use the Forth term for all the defined words.) Hmmm... this is a quandary. Okay, it wasn't so bad. I just needed to write an additional word 'printnum' (with macro PRINTNUM_CODE) to help it out. After that, this is all it takes to write an 'inspect' that takes a word tail address and prints the word name and bytes of machine code (from the tail metadata): DEFWORD inspect pop esi ; get tail addr lea eax, [esi + T_NAME] push eax PRINT_CODE PRINTSTR ": " mov eax, [esi + T_CODE_LEN] push esi ; preserve tail addr ; param 1: num to be stringified push eax PRINTNUM_CODE PRINTSTR " bytes " NEWLINE_CODE ENDWORD inspect, 'inspect', (IMMEDIATE) Here it is printing the word 'find': "find" find inspect find: 58 bytes Neat! Oh, and I guess I can show my new 'printnum' in standalone action: bin 11101111 hex printnum ef Nice. +--------------------------------------------------------+ | | | Just so it's clear I'm not trying to trick anybody, | | the interpreter input is still hard-coded into | | memory, so that calling line really looks like this | | in assembly: | | | | db 'bin 11101111 hex printnum ' | | | | I'm just trying to make the log readable, so I leave | | off the 'db' and quotes. | | | | I'll take input on STDIN very soon (next!). | | | +--------------------------------------------------------+ And now I can find out how big a "meow" printing function is in machine code (minus the "Meow." string itself, because that's stored elsewhere. And also the combined five meows: "meow" find inspect "meow5" find inspect meow: 38 bytes meow5: 190 bytes There we go. So even repeating the entire printing routine five times is still under 200 bytes of machine code. When's the last time you shed a tear over 200 bytes? Okay, now I just need a loop to call inspect for every defined word and I'm all set. I'm not sure I want to use the term "dictionary" from Forth (or "word", for that matter). Hmm, I guess I can avoid the issue by just calling it 'inspect_all'... Almost there on inspect_all. It seems something has left the token str addr on the stack and it's getting in the way of popping the link addr and ruining everything. At least, that's what it looks like. Next night: So what I woke up thinking in the morning is that if I'm not maintaining my stack correctly between words, there are three ways I could debug this: 1. Painfully, with GDB stepping through...a lot. 2. Write a DEBUG_STACK macro that non-destructively prints the values on the stack from anywhere within the assembly. 3. Write a script (perhaps in awk?) that checks the push/pop balance by following all of the called or inlined code and incrementing a counter for each push and decrementing for each pop. Obviously I'm trying to avoid Option 1. Option 3 sounds neat, but I'm not sure how to *display* that information so that it helps me track down the problem - some words push more than they pop and vice-versa. It'd still be a lot of manual checking. Option 2 seems generally useful anyway, so I'm going to give that a shot and hope that peppering it about the program will help me figure out what's going on. Hmmm...well, printing the stack at the Meow5 language level is no problem, but printing it at the assembly level is a big problem. I've got a chicken-and-egg problem where I need to preserve the registers it uses AND preserve the stack. I could preserve the registers in a purpose-built bit of memory. Well, I think I did that right. Here's the whole ugly thing: section .bss ds_eax: resb 4 ds_ebx: resb 4 ds_ecx: resb 4 ds_edx: resb 4 ds_esi: resb 4 ds_edi: resb 4 section .text %macro DEBUG_STACK 0 mov [ds_eax], eax mov [ds_ebx], ebx mov [ds_ecx], ecx mov [ds_edx], edx mov [ds_esi], esi mov [ds_edi], edi PRINTSTR "Stack: " PRINTSTACK_CODE NEWLINE_CODE mov eax, [ds_eax] mov ebx, [ds_ebx] mov ecx, [ds_ecx] mov edx, [ds_edx] mov esi, [ds_esi] mov edi, [ds_edi] %endmacro But talk about chicken-and-egg problem. Or is it more of a "who watches the watchmen?" thing? Or maybe there's an even better expression for it, but my problem is that I think I'm acutally calling whatever is doing bad stack things WHILE trying to print the stack... :-( Okay, it's worse than even that. I don't know how I've gotten away with it this long, but I've been using a bunch of words mid-assembly to debug stuff...and a lot of them aren't safe for that (like 'newline') because they don't preserve the registers. So it's possible that by trying to observe what's going on for the 'inspect_all' feature, I've been ruining it! I have to remember that most of my words are only safe if I'm treating them as if they're being called from the language - transfering values on the stack and not giving a hoot what happens to the registers in between! Okay, I think I got it. No idea how this got here, but I had an extra pop for no reason I could understand in 'printnum'! I have no idea how it got there, but that would definitely mess things up: pop esi ; get preserved <-----?????? I think the lesson is to test all new words more thoroughly before moving on! Okay, can we _finally_ see the output of 'inspect_all'? $ mr Meow. Meow. Meow. Meow. Meow. meow5: 190 bytes IMMEDIATE COMPILE meow: 38 bytes IMMEDIATE COMPILE inspect_all: 359 bytes IMMEDIATE inspect: 340 bytes IMMEDIATE ps: 177 bytes IMMEDIATE COMPILE printmode: 100 bytes IMMEDIATE COMPILE printnum: 116 bytes IMMEDIATE COMPILE number: 306 bytes IMMEDIATE COMPILE decimal: 10 bytes IMMEDIATE COMPILE bin: 10 bytes IMMEDIATE COMPILE oct: 10 bytes IMMEDIATE COMPILE hex: 10 bytes IMMEDIATE COMPILE radix: 6 bytes IMMEDIATE COMPILE str2num: 95 bytes IMMEDIATE COMPILE quote: 247 bytes IMMEDIATE COMPILE num2str: 58 bytes IMMEDIATE COMPILE ;: 150 bytes COMPILE RUNCOMP return: 7 bytes IMMEDIATE :: 137 bytes IMMEDIATE copystr: 18 bytes IMMEDIATE COMPILE get_token: 55 bytes IMMEDIATE eat_spaces: 38 bytes IMMEDIATE COMPILE find: 58 bytes IMMEDIATE is_runcomp: 5 bytes IMMEDIATE COMPILE get_flags: 7 bytes IMMEDIATE COMPILE inline: 25 bytes IMMEDIATE print: 33 bytes IMMEDIATE COMPILE newline: 26 bytes IMMEDIATE COMPILE strlen: 16 bytes IMMEDIATE COMPILE exit: 8 bytes IMMEDIATE COMPILE Goodbye. Exit status: 0 Yay! At last! Every word in the system with the number of bytes of machine code and mode(s). And with the help of the shell, we can sort by size. $ ./meow5 | awk ' /^.*: [0-9]+/ {t=t+$2; print $2, $1} END{print "Total bytes:", t} ' | sort -n 5 is_runcomp: 6 radix: 7 get_flags: 7 return: 8 exit: 10 bin: 10 decimal: 10 hex: 10 oct: 16 strlen: 18 copystr: 25 inline: 26 newline: 33 print: 38 eat_spaces: 38 meow: 55 get_token: 58 find: 58 num2str: 95 str2num: 100 printmode: 116 printnum: 137 :: 150 ;: 177 ps: 190 meow5: 247 quote: 306 number: 340 inspect: 359 inspect_all: Total bytes: 2655 (Command line formatted for semi-readability and total bytes separated out for clarity.) So the smallest word is 'is_runcomp', which tests for the RUNCOMP flag in a mode and pushes the result. The largest word is not a surprise to me, 'inspect_all', which contains 'inspect' and puts a loop around it. In turn, 'inspect' includes a bunch of string and number printing code. So let's see how I did: [x] Pretty-print meta info about word! [ ] Loop through dictionary, list all names [x] Loop through dictionary, pretty-print all [x] New word: str2num (ascii to integer) [x] New word: num2str (integer to ascii) [x] Add "$" placeholders so the 'quote' word so it can interpolate numbers from the stack into the string. [x] Break out 'number' into separate word that does pre-check like 'quote'. I forgot about a loop that just prints all the word names. And 'number' works a little differently now than I'd envisioned. But I'm gonna call this part done for now and close out this log file.