Well, I think the next thing to do is make the interpreter take input from STDIN. With that, I'll be able to not only play with it interactively, but also redirect and pipe instructions into it, which will mean being able to have some regression tests as well. [ ] Get input string from 'read' syscall [ ] Add some testing (shell script?) Testing is one of those things that a lot of people have strong opinions about. But I absolutely love having a reasonable number of tests in place to let me know that I haven't broken something. Tests let me be _more_ creative and _more_ brave with the code because I can try something and know right away whether or not it works. I have come to _loathe_ manual testing because I've been in the Web dev world forever and testing on the web SUCKS. In a lot of cases, the state of the art is still refreshing a browser and clicking through a bunch of pages. When you're used to that, it is a _delight_ to be able to easily set up some STDIN/STDOUT tests on a command line!!! Okay, that's more than enough about that. It's time to get real input! Right off the bat, I know I'm going to need to handle this in three places: * get_token * eat_spaces * quote On the plus side, I'm happy with my input functionality (especially the string literals) and I don't regret how they've turned out. But now's when I have to pay the price for four separate methods that read input. This is definitely going to test my resolve to stick with the "inline all the things" redundant code. But I don't think "get_input" or whatever I end up calling it will be very long. Might even be under 100 bytes. So three copies shouldn't be too bad. :-) (When I wrote the above paragraphs, I thought I was going to have up to six copies, but I kept realizing that most of those places weren't actually reading a whole stream of input - they were relying on one of these three to do it.) Okay, stripped of comments, this is get_input: mov ebx, [input_file] mov ecx, input_buffer mov edx, INPUT_SIZE mov eax, SYS_READ int 0x80 cmp eax, INPUT_SIZE jge %%done mov byte [input_buffer + eax], 0 %%done: mov dword [input_buffer_pos], input_buffer It's tiny - just the Linux 'read' syscall to get more input into the input_buffer. The only interesting thing is that if we read more than an entire buffer's worth, it null-terminates the string. Now I gotta use it in (at least?) three places. Two of them also needed to be updated now that I understand what the esi and edi registers are for, ha ha. Anyway, here's a typical example from 'eat_spaces': cmp esi, input_buffer_end ; need to get more input? jl .continue ; no, keep going GET_INPUT_CODE ; yes, get some jmp .reset ; got more input, reset and continue I kept simplifying until I got down to those four lines. But does it work? Just a couple dumb mistakes and then... $ mr hello Could not find word "hello" while looking in IMMEDIATE mode. Exit status: 1 Wow! I don't know why I typed "hello" as my first live input into this thing. But it totally worked. Ha ha, I probably don't need an unfound word to be a fatal error anymore. :-) How about something that *will* work: $ mr "Hello world!\n" print Hello world! Goodbye. Exit status: 0 Yay! My first live "Hello world" in this interpreter! I'm still exiting after one line of input. I'll have to figure that out. Do I read until an actual EOF character is encountered? I can't remember. I'm just super excited this works! So after all that hand-wringing about having a couple copies of this code, how much impact has that actually had? Here's the relevant bits from 'inspect_all': get_input: 45 bytes IMMEDIATE COMPILE get_token: 108 bytes IMMEDIATE eat_spaces: 80 bytes IMMEDIATE COMPILE quote: 348 bytes IMMEDIATE COMPILE Let's compare with the previous log07.txt results: get_token: 55 bytes IMMEDIATE eat_spaces: 38 bytes IMMEDIATE COMPILE quote: 247 bytes IMMEDIATE COMPILE Since I cleaned up some of the words, there wasn't an across-the board increase of 45 * 3 bytes. I'd like to see what the grand total has become. And I'll probably want to do that often. So I'll make a new option in my build.sh script: if [[ $1 == 'bytes' ]] then AWK='/^.*: [0-9]+/ {t=t+$2} END{print "Total bytes:", t}' echo 'inspect_all' | ./$F | awk -e "$AWK" exit fi Okay, let's see the damage: $ ./build.sh bytes Total bytes: 2816 The last run in log07.txt was 2655 bytes, so the difference is: 2816 (current) - 2655 (previous) ------ 161 Ha ha, only 161 bytes difference, and since one of the three copies is needed, I only gained 116 bytes of "bloat". I think I can live with that on the x86 platform. :-) Now I gotta figure out how to continue reading after one line of input. Oh, wait! One last thing. I had also set the input buffer to an artificially tiny size so I could make sure it was being refilled as needed. I'll add a DEBUG statement to see where that's happening. The buffer size is 16 bytes. $ mr GET_INPUT00000000 "This is a jolly long string to make sure we read plenty into input buffer a couple times.\n" print GET_INPUT00000000 GET_INPUT00000000 GET_INPUT00000000 GET_INPUT00000000 GET_INPUT00000000 GET_INPUT00000000 This is a jolly long string to make sure we read plenty into input buffer a couple times. Goodbye. Exit status: 0 Okay, perfect, that long line of input required 7 calls to 'get_input' to refill the input_buffer. Now I'll set it to a reasonable size. I've seen some conflicting stuff online, so I'll just take the coward's way out: %assign INPUT_SIZE 1024 ; size of input buffer Now to figure out how to keep reading after the first line (or token?) of input. Okay, so I do need to check the return value from 'read' because that's the only way I can know if I've really got an EOF instead of just "no more input at this moment" - as would be the case between when the user hits enter and types the next line of input. I also added a new eof global that I can trip as soon as any of the 'get_input' instances hits the end of input: cmp eax, 0 ; 0=EOF, -1=error jge %%normal mov dword [input_eof], 1 ; set EOF reached %%normal: dave@cygnus~/meow5$ mr "Hello world!\n" print Hello world! : loud_meow "MEOW!\n" print ; loud_meow MEOW! exit Exit status: 12 Heh, that's so cool. I can finally interact with this thing for real. But CTRL+D doesn't exit. I had to type 'exit' to make that happen. I'll add a debug to 'get_input' to see what 'read' is returning... dave@cygnus~/meow5$ mr "goodbye cruel world" print read bytes: 0000001c goodbye cruel world <---- I typed ENTER here read bytes: 00000001 <---- ENTER again here read bytes: 00000001 read bytes: 00000000 <---- CTRL+D read bytes: 00000001 <---- ENTER again exit read bytes: 00000005 Exit status: 1 Okay, so I guess I'm not checking the input_eof flag correctly in my interpreter loop? No! Ha, perhaps you spotted it before I did in the assembly snippet? Here it is again: cmp eax, 0 ; 0=EOF, -1=error jge %%normal mov dword [input_eof], 1 ; set EOF reached %%normal: Silly mistake: jge %%normal should be jg %%normal so that 0 will trigger EOF! Okay, that pretty much worked. But there's still some inelegant code in the interpreter where I feel like I'm checking for input too many times and it's somehow still not enough. I was null-terminating it and I think I would be better off setting an upper bound on it. Two nights later: Okay, just about have the kinks worked out. I've got two new global variables to keep track of the input buffer: input_buffer: resb INPUT_SIZE input_buffer_pos: resb 4 input_buffer_end: resb 4 <--- new input_eof: resb 4 <--- new Now I can check input_eof in any input words and in the outer interpreter. Okay, I'm stuck in 'eat_spaces'. I'm peppering it with DEBUG macro calls to see what's up. esi contains the current character in the input buffer (if it's a space, we want to advance past it). ebx contains the last position filled in the buffer by 'read'. $ mr eat_spaces pos: 0804c774 eat_spaces RESET, pos: 0804c774 ES more input! esi: 0804c774 ES more input! ebx: 0804c774 45 234 "hello!" meow <----- I typed this read bytes: 00000015 eat_spaces RESET, pos: 0804c774 ES more input! esi: 0804c774 ES more input! ebx: 0804c774 read bytes: 00000000 <----- I typed CTRL+D here get_input EOF! 00000001 eat_spaces RESET, pos: 0804c774 get_next_token checking for EOF 0804ace2 Goodbye. Exit status: 0 Well, that would be a problem. Looks like esi and ebx are always the same value. Oops! LOL, that's exactly it. I forgot to save the new end of buffer pointer in 'get_input'. Here we are: mov dword [input_buffer_end], ebx ; save it Do you like super verbose logging? You'll love this. Here I am printing "hello" and then quitting with CTRL+D. It's hard to even find the interaction amidst all the noise: eat_spaces pos: 0804c7d5 eat_spaces RESET, pos: 0804c7d5 eat_spaces looking at char... 0000000a ES more input! esi: 0804c7d6 ES more input! ebx: 0804c7d6 "hello" print read bytes: 0000000e eat_spaces RESET, pos: 0804c7cc eat_spaces looking at char... 00000022 get_next_token checking for EOF 0804ad12 get_next_token looking at chars. 0804ad12 quote0804c7cc eat_spaces pos: 0804c7d3 eat_spaces RESET, pos: 0804c7d3 eat_spaces looking at char... 0804c320 eat_spaces looking at char... 0804c370 eat_spaces pos: 0804c7d4 eat_spaces RESET, pos: 0804c7d4 eat_spaces looking at char... 00000070 get_next_token checking for EOF 0804ad12 get_next_token looking at chars. 0804ad12 get_token0804c7d4 helloeat_spaces pos: 0804c7d9 eat_spaces RESET, pos: 0804c7d9 eat_spaces looking at char... 0000000a ES more input! esi: 0804c7da ES more input! ebx: 0804c7da read bytes: 00000000 get_input EOF! 00000001 eat_spaces RESET, pos: 0804c7cc get_next_token checking for EOF 0804ad12 Goodbye. Exit status: 0 But it works. I'll clean this up tomorrow night and see if I can add a simple test script. Next night: The DEBUGs are cleaned up. Now a couple housekeeping things. First, I want to complete that TODO item from the last log, a word to print all defined words (just the names, not the entire 'inspect' output. I think I'll call it 'all'. [ ] New word: 'all' to list all current word names Well, that was easy: $ mr all all inspect_all inspect ps printmode printnum number decimal bin oct hex radix str2num quote num2str ; return : copystr get_token eat_spaces get_input find is_runcomp get_flags inline print newline strlen exit Goodbye. Exit status: 0 I also added a non-destructive stack printing word last log and I never actually got it working. So I'd like to fix that. [ ] Finish 'ps' (non-destructive stack print) And since I have string escape sequences for runtime newline printing and NASM can include newlines in string literals with backticks, I'd like to remove the 'newline' word. I'm only using it in a couple places anyway. [ ] Remove word 'newline' (replace with `\n`) That one was super-easy too. I didn't really need a TODO item for it. But it'll feel good to show that checked box at the end of the log, so why not? Now for that print stack: $ mr 42 ps 1 4290881940 0 4290881948 4290881964 4290881982 4290882002 4290882040 4290882048 4290882106 ... It just keeps going on and on. And then ends in a Segmentation fault. So clearly I've got something wrong. When the interpreter starts, I save the stack pointer to a variable. mov dword [stack_start], esp I want to do a sanity check, so I'll push two values: push dword 555 push dword 42 Let's see this in action to confirm how x86 stacks work: $ mb Reading symbols from meow5... (gdb) break 877 Breakpoint 1 at 0x8049f92: file meow5.asm, line 877. (gdb) r Starting program: /home/dave/meow5/meow5 Breakpoint 1, _start () at meow5.asm:877 Okay, let's see what the stack register current points to (and by using GDB's 'display', this will always print after every command): (gdb) disp $esp 1: $esp = (void *) 0xffffd780 (gdb) disp *(int)$esp 2: *(int)$esp = 1 I've noticed that 1 (one) when I was trying to debug the stack before. I have no idea why that's there. That's something else to figure out. Anyway, we can see that the "first" stack address: 0xffffd780 And as I push values onto the stack, esp should decrement by 4 since the x86 stack writes to memory backward. (By the way, I feel a rant about how we describe this coming on, stay tuned for that in a moment.) ------------------------------------------------------- NOTE ------------------------------------------------------- By the way, I often manually manipulate these GDB sessions here in my logs so that the instruction I'm executing shows up right before I start examining memory. Sorry if that confuses people who are well-versed in GDB and are wondering what the heck is going on. ------------------------------------------------------- Now I'll just verify that my stack_start variable indeed holds the same value as esp and it points to that '1' at the beginning of the stack: 877 mov dword [stack_start], esp (gdb) s 1: $esp = (void *) 0xffffd780 2: *(int)$esp = 1 (gdb) x/a (int)stack_start 0xffffd780: 0x1 Yup. No surprises so far. Now when I push, we should see esp decrement and point to the newly pushed value: 879 push dword 555 (gdb) s 1: $esp = (void *) 0xffffd77c 2: *(int)$esp = 555 880 push dword 42 (gdb) s 1: $esp = (void *) 0xffffd778 2: *(int)$esp = 42 Looks good so far! 0xffffd780 1 0xffffd77c 555 0xffffd778 42 ...I think. I'm really no good at hex calculations in my head. Even easy ones. Let's confirm with 'dc', the old RPN desk calculator on UNIX systems since forever: $ dc 16 i 10 o <--- set input and output base to 16 (get it?) 1A 5 + p 1F <--- just making sure it's set up okay D780 p D780 <--- 0xffffd780 4 - p D77C <--- 0xffffd77c 4 - p D778 <--- 0xffffd778 dc is crazy. Anyway, those addresses are right. Every push subtracts 4 from esp and writes the pushed value to that address. So when I examine the stack area of memory, I should be able to subtract 4 from my stack_start variable and see each value. When I hit the current value of esp, that's the last value on the stack and I'm done: (gdb) x/d (int)stack_start 0xffffd780: 1 (gdb) x/d (int)stack_start -4 0xffffd77c: 555 (gdb) x/d (int)stack_start -8 0xffffd778: 42 Great! So the computer is doing what I think it's doing. Always a good sign. :-) ***************************************************** * RANT ALERT * RANT ALERT * RANT ALERT * RANT ALERT * ***************************************************** Okay, so my issue with how we talk about stacks is the use of terms like "top" and "bottom". If we start with the stack of plates analogy, it's perfectly fine to talk about the top of the stack because it makes physical sense: ===== <--- top plate ===== ===== ===== But where's the "top" of this memory? +-----+ | | 0x0000 +-----+ | | ... +-----+ | | 0xFFFF +-----+ Okay, now where's the "top" of this memory? +-----+ | | 0xFFFF +-----+ | | ... +-----+ | | 0x0000 +-----+ Where's the "top" of the stack in this memory? +-----+ | === | 0xFFFF } stack start +-===-+ } stack | === | ... } stack +-----+ | | 0x0000 +-----+ And the "top" of the stack in this memory? +-----+ | === | 0x0000 } stack start +-===-+ } stack | === | ... } stack +-----+ | | 0xFFFF +-----+ Or this? +-----+ | | 0xFFFF +-----+ | === | ... } stack +-===-+ } stack | === | 0x0000 } stack start +-----+ Or this? +-----+ | | 0x0000 +-----+ | === | ... } stack +-===-+ } stack | === | 0xFFFF } stack start +-----+ I've seen ALL of these representations over the years and the person making the diagram just passes it off like their own personal mental model is completely obvious. This situation is nuts. And I know Intel's official docs for x86 use the "top" and "bottom" terms. But guess what? Intel's "word" size on 64-bit processors is 16 bits, so I think we can safely ignore their advice on terminology. Personally, I don't picture ANY of the diagrams above. Instead, I imagine the stack as horizontal memory and the stack grows to the right: +-------------------- | A | B | C | D | E ---> +-------------------- ^ ^ oldest current But you'll notice that I don't say "rightmost" or "leftmost". That would be ridiculous. Especially since x86 has a stack that grows from a high-numbered address to a lower-numbered address. So it's really more like this: --------------------+ <--- E | D | C | B | A | --------------------+ ^ ^ 0xE4 0xFF (current) (oldest) Anyway, the point is that using directional descriptions as if we were all looking at the same physical object is super confusing. I prefer stack descriptions such as: * current / newest / recent * older / previous * oldest * hot vs cold * surfaced / buried And so on. I'm sure you can think of some better ones. Actually, please do. ***************************************************** * RANT ALERT * RANT ALERT * RANT ALERT * RANT ALERT * ***************************************************** Sorry about that. I do feel better now. So, I've made some changes in how I do the stack printing (I needed to basically reverse everything I was doing, ha ha) and let's see if it works now: $ mr ps 1 42 555 97 33 ps 1 42 555 97 33 "Hello $ $ $" print Hello 33 97 555 ps 1 42 "I put $ on there, but where does the $ come from???\n" print I put 42 on there, but where does the 1 come from??? Goodbye. Exit status: 0 I don't know if that's hard to follow or not? It's tempting to make some sort of prompt in the interpreter just so it's easier to see the commands I type versus the responses. Anyway, it works great. I just don't understand why there's a 1 on the stack when I start? I guess it doesn't really matter. It occurs to me that I should consider the start of the stack to be the *next* available position. I'll update that now. From: mov dword [stack_start], esp To: lea eax, [esp - 4] mov [stack_start], eax Did that fix it? ps 42 16 ps 42 16 8 ps 42 16 8 Yup! Now we start with nothing on the stack and adding items to the stack only shows those items. Now how about a test script? I'm a big fan of simple tests that are just enough to give me the peace-of-mind that I haven't broken anything that used to work. One thing that works just fine now that I take input on STDIN is piping input: $ echo "42 13 ps" | ./meow5 42 13 Goodbye. And I can grep/ag the results to make they contain what I want. But I remember 'expect' from back when I was heavy into Tcl. I think I'll give that a shot to interactively drive the interpreter and test it. Expect is so cool. Here's my whole test script so far: #!/usr/bin/expect spawn ./meow5 # Print a string send -- "\"Meow\\n\" print\r" expect "Meow" # Consruct meow and test it send -- ": meow \"Meow. \" print ;\r" send -- "meow\r" expect "Meow. " # Consruct meow5 and test it send -- ": meow5 meow meow meow meow meow \"\\n\" print ;\r" send -- "meow5\r" expect "Meow. Meow. Meow. Meow. Meow." # Exit (send CTRL+D EOF) send -- "\x04" expect eof The long meow5 definition line has been broken onto the next line for this log. Here it is running! $ ./test.exp spawn ./meow5 "Meow\n" print Meow : meow "Meow. " print ; meow Meow. : meow5 meow meow meow meow meow "\n" print ; meow5 Meow. Meow. Meow. Meow. Meow. Goodbye. I'll add a new alias for it now. (Defined by my "meow" function in .bashrc): alias mt="./build.sh ; ./test.exp" Sweet! That wraps up this log and the goals I had for it. I'll ad more to the test script as I go. This was just go get it started. [x] Get input string from 'read' syscall [x] Finish 'ps' (non-destructive stack print) [x] New word: 'all' to list all current word names [x] Remove word 'newline' (replace with `\n`) [x] Add some testing (expect!) I think I might make some math words next so I can use the language to do basic stuff like add and subtract!