colorful rat Ratfactor.com > Dave's Repos

meow5

A stack-based pure inlining concatenative programming language written in NASM assembly
git clone http://ratfactor.com/repos/meow5/meow5.git

meow5/log08.txt

Download raw file: log08.txt

1 Well, I think the next thing to do is make the 2 interpreter take input from STDIN. With that, I'll be 3 able to not only play with it interactively, but also 4 redirect and pipe instructions into it, which will mean 5 being able to have some regression tests as well. 6 7 [ ] Get input string from 'read' syscall 8 [ ] Add some testing (shell script?) 9 10 Testing is one of those things that a lot of people have 11 strong opinions about. But I absolutely love having a 12 reasonable number of tests in place to let me know that 13 I haven't broken something. Tests let me be _more_ 14 creative and _more_ brave with the code because I can 15 try something and know right away whether or not it 16 works. 17 18 I have come to _loathe_ manual testing because I've been 19 in the Web dev world forever and testing on the web 20 SUCKS. In a lot of cases, the state of the art is still 21 refreshing a browser and clicking through a bunch of 22 pages. When you're used to that, it is a _delight_ to be 23 able to easily set up some STDIN/STDOUT tests on a 24 command line!!! 25 26 Okay, that's more than enough about that. It's time to 27 get real input! 28 29 Right off the bat, I know I'm going to need to handle 30 this in three places: 31 32 * get_token 33 * eat_spaces 34 * quote 35 36 On the plus side, I'm happy with my input functionality 37 (especially the string literals) and I don't regret 38 how they've turned out. But now's when I have to pay the 39 price for four separate methods that read input. 40 41 This is definitely going to test my resolve to stick 42 with the "inline all the things" redundant code. But I 43 don't think "get_input" or whatever I end up calling it 44 will be very long. Might even be under 100 bytes. So 45 three copies shouldn't be too bad. :-) 46 47 (When I wrote the above paragraphs, I thought I was 48 going to have up to six copies, but I kept realizing 49 that most of those places weren't actually reading a 50 whole stream of input - they were relying on one of 51 these three to do it.) 52 53 Okay, stripped of comments, this is get_input: 54 55 mov ebx, [input_file] 56 mov ecx, input_buffer 57 mov edx, INPUT_SIZE 58 mov eax, SYS_READ 59 int 0x80 60 cmp eax, INPUT_SIZE 61 jge %%done 62 mov byte [input_buffer + eax], 0 63 %%done: 64 mov dword [input_buffer_pos], input_buffer 65 66 It's tiny - just the Linux 'read' syscall to get more 67 input into the input_buffer. The only interesting thing 68 is that if we read more than an entire buffer's worth, 69 it null-terminates the string. 70 71 Now I gotta use it in (at least?) three places. Two of 72 them also needed to be updated now that I understand 73 what the esi and edi registers are for, ha ha. Anyway, 74 here's a typical example from 'eat_spaces': 75 76 cmp esi, input_buffer_end ; need to get more input? 77 jl .continue ; no, keep going 78 GET_INPUT_CODE ; yes, get some 79 jmp .reset ; got more input, reset and continue 80 81 I kept simplifying until I got down to those four lines. 82 83 But does it work? Just a couple dumb mistakes and 84 then... 85 86 $ mr 87 hello 88 Could not find word "hello" while looking in IMMEDIATE mode. 89 Exit status: 1 90 91 Wow! I don't know why I typed "hello" as my first live 92 input into this thing. But it totally worked. Ha ha, I 93 probably don't need an unfound word to be a fatal error 94 anymore. :-) 95 96 How about something that *will* work: 97 98 $ mr 99 "Hello world!\n" print 100 Hello world! 101 Goodbye. 102 Exit status: 0 103 104 Yay! My first live "Hello world" in this interpreter! 105 106 I'm still exiting after one line of input. I'll have to 107 figure that out. Do I read until an actual EOF character 108 is encountered? I can't remember. I'm just super excited 109 this works! 110 111 So after all that hand-wringing about having a couple 112 copies of this code, how much impact has that actually 113 had? 114 115 Here's the relevant bits from 'inspect_all': 116 117 get_input: 45 bytes IMMEDIATE COMPILE 118 get_token: 108 bytes IMMEDIATE 119 eat_spaces: 80 bytes IMMEDIATE COMPILE 120 quote: 348 bytes IMMEDIATE COMPILE 121 122 Let's compare with the previous log07.txt results: 123 124 get_token: 55 bytes IMMEDIATE 125 eat_spaces: 38 bytes IMMEDIATE COMPILE 126 quote: 247 bytes IMMEDIATE COMPILE 127 128 Since I cleaned up some of the words, there wasn't an 129 across-the board increase of 45 * 3 bytes. 130 131 I'd like to see what the grand total has become. And 132 I'll probably want to do that often. So I'll make a new 133 option in my build.sh script: 134 135 if [[ $1 == 'bytes' ]] 136 then 137 AWK='/^.*: [0-9]+/ {t=t+$2} END{print "Total bytes:", t}' 138 echo 'inspect_all' | ./$F | awk -e "$AWK" 139 exit 140 fi 141 142 Okay, let's see the damage: 143 144 $ ./build.sh bytes 145 Total bytes: 2816 146 147 The last run in log07.txt was 2655 bytes, so the 148 difference is: 149 150 2816 (current) 151 - 2655 (previous) 152 ------ 153 161 154 155 Ha ha, only 161 bytes difference, and since one of the 156 three copies is needed, I only gained 116 bytes of 157 "bloat". I think I can live with that on the x86 158 platform. :-) 159 160 Now I gotta figure out how to continue reading after one 161 line of input. 162 163 Oh, wait! One last thing. I had also set the input 164 buffer to an artificially tiny size so I could make sure 165 it was being refilled as needed. I'll add a DEBUG 166 statement to see where that's happening. 167 168 The buffer size is 16 bytes. 169 170 $ mr 171 GET_INPUT00000000 172 "This is a jolly long string to make sure we read plenty into input buffer a couple times.\n" print 173 GET_INPUT00000000 174 GET_INPUT00000000 175 GET_INPUT00000000 176 GET_INPUT00000000 177 GET_INPUT00000000 178 GET_INPUT00000000 179 This is a jolly long string to make sure we read plenty into input buffer a couple times. 180 Goodbye. 181 Exit status: 0 182 183 Okay, perfect, that long line of input required 7 calls 184 to 'get_input' to refill the input_buffer. Now I'll set 185 it to a reasonable size. I've seen some conflicting 186 stuff online, so I'll just take the coward's way out: 187 188 %assign INPUT_SIZE 1024 ; size of input buffer 189 190 Now to figure out how to keep reading after the first 191 line (or token?) of input. 192 193 Okay, so I do need to check the return value from 'read' 194 because that's the only way I can know if I've really 195 got an EOF instead of just "no more input at this 196 moment" - as would be the case between when the user 197 hits enter and types the next line of input. 198 199 I also added a new eof global that I can trip as soon as 200 any of the 'get_input' instances hits the end of input: 201 202 cmp eax, 0 ; 0=EOF, -1=error 203 jge %%normal 204 mov dword [input_eof], 1 ; set EOF reached 205 %%normal: 206 207 dave@cygnus~/meow5$ mr 208 "Hello world!\n" print 209 Hello world! 210 211 : loud_meow "MEOW!\n" print ; 212 loud_meow 213 MEOW! 214 215 exit 216 Exit status: 12 217 218 Heh, that's so cool. I can finally interact with this 219 thing for real. But CTRL+D doesn't exit. I had to type 220 'exit' to make that happen. 221 222 I'll add a debug to 'get_input' to see what 'read' is 223 returning... 224 225 dave@cygnus~/meow5$ mr 226 "goodbye cruel world" print 227 read bytes: 0000001c 228 goodbye cruel world <---- I typed ENTER here 229 read bytes: 00000001 230 <---- ENTER again here 231 read bytes: 00000001 232 read bytes: 00000000 <---- CTRL+D 233 234 read bytes: 00000001 <---- ENTER again 235 exit 236 read bytes: 00000005 237 Exit status: 1 238 239 Okay, so I guess I'm not checking the input_eof flag 240 correctly in my interpreter loop? 241 242 No! Ha, perhaps you spotted it before I did in the 243 assembly snippet? Here it is again: 244 245 cmp eax, 0 ; 0=EOF, -1=error 246 jge %%normal 247 mov dword [input_eof], 1 ; set EOF reached 248 %%normal: 249 250 Silly mistake: 251 252 jge %%normal 253 254 should be 255 256 jg %%normal 257 258 so that 0 will trigger EOF! 259 260 Okay, that pretty much worked. But there's still some 261 inelegant code in the interpreter where I feel like I'm 262 checking for input too many times and it's somehow still 263 not enough. 264 265 I was null-terminating it and I think I would be better 266 off setting an upper bound on it. 267 268 Two nights later: Okay, just about have the kinks worked 269 out. I've got two new global variables to keep track of 270 the input buffer: 271 272 input_buffer: resb INPUT_SIZE 273 input_buffer_pos: resb 4 274 input_buffer_end: resb 4 <--- new 275 input_eof: resb 4 <--- new 276 277 Now I can check input_eof in any input words and in the 278 outer interpreter. 279 280 Okay, I'm stuck in 'eat_spaces'. I'm peppering it with 281 DEBUG macro calls to see what's up. esi contains the 282 current character in the input buffer (if it's a space, 283 we want to advance past it). ebx contains the last 284 position filled in the buffer by 'read'. 285 286 $ mr 287 eat_spaces pos: 0804c774 288 eat_spaces RESET, pos: 0804c774 289 ES more input! esi: 0804c774 290 ES more input! ebx: 0804c774 291 45 234 "hello!" meow <----- I typed this 292 read bytes: 00000015 293 eat_spaces RESET, pos: 0804c774 294 ES more input! esi: 0804c774 295 ES more input! ebx: 0804c774 296 read bytes: 00000000 <----- I typed CTRL+D here 297 get_input EOF! 00000001 298 eat_spaces RESET, pos: 0804c774 299 get_next_token checking for EOF 0804ace2 300 Goodbye. 301 Exit status: 0 302 303 Well, that would be a problem. Looks like esi and ebx 304 are always the same value. Oops! 305 306 LOL, that's exactly it. I forgot to save the new end of 307 buffer pointer in 'get_input'. Here we are: 308 309 mov dword [input_buffer_end], ebx ; save it 310 311 Do you like super verbose logging? You'll love this. 312 Here I am printing "hello" and then quitting with 313 CTRL+D. It's hard to even find the interaction amidst 314 all the noise: 315 316 eat_spaces pos: 0804c7d5 317 eat_spaces RESET, pos: 0804c7d5 318 eat_spaces looking at char... 0000000a 319 ES more input! esi: 0804c7d6 320 ES more input! ebx: 0804c7d6 321 "hello" print 322 read bytes: 0000000e 323 eat_spaces RESET, pos: 0804c7cc 324 eat_spaces looking at char... 00000022 325 get_next_token checking for EOF 0804ad12 326 get_next_token looking at chars. 0804ad12 327 quote0804c7cc 328 eat_spaces pos: 0804c7d3 329 eat_spaces RESET, pos: 0804c7d3 330 eat_spaces looking at char... 0804c320 331 eat_spaces looking at char... 0804c370 332 eat_spaces pos: 0804c7d4 333 eat_spaces RESET, pos: 0804c7d4 334 eat_spaces looking at char... 00000070 335 get_next_token checking for EOF 0804ad12 336 get_next_token looking at chars. 0804ad12 337 get_token0804c7d4 338 helloeat_spaces pos: 0804c7d9 339 eat_spaces RESET, pos: 0804c7d9 340 eat_spaces looking at char... 0000000a 341 ES more input! esi: 0804c7da 342 ES more input! ebx: 0804c7da 343 read bytes: 00000000 344 get_input EOF! 00000001 345 eat_spaces RESET, pos: 0804c7cc 346 get_next_token checking for EOF 0804ad12 347 Goodbye. 348 Exit status: 0 349 350 But it works. I'll clean this up tomorrow night and see 351 if I can add a simple test script. 352 353 Next night: The DEBUGs are cleaned up. Now a couple 354 housekeeping things. First, I want to complete that TODO 355 item from the last log, a word to print all defined 356 words (just the names, not the entire 'inspect' output. 357 I think I'll call it 'all'. 358 359 [ ] New word: 'all' to list all current word names 360 361 Well, that was easy: 362 363 $ mr 364 all 365 all inspect_all inspect ps printmode printnum number decimal bin oct hex radix str2num quote num2str ; return : copystr get_token eat_spaces get_input find is_runcomp get_flags inline print newline strlen exit 366 Goodbye. 367 Exit status: 0 368 369 I also added a non-destructive stack printing word last 370 log and I never actually got it working. So I'd like to 371 fix that. 372 373 [ ] Finish 'ps' (non-destructive stack print) 374 375 And since I have string escape sequences for 376 runtime newline printing and NASM can include newlines 377 in string literals with backticks, I'd like to remove 378 the 'newline' word. I'm only using it in a couple places 379 anyway. 380 381 [ ] Remove word 'newline' (replace with `\n`) 382 383 That one was super-easy too. I didn't really need a TODO 384 item for it. But it'll feel good to show that checked 385 box at the end of the log, so why not? 386 387 Now for that print stack: 388 389 $ mr 390 42 ps 391 1 4290881940 0 4290881948 4290881964 4290881982 392 4290882002 4290882040 4290882048 4290882106 ... 393 394 It just keeps going on and on. And then ends in a 395 Segmentation fault. So clearly I've got something wrong. 396 397 When the interpreter starts, I save the stack pointer to 398 a variable. 399 400 mov dword [stack_start], esp 401 402 I want to do a sanity check, so I'll push two values: 403 404 push dword 555 405 push dword 42 406 407 Let's see this in action to confirm how x86 stacks work: 408 409 $ mb 410 Reading symbols from meow5... 411 (gdb) break 877 412 Breakpoint 1 at 0x8049f92: file meow5.asm, line 877. 413 (gdb) r 414 Starting program: /home/dave/meow5/meow5 415 416 Breakpoint 1, _start () at meow5.asm:877 417 418 Okay, let's see what the stack register current points 419 to (and by using GDB's 'display', this will always print 420 after every command): 421 422 (gdb) disp $esp 423 1: $esp = (void *) 0xffffd780 424 (gdb) disp *(int)$esp 425 2: *(int)$esp = 1 426 427 I've noticed that 1 (one) when I was trying to debug the 428 stack before. I have no idea why that's there. That's 429 something else to figure out. 430 431 Anyway, we can see that the "first" stack address: 432 433 0xffffd780 434 435 And as I push values onto the stack, esp should 436 decrement by 4 since the x86 stack writes to memory 437 backward. (By the way, I feel a rant about how we 438 describe this coming on, stay tuned for that in a 439 moment.) 440 441 ------------------------------------------------------- 442 NOTE 443 ------------------------------------------------------- 444 By the way, I often manually manipulate these GDB 445 sessions here in my logs so that the instruction I'm 446 executing shows up right before I start examining 447 memory. Sorry if that confuses people who are 448 well-versed in GDB and are wondering what the heck is 449 going on. 450 ------------------------------------------------------- 451 452 Now I'll just verify that my stack_start variable indeed 453 holds the same value as esp and it points to that '1' at 454 the beginning of the stack: 455 456 877 mov dword [stack_start], esp 457 (gdb) s 458 1: $esp = (void *) 0xffffd780 459 2: *(int)$esp = 1 460 (gdb) x/a (int)stack_start 461 0xffffd780: 0x1 462 463 Yup. No surprises so far. 464 465 Now when I push, we should see esp decrement and point 466 to the newly pushed value: 467 468 879 push dword 555 469 (gdb) s 470 1: $esp = (void *) 0xffffd77c 471 2: *(int)$esp = 555 472 880 push dword 42 473 (gdb) s 474 1: $esp = (void *) 0xffffd778 475 2: *(int)$esp = 42 476 477 Looks good so far! 478 479 0xffffd780 1 480 0xffffd77c 555 481 0xffffd778 42 482 483 ...I think. I'm really no good at hex calculations in my 484 head. Even easy ones. Let's confirm with 'dc', the old 485 RPN desk calculator on UNIX systems since forever: 486 487 $ dc 488 16 i 10 o <--- set input and output base to 16 (get it?) 489 1A 5 + p 490 1F <--- just making sure it's set up okay 491 D780 p 492 D780 <--- 0xffffd780 493 4 - p 494 D77C <--- 0xffffd77c 495 4 - p 496 D778 <--- 0xffffd778 497 498 dc is crazy. Anyway, those addresses are right. Every 499 push subtracts 4 from esp and writes the pushed value to 500 that address. 501 502 So when I examine the stack area of memory, I should be able to 503 subtract 4 from my stack_start variable and see each 504 value. When I hit the current value of esp, that's the 505 last value on the stack and I'm done: 506 507 (gdb) x/d (int)stack_start 508 0xffffd780: 1 509 (gdb) x/d (int)stack_start -4 510 0xffffd77c: 555 511 (gdb) x/d (int)stack_start -8 512 0xffffd778: 42 513 514 Great! So the computer is doing what I think it's doing. 515 Always a good sign. :-) 516 517 ***************************************************** 518 * RANT ALERT * RANT ALERT * RANT ALERT * RANT ALERT * 519 ***************************************************** 520 521 Okay, so my issue with how we talk about stacks is the 522 use of terms like "top" and "bottom". 523 524 If we start with the stack of plates analogy, it's 525 perfectly fine to talk about the top of the stack 526 because it makes physical sense: 527 528 ===== <--- top plate 529 ===== 530 ===== 531 ===== 532 533 But where's the "top" of this memory? 534 535 +-----+ 536 | | 0x0000 537 +-----+ 538 | | ... 539 +-----+ 540 | | 0xFFFF 541 +-----+ 542 543 Okay, now where's the "top" of this memory? 544 545 +-----+ 546 | | 0xFFFF 547 +-----+ 548 | | ... 549 +-----+ 550 | | 0x0000 551 +-----+ 552 553 Where's the "top" of the stack in this memory? 554 555 +-----+ 556 | === | 0xFFFF } stack start 557 +-===-+ } stack 558 | === | ... } stack 559 +-----+ 560 | | 0x0000 561 +-----+ 562 563 And the "top" of the stack in this memory? 564 565 +-----+ 566 | === | 0x0000 } stack start 567 +-===-+ } stack 568 | === | ... } stack 569 +-----+ 570 | | 0xFFFF 571 +-----+ 572 573 Or this? 574 575 +-----+ 576 | | 0xFFFF 577 +-----+ 578 | === | ... } stack 579 +-===-+ } stack 580 | === | 0x0000 } stack start 581 +-----+ 582 583 Or this? 584 585 +-----+ 586 | | 0x0000 587 +-----+ 588 | === | ... } stack 589 +-===-+ } stack 590 | === | 0xFFFF } stack start 591 +-----+ 592 593 I've seen ALL of these representations over the years 594 and the person making the diagram just passes it off 595 like their own personal mental model is completely 596 obvious. 597 598 This situation is nuts. 599 600 And I know Intel's official docs for x86 use the "top" 601 and "bottom" terms. But guess what? Intel's "word" size 602 on 64-bit processors is 16 bits, so I think we can 603 safely ignore their advice on terminology. 604 605 Personally, I don't picture ANY of the diagrams above. 606 607 Instead, I imagine the stack as horizontal memory and 608 the stack grows to the right: 609 610 +-------------------- 611 | A | B | C | D | E ---> 612 +-------------------- 613 ^ ^ 614 oldest current 615 616 But you'll notice that I don't say "rightmost" or 617 "leftmost". That would be ridiculous. Especially since 618 x86 has a stack that grows from a high-numbered address 619 to a lower-numbered address. So it's really more like 620 this: 621 622 --------------------+ 623 <--- E | D | C | B | A | 624 --------------------+ 625 ^ ^ 626 0xE4 0xFF 627 (current) (oldest) 628 629 Anyway, the point is that using directional descriptions 630 as if we were all looking at the same physical object is 631 super confusing. 632 633 I prefer stack descriptions such as: 634 635 * current / newest / recent 636 * older / previous 637 * oldest 638 * hot vs cold 639 * surfaced / buried 640 641 And so on. I'm sure you can think of some better ones. 642 Actually, please do. 643 644 ***************************************************** 645 * RANT ALERT * RANT ALERT * RANT ALERT * RANT ALERT * 646 ***************************************************** 647 648 Sorry about that. I do feel better now. So, I've made 649 some changes in how I do the stack printing (I needed to 650 basically reverse everything I was doing, ha ha) and 651 let's see if it works now: 652 653 $ mr 654 ps 655 1 656 42 555 97 33 657 ps 658 1 42 555 97 33 659 "Hello $ $ $" print 660 Hello 33 97 555 661 ps 662 1 42 663 "I put $ on there, but where does the $ come from???\n" print 664 I put 42 on there, but where does the 1 come from??? 665 Goodbye. 666 Exit status: 0 667 668 I don't know if that's hard to follow or not? It's 669 tempting to make some sort of prompt in the interpreter 670 just so it's easier to see the commands I type versus 671 the responses. 672 673 Anyway, it works great. I just don't understand why 674 there's a 1 on the stack when I start? 675 676 I guess it doesn't really matter. It occurs to me that I 677 should consider the start of the stack to be the *next* 678 available position. I'll update that now. 679 680 From: 681 682 mov dword [stack_start], esp 683 684 To: 685 686 lea eax, [esp - 4] 687 mov [stack_start], eax 688 689 Did that fix it? 690 691 ps 692 693 42 16 ps 694 42 16 695 8 ps 696 42 16 8 697 698 Yup! Now we start with nothing on the stack and adding 699 items to the stack only shows those items. 700 701 Now how about a test script? I'm a big fan of simple 702 tests that are just enough to give me the peace-of-mind 703 that I haven't broken anything that used to work. 704 705 One thing that works just fine now that I take input on 706 STDIN is piping input: 707 708 $ echo "42 13 ps" | ./meow5 709 42 13 710 Goodbye. 711 712 And I can grep/ag the results to make they contain what 713 I want. 714 715 But I remember 'expect' from back when I was heavy into 716 Tcl. I think I'll give that a shot to interactively 717 drive the interpreter and test it. 718 719 Expect is so cool. Here's my whole test script so far: 720 721 #!/usr/bin/expect 722 723 spawn ./meow5 724 725 # Print a string 726 send -- "\"Meow\\n\" print\r" 727 expect "Meow" 728 729 # Consruct meow and test it 730 send -- ": meow \"Meow. \" print ;\r" 731 send -- "meow\r" 732 expect "Meow. " 733 734 # Consruct meow5 and test it 735 send -- ": meow5 meow meow 736 meow meow meow \"\\n\" print ;\r" 737 send -- "meow5\r" 738 expect "Meow. Meow. Meow. Meow. Meow." 739 740 # Exit (send CTRL+D EOF) 741 send -- "\x04" 742 expect eof 743 744 The long meow5 definition line has been broken onto the 745 next line for this log. 746 747 Here it is running! 748 749 $ ./test.exp 750 spawn ./meow5 751 "Meow\n" print 752 Meow 753 : meow "Meow. " print ; 754 meow 755 Meow. : meow5 meow meow meow meow meow "\n" print ; 756 meow5 757 Meow. Meow. Meow. Meow. Meow. 758 Goodbye. 759 760 I'll add a new alias for it now. (Defined by my "meow" 761 function in .bashrc): 762 763 alias mt="./build.sh ; ./test.exp" 764 765 Sweet! That wraps up this log and the goals I had for 766 it. I'll ad more to the test script as I go. This was 767 just go get it started. 768 769 770 [x] Get input string from 'read' syscall 771 [x] Finish 'ps' (non-destructive stack print) 772 [x] New word: 'all' to list all current word names 773 [x] Remove word 'newline' (replace with `\n`) 774 [x] Add some testing (expect!) 775 776 I think I might make some math words next so I can use 777 the language to do basic stuff like add and subtract!