colorful rat Ratfactor.com > Dave's Repos

meow5

A stack-based pure inlining concatenative programming language written in NASM assembly
git clone http://ratfactor.com/repos/meow5/meow5.git

meow5/log06.txt

Download raw file: log06.txt

1 Well, log05.txt ended with some great excitement. I 2 double-checked and all of the open TODOs are now closed. 3 4 So I think I'll dip into design-notes.txt and pick the 5 next thing to do. 6 7 I just remembered one thing, I need to _remove_ a 8 feature: the return stack doesn't need to be a stack at 9 all because my "inline all the things!" language can't 10 have nested word calls anyway: 11 12 [ ] Replace return stack with single addr 13 14 So, that's not super rewarding, but I do enjoy deleting 15 uneeded code. 16 17 Oh, I know which feature I'm doing after that! Time to 18 reward myself for staying on track with something fun 19 and visual: 20 21 [ ] Pretty-print meta info about word! 22 [ ] Loop through dictionary, list all names 23 [ ] Loop through dictionary, pretty-print all 24 25 Next night: De-evolving the return mechanism for 26 immediate word calling was easy, so that one's done. 27 28 Now for the fun ones. 29 30 I'm more of a strings programmer than a numbers 31 programmer. So the ultra-primitive state of my string 32 printing is a bit of a bummer. Before I start storing a 33 billion little pieces of strings in the DATA segment, 34 I'd like to consider adding some convenience words for 35 string handling. 36 37 It would be nice to have, at the very least, string 38 literals in the language. 39 40 [ ] Add string literals. 41 [ ] Re-define 'meow' using a string literal. 42 43 I like the idea of just writing "anonymous" strings to 44 be printed into the dictionary space where all the words 45 are. And I think my choice to null-terminate my strings 46 will pay off here (I hope). 47 48 Adding immediate mode strings that are just references 49 to the input buffer turned out to be super easy: 50 51 ; IMMEDIATE version of " scans forward until it finds end 52 ; quote '"' character in input_buffer and replaces it with 53 ; the null terminator. Leaves start addr of string on the 54 ; stack. Use it right away! 55 DEFWORD quote 56 mov ebp, [input_buffer_pos] 57 inc ebp ; skip initial space 58 push ebp ; we leave this start addr on the stack 59 .look_for_endquote: 60 inc ebp 61 cmp byte [ebp], '"' ; endquote? 62 jne .look_for_endquote ; nope, loop 63 mov byte [ebp], 0 ; replace endcquote with null 64 inc ebp ; move past the new null terminator 65 mov [input_buffer_pos], ebp ; save position 66 ENDWORD quote, '"', (IMMEDIATE) 67 68 And now I can do my first legit Hello World: 69 70 db ' " Hello world!" print newline exit ' 71 72 Which works just fine: 73 74 $ mr 75 Hello world! 76 77 But since it just saves a reference to the input buffer, 78 real world usage won't really be safe. Unless the input 79 buffer is limitless, I hae no idea if the string address 80 will still be valid by the time I try to use it. 81 82 For that reason, I'm gonna have to copy any strings from 83 the input buffer to somewhere. 84 85 I could either have a special-purpose buffer just for 86 storing strings, or I could write to the stack, or I 87 could write to the compile area. 88 89 The other thing that's really messing with my mind is 90 trying to think ahead (probably way too much) towards 91 how I might handle this stuff in a stand-alone 92 executable program produced by Meow5...which, now that 93 I've written it out, is DEFINITELY thinking ahead too 94 far ahead. 95 96 Next night: Moving on, I've also decided that I should 97 extract the part of 'get_token' that eats any initial 98 space characters (or other whitespace) out into its own 99 word. 100 101 [ ] New word: 'eat_spaces' 102 103 That will allow me to use it to "peek ahead" if I 104 want to in the outer interpreter and possibly switch 105 into a "string mode" (which is something I'm 106 contemplating). But all these paragraphs are me getting 107 way ahead of myself. Back to the assembly! 108 109 Okay, done. I had just one mistake, but GDB was a clumsy 110 way to debug it. So I added some more print debugging, 111 leading to this extremely verbose output once it worked: 112 113 $ mr 114 Running ":" 115 Inlining "meow" 116 Inlining "meow" 117 Inlining "meow" 118 Inlining "meow" 119 Inlining "meow" 120 Running ";" 121 Running "meow5" 122 Meow. Meow. Meow. Meow. Meow. Running "newline" 123 124 Running "exit" 125 126 I'll comment those out for now, but I'm betting I'll be 127 using them again soon. 128 129 $ mr 130 Meow. Meow. Meow. Meow. Meow. 131 132 There we are, good as new. 133 134 Next night: So while it's true that I could save strings 135 (and other data) in a variety of clever places, my 136 understanding is that modern CPUs do much better with 137 separate instruction and data memory. 138 139 So I'm gonna say for now that there will be three types 140 of memory in Meow5: 141 142 1. The stack for all register-sized parameters 143 2. The "compile area" where all inlined words go 144 3. The "data area" where all variables and other 145 data (such as "anonymous" strings) will go. 146 147 In fact, I'm gonna name #2 and #3 exactly like that: 148 149 section .bss 150 151 ... 152 153 compile_area: resb 1024 154 data_area: resb 1024 155 156 here: resb 4 157 free: resb 4 158 159 Where 'here' points to the next free spot in the 160 compile_area (the 'here' name comes from Forth). 161 162 And 'free' points to the next free spot in the 163 data_area. 164 165 And I'm gonna go against the Forth grain and add a 166 special handler for quote syntax. I'll go ahead and peek 167 at the next character of input. If it's a quote, I'll 168 handle the rest as a string. Otherwise, keep processing 169 tokens as usual. 170 171 The word is called 'quote' instead of '"' and I'm going 172 to call it explicitly in my outer interpreter. 173 174 The point of this is to allow "normal looking" strings 175 like this: 176 177 "Hello world" 178 179 Rather than requring a token delimeter after the '"' 180 word as in traditional Forth: 181 182 " Hello world" 183 184 Between that and copying the string from the input 185 buffer to a new variable space, the change in my 186 immediate mode hello world is just the missing space, 187 but it's a world of difference: 188 189 db ' "Hello World!" print newline exit ' 190 191 Does it work? 192 193 $ mr 194 Hello World! 195 196 Compile mode is exactly the same (I'll put the string in 197 the data_area at compile time), but instead of pushing 198 the address of the string to the stack right at that 199 momment, I need to inline (or "compile") the machine 200 code to push the address *when the word being compiled 201 runs*! 202 203 To do that, I need to actually "assemble" the i386 204 opcode to push the 32-bit address onto the stack. 205 206 So that'll be the "PUSH imm32" instruction in Intel 207 documentation parlance. 208 209 Handy reference: https://www.felixcloutier.com/x86/push 210 211 6A <ib> PUSH imm8 212 66 <iw> PUSH imm16 213 68 <id> PUSH imm32 214 215 And I'm gonna test that out with NASM and GDB: 216 217 push byte 0x99 218 push word 0x8888 219 push dword 0x77777777 220 221 disassembles as: 222 223 0x0804942d <+0>: 6a 99 push $0xffffff99 224 0x0804942f <+2>: 66 68 88 88 pushw $0x8888 225 0x08049433 <+6>: 68 77 77 77 77 push $0x77777777 226 227 Bingo! So I'm going to want opcode 0x68 followed by 228 the address value. 229 230 mov edx, [here] 231 mov byte [edx], 0x68 ; i386 opcode for PUSH imm32 232 mov dword [edx + 1], ebx ; address of string 233 add edx, 5 ; update here 234 mov [here], edx ; save it 235 236 Well, here goes nothing... 237 238 db ': meow "Meow." print ; meow newline exit ' 239 240 There's no way that's gonna work... 241 242 $ mr 243 Running ":" 244 Inlining "print" 245 Running ";" 246 Running "meow" 247 Meow.Running "newline" 248 249 Running "exit" 250 251 What?! It worked! 252 253 As you can see, I had also turned my debugging 254 statements back on 'cause I was expecting trouble. They 255 help assure me that this is, in fact compiling a word 256 called 'meow' that prints a string stored in memory at 257 compile time. I'll turn the debugging off again. 258 259 And while I'm at it, I'll remove the old assembly test 260 'meow' word and define it like this in order to create 261 the 'meow5' word. 262 263 264 input_buffer: 265 db ': meow "Meow." print ; ' 266 db ': meow5 meow meow meow meow meow ; ' 267 db 'meow5 ' 268 db 'newline ' 269 db 'exit',0 270 271 ./build.sh: line 33: 2650 Segmentation fault ./$F 272 273 Aw man. 274 275 Okay, were are we crashing? 276 277 (gdb) r 278 Starting program: /home/dave/meow5/meow5 279 Running ":" 280 Inlining "print" 281 Running ";" 282 Running ":" 283 Inlining "meow" 284 Inlining "meow" 285 Inlining "meow" 286 287 Program received signal SIGSEGV, Segmentation fault. 288 find.test_word () at meow5.asm:165 289 165 and eax, [edx + T_FLAGS] ; see if mode bit is set... 290 291 Hmmm. Weird that it dies while trying to find the fourth 292 'meow' to inline. I bumped up the compile area memory to 293 4kb and it wasn't that. So I guess I'll be stepping 294 through this. 295 296 Three nights later (I think): I did step through it 297 quite a bit with GDB, but this thing is getting to the 298 point where it feels like there's a pretty big mismatch 299 between GDB's strengths (stepping through C) and this 300 crazy machine code concatenation I'm doing. 301 302 I've always prefered "print debugging" anyway. So I've 303 made what I think is a neat little DEBUG print macro. It 304 takes a string and an expression to print as a 32-bit hex 305 number. The expression is anything that would be valid 306 as the source for a MOV to a register: mov eax, <expr>. 307 308 Examples: 309 310 DEBUG "Value in eax: ", eax 311 DEBUG "My memory: ", [mem_label] 312 DEBUG "32 bits of glory: ", 0xDEADBEEF 313 314 Since the segfault is happening after a fourth iteration 315 of inline, I feel almost certain that this is a memory 316 clobbering problem. But all my data areas seem more than 317 big enough, so there must be a bug. 318 319 I've peppered 'inline' and 'find' (where the actual 320 crash takes place) with DEBUG statements. Here's a 321 sampling: 322 323 Start [here]: 0804a280 324 Start [last]: 08049ad8 325 find [last]: 08049ad8 326 find edx: 08049ad8 327 find [edx]: 08049a3b 328 find [edx+T_FLAGS]: 00000003 329 ... 330 Running ":" 331 ... 332 Inlining "print" 333 ... 334 ... 335 Running ";" 336 semicolon end of machine code [here]: 0804a2a5 337 inline to [here]: 0804a2a5 338 inline len: 00000007 339 inline from: 0804961a 340 inline done, [here]: 0804a2ac 341 semicolon tail [here]: 0804a2ac 342 semicolon linking to [last]: 08049ad8 343 semicolon done with [last]: 0804a2ac 344 [here]: 0804a264 345 find [last]: 0804a2ac 346 ... 347 Running ":" 348 ... 349 Inlining "meow" 350 inline to [here]: 0804a264 351 inline len: 00000025 352 inline from: 0804a280 353 inline done, [here]: 0804a289 354 ... 355 Inlining "meow" 356 ... 357 inline done, [here]: 0804a2ae 358 Inlining "meow" 359 ... 360 inline done, [here]: 0804a2d3 361 find [last]: 0804a2ac 362 find edx: 0804a2ac 363 find [edx]: 595a5a80 364 find [edx+T_FLAGS]: 0004b859 365 find [last]: 0804a2ac 366 find edx: 595a5a80 367 368 Even viewing exactly what I want to see, all of these 369 addresses are still enough to make me go cross-eyed. 370 371 So immediately after compiling a new word, I should have 372 this: 373 374 (word's machine code) 375 tail: 376 link: 0x0804____ <-- [last] points here 377 (offsets and flags) 378 end of tail <-- [here] points here 379 380 The [last] address should point to the tail of the last 381 compiled word and [here] should point to the next 382 available free space in the compile_area. 383 384 Time to examine the output. 385 386 When Meow5 beings, [last] is pointing to the last word 387 created in assembly and [here] is pointing to the very 388 beginning of the compile_area: 389 390 Start [here]: 0804a280 391 Start [last]: 08049ad8 392 393 After a run-time word is compiled (such as 'meow'), 394 [here] should always be a little larger than [last]. 395 396 Running ";" 397 semicolon end of machine code [here]: 0804a2a5 398 inline to [here]: 0804a2a5 399 inline len: 00000007 400 inline from: 0804961a 401 inline done, [here]: 0804a2ac 402 semicolon tail [here]: 0804a2ac 403 semicolon linking to [last]: 08049ad8 404 semicolon done with [last]: 0804a2ac 405 [here]: 0804a264 406 407 Which is indeed the case - [here] is a tail's worth of 408 bytes after [last]. So far so good. 409 410 Then we crash while finding and inlining the 'meow' 411 machine code into a new 'meow5' word. Here's the first: 412 413 Inlining "meow" 414 inline to [here]: 0804a264 415 inline len: 00000025 416 inline from: 0804a280 417 inline done, [here]: 0804a289 418 419 To double-check, I put in even more DEBUG statements in 420 'inline': 421 422 Inlining "meow" 423 word tail: 0804b30c 424 len: 00000025 425 code offset: 0000002c 426 source: 0804b2e0 427 dest [here]: 0804b30e 428 dest edi: 0804b30e 429 end edi: 0804b333 430 end [here]: 0804b333 431 432 No, that all seems fine. It doesn't look like 'inline' 433 is at fault here. But _something_ is making the linked 434 list incorrect in the tail: 435 436 find [last]: 0804b30c 437 find edx: 0804b30c 438 find [edx]: 595a5a80 <--- not a valid address 439 440 Sure, semicolon could have a bug...but that should be 441 causing the problem immediately, not between inlining 442 'meow' the third and fourth times. 443 444 Okay, acutally, I think GDB can help me here. I need to 445 know when this value in memory is getting clobbered. 446 Here's the syntax for watching a specific address. Have 447 to cast it - "int" is 32 bits for 32-bit elf and '*' 448 tells GDB that our value is a pointer. It's all very 449 'C'. 450 451 (gdb) watch *(int)0x0804b30c 452 Hardware watchpoint 1: *(int)0x0804b30c 453 454 And let's see what happens: 455 456 ... 457 semicolon linking to [last]: 08049dc8 458 Hardware watchpoint 1: *(int)0x0804b30c 459 460 Old value = 0 461 New value = 134520264 462 @124.continue () at meow5.asm:454 463 454 mov [last], eax ; and store this tail as new 'last' 464 (gdb) x/x *(int)0x0804b30c 465 0x8049dc8 <tail_quote>: 0x08049d2b 466 467 Okay, that's a good address. So semicolon is doing the 468 right thing so far. Let's continue... 469 470 ... 471 Inlining "meow" 472 word tail: 0804b30c 473 len: 00000025 474 code offset: 0000002c 475 source: 0804b2e0 476 dest [here]: 0804b2e9 477 dest edi: 0804b2e9 478 479 Hardware watchpoint 1: *(int)0x0804b30c 480 481 Old value = 134520264 482 New value = 134520192 483 @36.continue () at meow5.asm:247 484 247 rep movsb ; copy [esi]...[esi+ecx] into [edi] 485 486 Bingo! Well, then it *is* inline then. Yeah, clearly it 487 is. Ah, I see, but that's the first 'meow' inline. Which 488 kinda explains why I missed it. 489 490 So it's gotta be with a [here] that wasn't updated 491 correctly at some point. 492 493 Wait, has it been staring me in the face this whole 494 time? 495 496 semicolon done with [last]: 0804b30c 497 [here]: 0804b2c4 498 499 Ah geez. Yeah, [here] should certainly be *after* 500 [last]: 501 502 [last]: 0804b30c <-- 30c (after) 503 [here]: 0804b2c4 <-- 2c4 (before) 504 505 Dangit! Okay, so some more DEBUGs: 506 507 tail eax: 0804b35c 508 tail eax: 0804b360 509 tail eax: 0804b364 510 tail eax: 0804b368 511 tail eax: 0804b30c <-- yup! lost some ground here :-( 512 tail eax: 0804b310 513 514 Got it! So it was my descision to go against Chuck 515 Moore's advice to always have words consume their 516 parameters from the stack so you don't have to remember 517 which words do and which words don't: 518 519 %macro STRLEN_CODE 0 520 mov eax, [esp] ; get string addr (without popping!) 521 ... 522 523 Sure enough, I forgot to pop to throw away this one 524 unique case where I really do just need the string 525 length: 526 527 ; Call strlen again so we know how much string name we 528 ; wrote to the tail: 529 push name_buffer 530 STRLEN_CODE 531 pop ebx ; get string len pushed by STRLEN_CODE 532 pop eax ; get saved 'here' position 533 534 That second pop was getting the name_buffer address I'd 535 pushed before STRLEN_CODE. 536 537 And now the novice is enlightened. 538 539 I'll fix that behavior right now and always heed that 540 particular bit of advice from here on out! 541 542 Okay, then it wouldn't find 'meow' after the *second* 543 try: 544 545 Finding...0804b2fc 546 meow 547 find [last]: 0804b369 548 find edx: 0804b369 549 find [edx]: 0804a062 550 flags okay: 00000001 551 finding: 776f656d 552 finding: 00776f65 553 finding: 0000776f 554 finding: 00000077 555 finding: 00000000 556 557 It turns out I had one more problem with my strlen: 558 559 add eax, ebx ; advance 'here' by that amt 560 inc eax ; plus one for the null 561 562 Had to add that last inc because strlen doesn't count 563 the null terminator as a character. So why did it find 564 'meow' the first time? Because I hadn't yet written 565 anything to the compile area, and the "blank" memory 566 acted as a terminator, but once I started to inline a 567 copy of 'meow' right after 'meow's tail as the 568 definition of 'meow5', that null was no longer there! 569 570 Now I'm gonna remove about two dozen DEBUG statements... 571 572 And will this work? 573 574 input_buffer: 575 db ': meow "Meow." print ; ' 576 db ': meow5 meow meow meow meow meow ; ' 577 db 'meow5 ' 578 db 'newline ' 579 db 'exit',0 580 581 Crossing fingers: 582 583 $ mr 584 Meow.Meow.Meow.Meow.Meow. 585 586 At last! 587 588 Guess the pretty-printing the dictionary got super 589 delayed, but this was vital stuff. I'll put those todos 590 in a new log in just a bit. But leaving this much 591 simpler todo for tomorrow night: 592 593 [ ] Factor out a PRINTSTR macro from DEBUG, then use 594 it *from* DEBUG and also anywhere else I'm 595 currently hard-coding strings in the data 596 section and printing them in the interpreter. Go 597 ahead and push/pop the 4 registers in that one 598 too. Performance is totally not a concern with 599 these convenience macros in the interpreter. 600 601 Well, that was even easier than I expected. 602 603 Now to test (I'm using PRINTSTR in DEBUG and 604 stand-alone): 605 606 PRINTSTR "Hello world!" 607 NEWLINE_CODE 608 609 DEBUG "[here] starting at 0x", [here] 610 611 Run: 612 613 $ mr 614 Hello world! 615 [here] starting at 0x0804a114 616 Meow.Meow.Meow.Meow.Meow. 617 618 And replaced all the strings in the data section with my 619 PRINSTR macro - which makes those parts at least 30% 620 shorter and MUCH easier to read. 621 622 Here's where I'm at with the TODOs in this log: 623 624 625 [x] Replace return stack with single addr 626 [ ] Pretty-print meta info about word! 627 [ ] Loop through dictionary, list all names 628 [ ] Loop through dictionary, pretty-print all 629 [x] Add string literals. 630 [x] Re-define 'meow' using a string literal. 631 [x] New word: 'eat_spaces' 632 [x] Factor out a PRINTSTR macro from DEBUG 633 [x] Use it in DEBUG 634 [x] Replace data strings + CALLWORD print 635 636 So I'll start the next log where I started this one: 637 With the fun dictionary pretty-printer words. 638 639 This log's progress has been better than I'd hoped for 640 and I think now I'm in a good position for the fun 641 stuff!