1 Well, I think the next thing to do is make the
2 interpreter take input from STDIN. With that, I'll be
3 able to not only play with it interactively, but also
4 redirect and pipe instructions into it, which will mean
5 being able to have some regression tests as well.
6
7 [ ] Get input string from 'read' syscall
8 [ ] Add some testing (shell script?)
9
10 Testing is one of those things that a lot of people have
11 strong opinions about. But I absolutely love having a
12 reasonable number of tests in place to let me know that
13 I haven't broken something. Tests let me be _more_
14 creative and _more_ brave with the code because I can
15 try something and know right away whether or not it
16 works.
17
18 I have come to _loathe_ manual testing because I've been
19 in the Web dev world forever and testing on the web
20 SUCKS. In a lot of cases, the state of the art is still
21 refreshing a browser and clicking through a bunch of
22 pages. When you're used to that, it is a _delight_ to be
23 able to easily set up some STDIN/STDOUT tests on a
24 command line!!!
25
26 Okay, that's more than enough about that. It's time to
27 get real input!
28
29 Right off the bat, I know I'm going to need to handle
30 this in three places:
31
32 * get_token
33 * eat_spaces
34 * quote
35
36 On the plus side, I'm happy with my input functionality
37 (especially the string literals) and I don't regret
38 how they've turned out. But now's when I have to pay the
39 price for four separate methods that read input.
40
41 This is definitely going to test my resolve to stick
42 with the "inline all the things" redundant code. But I
43 don't think "get_input" or whatever I end up calling it
44 will be very long. Might even be under 100 bytes. So
45 three copies shouldn't be too bad. :-)
46
47 (When I wrote the above paragraphs, I thought I was
48 going to have up to six copies, but I kept realizing
49 that most of those places weren't actually reading a
50 whole stream of input - they were relying on one of
51 these three to do it.)
52
53 Okay, stripped of comments, this is get_input:
54
55 mov ebx, [input_file]
56 mov ecx, input_buffer
57 mov edx, INPUT_SIZE
58 mov eax, SYS_READ
59 int 0x80
60 cmp eax, INPUT_SIZE
61 jge %%done
62 mov byte [input_buffer + eax], 0
63 %%done:
64 mov dword [input_buffer_pos], input_buffer
65
66 It's tiny - just the Linux 'read' syscall to get more
67 input into the input_buffer. The only interesting thing
68 is that if we read more than an entire buffer's worth,
69 it null-terminates the string.
70
71 Now I gotta use it in (at least?) three places. Two of
72 them also needed to be updated now that I understand
73 what the esi and edi registers are for, ha ha. Anyway,
74 here's a typical example from 'eat_spaces':
75
76 cmp esi, input_buffer_end ; need to get more input?
77 jl .continue ; no, keep going
78 GET_INPUT_CODE ; yes, get some
79 jmp .reset ; got more input, reset and continue
80
81 I kept simplifying until I got down to those four lines.
82
83 But does it work? Just a couple dumb mistakes and
84 then...
85
86 $ mr
87 hello
88 Could not find word "hello" while looking in IMMEDIATE mode.
89 Exit status: 1
90
91 Wow! I don't know why I typed "hello" as my first live
92 input into this thing. But it totally worked. Ha ha, I
93 probably don't need an unfound word to be a fatal error
94 anymore. :-)
95
96 How about something that *will* work:
97
98 $ mr
99 "Hello world!\n" print
100 Hello world!
101 Goodbye.
102 Exit status: 0
103
104 Yay! My first live "Hello world" in this interpreter!
105
106 I'm still exiting after one line of input. I'll have to
107 figure that out. Do I read until an actual EOF character
108 is encountered? I can't remember. I'm just super excited
109 this works!
110
111 So after all that hand-wringing about having a couple
112 copies of this code, how much impact has that actually
113 had?
114
115 Here's the relevant bits from 'inspect_all':
116
117 get_input: 45 bytes IMMEDIATE COMPILE
118 get_token: 108 bytes IMMEDIATE
119 eat_spaces: 80 bytes IMMEDIATE COMPILE
120 quote: 348 bytes IMMEDIATE COMPILE
121
122 Let's compare with the previous log07.txt results:
123
124 get_token: 55 bytes IMMEDIATE
125 eat_spaces: 38 bytes IMMEDIATE COMPILE
126 quote: 247 bytes IMMEDIATE COMPILE
127
128 Since I cleaned up some of the words, there wasn't an
129 across-the board increase of 45 * 3 bytes.
130
131 I'd like to see what the grand total has become. And
132 I'll probably want to do that often. So I'll make a new
133 option in my build.sh script:
134
135 if [[ $1 == 'bytes' ]]
136 then
137 AWK='/^.*: [0-9]+/ {t=t+$2} END{print "Total bytes:", t}'
138 echo 'inspect_all' | ./$F | awk -e "$AWK"
139 exit
140 fi
141
142 Okay, let's see the damage:
143
144 $ ./build.sh bytes
145 Total bytes: 2816
146
147 The last run in log07.txt was 2655 bytes, so the
148 difference is:
149
150 2816 (current)
151 - 2655 (previous)
152 ------
153 161
154
155 Ha ha, only 161 bytes difference, and since one of the
156 three copies is needed, I only gained 116 bytes of
157 "bloat". I think I can live with that on the x86
158 platform. :-)
159
160 Now I gotta figure out how to continue reading after one
161 line of input.
162
163 Oh, wait! One last thing. I had also set the input
164 buffer to an artificially tiny size so I could make sure
165 it was being refilled as needed. I'll add a DEBUG
166 statement to see where that's happening.
167
168 The buffer size is 16 bytes.
169
170 $ mr
171 GET_INPUT00000000
172 "This is a jolly long string to make sure we read plenty into input buffer a couple times.\n" print
173 GET_INPUT00000000
174 GET_INPUT00000000
175 GET_INPUT00000000
176 GET_INPUT00000000
177 GET_INPUT00000000
178 GET_INPUT00000000
179 This is a jolly long string to make sure we read plenty into input buffer a couple times.
180 Goodbye.
181 Exit status: 0
182
183 Okay, perfect, that long line of input required 7 calls
184 to 'get_input' to refill the input_buffer. Now I'll set
185 it to a reasonable size. I've seen some conflicting
186 stuff online, so I'll just take the coward's way out:
187
188 %assign INPUT_SIZE 1024 ; size of input buffer
189
190 Now to figure out how to keep reading after the first
191 line (or token?) of input.
192
193 Okay, so I do need to check the return value from 'read'
194 because that's the only way I can know if I've really
195 got an EOF instead of just "no more input at this
196 moment" - as would be the case between when the user
197 hits enter and types the next line of input.
198
199 I also added a new eof global that I can trip as soon as
200 any of the 'get_input' instances hits the end of input:
201
202 cmp eax, 0 ; 0=EOF, -1=error
203 jge %%normal
204 mov dword [input_eof], 1 ; set EOF reached
205 %%normal:
206
207 dave@cygnus~/meow5$ mr
208 "Hello world!\n" print
209 Hello world!
210
211 : loud_meow "MEOW!\n" print ;
212 loud_meow
213 MEOW!
214
215 exit
216 Exit status: 12
217
218 Heh, that's so cool. I can finally interact with this
219 thing for real. But CTRL+D doesn't exit. I had to type
220 'exit' to make that happen.
221
222 I'll add a debug to 'get_input' to see what 'read' is
223 returning...
224
225 dave@cygnus~/meow5$ mr
226 "goodbye cruel world" print
227 read bytes: 0000001c
228 goodbye cruel world <---- I typed ENTER here
229 read bytes: 00000001
230 <---- ENTER again here
231 read bytes: 00000001
232 read bytes: 00000000 <---- CTRL+D
233
234 read bytes: 00000001 <---- ENTER again
235 exit
236 read bytes: 00000005
237 Exit status: 1
238
239 Okay, so I guess I'm not checking the input_eof flag
240 correctly in my interpreter loop?
241
242 No! Ha, perhaps you spotted it before I did in the
243 assembly snippet? Here it is again:
244
245 cmp eax, 0 ; 0=EOF, -1=error
246 jge %%normal
247 mov dword [input_eof], 1 ; set EOF reached
248 %%normal:
249
250 Silly mistake:
251
252 jge %%normal
253
254 should be
255
256 jg %%normal
257
258 so that 0 will trigger EOF!
259
260 Okay, that pretty much worked. But there's still some
261 inelegant code in the interpreter where I feel like I'm
262 checking for input too many times and it's somehow still
263 not enough.
264
265 I was null-terminating it and I think I would be better
266 off setting an upper bound on it.
267
268 Two nights later: Okay, just about have the kinks worked
269 out. I've got two new global variables to keep track of
270 the input buffer:
271
272 input_buffer: resb INPUT_SIZE
273 input_buffer_pos: resb 4
274 input_buffer_end: resb 4 <--- new
275 input_eof: resb 4 <--- new
276
277 Now I can check input_eof in any input words and in the
278 outer interpreter.
279
280 Okay, I'm stuck in 'eat_spaces'. I'm peppering it with
281 DEBUG macro calls to see what's up. esi contains the
282 current character in the input buffer (if it's a space,
283 we want to advance past it). ebx contains the last
284 position filled in the buffer by 'read'.
285
286 $ mr
287 eat_spaces pos: 0804c774
288 eat_spaces RESET, pos: 0804c774
289 ES more input! esi: 0804c774
290 ES more input! ebx: 0804c774
291 45 234 "hello!" meow <----- I typed this
292 read bytes: 00000015
293 eat_spaces RESET, pos: 0804c774
294 ES more input! esi: 0804c774
295 ES more input! ebx: 0804c774
296 read bytes: 00000000 <----- I typed CTRL+D here
297 get_input EOF! 00000001
298 eat_spaces RESET, pos: 0804c774
299 get_next_token checking for EOF 0804ace2
300 Goodbye.
301 Exit status: 0
302
303 Well, that would be a problem. Looks like esi and ebx
304 are always the same value. Oops!
305
306 LOL, that's exactly it. I forgot to save the new end of
307 buffer pointer in 'get_input'. Here we are:
308
309 mov dword [input_buffer_end], ebx ; save it
310
311 Do you like super verbose logging? You'll love this.
312 Here I am printing "hello" and then quitting with
313 CTRL+D. It's hard to even find the interaction amidst
314 all the noise:
315
316 eat_spaces pos: 0804c7d5
317 eat_spaces RESET, pos: 0804c7d5
318 eat_spaces looking at char... 0000000a
319 ES more input! esi: 0804c7d6
320 ES more input! ebx: 0804c7d6
321 "hello" print
322 read bytes: 0000000e
323 eat_spaces RESET, pos: 0804c7cc
324 eat_spaces looking at char... 00000022
325 get_next_token checking for EOF 0804ad12
326 get_next_token looking at chars. 0804ad12
327 quote0804c7cc
328 eat_spaces pos: 0804c7d3
329 eat_spaces RESET, pos: 0804c7d3
330 eat_spaces looking at char... 0804c320
331 eat_spaces looking at char... 0804c370
332 eat_spaces pos: 0804c7d4
333 eat_spaces RESET, pos: 0804c7d4
334 eat_spaces looking at char... 00000070
335 get_next_token checking for EOF 0804ad12
336 get_next_token looking at chars. 0804ad12
337 get_token0804c7d4
338 helloeat_spaces pos: 0804c7d9
339 eat_spaces RESET, pos: 0804c7d9
340 eat_spaces looking at char... 0000000a
341 ES more input! esi: 0804c7da
342 ES more input! ebx: 0804c7da
343 read bytes: 00000000
344 get_input EOF! 00000001
345 eat_spaces RESET, pos: 0804c7cc
346 get_next_token checking for EOF 0804ad12
347 Goodbye.
348 Exit status: 0
349
350 But it works. I'll clean this up tomorrow night and see
351 if I can add a simple test script.
352
353 Next night: The DEBUGs are cleaned up. Now a couple
354 housekeeping things. First, I want to complete that TODO
355 item from the last log, a word to print all defined
356 words (just the names, not the entire 'inspect' output.
357 I think I'll call it 'all'.
358
359 [ ] New word: 'all' to list all current word names
360
361 Well, that was easy:
362
363 $ mr
364 all
365 all inspect_all inspect ps printmode printnum number decimal bin oct hex radix str2num quote num2str ; return : copystr get_token eat_spaces get_input find is_runcomp get_flags inline print newline strlen exit
366 Goodbye.
367 Exit status: 0
368
369 I also added a non-destructive stack printing word last
370 log and I never actually got it working. So I'd like to
371 fix that.
372
373 [ ] Finish 'ps' (non-destructive stack print)
374
375 And since I have string escape sequences for
376 runtime newline printing and NASM can include newlines
377 in string literals with backticks, I'd like to remove
378 the 'newline' word. I'm only using it in a couple places
379 anyway.
380
381 [ ] Remove word 'newline' (replace with `\n`)
382
383 That one was super-easy too. I didn't really need a TODO
384 item for it. But it'll feel good to show that checked
385 box at the end of the log, so why not?
386
387 Now for that print stack:
388
389 $ mr
390 42 ps
391 1 4290881940 0 4290881948 4290881964 4290881982
392 4290882002 4290882040 4290882048 4290882106 ...
393
394 It just keeps going on and on. And then ends in a
395 Segmentation fault. So clearly I've got something wrong.
396
397 When the interpreter starts, I save the stack pointer to
398 a variable.
399
400 mov dword [stack_start], esp
401
402 I want to do a sanity check, so I'll push two values:
403
404 push dword 555
405 push dword 42
406
407 Let's see this in action to confirm how x86 stacks work:
408
409 $ mb
410 Reading symbols from meow5...
411 (gdb) break 877
412 Breakpoint 1 at 0x8049f92: file meow5.asm, line 877.
413 (gdb) r
414 Starting program: /home/dave/meow5/meow5
415
416 Breakpoint 1, _start () at meow5.asm:877
417
418 Okay, let's see what the stack register current points
419 to (and by using GDB's 'display', this will always print
420 after every command):
421
422 (gdb) disp $esp
423 1: $esp = (void *) 0xffffd780
424 (gdb) disp *(int)$esp
425 2: *(int)$esp = 1
426
427 I've noticed that 1 (one) when I was trying to debug the
428 stack before. I have no idea why that's there. That's
429 something else to figure out.
430
431 Anyway, we can see that the "first" stack address:
432
433 0xffffd780
434
435 And as I push values onto the stack, esp should
436 decrement by 4 since the x86 stack writes to memory
437 backward. (By the way, I feel a rant about how we
438 describe this coming on, stay tuned for that in a
439 moment.)
440
441 -------------------------------------------------------
442 NOTE
443 -------------------------------------------------------
444 By the way, I often manually manipulate these GDB
445 sessions here in my logs so that the instruction I'm
446 executing shows up right before I start examining
447 memory. Sorry if that confuses people who are
448 well-versed in GDB and are wondering what the heck is
449 going on.
450 -------------------------------------------------------
451
452 Now I'll just verify that my stack_start variable indeed
453 holds the same value as esp and it points to that '1' at
454 the beginning of the stack:
455
456 877 mov dword [stack_start], esp
457 (gdb) s
458 1: $esp = (void *) 0xffffd780
459 2: *(int)$esp = 1
460 (gdb) x/a (int)stack_start
461 0xffffd780: 0x1
462
463 Yup. No surprises so far.
464
465 Now when I push, we should see esp decrement and point
466 to the newly pushed value:
467
468 879 push dword 555
469 (gdb) s
470 1: $esp = (void *) 0xffffd77c
471 2: *(int)$esp = 555
472 880 push dword 42
473 (gdb) s
474 1: $esp = (void *) 0xffffd778
475 2: *(int)$esp = 42
476
477 Looks good so far!
478
479 0xffffd780 1
480 0xffffd77c 555
481 0xffffd778 42
482
483 ...I think. I'm really no good at hex calculations in my
484 head. Even easy ones. Let's confirm with 'dc', the old
485 RPN desk calculator on UNIX systems since forever:
486
487 $ dc
488 16 i 10 o <--- set input and output base to 16 (get it?)
489 1A 5 + p
490 1F <--- just making sure it's set up okay
491 D780 p
492 D780 <--- 0xffffd780
493 4 - p
494 D77C <--- 0xffffd77c
495 4 - p
496 D778 <--- 0xffffd778
497
498 dc is crazy. Anyway, those addresses are right. Every
499 push subtracts 4 from esp and writes the pushed value to
500 that address.
501
502 So when I examine the stack area of memory, I should be able to
503 subtract 4 from my stack_start variable and see each
504 value. When I hit the current value of esp, that's the
505 last value on the stack and I'm done:
506
507 (gdb) x/d (int)stack_start
508 0xffffd780: 1
509 (gdb) x/d (int)stack_start -4
510 0xffffd77c: 555
511 (gdb) x/d (int)stack_start -8
512 0xffffd778: 42
513
514 Great! So the computer is doing what I think it's doing.
515 Always a good sign. :-)
516
517 *****************************************************
518 * RANT ALERT * RANT ALERT * RANT ALERT * RANT ALERT *
519 *****************************************************
520
521 Okay, so my issue with how we talk about stacks is the
522 use of terms like "top" and "bottom".
523
524 If we start with the stack of plates analogy, it's
525 perfectly fine to talk about the top of the stack
526 because it makes physical sense:
527
528 ===== <--- top plate
529 =====
530 =====
531 =====
532
533 But where's the "top" of this memory?
534
535 +-----+
536 | | 0x0000
537 +-----+
538 | | ...
539 +-----+
540 | | 0xFFFF
541 +-----+
542
543 Okay, now where's the "top" of this memory?
544
545 +-----+
546 | | 0xFFFF
547 +-----+
548 | | ...
549 +-----+
550 | | 0x0000
551 +-----+
552
553 Where's the "top" of the stack in this memory?
554
555 +-----+
556 | === | 0xFFFF } stack start
557 +-===-+ } stack
558 | === | ... } stack
559 +-----+
560 | | 0x0000
561 +-----+
562
563 And the "top" of the stack in this memory?
564
565 +-----+
566 | === | 0x0000 } stack start
567 +-===-+ } stack
568 | === | ... } stack
569 +-----+
570 | | 0xFFFF
571 +-----+
572
573 Or this?
574
575 +-----+
576 | | 0xFFFF
577 +-----+
578 | === | ... } stack
579 +-===-+ } stack
580 | === | 0x0000 } stack start
581 +-----+
582
583 Or this?
584
585 +-----+
586 | | 0x0000
587 +-----+
588 | === | ... } stack
589 +-===-+ } stack
590 | === | 0xFFFF } stack start
591 +-----+
592
593 I've seen ALL of these representations over the years
594 and the person making the diagram just passes it off
595 like their own personal mental model is completely
596 obvious.
597
598 This situation is nuts.
599
600 And I know Intel's official docs for x86 use the "top"
601 and "bottom" terms. But guess what? Intel's "word" size
602 on 64-bit processors is 16 bits, so I think we can
603 safely ignore their advice on terminology.
604
605 Personally, I don't picture ANY of the diagrams above.
606
607 Instead, I imagine the stack as horizontal memory and
608 the stack grows to the right:
609
610 +--------------------
611 | A | B | C | D | E --->
612 +--------------------
613 ^ ^
614 oldest current
615
616 But you'll notice that I don't say "rightmost" or
617 "leftmost". That would be ridiculous. Especially since
618 x86 has a stack that grows from a high-numbered address
619 to a lower-numbered address. So it's really more like
620 this:
621
622 --------------------+
623 <--- E | D | C | B | A |
624 --------------------+
625 ^ ^
626 0xE4 0xFF
627 (current) (oldest)
628
629 Anyway, the point is that using directional descriptions
630 as if we were all looking at the same physical object is
631 super confusing.
632
633 I prefer stack descriptions such as:
634
635 * current / newest / recent
636 * older / previous
637 * oldest
638 * hot vs cold
639 * surfaced / buried
640
641 And so on. I'm sure you can think of some better ones.
642 Actually, please do.
643
644 *****************************************************
645 * RANT ALERT * RANT ALERT * RANT ALERT * RANT ALERT *
646 *****************************************************
647
648 Sorry about that. I do feel better now. So, I've made
649 some changes in how I do the stack printing (I needed to
650 basically reverse everything I was doing, ha ha) and
651 let's see if it works now:
652
653 $ mr
654 ps
655 1
656 42 555 97 33
657 ps
658 1 42 555 97 33
659 "Hello $ $ $" print
660 Hello 33 97 555
661 ps
662 1 42
663 "I put $ on there, but where does the $ come from???\n" print
664 I put 42 on there, but where does the 1 come from???
665 Goodbye.
666 Exit status: 0
667
668 I don't know if that's hard to follow or not? It's
669 tempting to make some sort of prompt in the interpreter
670 just so it's easier to see the commands I type versus
671 the responses.
672
673 Anyway, it works great. I just don't understand why
674 there's a 1 on the stack when I start?
675
676 I guess it doesn't really matter. It occurs to me that I
677 should consider the start of the stack to be the *next*
678 available position. I'll update that now.
679
680 From:
681
682 mov dword [stack_start], esp
683
684 To:
685
686 lea eax, [esp - 4]
687 mov [stack_start], eax
688
689 Did that fix it?
690
691 ps
692
693 42 16 ps
694 42 16
695 8 ps
696 42 16 8
697
698 Yup! Now we start with nothing on the stack and adding
699 items to the stack only shows those items.
700
701 Now how about a test script? I'm a big fan of simple
702 tests that are just enough to give me the peace-of-mind
703 that I haven't broken anything that used to work.
704
705 One thing that works just fine now that I take input on
706 STDIN is piping input:
707
708 $ echo "42 13 ps" | ./meow5
709 42 13
710 Goodbye.
711
712 And I can grep/ag the results to make they contain what
713 I want.
714
715 But I remember 'expect' from back when I was heavy into
716 Tcl. I think I'll give that a shot to interactively
717 drive the interpreter and test it.
718
719 Expect is so cool. Here's my whole test script so far:
720
721 #!/usr/bin/expect
722
723 spawn ./meow5
724
725 # Print a string
726 send -- "\"Meow\\n\" print\r"
727 expect "Meow"
728
729 # Consruct meow and test it
730 send -- ": meow \"Meow. \" print ;\r"
731 send -- "meow\r"
732 expect "Meow. "
733
734 # Consruct meow5 and test it
735 send -- ": meow5 meow meow
736 meow meow meow \"\\n\" print ;\r"
737 send -- "meow5\r"
738 expect "Meow. Meow. Meow. Meow. Meow."
739
740 # Exit (send CTRL+D EOF)
741 send -- "\x04"
742 expect eof
743
744 The long meow5 definition line has been broken onto the
745 next line for this log.
746
747 Here it is running!
748
749 $ ./test.exp
750 spawn ./meow5
751 "Meow\n" print
752 Meow
753 : meow "Meow. " print ;
754 meow
755 Meow. : meow5 meow meow meow meow meow "\n" print ;
756 meow5
757 Meow. Meow. Meow. Meow. Meow.
758 Goodbye.
759
760 I'll add a new alias for it now. (Defined by my "meow"
761 function in .bashrc):
762
763 alias mt="./build.sh ; ./test.exp"
764
765 Sweet! That wraps up this log and the goals I had for
766 it. I'll ad more to the test script as I go. This was
767 just go get it started.
768
769
770 [x] Get input string from 'read' syscall
771 [x] Finish 'ps' (non-destructive stack print)
772 [x] New word: 'all' to list all current word names
773 [x] Remove word 'newline' (replace with `\n`)
774 [x] Add some testing (expect!)
775
776 I think I might make some math words next so I can use
777 the language to do basic stuff like add and subtract!