tutorial, developer, with an peek in opposition to a brighter techno-social lifestyles
[prev]
[next]
2021-07-10
I wrote a 231-byte Brainfuck compiler by abusing the entirety
All offer code for this weblog post would possibly per chance presumably presumably be found right here.
We wrote a Brainfuck compiler back in April. That one developed a simple architecture to enhance a few backends. At the present time, I could per chance presumably presumably love to write a novel Brainfuck compiler. One who would possibly per chance presumably most productive bustle on a single plot, OpenBSD/amd64. And one that is as shrimp as I will presumably impact it. I broke into the sub-256 bytes membership with my compiler, so I believed it used to be recommended sharing and speaking about how I was ready to drag it off.
Whereas this works on my machine, there would possibly per chance presumably presumably be subtle differences about how your machine works that trigger this compiler no longer to work. Make determined to determine your machine’s habits if there are any differences!
We’re optimizing completely for size. Which technique that we would possibly per chance presumably impact selections which are less performant than diverse ways of writing a Brainfuck compiler. That is OK.
The assembly code shall be written the utilize of AT&T syntax. Final time I had a weblog post with assembly code, any individual complained that it used to be no longer Intel syntax. I’m no longer vital afflicted by either and I enlighten my undergraduates each syntaxes because I deem it’s a long way necessary with a understanding to learn each. In the occasion you are an Intel syntax particular person, spend this post as a risk to abet your AT&T syntax abilities.
Reviewing Brainfuck
Let’s rapidly evaluation the syntax of Brainfuck so we know what we want to implement. The table beneath presumes that we are translating Brainfuck to C, have a global array performing because the Brainfuck tape, and have a pointer, *p
, that is intialized to the starting of the array.
Image | C identical |
---|---|
< |
--p; |
> |
++p; |
- |
--*p; |
+ |
++*p; |
, |
*p = getchar(); |
. |
putchar(*p); |
[ |
while (*p) { |
] |
} |
Selecting an implementation language
I deem the most productive technique we stand a gamble at increasing a compiler as shrimp as capacity is to utilize hand-optimized assembly. Those precious bytes are now not going to set themselves!
We shall be capable of be the utilize of the GNU binutils for our assembler and the OpenBSD plot linker, which is LLVM lld, for our linker. I'm the utilize of the GNU assembler from ports, which is a substantial newer version of the assembler. This newer assembler is a shrimp bit bit smarter in the case of encoding and that finally ends up saving a byte for us down the motorway. I will level it out once we assemble there.
Coding our strings
Our compiler will bring collectively Brainfuck to C. I selected C since it will enable us to have a few of the smallest output code, and by extension the shortest strings wanted to have built-in to our compiler. We shall be capable of utilize the table above for our translation.
Our compiler therefore begins as follows:
.text .globl necessary necessary: .LSprologue: .ascii "a[65536],*p=a;necessary(){" # 21 .LS: .ascii "--p;" # 4 .ascii "++p;" # 4 .ascii "--*p;" # 5 .ascii "++*p;" # 5 .ascii "*p=getchar();" # 13 .ascii "putchar(*p); " # 13 .ascii "while(*p){" # 10 .ascii "return 0;" # 9 .ascii "}" # 1
The tip is just a few boilerplate to repeat the toolchain that we now have a global image named necessary and it (and the entirety else, for that matter) lives in the text piece of our binary.
Starting with .LSprototype
is our strings. Labels starting with .L
are regarded as native labels for ELF targets delight in my machine. After each string I set aside aside a inform with the sequence of characters in the string. We shall be capable of utilize the OpenBSD kernel facilities for studying and writing, and the write(2)
plot call has a parameter for sequence of bytes to write. Yes, we would possibly per chance presumably write a strlen(3)
aim, and if truth be told we did that for our echo program. On the replacement hand, writing such a aim would spend too many bytes when put next to easily shoving the ideal price into the register.
The .LSprologue
string is change into self sufficient from the the leisure of the .LS
strings because the technique I organized the compiler the prologue is passe straight and for all diverse strings we index into the string starting at .LS
.
Additionally, you will gape I'm the utilize of the .ascii
pseudo-op moderately than the .asciz
pseudo-op. This implies that there would possibly per chance be no NUL-termination of any of these strings, so that they're no longer C strings. Since we are offering the real sequence of bytes to write with each write(2)
call, we are able to assemble away with this and set a series of bytes in the job.
We're also leaving them in the .text
piece. That technically technique they're code. We would possibly per chance presumably set aside them in the .rodata
piece however since these string are by no technique written to, I deem it's stunning.
We would possibly per chance presumably take into memoir some more byte-shaving at work. Whereas we are compiling to C, we didn't specify which similar old we had been targeting. We're going very passe by having implicit int
s all over the set aside. We're also going to be implicitly declaring getchar(3)
and putchar(3)
which is invalid in C99 and later, in step with clang. So we can aim C89 and/or K&R C.
As a results of the implicit int
s the cells on the Brainfuck tape are 32-bits in size. There is no requirement for any particular cell size, so right here is appropriate from a Brainfuck standpoint.
I also made the getchar and putchar strings equal size, though the putchar string is on the entire one personality shorter. We shall be capable of employ a byte now to set several bytes later.
Finding out code into a buffer
Our parser must learn on this procedure one personality at a time and job each personality. If it's one in all the Brainfuck characers, we output the correct translation, otherwise we transfer on to the next personality. We're going to spend fair correct thing regarding the truth that we are able to learn from stdin
and write to stdout
with out having to pickle anything up outselves; these facilities are pickle up for us by our ambiance.
We shall be capable of also ought to survey up the syscall numbers for learn(2)
and write(2)
, which are 3 and 4.
We must always always learn in a personality before we are able to parse it and write anything. Our learn can survey delight in this:
.Lparse: movb $3, %al # Load learn(2) plot call movb $1, %dl # Read one personality xorl %edi, %edi # Read from stdin leaq (%rsp), %rsi # Read into high of stack syscall # learn(0, (%rsp), 1); incl %edi # Field %edi to 1, for write
We are able to spend fair correct thing about some assumptions on our plot for this to work. On our plot all overall-aim registers are initialized to 0 by the ambiance before program birth up. The movb
directions require most productive 2 bytes however attain no longer 0-out the the leisure of the register. If there used to be something already in %rax
or %rdx
above the lowest byte, it would dwell as-is. But since our ambiance pickle these registers to 0, we are OK. These movb
directions are 2 bytes each.
We also selected to utilize xorl
as an replacement of movb
for the %rdi
register. Right here's since it's seemingly you'll presumably presumably presumably't utilize movb
on %rdi
since it's at its smallest a 16-bit register. Since we are making it 0, xor'ing any 32-bit register with itself is the smallest technique to attain this at 2 bytes.
The leaq
instruction is costly at 4 bytes. Or no longer it's truly one in all the smallest variations of this instruction and we can take into memoir a 7-byte version later. In the slay, we want to call the syscall for 2 byte. That is 12 bytes total to assemble a personality from stdin
.
We're also assuming that our ambiance pickle up some stack for us and that our stack pointer parts to something practical. That happens to be correct kind. As we discovered before, the ambiance does non-public our argc
and argv
variables and that's the case right here too. So we now have an inexpensive stack pointer.
Whereas it's no longer straight related to studying in a personality, it turns out that %rdi
stays the similar all the way thru learn(2)
calls, so we are able to increment it to prepare for an upcoming write(2)
call. Incrementing is 2 bytes, which is the smallest technique to turn a 0 into a 1 for us. Even in the case where we attain no longer truly cease up writing anything, as an illustration because we learn in a inform personality, it would now not vital matter since we xor the register before the learn(2)
call on every occasion.
Checking for EOF
Now would possibly per chance presumably presumably be a correct kind time to determine if we are at the cease of the file, since if we are we now have executed our work and must exit this procedure. Fortunately, learn(2)
returns the sequence of bytes learn in %rax
. If that quantity is 1, then we are no longer but executed. In every other case, if it's 0 or -1 then we are executed. A 0 technique cease of file and a -1 technique an error occured. We assemble no longer truly care if we reached cease of file or encountered an error. Each of these imply it's time to exit for us.
Swiftly aside: The Machine V AMD64 ABI calling convention
Our machine uses the Machine V AMD64 ABI calling convention. This dictates how our registers are passe and how registers are saved or no longer all the way thru aim calls, in conjunction with syscalls. For us, we want to know that %rax
will non-public the return price of a call, so we must presume that it will commerce after each call. Additionally, %rax
contains the syscall amount earlier than making the syscall. The calling convention also says that the %rbx
, %rsp
, %rbp
, and %r12
-%r15
registers has to be restored to their usual price by the callee in the occasion that they are passe. Which technique as a long way as we are fervent, we are able to utilize these registers all the way thru calls and presume that their price will slay the similar. Any diverse registers would possibly per chance presumably presumably be clobbered by calls. We must always always also appreciate the registers passe for call parameters. Since learn(2)
and write(2)
utilize three parameters, we must always always be sure to utilize %rdi
, %rsi
, and %rdx
for these parameters, delight in we did with our learn(2)
call above.
It happens to be the case that that %rdi
and %rdx
attain no longer assemble clobbered by learn(2)
and write(2)
calls on our machine, which we can utilize to our advantage.
Checking for EOF
, portion 2
Now we are able to envision for EOF
after each learn:
cmpl %edx, %eax # EOF ? (%eax < 1) jl .Leof
That is all we resolve. Since %rdx
would now not assemble clobbered on our plot, we sign it's 1. If %rax
is lower than that, we know we either reached the cease of the file or an error occured. In that case, we must jump to the code that exits this procedure.
Parsing the input
One thing that we want to know is that the smallest cmp
is when you compare a byte in opposition to %al
. But as of now, %rax
contains the return code of the learn(2)
call. We stand to set eight bytes the utilize of the smallest cmp
. We shall be capable of cease up spending three bytes to set eight bytes, which is tall.
We are able to impact this happen by a shrimp bit tweaking our cease of file take a look at and then parsing our input after the cease of file take a look at:
xchg %ebp, %eax # Store return price in %ebp movb (%rsi), %al # cmpb imm, %al is the smallest cmp leal .LS, %esi # Preload string cmpl %edx, %ebp # EOF ? (%ebp < 1) jl .Leof cmpb $60, %al # '<' ? je .Lleft cmpb $62, %al # '>' ? je .Lright cmpb $45, %al # '-' ? je .Ldec cmpb $43, %al # '+' ? je .Linc cmpb $44, %al # ',' ? je .Lgetchar cmpb $46, %al # '.' ? je .Lputchar cmpb $91, %al # '[' ? je .Lopenloop cmpb $93, %al # ']' ? je .Lcloseloop jmp .Lparse # Comment personality, skip
We are able to xchg
to assemble our smallest cmp
. Right here's the explanation in the back of the newer GNU assembler. The older assembler in atrocious consistently encodes this to a 2-byte sequence however the newer assembler is terribly very top ample to know this would presumably presumably be encoded as a 1-byte instruction and does so. We then set aside aside the personality we learn in into %rax
and preload the take care of of all our strings by a 7-byte leal
. That is very dear, truly it's the most costly instruction we can come upon. On the replacement hand it's a long way a the biggest substandard for us. Now we create our cease of file take a look at and if we are no longer at the cease of the file we continue on determining which personality we now have and what to attain with it.
Processing each risk
Let's poke away the cease of file processing for later and now address the Brainfuck image processing. There are eight symbols we want to take care of:
.Lright: addl $4, %esi .Lleft: movb $4, %dl jmp .Lwrite
Let's spend a behold at fair correct this for correct now. This code handles each the <
and >
symbols. Steal that .LS
contains all our strings, prologue notwithstanding. The most necessary personality of .LS
corresponds to the <
symbol. We are calling that the .Lleft
label. That has an offset of 0 so we immediately move to telling write(2)
we want to write 4 characters and then head to the syscall. If we instead see a >
that is the .Lright
tag and we want to offset the place to begin of the string by 4. We then fall thru because each outputs are 4 characters in size so we are able to combine the code to set bytes.
We shall be capable of utilize these tricks just a few more times. Let's continue writing the emblem processing code:
.Ldec: subl $5, %esi # 13 - 5 = 8 .Linc: addl $13, %esi movb $5, %dl jmp .Lwrite
Let's cease right here too and try but every other trick. Each .Ldec
and .Linc
output strings which are the similar size. We resolve with a understanding to combine the code delight in we did in the .Lleft
and .Lright
case. On the replacement hand, we assemble no longer have an index of 0 for either. We assemble round this by creatively subtracting the size of the string from the offset of the string farther from the starting of .LS
to assemble the offset of the string nearer to the starting of .LS
and then fall thru from there. These subtractions and additions, because they're all the way thru the vary of -128 and 127 all encode to 2 bytes.
Let's continue on:
.Lgetchar: subl $13, %esi # 31 - 13 = 18 .Lputchar: addl $31, %esi movb $13, %dl jmp .Lwrite
Precisely the similar tricks as .Linc
and .Ldec
. Advise though that if I did no longer extend the putchar string by one personality at the starting, we would possibly per chance presumably no longer utilize any of these tricks and it would price us an entire bunch bigger than the one byte it price to elongate the putchar string.
Let's attain up our processing code:
.Lopenloop: incl %ebx # Increment loop counter addl $44, %esi movb $10, %dl jmp .Lwrite .Lcloseloop: decl %ebx # Decrement loop counter js .Lexit # %ebx < 0 ? (%rdi == 1 from the write(2) call) addl $63, %esi movb $1, %dl jmp .Lwrite
Sadly, we can't utilize the fall thru trick right here however there are some diverse tricks we are able to utilize. We utilize the truth that every our overall-aim registers are initialized to 0 and the truth that %rbx
is saved all the way thru calls to fair correct birth up the utilize of it successfully out of nowhere to act as our loop counter. The .Lopenloop
tag is otherwise easy.
The .Lcloseloop
tag has a further trick. Attributable to we want to have a correct kind compiler that has correct kind return values, 0 for success and 1 for failure, we decrement the loop counter and take a look at to take into memoir if it has fallen beneath 0. If it has, there is an imbalance of brackets and we must cease processing and mistake out. Sadly, dec
preserves the price of the lift flag, so we can't put it to use to uncover if we now have long gone beneath 0. On the replacement hand, it alters all diverse flags and we are able to utilize the signal flag to determine if we now have long gone detrimental. The js
opcode jumps if the signal flag is determined, which it most productive shall be if we now have long gone detrimental. We are able to steer determined of a cmp
in opposition to -1 this system. Fortunately, %rsi
has already been pickle to 1 and is precisely the price we resolve for _exit(2)
once we error out so we attain no longer must impact any changes to %rdi
. The leisure is straightfoward.
Writing output
We shall be capable of utilize the truth that stdout
comes on hand to utilize from our ambiance with out us having to attain anything and as such we can write our compiled code to stdout
. We're also going to utilize one other trick right here: we can set aside the .Lwrite
tag between the .Lparse
and .Lright
labels. Right here's because even as you have a jmp
that is all the way thru the vary of -128 and 127 bytes, that encodes to a short jump which is most productive 2 bytes. That is the smallest jmp
there is. We resolve all our jmp
s to be short jumps. Our .Lwrite
labels ends with a jmp
to .Lparse
in characterize to learn in the next personality and if it came after the entire code we fair correct wrote, it shall be bigger than -128 bytes away.
Our .Lwrite
tag is terribly short:
.Lwrite: movb $4, %al syscall # write(1, string, size); jmp .Lparse
That is it. The entire writes must position 4 into %rax
, impact the syscall, and then learn in the next personality.
Dealing with cease of file and exiting
Now let's write the code to take care of cease of file and the code to take care of exit, since we can utilize some fall thru to take care of them. Let's also set aside aside this unique code straight after .Lwrite
and before .Lright
, but again to make certain all our jmp
s are short jumps.
It appears to be like to be delight in this:
.Leof: cmpl %edx, %ebx # Loop counter < 1 ? (i.e., 0) jge .Lexit addl $54, %esi movb $10, %dl movb $4, %al syscall xorl %edi, %edi # Derive ready to exit .Lexit: movb $1, %al # _exit(%edi); syscall
When we attain the cease of the file, we want to make certain our loop counter is 0. If it's no longer, we now have a bracket imbalance and must error out straight. Like before, %rdi
happens to be 1 from the pickle up for the write(2)
call so we attain no longer ought to impact any changes.
The the leisure of the .Leof
tag is writing the C epilogue. It then sets %rdi
to 0 because we're going to exit efficiently and falls thru to .Lexit
which sets the syscall amount to the _exit(2)
syscall and calls it, which exits this procedure.
Writing the prologue
The closing thing we want to write in characterize to attain our compiler is the prologue. This code goes straight after the necessary
tag and before the .Lparse
tag.
We want to impact a call to write(2)
with our .LSprologue
string:
popq %rdi # Write prologue (argc == 1, confidently) leal .LSprologue, %esi movb $21, %dl jmp .Lwrite
Our closing trick is straight away popping the tip of the stack into %rdi
correct to birth with up. As we discovered before, our ambiance puts argc
in there. See you later as argc
is 1, which it must be, that lets in us to pickle up %rdi
because it shall be for this preliminary write(2)
call in barely 1 byte. If no longer, successfully then your output shall be unsuitable. It is the one error that goes undetected by our compiler. We're going to fair safe that we now have chop this corner.
And that's the explanation it! Our compiler is executed.
Inserting all of it collectively
Right here's what our closing performed compiler appears to be like to be delight in:
.text .globl necessary necessary: popq %rdi # Write prologue (argc == 1, confidently) leal .LSprologue, %esi movb $21, %dl jmp .Lwrite .Lparse: movb $3, %al # Load learn(2) plot call movb $1, %dl # Read one personality xorl %edi, %edi # Read from stdin leaq (%rsp), %rsi # Read into high of stack syscall # learn(0, (%rsp), 1); incl %edi # Field %edi to 1, for write xchg %ebp, %eax # Store return price in %ebp movb (%rsi), %al # cmpb imm, %al is the smallest cmp leal .LS, %esi # Preload string cmpl %edx, %ebp # EOF ? (%ebp < 1) jl .Leof cmpb $60, %al # '<' ? je .Lleft cmpb $62, %al # '>' ? je .Lright cmpb $45, %al # '-' ? je .Ldec cmpb $43, %al # '+' ? je .Linc cmpb $44, %al # ',' ? je .Lgetchar cmpb $46, %al # '.' ? je .Lputchar cmpb $91, %al # '[' ? je .Lopenloop cmpb $93, %al # ']' ? je .Lcloseloop jmp .Lparse # Comment personality, skip .Lwrite: movb $4, %al syscall # write(1, string, size); jmp .Lparse .Leof: cmpl %edx, %ebx # Loop counter < 1 ? (i.e., 0) jge .Lexit addl $54, %esi movb $10, %dl movb $4, %al syscall xorl %edi, %edi # Derive ready to exit .Lexit: movb $1, %al # _exit(%edi); syscall .Lright: addl $4, %esi .Lleft: movb $4, %dl jmp .Lwrite .Ldec: subl $5, %esi # 13 - 5 = 8 .Linc: addl $13, %esi movb $5, %dl jmp .Lwrite .Lgetchar: subl $13, %esi # 31 - 13 = 18 .Lputchar: addl $31, %esi movb $13, %dl jmp .Lwrite .Lopenloop: incl %ebx # Increment loop counter addl $44, %esi movb $10, %dl jmp .Lwrite .Lcloseloop: decl %ebx # Decrement loop counter js .Lexit # %ebx < 0 ? (%rdi == 1 from the write(2) call) addl $63, %esi movb $1, %dl jmp .Lwrite .LSprologue: .ascii "a[65536],*p=a;necessary(){" # 21 .LS: .ascii "--p;" # 4 .ascii "++p;" # 4 .ascii "--*p;" # 5 .ascii "++*p;" # 5 .ascii "*p=getchar();" # 13 .ascii "putchar(*p); " # 13 .ascii "while(*p){" # 10 .ascii "return 0;" # 9 .ascii "}" # 1
How many bytes is our compiler?
On OpenBSD, there is just a few most necessary overhead, 24 bytes price. We shall be capable of measure the size of our compiler three ways: by itself, with the OpenBSD overhead, and with the entire ELF overhead.
The compiler by itself:
$ size bf256.o text facts bss dec hex 231 0 0 231 e7
Entirely 231 bytes! Now now not wrong. If I remember because it shall be, supposedly the unique Brainfuck compiler used to be 240 bytes.
The bring along with the OpenBSD overhead:
$ size bf256 text facts bss dec hex 255 0 0 255 ff
So even if you remark that it's unfair that I excluded the OpenBSD overhead in my rely, we restful come in at most productive 255 bytes.
With the entire ELF overhead:
$ ls -l bf256 -rwxr-xr-x 1 brian brian 952 Jul 10 19: 38 bf256
Correctly, 952 bytes. OK, that's where the ELF overhead will get you. I assemble no longer would like to assemble out a hex editor and begin up whittling down all that. I'm going to yelp it's correct kind ample and giving us all permission to ignore the entire ELF overhead.
Conclusion
It is time for me to position away the text editor. My aim used to be 256 bytes and I fair correct scraped by. Can you acquire extra bytes to shave off?
Update: A 216-byte compiler
I was ready to set a further jmp
and an extraneous movb $1, %dl
, saving a further 4 bytes, which allowed me so that you can add back one byte and commerce the preliminary popq %rdi
to an incl %edi
, guaranteeing that the preliminary prologue write will consistently work no matter the sequence of arguments to this procedure and restful set a byte. Tag it right here.
I was ready to set a further byte by getting rid of the extra dwelling in the putchar string and the utilize of the closing personality in the getchar string to pad out the putchar string. As right here is a ;
personality, in C that turns into an empty assertion, which is nice C. This will get us all the kind down to 944 bytes with the entire ELF overhead. Tag it right here.
I was ready to set a further 11 bytes by getting rid of the epilogue string entirely and easily the utilize of the closeloop string for the epilogue. Curiously clang would now not poke over it. That will get us to 936 bytes with the entire ELF overhead. Tag it right here.