Compilers, Decompilers and Programming Language Design

2024-01-04

I'm starting this blog to talk about compilers, decompilers and programming language design. I have accumulated tons of random notes on these topics from my experiments through the years. Lot of them are worthless but few ones may be worth sharing; the good ideas as well as the bad ones.

This is an hobbyist's blog, I am not an expert in programming language theory, there won't be any advanced theory here, just simple problems with simple solutions. I'm not a researcher, I may occasionnally cite some papers, the classic ones in the domain, but don't expect any scientific rigor here.

Programming Language Design

I really love assembly programming:

it's simple,
I directly instruct the CPU what to execute,
I see the cost in size and time of my code,
I have access to all instructions and
it encourages good practices such as simple control flows, short functions and the DRY principle.

But, like with other non-statically typed languages, it can be a nightmare to maintain and refactor huge programs.

Since I'm not a big fan of C and C++, I tried to create my perfect replacement to C. It's been a lot of fun. In the end, despite being far from perfect, I've managed to create something that I'm happy to use everyday for my personal projects. I've experimented lot of features such as exceptions, block expressions, union types, traits and many more to finally drop them all. Not because they're bad or because they don't fit my needs but most often it's because I want to keep the language simple and to keep it close to the bare metal.

Compilers

LLVM has changed the game in writing compilers: now you can focus more on the language design itself and less in the code generation.

Yet, it can be a good idea to write your own code generation even for a language that is suitable for LLVM, it is a very interesting learning experience and you can have an order of magnitude faster compilation time: last time I checked, my own x86 generator was 30 times faster than LLVM.

I don't think I'll write a lot on the development of compilers itself except for the tricks I used to make compilation very fast. There are already hundreds of tutorials on how to write a compiler and I don't think I have anything to add.

Decompilers

Years ago I started a very ambitious project: write a decompiler completely agnostic on the source language in order to translate any programs, including the ones written in assembly, into an higher level language.

I've done a lot of experiments and I have a lot to write on this topic, but since I don't have yet released even a pre-alpha version of a toy decompiler, I don't think I'll post a lot in the near future.

Today I focus more on the decompilation of programs from old 8-bit machines (Z80) therefore I have to deal with some specific problems and some ancient forgotten techniques that are usually ignored by modern decompilers:

pc-based parameter passing,
self-modifying code,
Return-Oriented Programming (ROP),
instruction overlapping,
operations split in multiple parts and
tail calls,

so I think it might be interesting to present these old techniques that were once more common, and the solutions I've found to handle these cases.

That's all for Now

I group all these topics in a single blog because I think that they are strongly related:

Writing a decompiler requires knowing the basics of compilers.
Designing a language suitable for decompilation can help reverse engineering.
Compilers and decompilers are more similar than it seems. Both are parts of the more general concept of Program Transformation and some operations done by a compiler are rather decompilation.