As far back as I can remember, one of the accepted dogmas of reverse engineering is that when a program is compiled, some information about the program is lost, and there is no way to recover it. Variable names are one of the most frequently cited casualties of this idea. In fact, this argument is used by countless authors in the introductions of their papers to explain some of the unique challenges that binary analysis has compared to source analysis.
You can imagine how shocking (and cool!) it was then when my colleagues and I found that it's possible to recover a large percentage of variable names. Our key insight is that the semantics of the binary code that accesses a variable is actually a surprisingly good signal for what the variable was named. In other words, we took a huge amount of source code from github, compiled it, and then decompiled it. We then trained a model to predict the variable names, which we can't see in an executable, based on the way that the code accesses those variables, which we can see in an executable. In Meaningful Variable Names for Decompiled Code: A Machine Translation Approach, we showed that this was possible using Statistical Machine Translation, which is one of the techniques used to translate between natural languages.
I'm happy to announce that our latest paper on this subject, DIRE: A Neural Approach to Decompiled Identifier Renaming, was accepted to ASE 2019. In this paper, we found that neural network models work really well for recovering variable names in decompiled code. I'll post our camera ready as soon as its finished.
I'll be speaking at the 2019 Central Pennsylvania Open Source Conference (CPOSC) in September. I attended CPOSC for the first time last year, and was very impressed by the quality of the talks. The name is a little misleading; talks are not necessarily related to open source software. I'll actually be giving a primer on code reuse attacks.
Powered with by Gatsby 5.0