(Apologies in advance, this is probably going to be rambling.)
I spend a lot of time looking at various research artifacts. Yesterday, I was looking at several verification artifacts, and I was struck by how much impact small details like a Docker container can have. One of the projects I was using is STOKE. STOKE is a cool project, but the salient detail for this post is that it is an abandoned research project. The last commit was in December 2020. This is very common: Ph.D. student creates a project, maintains project, graduates, and then no longer maintains project. Despite that, STOKE has a Docker container, which makes using the project trivial.
In contrast, I was also attempting to run Psyche-C this week. Like STOKE, there's a fair bit of bitrot on the branch that contains the type-inference component. Unlike STOKE, Psyche-C does not have a Docker container. Part of this branch uses Haskell lts-5.1, which is from 2016! Trying to get this running was a nightmare, since modern versions of stack, GHC and cabal could not cope with such an old environment. I was eventually able to get it running by creating an Ubuntu focal docker image but it took me an entire day. I also created a HuggingFace space for it.
I have said it before, but I just love HuggingFace spaces for hosting research artifacts. It makes it almost effortless for others to try out your research. I wish more researchers would use them.
I think that the decompilation and reverse engineering research community could also significantly benefit from using HuggingFace spaces and generally making artifacts easier to use. I say this because there are many subtle details about reverse engineering research artifacts that can make them less usable in practice.
For example, I was recently reading DecompileBench, which is a good paper about benchmarking decompilers. In particular, they have a very clever method for testing whether a decompiled function is semantically equivalent to the original source code. In short, they compile the decompiled function in isolation and splice it into the original program, and do some testing to see if it behaves the same way. I've been thinking about this topic a lot recently, since I have been talking with some of my students about it. The problem is that if the binary is stripped, the decompiler can't refer to symbols by their original name, and thus the decompiled code can't reliably be linked back into the original program. (Ryan pointed out on Bluesky that this is possible in some cases.) DecompileBench ignores this problem and decompiled unstripped binaries. This is a problem, because decompilers are usually used on stripped binaries, and they generally perform significantly worse on them.
My goal is not to criticize DecompileBench; I think it's a nice paper. My point is that there are many subtle details like this that can make research artifacts less useful in practice. I've had my own share. As one example, the DIRT dataset was stripped using the wrong command, so that function names were still present in the binary, which is unrealistic. Fortunately, it turned out (surprisingly) that this did not significantly affect the results in that paper, but it could have.
I think part of the problem with these two examples is that it's hard to get close to the actual use case with these projects. In decompilation, the real use case is decompiling stripped binaries in the wild. But it's hard to run DIRTY on a new binary to see how well it works. I have found this to be a common problem in machine learning-based research. The straight-forward approach is to start with a dataset for which you have ground truth, and then train and test on that dataset. This often leads to preprocessing code that expects to have the ground truth information available. This is problematic when you want to perform inference on a new example when you don't have ground truth information, e.g., the primary use case of these technologies!
My overall message is that docker containers and HuggingFace spaces are great ways to make research artifacts easier to use. This is important in general, but it's also important to be able to get as close to the real use case as possible. If, for example, your technique only works for unstripped binaries and you forget to mention this in your paper, a docker container or space is going to make that very apparent.
I have a pretty hot take: the top-tier conferences should mandate that research artifacts be easy to use, e.g., via docker containers or HuggingFace spaces, on new examples, and that these artifacts should be considered as part of the submission. Having a separate, optional artifact evaluation process simply doesn't work. (The incentive for going through the artifact evaluation is a badge, which is essentially a sticker for grown-ups!) But if reviewers can actually try out the artifact on new examples, they can see how well it works in practice. This would significantly improve the quality of research artifacts in our community.
I'm excited to announce that "Quantifying and Mitigating the Impact of Obfuscations on Machine-Learning-Based Decompilation Improvement" has been published at the 2025 Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA 2025)!
This work was primarily conducted by Deniz Bölöni-Turgut—a bright undergraduate at Cornell University—as part of the REU in Software Engineering (REUSE) program at CMU. She was supervised by Luke Dramko from our research group.
This paper tackles an important question in the evolving landscape of AI-powered reverse engineering: How do code obfuscations impact the effectiveness of these ML-based approaches? In the real world, adversaries often employ obfuscation techniques to make their code harder to analyze by reverse engineers. Although these obfuscation techniques were not designed with machine learning in mind, they can significantly modify the code, which raises the question of whether they could hinder the performance of ML models, which are currently trained on unobfuscated code.
Our research provides important quantitative insights into how obfuscations affect ML-based decompilation:
Obfuscations do negatively impact ML models: We demonstrated that semantics-preserving transformations that obscure program functionality significantly reduce the accuracy of machine learning-based decompilation tools.
Training on obfuscated code helps: Our experiments show that training models on obfuscated code can partially recover the lost accuracy, making the tools more resilient to obfuscation techniques.
Consistent results across multiple models: We validated our findings across three different state-of-the-art models from the literature—DIRTY, HexT5, and VarBERT—suggesting that our findings generalize.
Practical implications for malware analysis: Since obfuscations are commonly used in malware, these findings are directly applicable to improving real-world binary analysis scenarios.
This work represents an important step forward in making ML-based decompilation tools more resilient against the obfuscation techniques commonly encountered in real-world binary analysis scenarios. As the field continues to evolve, understanding these vulnerabilities and developing robust solutions will be crucial for maintaining the effectiveness of AI-powered security tools.
Want to know more? Download the complete paper.
I'm excited to announce that "A Human Study of Automatically Generated Decompiler Annotations" has been published at the 2025 IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2025)!
This work represents the culmination of Jeremy Lacomis's Ph.D. research, alongside our fantastic collaborators:
This paper investigates a critical question in reverse engineering: Do automatically generated variable names and type annotations actually help human analysts understand decompiled code?
Our study built upon DIRTY, our machine learning system that automatically generates meaningful variable names and type information for decompiled binaries. While DIRTY showed promising technical results, we wanted to understand its real-world impact on human reverse engineers.
Interested in the full methodology and detailed results? Download the complete paper to dive deeper into our human study design, statistical analysis, and implications for future decompilation tools.
Can existing neural decompiler artifacts be used to run on a new example? Here are some notes on the current state of the art. I assign each decompiler a score from 0 to 10 based on how easy it is to use the publicly available artifacts to run on a new example.
SLaDe has a publicly released replication artifact but there are several problems that prevent it from being used on new examples:
Below is a quote from a private conversation with the author:
You are right that IO are somehow used to select in the beam search, in the sense that we report pass@5. They are not strictly required to get the outputs though.
The link you sent is for the program synthesis dataset. In this one, IO generation was programmatic but still kind of manual, I don't think it would be feasible to automatically generate the props file in the general case. For the Github functions, we have a separate repo that automatically generates IO tests, but those are randomly generated and the quality depends on each case. If I had to redo now, I would ask an LLM to generate unit tests! I can give you access to the private repo we used to automatically generate the IO examples for the general case if you wish, but now I'd do it with LLMs rather than randomly.
LLM4Decompile has published model files on HuggingFace that can easily be used to run on new examples. I created a few HuggingFace Spaces for testing.
resym has a publicly released replication artifact. Unfortunately, as of February 2025, the artifact is missing the "prolog-based inference system for struct layout recovery" which is the key contribution of the paper. Thus it is not possible to run resym on new examples.
DeGPT has a publicly released GitHub repository. I'm largely going on memory, but I used it previously on new examples and it was relatively easy to use. I did have to file a few PRs though.
Powered with by Gatsby 5.0