You see this with some apps (I think ReVanced is a popular example?) and games occasionally, and I’ve never been clear on how they do it.
The term you are looking for in general is “reverse engineering”. For software in particular you are looking at disassembly, decompilation and various forms of tracing and debugging.
As for particular software: For .NET there is ILSpy that can help you look into how things work. For native code I have used Ghidra in the past.
Native code is a lot more effort to understand. In both cases things like variable names names will be gone. Most function names will be missing (even more so for native code). Type names too. For native code the types themselves will be gone, so you will have to look at what is going on and guess if something is a struct or an array. How big is the struct and what are the fields?
Left over debug or logging lines are very valuable in figuring out what something is. Often times you have to go over a piece of disassembly or decompiled code several times as your understanding of it gradually builds.
C++ code with lots of object orientation tends to be easier to figure out the big picture of than C code, as the classes and inheritance provides a more obvious pattern.
Then there is dynamic tracing (running under some sort of debugger or call tracer to see what the software does). I have not had as much success with this.
Note that I’m absolutely an amateur at reverse engineering. I thought it was interesting enough that I wanted to learn it (and I had a small project where it was useful). But I’m mostly a programmer.
I have done a lot of low level programming (C, C++, even a small amount of assembly, in recent times a lot of Rust), and this knowledge helps when reverse engineering. You need to understand how compilers and linkers lowers code to machine code in order to have a fighting chance at reversing that.
Also note that there may be legal complications when doing reverse engineering, especially with regards to how you make use of the things you learned. I’m not a lawyer, this is not legal advice, etc. But check out the legal guidelines of Asahi Linux (who are working on reverse engineering M1 macs to run Linux on them): https://asahilinux.org/copyright/ (scroll down to “reverse engineering policy”).
Now this covers (at a high level) how to figure things out. How you then patch closed source software I have no idea. Haven’t looked into that, as my interest was in figuring out how hardware and drivers worked to make open source software talk to said hardware.
What do you mean when you say “native code”? It sounds like perhaps C and similar languages?
Also as someone that would be approaching this as an amateur as well, have you pulled together some resources you’ve found useful in your learning, or has it largely been more scrapping together info from searches as you learn, and not so much things that may be useful to refer to others?
With native code I mean machine code. That is indeed usually produced by C or C++, though there are some other options too, notably Rust and Go both also compile to native machine code rather than some sort of byte code. In contrast Java, C# and Python all compile to various byte code representations (that are usually much higher level and thus easier to figure out).
You could of course also have hand written assembly code, but that is rare these days outside a few specific critical functions like memcpy or media encoders/decoders.
I basically learnt as I went, googling things I needed to figure out. I was goal oriented in this case: I wanted to figure out how some particular drivers worked on a particular laptop so I could implement the same thing on Linux. I had heard of and used ghidra briefly before (during a capture the flag security competition at univerisity). I didn’t really want to use it here though to ensure I could be fully in the clear legally. So I focused on tracing instead.
I did in fact write up what I found out. Be warned it is a bit on the vague side and mostly focuses on the results I found. I did plan a followup blog post with more details on the process as well as more things I figured out about the laptop, but never got around to it. In particular I did eventually figure out power monitoring and how to read the fan speed. Here is a link if you are interested to what I did write: https://vorpal.se/posts/2022/aug/21/reverse-engineering-acpi-functionality-on-a-toshiba-z830-ultrabook/
Thanks for the response, and the link! It’s interesting info, and good pointers to look to some of the existing tools from your OS and/or hardware providers for getting a start into whatever you’re working on.
I think I might have made the mistake of thinking they wouldn’t be available and only bothering to look till after trying a lot of other indirect methods, so it’s a good reminder to check for any available official tooling and then supplement them with others where needed.
Not exactly the question you were asking, but there are also SDKs for closed source software. You can get a library, or just an interface definition you adapt to. It can be frustrating when you cannot peek a layer deeper into the system, and takes head banging, but it’s a thing. Often, if you are a significant enough client, you can get consulting or guidance from the devs at the other end.
Nowadays a lot more business software is open source (at least partially), because it increases adoption. People found that when you remove the stops, others will flock and build stuff around.
Compiled binaries can be decompiled back into source code. It’s not perfect by any means, but I was very surprised how well it worked the first time I decompiled a .Net application. With this as your base you can then make changes and recompile a new binary. This glosses over a lot of detail, and there are other ways like obtaining a leaked copy of the source code.
Yeah, it’s particularly easy with Java and C#, as they don’t compile all the way to machine code, but rather just to an intermediate representation (byte code).
The reason this works well for certain applications and not others comes down to programming language / framework and compilation optimization.
If the application was compiled directly into an executable binary and optimized, it can be decompiled, but it won’t be human-readable. Programmers would have to delve in and manually trace the code paths to figure out how it works. Fun fact, this is how a lot of the retro game decompilation projects are happening. Teams of volunteers are going through the unreadable decompilations and working together to figure them out.
Dotnet and Java based applications are easier, because they don’t usually get directly compiled into machine-executable binaries, and even when they do, it’s still easy to decompile them. This is because they’re both compiled to an intermediate language that’s more optimized than the original, then that IL is run by a runtime. Dotnet’s IL is called Common Intermediate Language and Java’s is called bytecode. This sounds weird, but it’s kinda cool, because it lets people write different languages without having to have a full compiler. They just have to be able to get it compiled to an intermediate language, and then the existing runtime can take it from there.
Are the tools involved typically called decompilers, or would you happen to know the different names they may go by? Trying to make sure I have some solid terms to guide my own research. Thanks for the response!
Yep, decompiler is the correct term