Here to talk to you about semi-automatic unpacking, specifically referring to unpacking a malware
or something to do with machine guns, but to talk about something here, I'll call you
up.
And to get started, man there's a lot of factors out there.
The problem that poses for people like me, going malware, reverse engineering, and anti-virus
companies, is that anybody can take a piece of malware and trivially repack it and evade
detection.
And a lot of times, in my role, I'll be faced with seeing hackers that I've never seen before
yet.
I've got to figure out how they work and get to the malware inside and figure out what
it does.
And I really don't want to spend a whole lot of time playing around with the hacker code,
because that's not unique to the malware.
It's something that's just kind of in the way for me.
So I don't want to spend my time messing around with all these hackers, but I do want to be
in the room every week and figuring out how they all work is getting the work.
So I'd like to be able to do this in a more automatic way.
So I wanted to review some of the methods that people are using right now when it comes
to automated unpacking.
The hard way is you can take this particular hacker and you can spend a lot of time looking
at it.
Reverse engineering the hacker code, figuring out how it writes the sections, what kind
of encryption or compression that it uses, and how it sets up the import table after
it's done.
It takes a lot of time.
But when you're finished with all that, you can then write your own engine to basically
do everything that the unpacking code does, handling all these algorithms and compression
methods, and then write a one-time use per version unpacker for Packer X.
And a lot of people have done this, and I'm not knocking it, but it takes a lot of time
and energy to maintain something like this.
So it's great that people have done this, but there's just not enough of these unpackers
for me to go and say, well, I've got MSG version 133, does MSG handle that?
I don't have time to give up on that either.
AV engines, they've got to have some sort of unpacking methodology.
They have to know something about these Packers.
And if they are to go and put code into their AV engine for every single Packer that's out
there, this engine is going to be huge after a while, and possibly introducing even more
plugs and overflows and things like that.
So that's the hard way.
Another approach that people take sometimes is, well, instead of learning about the algorithm,
why don't I just emulate the CPU, and we'll just let this packing code think that it's
running and writing to its memory.
But we're just going to virtualize all this.
And that is a pretty good approach.
A lot of people take that approach too.
Then you don't have to really deal with the algorithms that we use in the Packer.
You just have to deal with any kind of any of the secrecy that caused your emulation
engine not to work, and just go and fix those.
You've got to put a lot of time and energy into this emulation engine.
It takes a lot of time even to get one off the ground.
So a lot of times I see people write such things but then not release the code.
It's something that I don't have access to.
Maybe the antivirus companies have this internally, or somebody like Peter Vania has a Packer
engine, and people just aren't giving it away for free.
It's a lot of work, and that's their right.
Then there's the cheating method, and I've done this myself, you know, when I'm in memory.
Just running the code, putting it into memory on a system that I don't care about or that's
a VMware image that I can re-image later.
And then, you know, once it's in memory, just go ahead and dump it right out of memory.
And that's, for all my tests and purposes, unpacked.
But it's not something we can really use if we want to do an in-depth analysis because
we don't know what that code looked like at the time that it started.
We only know, we don't have a snapshot of what it was like at that moment in time.
So we don't know how the variables that were initialized at the start changed.
You know, things are in there that wouldn't be in the actual unpacked code.
You don't know where the entry point was necessarily.
You can do, in some cases, you know, use some heuristics if you have various compilers and
they all do the entry point kind of the same way.
Sometimes you can find it that way, but other times if it was coded in assembly, you know,
you may not have an entry point that's recognizable.
So you're kind of left guessing where that entry point was and where all the code started.
There are other ways around that, but it's just not a clean way to go about it.
So when I look at all the code that I've unpacked, or from the various packers, notice that they
all, well not all, but a great deal of them have a very similar methodology about how
they pack and then unpack the executable that you're dealing with.
They'll first analyze the code segments, say, okay, I've got four segments in this file.
I'm going to compress those in various ways.
I'm going to take the import table and chop it up.
And then I'm going to put a subsection at the end of the file and that's going to be
where the new entry point is so that the unpacking code can run first and then when it's done
its job and everything's processed, the import table is set back up.
It will jump to what we call the OEP, the original entry point, the point that, if you
used it in the entry point before, it was packed.
So this is kind of just a graphic example of how it works.
In the unpacked code, you've got just, you know, very straightforward.
Into the code section is the first section of the file and then you'll have other sections,
data section, resource sections.
And there's an entry point field in the PE header that points somewhere into that code
section and that's where, as soon as the file is loaded, it'll jump to that execution point
again.
In the packed version, we see that the entry point now points down to a new section that's
been appended to the file and then it's going to run.
And somewhere in there, it's going to store the OEP.
And a lot of times this is intentionally obfuscated so you can't just read the memory, you know,
that the crypto stuff or read the file and say, okay, here's where the OEP is stored
so I just need to set a break point there.
Sometimes you can.
I mean, some hackers aren't that sophisticated, but in most cases, hackers are making a conservative
effort to make this information obscure because it would be pretty darn easy if you could
just say, this is the address that I set a break point on, I'm done.
So if you use a OLLI bug like I do and you unpack code, you've probably sat there and
said, gee, you know, if I could just set a break point on that section as a whole and
then stop running when it reached that section because I know that that's, you know, when
the code is going to be unpacked, then that would be great.
And you can do that already.
You can toggle a break point on the memory section, but this is break on access, which
means that anything that touches that code will cause a break.
The problem with that is that before you actually jump in and execute this code, you're going
to be reading, possibly, most definitely writing to it.
So you're going to do this hundreds of times, thousands of times, depending on how big the
code is you're unpacking.
So you'd have to set this break point and reset it again and again and again and again,
which turns out to be a really slow process.
I mean, even if you automate it, you can use OLLI script or something like that and automate
the section, say, well, am I running yet?
No, I'm just reading, okay, set the break point again.
You can do that, but that's really, really slow.
And if you're going to do that, I mean, why not just set a trace point and say, by my
execution, my instruction pointer is between this range and this range, then stop executing.
That's a pretty old way of putting the code that goes way back in the day.
It's pretty common and because it's not only, you know, red is this low, you can also detect,
you know, in the hacker code itself, it can say, hey, check the CPU instruction count
and see how much time it's taken to run the subroutine.
It's taken way longer than it should have and there's a chance that some debugger is
interacting with me and doing tracing.
So it would be great, I thought to myself and, you know, hopefully somebody else out there
has thought that to themselves that, you know, if I just have a way to break on execution,
not access, just the execution of a memory section without having to do all this tracing,
I might be able to do something.
So that's what OLLI bone is.
OLLI bone stands for break on execute for OLLI bone.
So how we did this is to look at the architecture and say, well, is there a possibility
that I could set a no execute bit?
Well, with the X86, you can't.
Now, it's been added to CPU since the back of the day, but as it stands in chains,
I just generally don't have this bit.
So they did solve this problem in the Pax project because if you remember,
Pax is designed to stop execution if your code happens to be running in the heat
or happens to be running in the stack, then that's a sign that there's been
a bug or overflow and something's bad, bad is happening.
So Pax already took care of this some years ago.
So what we do here is we take the same idea and instead of talking about
protecting the stack in the heat, we're just going to look at protecting
just arbitrary pages that we just picked out to target, you know,
this is where we want to stop execution in.
So I hope everybody's familiar with Pax.
I have a passing familiarity.
I've never actually run it.
There's a...anytime you try to do some sort of memory access,
you have to translate, your OS has to translate your virtual address
that you're asking for to the actual physical address of the memory
so that we can maintain this gigantic virtual memory space
that we only have, you know, a small actual physical memory space
and the operating system figures that out, it can be that up and translates it for you.
Because this process of doing these virtual memory translations is kind of
time consuming, the Intel architecture uses what they call olympicized buffers.
And so the first time you do one of these virtual to physical address translations,
a olympicized buffer will cache that resulting address.
So the next time you go and get that page in memory, you already know exactly
where it is and it speeds up things tremendously.
Well, then maybe Pax, probably as possible, is that the x86 uses separate TLDs.
One is for data, they call it a DTLD, and one is for instructions.
That's the ITLD.
And because of that, what we can do is make one read one thing
and then use another one in a separate method.
Basically, we want to give a certain permission to the stack.
We want the operating system to be able to read from that stack,
so we go ahead and we let the DTLD cache that entry as is.
But if we come around and it tries to do a read to execute the stack,
and the ITLD doesn't get patched, it throws up a fault and kills the process.
And it marks these pages in memory that it wants to protect
by the user supervisor bit of the page table entry bit.
So basically, it installs its own page ball handler,
and any time a page ball is thrown because a particular page of memory
only has supervisor permissions and we're running our user load,
then the page ball handler figures out whether this is a protected area,
or a stack, or if it's just another user supervisor set by some other processor in the crowd.
So using that bit, we can then mark that page and say,
it means more than just user supervisor, it means execute, no execute.
So to make this Pax concept work so packed that we just have to do a few things differently.
We're not protecting the stack in the deep.
We are looking at a section of memory that's really just defined by the EEO loader.
And we just find every page that belongs to that section,
and then we mark those pages, we flip that piece of E-Bit.
And then we install our own page ball handler,
and this time instead of Pax where it's going to try to protect the operating system's integrity
by killing that process, all we're going to do is from the page ball handler
we're going to jump to the one handler, the single step break handler.
And what this does is this passes our control from the kernel back to our program-based debug.
So essentially all the debug comes back and says,
I just got a single step break, and it stops right there.
So the way this is implemented, is you've got an Olley Plug plugin
that basically just handles the figuring out where the section of memory is,
how many pages it has, and then sending IO controls to a kernel module,
and basically it then just arbitrarily can assign these user supervisor bits to all these pages,
and then also installs its own page ball handler to override the normal page ball handler
and redirect the flow whenever we want to.
You could also, if anybody's interested in using IDA for this,
you could probably do this with the same kernel module, all you need is an IDA plugin,
basically that can figure out what the memory segment that you want to target is,
and just set IO controls for all of those pages.
So it wouldn't be hard to implement. I just haven't done it because I typically don't use IDA as a debugger.
I'm going to go a little bit about the process of how it actually flows in the page ball handler here.
So your packer is now trying to unpack itself, and it's attempting to write to this page that we've set to be no execute.
So at this point, this is the first time we try to access it, so there's no virtual address translation.
So the page table, basically, the lock happens, and it comes back with a user supervisor bit set, so it generates a page ball.
So that kernel directs into the page ball handler, where we've taken over the flow,
and we check and see if that page ball belongs to us, if it's something as a result of what we've done here in Ollibong,
or whether it's a real page ball that just needs to be passed on down.
So if it does belong to us, what we do is we decide whether or not this is a data or an execution access.
And it's pretty easy to do. All we have to do is look at the calling address,
and then say, and push the pass to our page ball handler, and then see if that equals the instruction pointer,
which we can also get from one of the front-end structures.
So if it doesn't equal the instruction pointer, that means this was a data access.
It wasn't trying to execute here, it was just trying to read.
So what we do is we'll toggle that PTE bit, we'll set it back to user access.
Then we'll do a read from the page, and when we do that, that caches this page table entry.
So we don't have to do this again. That is caching to the DTLB, and then the next time it wants to read,
it doesn't have to go to our page ball handler, so we don't slow the system down terribly.
Then after we've gone ahead and cached that, then we just toggle that the page is bit back to supervisor mode,
so that when our potential execution happens, that page ball will be hit again.
So when that happens, hopefully, this is the target segment we've guessed is where the execution is going to occur.
If it does happen, then what will happen is we have another page ball.
The translation, the cache would try to do a page table walk-in and it came up and said,
hey, you've only got supervisor bit set here and you're only a user, so it's going to generate that page ball for us.
Once again, our page ball handler says, is this to us? Is this belonging to something else in the system?
And then if it belongs to us, it says, does this constitute an execute access,
which is now the faulting address equals the instruction for your address.
So when that happens, what we have to do is pop one extra argument off the stack
because the page ball handler is called with an argument that the end one handler is not,
and then we simply jump to that error-append handler and let OlliBug take that control on the program.
This works great on standard hardware. It works pretty seamlessly on VMware.
I found that Box in QNU, I looked at their code, tried to make this work on it,
and it just simply doesn't work because they haven't implemented split TLBs.
Whenever somebody was writing this part of the code in those projects, they said,
well, let's save some time. We'll just put all of the translation with the sidebuffers into one table.
And so that doesn't give us the ability to do this, unfortunately, on our own platforms.
I think I filed a bug a while back with QNU about that. I don't know if it's been fixed since then.
You probably haven't seen it once since I did that. So it might work at some point.
And I have not tested this on Microsoft Virtual PC. I'd be interested to know if any of you use that.
I don't use it, so if you do try it out and it doesn't work for you or doesn't work, I'd like to hear from you.
Usage, it's pretty straightforward. It's actually very fast.
You're just going to load in your executable in OlliBug.
You're going to view the memory map and look for where your process is loaded and then figure out
what is going to be that final code segment when it's unpacked, which piece of those sections that are loaded in there,
which one is going to be the one that's running in the unpacked state.
And that's guesswork sometimes. Sometimes you'll know because you're familiar with a particular packer.
You don't have to know all the algorithms, but you'll know at least that it's always going to jump to section one after it's done.
It's pretty predictable. Sometimes you'll get it wrong and it'll never break. It'll run to completion.
So if you're analyzing malware, that could be bad if you're not running this on a protected system that's isolated.
If you're not using this for malware, you probably won't have a problem.
But basically there's a break on execute flag that's been added to the menu, the right-click menu in the virtual memory space.
So we just toggle that and we run the program.
And then hopefully when it tries to execute that section after it's unpacked, it'll encounter the single step break.
So control goes back to OlliBug and then it'll dump you out basically at the OEP,
exactly where you want it to be in one step.
So theoretically that's how it should work in most cases.
The differences between hackers mean that it doesn't always work quite as cleanly as that.
And I'm going to show you actually a video demo of about five different hackers and using OlliBug on them.
So let's go ahead and start.
So the first hacker that we're going to attack here is FSG, which is a pretty easy one to manually unpack.
If you've ever walked through the code, there's not much to it.
And if you look at the two or three different versions that are out there, you can spot pretty quickly the point at which it's going to jump to the OEP.
But it's also a pretty common one that malware authors like to use, so I'll show you how long it takes to do this.
So we loaded it up and we are landing somewhere in memory and we look.
This is the memory map and we can see that the code section starts at 401,000.
And we have actually started at 401,000, so we're actually in that subsection right now.
So what we're going to do is we're just going to toggle our break on execute flag for that section.
And then we're just going to hit the play button.
And down at the bottom there you can see it's already hit our break on execute.
So we're just going to break the code analysis here and we can now read the strings that are in the files up.
I'd like to thank Peter Bagnia for providing these packed executables that he uses to test his U-Pack engine.
He's saved a lot of time and happened to go and find a bunch of packers and pack something up with.
So that's what we should see every time it's unpacked basically.
So this one's done, we can dump this right now and we need the OEP and we're good to go.
All right, moving on to an older version of U-Pack.
Once again we see we're looking at the memory map.
We are well out of that initial code section there.
We are in the subsection.
So we'll set our break on execute, hit play, and it's already done.
We can analyze code and we can read it fairly easily.
So that's two fairly straightforward, easy ones.
I would have put U-PX in here as an example because that's also a very common one.
For some reason U-PX did not like Peter's sample executable with the compressor.
All right, so this is ASP.Protect. Is it ASP.Protect or ASP.Protect?
ASP.Protect sounds kind of funny.
It sounds a heck of a lot better than ASP.Pack.
That's all I know.
But we'll look at this one now. This one can take a few more tricks.
We can see that we started this off and we are, because we've already unpacked this a couple of times,
we know that that is actually our OEP.
So it starts off in that actual code section.
So we're not going to be able to simply set a break on execute right now on that section
because we're already there and we wouldn't be able to execute anything.
So we're going to have to work around that.
Fortunately, we can see here that this push and this call and return and return,
all this is is just a jump basically into another section there in 405.1.
So we'll just step through that.
And now we've landed in another section.
So we'll go back to our memory now.
We can set our break on execute on the code section.
Hit play.
And blah, blah, blah.
This is debugger detection built into Asper Tech.
It detected that we were running this under OLLI bug and didn't like it.
So we're going to have to re-strategize here because just being able to unpack this stuff
doesn't necessarily mean that it doesn't detect our debugger.
So fortunately, this debugger detection technique that is being used here is not all that sophisticated.
So basically, we could use the isDebugPresent plugin for OLLI debug
and it just hides that isDebugPresent flag in the pattern.
So we don't have to show up as being debugged before you can re-operate it.
So we'll go through the motions again.
We'll step through that first section.
Land in the second section.
Go to our memory map.
Set our break on execute.
And hit play.
And we land somewhere.
We landed 4104.
A little strange because that isn't where we know that the OEP is.
This code doesn't look like it's been unpacked.
So what is it?
Well, we go and remove the kind of messed up analysis that OLLI debug tried to do.
We'll be able to read what that instruction is and its return.
So what Aspect is doing is jumping basically into that code section and then returning probably to pop out that address from the stack and jump.
And then do something else with it.
So what we have to do here in order to continue with our execution is we're going to have to remove our break on execute that we set.
So now we can step and we return.
And now we're in the deep.
So we're well out of our code section.
So all we have to do now is go back to the memory map.
Hit that section.
Set a break on execute one more time.
And run the program.
And it ends up breaking.
Now, we did hit a break on execute.
We are at the OEP.
But the code doesn't look quite right.
It's unpacked here, it's just that in certain cases OLLI debug gets freaked out about the analysis and can't quite get it together.
So we are actually at the OEP.
We are unpacked here.
But running OLLI debug's analysis just looks a little bit wonky here.
All right, so that one's unpacked.
Let's go on to the EEP pack.
And it does kind of the same trick at the beginning that ASP.TEC uses.
And it's not jumping outside of that code section.
It's staying right in that first OEP.
So we're going to have to set through it and see what the deal is.
And we can see that only a few lines down we did an exception.
So what it's doing is it's using the exception handler to do some of the work in packing the code.
Try to play around maybe with emulators that don't properly handle exception.
But now that we're in the exception routine, we're outside of that code section.
And we can set our break on execute now on that section.
When we do that, it might land here.
And that's a jump to another section, which is reduced because that means that we're probably getting closer.
So we're going to have to remove our break on execute in order to step.
Because right now we're just kind of frozen. It won't let us do anything. So we can toggle that.
And we now just hit F7 to step into that section there.
So we just go back to the memory map.
Set our break on execute one more time.
And run it.
And now we land, once again, at the break on execute, that OEP.
And so we're unpacked.
And we're still kind of in a condition where all you want is it doesn't help with the analysis.
But you can see down towards the lower right of the screen, you can see the call to make the message box and exit process.
So it's unpacked.
All right.
So that's one for example purposes here, TLOC.
It's also kind of one that uses a few tricks.
Starts outside the code section.
So it seems like we would be good to go if we just go ahead and set a break on execute.
So let's go to our memory map, pick out that target code section and set it.
Program.
We've landed somewhere.
What's going on here?
Let's go down to the side.
Go down and look at, we've got a single step event.
What's happened here is that TLOC is using single step exceptions as part of its unpacking optimization.
It uses actually a different series of exceptions.
And some of those have to be single steps.
So what we're going to have to do here is we're going to have to basically handle the exception.
So let's program handle the exception, shift F9 or control F9 and just keep running it until we no longer are looking at single step events, but we hit the break on execute.
So that's run through a field here.
So we just keep on hitting F9.
We keep running and handling the exception.
And now we've landed somewhere unfamiliar.
You can see that there has been a break on execute breached.
And if we do our analysis, we can see that we're back again.
Back again.
I'm going to rush to download it.
We're in here.
Is that a movie or is that live?
It's a video of me going at live.
The speed of the unpacking was not edited.
I just basically put a bunch of clips together.
It is a video for most of the testing purposes.
If you want to do that, by the way, Xvidcap is similar.
So the problems we're going to face using this method to unpack malware, we still haven't solved the fact that our devlogger can be detected.
So any code that we're skipping over when we're saying I'm going to break on execute, it has the potential to do anything that it wants.
It can text the debugger, it can just exit or potentially do other bad things.
So you can't really use it as a 100% automated solution.
That's why I call it semi-automatic unpacking because you have to interact with it to some extent.
So the anti-debugging stuff you're still going to need to incorporate that into your debugger system.
There's no rules saying you have to use a quote unquote debugger to use that kernel module.
You can write your own unpacking engine and just use that kernel module as the basis for it and maybe get around some of the anti-debugging tricks.
Obviously the isDebug present flag is a pretty trivial one.
Some of the anti-debugging tricks I've seen, easy, they'll just look for a window type of auto debug, that's a pretty trivial one.
Other ones might be a little harder where they do timing of exception handlers.
So they have to work around those.
Another thing that's kind of frustrating is some of the malware now, or some of the tappers, will refuse to run on VMware.
I've had a particular thing that I was struggling with, trying to figure out why I was never able to unpack it.
I thought that something I had done in Ollivov was wrong, but it turns out that it was actually rejecting the fact that it was running on RubyMln and when I ran it on a native code it worked fine.
I haven't seen a lot of them. I understand the concept and I'm pretty sure there are some hackers which will be a little less rigid in the way that it unpacks and just puts everything back in nice little sections for you.
Some of them will unpack, for instance, the heap and then run from there, just relocate all the jumps and everything.
That's a problem right now because we haven't actually included a method to separate on a heap.
It would be pretty trivial to do. I just haven't done it in two days because figuring out what the heap is and figuring out what potentially shared Chrome logic code might be is a little bit more time consuming for me.
Needless to say, it would be bad if instead of a break on execute, it set its permissions onto a piece of shared memory and every process on the system suddenly came up with an in-break and an in-break at the same time.
In terms of evasion, I'll get into a little bit how you can do it. It's totally possible not to say this is an in-doll method of unpacking.
For instance, what if they don't make a stomp section? What if they do everything from that initial code section that it's eventually going to jump to?
That could happen. We have seen where they start out in the code section, they jump elsewhere.
We can always get a little bit finer grained. We can take this protection down page by page. They may be running in the last five pages of the code section.
We can protect the first how many ever pages up to that point. Once again, slow the work and implementation details, but it's possible.
The fact is that OliBone's kernel module is loaded, that can be detected, they could refuse to run, or they might even try to get clever and use that IOT control method to unset the break on execute after you've set it.
The advantage you have here is that I'm giving you the source code on this, so you could change that to any dimension. You don't have to name that OliBone without sys. You can name it anything. You can change the IOT control numbers.
If that ever comes around, that's a possibility. I'd be surprised if somebody actually started doing that.
One thing that I think might be possible, I haven't really tested it out, if somebody plays around with virtual protect and having it read and write the PTEs, I don't necessarily know that the supervisor bit stays intact.
There's a chance that after this memory is touched by some other API call, that we might want to continually maintain that bit and make sure that it's set and not just hit it once to toggle it on and hope that it stays that way.
You can download it right now, Tmugpl. There's still stuff to do on it. I don't know necessarily that I'll implement these changes myself, but since the code is out there, don't complain about it. You can implement it yourself.
You can share that back with the community. There's some ideas there in terms of the break-down execute for shared DLLs. There's some point you might want to be able to set a break-down on the kernel DLL.
Theorizing that it's possible by using the copy-on-write aspects of that page and basically write to that, forcing it to be copied into another section of memory that's only going to be seen by your process and then set that break-down.
If I haven't tested that, it's a very short part, but maybe that won't work. That's all I have for you right at this moment. Any questions about standing testing?
Your driver is actually modifying the page fault handler?
Yes. The driver is modifying the page fault handler. It's basically finding the location from the integral district table and it's inserting its own. Some of the code in the kernel model is actually adapted from other people's projects. The hooking of that integral handler, I actually borrowed some of that code from the Shadow Walker project, if you're familiar with that.
It kind of did a similar thing where it took the tax concept and used that to make it possible to execute but not possible to read. That's something I've done there.
I'm not really a kernel programmer, so I did put this kernel model together and debug it and made it work, but I don't consider myself a kernel programmer, or really even a C programmer for that matter. If you find bugs in there, that's probably logical.
Do you think it will work in Vista?
I'm sorry, it's obvious.
Do you think this will work in Vista?
In Vista? I don't know. I'll be able to find out from the beta tester.
Thank you.