Breaking Cerber strings obfuscation with Python and radare2

Written by aaSSfxxx -

Finally after almost two years of inactivity, I finally motivated myself and decided a new article on this blog ! In this article, I'll show how to use radare2 and especially its Python bindings to write a little tool able to decode a large part of Cerber ransomware encrypted strings, thanks to the help of radare2. So, let's begin with funny stuff ! :D

Overview

You'll obviously need to have radare2 installed, along with Python bindings. Also keep in mind that I'm using a mixture of radare2 internal API and radare2 commands (to workaround missing stuff/bugs in the bindings, mostly related to nonexistent support of unsigned char* parameters from Python), so this internal API can break any time. At this moment, I didn't find any Python 3 bindings, so I'm using Python 2 for this article.

After binary analysis and location of string obfuscation function, we find that Cerber uses this function for almost every string in the binary, which leads to boring analysis. To circumvent this, I wanted to write a little tool that will find every place where the function is called, and walk into the stack to find the arguments given to the function, and decode them manually. Obviously, finding cross-references and decoding instructions by hand is a daunting task, so we'll use radare2 engine to perform this task "easily".

Loading the binary and finding cross-references

Loading radare2 inside our python script is as easy as importing RCore class from r2.r_core, and creating a new instance of it. After this, we just need to load the file, and ask the RBin plugin to map the sections inside radare2. This can be achieved simply by doing:

from r2.r_core import RCore

core = RCore()
core.file_open("your_malware_here.exe", 0, 0)
# set base address manually, I didn't find how to do this automatically yet
core.bin_load(None, 0x400000)

Then, we want to perform analysis to identify called functions, and moreover, do a cross-references analysis to spot all call to our decryption function. To achieve this, we'll just send radare2 commands "af", and "aar" (which stand for "analyze functions" and "analyze references"), with

core.cmd0("af")
core.cmd0("aar")

Finally, we gets all xrefs for our function through the [code=python]core.xrefs_get(address)[/code], so we can iterate on them and get the address of the calling code. To sum up, to get the cross-references, we can do a code like this:

from r2.r_core import RCore

PROGRAM="sample.exe"
ADDRESS=0xdeadbeef

def process_xref(core, addr):
    # do something awesome :D
    pass

core = RCore()
core.file_open(PROGRAM, 0, 0)
core.bin_load(None, 0x400000)  # set base address manually
core.cmd0("af")
core.cmd0("aar")

for xref in core.get_xrefs(ADDRESS):
    process_xref(core, xref.addr)

More craziness: walk through the calling code

Once we located the place where the function is called, we now want to get its call stack, so we can find the string to be decrypted, its length and the decryption key, and call PyCrypto's RC4 decryption routine (yeah, strings obfuscation in Cerber is just plain lame RC4). So, to achieve this, we need to parse the code backwards, and unluckily for us, there isn't exposed methods to do this easily through the bindings. So I chose to implement the dumbest methods, which consists to go one byte backwards, try to decode the instruction, and loop until we encounter a "push" opcode.

So here is the method performing this, and returning a RAnalOp object (which saves us the burden to parse assembly code):

# Returns a RAsm
def disassemble_instr(inst, addr):
    return inst.op_anal(addr)


# Hack because radare2 doesn't expose API to find prev instruction in bindings
def get_prev_instr(inst, addr):
    return addr - 1


def find_prev_push(inst, addr):
    cur = get_prev_instr(inst, addr)
    found = False
    instr = None
    while not found:
        instr = disassemble_instr(inst, cur)
        found = (instr.type == 13)
        if not found:
            cur = get_prev_instr(inst, cur)
    return (cur, instr)

We can see in the code above that the "push" instruction is an instruction of type "13" for radare2 analyzer. We loop until we find a push instruction, and return the decoded instruction through analyzer, along with address of that instruction, to allow us to walk the code backwards.

So, to get our call parameters, we need to call three times our new "find_prev_push" function, which will return three "push" symbol instructions, and we'll able to fetch their values without hassle. We can now decode our strings, with the code below:

from r2 import r_core
from Crypto.Cipher.ARC4 import ARC4Cipher
from struct import pack


# Returns a RAsm
def disassemble_instr(inst, addr):
    return inst.op_anal(addr)


# Hack because radare2 doesn't expose API to find prev instruction in bindings
def get_prev_instr(inst, addr):
    return addr - 1


def find_prev_push(inst, addr):
    cur = get_prev_instr(inst, addr)
    found = False
    instr = None
    while not found:
        instr = disassemble_instr(inst, cur)
        found = (instr.type == 13)
        if not found:
            cur = get_prev_instr(inst, cur)
    return (cur, instr)


def process_xref(inst, addr):
    old, push = find_prev_push(inst, addr)
    str_addr = push.val
    old, push = find_prev_push(inst, old)
    str_len = push.val
    old, push = find_prev_push(inst, old)
    key = push.val

    if str_addr >= 0xffffffff or str_len >= 0xffffffff or key >= 0xffffffff:
        print("Manual decoding required at 0x%x" % addr)
        return
    if str_len > 1000:
        print("Strange strlen found %d at 0x%x" % (str_len, addr))
        return
    # Issue a command to print our string as hex format since r2 doesn't allow this
    # programmatically
    ciph = inst.cmd_str("p8 %d @0x%x" % (str_len, str_addr)).rstrip()
    ciph = ciph.rstrip().decode("hex")
    deciph = ARC4Cipher(pack("<I", key)).decrypt(ciph)
    print("%08x\t| %s" % (addr, deciph))

inst = r_core.RCore()
inst.file_open("macbook_tutorial.unpacked.exe", 0, 0)
f = inst.bin_load(None, 0x400000)

# Launch function analysis
inst.cmd0("af")
inst.cmd0("aar")

# Gets our xrefs
anal = inst.anal
xrefs = anal.xrefs_get(0x408545)  # decryption func
print(len(xrefs))
for xref in xrefs:
    process_xref(inst, xref.addr)

Limitations / Conclusion

As you may have seen in the code above, I added some checks because of weird behaviour that appeared in the real world: sometimes, interesting values are stored in registers and the register is pushed, leading to inconsistent values in what we fetch and decoding failure. Also, because of my dumb approach to grab previous instruction, some junk code is decoded as push, and we also get invalid decoding because of that. Learning how to use radare2 opcode emulation engine (ESIL ?) should solve most of the issues faced here.

However, we can programmatically use radare2's power to do interesting things (such as binary unpacking/shellcode recognition/whatever, or maybe obfuscation cleanup !), and automate boring tasks.