There has been substantial progress in the development of myrrh since the previous blog article. Development has focused primarily on the implementation of myrrh’s support for Remote Procedure Calls (RPC) which can be issued by external applications.
Some of myrrh’s new features are listed below:
- myrrh currently contains both a JSON decoder and encoder, which are both nearly fully compliant to the JSON standard. JSON support enables external applications to encode their request in a JSON object and send it through the established TCP/IP connection
- A VNC server has been implemented (so as to void the necessity of having one run myrrh on his or hers local system in order to see the screen output)
- There are many more remote procedure calls available to the client (thus enabling the client to exercise even more freedom over the emulated system)
- Breakpoints can now be grouped in combinations
- Full support for requesting register values of both before and after hitting a breakpoint
- The phasing out of myrrh’s internal debugging console has commenced while having added an implementation of a similar console written entirely in Python has been added to the project’s design goals. Eventually myrrh will work as a “headless” emulation server.
I will now focus on several Python scripts for interaction with myrrh that I have written. These examples focus on the dynamic or runtime analysis of binary code.
A script to generate control-flow graphs
I’ll start out with this assembly language program:
global main main: mov cx, 5 first: push cx mov cx, 10 second: loop second pop cx loop first jmp main function: nop ret
I compile it and save it to program.bin. Now, consider the following Python script:
import myrrh import json import matplotlib.pyplot as plt import networkx as nx def create_edge_data(previous): if previous == False: address = m.get_absolute_address() else: address = m.get_absolute_address_p() datastr = hex(address) disasm = json.loads(m.disassemble(address, 1))["ReturnValues"]["1"] datastr = datastr + " (" + disasm + ")" return datastr G = nx.Graph() # Create a class instance m = myrrh.myrrh() # Connect to the myrrh server m.connect("localhost", 5000) # Configure (do not load bioses and ROM Basic) print m.configure(m.EXEC_FLAG_NONE) print m.start() # Load the binary at absolute address 0 print m.load_binary(0x0000, "program.bin") # Let CS:IP point to absolute address 0 print m.set_register_value(m.R_CS, 0x0000) print m.set_register_value(m.R_EIP, 0x0000) # Set a breakpoint on every type of branch m.set_breakpoint_branch(m.BRANCH_FLAG_ALL) print "Collecting branch node data, please wait.." # Encounter a branch 1000 times for x in range(0, 1000): # Retrieve the current CS and EIP fromtotal = create_edge_data(False) # Run until a breakpoint is encountered m.run() # Breakpoint encountered; we want to know the CS and EIP values # of before the branch, hence CS_p(revious)() and EIP_(previous)() tototal = create_edge_data(True) print str(x) + "|" + fromtotal + " - " + tototal # Add it to the graph G.add_edge(fromtotal, tototal) # Now graph a line from the previous position to the current position # (ie. the position where the branch jumped to) fromtotal = tototal tototal = create_edge_data(False) G.add_edge(fromtotal, tototal) print "Done" # Call the exit function of myrrh to make it halt m.exit() # Draw the graph pos=nx.spring_layout(G, scale=3) # nodes nx.draw_networkx_nodes(G,pos,node_size=60) # edges nx.draw_networkx_edges(G,pos, width=1) # labels nx.draw_networkx_labels(G,pos,font_size=35,font_family='sans-serif') plt.axis('off') plt.savefig("myrrh_code_path.png") # save as png plt.show() # display
As you can see the script above puts a breakpoint on every type of branch. A branch can be JMP, CALL, RET and so forth: anything that alters (E)IP. Within the block of code that is looped a thousand times, it gathers the location of the code (CS:IP) of when before the running was started and the location of the code right before it branched. These two points are added using add_edge. Then the points from right before the code branched to the current CS:IP (where it branched to) are recorded. This way we get a nice graph of the code’s code paths:
Let’s modify the assembly language program slightly:
mov cx, 5
mov cx, 10
As you can see a call to ‘function’ was added in the inner loop of the program. When we run the script now, the image it produces is as follows:
Although this is a simple Python script that uses a very basic assembly language program, it does show the power of scripting the emulator: it enables one to yield data and produce interesting results with very few lines of code.
Runtime detection of self-modifying code
Now for a more advanced example. Consider the following program:
mov cx, 5
mov cx, 10
mov byte [thenop], 0x90
inc byte [abyte]
abyte db 0
thenop db 0
And consider the following Python script:
import myrrh import json def format_current_instruction(previous = False): if previous == False: addr = m.get_absolute_address() else: addr = m.get_absolute_address_p() disasm = m.disassemble(addr, 1) line = hex(addr) + " - " + json.loads(disasm)["ReturnValues"]["1"].lower() return line def load_program(): m.reboot() # Load the binary at absolute address 0 m.load_binary(0x1000, "program.bin") # Let CS:IP point to absolute address 0 m.set_register_value(m.R_CS, 0x0000) m.set_register_value(m.R_EIP, 0x1000) def find_code_bytes(): load_program() code_bytes = () for x in range(1000): addr = m.get_absolute_address() code_bytes = code_bytes + (addr, ) m.run(1) code_bytes = tuple(set(code_bytes)) return code_bytes def has_self_modifying_code(code_bytes): load_program() for code_byte in code_bytes: m.set_breakpoint_memory_write(code_byte, code_byte) self_modifying_instructions = () found = False for x in range(1, 100): runreturn = json.loads(m.run(1000)) if "ReturnValues" in runreturn: self_modifying_instructions = self_modifying_instructions + (format_current_instruction(True), ) found = True self_modifying_instructions = tuple(set(self_modifying_instructions)) return (found, self_modifying_instructions) def find_reads_writes(): load_program() m.set_breakpoint_memory_read(0x00000, 0xFFFFF) m.set_breakpoint_memory_write(0x00000, 0xFFFFF) reads =  writes =  for x in range(0, 1000): X = json.loads(m.run()) line = format_current_instruction(True) if X["ReturnValues"]["1"] == 1: reads.append(line) else: writes.append(line) reads = list(set(reads)) writes = list(set(writes)) m.delete_breakpoint(1) m.delete_breakpoint(2) return (reads, writes) m = myrrh.myrrh() m.connect("localhost", 5000) m.configure(0) m.start() reads, writes = find_reads_writes() print "Reads occurred from these addresses:" print for X in reads: print X print print "Writes occurred from these addresses:" print for X in writes: print X print code_bytes = find_code_bytes() contains, disasm = has_self_modifying_code(code_bytes) if contains == True: print "Contains self-modifying code:" print for line in disasm: print line else: print "Does not contain self-modifying code" m.exit()
I will now explain its functioning.
First, after a connection with the myrrh server has been established, the script gathers a list of addresses from which reads and another list from which writes occurred. This is accomplished by putting both a memory read breakpoint and a memory write breakpoint on the entire emulated memory. Using a loop, the code is run a thousand times, and each time a breakpoint is hit it is recorded whether either a read or a write occurred. These lists are returned to the caller and its contents is output to the screen.
Then a list of addresses which contain code that is actually executed during the program’s lifetime is gathered. This is done by stepping (running a single instruction) a thousand times and recording the CS:IP each time. The resulting list is purged from duplicates and returned to the caller.
The function has_self_modifying_code() will then put a write breakpoint on all the code bytes. If the program should try to modify its own code, the breakpoint is triggered. Finally a list of instructions that cause the modification of code is returned to the caller.
If I compile the assembly language program listed above to program.bin and run the Python script above, the following output is displayed:
Reads occurred from these addresses: 0x101c - ret 0x1011 - pop cx Writes occurred from these addresses: 0x100c - call 000c 0x1014 - inc [101a] 0x1003 - push cx 0x1007 - mov [101b], 90 Contains self-modifying code: 0x1007 - mov [101b], 90
inc byte [abyte]
is not flagged as being code-modifying code since abyte is never executed.
Again, this Python script again demonstrates the ease (and conciseness of the script) with which dynamic analysis on binaries can be performed, a goal that would likely result in a much more tedious effort using other methods or software.
P.S. I am aware of the fact that the code above does not detect code that modifies code bytes that are not the first byte of an instruction, but it would be easy to modify for such a purpose and the script above serves only as a proof of concept.