Reverse Engineering

Reverse Engineering in a CTF is typically the process of taking a compiled (machine code, bytecode) program and converting it back into a more human readable format.

Very often the goal of a reverse engineering challenge is to understand the functionality of a given program such that you can identify deeper issues.

1. Assembly / Machine Code

1.1 From Source to Compilation

Godbolt shows the differences in machine code generated by various compilers.

1.2 x86-64

x86-64 or amd64 or i64 is a 64-bit Complex Instruction Set Computing (CISC) architecture. This basically means that the registers used for this architecture extend an extra 32-bits on Intel’s x86 architecture. CISC means that a single instruction can do a bunch of diferent things at once such as memory accesses, register reads, etc. It is also a variable-length instruction set which means diferent instructions can be diferent sizes ranging from 1 to 16 bytes long. And finally x86-64 allows for multi-sized register access which means that you can access certain parts of a register which are diferent sizes.

1.2.1 x86-64 Registers

x86-64 registers behave similarly to other architectures. A key component of x86-64 registers is multi-sized access which means the register RAX can have its lower 32 bits accessed with EAX. The next lower 16 bits can be accessed with AX and the lowest 8 bits can be accessed with AL which allows for the compuler to make optimizations which boost program execution.

rax

x86-64 has plenty of registers to use including rax, rbx, rcx, rdx, rdi, rsi, rsp, rip, r8-r15, and more! But some registers serve special purposes.

The special registers include: - RIP: the instruction pointer - RSP: the stack pointer - RBP: the base pointer

1.2.2 Instructions

An instruction represents a single operation for the CPU to perform.

There are diferent types of instructions including:

Data movement: mov rax, [rsp - 0x40]
Arithmetic: add rbx, rcx
Control-flow: jne 0x8000400

Because x86-64 is a CISC architecture, instructions can be quite complex for machine code such as repne scasb which repeats up to ECX times over memory at EDI looking for NULL byte (0x00), decrementing ECX each byte (Essentially strlen() in a single instruction!)

It is important to remember that an instruction really is just memory, this idea will become useful with Return Oriented Programming or ROP.

Note

Instructions, numbers, strings, everything! Always represented in hex.

1.2.3 Execution

What should the CPU execute? This is determined by the RIP register where IP means instruction pointer. Execution follows the pattern: fetch the instruction at the address in RIP, decode it, run it.

1.2.4 Examples

mov rax, 0xdeadbeef

Here the operation mov is moving the “immeadiate” 0xdeadbeef into the register RAX

mov rax, [0xdeadbeef + rbx * 4]

Here the operation mov is moving the data at the address of [0xdeadbeef + RBX*4] into the register RAX. When brackets are used, you can think of the program as getting the content from that effective address.

1.2.5 Example Execution

nop

1.2.6 Control Flow

How can we express conditionals in x86-64? We use conditional jumps such as:

jnz <address>
je <address>
jge <address>
jle <address>
etc.

They jump if their condition is true, and just go to the next instruction otherwise. These conditionals are checking EFLAGS which are special registers which store flags on certain instructions such as add rax, rbxwhich sets the o (overflow) flag if the sum is greater than a 64-bit register can hold, and wraps around. You can jump based on that with a jo instruction. The most important thing to remember is the cmp instruction:

1.2.7 Addresses

Memory acts similarly to a big array where the indices of this “array” are memory addresses. Remember from earlier:

mov rax, [0xdeadbeef]

The square brackets mean “get the data at this address”. This is analagous to the C/C++ syntax: rax = *0xdeadbeef;

2. The C Programming Language

2.1 History

C was written by Dennis Ritchie in the 1970s while he was working at Bell Labs.It was first used to reimplement the Unix operating system which was purely written in assembly language. At first, the Unix developers were considering using a language called “B” but because B wasn’t optimized for the target computer, the C language was created.

C was designed to be close to assembly and is still widely used in lower level programming where speed and control are needed (operating systems, embedded systems). C was also very influential to other programming langauges used today. Notable languages include C++, Objective-C, Golang, Java, JavaScript, PHP, Python, and Rust.

2.2 Hello World

nop

2.3 Today

Today C is widely used either as a low level programming langauge or is the base language that other programming languages are implemented in.

While it can be difficult to see, the C language compiles down directly into machine code. The compiler is programmed to process the provided C code and emit assembly that’s targetted to whatever operating system and architecture the compiler is set to use.

Some common compilers include:

gcc
clang

In regards to CTF, many reverse engineering and exploitation CTF challenges are written in C because the language compiles down directly to assembly and there are little to no safeguards in the language. This means developers must manually handle both. Of course, this can lead to mistakes which can sometimes lead to security issues.

Note

Other higher level langauges like Python manage memory and garbage collection for you. Google Golang was inspired by C but adds in functionality like garbage collection, and memory safety.

There are some examples of famously vulnerable functions in C which are still available and can still result in vulnerabilities:

gets - Can result in buffer overflows
strcpy - Can result in buffer overflows
strcat - Can result in buffer overflows
strcmp - Can result in timing attacks

2.4 Types

nop

2.5 Pointers

nop

2.6 Arrays

nop

How do arrays work?

Arrays are a clever combination of multiplication, pointers, and programming.

Because the computer knows the data type used for every element in the array, the computer needs to simply multiply the size of the data type by the index you are looking for and then add this value to the address of the beginning of the array.

For example if we know that the base address of an array is 1000 and we know that each integer takes 8 bytes, we know that if we have 8 integers right next to each other, we can get the integer at the 4th index with the following math:

1000 + (4 * 8) = 1032

2.7 Memory Management

nop

3. Disassemblers

A disassembler is a tool which breaks down a compiled program into machine code.

List of Disassemblers

IDA
Binary Ninja
GNU Debugger (GDB)
radare2
Hopper

List of Disassemblers

IDA

The Interactive Disassembler (IDA) is the industry standard for binary disassembly. IDA is capable of disassembling “virtually any popular file format”. This makes it very useful to security researchers and CTF players who often need to analyze obscure files without knowing what they are or where they came from. IDA also features the industry leading Hex Rays decompiler which can convert assembly code back into a pseudo code like format.

IDA also has a plugin interface which has been used to create some successful plugins that can make reverse engineering easier:

Binary Ninja

Binary Ninja is an up and coming disassembler that attempts to bring a new, more programmatic approach to reverse engineering. Binary Ninja brings an improved plugin API and modern features to reverse engineering. While it’s less popular or as old as IDA, Binary Ninja (often called binja) is quickly gaining ground and has a small community of dedicated users and followers.

Binja also has some community contributed plugins which are collected here: https://github.com/Vector35/community-plugins

gdb

The GNU Debugger is a free and open source debugger which also disassembles programs. It’s capable as a disassembler, but most notably it is used by CTF players for its debugging and dynamic analysis capabailities.

gdb is often used in tandom with enhancement scripts like peda, pwndbg, and GEF

4. Decompilers

Decompilers do the impossible and reverse compiled code back into psuedocode/code.

IDA offers HexRays, which translates machine code into a higher language pseudocode.

ctf101-Reverse Engineering

Reverse Engineering

文章目录

1. Assembly / Machine Code

1.1 From Source to Compilation

1.2 x86-64

1.2.1 x86-64 Registers

1.2.2 Instructions

1.2.3 Execution

1.2.4 Examples

1.2.5 Example Execution

1.2.6 Control Flow

1.2.7 Addresses

2. The C Programming Language

2.1 History

2.2 Hello World

2.3 Today

2.4 Types

2.5 Pointers

2.6 Arrays

How do arrays work?

2.7 Memory Management

3. Disassemblers

List of Disassemblers

List of Disassemblers

IDA

Binary Ninja

gdb

4. Decompilers

猜你喜欢