Introduction to file structure

File Type Introduction

A file format (or file type) refers to a special encoding method for information used by a computer to store information, and is used to identify internally stored data. For example, some store pictures, some store programs, and some store text information. Each type of information can be stored in computer storage in one or more file formats. Each file format usually has one or more extensions that can be used to identify it, but it may not have an extension. Extensions help applications recognize the file format.

For hard drives or any computer storage, the only valid information is 0 and 1. So the computer must be designed with a corresponding method for information-to-bit conversion. There are different storage formats for different information.

Some file formats are designed to store special data. For example, the JPEG file format in image files is only used to store static images, while GIF can store both static images and simple animations; Quicktime format can store multiple different media types. Text files include: txt files generally only store simple and unformatted ASCII or Unicode text; HTML files can store formatted text; PDF format can store text with rich content and rich pictures and texts.

The same file format, processed by different programs may produce completely different results. For example, when viewing a Word file with Microsoft Word, you can see the content of the text, but if you play it in a music player software in an unformatted manner, it will produce noise. A file format that produces meaningful results for some software may look like useless digital garbage to others.

The method of identifying file formats with extensions was first adopted in Digital Equipment Corporation's CP/M operating system. It was later adopted by DOS and Windows operating systems. The extension is the sequence of letters after the last dot (.) in the file name. For example, HTML files are identified by the .htm or .html extension; GIF graphics files are identified by the .gif extension.

And how are different file formats recognized by machines? After all, we changed the suffix from (.txt) to (.jpg) on ​​the desktop. Although we changed the document to a picture, we all know that double-clicking to open it at this time does not display a picture. So how does the computer determine what this file is? In fact, different file formats have their own unique file format flags. At the beginning of the file, there will be a sign corresponding to the file, which determines what type of file the file is. After determining the type of the file, the following content is parsed according to the file format of the corresponding type. We can see all this when we use a hex editor.

Common file format flags:

PE(exe,dll)文件头:4D5A

JPEG (jpg),文件头:FFD8FF

PNG (png),文件头:89504E47

GIF (gif),文件头:47494638

TIFF (tif),文件头:49492A00

Windows Bitmap (bmp),文件头:424D

CAD (dwg),文件头:41433130

Adobe Photoshop (psd),文件头:38425053

Rich Text Format (rtf),文件头:7B5C727466

XML (xml),文件头:3C3F786D6C

HTML (html),文件头:68746D6C3E

Email [thorough only] (eml),文件头:44656C69766572792D646174653A

Outlook Express (dbx),文件头:CFAD12FEC5FD746F

Outlook (pst),文件头:2142444E

MS Word/Excel (xls.or.doc),文件头:D0CF11E0

MS Access (mdb),文件头:5374616E64617264204A

WordPerfect (wpd),文件头:FF575043

Postscript (eps.or.ps),文件头:252150532D41646F6265

Adobe Acrobat (pdf),文件头:255044462D312E

Quicken (qdf),文件头:AC9EBD8F

Windows Password (pwl),文件头:E3828596

ZIP Archive (zip),文件头:504B0304

RAR Archive (rar),文件头:52617221

Wave (wav),文件头:57415645

AVI (avi),文件头:41564920

Real Audio (ram),文件头:2E7261FD

Real Media (rm),文件头:2E524D46

MPEG (mpg),文件头:000001BA

MPEG (mpg),文件头:000001B3

Quicktime (mov),文件头:6D6F6F76

Windows Media (asf),文件头:3026B2758E66CF11

MIDI (mid),文件头:4D546864

PE file structure

PE (Portable Execute) files are the general term for executable files under Windows. Common ones include DLL, EXE, OCX, SYS, etc. The structure of PE files is generally as shown in the figure below: starting from the starting position, there are DOS headers, NT header, section table and specific sections

1. DOS header

The role of the DOS header is to be compatible with executable files in the MS-DOS operating system. For 32-bit PE files, the role of DOS is to display a line of text, prompting the user: I need to run on 32-bit windows. I think it's a good joke, because it doesn't appear to be disabled, it does, it just doesn't do what the user expects it to do on DOS, well, I admit that's not the point. But, at least let's take a look at how this header is defined:

We only need to focus on two domains:

e_magic: A WORD type, the value is a constant 0x4D5A, use a text editor to view the value of 'MZ', executable files must all start with 'MZ'.

e_lfanew: A field extended for 32-bit executable files, used to indicate the offset of the NT header after the DOS header relative to the file start address.

Two, NT head

Following the e_lfanew in the DOS header, we can easily find the NT header, which is the most useful header in the 32-bit PE file, defined as follows:


The following figure is a real PE file header structure and the value of each field:


Signature: Similar to e_magic in the DOS header, the upper 16 bits are 0, the lower 16 bits are 0x4550, and the character representation is 'PE'.

IMAGE_FILE_HEADER is the PE file header, and the definition of c language is as follows:

The specific meaning of each field is as follows:

Machine: The operating platform of the file, whether it is x86, x64 or I64, etc., can be one of the following values.


NumberOfSections: How many sections are there in the PE file, that is, the number of items in the section table.

TimeDateStamp: The creation time of the PE file, usually filled by the connector.

PointerToSymbolTable: COFF file symbol table offset in the file.

NumberOfSymbols: Number of symbol tables.

SizeOfOptionalHeader: The size of the optional header that follows.

Characteristics: The properties of the executable file, which can be the phase-wise OR of the following values.


It can be seen that the PE file header defines some basic information and attributes of the PE file. These attributes will be used when the PE loader loads. If the loader finds that some attributes defined in the PE file header do not meet the current operating environment, it will will stop loading the PE.

Another important header is the PE optional header. Don’t look at its name as an optional header. In fact, it is not missing at all. However, it is different under different platforms. For example, it is IMAGE_OPTIONAL_HEADER32 under 32-bit, but it is IMAGE_OPTIONAL_HEADER32 under 64-bit Below is IMAGE_OPTIONAL_HEADER64. For simplicity, we'll only look at 32 bits.

Magic: Indicates the type of optional header. 

MajorLinkerVersion and MinorLinkerVersion: The version number of the linker.

SizeOfCode: The length of the code segment, if there are multiple code segments, it is the sum of the lengths of the code segments.

SizeOfInitializedData: Initialized data length.

SizeOfUninitializedData: Uninitialized data length.

AddressOfEntryPoint: The RVA of the program entry, for the address of exe, it can be understood as the RVA of WinMain. For a DLL, this address can be understood as the RVA of DllMain, and if it is a driver, it can be understood as the RVA of DriverEntry. Of course, the actual entry point is not WinMain, DllMain and DriverEntry, there is still a series of initialization to be completed before these functions, of course, these are not the focus of this article.

BaseOfCode: The RVA of the start address of the code segment.

BaseOfData: RVA of the starting address of the data segment.

ImageBase: The base address of the image (PE file loaded into the memory). This base address is a suggestion. For DLL, if it cannot be loaded to this address, the system will automatically select an address for it.

SectionAlignment: Section alignment. When a section in PE is loaded into memory, it will be aligned according to the value specified by this field. For example, if this value is 0x1000, then the lower 12 bits of the start address of each section are 0.

FileAlignment: Sections are aligned by this value in the file, and SectionAlignment must be greater than or equal to FileAlignment.

MajorOperatingSystemVersion, MinorOperatingSystemVersion: The version number of the required operating system. With more and more operating system versions, this does not seem to be so important.

MajorImageVersion, MinorImageVersion: The version number of the image, which is specified by the developer and filled in by the linker.

MajorSubsystemVersion, MinorSubsystemVersion: The required subsystem version number.

Win32VersionValue: reserved, must be 0.

SizeOfImage: The size of the image, the PE file loaded into the memory space is continuous, this value specifies the size of the virtual space occupied.

SizeOfHeaders: The size of all file headers (including section tables), this value is aligned with FileAlignment.

CheckSum: The checksum of the image file.

Subsystem: The subsystem required to run the PE file can be one of the following definitions:

SizeOfStackReserve: The size of memory reserved for each thread stack at runtime.
SizeOfStackCommit: The initial memory size of each thread stack at runtime.

SizeOfHeapReserve: The runtime reserves memory size for the process heap.

SizeOfHeapCommit: The initial memory size of the process heap at runtime.

LoaderFlags: reserved, must be 0.

NumberOfRvaAndSizes: The number of items in the data directory, that is, the number of items in the following array.

DataDirectory: data directory, which is an array, and the items of the array are defined as follows:


VirtualAddress: is an RVA.
Size: is a size.

What is the use of these two numbers? One is the address and the other is the size. It can be seen that this data directory entry defines an area. So what area does he define? As mentioned earlier, DataDirectory is an array, and each item in the array corresponds to a specific data structure, including import tables, export tables, etc. Different structures are extracted according to different indexes, and each item is defined in the header file to represent which structure, as shown in the code below:

ELF file structure

ELF file consists of 4 parts, namely ELF header (ELF header), program header table (Program header table), section (Section) and section header table (Section header table). In fact, a file does not necessarily contain all the content, and their positions may not be arranged as shown, only the position of the ELF header is fixed, and the position, size and other information of the other parts are determined by the values ​​in the ELF header. to decide.

Tip: You can use a hexadecimal editor to view it

Introduction to the shell

The shell is a program in some computer software that is specially responsible for protecting the software from being illegally modified or decompiled.

They generally run ahead of the program, gain control, and complete their task of protecting the software.

We usually divide shells into two categories, one is compressed shells and the other is encrypted shells.

Compressed shell: 

Compression shells have appeared as early as the DOS era, but at that time due to limited computing power and excessive decompression overhead, they were not widely used.

Using a compressed shell can help reduce the size of the PE file, hide the internal code and resources of the PE file, and facilitate network transmission and storage.

Generally, there are two types of compression shells, one is simply used to compress common PE files, and the other will greatly deform the source file, severely damage the PE file header, and is often used to compress malicious programs.

Common compression shells are: Upx, ASpack, PECompat

Encryption shell:

Encryption shell, or protective shell, is applied with various technologies to prevent code reverse analysis, and its main function is to protect PE from code reverse analysis.

Since the main purpose of the encryption shell is no longer to compress file resources, the PE program protected by the encryption shell is usually much larger than the original file.

At present, encrypted shells are widely used in applications that have high security requirements and are sensitive to cracking. At the same time, malicious programs are used to avoid (reduce) the detection and killing of antivirus software.

Common encryption shells are: ASProtector, Armadillo, EXECryptor, Themida, VMProtect

Shell loading process:

1. Save the entry parameters 

① Save the value of each register when the packer is initialized

② After the execution of the shell is completed, restore the value of each register

③Finally jump to the original program execution

Usually use pushad / popad, pushfd / popfd instruction pair to save and restore the scene environment

2. Obtain the required function API

①In the input table of the general shell, there are only API functions GetProcAddress, GetModuleHandle and LoadLibrary

②If other API functions are needed, map the DLL file to the address space of the calling process through LoadLibraryA(W) or LoadLibraryExA(W)

③If the DLL file has been mapped into the address space of the calling process, you can call the GetModuleHandleA(W) function to obtain the DLL module handle

④Once the DLL module is loaded, you can call the GetProcAddress function to get the address of the input function

3. Decrypt the data of each block 

①For the purpose of protecting the source program code and data, each block of the source program file is generally encrypted. When the program is executed, the shell decrypts these block data to allow the program to run normally

②The shell is generally encrypted by block, decrypted by block, and the decrypted data is put back in the appropriate memory location

4. Jump back to the original program entry point 

①Before jumping back to the entry point, the original PE file input table (IAT) will generally be restored and the relocation items (mainly DLL files) will be processed.

②Because the shell constructs an input table by itself when packing, it is necessary to re-acquire the addresses of all functions introduced by each DLL and fill them in the IAT table

③ After the above work is done, the control will be transferred to the original program and continue to execute

Common tool

1. Behavior analysis tools

Tinder Sword (cannot be used in the same environment as Sangfor EDR), procexp, processminer, etc.

2. Shell check tool

ExefoPE、 DetectItEasy

3. Dynamic analysis tool

Ollydbg (can only analyze 32-bit programs), X64dbg, Windbg, gdb (linux)

4. Static analysis tools

IDApro

5. Auxiliary tools

HashMyFiles (calculates the hash value of the file)

010Editor (check the hexadecimal state and modify it)

Unpacking tool (download special unpacking tool according to the type of shell)

Pwndbg (gdb plugin)

Guess you like

Origin blog.csdn.net/jd_cx/article/details/126494746