Determine whether the text is file text or binary text (implemented in C language)

First clarify the requirements: use C language to determine whether the file is a text file or a binary file, or other compressed format files.

file type

Under the Linux system, everything is a file.
In order to manage everything as files, the Linux system divides files into seven types, which are as follows:

insert image description here
The third and fourth columns in the above table are some macro definitions provided by using the stat function to determine the file type under Linux. To determine whether a file is an ordinary file, you can use the following code:

stat(pathname, &sb);
if ((sb.st_mode & S_IFMT) == S_IFREG) {
    
    
   /* Handle regular file */
}

Or use directly:

stat(pathname, &sb);
if (S_ISREG(sb.st_mode)) {
    
    
    /* Handle regular file */
}

But our requirement is to judge whether the file is a text file or a binary file. And these two kinds belong to S_IFREG ordinary files, so the above method cannot be used to judge.

The universal file command

The file command is a built-in command used to detect file types under Linux.
The general principle is to read the first 1024 bytes of a file, then analyze the file header according to the corresponding rules in magic (/etc/magic or /usr/share/misc/magic), and print it to the screen.
It is also very simple to use, just follow the file with the file name:

[root@ck08 ~]# file anaconda-ks.cfg
anaconda-ks.cfg: ASCII text
[root@ck08 ~]# file tls.pcap
tls.pcap: tcpdump capture file (little-endian) - version 2.4 (Ethernet, capture length 262144)
[root@ck08 ~]# file zlib-1.2.11.tar.gz
zlib-1.2.11.tar.gz: gzip compressed data, was "zlib-1.2.11.tar", from Unix, last modified: Mon Jan 16 01:36:58 2017, max compression
[root@ck08 ~]# file /usr/bin/grep
/usr/bin/grep: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=bb5d89868c5a04ae48f76250559cb01fae1cd762, stripped

From the above example, it can be seen that the file command is very powerful, and can almost identify the detailed type of the file, and even specific information such as encoding, compression format, and endianness.
Therefore, this naming is in line with our needs.
However, what we need is C language implementation, so we have to study magic's file header rules.

magic file rules

Each line in the file specifies a rule test to check the file type, this rule is specified by 4 fields. They are offset, type, test, and message respectively.

offset

Specify the first few bytes from the beginning of the file to start checking.
type

The data type to be checked, that is, what is the data type starting from the byte of offset. For specific data types, please refer to magic(5). The commonly used data types are

byte: the value of one byte
short: the value of two bytes
long: the value of four bytes
string: the string
test

test value. It is used to check whether the type under offset is the test value. Use the numeric or character representation of the C language.
message

Information display for displaying inspection results.
If the type is a numeric type, then &value can be added after it, which means that the 'AND' operation is performed with the following test value first, and then compared. If the type is a string type, it can be followed by /[Bbc]*, /b means to ignore spaces, and /c means to ignore letter case.
If the value of test is a numeric value, you can add =, <, >, &, ^, ~ before the value to represent equal, less than, greater than, AND, XOR, and negation respectively.
If the value of test is a string type, you can add =, <, > before it.
For example, the magic representation of an ELF file is:

# ELF
#0string        ELF        ELF
 0    string        \177ELF        ELF
>4    byte        1        32-bit
>4      byte            2               64-bit
>5    byte        1        LSB
>5    byte        2        MSB
>16    short        0        unknown type
>16    short        1        relocatable
>16    short        2        executable
>16    short        3        dynamic lib
>16    short        4        core file
>18    short        0        unknown machine
>18    short        1        WE32100
>18    short        2        SPARC
>18    short        3        80386
>18    short        4        M68000
>18    short        5        M88000
>20    long        1        Version 1
>36    long        1        MathCoPro/FPU/MAU Required

Found no, this magic is really not easy for humans to understand, the Linux kernel provides libmagic library to parse magic files, but I tried CentOS 7 and Ubuntu20. to magic.h), and my requirement is to require a more general method, which not only requires that it can work on Linux, but also has better performance on Windows and AIX. Therefore, I try to implement a set of principles similar to file The road doesn't work.
So, is there not a more general solution to realize the judgment of the file type?
A lot of information on the Internet says that it can be judged according to the characters of the file. If the file contains \x00, it must be a binary or compressed file, otherwise it is an ordinary text file.
Most of the time, this rule holds true. But if the encoding of ordinary text files is UTF-16 or UTF-32, you will cry in the toilet again.
Therefore, this scheme is unreliable.

header for special files

The idea of ​​libmagic, to put it bluntly, is to judge according to the encoding of the file header, that is to say, as long as we know some special file header encodings, we can match these special file headers. If they can match, it means that it is special. File, otherwise, it is an ordinary text file. According to this idea, the same effect as the libmagic library can also be achieved.
In the article on the standard encoding of various types of file headers, some common file header encodings are listed. For example, common jar packages, rar, and zip compressed files all start with 504B0304, while binary files under Linux, including .o, .a, .so, and coredump files, all belong to ELF files, and the file header is 7F454C46 . However, the beginning of the executable file of Windows is 504B0304. The AIX system is more complicated, but the first three bytes are basically 01DF00. Based on these, therefore, many distinctions can be made.
In fact, for the windows system, it can actually be distinguished according to the suffix, and the customary suffix rules of the Unix system can also distinguish many files, such as a file with the suffix .rpm, you will not treat it as text anyway File, you know that it is a binary object file when you see .o, and .so is a dynamic link library. The more ambiguous ones may be only some executable files, such as ls, grep, a.out and other files with suffixes that do not represent actual meaning.
Therefore, our thinking is clear. There are two steps. First, we can roughly distinguish some particularly obvious binary files and compressed files according to the file suffix, and then make further distinctions based on the header of the file.

C language code implementation

The code is implemented as follows:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef int boolean;
#define FALSE 0
#define TRUE  1

/*列举一些常见的文件头,可以自行扩展,比如放到某个配置文件中*/
static const char *with_suffix[] = {
    
    ".gz", ".rar", ".exe", ".bz2",
                                ".tar", ".xz", ".Z", ".rpm", ".zip",
                                ".a",   ".so", ".o", ".jar", ".dll",
                                ".lib", ".deb", ".I", ".png",".jpg",
                                ".mp3", ".mp4", ".m4a", ".flv", ".mkv",
                                ".rmvb", ".avi",  ".pcap", ".pdf", ".docx",
                                ".xlsx", ".pptx", ".ram", ".mid", ".dwg",
                                NULL};

/*判断某个字符串是否拥有指定后缀*/
static boolean string_has_suffix(const char *str, const char *suffix) {
    
    
    int n, m, i = 0;
    char ch = '\0';
    if (str == NULL || suffix == NULL)
    {
    
    
        return FALSE;
    }
    n = strlen(str);
    m = strlen(suffix);
    if (n < m) {
    
    
        return FALSE;
    }
    for (i = m-1; i >= 0;  i--) {
    
    
        if (suffix[i] != str[n - m + i]) {
    
    
            return FALSE;
        }
    }
    return TRUE;
}

/*判断文件是否具有特殊后缀*/
static boolean file_has_spec_suffix(const char *fname) {
    
    
   const char **suffix = NULL;
   suffix = with_suffix;
   while (*suffix)
   {
    
    
      if (string_has_suffix(fname, *suffix))
      {
    
    
         return TRUE;
      }
      suffix++;
   }
   return FALSE;
}

/*判断文件是否具有特殊文件头*/
static boolean file_has_spec_header(const char *fname) {
    
    
    FILE *fp = NULL;
    size_t len = 0;
    char buf[16] = {
    
    0};
    int i = 0;
    boolean retval = FALSE;
    if ((fp = fopen(fname, "r")) == NULL ){
    
    
       return FALSE;
    }

    len = sizeof(buf) - 1;
    if (fgets(buf, len, fp) == NULL )  {
    
    
       return FALSE;
    }
    if (len < 4)
    {
    
    
       return FALSE;
    }
#if defined(__linux__)
    //ELF header
    if (memcmp(buf, "\x7F\x45\x4C\x46", 4) == 0) {
    
    
        return TRUE;
    }
#elif defined(_AIX)
    //executable binary
    if (memcmp(buf, "\x01\xDF\x00", 3) == 0) {
    
    
        return TRUE;
    }
#elif defined(WIN32)
    // standard exe file, actually, won't go into this case
    if (memcmp(buf, "\x4D\x5A\x90\x00", 4) == 0)
    {
    
    
        return TRUE;
    }
#endif
    if (memcmp(buf, "\x50\x4B\x03\x04", 4) == 0) {
    
    
        //maybe archive file, eg: jar zip rar sec.
        return TRUE;
    }

    return FALSE;
}


/*测试程序
* 从命令行输入一个文件,返回该文件的类型
*/
int main(int argc, const char **argv) {
    
    
   if (argc < 2) {
    
    
      printf("usgae: need target file\n");
      exit(-1);
   }
   const char *fname = argv[1];

   if (file_has_spec_suffix(fname)) {
    
    
      printf("file %s have special suffix, maybe it's a binary or archive file\n", fname);
   } else if (file_has_spec_header(fname)) {
    
    
      printf("file %s have special header, maybe it's a binary or archive file\n", fname);
   } else {
    
    
      printf("file %s should be a text file\n", fname);
   }
   return 0;
}

The running results are as follows, you can compare the file command as a reference:

[root@ck08 ctest]# gcc -o magic magic.c 
[root@ck08 ctest]# ./magic ~/anaconda-ks.cfg
file /root/anaconda-ks.cfg should be a text file
[root@ck08 ctest]# ./magic ~/tls.pcap
file /root/tls.pcap have special suffix, maybe it's a binary or archive file
[root@ck08 ctest]# ./magic ~/zlib-1.2.11.tar.gz
file /root/zlib-1.2.11.tar.gz have special suffix, maybe it's a binary or archive file
[root@ck08 ctest]# ./magic /usr/bin/grep
file /usr/bin/grep have special header, maybe it's a binary or archive file
[root@ck08 ctest]# ./magic kafka_2.12-2.8.0.jar
file kafka_2.12-2.8.0.jar have special suffix, maybe it's a binary or archive file
[root@ck08 ctest]# ./magic kafka_2.12-2.8.0.jar.1
file kafka_2.12-2.8.0.jar.1 have special header, maybe it's a binary or archive file

Guess you like

Origin blog.csdn.net/C214574728/article/details/126997690