Pure js to judge the file stream format type: pdf, doc, docx, xls, xlsx, ppt, pptx once done!

Use js to judge the scene of the file type

When developing a pure front-end file preview component based on the react framework, it needs to be distributed to different components to complete the preview according to different file types. Existing open source projects on the Internet usually pass file name parameters and distinguish file types through suffix name string matching.
But this approach requires the user to pass the exact file name and suffix. If your file is obtained from the server, it also requires the back-end development to accurately have this information. However, if the file type can be determined directly from the file stream, the file name parameter can be completely omitted, and the preview effect can be realized, and the dependence on the user or the server is greatly reduced .
This file preview project and the github address of the functional functions implemented in this article are placed at the end of the file. Currently, previews of the react framework pdf, xls, xlsx, and docx formats are implemented, and the pdf preview implements the basic operation functions of the toolbar. Welcome to criticize and exchange, if it helps you, please help me light up the star.

method features

Many methods on the Internet can only distinguish whether the file belongs to the Microsoft2003 specification (doc, xls, ppt) or the Microsoft2007 specification (docx, xlsx, pptx) only through the header encoding of the file stream, but they cannot distinguish the specific types within the same version . Because docx, xlsx, pptx and other formats specified by Microsoft2007 are essentially zip-type files, the hexadecimal numbers at the beginning of the file stream are all unified "0x50, 0x4b, 0x03, 0x04,...". This method is extended on the basis of distinguishing the two types of specifications, and can effectively identify specific file types under each specification.

input Output

Input: arrayBuffer data of files
Output: one of 'file2003, file2007, pdf, doc, docx, xls, xlsx, ppt, pptx, other'.

Method steps

1. Check the hexadecimal code of each file format and extract the "feature numbers" of different file types.

This step can be done with vscode. First, you can install a Hex Editor plug-in to display the hexadecimal file stream.
insert image description here
Put the file into vscode and open it. The left side is the hexadecimal code of the document (0x25, 0x50...), and the right side is the ascii code converted from the hexadecimal number (it can be converted by String.fromCharCode). This mode can visually see the keyword information of each file.

The first few numbers of the pdf file are unique, so as long as you get the file stream, it is easy to judge that it must be a pdf file.
Keywords appearing in the file header
The file header identification information of docx, xlsx, and pptx files is the same. From these fixed numbers, it can be judged that it is a file of the new Microsoft2007 standard.
insert image description here

However, keywords also appear at the end of these files, and they are unique for each type. Therefore, not only the header of the file can be obtained, but also a piece of data at the end of the file can be obtained for judgment.
The picture below is the docx fileKeywords appearing at the end of the docx file

The figure below is an xlsx file, and the key word is /worksheets. insert image description here
Similarly, the key word of the pptx file is ppt/, which also appears at the end of the file.
The beginning of the Microsoft2003 specification file is unified 'd0', 'cf', '11', 'e0'.
The keyword for xls is Microsoft Excel.
The keyword for doc is Microsoft Word.
The keyword of ppt is PowerPoint Document.

The following work is to obtain the data at different positions of the file stream and put them in an array to determine whether the array contains the above-mentioned hexadecimal numbers with characteristics.

2. First judge the large type, and then judge the small type under the specific large type

1) It is necessary to sort out the corresponding relationship between the various formats and feature numbers just summarized, and organize them into json format.

//大类
const formatMap = {
    
    
    'pdf': ['25', '50', '44', '46'],`在这里插入代码片`
    'file2003': ['d0', 'cf', '11', 'e0'],
    'file2007': ['50', '4b', '03', '04', '14', '00', '06', '00'],
}
//区分xlsx,pptx,docx三种格式的特征数。通过每个文件末尾的关键词检索判断
const format2007Map = {
    
    
    xlsx: ['77', '6f', '72', '6b', '73', '68', '65', '65', '74', '73', '2f'],// 转换成ascii码的含义是 worksheets/
    docx: ['77', '6f', '72', '64', '2f'],// 转换成ascii码的含义是 word/
    pptx: ['70', '70', '74', '2f'],// 转换成ascii码的含义是 ppt/
}
//区分xls,ppt,doc三种格式的特征数,关键词同样出现在文件末尾
const pptFormatList = ['50', '6f', '77', '65', '72', '50', '6f', '69', '6e', '74', '20', '44', '6f', '63', '75', '6d', '65', '6e', '74'];// 转换成ascii码的含义是 PowerPoint Document
const format2003Map = {
    
    
    xls: ['4d', '69', '63', '72', '6f', '73', '6f', '66', '74', '20', '45', '78', '63', '65', '6c'],// 转换成ascii码的含义是 Microsoft Excel
    doc: ['4d', '69', '63', '72', '6f', '73', '6f', '66', '74', '20', '57', '6f', '72', '64'],// 转换成ascii码的含义是 Microsoft Word
    ppt: pptFormatList.join(',00,').split(',')
}
//xls格式的文件还有一种例外情况,就是保存为.html格式的文件。特征码是office:excel
let xlsHtmlTarget = ['6f', '66', '66', '69', '63', '65', '3a', '65', '78', '63', '65', '6c']; 

2) Convert the arraybuffer data into an array using Unit8Array first, and then convert it into hexadecimal (for comparison, '0x' is omitted, and only the last two digits are reserved). At the same time, we don't need to use all the array data for each operation, but find a part of the corresponding position, so we use the slice method to intercept the array at our desired position.

It should be explained here that the specific location of the file should be intercepted, which is the approximate range estimated after opening several files. The larger the range, the higher the matching success rate, but a balance must be made between the efficiency of array traversal. It is not ruled out that due to standard compatibility issues in some files, the position where keywords appear is different from the position we intercepted, resulting in the possibility of final judgment error.

 //截取部分数组,并转化成16进制
function getSliceArrTo16(arr, start, end) {
    
    
    let newArr = arr.slice(start, end);
    return Array.prototype.map
        .call(newArr, (x) => ('00' + x.toString(16)).slice(-2));
}
//判断arr数组是否包含target数组,且不能乱序。如果数组比较小,直接将两个数组转换成字符串比较。
//在数组长度大于500的情况下,写了如下方法来提高检索的效率:
function isListContainsTarget(target, arr) {
    
    
    let i = 0;
    while (i < arr.length) {
    
    
        if (arr[i] == target[0]) {
    
    
            let temp = arr.slice(i, i + target.length);
            if (temp.join() === target.join()) {
    
    
                return true
            }
        }
        i++;
    }
}

Specific judgment method:

export default function getFileTypeFromArrayBuffer(arrayBuffer) {
    
    
	try {
    
    
        if (Object.prototype.toString.call(arrayBuffer) !== '[object ArrayBuffer]') {
    
    
            throw new TypeError("The provided value is not a valid ArrayBuffer type.")
        }
        let arr = new Uint8Array(arrayBuffer);
		let str_8 = getSliceArrTo16(arr, 0, 8).join('');  //只截取了前8位
		//为了简便,匹配的位置索引我选择在代码里直接固定写出,你也可以把相应的索引配置在json数据里。
		//第一次匹配,只匹配数组前八位,得到大范围的模糊类型
		let result = '';
		for (let type in formatMap) {
    
    
		    let target = formatMap[type].join('');
		    if (~str_8.indexOf(target)) {
    
     //相当于(str_8.indexOf(target) !== '-1')
		        result = type;
		        break;
		    }
		}
		
		if (!result) {
    
    
			//第一次匹配失败,不属于file2003,file2007,pdf。有可能是html格式的xls文件
			//通过前50-150位置判断是否是xls
		    let arr_start_16 = getSliceArrTo16(arr, 50, 150);
		    if (~(arr_start_16.join('').indexOf(xlsHtmlTarget.join('')))) {
    
    
		        return 'xls';
		    }
		    return 'other';
		}
		if (result == 'pdf') {
    
    
		     return result;
		 }
		 if (result == 'file2007') {
    
    
		     //默认是xlsx,pptx,docx三种格式中的一种,进行第二次匹配.如果未匹配到,结果仍然是file2007
		     let arr_500_16 = getSliceArrTo16(arr, -500);
		     for (let type in format2007Map) {
    
    
		         let target = format2007Map[type];
		         if (isListContainsTarget(target, arr_500_16)) {
    
    
		             result = type;
		             break;
		         }
		     }
		     return result;
		 }
		 if (result == 'file2003') {
    
    
		     let arr_end_16 = getSliceArrTo16(arr, -550, -440);
		     for (let type in format2003Map) {
    
    
		         let target = format2003Map[type];
		         //通过倒数440-550位置判断是否是doc/ppt/xls
		         if (~(arr_end_16.join('').indexOf(target.join('')))) {
    
    
		             result = type;
		             break
		         }
		     }
		     return result;
		 }
		//未匹配成功
		return 'other';

    }
        
}

project address:

Pure front-end based on react-based multi-type file preview:

https://github.com/react-office-viewer/react-office-viewer.git

Determine the file type by arraybuffer:

https://github.com/react-office-viewer/getFileTypeFromArrayBuffer.git

epilogue

Finally, what this article hopes most is to provide you with an idea to distinguish file formats. In actual operation, more abundant and detailed standards can be used to continuously improve the accuracy of judgment. Welcome to leave a message and exchange.

Guess you like

Origin blog.csdn.net/csdnyiiran/article/details/128735272