Python read all kinds of file formats of text messages | doc, excel, html, mht

introduction

 

As we all know, python is the most powerful place, python summary communities have a wealth of third-party libraries, open source, which makes the technology more and more developers to improve.

 

Perfection of python.

 

The future of artificial intelligence, big data direction, block chain are identified and will be advanced as the center to expand python.

 

Ahem cough! It seems a bit suspect to advertise the.

 

The current share Internet information age, the most important thing is what? Data. What is the most valuable? It is data. What is the most intuitively reflect the level of technology is? Or data.

 

So, today we want to share with you is: How to get text information for each file format.

 

Common file format is generally divided into: txt plain text message, doc word document, html web content, excel spreadsheet data, as well as special mht file.

 

A, Python process html web page information

html text data type, content is written by the front-end code label + text data format, can directly open a browser chrome, exhibit a clear text format.

Python html file to obtain the same content and access methods txt file, open the file directly to read it.

Read the code as follows:

with open(html_path, "r", encoding="utf-8") as f:
    file = f.read()

html file is a text file. It is a web page format of the content label.

Two , P ython information processing excel spreadsheet

python has third-party libraries xlwt direct operation excel forms, xlrd. Call the corresponding methods excel spreadsheet can read and write data.

Excel read operation code as follows:

filepath = "C:\\Users\Administrator\Desktop\新建文件夹\笨笨 前程6份 武汉.xls"
sheet_name = "UserList"
rb = xlrd.open_workbook(filepath)

sheet = rb.sheet_by_name(sheet_name)

# clox_list = [0, 9, 14, 15, 17]
for row in range(1, sheet.nrows):
    w = WriteToExcel()
    # for clox in clox_list:
    name = sheet.cell(row, 0).value
    phone = sheet.cell(row, 15).value
    address = sheet.cell(row, 9).value
    major = sheet.cell(row, 14).value
    age = sheet.cell(row, 8).value

Wherein the row is the row number corresponding to the form data, obtaining specific number Cell lines, number of columns specific data.

 

Three, Python doc read the document data

 

python doc read the document is the most troublesome. Processing logic complexity. Manner also there are many.

 

There is no direct third-party libraries python document processing doc, docx but there is a process of third-party libraries. By converting files to doc docx file, then call third-party library pydocx python doc to read the contents of the document.

 

It should be noted, do not directly modify the suffix doc docx file to be modified. Docx directly by modifying the file suffix obtained, pydocx can not read the contents.

 

We can use another library to modify doc as docx.

 

Specific code as follows:

def doSaveAas(self, doc_path):
    """
    将doc文档转换为docx文档
    :rtype: object
    """
    docx_path = doc_path.replace("doc", "docx")
    word = wc.Dispatch('Word.Application')
    doc = word.Documents.Open(doc_path)  # 目标路径下的文件
    doc.SaveAs(docx_path, 12, False, "", True, "", False, False, False, False)  # 转化后路径下的文件
    doc.Close()
    word.Quit()

Desired code packet interface:

import os
import zipfile
from win32com import client as wc
import xlrd
from bs4 import BeautifulSoup
from pydocx import PyDocX
from lxml import html
from xpath_content import XpathContent
from write_to_excel import WriteToExcel

There python docx document processing methods are many, the specific use, be determined according to individual needs.

No.1-extracting file docx

Principle docx file is a compressed zip file in nature, through after decompression, you can get the original contents of each file.

Docx file structure after decompression follows:

Text docx file storage structure is as follows:

Text stored in the word / document.xml file.

The first method, we can first docx reduced to the zip file, and then extract the zip file, read the contents of word / document.xml file ok.

Specific operation code as follows:

def get_content(self):
    """
    获取docx文档的文本内容
    :rtype: object
    """
    os.chdir(r"C:\Users\Administrator\Desktop\新建文件夹")  # 改变目录到文件的目录
    #
    os.rename("51 2014.09.12 1份Savannah.docx", "51 2014.09.12 1份Savannah.ZIP")  # 重命名为zip文件
    f = zipfile.ZipFile('51 2014.09.12 1份Savannah.ZIP', 'r')  # 进行解压
    xml = f.read("word/document.xml")

    wordObj = BeautifulSoup(xml.decode("utf-8"))
    # print(wordObj)
    texts = wordObj.findAll("w:t")
    content = []
    for text in texts:
        content.append(text.text)
    content_str = "".join(content)
    return content_str

Finally get to docx document is the text of all the data.

 

No.2 convert docx documents into text format that can be handled python

 

The first method is to obtain the data in accordance with the principle of docx documents, process a bit cumbersome, there is no direct read docx document content approach? The answer is definitely no, forget it, go home wash sleep.

 

Directly read docx document method is not, there is no ability to convert docx documents into text format python can easily handle it?

 

This may have said earlier, python has a lot of rich third-party libraries (first wave I boast large python), after undergoing untold hardships finally found, a third-party library can be converted docx document format, pydocx, pydocx library there is a method pydocx.to_html () can directly convert docx document as an html file, how? Italy Not surprisingly, no surprise surprise!

 

The second method, the text conversion code is as follows:

def docx_to_html(self, docx_path):
    """
    docx文档转换成html响应
    :rtype: object
    """
    # docx_path = "C:\\Users\Administrator\Desktop\新建文件夹\\51 2014.09.12 1份Savannah.docx"
    response = PyDocX.to_html(docx_path)

The response is to get html file contents.

 

Four, Python treatment mht file

 

mht file is an only show in text format on the IE browser, open the browser chrome is a bunch of garbage.

 

No.1 forgery IE requests content file mht

 

The most basic way is to read the text counterfeit mht IE browser requests.

 

Call the library requests, request to send get request headers Web links, construction of IE.

 

In theory, this approach is feasible. But then, it is not recommended, because we all know.

 

 

No.2 converting file formats

 

Seriously good way to guess mht file can be modified to other file formats to directly read it?

 

docx, No; html, will not do; excel, needless to say.

 

there is only one truth! ! !

 

Docx suffix obtained by directly modifying, can not be read.

 

so, what methods we think of it. Yes, that is modified to doc documents.

 

The method is incredible, but it is also inspired by a current.

 

mht by modifying the suffix can be directly converted into doc file, doc text document reading method of reading doc particular reference to the above process the document.

 

How to get the contents of html text?

Html web content is text label data structure, is taken out of the text mode: re regular or xpath.

 

Follow-up, there is a small partner necessary, will then open a chapter to learn more about re, xapth usage rules.

Guess you like

Origin blog.csdn.net/weixin_44786530/article/details/92392135