用xlrd读xls类型的文件,结果一直报错
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'<?xml ve'
最后发现原来该文件是披着xls格式的xml文件,查找了很多资料,发现大多数都是针对纯xml文件的解决方法,对于偏向于excel结构的很少。由于我的xml文件一打开就只有一行,而且很卡,所以光检查文件里的标签就花了很长时间。下面附上一般偏向excel的xml文件。
<?xml version="1.0"?> <?mso-application progid="Excel.Sheet"?> <Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40"> <DocumentProperties xmlns="urn:schemas-microsoft-com:office:office"> <Title>Excel表格</Title> <LastAuthor>bigtall</LastAuthor> </DocumentProperties> <Styles> <Style ss:ID="Default" ss:Name="Normal"> <Alignment ss:Vertical="Center"/> <Font ss:FontName="宋体" x:CharSet="134" ss:Size="12"/> </Style> </Styles> <Worksheet ss:Name="tt"> <Table> <Row> <Cell ss:MergeAcross="6" ><Data ss:Type="String">Hello!World!</Data></Cell> </Row> </Table> </Worksheet> </Workbook>
还可以精简到:
<?xml version="1.0"?> <?mso-application progid="Excel.Sheet"?> <Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40"> <Worksheet ss:Name="tt"> <Table> <Row> <Cell><Data ss:Type="String">Hello!World!</Data></Cell> </Row> </Table> </Worksheet> </Workbook>
所以对于规整的标签格式,用正则就可以取出对应的元素,然后存入列表,这就是解决这个问题的思路。
#!/usr/bin/python3 # -*- coding: utf-8 -*- # @Time : 2018/6/13 0013 8:42 # @Author : 一梦南柯 # @File : upload_data4.py ''' 用python读取xml格式的excel文件 ''' # coding=utf-8 import os import re import csv INPUTPATH = u"F:\\GY-DB\\upload_data" def readxmldata(): strstock = os.path.join(INPUTPATH,'ttt.xml') fp = open(strstock, "rb") file_data = fp.read() data1 = re.findall('<Data ss:Type=("String"|"Number")>([\s\W\w\S]*?)</Data></Cell>', file_data.decode('utf-8'))[1:-2] print(len(data1)) data2 = [] for i in range(len(data1)): data2.append(data1[i][1]) len_axis = 28 re_data = [] for ii in range(0, int(len(data2) / len_axis)): re_data.append(data2[ii * len_axis:(ii + 1) * len_axis]) print(int(len(data1) / len_axis) - 1) print(re_data[int(len(data1) / len_axis) - 1], len(re_data)) with open('F:\\GY-DB\\upload_data\\{t3}.csv'.format(t3='13ruku1'), 'w', newline='') as f: writer = csv.writer(f) writer.writerows(re_data) f.close() if __name__ == '__main__': readxmldata()
因为xml文件中的标签内有<Data ss:Type="String">,<Data ss:Type="Number">这两种格式,所以用正则匹配或运算同时获取满足这两个条件之一的元素。
如果用这种方式打开的话
dom = xml.dom.minidom.parse('abc.xml')
非常吃内存,一下子吃了6个g吧。
Reference:
https://www.cnblogs.com/bigtall/archive/2004/10/13/51821.aspx
https://blog.csdn.net/lishenluo/article/details/53560961