python读取xml格式的xls文件，并写出csv文件

用xlrd读xls类型的文件，结果一直报错

xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'<?xml ve'

最后发现原来该文件是披着xls格式的xml文件，查找了很多资料，发现大多数都是针对纯xml文件的解决方法，对于偏向于excel结构的很少。由于我的xml文件一打开就只有一行，而且很卡，所以光检查文件里的标签就花了很长时间。下面附上一般偏向excel的xml文件。

<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:o="urn:schemas-microsoft-com:office:office"
 xmlns:x="urn:schemas-microsoft-com:office:excel"
 xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:html="http://www.w3.org/TR/REC-html40">
 <DocumentProperties xmlns="urn:schemas-microsoft-com:office:office">
  <Title>Excel表格</Title>
  <LastAuthor>bigtall</LastAuthor>
 </DocumentProperties>
 <Styles>
  <Style ss:ID="Default" ss:Name="Normal">
   <Alignment ss:Vertical="Center"/>
   <Font ss:FontName="宋体" x:CharSet="134" ss:Size="12"/>
  </Style>
 </Styles>
 <Worksheet ss:Name="tt">
  <Table>
   <Row>
    <Cell ss:MergeAcross="6" ><Data ss:Type="String">Hello!World!</Data></Cell>
   </Row>
  </Table>
 </Worksheet>
</Workbook>

还可以精简到：

<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:o="urn:schemas-microsoft-com:office:office"
 xmlns:x="urn:schemas-microsoft-com:office:excel"
 xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:html="http://www.w3.org/TR/REC-html40">
 <Worksheet ss:Name="tt">
  <Table>
   <Row>
    <Cell><Data ss:Type="String">Hello!World!</Data></Cell>
   </Row>
  </Table>
 </Worksheet>
</Workbook>

所以对于规整的标签格式，用正则就可以取出对应的元素，然后存入列表，这就是解决这个问题的思路。

#!/usr/bin/python3
# -*- coding: utf-8 -*-
# @Time    : 2018/6/13 0013 8:42
# @Author  : 一梦南柯
# @File    : upload_data4.py

'''
用python读取xml格式的excel文件
'''
# coding=utf-8
import os
import re
import csv

INPUTPATH = u"F:\\GY-DB\\upload_data"
def readxmldata():
    strstock = os.path.join(INPUTPATH,'ttt.xml')
    fp = open(strstock, "rb")
    file_data = fp.read()
    data1 = re.findall('<Data ss:Type=("String"|"Number")>([\s\W\w\S]*?)</Data></Cell>', file_data.decode('utf-8'))[1:-2]
    print(len(data1))
    data2 = []
    for i in range(len(data1)):
        data2.append(data1[i][1])
    len_axis = 28
    re_data = []
    for ii in range(0, int(len(data2) / len_axis)):
        re_data.append(data2[ii * len_axis:(ii + 1) * len_axis])
    print(int(len(data1) / len_axis) - 1)
    print(re_data[int(len(data1) / len_axis) - 1], len(re_data))
    with open('F:\\GY-DB\\upload_data\\{t3}.csv'.format(t3='13ruku1'), 'w',
              newline='') as f:
        writer = csv.writer(f)
        writer.writerows(re_data)
        f.close()

if __name__ == '__main__':
    readxmldata()

因为xml文件中的标签内有<Data ss:Type="String">，<Data ss:Type="Number">这两种格式，所以用正则匹配或运算同时获取满足这两个条件之一的元素。

如果用这种方式打开的话

dom = xml.dom.minidom.parse('abc.xml')

非常吃内存，一下子吃了6个g吧。

Reference：

https://www.cnblogs.com/bigtall/archive/2004/10/13/51821.aspx

https://blog.csdn.net/lishenluo/article/details/53560961

python读取xml格式的xls文件，并写出csv文件

猜你喜欢