Python解析大XML文件及读取XML不全的问题

之前用python的minidom写过解析xml的脚本文件,在前期是比较好用的,因为xml文件比较小。但是当xml文件超过了70M的时候,minidom不仅效率低,而且会占用非常大的内存空间,因为他是将整个xml读入进去并且按照整个xml树进行建树(虽然这样写代码逻辑清晰,但是确实效率低,内存占用高)。70M的xml,我8G内存吃了4个多G,太可怕了。考虑到以后这个读取的xml文件可能还需要扩大,所以抓紧时间写了一个一个新的读取脚本。

在此之前,参考了这篇文章以及这篇文章之后,决定采用里面说的ET_iter方式实现。

然后,我找到了这个博主的文章,仿照这上面的代码进行了进行了编写:

# coding=utf-8
__author__ = 'Arthur'
import mysql.connector
import sys
import xml.etree.cElementTree as ET
if __name__=="__main__":
   for event, elem in ET.iterparse("test2.xml", events=('start','end')):
      if event == 'start':
         if elem.tag=='product' or elem.tag=='property' or elem.tag=='evaluation':
            print(elem.attrib)
         elif elem.tag == 'result':
            a_result = {}
            a_result=elem.attrib
            a_result['value']=elem.text
            if(elem.text==None):
               print("result none")
            else:
               print(a_result)
      elif event == 'end':
         if elem.tag == 'products':
            print("deal with products over")
         elif elem.tag == 'propertys':
            print("deal with propertys over")
         elif elem.tag == 'evaluations':
            print("deal with evaluations over")
         elif elem.tag == 'results':
            print("deal with results over")
      elem.clear()
前面使用自己构造的xml文件发现没有问题:

<?xml version='1.0' encoding='utf-8'?>
<testresults	source="ICRT EvalDB" type="data"
        		user="unknown">
<project	id_project="697"
            icrt_code="IC16539"
            name="Combined Wearables"
            comment="">
<snapshots>
<snapshot	id_snapshot="4"
            name="Combined snapshot"
            timestamp_created="1471515160"
            timestamp_lastchange="1482147798"
            time_lastchange="2016-12-19 (11:43)">
<manufacturers>
<manufacturer	id_manufacturer="1"
                name="Apple"
                comment=""
                timestamp_created="1465471929"
                timestamp_lastchange="0" />
<manufacturer	id_manufacturer="2"
                name="Fitbit"
                comment=""
                timestamp_created="1465471929"
                timestamp_lastchange="0" />
</manufacturers>
<productgroups>
<productgroup	id_productgroup="1"
                name="SMARTWATCH"
                comment=""
                timestamp_created="1465471929"
                timestamp_lastchange="0" />
<productgroup	id_productgroup="2"
                name="FITNESS TRACKER"
                comment=""
                timestamp_created="1465471929"
                timestamp_lastchange="0" />
</productgroups>
<products>
<product	id_product="10"
            icrt_code="IC16539-0036-00"
            modelname="Gear S2"
            completename="Samsung Gear S2"
            shortname=""
            systemmodelid=""
            releasedate=""
            labreportdate="2016-05-27T00:00:00.000"
            labarrivaldate="2016-05-06T00:00:00.000"
            boughtbyorganisation="WHICH"
            serialnumber="RFAH105HFQF"
            articlenumber="8.80608808859E+12"
            comment=""
            id_productgroup="1"
            id_manufacturer="9"
            sortorder="0"
            batch="1"
            labcode=""
            parentmodelcode=""
            similarmodelscodes=""
            testtype=""
            picture_lores=""
            picture_hires=""
            timestamp_created="1465471929"
            timestamp_lastchange="1466062628" />
<product	id_product="11"
            icrt_code="IC16539-0040-00"
            modelname="Vivofit 3"
            completename="Garmin Vivofit 3"
            shortname=""
            systemmodelid=""
            releasedate=""
            labreportdate="2016-06-15T00:00:00.000"
            labarrivaldate="2016-06-24T00:00:00.000"
            boughtbyorganisation="WHICH"
            serialnumber="4R0201708"
            articlenumber="53759 15457"
            comment=""
            id_productgroup="2"
            id_manufacturer="3"
            sortorder="0"
            batch="2"
            labcode=""
            parentmodelcode=""
            similarmodelscodes=""
            testtype=""
            picture_lores=""
            picture_hires=""
            timestamp_created="1469800248"
            timestamp_lastchange="1475593828" />
<product	id_product="12"
            icrt_code="IC16539-0047-00"
            modelname="Go"
            completename="Withings Go"
            shortname=""
            systemmodelid=""
            releasedate=""
            labreportdate="2016-06-15T00:00:00.000"
            labarrivaldate="2016-06-24T00:00:00.000"
            boughtbyorganisation="WHICH"
            serialnumber="00:24:E4:39:F0:0D"
            articlenumber="700546 701481"
            comment=""
            id_productgroup="2"
            id_manufacturer="10"
            sortorder="0"
            batch="2"
            labcode=""
            parentmodelcode=""
            similarmodelscodes=""
            testtype=""
            picture_lores=""
            picture_hires=""
            timestamp_created="1469800248"
            timestamp_lastchange="1475593828" />
</products>
<propertygroups>
<propertygroup	id_propertygroup="36"
                name="Features|inventory"
                comment=""
                timestamp_created="1465222484"
                timestamp_lastchange="0" />
<propertygroup	id_propertygroup="37"
                name="Features|Smart"
                comment=""
                timestamp_created="1465222484"
                timestamp_lastchange="0" />
</propertygroups>
<propertys>
<property	id_property="381"
			id_propertygroup=""
			binding="FIRMWARE"
			name="Firmware version on device"
			comment=""
			max="0"
			min="0"
			unit=""
			precision="0"
			type="String"
			use="1"
			testprogram="1.1.3"
			timestamp_created="1465222485"
			timestamp_lastchange="1465222485" />
<property	id_property="382"
			id_propertygroup=""
			binding="COMPATABILITY"
			name="What phones are compatible with device"
			comment=""
			max="0"
			min="0"
			unit=""
			precision="0"
			type="String"
			use="1"
			testprogram="1.1.7"
			timestamp_created="1465222485"
			timestamp_lastchange="1468831229" />
</propertys>
<calculationtypes>
<calculationtype	id_calculationtype="0"
	        		name="Arithmetic mean calculation" />
<calculationtype	id_calculationtype="5"
	        		name="Geometric mean calculation" />
<calculationtype	id_calculationtype="1"
	        		name="Versatility calculation" />
<calculationtype	id_calculationtype="2"
	        		name="Free formula calculation (complex)" />
<calculationtype	id_calculationtype="3"
	        		name="Minimum calculation" />
<calculationtype	id_calculationtype="4"
	        		name="Maximum calculation" />
</calculationtypes>
<evaluations>
<evaluation	id_evaluation="3165"
			id_childs="3185,3199,3176,3166,3180,3175,3195,3615"
			id_parent="0"
			id_calculationtype="0"
			name="total test result"
			binding=""
			use_inheritna="0"
			use_lookuptable="0"
			use_limiting="0"
			weighting_normalized="0"
			weighting_given="1"
			lookuptable="0.5,1.5,2.5,3.5,4.5,5.5" unit=""
			precision="3"
			timestamp_created="1465222499"
			timestamp_lastchange="1467972637" />
<evaluation	id_evaluation="3166"
			id_childs="3167"
			id_parent="3165"
			id_calculationtype="0"
			name="App"
			binding=""
			use_inheritna="0"
			use_lookuptable="0"
			use_limiting="0"
			weighting_normalized="0"
			weighting_given="0"
			lookuptable="0.5,1.5,2.5,3.5,4.5,5.5" unit=""
			precision="3"
			timestamp_created="1465222499"
			timestamp_lastchange="1467969418" />
</evaluations>
<results>
<result	id_product="1"
        id_evaluation="3165"
        is_downgrading="0"
        downgrading_value="">3.98268146</result>
<result	id_product="1"
        id_evaluation="100000635"
        is_downgrading="0"
        downgrading_value="">Provides reminders to stand every hour. You can set progress updates to be given every 4, 6 or 8 hours. Congratulates you when you complete a goal and provides individual feedback and history of activity data. Notifications to focus on specific goals _eg activity__, tells you what percentage of your goal is complete </result>
<result	id_product="1"
        id_evaluation="100000636"
        is_downgrading="0"
        downgrading_value="">1</result>
<result	id_product="1"
        id_evaluation="100000637"
        is_downgrading="0"
        downgrading_value="">Using the workout app gives you a breakdown of steps, total and active calories and distance covered for that session as well adding these values onto daily accumulated totals</result>
<result	id_product="1"
        id_evaluation="100000638"
        is_downgrading="0"
        downgrading_value="">1</result>
</results>
</snapshot>
</snapshots>
</project>
</testresults>
不过当真正使用的时候,发现有时候文本elem.text读取不正确,明明有值但是读取的时候发现还是None。调了半天都不知道为什么(因为自己构造的xml始终不是真实的,所以肯定不能完全模拟),找了半天终于找到了一段官方说明:


Note iterparse() only guarantees that it has seen the “>” character of a starting tag when it emits a “start” event, so the attributes are defined, but the contents of the text and tail attributes are undefined at that point. The same applies to the element children; they may or may not be present.
If you need a fully populated element, look for “end” events instead.


好了,原来是因为start事件开始的时候只能保证属性存在,不能保证value值以及子节点存在。所以目测改成了使用end事件响应就对了。然而我改成end事件响应过后,发现居然连小xml文件读取都有问题……这是为什么呢?好在这个问题好调试,调试一番发现问题其实很简单:因为我的触发信号是start以及end,但是start触发过后什么也没有做就把elem.clear()了,结果到end事件进来响应的时候只有一个空节点了……

所以说!!!!!触发事件一般不用使用start和end两个触发条件,之前看那个博主同时使用start以及end完全不必要,使用一个就好,除非你有其他特殊需求,比如需要继续使用根节点之类的,读取值的时候要保证是在end的时候读取并且end时当前节点没有clear.

最后完成的有效代码:

扫描二维码关注公众号,回复: 1215353 查看本文章

# coding=utf-8
__author__ = 'Arthur'
import mysql.connector
import sys
import xml.etree.cElementTree as ET
if __name__=="__main__":
   for event, elem in ET.iterparse("test.xml", events=('end',)):#注意这里只使用end进行触发即可
         if elem.tag=='product' or elem.tag=='property' or elem.tag=='evaluation':
            print(elem.attrib)
         elif elem.tag == 'result':
            a_result = {}
            a_result=elem.attrib
            a_result['value']=elem.text
            if(elem.text==None):
               print("result none")
            else:
               print(a_result)
         if elem.tag == 'products':
            print("deal with products over")
         elif elem.tag == 'propertys':
            print("deal with propertys over")
         elif elem.tag == 'evaluations':
            print("deal with evaluations over")
         elif elem.tag == 'results':
            print("deal with results over")
         elem.clear()

从调研新XML解析方法到实现重构代码只花了1小时,结果写出bug调代码一搞就是1个半小时,蛋疼。


猜你喜欢

转载自blog.csdn.net/hahajinbu/article/details/69660420