scene description
When writing a Python crawler, I encounter data in XML format that cannot be parsed normally when using XPath. What should I do at this time?
test environment
- Python 3.9.13
Test Data
<?xml version="1.0" encoding="UTF-8"?>
<tradeproperty>
<INDEX>
<TRADING_DAY>20221012</TRADING_DAY>
<PRODUCT_ID>IC</PRODUCT_ID>
<INSTRUMENT_ID>IC2210</INSTRUMENT_ID>
<INSTRUMENT_MONTH>2210</INSTRUMENT_MONTH>
<BASIS_PRICE>6361.6</BASIS_PRICE>
<OPEN_DATE>20220822</OPEN_DATE>
<END_TRADING_DAY>20221021</END_TRADING_DAY>
<UPPER_VALUE>0.1</UPPER_VALUE>
<LOWER_VALUE>0.1</LOWER_VALUE>
<UPPERLIMITPRICE>6256</UPPERLIMITPRICE>
<LOWERLIMITPRICE>5118.8</LOWERLIMITPRICE>
<LONG_LIMIT>1200</LONG_LIMIT>
</INDEX>
<INDEX>
<TRADING_DAY>20221012</TRADING_DAY>
<PRODUCT_ID>IC</PRODUCT_ID>
<INSTRUMENT_ID>IC2211</INSTRUMENT_ID>
<INSTRUMENT_MONTH>2211</INSTRUMENT_MONTH>
<BASIS_PRICE>5929.6</BASIS_PRICE>
<OPEN_DATE>20220919</OPEN_DATE>
<END_TRADING_DAY>20221118</END_TRADING_DAY>
<UPPER_VALUE>0.1</UPPER_VALUE>
<LOWER_VALUE>0.1</LOWER_VALUE>
<UPPERLIMITPRICE>6231.6</UPPERLIMITPRICE>
<LOWERLIMITPRICE>5098.8</LOWERLIMITPRICE>
<LONG_LIMIT>1200</LONG_LIMIT>
</INDEX>
</tradeproperty>
parsing code
import xml.etree.ElementTree as ET
test_xml = '''<?xml version="1.0" encoding="UTF-8"?>
<tradeproperty>
<INDEX>
<TRADING_DAY>20221012</TRADING_DAY>
<PRODUCT_ID>IC</PRODUCT_ID>
<INSTRUMENT_ID>IC2210</INSTRUMENT_ID>
<INSTRUMENT_MONTH>2210</INSTRUMENT_MONTH>
<BASIS_PRICE>6361.6</BASIS_PRICE>
<OPEN_DATE>20220822</OPEN_DATE>
<END_TRADING_DAY>20221021</END_TRADING_DAY>
<UPPER_VALUE>0.1</UPPER_VALUE>
<LOWER_VALUE>0.1</LOWER_VALUE>
<UPPERLIMITPRICE>6256</UPPERLIMITPRICE>
<LOWERLIMITPRICE>5118.8</LOWERLIMITPRICE>
<LONG_LIMIT>1200</LONG_LIMIT>
</INDEX>
<INDEX>
<TRADING_DAY>20221012</TRADING_DAY>
<PRODUCT_ID>IC</PRODUCT_ID>
<INSTRUMENT_ID>IC2211</INSTRUMENT_ID>
<INSTRUMENT_MONTH>2211</INSTRUMENT_MONTH>
<BASIS_PRICE>5929.6</BASIS_PRICE>
<OPEN_DATE>20220919</OPEN_DATE>
<END_TRADING_DAY>20221118</END_TRADING_DAY>
<UPPER_VALUE>0.1</UPPER_VALUE>
<LOWER_VALUE>0.1</LOWER_VALUE>
<UPPERLIMITPRICE>6231.6</UPPERLIMITPRICE>
<LOWERLIMITPRICE>5098.8</LOWERLIMITPRICE>
<LONG_LIMIT>1200</LONG_LIMIT>
</INDEX>
</tradeproperty>
'''
# 从xml格式字符串导入数据
root = ET.fromstring(test_xml)
# 遍历每条xml数据:INDEX
for child in root:
print('=' * 60)
# 遍历每条xml数据下的具体内容
## 方案一
for item in child:
print(f'{
item.tag}:{
item.text}')
## 方案二
# for i in range(len(child)):
# print(f'{child[i].tag}:{child[i].text}')
Example of running results:
============================================================
TRADING_DAY:20221012
PRODUCT_ID:IC
INSTRUMENT_ID:IC2210
INSTRUMENT_MONTH:2210
BASIS_PRICE:6361.6
OPEN_DATE:20220822
END_TRADING_DAY:20221021
UPPER_VALUE:0.1
LOWER_VALUE:0.1
UPPERLIMITPRICE:6256
LOWERLIMITPRICE:5118.8
LONG_LIMIT:1200
============================================================
TRADING_DAY:20221012
PRODUCT_ID:IC
INSTRUMENT_ID:IC2211
INSTRUMENT_MONTH:2211
BASIS_PRICE:5929.6
OPEN_DATE:20220919
END_TRADING_DAY:20221118
UPPER_VALUE:0.1
LOWER_VALUE:0.1
UPPERLIMITPRICE:6231.6
LOWERLIMITPRICE:5098.8
LONG_LIMIT:1200