With Python and XML data, transmission of information [Part] - the data generated xml

0 Background and configuration environment

0.1 Background

 Established using xml data in a project, the transmission of information, of which there are two main aspects:

  1. ** data turn out to be large xml. ** a staff member from a web page climb fetch data encryption into a well-defined xml format, and then through a web platform project, enter into the original library.
  2. ** parsed xml in processing, modeling scheme. ** Due to the advantages of python algorithm backend we want to call python xml algorithm transmission scheme for processing, modeling data.

 Here, the need to have some XML basics .

 Defined xml data transmission rules:

# 被定义xml格式
<?xml version="1.0" ?><OriginalDataStorage.Entity.database><Name>DB_name</Name><Description>this is a databases</Description><tables><OriginalDataStorage.Entity.table><Name>TB_name</Name><Description>this is a table</Description><fieldItems><OriginalDataStorage.Entity.fieldItem><Name>col</Name><Length>3</Length><DataItemType>varchar</DataItemType></OriginalDataStorage.Entity.fieldItem><OriginalDataStorage.Entity.fieldItem><Name>col2</Name><Length>2</Length><DataItemType>varchar</DataItemType></OriginalDataStorage.Entity.fieldItem></fieldItems><dataRows><OriginalDataStorage.Entity.dataRowItem><dataItems><OriginalDataStorage.Entity.dataItem><fieldName>col</fieldName><fieldValue>jsh</fieldValue></OriginalDataStorage.Entity.dataItem><OriginalDataStorage.Entity.dataItem><fieldName>col2</fieldName><fieldValue>24</fieldValue></OriginalDataStorage.Entity.dataItem></dataItems></OriginalDataStorage.Entity.dataRowItem><OriginalDataStorage.Entity.dataRowItem><dataItems><OriginalDataStorage.Entity.dataItem><fieldName>col</fieldName><fieldValue>tc</fieldValue></OriginalDataStorage.Entity.dataItem><OriginalDataStorage.Entity.dataItem><fieldName>col2</fieldName><fieldValue>25</fieldValue></OriginalDataStorage.Entity.dataItem></dataItems></OriginalDataStorage.Entity.dataRowItem></dataRows></OriginalDataStorage.Entity.table></tables></OriginalDataStorage.Entity.database>

Structure mapping:
Here Insert Picture Description

0.2 Environment

 win7-64bit operating system, python3.6.4, pycharm2018.1.4
 use python package: xml

# 写入xml
import xml.etree.ElementTree as ET  # 迭代生成xml
import xml.dom.minidom as MD  # 输出
# 解析xml
import xml.etree.ElementTree as ET  # 迭代生成xml
import xml.dom.minidom as MD  # 输出

1 write xml

1.1 problem

  First, the algorithm which requires a clear need:

  1. Need to write a column of Type, Length, is looking for large data types on all columns up to a maximum length;
  2. Required for each row - written into each column xml.

1.2 implementation

1.21 found that most types of columns

# 发现最多类型,如果是pandas.datafram,转化成列表即可
def list_find_most_type(mylist):
    """
    找到一个列表,最多出现的元素类型,str、int、float、其他类型
    :param mylist: 输入的列表
    :return: 返回存在最多的类型
    """
    count_int = 0  # 循环计数
    count_flo = 0
    count_str = 0
    count_oth = 0
    count_datetime = 0
    for i in mylist:
        if isinstance(i, int):
            count_int += 1
        elif isinstance(i, float):
            count_flo += 1
        elif isinstance(i, str):
            count_str += 1
        elif isinstance(i, datetime.datetime):
            count_datetime += 1
        else:
            count_oth += 1
            print('Warning:一个其他类型的数据.',str(type(i)))
    list_count = [count_int, count_flo, count_str, count_oth, count_datetime]  # 类型数量列表
    list_type = ['int', 'float', 'varchar', 'varchar', 'datetime']  # 对应类型列表
    index = list_count.index(max(list_count))  # 找出最多类型的索引
    maxType = list_type[index]  # 提取最多的类型
    if 'varchar' not in maxType and count_str != 0:  # 如果int列中有字符串,那么str入int列容易报错
        print('Warning:该列大多是{}类型,但是包含{}个字符数据,存储可能会报错.'.format(maxType, count_str))
    return maxType

1.22 found that the maximum length of the column:

# 按utf-8的编码格式,一个字母占一个字节,一个中文字符占三个字节
def list_find_max_len(mylist):
    """
    :param mylist: 输入的列表
    :return: 返回该列表中长度最大的数据
    """

    def stat_list_ele_len(ele):
        """
        :param ele: 元素
        :return: 元素的长度
        """
        count_zh = 0
        for s in ele:
            if '\u4e00' <= s <= '\u9fff':  # utf-8编码的中文区间
                count_zh += 1
        len_ele = len(ele) + 2 * count_zh  # 中文占三个字节
        return len_ele

    mylist = [stat_list_ele_len(str(i)) for i in mylist]
    maxLen = max(mylist)  # 找到最长的数据
    return maxLen

1.23 iterative writing xml structure

  •  1, create a xml, and write OSE.database node:
def xml_db():
    """
    Create a DB`s Entity node and  Each Entity has three sub nodes.
    :return: third sub node to create tables
    """
    db = ET.Element('OriginalDataStorage.Entity.database')  # 初始根,OSE.database
    db_n1 = ET.SubElement(db, 'Name')  # 根db下的子节点,共三个子节点
    db_n1.text = input('Please enter the name of the DB you want to store:\t')  # db_n1标签中的文字
    db_n2 = ET.SubElement(db, 'Description')
    db_n2.text = 'this is a databases'
    db_n3 = ET.SubElement(db, 'tables')
    return db, db_n3
  • 2, the write node tables:
def xml_tablesItem(db, db_n3):
    """
    [According to db_n3(table`s root node)], Create the table`s Entity node(OE.table) and Each Entity has four sub nodes.
    :return: third, fourth nodes to create the fieldItems and dataRows of the table.
    """
    tb = ET.SubElement(db_n3, 'OriginalDataStorage.Entity.table')  # 创建一个表实体
    tb_n1 = ET.SubElement(tb, 'Name')  # 表实体的第一个子节点
    tb_n1.text = input('Please enter the name of the table you want to store:\t')
    tb_n2 = ET.SubElement(tb, 'Description')
    tb_n2.text = 'this is a table'
    tb_n3 = ET.SubElement(tb, 'fieldItems')
    tb_n4 = ET.SubElement(tb, 'dataRows')
    return db, tb_n3, tb_n4
  • 3, write the node fileItems:
def xml_fieldItems(db, tb_n3, Name, Length, Type):
    """
    Create the fieldItems`s Entity node(OE.fieldItem) and three field descriptions of the field item.
    :return: Iterative creating`s DB.
    """
    fieldItem = ET.SubElement(tb_n3, 'OriginalDataStorage.Entity.fieldItem')
    field_name = ET.SubElement(fieldItem, 'Name')
    field_name.text = '%s'%(Name)
    field_lenth = ET.SubElement(fieldItem, 'Length')
    field_lenth.text = '%s'%(Length)
    field_type = ET.SubElement(fieldItem, 'DataItemType')
    field_type.text = '%s'%(Type)
    return db
  • 4, write node dataRows of:
def xml_dataRows(db, tb_n4):
    """
    [According to tb_n3], Create the dataRows`s Entity node(OE.dataRowItem) and one dataItems node.
    :return: dataItems node for instantiating per column data;Per column data.
    """
    dataRowItem = ET.SubElement(tb_n4, 'OriginalDataStorage.Entity.dataRowItem')
    dataItems = ET.SubElement(dataRowItem, 'dataItems')
    return db, dataItems
  • 5, the write node OSE.dataItem:
def xml_dataItem(db, dataItems, Name, Value):
    """
    [According to dataItems], Create the dataItems`s Entity node(OE.dataItem) and two dataItems node.
    :return: dataItems node for instantiating per column data;Per column data.
    """
    dataItem = ET.SubElement(dataItems, 'OriginalDataStorage.Entity.dataItem')
    fieldName = ET.SubElement(dataItem, 'fieldName')
    fieldName.text = '%s'%(Name)
    fieldValue = ET.SubElement(dataItem, 'fieldValue')
    fieldValue.text = '%s'%(Value)
    return db
  • 6, the output xml
     as described above xml custom rules, can continue the iterative generate xml tree. However, we need to export, you need to:

 According to the previous definition, uniform application of coding gbk

def write_xml_out(db, outpath, encode):
    """
    根据<类的xml.etree.ElementTree。元素'>由迭代生成
    首先转换为字符串,然后转换为树结构,然后导出为XML
    :param db: 迭代生成好的xml树
    :param outpath: 输出的路径
    :return:
    """
    xml_string = ET.tostring(db)
    tree = MD.parseString(xml_string)
    xml_string = tree.toxml()
    #  print(db, tree, xml_string)
    with open(outpath, 'w', encoding=encode)as fxml:
        fxml.write(xml_string)
    print('创建xml成功.')

2 xml parsing

Too voluminous, see next author

Guess you like

Origin blog.csdn.net/qq_40260867/article/details/84778008