70 GB ALTO Xml file parsing

Tanu :

I am having troubles to parse 70 GB XML file into CSV.

That's how the XML looks like:

<?xml version="1.0" encoding="utf-8"?>
<File>
  <row Id="1" Name="tanu" Count="289949" />
  <row Id="2" Name="daniel" Count="863524" />
  <row Id="3" Name="ricky" Count="1909662"/>
</File>

Since it's such a big file, I cannot read the whole file in one go as it kills the kernel. I want to iterate over some number of rows first and write them to CSV file.

I am using the following code:

import xml.etree.ElementTree as et
import pandas as pd

path = 'file path'
root = et.parse(path)
rows = root.findall('.//row')
column_names = ['Id','Name','Count']
xml_data = [[row.get(col) for col in column_names]
data = pd.DataFrame(xml_data,columns=column_names)
data.to_csv ('File.csv', index = False, header = True)

I would really appreciate it if anyone could tell me how to read XML in chunks and write it into CSV. I am unable to run the .iterator function properly in the above code.

Serge Ballesta :

I would use a parser that allows to process an xml file in chunks, like the expat parser. The only problem is that you should know from the beginning the columns that you want to write in the csv file. Code could be:

with open('file path', 'rb') as fdin, open('File.csv', 'w', newline='') as fdout:
    writer = csv.DictWriter(fdout, ['Id', 'Name', 'Count'],
                            extrasaction='ignore')   # any additional field will be ignored
    writer.writeheader()

    def start_elt(name, attrs):
        if name == 'row':
            writer.writerow(attrs)

    parser = xml.parsers.expat.ParserCreate()
    parser.StartElementHandler = start_elt
    parser.ParseFile(fdin)
    print(fdout.getvalue())

With the sample file, I get:

Id,Name,Count
1,tanu,289949
2,daniel,863524
3,ricky,1909662

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=398114&siteId=1