Which methods can be used to return valid and invalid XML data from a file in Java?

Felipe C. :

I have the following data that is supposed to be XML:

<?xml version="1.0" encoding="UTF-8"?>
<Product>
    <id>1</id>
    <description>A new product</description>
    <price>123.45</price>
</Product>

<Product>
    <id>1</id>
    <description>A new product</description>
    <price>123.45</price>
</Product>

<ProductTTTTT>
    <id>1</id>
    <description>A new product</description>
    <price>123.45</price>
</Product>

<Product>
    <id>1</id>
    <description>A new product</description>
    <price>123.45</price>
</ProductAAAAAA>

So, basically I have multiple root elements (product)...

The point is that I'm trying to transform this data into 2 XML documents, 1 for valid nodes and other for invalid nodes.

Valid node:

<Product>
   ...
</Product>

Invalid nodes: <ProductTTTTT>...</Product> and <Product>...</ProductAAAAAA>

Then I am thinking how I can achieve this using JAVA (not web).

  • If I am not wrong, validating it with a XSD will invalidate the whole file, so not an option.
  • Using default JAXB parser (unmarshaller) will lead to item above since internally it creates a XSD of my entity.
  • Using XPath just (from what I know) will just return the whole file, I did not find a way to get something like GET !VALID (It is just to explain...)
  • Using XQuery (maybe?).. by the way, how to use XQuery with JAXB?
  • XSL(T) will lead to same thing on XPath, since it uses XPath to select the content.

So... which method can I use to achieve the objective? (And if possible, provide links or code please)

Mads Hansen :

If the file contains lines with start and end tags who's name begins with "Product", you could:

  • use a file scanner to split this document into individual pieces whenever a line starts with <Product or </Product
  • attempt to parse the extracted text as XML using an XML API.
    • If it succeeds, add that object to a list of "good" well-formed XML documents
      • then perform any additional schema validation or validity checks
    • If it throws a parse error, catch it, and add that snippet of text to the list of "bad" items that need to be cleaned up or otherwise handled

An example to get you started:

package com.stackoverflow.questions.52012383;

import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.StringReader;

import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;

public class FileSplitter {

    public static void parseFile(File file, String elementName) 
      throws ParserConfigurationException, IOException {

        List<Document> good = new ArrayList<>();
        List<String> bad = new ArrayList<>();

        String start-tag = "<" + elementName;
        String end-tag = "</" + elementName;
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        StringBuffer buffer = new StringBuffer();
        String line;
        boolean append = false;

        try (Scanner scanner = new Scanner(file)) {
            while (scanner.hasNextLine()) {
                line = scanner.nextLine();

                if (line.startsWith(startTag)) {
                    append = true; //start accumulating content
                } else if (line.startsWith(endTag)) {
                    append = false;
                    buffer.append(line); 
                    //instead of the line above, you could hard-code the ending tag to compensate for bad data:
                    // buffer.append(endTag + ">");

                    try { // to parse as XML
                        builder = factory.newDocumentBuilder();
                        Document document = builder.parse(new InputSource(new StringReader(buffer.toString())));
                        good.add(document); // parsed successfully, add it to the good list

                        buffer.setLength(0); //reset the buffer to start a new XML doc

                    } catch (SAXException ex) {
                        bad.add(buffer.toString()); // something is wrong, not well-formed XML
                    }
                }

                if (append) { // accumulate content
                    buffer.append(line);
                }
            }
            System.out.println("Good items: " + good.size() + " Bad items: " + bad.size());
            //do stuff with the good/bad results...
        }
    }

    public static void main(String args[]) 
      throws ParserConfigurationException, IOException {
        File file = new File("/tmp/test.xml");
        parseFile(file, "Product");
    }

}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=88739&siteId=1
Recommended