Felipe C. :
I have the following data that is supposed to be XML:
<?xml version="1.0" encoding="UTF-8"?>
<Product>
<id>1</id>
<description>A new product</description>
<price>123.45</price>
</Product>
<Product>
<id>1</id>
<description>A new product</description>
<price>123.45</price>
</Product>
<ProductTTTTT>
<id>1</id>
<description>A new product</description>
<price>123.45</price>
</Product>
<Product>
<id>1</id>
<description>A new product</description>
<price>123.45</price>
</ProductAAAAAA>
So, basically I have multiple root elements (product
)...
The point is that I'm trying to transform this data into 2 XML documents, 1 for valid nodes and other for invalid nodes.
Valid node:
<Product>
...
</Product>
Invalid nodes: <ProductTTTTT>...</Product>
and <Product>...</ProductAAAAAA>
Then I am thinking how I can achieve this using JAVA (not web).
- If I am not wrong, validating it with a XSD will invalidate the whole file, so not an option.
- Using default JAXB parser (unmarshaller) will lead to item above since internally it creates a XSD of my entity.
- Using XPath just (from what I know) will just return the whole file, I did not find a way to get something like GET !VALID (It is just to explain...)
- Using XQuery (maybe?).. by the way, how to use XQuery with JAXB?
- XSL(T) will lead to same thing on XPath, since it uses XPath to select the content.
So... which method can I use to achieve the objective? (And if possible, provide links or code please)
Mads Hansen :
If the file contains lines with start and end tags who's name begins with "Product", you could:
- use a file scanner to split this document into individual pieces whenever a line starts with
<Product
or</Product
- attempt to parse the extracted text as XML using an XML API.
- If it succeeds, add that object to a list of "good" well-formed XML documents
- then perform any additional schema validation or validity checks
- If it throws a parse error, catch it, and add that snippet of text to the list of "bad" items that need to be cleaned up or otherwise handled
- If it succeeds, add that object to a list of "good" well-formed XML documents
An example to get you started:
package com.stackoverflow.questions.52012383;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
public class FileSplitter {
public static void parseFile(File file, String elementName)
throws ParserConfigurationException, IOException {
List<Document> good = new ArrayList<>();
List<String> bad = new ArrayList<>();
String start-tag = "<" + elementName;
String end-tag = "</" + elementName;
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder;
StringBuffer buffer = new StringBuffer();
String line;
boolean append = false;
try (Scanner scanner = new Scanner(file)) {
while (scanner.hasNextLine()) {
line = scanner.nextLine();
if (line.startsWith(startTag)) {
append = true; //start accumulating content
} else if (line.startsWith(endTag)) {
append = false;
buffer.append(line);
//instead of the line above, you could hard-code the ending tag to compensate for bad data:
// buffer.append(endTag + ">");
try { // to parse as XML
builder = factory.newDocumentBuilder();
Document document = builder.parse(new InputSource(new StringReader(buffer.toString())));
good.add(document); // parsed successfully, add it to the good list
buffer.setLength(0); //reset the buffer to start a new XML doc
} catch (SAXException ex) {
bad.add(buffer.toString()); // something is wrong, not well-formed XML
}
}
if (append) { // accumulate content
buffer.append(line);
}
}
System.out.println("Good items: " + good.size() + " Bad items: " + bad.size());
//do stuff with the good/bad results...
}
}
public static void main(String args[])
throws ParserConfigurationException, IOException {
File file = new File("/tmp/test.xml");
parseFile(file, "Product");
}
}