Java: remove < and > from text in XML (not tags)

J. Snipe :

I'm having a hard time escaping xml to be processed by Java. I'm using JTidy to escape unwanted characters, but struggle to remove "<" and ">" from values such as <tag> capacity < 1000 </tag>

I'm using below code to escape the input

    public String CleanXML(String input){

        Tidy tidy = new Tidy();
        tidy.setInputEncoding("UTF-16");
        tidy.setOutputEncoding("UTF-16");
        tidy.setWraplen(Integer.MAX_VALUE);
        tidy.setXmlOut(true);
        tidy.setSmartIndent(true);
        tidy.setXmlTags(true);
        tidy.setMakeClean(true);
        tidy.setForceOutput(true);
        tidy.setQuiet(true);
        tidy.setShowWarnings(false);
        StringReader in = new StringReader(input);
        StringWriter out = new StringWriter();
        tidy.parse(in, out);

        return out.toString();
    }
Nilanka Manoj :

use following function

private static final Pattern TAG_REGEX = Pattern.compile("<tag>(.+?)</tag>", Pattern.DOTALL);

public String CleanXML(String input){
    final Matcher matcher = TAG_REGEX.matcher(input);
    while (matcher.find()) {
        String value = matcher.group(1);
        String valueReplace = value.replaceAll("[^a-zA-Z0-9\\s]", "");
        input.replace(value,valueReplace);
    }
    return input;        
}

It uses regular expression search to get values between tags then, remove all non alphanumeric characters. Regular expressions and basic idea was gained from Java regex to extract text between tags

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=417025&siteId=1