I'm trying to process an xml, before that i need to remove the doctype and entity declaration from the input xml.
I'm using the below code to remove the doctype and entity:
fileContent = fileContent.replaceAll("<!ENTITY ((.|\n|\r)*?)\">", "");
fileContent = fileContent.replaceAll("<!DOCTYPE((.|\n|\r)*?)>", "");
This removes the entity and then the doctype. This works fine if the xml contains below doctype declarations in the xml:
<!DOCTYPE ichicsr SYSTEM "http://www.w3.org/TR/html4/frameset.dtd">
<!DOCTYPE ichicsr SYSTEM "D:\UPGRADE\NTServices\Server\\Xml21.dtd"
[<!ENTITY % entitydoc SYSTEM "D:\UPGRADE\NTServices\Server\\latin-entities.dtd"> %entitydoc;]>
But if I have the doctype as given below, it doesn't work and the root tag in the xml get stripped off:
<!DOCTYPE ichicsr SYSTEM "D:\UPGRADE\NTServices\Server\\Xml21.dtd"
[<!ENTITY % entitydoc SYSTEM 'D:\UPGRADE\NTServices\Server\\Xml21.dtd'>
]>
Please let me know if the regular expression I'm using is incorrect or any other action needs to be taken.
Your approach does not work because you have "
required before final >
in the ENTITIY
regex. You may just replace \"
with ['\"]
there.
Besides, never use (.|\n|\r)*?
in any regex since it is a performance killer. Instead, use .*?
with Pattern.DOTALL
(or inline (?s)
variant), or at least [\s\S]*?
.
However, there is a better way: merge the two regexps into one:
fileContent = fileContent.replaceAll("(?i)<!DOCTYPE[^<>]*(?:<!ENTITY[^<>]*>[^<>]*)?>", "");
See the regex demo.
Details
(?i)
- case insensitivePattern.CASE_INSENSITIVE
inline modifier<!DOCTYPE
- literal text[^<>]*
- 0+ chars other than<
and>
(?:<!ENTITY[^<>]*>[^<>]*)?
- an optional occurrence of<!ENTITY
[^<>]*
- 0+ chars other than<
and>
>
- a>
char[^<>]*
- 0+ chars other than<
and>
>
- a>
char.