I'm trying to parse a large YAML file (over 3000 lines) in a Java application that is downloaded from another system (a PHP app). I have limited control over the YAML file itself. Changes to it are done manually and the YAML parser in the other system seems to be a lot more forgiving about how the YAML is formatted.
The problem I'm running into is that when I try to parse the file with Jackson, I get an exception because a handful of lines have an invalid character at the end. This causes the entire parse attempt to fail.
Is there a way to configure or set up Jackson to simply skip over lines or YAML blocks if they are malformed or have invalid tokens?
Example YAML
example.good_yaml:
description: "Example of good YAML"
example.bad_yaml:
description: "Example of bad YAML")
Parsing Code
ObjectMapper mapper = new YAMLMapper();
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
Map<String, Object> result = mapper.readValue(sourceYaml, new TypeReference<Map<String, Object>>() {});
Error
com.fasterxml.jackson.dataformat.yaml.snakeyaml.error.MarkedYAMLException: while parsing a block mapping
in 'reader', line 4, column 3:
description: "Example of bad YAML")
^
expected <block end>, but found '<scalar>'
in 'reader', line 4, column 37:
description: "Example of bad YAML")
^
at [Source: (File); line: 4, column: 37]
That would require SnakeYAML, which is used by Jackson for parsing, to support this. The options for loading don't include a setting for this, nor do I know of any API for it, so I am pretty sure that it doesn't have any such functionality.
Mind that recovery from syntax errors is a rather complex endeavor (even though it seems simple for your specific use-case) and I don't know of any YAML implementation which implements that (since most of them are rewrites of PyYAML/libyaml).
Chances are that it's easier to sanitize your file with a well-placed sed
command assuming there are a small number of repeating syntax errors that are easily discoverable with a RegEx.