Parse or convert HTML code embedded inside JSON object in the JSON response

Akhila Rajeev :

I have the following URL :

https://en.wikipedia.org/w/api.php?action=parse&section=0&prop=text&format=json&page=The%20Matrix

which returns a JSON response with HTML code embedded within a JSON object (See the link).

How do I retrieve details like Actor, Director etc from that HTML part using java?

How do I convert that Html part to JSON using java, if it's possible?

Or is there any way to change the url itself to get the movie data in readable JSON format?

Dhrubajyoti Gogoi :

Here is a solution using jsoup for parsing HTML and jackson for parsing JSON:

public static void main(String[] args) throws IOException {
    // Extract JSON string
    String body = Jsoup.connect("https://en.wikipedia.org/w/api.php?action=parse&section=0&prop=text&format=json&page=The%20Matrix")
    .ignoreContentType(true).execute().body();
    // Extract HTML string from JSON
    JsonFactory factory = new JsonFactory();
    ObjectMapper mapper = new ObjectMapper(factory);
    JsonNode targetNode = mapper.readTree(body).get("parse").get("text").get("*");
    // Generic but fragile function to extract specific details
    Function<String, String> retrieveDetailsOf = detailsOf ->
        Jsoup.parse(targetNode.asText())
                .select(".infobox tr th:contains(" + detailsOf + ") ~ td a[title]")
                .stream().map(e -> e.attr("title")).collect(Collectors.toList()).toString();

    System.out.println(retrieveDetailsOf.apply("Directed by"));
    System.out.println(retrieveDetailsOf.apply("Produced by"));
    System.out.println(retrieveDetailsOf.apply("Music by"));
    System.out.println(retrieveDetailsOf.apply("Starring"));
}

Ouput:

[The Wachowskis]
[Keanu Reeves, Laurence Fishburne, Carrie-Anne Moss, Hugo Weaving, Joe Pantoliano]

Dependencies:

implementation("com.fasterxml.jackson.core:jackson-core:2.10.2")
implementation("com.fasterxml.jackson.core:jackson-databind:2.10.2")

Just be mindful of the fact that any changes in the content structure will mostly result in breaks. Rather use official movie detail APIs if available.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=19124&siteId=1