How can I parse my html files using jSoup?

beartown :

I have multiple files in a folder that I need to parse.My final goal is to extract a plain text file that contains whatever is in any div with an id attribute starting with "part_" .I'm not familiar with jSoup. Here is a sample of one of my html files :

<html>
<head>...</head>
<body>
<div id="document">
    <p>..<p>
    <div id=part_1>  </div>
<div id=part_2> Part 2 : Security measures </div>
<div id=part_3> 
         Part 3
    <p>security To review  </p>
    <ul> <li> ...</li> <li> ...</li></ul>
    <p>measures to adjust </p>  

</div>

I want to extract all the < div > with the "id= part_". As we can see it contains text, and other html tags and I don't know how I may handle this hearchy. I wrote the following code but clearly I'm lost :

    Document document = Jsoup.parse( new File( "Folder/2.html" ) , "utf-8" );
    System.out.println(document.title());

    Elements divs = document.select("div");
    for (Element div : divs) {
        String divid = div.attr("id").toString();
         if (divid.equals("part_201")) {
                       System.out.println(div.text());
         }

Ps ; I'm working with more than 200 html files and i should have in the end one .txt file of all their plain text.

Eritrean :

I think you're making this more complicated than it needs to be. Jsoup is actually intuitive and pretty straightforward:

To select all divs having an id attribute and a value starting with "part" just use: doc.select("div[id^=part]");

public static void main(String[] args) throws IOException {
    Document document = Jsoup.parse( new File( "C:\\Users\\Eritrean\\Desktop\\delete.html" ) , "utf-8" );
    Elements myDivs = document.select("div[id^=part_]");
    myDivs.forEach(d -> {System.out.println(d.wholeText());});
}

Output:

Part 2 : Security measures 

     Part 3
security To review  

 ... 
 ...

measures to adjust   

Look at the Jsoup cookbook

See the Selector API reference

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=352908&siteId=1