I have multiple files in a folder that I need to parse.My final goal is to extract a plain text file that contains whatever is in any div with an id attribute starting with "part_" .I'm not familiar with jSoup. Here is a sample of one of my html files :
<html>
<head>...</head>
<body>
<div id="document">
<p>..<p>
<div id=part_1> </div>
<div id=part_2> Part 2 : Security measures </div>
<div id=part_3>
Part 3
<p>security To review </p>
<ul> <li> ...</li> <li> ...</li></ul>
<p>measures to adjust </p>
</div>
I want to extract all the < div > with the "id= part_". As we can see it contains text, and other html tags and I don't know how I may handle this hearchy. I wrote the following code but clearly I'm lost :
Document document = Jsoup.parse( new File( "Folder/2.html" ) , "utf-8" );
System.out.println(document.title());
Elements divs = document.select("div");
for (Element div : divs) {
String divid = div.attr("id").toString();
if (divid.equals("part_201")) {
System.out.println(div.text());
}
Ps ; I'm working with more than 200 html files and i should have in the end one .txt file of all their plain text.
I think you're making this more complicated than it needs to be. Jsoup is actually intuitive and pretty straightforward:
To select all divs having an id attribute and a value starting with "part" just use: doc.select("div[id^=part]");
public static void main(String[] args) throws IOException {
Document document = Jsoup.parse( new File( "C:\\Users\\Eritrean\\Desktop\\delete.html" ) , "utf-8" );
Elements myDivs = document.select("div[id^=part_]");
myDivs.forEach(d -> {System.out.println(d.wholeText());});
}
Output:
Part 2 : Security measures
Part 3
security To review
...
...
measures to adjust
Look at the Jsoup cookbook
See the Selector API reference