Java parses HTML files through Jsoup

1. Introduction to Jsoup

Jsoup is a Java HTML parser that can directly parse a URL address and HTML text content. It provides a very labor-saving API, which can retrieve and manipulate data through DOM, CSS, and operation methods similar to jQuery.

Two, the main function of Jsoup

1. Parse HTML from a URL, file or string

2. Use DOM or CSS selectors to find and retrieve data

3. Operable HTML elements, attributes, and text

Note: jsoup is released based on the MIT protocol and can be used in commercial projects with confidence.

Three, Jsoup usage introduction

1. Obtain the Document object

Document document = Jsoup.parse(new File("D:\\information\\test.html"), "utf-8");

2. Use DOM to obtain

After obtaining the Document object, the next step is to parse the Document object and get the elements we want from it.

Document provides a wealth of methods to obtain specified elements.

  1. getElementById(String id): get by id
  2. getElementsByTag(String tagName): get by tag name
  3. getElementsByClass(String className): get by class name
  4. getElementsByAttribute(String key): get by attribute name
  5. getElementsByAttributeValue(String key, String value): Obtain by specifying the attribute name and attribute value
  6. getAllElements(): Get all elements

3. Find the element through the selector

Find elements through selectors similar to css or jQuery

The following method of the Element class is used:

public Elements select(String cssQuery)

Find the specified element by passing in a selector string similar to CSS or jQuery.

Four, Jsoup code example

The original intention of the blog is to parse the table in HTML and convert it into Bean.

1. Introduce dependencies

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.12.1</version>
</dependency>

2. Code example

//通过Jsoup获取table中对应标签的信息
private static void HTMLParserMapInit() throws IOException {
		Document document = Jsoup.parse(new File("D:\\information\\test.html"), "utf-8");
        Elements table_title = document.select(".title");
        Elements tables = document.select(".left");
        for(int i=0;i<table_title.size();i++) {
        	String title = table_title.get(i).text();
            String keyLevel1 = "";
		    String keyLevel2 = "";
		    String value = "";
		    String tag_rowspan = "";
		    String tag_colspan = "";
		    String tag_class = "";
		    String tag_text = "";
		    String title = "";
            String table = tables.get(i);
		    Elements tr =  table.select("tr");
		    for(Element eTr : tr){
			    Elements td = eTr.select("td");
			    for(Element eTd : td){
				    tag_rowspan = eTd.attr("rowspan");
				    tag_colspan = eTd.attr("colspan");
				    tag_class = eTd.attr("class");
				    tag_text = eTd.text();
				    if(!tag_colspan.equals("")) {
					    title += tag_text + ",";
				    }
				    if((tag_class.equals("class2"))) {
					    keyLevel1 = tag_text;
				    }else if((tag_class.equals("class1"))) {
					    keyLevel2 = tag_text;
				    }else if(tag_class.equals("")){
					    value += tag_text+",";
				    }
			        }
			    if(!(keyLevel1.equals("")&&keyLevel1.equals(""))) {
				    if(!value.equals("")) {
					    value = value.substring(0,value.length() - 1);
					    shiftInformationHashMap.put(keyLevel1 + "," + keyLevel2, value);
				    }
				        value = "";                                                                                                                                             
                }
		    }
		    title = title.toString().substring(0,title.length() - 1);
		    System.out.println("title,"+title);
		    System.out.println("hashMap,"+shiftInformationHashMap.toString());
				
		}
	}

Parse the data in HTML into a hashmap, everything is clear at a glance.

Five, Map to Bean

public static <T, V> T map2Bean(Map<String,V> map,Class<T> clz) throws Exception{
	T obj = clz.newInstance();
	Field field = null;
	for(String key : map.keySet()) {
		field = obj.getClass().getDeclaredField(key);
		field.setAccessible(true);
		field.set(obj, map.get(key));
	}
	return obj;
}

Six, parse the CSV file

1. CSV file

2. Bean class

@Data
public class ScoreBean {
	private Object id;
	private Object score;
}

 3. How to read CSV file

​public static List<HashMap<String, Object>> readCSVToList(String filePath) throws Exception {
	List<HashMap<String, Object>> list = new ArrayList<HashMap<String, Object>>();
	BufferedReader reader = null;
	try {
		reader = new BufferedReader(new FileReader(filePath));
        String[] headtilte = reader.readLine().split(",");
        String line = null;
        while ((line = reader.readLine()) != null) {
        	HashMap<String, Object> hashMap = new HashMap<String, Object>();
            String[] itemArray = line.split(",");
            for (int i = 0; i < itemArray.length; i++) {
            	hashMap.put(headtilte[i], itemArray[i]);
            }
            list.add(hashMap);
        }
	} catch (Exception e) {
		e.printStackTrace();
	} finally {
		if (null != reader) {
			reader.close();
		}
	}
	return list;
}

4. Testing 

public static void main(String[] args) throws Exception {
	String path = "D:\\scoreInfo.csv";
	    List<HashMap<String, Object>> list = readCSVToList(path);
	    for(HashMap hashMap:list) {
	        BeanUtil.HashMapToBeanUtil(hashMap,ScoreBean.class);
	    }
}

5. Console output

 

Highlights from previous issues:

Summary of Java knowledge system (2021 version)

Summary of basic knowledge of Java multithreading (absolutely classic)

Super detailed springBoot study notes

Summary of common data structures and algorithms

Java design patterns: a comprehensive analysis of 23 design patterns (super detailed)

Summary of Java interview questions (with answers)

Guess you like

Origin blog.csdn.net/guorui_java/article/details/114714216