Crawl all the teachers 'mailboxes of the teachers' homepage of Xidian University (using java's jsoup)

1 Introduction

In order to crawl the teacher homepage information of Xidian, I selected the old teacher homepage . The old teacher homepage is more stable than the newer teacher homepage , and the updated information of the teacher is more detailed.

This time I used Jsoup:   Jsoup Chinese document

2. Basic ideas

Get the college URL through the homepage-> teacher URL of each college-> mailbox information of all teacher pages

3. Code


import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Set;
import java.util.TreeSet;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import static java.lang.Thread.sleep;

public class JsoupTestTitle {

    public  static List<String> school = new ArrayList<String>();
    public  static List<String> zhuye = new ArrayList<String>();
    public  static Set<String> email = new TreeSet<>();

    public static void main(String[] args) throws Exception {
        getSchool();
        getZhuye();
        getEmail();
    }


    //爬取全部学院的url
    public static void getSchool() {

        String url = "https://web.xidian.edu.cn/showcollege.php?col_num=1";
        Document doc = null;
        try {
            doc = Jsoup.connect(url).get();
            Elements listDiv = doc.getElementsByAttributeValue("class", "right_container");
            for (Element element : listDiv) {
                Elements texts = element.getElementsByTag("a");
                for (Element text : texts) {

                    // 取所有文本
                    //String ptext = text.text();

                    String ptext = text.attr("href");


                    if("s".equals(ptext.substring(0,1))){
                        school.add("https://web.xidian.edu.cn/" + ptext);
                    }

                    //if(!ptext.equals(""))
                    // System.out.println("https://web.xidian.edu.cn/" + ptext);

                }
            }
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }


    //按照学院爬取教师的主页url
    public static void getZhuye() {

        for(String ss : school){

            String url = ss;
            //for(String ss : school)
            Document doc = null;
            try {
                doc = Jsoup.connect(url).get();
                Elements listDiv = doc.getElementsByAttributeValue("class", "left_item");

                for (Element element : listDiv) {
                    Elements texts = element.getElementsByTag("a");
                    for (Element text : texts) {

                        // 取所有文本
                        //String ptext = text.text();

                        String ptext = text.attr("href");
                        String ptitle = text.attr("title");
                        zhuye.add("https://web.xidian.edu.cn/" + ptext + "+" + ptitle);
                        //if(!ptext.equals(""))
                        // System.out.println("https://web.xidian.edu.cn/" + ptext);

                    }
                }
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
        }

    }


    //按照教师主页爬取教师的邮箱
    public static void getEmail() throws InterruptedException {

        for(String zz : zhuye) {
            int index = zz.indexOf("+");


            String url = zz.substring(0,index);
            String name = zz.substring(index+1,zz.length());
            sleep(1000);

            //String url = "https://web.xidian.edu.cn/baoliang/";
            Document doc = null;
            try {
                doc = Jsoup.connect(url).get();
                Elements listDiv = doc.getElementsByAttributeValue("class", "nr");
                sleep(1000);
                for (Element element : listDiv) {
                    String Content = element.text();
                    //正则表达式判断邮箱
                    String patternStr = "[\\w[.-]]+@[\\w[.-]]+\\.[\\w]+";
                    Pattern pattern = Pattern.compile(patternStr);
                    Matcher matcher = pattern.matcher(Content);
                    //如果主页含有邮箱
                    if (matcher.find()) {

                        String teacherEmail = name + ":" + matcher.group();
                        if(!email.contains(teacherEmail)){
                            //email.add(teacherEmail);
                            System.out.println(teacherEmail);
                        }

                    }
                }
            } catch (IOException | InterruptedException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
        }
    }

}

4. Operation result:

5. Some notes:

  • Xidian's old teacher homepage is very irregular, some are e-mail: [email protected], some are e-mail: [email protected], some are directly [email protected]; the location of the mailbox is also irregular, some in the introduction, Some are in the left column and some are in the right column. So I ended up using a regular expression to directly match the mailbox.
  • I used sleep () because I was temporarily blocked too fast because I crawled too fast at first. It seems that Jsoup.connect with user-agent is more likely to be blocked. It seems that ...
  • After this run, java.net.SocketTimeoutException: connect timed out exception will be thrown. Setting timeout in Jsoup.connect is also useless and needs to be optimized
  • See the final result link: https://share.weiyun.com/5Hqwqhv
Published 108 original articles · praised 48 · 50,000+ views

Guess you like

Origin blog.csdn.net/larry1648637120/article/details/103354591