1 Introduction
In order to crawl the teacher homepage information of Xidian, I selected the old teacher homepage . The old teacher homepage is more stable than the newer teacher homepage , and the updated information of the teacher is more detailed.
This time I used Jsoup: Jsoup Chinese document
2. Basic ideas
Get the college URL through the homepage-> teacher URL of each college-> mailbox information of all teacher pages
3. Code
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Set;
import java.util.TreeSet;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import static java.lang.Thread.sleep;
public class JsoupTestTitle {
public static List<String> school = new ArrayList<String>();
public static List<String> zhuye = new ArrayList<String>();
public static Set<String> email = new TreeSet<>();
public static void main(String[] args) throws Exception {
getSchool();
getZhuye();
getEmail();
}
//爬取全部学院的url
public static void getSchool() {
String url = "https://web.xidian.edu.cn/showcollege.php?col_num=1";
Document doc = null;
try {
doc = Jsoup.connect(url).get();
Elements listDiv = doc.getElementsByAttributeValue("class", "right_container");
for (Element element : listDiv) {
Elements texts = element.getElementsByTag("a");
for (Element text : texts) {
// 取所有文本
//String ptext = text.text();
String ptext = text.attr("href");
if("s".equals(ptext.substring(0,1))){
school.add("https://web.xidian.edu.cn/" + ptext);
}
//if(!ptext.equals(""))
// System.out.println("https://web.xidian.edu.cn/" + ptext);
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
//按照学院爬取教师的主页url
public static void getZhuye() {
for(String ss : school){
String url = ss;
//for(String ss : school)
Document doc = null;
try {
doc = Jsoup.connect(url).get();
Elements listDiv = doc.getElementsByAttributeValue("class", "left_item");
for (Element element : listDiv) {
Elements texts = element.getElementsByTag("a");
for (Element text : texts) {
// 取所有文本
//String ptext = text.text();
String ptext = text.attr("href");
String ptitle = text.attr("title");
zhuye.add("https://web.xidian.edu.cn/" + ptext + "+" + ptitle);
//if(!ptext.equals(""))
// System.out.println("https://web.xidian.edu.cn/" + ptext);
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
//按照教师主页爬取教师的邮箱
public static void getEmail() throws InterruptedException {
for(String zz : zhuye) {
int index = zz.indexOf("+");
String url = zz.substring(0,index);
String name = zz.substring(index+1,zz.length());
sleep(1000);
//String url = "https://web.xidian.edu.cn/baoliang/";
Document doc = null;
try {
doc = Jsoup.connect(url).get();
Elements listDiv = doc.getElementsByAttributeValue("class", "nr");
sleep(1000);
for (Element element : listDiv) {
String Content = element.text();
//正则表达式判断邮箱
String patternStr = "[\\w[.-]]+@[\\w[.-]]+\\.[\\w]+";
Pattern pattern = Pattern.compile(patternStr);
Matcher matcher = pattern.matcher(Content);
//如果主页含有邮箱
if (matcher.find()) {
String teacherEmail = name + ":" + matcher.group();
if(!email.contains(teacherEmail)){
//email.add(teacherEmail);
System.out.println(teacherEmail);
}
}
}
} catch (IOException | InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
4. Operation result:
5. Some notes:
- Xidian's old teacher homepage is very irregular, some are e-mail: [email protected], some are e-mail: [email protected], some are directly [email protected]; the location of the mailbox is also irregular, some in the introduction, Some are in the left column and some are in the right column. So I ended up using a regular expression to directly match the mailbox.
- I used sleep () because I was temporarily blocked too fast because I crawled too fast at first. It seems that Jsoup.connect with user-agent is more likely to be blocked. It seems that ...
- After this run, java.net.SocketTimeoutException: connect timed out exception will be thrown. Setting timeout in Jsoup.connect is also useless and needs to be optimized
- See the final result link: https://share.weiyun.com/5Hqwqhv