Jsoup相应的jar包可以去官网下载,便可使用
爬取的入口地址:(实习僧招聘网Java类)
一:通过分析URL地址可知,每一页的URL只有p=?不同,第一页就是p=1,以此类推。
因此,爬取总页数67页:
public static void main(String[] args) throws Exception {
for(int i=1;i<67;i++){
System.out.println("---------------------正在爬取第"+i+"页数据——————————————");
String url = "https://www.shixiseng.com/interns?k=Java&p=";
url=url+i;
//爬取页面信息
getPageData(url);
System.out.println("---------------------第"+i+"页数据爬取完毕——————————————");
}
}
二:我们需要进入对应的详情页获取信息
通过分析找到对应的a标签元素获取详情页地址
public static void getPageData(String url) throws IOException {
//模拟浏览器,建立连接
Document document = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36")
.get();
//通过class来获取
Elements elements = document.getElementsByClass("name");
// System.out.println(document);
for(int i=1;i<elements.size();i++
) {
//给对应的连接加前缀,用来存入获取详情页的方法
String url2 = "https://www.shixiseng.com"+elements.get(i).attr("href");
getInnerPageData(url2);
}
}
三:进入详情页找到我们要的信息
通过Jsoup获取和拼接为合理字符串写入文件:
public static void getInnerPageData(String url) throws IOException {
FileOutputStream fileOutputStream = new FileOutputStream("D://job.txt",true);
Document document = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36")
.get();
Elements elements1 = document.getElementsByClass("item-name");
Elements elements2 = document.getElementsByClass("item-tm");
Elements elements3 = document.getElementsByClass("item-msg");
Elements elements4 = document.getElementsByClass("job_detail");
Elements elements5 = document.getElementsByClass("job_link");
Elements elements6 = document.getElementsByClass("job_academic");
Elements elements7 = document.getElementsByClass("job_good");
String content = "";
int i=0,j=0,k=0,l=0,n = 0,a=0,b=0;
int flag=0;
while(true) {
for (i = flag; i < elements1.size(); i++) {
content = elements1.get(i).text();
fileOutputStream.write(elements1.get(i).text().getBytes("UTF-8"));
break;
}
for (j = flag; j < elements2.size(); j++) {
content += elements2.get(j).text();
fileOutputStream.write(elements2.get(j).text().getBytes("UTF-8"));
fileOutputStream.write("\n".getBytes());
break;
}
for (a = flag; a < elements6.size(); a++) {
content += elements6.get(j).text();
fileOutputStream.write(elements6.get(a).text().getBytes("UTF-8"));
fileOutputStream.write("\n".getBytes());
break;
}
for (b = flag; b < elements7.size(); b++) {
content += elements7.get(j).text();
fileOutputStream.write(elements7.get(b).text().getBytes("UTF-8"));
fileOutputStream.write("\n".getBytes());
break;
}
for (k = flag; k < elements3.size(); k++) {
content += elements3.get(k).text() + "\n";
fileOutputStream.write(elements3.get(k).text().getBytes("UTF-8"));
fileOutputStream.write("\n".getBytes());
break;
}
for (l = flag; l < elements4.size(); l++) {
fileOutputStream.write(elements4.get(l).text().getBytes("UTF-8"));
fileOutputStream.write("\n".getBytes());
content += elements4.get(l).text() + "\n";
break;
}
for (n = flag; n < elements5.size(); n++) {
fileOutputStream.write(elements5.get(n).text().getBytes("UTF-8"));
fileOutputStream.write("\n".getBytes());
content += elements5.get(n).text();
break;
}
flag++;
fileOutputStream.write("-----------------------\n".getBytes());
System.out.println(content);
break;
}
}
四:运行结果
文件结果: