Java can do reptiles.
Now referred Reptilian first thought is python, in fact, written in Java reptile is a good choice,
Here to show you a use case crawling novels written in the Java language basis:
Function:
Crawling whole novel target site
Coding environment
JDK:1.8.0_191
Eclipse:2019-03 (4.11.0)
Material:
Website: http: //www.shicimingju.com
Fiction: Three Kingdoms
Case implementation techniques used:
Regular Expressions
Java Network Communications: URL
I 流
Map—HashMap
String Manipulation
Exception Handling
Code ideas
Create a file object from the storage position fiction
Regular, create a pattern object based on the structure of the page write
Write cycle, create an object url initiate network requests to all sections of the novel page
Network flow BufferReader
Create an input stream
Cycle read request content obtained using regular matching of contents
To read the contents of a local file write, know the end of the cycle
Note that exception handling code
running result
Chapter 117 to start the download. . . . . . . . . . . . . . . . .
Deng Shi Yin Ping Zhuge Zhan died of carrying stolen Mianzhu _ "Three Kingdoms" _ famous poems Network
Chapter 117 End download. . . . . . . . . . . . . . . . .
Chapter 118 to start the download. . . . . . . . . . . . . . . . .
Temple cried a dead Xiao Wang Shi Ru Xichuan two power struggle _ "Three Kingdoms" _ famous poems Network
Chapter 118 End download. . . . . . . . . . . . . . . . .
Chapter 119 to start the download. . . . . . . . . . . . . . . . .
False surrender artifice into empty words again Zen follow suit _ "Three Kingdoms" _ famous poems Network
Chapter 119 End download. . . . . . . . . . . . . . . . .
Chapter 120 to start the download. . . . . . . . . . . . . . . . .
Du Yu Jian veteran seeking to offer new drop-thirds owned by Sun Hao unification _ "Three Kingdoms" _ famous poems Network
Chapter 120 End download. . . . . . . . . . . . . . . . .
Case Code:
package com.qianfeng.text;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class GetText {
/**
* 1, to create a file object from the storage position fiction
2, according to the structure of the page to write regular, create a pattern object
3, the preparation cycle, create an object url initiate network requests to all sections of the novel page
4, network flow BufferReader
5, creating the input stream
6, a read request cycle obtained content, wherein the use of regular matches
7, to read the contents of a local file write, know the end of the cycle
8, attention exception handling code
* @param args
*/
public static void main(String[] args) {
1 // Create a file object according to the stowed position fiction
File file = new File("D:\\File\\three_guo.txt");
// 2, according to the structure of the page to write regular, create a pattern object
String regex_content = "<p.*?>(.*?)</p>";
String regex_title = "<title>(.*?)</title>";
Pattern p_content = Pattern.compile(regex_content);
Pattern p_title = Pattern.compile(regex_title);
Matcher m_content;
M_title games;
// 3, the preparation cycle, create an object url initiate network requests to all sections of the novel page
for (int i = 1; i <= 120; i++) {
System.out.println ( "first" + i + "to start the download chapter ...");
try {
// create an object url of each page
URL url = new URL("http://www.shicimingju.com/book/sanguoyanyi/" + i + ".html");
// Create a network read stream
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream(),"utf8"));
// 4, reads the network streaming content network BufferReader
String str = null;
// 5, creating the input stream
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file,true)));
while ((str = reader.readLine()) != null) {
m_title = p_title.matcher(str.toString());
m_content = p_content.matcher(str.toString());
// Get fiction title and write local files
Boolean isEx = m_title.find();
if (isEx) {
String title = m_title.group();
// data obtained was washed
title = title.replace("<title>", "").replace("</title>", "");
System.out.println(title);
writer.write("第" + i + "章:" + title + "\n");
}
while (m_content.find()) {
String content = m_content.group();
// data obtained was washed
content = content.replace("<p>", "").replace("</p>", "").replace(" ", "").replace("?", "");
// The content of the novel written documents
writer.write(content + "\n");
}
}
System.out.println ( "first" + i + "chapter download is complete .........");
writer.write("\n\n");
writer.close();
reader.close();
} catch (Exception e) {
System.out.println ( "Download failed");
e.printStackTrace ();
}
}
}
}
Reproduced in: https: //juejin.im/post/5d0b3f09518825122925c39d