Java reptiles: java crawling with fiction

Java can do reptiles.

Now referred Reptilian first thought is python, in fact, written in Java reptile is a good choice,

Here to show you a use case crawling novels written in the Java language basis:


Function:

Crawling whole novel target site


Coding environment

JDK:1.8.0_191

Eclipse:2019-03 (4.11.0)


Material:

Website: http: //www.shicimingju.com

Fiction: Three Kingdoms


Case implementation techniques used:

Regular Expressions

Java Network Communications: URL

I 流

Map—HashMap

String Manipulation

Exception Handling


Code ideas

Create a file object from the storage position fiction

Regular, create a pattern object based on the structure of the page write

Write cycle, create an object url initiate network requests to all sections of the novel page

Network flow BufferReader

Create an input stream

Cycle read request content obtained using regular matching of contents

To read the contents of a local file write, know the end of the cycle

Note that exception handling code



running result


Chapter 117 to start the download. . . . . . . . . . . . . . . . .

Deng Shi Yin Ping Zhuge Zhan died of carrying stolen Mianzhu _ "Three Kingdoms" _ famous poems Network

Chapter 117 End download. . . . . . . . . . . . . . . . .

Chapter 118 to start the download. . . . . . . . . . . . . . . . .

Temple cried a dead Xiao Wang Shi Ru Xichuan two power struggle _ "Three Kingdoms" _ famous poems Network

Chapter 118 End download. . . . . . . . . . . . . . . . .

Chapter 119 to start the download. . . . . . . . . . . . . . . . .

False surrender artifice into empty words again Zen follow suit _ "Three Kingdoms" _ famous poems Network

Chapter 119 End download. . . . . . . . . . . . . . . . .

Chapter 120 to start the download. . . . . . . . . . . . . . . . .

Du Yu Jian veteran seeking to offer new drop-thirds owned by Sun Hao unification _ "Three Kingdoms" _ famous poems Network

Chapter 120 End download. . . . . . . . . . . . . . . . .



Case Code:

package com.qianfeng.text;


import java.io.BufferedReader;

import java.io.BufferedWriter;

import java.io.File;

import java.io.FileOutputStream;

import java.io.InputStreamReader;

import java.io.OutputStreamWriter;

import java.net.URL;

import java.util.regex.Matcher;

import java.util.regex.Pattern;


public class GetText {

/**

* 1, to create a file object from the storage position fiction

2, according to the structure of the page to write regular, create a pattern object

3, the preparation cycle, create an object url initiate network requests to all sections of the novel page

4, network flow BufferReader

5, creating the input stream

6, a read request cycle obtained content, wherein the use of regular matches

7, to read the contents of a local file write, know the end of the cycle

8, attention exception handling code


* @param args

*/

public static void main(String[] args) {

1 // Create a file object according to the stowed position fiction

File file = new File("D:\\File\\three_guo.txt");

// 2, according to the structure of the page to write regular, create a pattern object

String regex_content = "<p.*?>(.*?)</p>";

String regex_title = "<title>(.*?)</title>";


Pattern p_content = Pattern.compile(regex_content);

Pattern p_title = Pattern.compile(regex_title);


Matcher m_content;

M_title games;


// 3, the preparation cycle, create an object url initiate network requests to all sections of the novel page

for (int i = 1; i <= 120; i++) {

System.out.println ( "first" + i + "to start the download chapter ...");

try {

// create an object url of each page

URL url = new URL("http://www.shicimingju.com/book/sanguoyanyi/" + i + ".html");

// Create a network read stream

BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream(),"utf8"));


// 4, reads the network streaming content network BufferReader

String str = null;


// 5, creating the input stream

BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file,true)));


while ((str = reader.readLine()) != null) {

m_title = p_title.matcher(str.toString());

m_content = p_content.matcher(str.toString());


// Get fiction title and write local files

Boolean isEx = m_title.find();

if (isEx) {

String title = m_title.group();

// data obtained was washed

title = title.replace("<title>", "").replace("</title>", "");

System.out.println(title);

writer.write("第" + i + "章:" + title + "\n");

}


while (m_content.find()) {

String content = m_content.group();

// data obtained was washed

content = content.replace("<p>", "").replace("</p>", "").replace("&nbsp;", "").replace("?", "");

// The content of the novel written documents

writer.write(content + "\n");

}


}


System.out.println ( "first" + i + "chapter download is complete .........");


writer.write("\n\n");

writer.close();

reader.close();

} catch (Exception e) {

System.out.println ( "Download failed");

e.printStackTrace ();

}

}

}

}





Reproduced in: https: //juejin.im/post/5d0b3f09518825122925c39d

Guess you like

Origin blog.csdn.net/weixin_34404393/article/details/93180575