Crawler does not need to write code crawlers do not need to write regular crawlers

Crawler V1.0.0

  1. The code has not been optimized

  2. The frame structure is simple

  3. Some functions require you to give me requirements, I will add tests later

Operation guide

命令:java -jar Crawler.jar -[option]
    -v  爬虫的版本信息
    -h  爬虫的帮助文档
    -ct [url]  爬虫爬取一个网站测试 URL:测试的URL地址
    -cw [url] [k,v] 测试信息抽取 | URL:测试的URL | [k,v] title,div[class=title] 如果有多个参数,使用#隔开
    -ci [urllist] [k,v] <InputResult> 把抽取的信息规则保存xml中,可以使用SQL工具的导入向导导入到数据库或者转成其他格式| <InputResult> 保存结
果目录
    -cl [url] [k,v] <InputUrllist> 把某URL的列表URL保存到文件中,可以用ci进行深入爬取

Eg example

1. -ci URL file crawler rule output path

URL file

2、执行java -jar crawler.jar -ci url.txt title,h1[id=artibodyTitle]#date,span[id=pub_date]#nodes,div[id=artibody] data.xml

Results of the

Then we can use the SQL import wizard to import by xml, and then convert to XML, TXT, EXCEL, WORD and other formats. Navicat tools, etc.

3. The -cl command is used to generate urllist.txt and then execute the ci command.

My mailbox is [email protected]  BUG direct ISS or email, you tell me your needs, I will improve, I have a bunch of them on hand that have not been perfected.

Completed:

1. URL formatting, some website URLs start with "/" "./" "../" "//" These have been resolved

2. HTTP proxy interface, if there is, it has not been added

3. Custom UA and Cookie login are also available, without adding

4. Before JDBC, I feel that it is not as fast as xml import, it is a burden to delete

5. Personalized tools are reserved for batch extraction of EMail, QQ, mobile phone numbers, etc.

6. Make an interface to SQLMAP, which can realize automatic injection test and XSS test in the later stage

7. Can be combined with Nutch

8. If you have any questions, please ask me, I will remember them, and then I will improve them gradually. The code is open source JavaGUI you know

PS: The user must have a Java runtime environment


The current functions can be combined with Shell DOS commands: timed crawler, distributed crawler, which can be freely combined

OSchina:http://git.oschina.net/puguoan/Crawler


The command has changed a lot, see the instructions in Git

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325441974&siteId=291194637
Recommended