Getting reptile 1 --- talk about web crawler
2 --- Getting reptile reptiles framework webmagic
Getting real reptile reptilian 3 ---
3 reptile combat
3.1 Requirements
A time period from the day **** crawling documentation blog, the article is stored in the database.
Digital-ready 3.2
The following are **** each channel address:
Here first prepare two tables:
Channel Table:
Article table:
Tb_channel add records to the table:
3.3 coding
Write module 3.3.1
( 1 ) create springboot IDEA project (not explained in detail here), create modules article_crawler , introduction of dependence
<dependencies>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.7.3</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.7.3</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
</dependency>
</dependencies>
( 2 ) create a profile application.yml
server:
port: 9015
spring:
application:
name: article-crawler #指定服务名
datasource:
driverClassName: com.mysql.jdbc.Driver
url: jdbc:mysql://****:3306/test_article?characterEncoding=UTF8
username: ****
password: ****
jpa:
database: MySQL
show-sql: true
redis:
host: ****
password: ****
(3) create a startup class
@SpringBootApplication
@EnableScheduling
public class ArticleCrawlerApplication {
public static void main(String[] args) {
SpringApplication.run(ArticleCrawlerApplication.class);
}
@Value("${spring.redis.host}")
private String redis_host;
@Value("${spring.redis.password}")
private String redis_password;
@Bean
public IdWorker idWorker(){
return new IdWorker(1,1);
}
@Bean
public RedisScheduler redisScheduler(){
JedisPoolConfig config = new JedisPoolConfig();// 连接池的配置对象
config.setMaxTotal(100);// 设置最大连接数
config.setMaxIdle(10);// 设置最大空闲连接数
JedisPool jedisPool=new JedisPool(config,redis_host,6379,20000,redis_password);
return new RedisScheduler(jedisPool);
}
(4) the entity classes and data access interface (not explain here)
3.3.2 Class crawling
Creating article crawling class ArticleProcessor
/**
* 文章爬取类
*/
@Component
public class ArticleProcessor implements PageProcessor {
@Override
public void process(Page page) {
page.addTargetRequests( page.getHtml().links().regex("https://blog.csdn.net/[a-z 0-9 -]+/article/details/[0-9]{8}").all());
//文章标题
String title=page.getHtml().xpath("//*[@id=\"mainBox\"]/main/div[1]/div/div/div[1]/h1").get();
String content=page.getHtml().xpath("//*[@id=\"article_content\"]/div[2]").get();
if(title!=null && content!=null){
page.putField("title" ,title );
page.putField("content",content);
}else{
page.setSkip(true);//跳过
}
}
@Override
public Site getSite() {
return Site.me().setRetryTimes(3000).setSleepTime(100);
}
}
3.3.3 storage class
Create an article storage class ArticleDbPipeline , is responsible for crawling the data stored in the database
@Component
public class ArticleDbPipeline implements Pipeline {
@Autowired
private ArticleDao articleDao;
@Autowired
private IdWorker idWorker;
private String channelId;//频道ID
public void setChannelId(String channelId) {
this.channelId = channelId;
}
@Override
public void process(ResultItems resultItems, Task task) {
String title=resultItems.get("title");//取出标题
String content=resultItems.get("content");//内容
Article article=new Article();
article.setId(idWorker.nextId()+"");
article.setChannelid(channelId);
article.setTitle(title);
article.setContent(content);
articleDao.save(article);
}
}
3.3.4 Task class
Create a task class, according @Scheduled set the timing crawl
/**
* 任务类
*/
@Component
public class ArticleTask {
@Autowired
private ArticleProcessor articleProcessor;
@Autowired
private ArticleDbPipeline articleDbPipeline;
@Autowired
private RedisScheduler redisScheduler;
/**
* 爬取AI文章
*/
@Scheduled(cron = "0 15 15 * * ?")
public void aiTask(){
System.out.println("开始爬取CSDN文章");
Spider spider =Spider.create(articleProcessor);
spider.addUrl("https://blog.csdn.net/nav/ai");
articleDbPipeline.setChannelId("ai");
spider.addPipeline(articleDbPipeline);
spider.setScheduler(redisScheduler);
spider.start();
}
}
Run springboot project, query the database data, you can see the database data storage.
Of course, the above is just a simple entry of reptiles engineering, the production of the above is applied to the real need to set up a proxy ip, more complex verification about the operation codes, not to explain here, are interested in children's shoes, you can study at their own.