WebMagic crawler Demo attempt (2) - multiple pages

The first Demo was recorded in the previous article. WebMagic was used to obtain information on a single page and the information was output on the console. This time, we will obtain information on multiple pages and then store it in the database, using the Mybatis framework and the mysql5.5 library.

pom.xml, and log4j configuration see above

The configuration of mybatis-config.xml and the configuration of the database address are recorded here.

jdbc.driver=com.mysql.jdbc.Driver
jdbc.url=jdbc:mysql://localhost:3307/webmagic
jdbc.username=root
jdbc.password=123456

Create the jdbc.propertise file and write your own database configuration information for connecting to the database

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE configuration
        PUBLIC "-//mybatis.org//DTD Config 3.0//EN"
        "http://mybatis.org/dtd/mybatis-3-config.dtd">
<configuration>
    <!-- 这里用于获取连接数据库的配置文件 -->
    <properties resource="jdbc.properties"></properties>
    <!-- 别名配置 -->
    <typeAliases>
        <package name="pojo"/>
    </typeAliases>
    <!-- 属性资源配置 -->
    <!-- SqlSessionFactory配置 -->
    <environments default="development">
        <environment id="development">
            <!-- 事务管理器配置 -->
            <transactionManager type="JDBC"/>
            <!-- 数据源配置 -->
            <dataSource type="POOLED">
                <property name="driver" value="${jdbc.driver}"/>
                <property name="url" value="${jdbc.url}"/>
                <property name="username" value="${jdbc.username}"/>
                <property name="password" value="${jdbc.password}"/>
            </dataSource>
        </environment>
    </environments>
    <mappers>
        <mapper resource="mapper/csdn_titleUrl_oneDao.xml"/>
        <mapper resource="mapper/csdn_user_messageDao.xml"/>
    </mappers>
</configuration>

Configure MyBatis, configure the data source, and configure the mapped Mapper file

I won’t go into details about the relevant information and configuration of Mybatis. There are many on Baidu. Go directly to the crawler logic class.

The same as the last single-page information acquisition, the process is still:

    Download page -> Parse page information -> Get information processing information

It’s just that this time we need to get multiple pages instead of one, and we need to get multiple pieces of information.

For the sake of convenience, it is better to initialize the information that may be used first. Regardless of security performance, just take it out and use it.

private static csdn_titleUrl_one csdn;
    private static csdn_titleUrl_oneService csdnService = new csdn_titleUrl_oneService();
    private List<csdn_titleUrl_one> allList;
    private static String username = "dog250";//需要爬取的用户名信息,可更改,也可设置为手动输入(实现控制台的scanner)
    private static int count = 0;//文章总数
    private static int number = 1;//当前页码数
    private static Spider spider = Spider.create(new getCsdn_TitleAndUrl());
    private static String START_URL = "https://blog.csdn.net/" + username + "/article/list/" + number;
    private Site site = Site.me()
            .setDomain("www.baidu.com")
            .setSleepTime(5000)
            .setCharset("utf-8")
            .setRetrySleepTime(3)
            .setTimeOut(1000)//设置超时
            .setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31");

This time a Service class is initialized, mainly for persistent data storage.

Crawling the code of parsing logic:

 public void process(Page page) {
        allList = new ArrayList<csdn_titleUrl_one>();
        List<String> title = page.getHtml().xpath("*[@id=\"mainBox\"]/main/div[2]/div/h4/a/text()").all();//文章标题信息
        List<String> url = page.getHtml().xpath("*[@id=\"mainBox\"]/main/div[2]/div/h4/a").links().all();//文章标题URL
        for(int i = 0;i<title.size();i++){
            csdn = new csdn_titleUrl_one();
            csdn.setTitle(title.get(i));
            csdn.setUrl(url.get(i));
            allList.add(csdn);
        }
        number++;
        if(allList.size()!=0){
            int num = csdnService.insertList(allList);
            System.out.println("共存入数据为:" + num + "条");
        }
        page.addTargetRequests(doListUrl());//进行下一页的跳转
    }

A page has multiple titles and URL information, so it is stored in a list, and then the list is looped for database storage.

There is a page.addTargetRequests() method below the method. This method is to crawl the next page information in the queue and needs to pass in a list parameter.

This list contains the URLs of the pages to be crawled. I was lazy and manually assembled a URL collection. The code is as follows:

/**
     * 手工生成listUrl表
     */
    public List<String> doListUrl(){
        List<String> list = new ArrayList<String>();
        for(int i = 2;i<=79;i++) {
            list.add("https://blog.csdn.net/" + username + "/article/list/" + i);
        }
        return list;
    }

Because this blogger's blog list has a total of 79 pages, so just crawl 79 pages.

Related classes about Mybatis are also posted:

pojo category:

package pojo;

public class csdn_titleUrl_one {
    private int id;
    private String title;
    private String url;

    public csdn_titleUrl_one(int id, String title, String url) {
        this.id = id;
        this.title = title;
        this.url = url;
    }

    public int getId() {
        return id;
    }

    public void setId(int id) {
        this.id = id;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getUrl() {
        return url;
    }

    public void setUrl(String url) {
        this.url = url;
    }

    @Override
    public String toString() {
        return "csdn_titleUrl_one{" +
                "id=" + id +
                ", title='" + title + '\'' +
                ", url='" + url + '\'' +
                '}';
    }

    public csdn_titleUrl_one() {
    }
}

Service class:

package mapper;

import Util.SqlsessionFactory;
import org.apache.ibatis.session.SqlSession;
import org.springframework.stereotype.Service;
import javax.annotation.Resource;
import java.util.List;
import pojo.csdn_titleUrl_one;

@Service
public class csdn_titleUrl_oneService {
    private csdn_titleUrl_oneDao csdn_titleUrl_oneDao;
    private SqlSession session;

    public csdn_titleUrl_oneService(){
        session = SqlsessionFactory.getSessionAutoConmit();
        csdn_titleUrl_oneDao = session.getMapper(csdn_titleUrl_oneDao.class);
    }
    @Resource
    public int insert(csdn_titleUrl_one pojo){
        return csdn_titleUrl_oneDao.insert(pojo);
    }

    public int insertList(List< csdn_titleUrl_one> pojos){
        return csdn_titleUrl_oneDao.insertList(pojos);
    }

    public List<csdn_titleUrl_one> select(csdn_titleUrl_one pojo){
        return csdn_titleUrl_oneDao.select(pojo);
    }

    public int update(csdn_titleUrl_one pojo){
        return csdn_titleUrl_oneDao.update(pojo);
    }

}

KNIFE:

package mapper;

import org.apache.ibatis.annotations.Param;
import java.util.List;
import pojo.csdn_titleUrl_one;

public interface csdn_titleUrl_oneDao {

    int insert(@Param("pojo") csdn_titleUrl_one pojo);

    int insertList(@Param("pojos") List< csdn_titleUrl_one> pojo);

    List<csdn_titleUrl_one> select(@Param("pojo") csdn_titleUrl_one pojo);

    int update(@Param("pojo") csdn_titleUrl_one pojo);

}

Finally, the Mapper file:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN" "http://mybatis.org/dtd/mybatis-3-mapper.dtd" >
<mapper namespace="mapper.csdn_titleUrl_oneDao">

<!--auto generated Code-->
    <resultMap id="AllColumnMap" type="pojo.csdn_titleUrl_one">
        <result column="id" property="id"/>
        <result column="title" property="title"/>
        <result column="url" property="url"/>
    </resultMap>

<!--auto generated Code-->
    <sql id="all_column">
        id,
        title,
        url
    </sql>

<!--auto generated Code-->
    <insert id="insert">
        INSERT INTO csdn_titleUrl_one
        <trim prefix="(" suffix=")" suffixOverrides=",">
            <if test="pojo.id != null"> id, </if>
            <if test="pojo.title != null"> title, </if>
            <if test="pojo.url != null"> url, </if>
        </trim>
        VALUES
        <trim prefix="(" suffix=")" suffixOverrides=",">
            <if test="pojo.id != null"> #{pojo.id}, </if>
            <if test="pojo.title != null"> #{pojo.title}, </if>
            <if test="pojo.url != null"> #{pojo.url}, </if>
        </trim>
    </insert>

<!--auto generated Code-->
    <insert id="insertList">
        INSERT INTO csdn_titleUrl_one(
        <include refid="all_column"/>
        )VALUES
        <foreach collection="pojos" item="pojo" index="index" separator=",">
            (
            #{pojo.id},
            #{pojo.title},
            #{pojo.url}
            )
        </foreach>
    </insert>

<!--auto generated Code-->
    <update id="update">
        UPDATE csdn_titleUrl_one
        <set>
            <if test="pojo.id != null"> id = #{pojo.id}, </if>
            <if test="pojo.title != null"> title = #{pojo.title}, </if>
            <if test="pojo.url != null"> url = #{pojo.url} </if>
        </set>
         WHERE id = #{pojo.id}
    </update>

<!--auto generated Code-->
    <select id="select" resultMap="AllColumnMap">
        SELECT <include refid="all_column"/>
        FROM csdn_titleUrl_one
        <where>
            <if test="pojo.id != null"> AND id = #{pojo.id} </if>
            <if test="pojo.title != null"> AND title = #{pojo.title} </if>
            <if test="pojo.url != null"> AND url = #{pojo.url} </if>
        </where>
        LIMIT 1000 
    </select>

<!--auto generated Code-->
    <delete id="delete">
        DELETE FROM csdn_titleUrl_one where id = #{pojo.id}
    </delete>
</mapper>

Finally, paste the full code of the reptile class:

package WebMagicForCSDN;

import mapper.csdn_titleUrl_oneService;
import pojo.csdn_titleUrl_one;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;

import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStream;
import java.util.ArrayList;
import java.util.List;

/**
 * 这个类用于演示,测试WebMagic爬虫在爬取多页面,根据页面结构,以及页面跳转的URL进行翻页下一页爬取的操作,
 *
 * 对于下一页数据的获取方面,CSDN博客的页面底部的按钮没有找到相应的跳转URL,或许是通过js封装跳转的,导致直接获取链接无法完成
 *
 * 所有使用了一个比较笨的方法,自己手动查看了所有的页数,然后自己根据URL拼接完成了待爬取的页面URL.
 *
 * 待优化的部分:
 *      当前获取页码的形式为写死的,是自己观看了博客页数以后进行的数据获取,
 *
 *      优化方法:可以在第一次进入页面的时候获取页面底部的最后一页按钮的text文本,转换为int类型,然后赋予全局变量,获取到总页数(待实现)
 */

/**
 *2018.9.26更新,整合mybatis,使用mysql5.5对查询出来的数据做了存储
 */
public class getCsdn_TitleAndUrl implements PageProcessor {
    private static csdn_titleUrl_one csdn;
    private static csdn_titleUrl_oneService csdnService = new csdn_titleUrl_oneService();
    private List<csdn_titleUrl_one> allList;
    private static String username = "dog250";//需要爬取的用户名信息,可更改,也可设置为手动输入(实现控制台的scanner)
    private static int count = 0;//文章总数
    private static int number = 1;//当前页码数
    private static Spider spider = Spider.create(new getCsdn_TitleAndUrl());
    private static String START_URL = "https://blog.csdn.net/" + username + "/article/list/" + number;
    private Site site = Site.me()
            .setDomain("www.baidu.com")
            .setSleepTime(5000)
            .setCharset("utf-8")
            .setRetrySleepTime(3)
            .setTimeOut(1000)//设置超时
            .setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31");
    @Override
    public void process(Page page) {
        allList = new ArrayList<csdn_titleUrl_one>();
        List<String> title = page.getHtml().xpath("*[@id=\"mainBox\"]/main/div[2]/div/h4/a/text()").all();//文章标题信息
        List<String> url = page.getHtml().xpath("*[@id=\"mainBox\"]/main/div[2]/div/h4/a").links().all();//文章标题URL
        for(int i = 0;i<title.size();i++){
            csdn = new csdn_titleUrl_one();
            csdn.setTitle(title.get(i));
            csdn.setUrl(url.get(i));
            allList.add(csdn);
        }
        number++;
        if(allList.size()!=0){
            int num = csdnService.insertList(allList);
            System.out.println("共存入数据为:" + num + "条");
        }
        page.addTargetRequests(doListUrl());//进行下一页的跳转
    }

    @Override
    public Site getSite() {
        return site;
    }

    public static void main(String[] args){
        spider
                .thread(20)//线程数控制在10以内差不多,否则极易出现read time out
                .addUrl(START_URL)
                .run();

    }

    /**
     * 手工生成listUrl表
     */
    public List<String> doListUrl(){
        List<String> list = new ArrayList<String>();
        for(int i = 2;i<=79;i++) {
            list.add("https://blog.csdn.net/" + username + "/article/list/" + i);
        }
        return list;
    }

    public static void ioWrite(String str,int number) throws Exception {
        File file = new File("D:" + File.separator+ "WebMagic/CSDN_dog250" + number +".txt");
        OutputStream out = null;
            try {
                out = new FileOutputStream(file);
                byte[] data = str.getBytes();
                out.write(data);
            }catch (Exception e){
                e.printStackTrace();
            }finally {
                out.close();
            }
    }


}

And the data stored in the database after the final run:

Guess you like

Origin blog.csdn.net/Zachariahs/article/details/82877400