Use phpspider crawler frame

These days using PHP framework reptiles crawling some of the data found is very convenient, first on the framework document reptile phpspider framework document

In fact, to use in a document written very clearly and also used in the demo example, here's the code to be put down my own notes

<? PHP
 the include "./autoloader.php" ; 

use phpspider \ Core \ phpspider;
 / * the Do the this the NOT the Delete the Comment * / 
/ * Do not delete this comment * / 

$ configs = Array (
     'name' => 'Chinese insulation network ', 
    ' Domains '=> Array (
         ' www.cnbaowen.net ', 
        ' cnbaowen.net ' 
    ) , 
    ' scan_urls '=> Array (
         ' http://www.cnbaowen.net/news/list-3720-1 .html ' 
    ) , 

    ' Export '=>array(
         'type' => 'db',
        'table' => 'articles_mc',
    ),

    'db_config' => array(
        'host'  => '127.0.0.1',
        'port'  => 3306,
        'user'  => 'root',
        'pass'  => '123456',
        'name'  => 'spider',
    ),

    'content_url_regexes' => array(
        "http://www.cnbaowen.net/news/show-\d+.html"
    ),

    'list_url_regexes' => array(
        "http://www.cnbaowen.net/news/list-3720-\d+.html"
    ),

    'fields' =>extract the contents of the article content page//(
            Array(
        Array
            'name' => "title", 
            'Selector' => "// h1 [@ the above mentioned id = 'title']", 
            'required' => to true 
        ) ,
         Array (
             // author of the article extract the contents page of 
            'name' = > "content", 
            'Selector' => "// div [@ the above mentioned id = 'content']", 
            'required' => to true 
        ) ,
         Array (
             // author of the article extract the contents page of 
            'name' => "type" 
        ) ,Array (
             // author of the article extract the contents page of 
            'name' => "site_id" 
        ) , 
    )

        ,
);
$spider = new phpspider($configs);


$spider->on_list_page = function($page, $content, $spider){
    for ($i = 2; $i < 24; $i++)
    {
        $url = "http://www.cnbaowen.net/news/list-3720-{$i}.html";
        $spider->add_url($url);
    }
};

$spider->on_extract_field = function($fieldname, $data, $page){
    if($fieldname == "type"){
        return 2;
    }elseif($fieldname == "content"){
        $s = preg_replace("/<div style=\"float:right[\s\S]*?div>/","",$data);
        $s = preg_replace('/<a .*?href="(.*?)".*?>/is',"<a href='#'>",$s);
        $data = preg_replace('/<img.*?>/is',"",$s);
        return $data;
    }elseif($fieldname == "site_id"){
        return 1;
    }else{
        return $data;
    }
};

$spider->start();

Note: It should be clear, when I just need to grab data page title and content of the part, but when I need to use stored in the database to the other two fields, so the definition of the field when it defines `type` and` site_id` two fields, but the actual assignment of these two fields is done in the `on_extract_field` callback function

Comes sql statement

The CREATE  TABLE `articles_mc` ( 
  ` id` int ( 10 ) unsigned the NOT  NULL the AUTO_INCREMENT, 
  `title` VARCHAR ( 200 is ) the DEFAULT  NULL , 
  ` content` text , 
  `type` int ( . 5 ) the DEFAULT  ' 0 ' the COMMENT ' article Industry Type 1 information 2 technical data ' , 
  `site_id` int ( 5 ) the DEFAULT  NULL the COMMENT ' site the above mentioned id ' ,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=4887 DEFAULT CHARSET=utf8mb4;

Guess you like

Origin www.cnblogs.com/itsuibi/p/11100780.html