These days using PHP framework reptiles crawling some of the data found is very convenient, first on the framework document reptile phpspider framework document
In fact, to use in a document written very clearly and also used in the demo example, here's the code to be put down my own notes
<? PHP the include "./autoloader.php" ; use phpspider \ Core \ phpspider; / * the Do the this the NOT the Delete the Comment * / / * Do not delete this comment * / $ configs = Array ( 'name' => 'Chinese insulation network ', ' Domains '=> Array ( ' www.cnbaowen.net ', ' cnbaowen.net ' ) , ' scan_urls '=> Array ( ' http://www.cnbaowen.net/news/list-3720-1 .html ' ) , ' Export '=>array( 'type' => 'db', 'table' => 'articles_mc', ), 'db_config' => array( 'host' => '127.0.0.1', 'port' => 3306, 'user' => 'root', 'pass' => '123456', 'name' => 'spider', ), 'content_url_regexes' => array( "http://www.cnbaowen.net/news/show-\d+.html" ), 'list_url_regexes' => array( "http://www.cnbaowen.net/news/list-3720-\d+.html" ), 'fields' =>extract the contents of the article content page//( Array( Array 'name' => "title", 'Selector' => "// h1 [@ the above mentioned id = 'title']", 'required' => to true ) , Array ( // author of the article extract the contents page of 'name' = > "content", 'Selector' => "// div [@ the above mentioned id = 'content']", 'required' => to true ) , Array ( // author of the article extract the contents page of 'name' => "type" ) ,Array ( // author of the article extract the contents page of 'name' => "site_id" ) , ) , ); $spider = new phpspider($configs); $spider->on_list_page = function($page, $content, $spider){ for ($i = 2; $i < 24; $i++) { $url = "http://www.cnbaowen.net/news/list-3720-{$i}.html"; $spider->add_url($url); } }; $spider->on_extract_field = function($fieldname, $data, $page){ if($fieldname == "type"){ return 2; }elseif($fieldname == "content"){ $s = preg_replace("/<div style=\"float:right[\s\S]*?div>/","",$data); $s = preg_replace('/<a .*?href="(.*?)".*?>/is',"<a href='#'>",$s); $data = preg_replace('/<img.*?>/is',"",$s); return $data; }elseif($fieldname == "site_id"){ return 1; }else{ return $data; } }; $spider->start();
Note: It should be clear, when I just need to grab data page title and content of the part, but when I need to use stored in the database to the other two fields, so the definition of the field when it defines `type` and` site_id` two fields, but the actual assignment of these two fields is done in the `on_extract_field` callback function
Comes sql statement
The CREATE TABLE `articles_mc` ( ` id` int ( 10 ) unsigned the NOT NULL the AUTO_INCREMENT, `title` VARCHAR ( 200 is ) the DEFAULT NULL , ` content` text , `type` int ( . 5 ) the DEFAULT ' 0 ' the COMMENT ' article Industry Type 1 information 2 technical data ' , `site_id` int ( 5 ) the DEFAULT NULL the COMMENT ' site the above mentioned id ' , PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=4887 DEFAULT CHARSET=utf8mb4;