We use pthreads to write a multi-threaded page crawling applet and store the results in the database.
The data table structure is as follows:
CREATE TABLE `tb_sina` ( `id` int(11) unsigned NOT NULL AUTO_INCREMENT COMMENT 'ID', `url` varchar(256) DEFAULT '' COMMENT 'url address', `title` varchar(128) DEFAULT '' COMMENT 'title', `time` datetime DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP COMMENT '时间', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=2001 DEFAULT CHARSET=utf8mb4 COMMENT='sina新闻';
code show as below:
<?php class DB extends Worker { private static $db; private $dsn; private $root; private $pwd; public function __construct($dsn, $root, $pwd) { $this->dsn = $dsn; $this->root = $root; $this->pwd = $pwd; } public function run() { //create connection object self::$db = new PDO($this->dsn, $this->root, $this->pwd); //Put require in the worker thread, do not put it in the main thread, otherwise it will report an error that the class cannot be found require './vendor/autoload.php'; } //return a connection resource public function getConn() { return self::$db; } } class Sina extends Thread { private $name; private $url; public function __construct($name, $url) { $this->name = $name; $this->url = $url; } public function run() { $db = $this->worker->getConn(); if (empty($db) || empty($this->url)) { return false; } $content = file_get_contents($this->url); if (!empty($content)) { //Get title, address, time $data = QL\QueryList::Query($content, [ 'tit' => ['.c_tit > a', 'text'], 'url' => ['.c_tit > a', 'href'], 'time' => ['.c_time', 'text'], ], '', 'UTF-8', 'GB2312')->getData(); // Insert the retrieved data into the database if (!empty($data)) { $sql = 'INSERT INTO tb_sina(`url`, `title`, `time`) VALUES'; foreach ($data as $row) { //Modify the time, Sina's time format is like this 04-23 15:30 $time = date('Y') . '-' . $row['time'] . ':00'; $sql .= "('{$row['url']}', '{$row['tit']}', '{$time}'),"; } $sql = rtrim($sql, ','); $ret = $db->exec($sql); if ($ ret! == false) { echo "Thread {$this->name} successfully inserted {$ret} pieces of data\n"; } else { var_dump($db->errorInfo()); } } } } } // grab the page address $url = 'http://roll.news.sina.com.cn/s/channel.php?ch=01#col=89&spec=&type=&ch=01&k=&offset_page=0&offset_num=0&num=60&asc=&page='; //create pool $pool = new Pool(5, 'DB', ['mysql:dbname=test;host=192.168.33.226', 'root', '']); //Get 100 paging data for ($ ix = 1; $ ix <= 100; $ ix ++) { $ pool-> submit (new Sina ($ ix, $ url. $ ix)); } / / Loop garbage collection, block the main thread, wait for the end of the child thread while ($pool->collect()) ; $pool->shutdown();
Since QueryList is used, you can install it through composer.
composer require jaeger/querylist
However, the installed version is 3.2. There will be problems under my php7.2. Since each() has been abandoned, everyone should change the source code and replace each() with foreach.
The results are as follows:
The data is also stored in the database
Of course, you can also get the specific page content through the url again. I won't do a demonstration here. Those who are interested can implement it by themselves.