php uses pthreads v3 multithreading to grab Sina news information

We use pthreads to write a multi-threaded page crawling applet and store the results in the database.

The data table structure is as follows:

CREATE TABLE `tb_sina` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT COMMENT 'ID',
  `url` varchar(256) DEFAULT '' COMMENT 'url address',
  `title` varchar(128) DEFAULT '' COMMENT 'title',
  `time` datetime DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP COMMENT '时间',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2001 DEFAULT CHARSET=utf8mb4 COMMENT='sina新闻';

code show as below:

<?php

class DB extends Worker
{
    private static $db;
    private $dsn;
    private $root;
    private $pwd;

    public function __construct($dsn, $root, $pwd)
    {
        $this->dsn = $dsn;
        $this->root = $root;
        $this->pwd = $pwd;
    }

    public function run()
    {
        //create connection object
        self::$db = new PDO($this->dsn, $this->root, $this->pwd);

        //Put require in the worker thread, do not put it in the main thread, otherwise it will report an error that the class cannot be found
        require './vendor/autoload.php';
    }

    //return a connection resource
    public function getConn()
    {
        return self::$db;
    }
}

class Sina extends Thread
{
    private $name;
    private $url;

    public function __construct($name, $url)
    {
        $this->name = $name;
        $this->url = $url;
    }

    public function run()
    {
        $db = $this->worker->getConn();

        if (empty($db) || empty($this->url)) {
            return false;
        }

        $content = file_get_contents($this->url);
        if (!empty($content)) {
            //Get title, address, time
            $data = QL\QueryList::Query($content, [
                'tit' => ['.c_tit > a', 'text'],
                'url' => ['.c_tit > a', 'href'],
                'time' => ['.c_time', 'text'],
            ], '', 'UTF-8', 'GB2312')->getData();

            // Insert the retrieved data into the database
            if (!empty($data)) {
                $sql = 'INSERT INTO tb_sina(`url`, `title`, `time`) VALUES';
                foreach ($data as $row) {
                    //Modify the time, Sina's time format is like this 04-23 15:30
                    $time = date('Y') . '-' . $row['time'] . ':00';
                    $sql .= "('{$row['url']}', '{$row['tit']}', '{$time}'),";
                }
                $sql = rtrim($sql, ',');
                $ret = $db->exec($sql);

                if ($ ret! == false) {
                    echo "Thread {$this->name} successfully inserted {$ret} pieces of data\n";
                } else {
                    var_dump($db->errorInfo());
                }
            }
        }
    }
}

// grab the page address
$url = 'http://roll.news.sina.com.cn/s/channel.php?ch=01#col=89&spec=&type=&ch=01&k=&offset_page=0&offset_num=0&num=60&asc=&page=';
//create pool
$pool = new Pool(5, 'DB', ['mysql:dbname=test;host=192.168.33.226', 'root', '']);

//Get 100 paging data
for ($ ix = 1; $ ix <= 100; $ ix ++) {
    $ pool-> submit (new Sina ($ ix, $ url. $ ix));
}

/ / Loop garbage collection, block the main thread, wait for the end of the child thread
while ($pool->collect()) ;
$pool->shutdown();

Since QueryList is used, you can install it through composer.

composer require jaeger/querylist

However, the installed version is 3.2. There will be problems under my php7.2. Since each() has been abandoned, everyone should change the source code and replace each() with foreach.

The results are as follows:

The data is also stored in the database

Of course, you can also get the specific page content through the url again. I won't do a demonstration here. Those who are interested can implement it by themselves.

php uses pthreads v3 multithreading to grab Sina news information

Guess you like