ThinkPHP5 采集网页的指定内容

版权声明:支持技术共享 https://blog.csdn.net/qq_38350907/article/details/85162339

荆轲刺秦王

因业务需求,需要做一个网页的信息采集功能。这个网页就是安居客的新房的列表页。

第一步:一开始,我用最基本的采集,采集一点很基本的内容,就是网页 html 的的<title>标签的内容,采集出来的是乱码

问过同事后才明白:原来有些网站为了优化,会使用 gzip 压缩,这样就导致我们采集的信息一直是乱码。

如何检测网页是否使用了 gzip 压缩?

1.谷歌浏览器  F12打开页面

2.右键点击 Waterfall  > Response Headers > Content-Encoding

3. 如何开启了 gzip 则会显示 gzip 没有则为空

扫描二维码关注公众号,回复: 4785698 查看本文章

解决办法:使用下面的函数:

function https_request($url='',$gz=false){
        $curl=curl_init();
        if($gz){curl_setopt($curl,CURLOPT_ENCODING,'gzip');}//加入gzip解析
        curl_setopt($curl,CURLOPT_URL,$url);
        curl_setopt($curl,CURLOPT_SSL_VERIFYPEER,FALSE);
        curl_setopt($curl,CURLOPT_SSL_VERIFYHOST,FALSE);
        curl_setopt($curl,CURLOPT_RETURNTRANSFER,1);
        $data=curl_exec($curl);
        if(curl_errno($curl)){
            return 'ERROR'.curl_error($curl);
        }
        curl_close($curl);
        return $data;
    }

第一个参数传入需要抓取页面的 url 第二个参数设为 true

然后,我将抓取页面信息的函数封装成一个方法,如果同学需要抓取其他页面,需要修改其中的正则表达式!

//this is collect function
    public function getCollectContent($url){
        $content = $this->https_request($url,true);
        //echo $content;exit();
        $title_preg='#<h3><span class="items-name">(.*)</span></h3>#i';
        preg_match_all($title_preg,$content,$titles);
        //print_r($res[1]);exit;
        $newarray = array();
        foreach($titles[1] as $k=>$v){
            $newarray[$k] = array(
                'ktitle' => $v,
            );
        }
        //print_r($newarray);
        //this  is  address and area collect
        $address_preg='#<span class="list-map" target="_blank">(.*)</span>#i';
        preg_match_all($address_preg,$content,$address);
        //print_r($address[1]);exit();
        foreach($address[1] as $k=>$v){
            foreach($newarray as $key=>$value){
                if($k == $key){
                    $newarray[$key]['address']=substr(str_replace('&nbsp;','',$v),strpos(str_replace('&nbsp;','',$v),']')+1);
                    $newarray[$key]['area']=substr(str_replace('&nbsp;','',$v),1,(strpos(str_replace('&nbsp;','',$v), ']') -1));
                }
            }
        }

        //this is  sole status
        $sale_preg='#<i class="status-icon .*s.*">(.*)</i>#i';
        preg_match_all($sale_preg,$content,$sales);
        //print_r($sales[1]);exit;
        foreach($sales[1] as $k=>$v){
            foreach($newarray as $key=>$value){
                if($k == $key){
                    $newarray[$key]['isflag']=$v;
                }
            }
        }

        //this is web_id
        $web_info_preg = '/<div class="item-mod " data-link="(.*?)" data-soj="(.*?)"  rel="nofollow">/';
        preg_match_all($web_info_preg,$content,$web_infos);
        //print_r($web_infos[1]);exit;
        foreach($web_infos[1] as $k=>$v){
            foreach($newarray as $key=>$value){
                if($k == $key){
                    $array = explode('/', $v);
                    $newarray[$key]['web_id'] = substr($array[2],0,strpos($array[2], '.'));
                    $newarray[$key]['web_domain'] = substr($array[4],0,strpos($array[4], '.'));
                    $newarray[$key]['web_fenqi'] = '';
                    $newarray[$key]['fromweb'] = 2;
                    $newarray[$key]['fromurl'] = $v;
                    $newarray[$key]['city_id'] = $this->city_id;
                    $newarray[$key]['user_id'] = UID;
                    $newarray[$key]['addtime'] = date("Y-m-d H:i",time());
                    $newarray[$key]['uptime'] = 0;
                    $newarray[$key]['deltime'] = 0;
                    $newarray[$key]['state'] = 1;
                    $newarray[$key]['cai_id'] = null;
                }
            }
        }
        return $newarray;
    }

其实我想重点讲解一下这个采集的方法,或者说是函数!

里面在采集 address 的时候,使用的字符串截取和去除 &nbsp; 是写到一块了,所以千万别被迷惑了。

主要内容就是将:[&nbsp;汤阴&nbsp;汤阴&nbsp;]&nbsp;乾坤大道与德贤路交汇处  处理为:汤阴汤阴  和  乾坤大道与德贤路交汇处  将网页中的内容截分为两个数据库的字段。

上面在采集具体某一个楼盘链接的时候,我顺带将需要插入数据库的所有字段都做了补充,方便后面直接使用 TP5 的 insertAll

方法一次性全部插入。不过就这个项目而言这个思路被我废弃了。不管怎样,你只需要知道 我写的这个方法返回的是一个二维数组就可以了!

然后,考虑到列表页肯定不止一页数据,而且城市不一样 楼盘数量肯定也不一样,我专门做了一个 getPage 方法,这个方法也很简单,就是先使用采集获取到一共有多少条数据,

这个 getPage 的思路也很简单,就是获取这个所有的楼盘数量,然后除以每页显示的数量,然后再向上取整 : 

//this is for get page function
    public function getPage(){
        $domain = $this->getCityDomain();
        //echo $domain;exit();
        $url = "https://".$domain.".fang.anjuke.com/loupan/?from=navigation";
        //echo $url;exit();
        $content = $this->https_request($url,true);
        //echo $content;exit();
        $em_preg='#<span class="result">.*<em>(.*)</em>#isU';
        preg_match_all($em_preg,$content,$result);
        //print_r($result);exit;
        $res = $result[1][0];
        $allPage = ceil($res/60);
        return $allPage;
    }

这个 getPage 函数主要也是为了获取第二页,第三页等等的链接地址的。

for($i=1;$i<=$this->getPage(); $i++) {
                sleep($wait_sec);
                $url = "https://".$domain.".fang.anjuke.com/loupan/all/p$i/";
                $res = $this->getCollectContent($url);
                //print_r($res);exit();
                $this->checkData($res);
            }

可以看到,我每次采集都会先使程序 sleep 几秒,我写这个是因为项目需要,一般情况是可以省略的。

然后使用 foreach 循环获取第一页,第二页,第三页等等的内容。

最后我还写了一个 checkData  解释一下这个函数:这个是为了解决:客户第一次采集完数据之后,如果再次采集,重复的数据作为更新,不重复的数据将作为新增数据,新增到数据库。

代码:

//this is for check data is exists
    public function checkData($obj_array){
        //$data = Db::table('tb_house_cai')->select();
        //print_r($obj_array);
        foreach($obj_array as $k=>$v){
            $result = Db::table('tb_house_cai')->where('ktitle',$v['ktitle'])->select();
            if(empty($result)){
                Db::table('tb_house_cai')->insert($v);
            }else{
                $res = Db::table('tb_house_cai')->where('ktitle',$v['ktitle'])->find();
                $v['cai_id'] = $res['cai_id'];
                Db::table('tb_house_cai')->update($v);
            }
        }
    }

注意!

这里 个人认为非常不理想,我一开始的思路是这样的:

public function checkDta($obj_array){
        $data = Db::table('tb_house_cai')->select();
        //print_r($obj_array);
        foreach($data as $k=>$v){
            foreach($obj_array as $key=>$value){
                if($v['ktitle'] == $value['ktitle']){
                    $obj_array[$key]['cai_id'] = $data[$k]['cai_id'];
                    Db::table('tb_house_cai')->update($v);
                }else{
                    Db::table('tb_house_cai')->insert($v);
                }
            }
        }
    }

我认为不好的地方也很明显,就是连接数据库的次数太多,我一开始的思路按照上面的代码操作,但是结果并不理想,或者说结果与我预期的效果不符合。我最后没办法,只能放弃这种思路,如果有同学可以帮我指出问题,可以在评论区里讨论。

最后,上整个控制器的代码:

这里面有个别地方需要借鉴的同学修改!

<?php 
namespace app\web\admin;

use app\admin\controller\Admin;
use app\common\builder\ZBuilder;//引入Zbuilder;
use app\sys\model\City as CityModel;
use think\Db;//引入Bb类

/**
* 楼盘信息采集管理
*/
class Collect extends Admin
{

    private $city_id;

    public function index()
    {

        if ($this->request->isPost()) {
            // 接收表单数据
            $data = $this->request->post();
            if($data['city_id'] == null)$this->error('城市不能为空');
            $wait_sec = $data['wait_sec'];
            //echo $wait_sec;exit();
            if($data['city_id'] == null){
                $city_id = 1;
            }else{
                $city_id = $data['city_id'];
            }
            $this->city_id = $city_id;
            $domain = $this->getCityDomain();
            //https://zz.fang.anjuke.com/loupan/all/p1/
            //echo $this->getPage();exit();
            for($i=1;$i<=$this->getPage(); $i++) {
                sleep($wait_sec);
                $url = "https://".$domain.".fang.anjuke.com/loupan/all/p$i/";
                $res = $this->getCollectContent($url);
                //print_r($res);exit();
                $this->checkData($res);
            }
            $this->success('采集完成');
        }
        $city_list = CityModel::getCityList();
        $wait_list = array(
            '3'=>'三秒',
            '5'=>'五秒',
            '10'=>'十秒',
            '15'=>'十五秒',
            '20'=>'二十秒',
            '25'=>'二十五秒',
            '30'=>'三十秒',
            '45'=>'四十五秒',
            '60'=>'一分钟',
            '120'=>'两分钟',
        );

        return ZBuilder::make('form')
            ->setPageTitle('采集链接')
            ->addFormItem('select', 'city_id', '请选择城市', '', $city_list)
            ->addFormItem('select', 'wait_sec', '请选择采集等待时间', '', $wait_list)
            //->addFormItem('text', 'url', '请输入需要采集的链接')
            ->fetch();
    }


    function https_request($url='',$gz=false){
        $curl=curl_init();
        if($gz){curl_setopt($curl,CURLOPT_ENCODING,'gzip');}//加入gzip解析
        curl_setopt($curl,CURLOPT_URL,$url);
        curl_setopt($curl,CURLOPT_SSL_VERIFYPEER,FALSE);
        curl_setopt($curl,CURLOPT_SSL_VERIFYHOST,FALSE);
        curl_setopt($curl,CURLOPT_RETURNTRANSFER,1);
        $data=curl_exec($curl);
        if(curl_errno($curl)){
            return 'ERROR'.curl_error($curl);
        }
        curl_close($curl);
        return $data;
    }

    function DeleteHtml($str)
    {
        $str = trim($str); //清除字符串两边的空格
        $str = preg_replace("/\t/","",$str); //使用正则表达式替换内容,如:空格,换行,并将替换为空。
        $str = preg_replace("/\r\n/","",$str);
        $str = preg_replace("/\r/","",$str);
        $str = preg_replace("/\n/","",$str);
        $str = preg_replace("/ /","",$str);
        $str = preg_replace("/  /","",$str);  //匹配html中的空格
        return trim($str); //返回字符串
    }

    function SubStr($tinfo=NULL,$ks=NULL,$js=NULL,$s=1,$e=0) {
        if(empty($tinfo)){
            return '';
        }
        if(strpos($tinfo, $ks) == FALSE) {
            return '';
        }
        if(!empty($ks)){
            $arry_tinfo=explode($ks,$tinfo);
            if(!empty($arry_tinfo)){
                $tinfo=$arry_tinfo[$s];
            }
        }
        if(!empty($js)){
            $arry_tinfo=explode($js,$tinfo);
            if(!empty($arry_tinfo)){
                $tinfo=$arry_tinfo[$e];
            }
        }
        return $tinfo;
    }

    public function getCityDomain(){
        $array = array(
            '1'=>'zz',
            '2'=>'kf',
            '3'=>'ly',
            '4'=>'pds',
            '5'=>'ay',
            '6'=>'jz',
            '7'=>'hb',
            '8'=>'xx',
            '9'=>'py',
            '10'=>'xc',
            '11'=>'lh',
            '12'=>'smx',
            '13'=>'ny',
            '14'=>'sq',
            '15'=>'zk',
            '16'=>'xy',
            '17'=>'zmd',
            '18'=>'jy',
            '19'=>'zm'
        );
        return $array[$this->city_id];
    }


    //this is collect function
    public function getCollectContent($url){
        $content = $this->https_request($url,true);
        //echo $content;exit();
        $title_preg='#<h3><span class="items-name">(.*)</span></h3>#i';
        preg_match_all($title_preg,$content,$titles);
        //print_r($res[1]);exit;
        $newarray = array();
        foreach($titles[1] as $k=>$v){
            $newarray[$k] = array(
                'ktitle' => $v,
            );
        }
        //print_r($newarray);
        //this  is  address and area collect
        $address_preg='#<span class="list-map" target="_blank">(.*)</span>#i';
        preg_match_all($address_preg,$content,$address);
        //print_r($address[1]);exit();
        foreach($address[1] as $k=>$v){
            foreach($newarray as $key=>$value){
                if($k == $key){
                    $newarray[$key]['address']=substr(str_replace('&nbsp;','',$v),strpos(str_replace('&nbsp;','',$v),']')+1);
                    $newarray[$key]['area']=substr(str_replace('&nbsp;','',$v),1,(strpos(str_replace('&nbsp;','',$v), ']') -1));
                }
            }
        }

        //this is  sole status
        $sale_preg='#<i class="status-icon .*s.*">(.*)</i>#i';
        preg_match_all($sale_preg,$content,$sales);
        //print_r($sales[1]);exit;
        foreach($sales[1] as $k=>$v){
            foreach($newarray as $key=>$value){
                if($k == $key){
                    $newarray[$key]['isflag']=$v;
                }
            }
        }

        //this is web_id
        $web_info_preg = '/<div class="item-mod " data-link="(.*?)" data-soj="(.*?)"  rel="nofollow">/';
        preg_match_all($web_info_preg,$content,$web_infos);
        //print_r($web_infos[1]);exit;
        foreach($web_infos[1] as $k=>$v){
            foreach($newarray as $key=>$value){
                if($k == $key){
                    $array = explode('/', $v);
                    $newarray[$key]['web_id'] = substr($array[2],0,strpos($array[2], '.'));
                    $newarray[$key]['web_domain'] = substr($array[4],0,strpos($array[4], '.'));
                    $newarray[$key]['web_fenqi'] = '';
                    $newarray[$key]['fromweb'] = 2;
                    $newarray[$key]['fromurl'] = $v;
                    $newarray[$key]['city_id'] = $this->city_id;
                    $newarray[$key]['user_id'] = UID;
                    $newarray[$key]['addtime'] = date("Y-m-d H:i",time());
                    $newarray[$key]['uptime'] = 0;
                    $newarray[$key]['deltime'] = 0;
                    $newarray[$key]['state'] = 1;
                    $newarray[$key]['cai_id'] = null;
                }
            }
        }
        return $newarray;
    }

    //this is for get page function
    public function getPage(){
        $domain = $this->getCityDomain();
        //echo $domain;exit();
        $url = "https://".$domain.".fang.anjuke.com/loupan/?from=navigation";
        //echo $url;exit();
        $content = $this->https_request($url,true);
        //echo $content;exit();
        $em_preg='#<span class="result">.*<em>(.*)</em>#isU';
        preg_match_all($em_preg,$content,$result);
        //print_r($result);exit;
        $res = $result[1][0];
        $allPage = ceil($res/60);
        return $allPage;
    }

    //this is for check data is exists
    public function checkData($obj_array){
        //$data = Db::table('tb_house_cai')->select();
        //print_r($obj_array);
        foreach($obj_array as $k=>$v){
            $result = Db::table('tb_house_cai')->where('ktitle',$v['ktitle'])->select();
            if(empty($result)){
                Db::table('tb_house_cai')->insert($v);
            }else{
                $res = Db::table('tb_house_cai')->where('ktitle',$v['ktitle'])->find();
                $v['cai_id'] = $res['cai_id'];
                Db::table('tb_house_cai')->update($v);
            }
        }
    }


    public function checkDta($obj_array){
        $data = Db::table('tb_house_cai')->select();
        //print_r($obj_array);
        foreach($data as $k=>$v){
            foreach($obj_array as $key=>$value){
                if($v['ktitle'] == $value['ktitle']){
                    $obj_array[$key]['cai_id'] = $data[$k]['cai_id'];
                    Db::table('tb_house_cai')->update($v);
                }else{
                    Db::table('tb_house_cai')->insert($v);
                }
            }
        }
    }





}

关于采集:

有兴趣的同学可以学习使用一下: phpQuery 

不过这个 phpQuery v3 版本的用着没有 v4 版本的好用,但是 v4 版本的需要 php 7.0以上,所以我放弃了,但是如果条件允许,phpQuery 将是个非常好用的采集插件!

猜你喜欢

转载自blog.csdn.net/qq_38350907/article/details/85162339