php+phpquery simple crawler crawls Jingdong product classification

This is a simple crawler that uses simple php plus phpquery to crawl the content of Jingdong's product category pages. Phpquery can help you extract the desired html content very simply. Phpquery and jquery are very similar, almost the same; if you have the basis of jquery, you can quickly get started.

1. Download phpquery and place it in the phpQuery folder under the web root directory

phpquery download: https://code.google.com/p/phpquery/downloads/list

A phpquery tutorial can be viewed here: https://code.google.com/p/phpquery/

2. Crawling program

<?php
/*
 * Created on 2015-1-29
 *
 * To change the template for this generated file go to
 * Window - Preferences - PHPeclipse - PHP - Code Templates
 */
header("Content-type:text/html; charset=utf-8");
function getPage( $url )
{
  $cnt = file_get_contents($url);
  return mb_convert_encoding($cnt ,"UTF-8","GBK");
}
include 'phpQuery/phpQuery.php'; 
$url = 'http://www.jd.com/allSort.aspx';
$page = getPage($url);
//phpQuery::newDocumentHTML($page);
phpQuery::newDocumentFile($url);
$firstCate = pq('#allsort .m');
$id = 0;
foreach($firstCate as $first){
  $id ++;
  $topcate = pq($first)->find(".mt a");
  //echo "**************************" . $topcate->text() . "**************************************</br>";
  echo $id . "#";
  foreach($topcate as $top){
    echo pq($top)->text() . "#" . "< a href='" .pq($top)->attr("href") . "' target='_blank'>" . pq($top)->text() ."< /a>、";
  }
  echo "#0#1</br>";
  $companies = pq($first)->find(".mc dl");
  $parent_id = $id;
  foreach($companies as $company)
  {  
    $id++;
    $sparent_id = $id;
     echo "  " . $id . "#" .pq($company)->find('dt')->text() . "#" .  "< a href='" . pq($company)->find('dt a')->attr("href") . "' target='_blank'>" . pq($company)->find('dt')->text() ."< /a>#" . $parent_id ."#2<br>"; 
     $cate = pq($company)->find('dd em a');
     foreach($cate as $detail) {
       $id++;
       echo "  " .  $id . "#" .pq($detail)->text() . "#" . "< a href='". pq($detail)->attr("href") . "' target='_blank'>" . pq($detail)->text() ."< /a>#" . $sparent_id . "#3<br>"; 
     }
  }  
}
?>

3. Operation effect

In this way, the information of Jingdong commodity classification can be captured. A database can be added to store the data in the database, which is more conducive to data storage and operation. Although this is just to capture the classification of Jingdong products, if you extend it, you can also capture information such as product prices, positive reviews and negative reviews. I won’t go into detail here. The specific solution to specific problems depends entirely on the needs. If necessary, it can also be made universal, input the xpath of the label, and then get the specific value; this is purely YY, and those who are interested can find information online, and there should be many ways to realize it.

Guess you like

Origin blog.csdn.net/sinat_37212928/article/details/103922028