Using PHP to make a simple content collector

Using PHP to make a simple content collector
Author: yzxh24 Number of readers:  … Article source: Organized by Tianji Forum Release time: 2018/4/23 Netizen comments () Article
 
  Collector, usually called a thief program, is mainly used to grab the content of other people's web pages. Regarding the production of the collector, it is actually not difficult. It is to remotely open the webpage to be collected, and then use regular expressions to match the required content. As long as you have a little basic regular expression, you can make your own collector. . 

  A few days ago, I made a novel serialization program. Because I was afraid of the trouble of updating, I wrote a collector by the way to collect the eight-way Chinese network. The function is relatively simple, and I can't customize the rules. You can expand by yourself. 

  Using php as a collector mainly uses two functions: file_get_contents() and preg_match_all(). The former is used to remotely read web page content, but it can only be used in versions above php5. The latter is a regular function used to extract required content. 

  The function implementation is described step by step below. 

  Because it is a collection of novels, the title, author, and genre must be extracted first, and other information can be extracted as needed.

  Here is the goal of "Returning to the Ming Dynasty to Be a Prince", first open the bibliography page, link: http://www.86zw.com/Book/3727/Index.aspx

  Open a few more books and you will find that the basic format of the book title is: http://www.86zw.com/Book/book number/Index.aspx, so we can make a start page and define an <input type=text name= number>, used to input the book number to be collected, and then you can receive the book number to be collected in the format of $_POST['number']. After receiving the book number, the next thing to do is to construct the bibliography page: $url=http://www.86zw.com/Book/$_POST['number']/Index.aspx, of course, here is an example, mainly for the purpose of The explanation is convenient, and it is best to check the legality of $_POST['number'] when actually making it. 

  After constructing the URL, you can start collecting book information. Use the file_get_contents() function to open the bibliography page: $content=file_get_contents($url), so that the contents of the bibliography page can be read out. The next step is to match information such as title, author and genre. The book title is used here as an example, everything else is the same. Open the bibliography page, view the source file, and find "<span class="booktitle">"Back to the Ming Dynasty as a Prince"</span>", which is the title of the book to be extracted. Regular expression to extract book title: /<span class=\"newstitle\">(.*?)\<\/span>/is, use preg_match_all() function to extract book title: preg_match_all("/<span class =\"newstitle\">(.*?)\<\/span>/is",$contents,$title);The content of $title[0][0] is the title we want (the preg_match_all function The usage can go to Baidu to check, and I will not explain it in detail here). After taking out the book information, the next step is to take the content of the chapter. To take the content of the chapter, the first thing to do is to find the address of each chapter, then open the chapter remotely, use the regular to take out the content, store it in the library or directly generate an html static file . This is the address of the chapter list: http://www.86zw.com/Html/Book/18/3727/List.shtm, it can be seen that this is the same as the bibliography page, and it can be found regularly: http://www .86zw.com/Html/Book/category/bookmark/List.shtm. The book number has been obtained before. The key here is to find the classification number. The classification number can be found on the previous bibliography page. Extract the classification number:

  preg_match_all("/Html\/Book\/[0-9]{1,}\/[0 -9]{1,}\/List\.shtm/is",



function cut($string,$start,$end){
$message = explode($start,$string);
$message = explode($end,$message[1]); return $message[0];}where $ string is the content to be cut, $start is the starting place, and $end is the ending place. Take out the classification number:

$start = "Html/Book/";
$end
= "List.shtm";
$typeid = cut($typeid[0][0],$start,$end);
$typeid = explode(" /",$typeid);[/php]

  In this way, $typeid[0] is the classification number we are looking for. Next is the address to construct the chapter list: $chapterurl = http://www.86zw.com/Html/Book/.$typeid[0]/$_POST['number']/List.shtm. With this, you can find the address of each chapter. The method is as follows:

$usit = "\""; 
$uend
= "\""; 
//t represents the abbreviation of title
$tstart = ">"; 
$tend
= "<"; 
//take the path, for example: 123.shtm ,2342.shtm,233.shtm
preg_match_all(" 
//取标题,例如:第一章 九世善人
preg_match_all("/<a href=\"[0-9]{1,}\.shtm\"(.*?)\<\/a>/is",$file,$title); 
$count = count($url[0]);
for($i=0;$i<=$count;$i++)
{
$u = cut($url[0][$i],$ustart,$uend);
$t = cut($title[0][$i],$tstart,$tend);
$array[$u] = $t;
}

  The $array array is all the chapter addresses. At this point, the collector is half done. The rest is to open each chapter address in a loop, read it, and then match the content. This is relatively simple and will not be described in detail here. Okay, let’s write this first today. It’s the first time I’ve written such a long article. There are inevitably problems with the language organization. Please bear with me!


http://www.jphgd.com.cn/
http://www.jphgd.com.cn/chandizheng/
http://www.jphgd.com.cn/jiaqian/
http://www.jphgd.com.cn/about.html
http://www.jphgd.com.cn/news/

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324769354&siteId=291194637