Introduction to Python crawlers | 5 Crawling piggy short-term rental information

Xiaozhu Short Rental is a rental website with a lot of high-quality homestay rental information. Let's take the rental information in Chengdu as an example to try to crawl these data.

Xiaozhu Short-term Rental (Chengdu) page: http://cd.xiaozhu.com/


1. Crawl the rental title

As usual, first climb down the title to test the water, find the title, and copy the xpath.

Copy the title xpath of several houses for comparison:

//*[@id="page_list"]/ul/li[1]/div[2]/div/a/span
//*[@id="page_list"]/ul/li[2]/div[2]/div/a/span
//*[@id="page_list"]/ul/li[3]/div[2]/div/a/span 

Instantly found that the title's xpath only changes after <li>, so write out the xpath to crawl the title of the entire page in seconds:

//*[@id=“page_list”]/ul/li/div[2]/div/a/span 

Still a fixed routine, let's try to crawl down the title of the full page:

Piggy is relatively strict in terms of IP restrictions, and the sleep() function must be added to the code to control the frequency of crawling

Ok, let's compare the xpath information:

Follow the title label to find the entire house information label. The xpath comparison is as follows:

//*[@id=“page_list”]/ul/li   #整体
//*[@id=“page_list”]/ul/li/div[2]/div/a/span   #标题 

You should know how to change the code, write a loop:

file=s.xpath(‘//*[@id=“page_list”]/ul/li’)
for div in file:    
  title=div.xpath("./div[2]/div/a/span/text()")[0] 

Ok, let's run it and try:


2. Crawl information of multiple elements

Compare the xpaths of other elements:

//*[@id=“page_list”]/ul/li   #整体
//*[@id=“page_list”]/ul/li/div[2]/div/a/span   #标题
//*[@id=“page_list”]/ul/li/div[2]/span[1]/i   #价格
//*[@id=“page_list”]/ul/li/div[2]/div/em   #描述
//*[@id=“page_list”]/ul/li/a/img   #图片 

Then you can write the code:

file=s.xpath(“//*[@id=“page_list”]/ul/li”)
for div in file:    
  title=div.xpath(“./div[2]/div/a/span/text()”)[0]          
  price=div.xpath(“./div[2]/span[1]/i/text()”)[0]    
  scrible=div.xpath(“./div[2]/div/em/text()”)[0].strip()    
  pic=div.xpath(“./a/img/@lazy_src”)[0] 

Let's try to run it:


3. Turn pages to crawl more pages

Take a look at the change of the url when turning the page:

http://cd.xiaozhu.com/search-duanzufang-p1-0/    #第一页
http://cd.xiaozhu.com/search-duanzufang-p2-0/    #第二页
http://cd.xiaozhu.com/search-duanzufang-p3-0/    #第三页
http://cd.xiaozhu.com/search-duanzufang-p4-0/    #第四页
…………………… 

The law of url changes is very simple, but the number after p is different, and it is exactly the same as the serial number of the page number, which is easy to handle... Write a simple loop to traverse all the urls.

for a in range(1,6):
  url = ‘http://cd.xiaozhu.com/search-duanzufang-p{}-0/’.format(a)
  # 我们这里尝试5个页面,你可以根据自己的需求来写爬取的页面数量 

The complete code is as follows:

from lxml import etree
import requests
import time

for a in range(1,6):
    url = 'http://cd.xiaozhu.com/search-duanzufang-p{}-0/'.format(a)
    data = requests.get(url).text

    s=etree.HTML(data)
    file=s.xpath('//*[@id="page_list"]/ul/li')
    time.sleep(3)
    
    for div in file:
        title=div.xpath("./div[2]/div/a/span/text()")[0]
        price=div.xpath("./div[2]/span[1]/i/text()")[0]
        scrible=div.xpath("./div[2]/div/em/text()")[0].strip()
        pic=div.xpath("./a/img/@lazy_src")[0]
            
        print("{}   {}   {}   {}\n".format(title,price,scrible,pic)) 

Take a look at the effect of crawling 5 pages down:

I believe that you have mastered the basic routines of crawlers, but you still need to be familiar with it until you can write code independently.

Writing code requires not only care, but also patience. Many people start to give up not because of how difficult programming is, but because of a small problem in a certain practice process.

Well, this section of sharing is here!



Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324648244&siteId=291194637