Introduction to Python Crawler | 4 Crawling Douban TOP250 Book Information

Let's take a look at what the page looks like: https://book.douban.com/top250

What information are we going to scrape: book titles, links, ratings, one-sentence reviews…

1. Crawl a single message

Let's try to crawl the title first, using the previous routine, or copy the xpath of the title first:

Get the title xpath of the first book "The Kite Runner" as follows:

//*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[2]/div[1]/a

Get the xpath, we can try it out according to the previous method:

The returned value turned out to be a null value, which is very embarrassing.

It should be noted here that the xpath information copied by the browser is not completely reliable. The browser often adds redundant tbody tags to it, and we need to delete these tags manually.

After modifying the xpath, try again, the results are as follows:

Remember: the browser copy xpath is not completely reliable, especially when you see the tbody tag.

Copy the xpath information of "The Kite Runner", "The Little Prince", "Besieged City", and "Satisfaction Grocery Store" for comparison:

//*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[2]/div[1]/a
//*[@id="content"]/div/div[1]/div/table[2]/tbody/tr/td[2]/div[1]/a
//*[@id="content"]/div/div[1]/div/table[3]/tbody/tr/td[2]/div[1]/a
//*[@id="content"]/div/div[1]/div/table[4]/tbody/tr/td[2]/div[1]/a

By comparison, it can be found that the xpath information of the book title is only different in the serial number after the table, and is consistent with the serial number of the book, so remove the serial number (remove the tbody), and we can get the general xpath information:

//*[@id=“content”]/div/div[1]/div/table/tr/td[2]/div[1]/a

Well, let's try to crawl down all the titles on this page:

2. Crawl multiple information

Copy the xpath information scored in "The Kite Runner", "The Little Prince", "Besieged City", and "Worry-Free Grocery Store" respectively for comparison:

//*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[2]/div[2]/span[2]
//*[@id="content"]/div/div[1]/div/table[2]/tbody/tr/td[2]/div[2]/span[2]
//*[@id="content"]/div/div[1]/div/table[3]/tbody/tr/td[2]/div[2]/span[2]
//*[@id="content"]/div/div[1]/div/table[4]/tbody/tr/td[2]/div[2]/span[2]

I believe you can already write the xpath that crawls all the scores in seconds:

//*[@id=“content”]/div/div[1]/div/table/tr/td[2]/div[2]/span[2]

Put the xpath of the score into the previous code and run:

Now let's crawl down the title and rating at the same time:

Here, we default to the complete and correct information for the title and score. This default is generally fine, but it is actually flawed. If we crawl less or more information for a certain item, then the two The amount of data is not the same, thus matching errors. For example the following example:

The @title after the title xpath is changed to text(), the amount of text obtained is inconsistent with the number of ratings, and there is a mismatch.

If we take each book as a unit and obtain the corresponding information separately, it must be a complete match.

The label of the book title must be within the frame of the book, so we look up from the label of the book title and find the label covering the entire book (the left page will have information about the content of the code), and copy the xpath information:

//*[@id="content"]/div/div[1]/div/table[1]

We compare the xpath of the entire book with the title of the book

//*[@id=“content”]/div/div[1]/div/table[1]   #整本书
//*[@id=“content”]/div/div[1]/div/table[1]/tr/td[2]/div[1]/a   #书名
//*[@id=“content”]/div/div[1]/div/table[1]/tr/td[2]/div[2]/span[2]   #评分

It is not difficult to find that the first half of the book title and rating xpath are consistent with the xpath of the entire book,
so we can locate the information by writing the xpath like this:

file=s.xpath(“//*[@id=“content”]/div/div[1]/div/table[1]”)
title =div.xpath(“./tr/td[2]/div[1]/a/@title”)
score=div.xpath(“./tr/td[2]/div[2]/span[2]/text()”)

Take a look at the actual code:

We just crawled the information of a book, how to crawl all the books on this page? It's very simple, just remove the sequence number located after the <table> in the xpath and it's ok.

I finally saw the true face of Mount Lu, but wait~

title = div.xpath("./tr/td[2]/div[1]/a/@title")[0]
score=div.xpath("./tr/td[2]/div[2]/span[2]/text()")[0]

Why is there a [0] after these two lines? The data we climbed out before is a list with a box outside, which is very uncomfortable to look at. The list has only one value, and it is OK to take the first value for it. If you are not familiar with the knowledge of lists, you can go back and make up.

The next step is to climb a few more elements in this way!

One point to note is:

num=div.xpath("./tr/td[2]/div[2]/span[3]/text()")[0].strip("(").strip().strip(")")

This line of code uses several strip() methods, where () represents the content to be removed, strip(“(”) represents removing parentheses, and strip() represents removing whitespace.

Well, a page has been completed. Next, we need to crawl down the information of all pages.

3. Turn pages and crawl all page information

Let's first look at how the url changes after the page is turned:

https://book.douban.com/top250?start=0    #第一页
https://book.douban.com/top250?start=25   #第二页
https://book.douban.com/top250?start=50   #第三页

The law of url changes is very simple, but the number of start=() is different, and it is based on 25 per page, incremented by 25. Isn't this the number of books per page? So, we only need to write a loop.

for a in range(10):
  url = 'https://book.douban.com/top250?start={}'.format(a*25)
  #总共10个页面，用 a*25 保证以25为单位递增

Here to emphasize the Python range() function

Basic syntax: range(start, stop, step)
start: The count starts at start. The default is to start from 0. For example range(5) is equivalent to range(0,5);
end: count to the end of end, but not including end. For example: range(0,5) is [0,1,2,3,4] without 5
step: step size, default is 1. For example: range(0,5) is equivalent to range(0,5,1)

>>>range(10)    #从 0 开始到 10 （不包含）
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] 

>>> range(1, 11)    #从 1 开始到 11 (不包含)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10] 

>>> range(0, 30, 5)    #从0到30（不包含），步长为5 
[0, 5, 10, 15, 20, 25]

加上循环之后，完整代码如下：

from lxml import etree
import requests
import time

for a in range(10):
    url = 'https://book.douban.com/top250?start={}'.format(a*25)
    data = requests.get(url).text

    s=etree.HTML(data)
    file=s.xpath('//*[@id="content"]/div/div[1]/div/table')
    time.sleep(3)

    for div in file:
        title = div.xpath("./tr/td[2]/div[1]/a/@title")[0]
        href = div.xpath("./tr/td[2]/div[1]/a/@href")[0]
        score=div.xpath("./tr/td[2]/div[2]/span[2]/text()")[0]
        num=div.xpath("./tr/td[2]/div[2]/span[3]/text()")[0].strip("(").strip().strip(")").strip()
        scrible=div.xpath("./tr/td[2]/p[2]/span/text()")

        if len(scrible) > 0:
            print("{},{},{},{},{}\n".format(title,href,score,num,scrible[0]))
        else:
            print("{},{},{},{}\n".format(title,href,score,num))

来运行一下：

请务必要自己练习几遍，你觉得自己看懂了，还是会出错，不信我们赌五毛钱。

Python 的基础语法很重要，没事的时候多去看看：字符串、列表、字典、元组、条件语句、循环语句……

编程最重要的是实战，比如你已经能够爬TOP250的图书了，去试试TOP250电影呢。

好了，这节课就到这里！

如果你也在学习Python，现在正走在学习Python开发的路上，自学迷茫？没有学习规划和建议，那么我建议你可以加下我的Python学习交流群：663033228 群里有海量学习资料，每天固定分享直播学习。欢迎你的加入，