The crawler crawls the title of the webpage and the corresponding URL link

How to use Python crawler to crawl information on web pages?

Take Books to Scrape as an example, address: http://books.toscrape.com/

First import the requests library and BeautifulSoup

import requests 
from bs4 import BeautifulSoup

Get the source code information of the URL

url = 'http://books.toscrape.com/'
res = requests.get(url)
print(res.status_code)	# 查看是否获取成功,成功则显示200
html = res.text

F12 or right-click to check the code of the webpage.
Insert picture description here
You can see that the title of the book is in the <li> block, create a beau object, and search through the find statement.
We create an empty list for receiving and position the source code through BeautifulSoup.

book_list = []	# 空列表用于后面的接收
soup = BeautifulSoup(html, 'lxml')
beau = soup.find('ul', class_='nav nav-list').find('ul').find_all('li')

Set the parameter q to traverse the beau object. Add another each1 object to find the hyperlink part in the parameter q. Then set the two parameters i and x to traverse the enumerate and book_list, and output them in the loop: sequence, book title, and address link.

(The enumerate function starts from 0, so if you want the output order to start from 1, do i+1; select the'href' of each1 object in the link part)

for q in beau:
    book_list.append(q.a.text.strip())	# 此时,用book_list接收书名文字
each1 = q.find('a')	
for i, x in enumerate(book_list):
    print(i+1, '书名:' + x + '\t 网址:' + url + each1['href'] + '\n')

At this point, you can get: "Sequence + Book Title + URL Link"
Insert picture description here
The essence of Python is "efficient"!
Finally, attach the simplified version of the code needed to crawl the title of the book and the corresponding URL link, 8 lines of code to achieve crawling!

import requests
from bs4 import BeautifulSoup
beau = BeautifulSoup(requests.get('http://books.toscrape.com/').text, 'lxml').find('ul', class_='nav nav-list').find('ul').find_all('li')
book_list = []
for q in beau:
    book_list.append(q.a.text.strip())
for i, x in enumerate(book_list):
    print(i+1, '书名:' + x + '\t 网址:' + 'http://books.toscrape.com/' + q.find('a')['href'] + '\n')

Guess you like

Origin blog.csdn.net/JasonZ227/article/details/109556770