How to use Python crawler to crawl information on web pages?
Take Books to Scrape as an example, address: http://books.toscrape.com/
First import the requests library and BeautifulSoup
import requests
from bs4 import BeautifulSoup
Get the source code information of the URL
url = 'http://books.toscrape.com/'
res = requests.get(url)
print(res.status_code) # 查看是否获取成功,成功则显示200
html = res.text
F12 or right-click to check the code of the webpage.
You can see that the title of the book is in the <li> block, create a beau object, and search through the find statement.
We create an empty list for receiving and position the source code through BeautifulSoup.
book_list = [] # 空列表用于后面的接收
soup = BeautifulSoup(html, 'lxml')
beau = soup.find('ul', class_='nav nav-list').find('ul').find_all('li')
Set the parameter q to traverse the beau object. Add another each1 object to find the hyperlink part in the parameter q. Then set the two parameters i and x to traverse the enumerate and book_list, and output them in the loop: sequence, book title, and address link.
(The enumerate function starts from 0, so if you want the output order to start from 1, do i+1; select the'href' of each1 object in the link part)
for q in beau:
book_list.append(q.a.text.strip()) # 此时,用book_list接收书名文字
each1 = q.find('a')
for i, x in enumerate(book_list):
print(i+1, '书名:' + x + '\t 网址:' + url + each1['href'] + '\n')
At this point, you can get: "Sequence + Book Title + URL Link"
The essence of Python is "efficient"!
Finally, attach the simplified version of the code needed to crawl the title of the book and the corresponding URL link, 8 lines of code to achieve crawling!
import requests
from bs4 import BeautifulSoup
beau = BeautifulSoup(requests.get('http://books.toscrape.com/').text, 'lxml').find('ul', class_='nav nav-list').find('ul').find_all('li')
book_list = []
for q in beau:
book_list.append(q.a.text.strip())
for i, x in enumerate(book_list):
print(i+1, '书名:' + x + '\t 网址:' + 'http://books.toscrape.com/' + q.find('a')['href'] + '\n')