Python 50 reptiles line fetch and process Turing process detailed bibliography

This article describes the Python 50 rows reptiles crawl and process the Turing process detailed bibliography, the paper sample code described in great detail, has a certain reference value of learning for all of us to learn or work, a friend in need can refer to the
Introduction

Use requests crawling, BeautifulSoup data extraction.

Is divided into two steps: The first step is to parse the list of books page, and parse out the inside of the book details page link. The second step is to parse details page book, the extracted content of interest, according to the present embodiment in the case of different data, using different extraction methods, general feeling that is easy to use BeautifulSoup

The following are a few typical content HTML snippet extracted Python

1, extract details page link

List page details page link segment

<h4 class="name">
 <a href="/book/1921" rel="external nofollow" title="深度学习入门:基于Python的理论与实现">
  深度学习入门:基于Python的理论与实现
 </a>
</h4>

Use detail page link Python code

# bs是BeautifulSoup的instance
bs.select('.name')
for 详情链接信息 in bs.select('.name'):
 # 提取出链接
 print(详情链接信息.a.get('href'))

2, extract the name of the book details page

Book details page HTML code name

<h2>
   深度学习入门:基于Python的理论与实现
</h2>

Book extract the name of the Python code

# 因为提取出来的文字前后还带了很多空格,所以要用strip去掉 
bs.h2.get_text().strip(

3, extract e-book pricing information
details page e-book prices in the HTML code

<dt>电子书</dt>
 <dd>
   <span class="price">¥29.99</span>
 </dd>

Python code to extract the e-book prices

# 因为不是每本书都有电子书,所以要判断一下
有电子书 = bs.find("dt", text="电子书")
if 有电子书:
 价格=有电子书.next_sibling.next_sibling.find("span", {"class": "price"}).get_text().strip()[1:]
 print(float(价格))

The complete code

# ituring.py,python3版本,默认只抓两页,可以通过启动参数控制要抓的列表页范围
import sys
import requests
import time
from bs4 import BeautifulSoup
 
def 输出图书列表中的详情链接(bs):
 # 找到页面中所有的 <h4 class="name"><a href="/book/..." rel="external nofollow" >...</a></h4>
 for 详情链接信息 in bs.select('.name'):
  # 提取出链接
  yield 详情链接信息.a.get('href')
 
def 获取图书详情(链接):
 详情页 = requests.get('http://www.ituring.com.cn%s' %链接)
 if 详情页.ok:
  bs = BeautifulSoup(详情页.content, features="html.parser")
 
  图书 = {}
 
  图书['title'] = bs.h2.get_text().strip()
  图书['status'] = bs.find("strong", text="出版状态").next_sibling
 
  有定价 = bs.find("strong", text="定  价")
  if 有定价:
   图书['price'] = 有定价.next_sibling
 
   有电子书 = bs.find("dt", text="电子书")
   if 有电子书:
    图书['ePrice'] = float(有电子书.next_sibling.next_sibling.find("span", {"class": "price"}).get_text().strip()[1:])
 
  有出版日期 = bs.find("strong", text="出版日期")
  if 有出版日期:
   图书['date'] = 有出版日期.next_sibling
 
  图书['tags'] = []
  for tag in bs.select('.post-tag'):
   图书['tags'].append(tag.string)
 
  return 图书
 
 else:
  print('❌ 详情页 http://www.ituring.com.cn%s' %链接)
 
def 解析图书列表页(起始页, 终止页):
 for 页序号 in range(起始页 - 1, 终止页): 
  # 逐一访问图书列表页面
  列表页 = requests.get('http://www.ituring.com.cn/book?tab=book&sort=new&page=%s' %页序号)
 
  if 列表页.ok:
   # 创建 BeautifulSoup 的 instance
   bs = BeautifulSoup(列表页.content, features="html.parser")
 
   # 提取 列表页中的 详情页链接,并逐一分析
   for 详情页面链接 in 输出图书列表中的详情链接(bs):
    图书信息 = 获取图书详情(详情页面链接)
    # 得到的图书信息,按照自己的需求去处理吧
    print(图书信息)
    # 抓完一本书休息一下
    time.sleep(0.1)
 
   print('✅ 第%s页获取完毕\n\t' %(页序号 + 1))
  else:
   print('❌ 第%s页获取出错\n\t' %(页序号 + 1))
 
if __name__ == '__main__':
 # 默认图书列表起始页 和 终止页
 起始图书列表页码 = 1
 终止图书列表页码 = 2 # ⚠️ 改改代码页可以实现自动获得最后一页 
 
 # 获取输入参数; ⚠️此处未对输入参数的类型做检测
 if(len(sys.argv)==2):
  # 只有一个参数时,输入的是终止页码,起始页码默认为 0
  终止图书列表页码 = int(sys.argv[1])
 if(len(sys.argv)==3):
  # 有两个参数时, 第一个参数是起始页码,第二个参数是终止页码
  起始图书列表页码 = int(sys.argv[1])
  终止图书列表页码 = int(sys.argv[2])
 
 解析图书列表页(起始图书列表页码, 终止图书列表页码)

Finally, we recommend a very wide python learning resource gathering, [click to enter] , here are my collection before learning experience, study notes, there is a chance of business experience, and calmed down to zero on the basis of information to project combat , we can at the bottom, leave a message, do not know to put forward, we will study together progress

Published 38 original articles · won praise 26 · views 40000 +

Guess you like

Origin blog.csdn.net/haoxun09/article/details/104741459