Write a simple reptile

(A) on the legality of reptiles

To Taobao, for example, access

https://www.baidu.com/robots.txt

Finally, there are two lines of code:

User-Agent: *
Disallow: /

Meaning that in addition to previously specified reptiles, reptile crawling does not allow any other data.

(B) html, CSS, Javascript knowledge to prepare

(C) a request to use the library website requests

(1) Get an example manner www.cntour.cn

1 url = ‘http://www.cntour.cn/2 strhtml = requests.get(url)
3 print(strhtml.text)

(2) Post manner, for example http://fanyi.youdao.com/

First, determine the way information transfer and related url. Press F12 Developer Tools appears:

Upon completion of the translation, select Network -> XHR, appears in the URL we want, after which it was extracted and assigned to the url:

View details of this url, submission and determine their specific information:

Post request the data differently than Get, request headers must be constructed. Data form (Form data) parameters are requested, and will be configured to copy a dictionary:

Then started  requests.post method request form data:

 1 import requests
 2 import json
 3 def get_translate_data(word=None):
 4     url='http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule'
 5     Form_data = {i":"我爱中国\n","from":"AUTO","to":"AUTO","smartresult":"dict","client":"fanyideskweb","salt":"15741317307555","sign":"8c28354728736eb14f6c6e7330b7d4c0","ts":"1574131730755","bv":"e2a78ed30c66e16a857c5b6486a1d326","doctype":"json","version":"2.1","keyfrom":"fanyi.web","action":"FY_BY_REALTlME"}
 6     # 请求表单数据
 7     response = requests.post(url, data=payload)
 8     # 将JSON格式字符串转字典
 9     content = json.loads(response.text)
10     # 打印翻译后的数据
11     print(content['translateResult'][0][0]['tgt'])
12 if _name_=='_main_':
13     get_translate_data('我爱数据‘)

(四)使用BeautifulSoup解析网页

通过requests库已经可以抓到网页源码,接下来要从源码中找到并提取数据——使用bs4库中的BeautifulSoup库。

首先html文档被转换成unicode编码格式,然后我们指定了lxml解析器对这段文档进行解析。解析后html文档变成树形结构,每个节点都是python对象。解析后的文档存储进新建的变量soup中。用select(选择器)定位数据,用“检查元素”,右击 copy->copy selector自动复制路径。

依然以www.cntour.cn为例:

1 import requests
2 from bs4 import BeautifulSoup
3 url='http://www.cntour.cn/'
4 strhtml=requests.get(url)
5 soup=BeautifulSoup(strhtml.text, 'lxml')
6 data=soup.select('main > div > div.mtop.firstMod.clearfix > div.centerBox > ul.newsList > li > a')
7 print(data)

Guess you like

Origin www.cnblogs.com/wowhy/p/11888430.html