Python crawler notes 1

1. Data acquisition
1.1 requests import

In [1]: import requests

1.2 The requests method

requests.get() #获取HTML网页的主要方法,对应HTTP的GET

1.3 Acquisition Process

"""
url="http://www.mingchaonaxieshier.com/hong-wu-da-di-qianyan.html"
#---使用get方法获取数据,返回包含网页数据的response响应,超时时间测试---
r = requests.get(url,timeout=XXX))
#----http请求的返回状态, 200表示连接成功---
r.status_code
#---返回对象的文本内容---
r.text
#---返回对象的二进制形式---
r.content
#---分析返回对象的编码方式---
r.encoding
#---响应内容编码方式(备选编码方式)---
r.appearent_encoding
#---抛出异常---
raise_for_status

2. Parsing and matching data
Three methods: BeautifulSoup, xpath of lxml, regular expression
Efficiency comparison:
write picture description here
2.1 xpath
2.1.1 Import lxml and return the xml structure:

from lxml import etree
html ='''
#省略
'''
s = etree.HTML(html)
print(s.xpath())

2.2.2 Several methods of xpath

text()#获取文本内容
comment()#获取注释 
@xx#获取其它任何属性 @href@src@value
string()#获取某个标签下所有的文本(包括子标签下的文本),使用string
starts-with#匹配字符串前面相等
contains#匹配任何位置相等

Commonly used symbols of xpath:
write picture description here
code:

import requests
from lxml import etree
url = 'https://www.douban.com/note/667285773/'
r = requests.get(url).text
s = etree.HTML(r)
print(s.xpath('//div[@class="author"]/span/text()')[1])
print(s.xpath('//div[@class="author"]/a/text()')[1])
print(s.xpath('//div[@class="content report-comment"]/p/text()')[1])

write picture description here
write picture description here

write picture description here
2.3 Regular expressions
Several common regular expressions are as follows:
write picture description here
3. Beautiful Soup
Beautiful Soup is a python library, and its main function is to fetch data from web pages.

import requests
import bs4
from bs4 import BeautifulSoup
r=requests.get("http://quotes.toscrape.com/")
soup = BeautifulSoup(r.text,'lxml')
soup.title
soup.head.children
soup.find_all('a')  
a=soup.find_all('small',attrs={'class':'author'})
soup.find('small',attrs={'class':'author'}).get_text()
soup.find('div',attrs={'class':'quote'}).get_text()
for i in range(len(a)):
    print(a[i].get_text())

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325579426&siteId=291194637