--- Beautiful Soup crawlers crawl know almost Billboard

  The first two chapters simply say the use of Beautiful Soup, I believe have encountered some anti-reptile reptiles in the process, how to skip these anti reptiles do? Today, by Douban write a simple anti-climb in

 

What is anti-reptile

Simply put, it is to use any technical means to prevent others a way to get the bulk of your site information. The key also lies in the batch.

 

Fanfan reptiles mechanism

  • --- increase request header headers to simulate more realistic user scenarios
  • Change the IP address of the site --- according to your IP frequently visit the site to determine whether you belong to the reptile
  • ua --- UA is limiting user access to the site when the browser identity, its anti-climb and ip restrictions similar mechanism
  • ---- Sign simulation account login to access the site by request simulation
  • Different restrictions --- web page cookies every request of cookies

 

Billboard know almost crawling

1. First, open the required crawling sites

2. Analysis html site, labeled 'a', attribute target = "_ blank"

3. Request website by way of request

4. First, we do not carry any mechanism of anti-anti-crawler access

# coding:utf-8
import requests
from  bs4 import BeautifulSoup
url = 'https://www.zhihu.com/hot'
html = requests.get(url,verify = False).content.decode('utf-8') # verify = False表示请求https
soup = BeautifulSoup(html,'html.parser')
name = soup.find_all('a',target="_blank")
for i in name:
    print(i)

The results showed that the request is null

5. F12 extracted from a complete request header (can also be viewed fiddler)

  • Request host address
  • cookies requests
  • User-Agent request

 

# coding:utf-8
import requests
from  bs4 import BeautifulSoup
url = 'https://www.zhihu.com/hot'
# 添加请求头
headers={
    "host":"www.zhihu.com",
    "cookie":'_zap=482b5934-4878-4c78-84f9-893682c32b07; d_c0="ALCgSJhlsQ6PTpmYqrf51G'
             'HhiwoTIQIlS1w=|1545203069"; _xsrf=XrStkKiqUlLxzwMIqRDc01J7jikO4xby; q_c1=94622'
             '462a93a4238aafabad8c004bc41|1552532103000|1548396224000; __utma=51854390.1197068257.'
             '1552532107.1552532107.1552532107.1; __utmz=51854390.1552532107.1.1.utmcsr=zhihu.com|utmccn=(r'
             'eferral)|utmcmd=referral|utmcct=/; __utmv=51854390.100--|2=registration_date=20190314=1^3=entry_da'
             'te=20190125=1; z_c0="2|1:0|10:1552535646|4:z_c0|92:Mi4xcFRlN0RnQUFBQUFBc0tCSW1HV3hEaVlBQUFCZ0FsVk5Ya'
             'DUzWFFBWExTLXVpM3llZzhMb29QSmRtcjlKR3pRaTBB|03a1fa3d16c98e1688cdb5f6ba36082585d72af2f54597e370f05207'
             'cd3a873f"; __gads=ID=27a40a1873146c19:T=1555320108:S=ALNI_MYb5D7sBKFhvJj32HBQXgrhyC6xxQ; tgw_l7_route=7'
             '3af20938a97f63d9b695ad561c4c10c; tst=h; tshl=',
    "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36",
}
html = requests.get(url,headers=headers,verify = False).content.decode('utf-8') # verify = False表示请求https
soup = BeautifulSoup(html,'html.parser')
name = soup.find_all('a',target="_blank")
for i in name:
    print(i.get_text())

After the success of the implementation of some of the information found in the request out of the hot list

Like a small partner can try their own hand.

Guess you like

Origin www.cnblogs.com/qican/p/11139899.html