Reptiles encountered pit & # 128371;

Top movie crawling cat's eye, can not display Chinese, garbled

         We found crawling Baidu garbled [https://www.baidu.com/], encoded by -> decoding,

import requests
url='https://www.baidu.com/'
html=requests.get(url).text.encode('iso-8859-1').decode('utf-8') 
print(html)

          You can solve the garbage problem. But : You can not specify headers , or otherwise garbled

      • You must specify headers when crawling cat's eye movie, there would be 403 error. Distortion of a problem can not solve the above-described method of crawling.
    • It found :
      • When the reptile: sometimes garbled, sometimes display Chinese.
      • Do not use reptiles, artificial copy the link to open the site from time to time: (corresponding to the two cases of reptile, also appears in both cases) only show cat's eye after the first pop-up movie site verification interface; direct display opal movie website ranking
    •  The ultimate solution : When garbled, artificial copy the link to open the site, this time there was verified interfaces, complete the verification interface, which displays a site to be crawled. After that, run .py files no longer appear garbled.
    • Baidu, for example to the above-mentioned problem of garbled error may have some degree of universality to the cat's eye, for example the film ranking garbage being given problem cases should be a problem.

Regular expression is correct, but it can not match normal

 

Guess you like

Origin www.cnblogs.com/fran-py-/p/12234588.html