Top movie crawling cat's eye, can not display Chinese, garbled
- response.text always garbled response
We found crawling Baidu garbled [https://www.baidu.com/], encoded by -> decoding,
import requests url='https://www.baidu.com/' html=requests.get(url).text.encode('iso-8859-1').decode('utf-8') print(html)
You can solve the garbage problem. But : You can not specify headers , or otherwise garbled
-
-
- You must specify headers when crawling cat's eye movie, there would be 403 error. Distortion of a problem can not solve the above-described method of crawling.
- It found :
- When the reptile: sometimes garbled, sometimes display Chinese.
- Do not use reptiles, artificial copy the link to open the site from time to time: (corresponding to the two cases of reptile, also appears in both cases) only show cat's eye after the first pop-up movie site verification interface; direct display opal movie website ranking
- The ultimate solution : When garbled, artificial copy the link to open the site, this time there was verified interfaces, complete the verification interface, which displays a site to be crawled. After that, run .py files no longer appear garbled.
- Baidu, for example to the above-mentioned problem of garbled error may have some degree of universality to the cat's eye, for example the film ranking garbage being given problem cases should be a problem.
-
Regular expression is correct, but it can not match normal