The reason why the reptile was banned

Common reasons for crawler being banned
1. First, check JavaScript. If the page you receive from the web server is blank, lacks information, or encounters something else that is not what you expected (or not what you see on your browser), it may be because of the JavaScript that the website created the page with Execution is problematic.
2. Check the parameters submitted by the normal browser. If you're going to submit a form or make a POST request to a website, remember to check the content of the page to see if every field you want to submit is filled in and in the correct format. Use the Chrome browser's network panel (shortcut F12 to open the developer console, and then click "Network" to see it) to view the POST command sent to the website, and confirm that each of your parameters is correct
3. Whether it is legal cookies? If you are already logged into the website and cannot stay logged in, or if you have other "logged in" exceptions on the website, please check your cookies. Make sure the cookie is called correctly on every page load, and that your cookie is sent to the website every time you make a request.
4. IP banned? If you are getting HTTP errors on the client side, especially 403 Forbidden errors, it may mean that the website has treated your IP as a robot and will no longer accept any requests from you. You either have to wait for your IP address to be removed from the website blacklist, or change your IP address (you can go to Starbucks to surf the Internet). If you are sure that you are not banned, then check the following again.

Make sure your crawler is not particularly fast on the site. Quick collection is a bad habit, it will put a heavy burden on the server of the network administrator, and it will also make you fall into illegal situations. It is also the primary reason why IPs are blacklisted by websites. Add latency to your crawlers and let them run in the dead of night. Remember: Writing programs or collecting data in a hurry is a sign of poor project management; plan ahead to avoid panic.

One more thing that must be done: modify your request headers! Some sites block any visitor who claims to be a crawler. If you're not sure what the appropriate header value is, use your own browser's header.

Verify that you are not clicking or accessing any information that a human user would not normally be able to click or access.

If you use a lot of complicated means to access the website, consider contacting the webmaster and tell them your purpose. Try sending an email to webmaster@<domain name> or admin@<domain name> to ask the webmaster to allow you to use the crawler to collect data. Managers are human too!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326277137&siteId=291194637