[Reptile combat] Let's analyze Amazon's anti-reptile mechanism step by step

Hello everyone, I'm Lex, the Lex who likes to bully Superman

Areas of expertise: python development, network security penetration, Windows domain control Exchange architecture

Today's focus: step-by-step analysis and overcoming Amazon's anti-reptile mechanism

Here's the thing

Amazon is the world's largest shopping platform

A lot of product information, user reviews, etc. are the most abundant.

Today, let's take everyone by hand and cross Amazon's anti-reptile mechanism

Crawl useful information such as products, reviews, etc. you want

Anti-reptile mechanism

However, when we want to use crawlers to crawl relevant data information

Large shopping malls like Amazon, TBo, JD

In order to protect their data information, they all have a complete set of anti-reptile mechanisms.

Try Amazon's anti-crawling mechanism first

We use several different python crawler modules to test step by step

In the end, the anti-climbing mechanism was successfully passed.

1. urllib module

code show as below:

# -*- coding:utf-8 -*-
import urllib.request
req = urllib.request.urlopen('https://www.amazon.com')
print(req.code)
复制代码

Return result : Status code: 503.

Analysis : Amazon identifies your request as a crawler and refuses to provide service.

With a scientific and rigorous attitude, let's try Baidu, which is on the top of thousands of people.

Return result : status code 200

Analysis : normal access

That means that the request of the urllib module is recognized by Amazon as a crawler and refused to serve

2. The requests module

1. Requests direct crawler access

The effect is as follows↓ ↓ ↓

The code is as follows ↓ ↓ ↓

import requests
url='https://www.amazon.com/KAVU-Rope-Bag-Denim-Size/product-reviews/xxxxxx'
r = requests.get(url)
print(r.status_code)
复制代码

Return result : Status code: 503.

Analysis : Amazon also rejected requests for requsets module

Identify it as a crawler and refuse to provide service.

2. We add cookies to requests

Plus request cookies and other related information

The effect is as follows↓ ↓ ↓

The code is as follows ↓ ↓ ↓

import requests

url='https://www.amazon.com/KAVU-Rope-Bag-Denim-Size/product-reviews/xxxxxxx'
web_header={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0',
'Accept': '*/*',
'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Cookie': '你的cookie值',
'TE': 'Trailers'}
r = requests.get(url,headers=web_header)
print(r.status_code)
复制代码

Return result : Status code: 200

Analysis : The return status code is 200, which is normal. It smells like a crawler.

3. Check the return page

Through the method of requests+cookie, the status code we get is 200

At least it is currently being served by Amazon's servers.

We write the crawled pages into text and open them in a browser.

I stepped on the horse... The return status is normal, but the return is an anti-crawler verification code page.

Still blocked by Amazon.

Three, selenium automation module

Installation of related selenium modules

pip install selenium
复制代码

Introduce selenium into the code and set relevant parameters

import os
from requests.api import options
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

#selenium配置参数
options = Options()
#配置无头参数,即不打开浏览器
options.add_argument('--headless')
#配置Chrome浏览器的selenium驱动 
chromedriver="C:/Users/pacer/AppData/Local/Google/Chrome/Application/chromedriver.exe"
os.environ["webdriver.chrome.driver"] = chromedriver
#将参数设置+浏览器驱动组合
browser = webdriver.Chrome(chromedriver,chrome_options=options)
复制代码

test access

url = "https://www.amazon.com"
print(url)
#通过selenium来访问亚马逊
browser.get(url)
复制代码

Return result : Status code: 200

Analysis : The return status code is 200, and the access status is normal. Let's look at the information on the crawled webpage.

Save the source code of the webpage to the local

#将爬取到的网页信息,写入到本地文件
fw=open('E:/amzon.html','w',encoding='utf-8')
fw.write(str(browser.page_source))
browser.close()
fw.close()
复制代码

Open the local file we crawled and view,

We have successfully bypassed the anti-crawling mechanism and entered the homepage of Amazon

ending

Through the selenium module, we can successfully cross

Amazon's anti-crawling mechanism.

Next: We will continue to introduce how to crawl hundreds of thousands of product information and reviews on Amazon.

【If you have any questions, please leave a message~~~】

Guess you like

Origin juejin.im/post/6974300157126901790