Hello everyone, I'm Lex, the Lex who likes to bully Superman
Areas of expertise: python development, network security penetration, Windows domain control Exchange architecture
Today's focus: step-by-step analysis and overcoming Amazon's anti-reptile mechanism
Here's the thing
Amazon is the world's largest shopping platform
A lot of product information, user reviews, etc. are the most abundant.
Today, let's take everyone by hand and cross Amazon's anti-reptile mechanism
Crawl useful information such as products, reviews, etc. you want
Anti-reptile mechanism
However, when we want to use crawlers to crawl relevant data information
Large shopping malls like Amazon, TBo, JD
In order to protect their data information, they all have a complete set of anti-reptile mechanisms.
Try Amazon's anti-crawling mechanism first
We use several different python crawler modules to test step by step
In the end, the anti-climbing mechanism was successfully passed.
1. urllib module
code show as below:
# -*- coding:utf-8 -*-
import urllib.request
req = urllib.request.urlopen('https://www.amazon.com')
print(req.code)
复制代码
Return result : Status code: 503.
Analysis : Amazon identifies your request as a crawler and refuses to provide service.
With a scientific and rigorous attitude, let's try Baidu, which is on the top of thousands of people.
Return result : status code 200
Analysis : normal access
That means that the request of the urllib module is recognized by Amazon as a crawler and refused to serve
2. The requests module
1. Requests direct crawler access
The effect is as follows↓ ↓ ↓
The code is as follows ↓ ↓ ↓
import requests
url='https://www.amazon.com/KAVU-Rope-Bag-Denim-Size/product-reviews/xxxxxx'
r = requests.get(url)
print(r.status_code)
复制代码
Return result : Status code: 503.
Analysis : Amazon also rejected requests for requsets module
Identify it as a crawler and refuse to provide service.
2. We add cookies to requests
Plus request cookies and other related information
The effect is as follows↓ ↓ ↓
The code is as follows ↓ ↓ ↓
import requests
url='https://www.amazon.com/KAVU-Rope-Bag-Denim-Size/product-reviews/xxxxxxx'
web_header={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0',
'Accept': '*/*',
'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Cookie': '你的cookie值',
'TE': 'Trailers'}
r = requests.get(url,headers=web_header)
print(r.status_code)
复制代码
Return result : Status code: 200
Analysis : The return status code is 200, which is normal. It smells like a crawler.
3. Check the return page
Through the method of requests+cookie, the status code we get is 200
At least it is currently being served by Amazon's servers.
We write the crawled pages into text and open them in a browser.
I stepped on the horse... The return status is normal, but the return is an anti-crawler verification code page.
Still blocked by Amazon.
Three, selenium automation module
Installation of related selenium modules
pip install selenium
复制代码
Introduce selenium into the code and set relevant parameters
import os
from requests.api import options
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
#selenium配置参数
options = Options()
#配置无头参数,即不打开浏览器
options.add_argument('--headless')
#配置Chrome浏览器的selenium驱动
chromedriver="C:/Users/pacer/AppData/Local/Google/Chrome/Application/chromedriver.exe"
os.environ["webdriver.chrome.driver"] = chromedriver
#将参数设置+浏览器驱动组合
browser = webdriver.Chrome(chromedriver,chrome_options=options)
复制代码
test access
url = "https://www.amazon.com"
print(url)
#通过selenium来访问亚马逊
browser.get(url)
复制代码
Return result : Status code: 200
Analysis : The return status code is 200, and the access status is normal. Let's look at the information on the crawled webpage.
Save the source code of the webpage to the local
#将爬取到的网页信息,写入到本地文件
fw=open('E:/amzon.html','w',encoding='utf-8')
fw.write(str(browser.page_source))
browser.close()
fw.close()
复制代码
Open the local file we crawled and view,
We have successfully bypassed the anti-crawling mechanism and entered the homepage of Amazon
ending
Through the selenium module, we can successfully cross
Amazon's anti-crawling mechanism.
Next: We will continue to introduce how to crawl hundreds of thousands of product information and reviews on Amazon.
【If you have any questions, please leave a message~~~】