[Python implementation of web crawler 22] detailed explanation of actual combat steps of Vipshop commodity information


Manual anti-crawler: original blog address

 知识梳理不易,请尊重劳动成果,文章仅发布在CSDN网站上,在其他网站看到该博文均属于未经作者授权的恶意爬取信息

If reprinted, please indicate the source, thank you!

1. Destination URL and page resolution

If you search for skin care suits on the official website of Vipshop, the page returned is as follows

Insert picture description here
Pull down the scroll bar on the right to find that the page will automatically refresh the data of the product when you slide to the bottom. This is the ajax interaction, indicating that the product information is stored in the json interface, and then you can find the page turning button by pulling it to the end Up, as follows

Insert picture description here

2. A preliminary exploration of crawlers

Try to capture the package and get the URL page where the real product data is located. First, right-click to enter the inspection interface, click Network and refresh the page. At this time, the requested information will be returned. You need to search and filter to find the link file that contains the specific product information. After inspection, it is found that most of the content is in the callback-related files, as follows

Insert picture description here
Analyzing these seven files, it is found that only four are useful. The second rank file contains the serial numbers of all the products on the current page.

Insert picture description here
Then in the remaining 3 v2 files, these 120 products are split, as follows (the serial numbers of the products all start from 0)

Insert picture description here
Insert picture description here
Insert picture description here
Therefore, the real data interface of the 120 product information on the search page is searched, and then one of the linked files is used to try to obtain the crawler data to see how the results are obtained, and then summarize the rules to see if it can be crawled at the same time All data in the page

After adding user-agent, cookie, and refer related information, set the post request header (click on Headers), copy and paste the url of the page interface data and assign it, and then make a data request. The code is as follows, for example, first request data for 20 products

Insert picture description here
Get the cookie, you can cancel the callback filter, and then select the first suggest file returned by default, as follows

Insert picture description here
Note: Set the request headers according to the content returned by your browser

import requests

headers = {
    
    
	'Cookie': 'vip_city_code=104101115; vip_wh=VIP_HZ; vip_ipver=31; user_class=a; mars_sid=ff7be68ad4dc97e589a1673f7154c9f9; VipUINFO=luc%3Aa%7Csuc%3Aa%7Cbct%3Ac_new%7Chct%3Ac_new%7Cbdts%3A0%7Cbcts%3A0%7Ckfts%3A0%7Cc10%3A0%7Crcabt%3A0%7Cp2%3A0%7Cp3%3A1%7Cp4%3A0%7Cp5%3A0%7Cul%3A3105; mars_pid=0; visit_id=98C7BA95D1CA0C0E518537BD0B4ABEA0; vip_tracker_source_from=; pg_session_no=5; mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375',
	'Referer': 'https://category.vip.com/suggest.php?keyword=%E6%8A%A4%E8%82%A4&ff=235|12|1|1',
	'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}

url = 'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets3&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101115&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&productIds=6918324165453150280%2C6918256118899745105%2C6918357885382468749%2C6918449056102396358%2C6918702822359352066%2C6918479374036836673%2C6918814278458725896%2C6918585149106754305%2C6918783763771922139%2C6917924417817122013%2C6918747787667990790%2C6918945825686792797%2C6918676686121468885%2C6918690813799719966%2C6917924776628925583%2C6918808484587649747%2C6918524324182323338%2C6917924083191145365%2C6917924119199990923%2C6917924081998898069%2C&scene=search&standby_id=nature&extParams=%7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223%22%2C%22couponVer%22%3A%22v2%22%2C%22exclusivePrice%22%3A%221%22%2C%22iconSpec%22%3A%222x%22%7D&context=&_=1600158865440'
html = requests.get(url,headers=headers)
print(html.text)

The output result is: (The final output result is consistent with the result returned by the interface)

Insert picture description here
Therefore, you can explore the difference between the actual request URLs in these three v2 files to facilitate the identification of the rules

'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets3&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101115&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&productIds=6918324165453150280%2C6918256118899745105%2C6918357885382468749%2C6918449056102396358%2C6918702822359352066%2C6918479374036836673%2C6918814278458725896%2C6918585149106754305%2C6918783763771922139%2C6917924417817122013%2C6918747787667990790%2C6918945825686792797%2C6918676686121468885%2C6918690813799719966%2C6917924776628925583%2C6918808484587649747%2C6918524324182323338%2C6917924083191145365%2C6917924119199990923%2C6917924081998898069%2C&scene=search&standby_id=nature&extParams=%7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223%22%2C%22couponVer%22%3A%22v2%22%2C%22exclusivePrice%22%3A%221%22%2C%22iconSpec%22%3A%222x%22%7D&context=&_=1600158865440'
'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets1&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101115&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&productIds=6918241720044454476%2C6917919624790589569%2C6917935170607219714%2C6918794091804350029%2C6918825617469761228%2C6918821681541400066%2C6918343188631192386%2C6918909902880919752%2C6918944714357405314%2C6918598446593061836%2C6917992439761061707%2C6918565057324098974%2C6918647344809112386%2C6918787811445699149%2C6918729979027610590%2C6918770949378056781%2C6918331290238460382%2C6918782319292540574%2C6918398146810241165%2C6918659293579989333%2C6917923814107067291%2C6918162041180009111%2C6918398146827042957%2C6917992175963801365%2C6918885216264034310%2C6918787811496047181%2C6918273588862755984%2C6917924752735125662%2C6918466082515404493%2C6918934739456193886%2C6917924837261255565%2C6918935779609622221%2C6917920117494382747%2C6917987978233958977%2C6917923641027928222%2C6918229910205674453%2C6917970328155673856%2C6918470882161509397%2C6918659293832008021%2C6918750646128649741%2C6917923139576259723%2C6918387987850605333%2C6917924445491982494%2C6918790938962557837%2C6918383695533143067%2C6918872378378761054%2C6918640250037793602%2C6918750646128641549%2C6917937020463562910%2C6917920520629265102%2C&scene=search&standby_id=nature&extParams=%7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223%22%2C%22couponVer%22%3A%22v2%22%2C%22exclusivePrice%22%3A%221%22%2C%22iconSpec%22%3A%222x%22%7D&context=&_=1600158865436'
'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets2&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101115&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&productIds=6918690813782926366%2C6918447252612175371%2C6918159188446941835%2C6918205147496443989%2C6918006775182997019%2C6918710130501497419%2C6917951703208964235%2C6918936224464094528%2C6918394023211385035%2C6918872268898919262%2C6918397905200202715%2C6918798460682221086%2C6918800888595138517%2C6917919413703328321%2C1369067222846365%2C6917924520139822219%2C6918904223283803413%2C6918507022166130843%2C6918479374087209281%2C6917924176900793243%2C6918750646145443341%2C6918449056102412742%2C6918901362318117467%2C6918570897095177292%2C6917924520223884427%2C6918757924517328902%2C6918398146827051149%2C6918789686747831253%2C6918476662192264973%2C6917919300445017109%2C6917919922739126933%2C6917920155539928286%2C6918662208810186512%2C6917923139508970635%2C6918859281628675166%2C6918750645658871309%2C6918820034693202694%2C6918689681141637573%2C6917919916536480340%2C6918719763326603415%2C6918659293579997525%2C6917920335390225555%2C6918589584225669211%2C6918386595131470421%2C6918640034622429077%2C6917923665227256725%2C6918331290238476766%2C6917924054840074398%2C6917924438479938177%2C6917920679932125915%2C&scene=search&standby_id=nature&extParams=%7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223%22%2C%22couponVer%22%3A%22v2%22%2C%22exclusivePrice%22%3A%221%22%2C%22iconSpec%22%3A%222x%22%7D&context=&_=1600158865437'

Comparing the URLs of the three product information, it is found that the fundamental difference lies in the productIds parameter in the middle, so as long as the id of all the products is obtained, the information of all the products can be obtained. This is to find that the law of the URL is
Insert picture description here
exactly the id of all the products. Stored in the second rank file, so you need to request this link file first, get the product id information, and then recombine the URL to finally get the detailed product information

3. Reptile practice

3.1 Crawling of product id information

In order to achieve the page turning requirements, you can look up the parameters that control the number of pages, as follows, for example, the first page has a total of 120 data, where the pageOffset parameter is 0
Insert picture description here
, the pageOffset parameter in the second page is 120, and so on, the third The parameter of the page is 240, and the number of pages will increase by 120 afterwards, and the rest of the parameters are almost unchanged
Insert picture description here

3.2 Commodity id data url structure

So the requested code is as follows

import requests
import json
headers = {
    
    
	'Cookie': 'vip_province_name=%E6%B2%B3%E5%8D%97%E7%9C%81; vip_city_name=%E4%BF%A1%E9%98%B3%E5%B8%82; vip_city_code=104101115; vip_wh=VIP_HZ; vip_ipver=31; user_class=a; mars_sid=ff7be68ad4dc97e589a1673f7154c9f9; VipUINFO=luc%3Aa%7Csuc%3Aa%7Cbct%3Ac_new%7Chct%3Ac_new%7Cbdts%3A0%7Cbcts%3A0%7Ckfts%3A0%7Cc10%3A0%7Crcabt%3A0%7Cp2%3A0%7Cp3%3A1%7Cp4%3A0%7Cp5%3A0%7Cul%3A3105; mars_pid=0; visit_id=98C7BA95D1CA0C0E518537BD0B4ABEA0; vip_tracker_source_from=; pg_session_no=5; mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375',
	'Referer': 'https://category.vip.com/suggest.php?keyword=%E6%8A%A4%E8%82%A4&ff=235|12|1|1',
	'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
n = 1 #n就是用来确定请求的页数,可以使用input语句替代
for num in range(120,(n+1)*120,120):  #这里是从第二页开始取数据了,第一个参数可以设置为0
	url = f'https://mapi.vip.com/vips-mobile/rest/shopping/pc/search/product/rank?callback=getMerchandiseIds&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101115&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&standby_id=nature&keyword=%E6%8A%A4%E8%82%A4%E5%A5%97%E8%A3%85&lv3CatIds=&lv2CatIds=&lv1CatIds=&brandStoreSns=&props=&priceMin=&priceMax=&vipService=&sort=0&pageOffset={num}&channelId=1&gPlatform=PC&batchSize=120&_=1600158865435'
	html = requests.get(url,headers=headers)
	print(html.text)

The output result is: (can successfully obtain the information of the product id)

Insert picture description here

3.3 Commodity id data format conversion and quantity verification

Analyze the json data, that is, convert the output data without a fixed format into a format that can be manipulated by python. The code is as follows

import json

#注意下面的代码是在for循环中
start = html.text.index('{')
end = html.text.index('})')+1
json_data = json.loads(html.text[start:end])
print(json_data)

The output result is: (contains the id information of the desired product data)

Insert picture description here
Verify whether it is the total amount of product data, that is, whether the ID number of the obtained product (here is the pid field data) is equal to 120, the code is as follows

#同样也是在for循环下
print(json_data['data']['products'])
print('')
print(len(json_data['data']['products']))

The output result is: (verification is completed, note that the first print output is the data of a list nested dictionary)
Insert picture description here

3.4 Obtaining product details

Therefore, you can traverse the loop again to obtain the id information of each product. Pay attention to the structure of product_url here, delete all the product id information in the middle, and then use the format method to replace it. The code is as follows

#在上面的for循环之中
for product_id in product_ids:
	print('商品id',product_id['pid'])
	product_url = 'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets3&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101115&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&productIds={}%2C&scene=search&standby_id=nature&extParams=%7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223%22%2C%22couponVer%22%3A%22v2%22%2C%22exclusivePrice%22%3A%221%22%2C%22iconSpec%22%3A%222x%22%7D&context=&_=1600164018137'.format(product_id['pid'])
	product_html = requests.get(product_url,headers = headers)
	print(product_html.text)

The output result is: (Intercept part of the output result)

Insert picture description here
It can be found that the same as the initial acquisition of product id information, the specific information data also needs to be formatted and then extracted, such as extracting the name, brand and price of the product

#这里以获取前10个商品数据为例进行展示
product_start = product_html.text.index('{')
product_end = product_html.text.index('})')+1
product_json_data = json.loads(product_html.text[product_start:product_end])
product_info_data = product_json_data['data']['products'][0]
# print(product_info_data)
product_title = product_info_data['title']
product_brand = product_info_data['brandShowName']
product_price = product_info_data['price']['salePrice']
print('商品名称:{},品牌:{},折后价格:{}'.format(product_title,product_brand,product_price))

The output result is: (relevant information can be obtained normally, here is the title, brand, and selling price of the product as an example, and other more detailed data can be obtained)

Insert picture description here
The last step is to write the acquired data locally:

with open('vip.txt','a+',encoding = 'utf-8') as f:
	f.write('商品名称:{},品牌:{},折后价格:{}\n'.format(product_title,product_brand,product_price))

The output result is: (data crawling is completed and saved locally)

Insert picture description here

4. All codes

The whole process can be encapsulated as a function, or the data can be stored locally in the form of csv or xlsx, here only the storage of txt text data is listed

import requests
import json

headers = {
    
    
	'Cookie': 'vip_province_name=%E6%B2%B3%E5%8D%97%E7%9C%81; vip_city_name=%E4%BF%A1%E9%98%B3%E5%B8%82; vip_city_code=104101115; vip_wh=VIP_HZ; vip_ipver=31; user_class=a; mars_sid=ff7be68ad4dc97e589a1673f7154c9f9; VipUINFO=luc%3Aa%7Csuc%3Aa%7Cbct%3Ac_new%7Chct%3Ac_new%7Cbdts%3A0%7Cbcts%3A0%7Ckfts%3A0%7Cc10%3A0%7Crcabt%3A0%7Cp2%3A0%7Cp3%3A1%7Cp4%3A0%7Cp5%3A0%7Cul%3A3105; mars_pid=0; visit_id=98C7BA95D1CA0C0E518537BD0B4ABEA0; vip_tracker_source_from=; pg_session_no=5; mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375',
	'Referer': 'https://category.vip.com/suggest.php?keyword=%E6%8A%A4%E8%82%A4&ff=235|12|1|1',
	'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}

n = 1 #注意这里的n就代表你要爬取的实际页码数
for num in range(0,n*120,120): 
	url = f'https://mapi.vip.com/vips-mobile/rest/shopping/pc/search/product/rank?callback=getMerchandiseIds&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101115&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&standby_id=nature&keyword=%E6%8A%A4%E8%82%A4%E5%A5%97%E8%A3%85&lv3CatIds=&lv2CatIds=&lv1CatIds=&brandStoreSns=&props=&priceMin=&priceMax=&vipService=&sort=0&pageOffset={num}&channelId=1&gPlatform=PC&batchSize=120&_=1600158865435'
	html = requests.get(url,headers=headers)
	# print(html.text)

	start = html.text.index('{')
	end = html.text.index('})')+1
	json_data = json.loads(html.text[start:end])
	product_ids = json_data['data']['products']
	for product_id in product_ids:
		print('商品id',product_id['pid'])
		product_url = 'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets3&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101115&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&productIds={}%2C&scene=search&standby_id=nature&extParams=%7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223%22%2C%22couponVer%22%3A%22v2%22%2C%22exclusivePrice%22%3A%221%22%2C%22iconSpec%22%3A%222x%22%7D&context=&_=1600164018137'.format(product_id['pid'])
		product_html = requests.get(product_url,headers = headers)
		product_start = product_html.text.index('{')
		product_end = product_html.text.index('})')+1
		product_json_data = json.loads(product_html.text[product_start:product_end])
		product_info_data = product_json_data['data']['products'][0]
		# print(product_info_data)
		product_title = product_info_data['title']
		product_brand = product_info_data['brandShowName']
		product_price = product_info_data['price']['salePrice']
		print('商品名称:{},品牌:{},折后价格:{}'.format(product_title,product_brand,product_price))
		with open('vip.txt','a+',encoding = 'utf-8') as f:
			f.write('商品名称:{},品牌:{},折后价格:{}\n'.format(product_title,product_brand,product_price))

If n=4, run the code again, the output result is as follows: (In order to check the amount of data, use sublime to open the txt file, you can find that it is exactly the sum of the number of products on 4 pages, so the entire Vipshop product information is crawled here end)
Insert picture description here

Guess you like

Origin blog.csdn.net/lys_828/article/details/108600922