foreword
What I will introduce to you today is Python crawling mobile phone product information data. Here, I will give the code to the friends who need it, and give some tips.
First of all, before crawling, you should pretend to be a browser as much as possible without being recognized as a crawler. The basic thing is to add a request header, but there will be many people crawling such plain text data, so we need to consider changing the proxy IP and random replacement The request header is used to crawl the mobile phone information data.
Before writing crawler code every time, our first and most important step is to analyze our web pages.
Through analysis, we found that the speed of crawling is relatively slow during the crawling process, so we can also improve the crawling speed of crawlers by disabling Google browser images, JavaScript, etc.
development tools
Python version: 3.6
Related modules:
requests module
json module
lxml module
openpyxl
Environment build
Install Python and add it to the environment variable, and pip installs the required related modules.
The complete code and Excel file in the article can be obtained by commenting and leaving a message
Idea analysis
Open the page we want to crawl in the browser
Press F12 to enter the developer tool, check where the mobile product data we want is
here we need the page data
Code
Request header to prevent anti-crawling
#这里提示不用请求也是可以的只保留user-agent也可以爬取数据
headers = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.
100 Safari/537.36',
'cookie':'你的Cookie',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9',
'upgrade-insecure-requests': '1',
'referer': 'https://www.jd.com/',
}
### 获取商品评论数
```python
import openpyxl
outwb = openpyxl.Workbook()
outws = outwb.create_sheet(index=0)
outws.cell(row=1,column=1,value="index")
outws.cell(row=1,column=2,value="title")
outws.cell(row=1,column=3,value="price")
outws.cell(row=1,column=4,value="CommentCount")
count=2
Get the number of comments based on the product id
def commentcount(product_id):
url = "https://club.jd.com/comment/productCommentSummaries.action?referenceIds="+str(product_id)+"&callback=jQuery8827474&_=1615298058081"
res = requests.get(url, headers=headers)
res.encoding = 'gbk'
text = (res.text).replace("jQuery8827474(","").replace(");","")
text = json.loads(text)
comment_count = text['CommentsCount'][0]['CommentCountStr']
comment_count = comment_count.replace("+", "")
###对“万”进行操作
if "万" in comment_count:
comment_count = comment_count.replace("万","")
comment_count = str(int(comment_count)*10000)
return comment_count
Get product data for each page
def getlist(url):
global count
#url="https://search.jd.com/search?keyword=笔记本&wq=笔记本&ev=exbrand_联想%5E&page=9&s=241&click=1"
res = requests.get(url,headers=headers)
res.encoding = 'utf-8'
text = res.text
selector = etree.HTML(text)
list = selector.xpath('//*[@id="J_goodsList"]/ul/li')
for i in list:
title=i.xpath('.//div[@class="p-name p-name-type-2"]/a/em/text()')[0]
price = i.xpath('.//div[@class="p-price"]/strong/i/text()')[0]
product_id = i.xpath('.//div[@class="p-commit"]/strong/a/@id')[0].replace("J_comment_","")
comment_count = commentcount(product_id)
#print(title)
#print(price)
#print(comment_count)
outws.cell(row=count, column=1, value=str(count-1))
outws.cell(row=count, column=2, value=str(title))
outws.cell(row=count, column=3, value=str(price))
outws.cell(row=count, column=4, value=str(comment_count))
count = count +1
#print("-----")
loop through each page
def getpage():
page=1
s = 1
for i in range(1,6):
print("page="+str(page)+",s="+str(s))
url = "https://search.jd.com/Search?keyword=手机=utf-8&wq=手机=56b2bc7c47db4861986201bb72c1b281"+str(page)+"&s="+str(s)+"&click=1"
getlist(url)
page = page+2
s = s+60
Result display
At last
In order to thank the readers, I would like to share with you some of my recent favorite programming dry goods, to give back to every reader, and hope to help you.
There are practical tutorials suitable for novices to get started~
Come and grow up with Xiaoyu!
① More than 100 PythonPDFs (mainstream and classic books should be available)
② Python standard library (the most complete Chinese version)
③ Reptile projects (forty or fifty interesting and classic hand-practice projects and source codes)
④ Videos on basics of Python, crawlers, web development, and big data analysis (suitable for beginners)
⑤ Python Learning Roadmap (Farewell to Influential Learning)