Python crawler actual combat, requests+openpyxl module, crawling mobile phone product information data (with source code)

foreword

What I will introduce to you today is Python crawling mobile phone product information data. Here, I will give the code to the friends who need it, and give some tips.

First of all, before crawling, you should pretend to be a browser as much as possible without being recognized as a crawler. The basic thing is to add a request header, but there will be many people crawling such plain text data, so we need to consider changing the proxy IP and random replacement The request header is used to crawl the mobile phone information data.

Before writing crawler code every time, our first and most important step is to analyze our web pages.

Through analysis, we found that the speed of crawling is relatively slow during the crawling process, so we can also improve the crawling speed of crawlers by disabling Google browser images, JavaScript, etc.

insert image description here

development tools

Python version: 3.6

Related modules:

requests module

json module

lxml module

openpyxl

Environment build

Install Python and add it to the environment variable, and pip installs the required related modules.

The complete code and Excel file in the article can be obtained by commenting and leaving a message

Idea analysis

Open the page we want to crawl in the browser
Press F12 to enter the developer tool, check where the mobile product data we want is
here we need the page data

source code structure

Code

Request header to prevent anti-crawling

#这里提示不用请求也是可以的只保留user-agent也可以爬取数据
headers = {
    
    
            'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.
            100 Safari/537.36',
            'cookie':'你的Cookie',
            'accept-encoding': 'gzip, deflate, br',
            'accept-language': 'zh-CN,zh;q=0.9',
            'upgrade-insecure-requests': '1',
            'referer': 'https://www.jd.com/',
        }

### 获取商品评论数
```python
import openpyxl
outwb = openpyxl.Workbook()
outws = outwb.create_sheet(index=0)

outws.cell(row=1,column=1,value="index")
outws.cell(row=1,column=2,value="title")
outws.cell(row=1,column=3,value="price")
outws.cell(row=1,column=4,value="CommentCount")

count=2

Get the number of comments based on the product id

def commentcount(product_id):
    url = "https://club.jd.com/comment/productCommentSummaries.action?referenceIds="+str(product_id)+"&callback=jQuery8827474&_=1615298058081"
    res = requests.get(url, headers=headers)
    res.encoding = 'gbk'
    text = (res.text).replace("jQuery8827474(","").replace(");","")
    text = json.loads(text)
    comment_count = text['CommentsCount'][0]['CommentCountStr']

    comment_count = comment_count.replace("+", "")
    ###对“万”进行操作
    if "万" in comment_count:
        comment_count = comment_count.replace("万","")
        comment_count = str(int(comment_count)*10000)

    return comment_count

Get product data for each page

def getlist(url):
    global  count
    #url="https://search.jd.com/search?keyword=笔记本&wq=笔记本&ev=exbrand_联想%5E&page=9&s=241&click=1"
    res = requests.get(url,headers=headers)
    res.encoding = 'utf-8'
    text = res.text

    selector = etree.HTML(text)
    list = selector.xpath('//*[@id="J_goodsList"]/ul/li')

    for i in list:
        title=i.xpath('.//div[@class="p-name p-name-type-2"]/a/em/text()')[0]
        price = i.xpath('.//div[@class="p-price"]/strong/i/text()')[0]
        product_id = i.xpath('.//div[@class="p-commit"]/strong/a/@id')[0].replace("J_comment_","")

        comment_count = commentcount(product_id)
        #print(title)
        #print(price)
        #print(comment_count)

        outws.cell(row=count, column=1, value=str(count-1))
        outws.cell(row=count, column=2, value=str(title))
        outws.cell(row=count, column=3, value=str(price))
        outws.cell(row=count, column=4, value=str(comment_count))

        count = count +1
        #print("-----")

loop through each page

def getpage():
    page=1
    s = 1
    for i in range(1,6):
        print("page="+str(page)+",s="+str(s))
        url = "https://search.jd.com/Search?keyword=手机=utf-8&wq=手机=56b2bc7c47db4861986201bb72c1b281"+str(page)+"&s="+str(s)+"&click=1"
        getlist(url)
        page = page+2
        s = s+60

Result display

Result display

At last

In order to thank the readers, I would like to share with you some of my recent favorite programming dry goods, to give back to every reader, and hope to help you.

There are practical tutorials suitable for novices to get started~

Come and grow up with Xiaoyu!

① More than 100 PythonPDFs (mainstream and classic books should be available)

② Python standard library (the most complete Chinese version)

③ Reptile projects (forty or fifty interesting and classic hand-practice projects and source codes)

④ Videos on basics of Python, crawlers, web development, and big data analysis (suitable for beginners)

⑤ Python Learning Roadmap (Farewell to Influential Learning)

Guess you like

Origin blog.csdn.net/Modeler_xiaoyu/article/details/128274399