Crawler combat: Requests+BeautifulSoup crawls the information of Jingdong underwear and imports the form (python)

Ready to work

If we want to save all the information of Jingdong underwear products locally, it will be a very huge project to copy and paste by hand. At this time, it can be realized with python crawler.
Step 1: Analyze the web address

Starting page address

Starting page address

https://search.jd.com/Search?keyword=%E5%86%85%E8%A1%A3%E5%A5%B3&suggest=4.def.0.base&wq=%E5%86%85%E8%A1%A3%E5%A5%B3&page=1&s=56&click=1

(Here, you will see that what you see in the URL bar of the browser is Chinese, but copy the url and paste it into a notepad or code, it will become as follows?)
In the URL of many websites, there are some get Parameters or keywords are encoded, so when we copy them out, there will be problems. But the copied URL can be opened directly. Don't worry about this in this example.

So, how can we automatically crawl other pages other than the first page, open the third page, the web page address is as follows, the analysis found that the difference from the first page is: the first page last &page=1, the third page &page=3
we can think of The method of automatically obtaining multiple web pages can be implemented in a for loop. After each loop, page+1

The third page URL is shown in the figure

https://search.jd.com/Search?keyword=%E5%86%85%E8%A1%A3%E5%A5%B3&suggest=4.def.0.base&wq=%E5%86%85%E8%A1%A3%E5%A5%B3&page=3&s=56&click=1

Step 1: Parse the code

First select the product,
a li label is a product

Then select specific information

Let’s study the source code of the webpage slowly.

Step 2: Code

from bs4 import BeautifulSoup
import numpy as np
import requests
from requests.exceptions import  RequestException
import pandas as pd
#读取网页
def craw(url,page):
    try:

        headers = {
    
    
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36"}
        html1 = requests.request("GET", url, headers=headers,timeout=10)
        html1.encoding ='utf-8' # 加编码,重要!转换为字符串编码,read()得到的是byte格式的
        html=html1.text

        return html
    except RequestException:#其他问题
        print('读取error')
        return None



#解析网页并保存数据到表格
def pase_page(url,page):
    html=craw(url,page)
    html=str(html)
    if html is not None:
        soup = BeautifulSoup(html, 'lxml')
        "---先选择商品--"
        shangping=soup.select('#J_goodsList ul li')
        for li in shangping:
            "---名称---"
            name=li.select('.p-name.p-name-type-2 em')
            name=[i.get_text() for i in name]
            "---价格---"
            price = li.select('.p-price i')
            price = [i.get_text() for i in price]
            "---店铺---"
            shop=li.select('.p-shop a')
            shop= [i.get_text() for i in shop]
            if(len(name)!= 0)and (len(price)!= 0) and ( len(shop) != 0):
                #print('名称:{0} ,价格{1},店铺名:{2}'.format(name, price, shop))

                information=[name,price,shop]
                information=np.array(information)
                information = information.reshape(-1,3)
                information=pd.DataFrame(information,columns=['名称','价格','店铺'])
            if page == 1:
                 information.to_csv('京东文胸数据1.csv', mode='a+', index=False)  # mode='a+'追加写入
            else:
                 information.to_csv('京东文胸数据1.csv', mode='a+', index=False, header=False)  # mode='a+'追加写入


    else:
        print('解析error')


for i  in range(1,10):#遍历网页1-10
    url="https://search.jd.com/Search?keyword=%E5%86%85%E8%A1%A3%E5%A5%B3&suggest=4.def.0.base&wq=%E5%86%85%E8%A1%A3%E5%A5%B3&page="+str(i)+"&s=56&click=1"
    pase_page(url,i)
    print('第{0}页读取成功'.format(i))
print('结束')

In this example, I only selected the product name, price, and shop name. You can choose more information access
Insert picture description here

Guess you like

Origin blog.csdn.net/kobeyu652453/article/details/113455232