python crawling and analysis Taobao product information

Tip: This article is for learning and exchange, not for illegal purposes! ! !

Background

One student asked me: "XXX, there is no way to gather information about the goods Taobao, ah, I want to be a statistic." Ever since, I have nothing else, I began pondering this thing ...
Here Insert Picture Description

First, the simulated landing

I excitedly ran into the Taobao ready meal chaos search:
Here Insert Picture Description
Fill in the keyword search box: "Graphics", hands lightly tapping the Enter key (see sample - I)
feel good I waited to return full of product information, hard to wait for a result but it is 302 , so I accidentally came to the login screen.
Here Insert Picture Description
The basic situation is such a situation ...
then I checked, along with Taobao continue to strengthen the anti-climbing means of many small partners should have been found Taobao search function requires user login is!

About Taobao simulated landing, have been greatly successful use requests simulated landing (junior partner interested way, please >>> requests landed Taobao <<<)
This method is the first analysis of the various requests Taobao landing, and generate the corresponding simulation parameters, relatively speaking, a certain degree of difficulty. So I decided to put it another thought, by the Selenium + two-dimensional code way:

# 打开图片
def Openimg(img_location):
    img=Image.open(img_location)
    img.show()

# 登陆获取cookies
def Login():  
    driver = webdriver.PhantomJS() 
    driver.get('https://login.taobao.com/member/login.jhtml')
    try:
        driver.find_element_by_id("J_Static2Quick").click()   #切换成二维码模式
    except:
        pass
    time.sleep(3)
    code_element = driver.find_element_by_xpath('//*[@id="J_QRCodeImg"]/img')
    code_url = code_element.get_attribute('src')
    time.sleep(2)
    with open('./login.png','wb') as f:
        f.write(requests.get(code_url).content)
        f.close()
    t = threading.Thread(target=Openimg,args=('./login.png',))
    t.start()
    print("Logining...Please sweep the code!\n")
    while(True):
        c = driver.get_cookies()
        if len(c) > 20:   #登陆成功获取到cookies
            cookies = {}
            for i in range(len(c)):
                cookies[c[i]['name']] = c[i]['value']
            driver.close()
            print("Login in successfully!\n")
            return cookies
        time.sleep(1)

Open Taobao login screen by webdriver, the two-dimensional code downloaded to the local and waits for the user to open the scan code (the corresponding element we can easily F12 element analysis by the browser will be able to find out). After the sweeping success of the code, the cookies into DICT webdriver in the form, and return. (This is the time for subsequent requests to use information crawled)

Second, crawling product information

When I got cookies, we will be able to crawl the commodity information.
(Sample - I'm coming)

1. Define the relevant parameters

Define the corresponding request address, the request header and the like:

# 定义参数
headers = {'Host':'s.taobao.com',
           'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0',
           'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
           'Accept-Language':'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
           'Accept-Encoding':'gzip, deflate, br',
           'Connection':'keep-alive'}
list_url = 'http://s.taobao.com/search?q=%(key)s&ie=utf8&s=%(page)d'

2. Analyze and define the regular

When the request is an HTML page, you want to get the data we want to have to be extracted, here I chose the regular way. By viewing the page source code:
Here Insert Picture Description
lazy, I just marked the two above data, but also other similar, thereby obtaining the following regular:

# 正则模式
p_title = '"raw_title":"(.*?)"'       #标题
p_location = '"item_loc":"(.*?)"'    #销售地
p_sale = '"view_sales":"(.*?)人付款"' #销售量
p_comment = '"comment_count":"(.*?)"'#评论数
p_price = '"view_price":"(.*?)"'     #销售价格
p_nid = '"nid":"(.*?)"'              #商品唯一ID
p_img = '"pic_url":"(.*?)"'          #图片URL

(Ps. Clever little partner should have found, in fact, the product information is stored in the g_page_config inside variables, so we can start to extract this variable (a dictionary), and then read the data, also!)

3. Data crawling

Get away with, only a strong wind. So, to the east:

# 数据爬取
key = input('请输入关键字:') # 商品的关键词
N = 20 # 爬取的页数 
data = []
cookies = Login()
for i in range(N):
    try:
        page = i*44
        url = list_url%{'key':key,'page':page}
        res = requests.get(url,headers=headers,cookies=cookies)
        html = res.text
        title = re.findall(p_title,html)
        location = re.findall(p_location,html)
        sale = re.findall(p_sale,html)
        comment = re.findall(p_comment,html)
        price = re.findall(p_price,html)
        nid = re.findall(p_nid,html)
        img = re.findall(p_img,html)
        for j in range(len(title)):
            data.append([title[j],location[j],sale[j],comment[j],price[j],nid[j],img[j]])
        print('-------Page%s complete!--------\n\n'%(i+1))
        time.sleep(3)
    except:
        pass
data = pd.DataFrame(data,columns=['title','location','sale','comment','price','nid','img'])
data.to_csv('%s.csv'%key,encoding='utf-8',index=False)

The above code is crawling 20 also product information, and save it in the local csv file, the effect is such that:
Here Insert Picture Description

Third, simple data analysis

With the data, stood not a waste, but I'm a good socialist youth, how can we do such a thing? So, let us look at these simple pair of data:
(Of course, a small amount of data, only for entertainment reference)

1. Import library

# 导入相关库
import jieba
import operator
import pandas as pd
from wordcloud import WordCloud
from matplotlib import pyplot as plt

Corresponding installation method library (in fact, the basic can be resolved by pip):

2. Chinese display

# matplotlib中文显示
plt.rcParams['font.family'] = ['sans-serif']
plt.rcParams['font.sans-serif'] = ['SimHei']

Do not set conditions suck Chinese garbled possible oh ~

3. Read data

# 读取数据
key = '显卡'
data = pd.read_csv('%s.csv'%key,encoding='utf-8',engine='python')

4. Analysis of price distribution

# 价格分布
plt.figure(figsize=(16,9))
plt.hist(data['price'],bins=20,alpha=0.6)
plt.title('价格频率分布直方图')
plt.xlabel('价格')
plt.ylabel('频数')
plt.savefig('价格分布.png')

Price frequency distribution histogram:
Here Insert Picture Description

5. Analysis of distribution sales

# 销售地分布
group_data = list(data.groupby('location'))
loc_num = {}
for i in range(len(group_data)):
    loc_num[group_data[i][0]] = len(group_data[i][1])
plt.figure(figsize=(19,9))
plt.title('销售地')
plt.scatter(list(loc_num.keys())[:20],list(loc_num.values())[:20],color='r')
plt.plot(list(loc_num.keys())[:20],list(loc_num.values())[:20])
plt.savefig('销售地.png')
sorted_loc_num = sorted(loc_num.items(), key=operator.itemgetter(1),reverse=True)#排序
loc_num_10 = sorted_loc_num[:10]  #取前10
loc_10 = []
num_10 = []
for i in range(10):
    loc_10.append(loc_num_10[i][0])
    num_10.append(loc_num_10[i][1])
plt.figure(figsize=(16,9))
plt.title('销售地TOP10')
plt.bar(loc_10,num_10,facecolor = 'lightskyblue',edgecolor = 'white')
plt.savefig('销售地TOP10.png')

Sales distribution:
Here Insert Picture Description
Sales to TOP10:
Here Insert Picture Description

6. Analysis word cloud

# 制作词云
content = ''
for i in range(len(data)):
    content += data['title'][i]
wl = jieba.cut(content,cut_all=True)
wl_space_split = ' '.join(wl)
wc = WordCloud('simhei.ttf',
               background_color='white', # 背景颜色
               width=1000,
               height=600,).generate(wl_space_split)
wc.to_file('%s.png'%key)

Taobao commodity "cards" word cloud:
Here Insert Picture Description

Written in the last

Thank you much patience to read ~

Released three original articles · won praise 19 · views 10000 +

Guess you like

Origin blog.csdn.net/kimol_justdo/article/details/105388082