What kind of cat is the most popular? Python crawls cat website transaction data

This article is about the sales analysis of a cosmetics company. Starting from the analysis ideas, we will take you step by step to analyze with python, find out the problem, and propose the whole process of the solution.

The following article comes from practicing Python

Author: Ye Tingyun

Python crawler, data analysis, website development and other case tutorial videos are free to watch online

https://space.bilibili.com/523606542

I. Introduction

When I see cute cat emoticons, I can’t help but collect them. Some pictures are as follows:

 

Some friends I know also have cats, such as orange cats, British shorts, and Garfield cats. Watching them post the cats in the circle of friends, I am envious. The cats are really cute. I found a website that specializes in trading cats and cats—Maomijiaoyi.com can watch cats in the cloud: http://www.maomijiaoyi.com/

 

From this website, we crawled the data of cat breed introduction and 20W+ cats transaction data to learn about cute cats.

2. Data acquisition

Open the cat trading network, first crawl the cat breed data, open the page to see the list of cat breeds:

 

But only the breed name of each cat is displayed, the reference price, click on the details page, you can see more detailed data: breed name, reference price, Chinese literature name, basic information, personality characteristics, living habits, advantages and disadvantages, Feeding methods, etc.

 

Check the webpage, you can find the webpage structure is simple, easy to parse and extract data. The crawler code is as follows:

import requests
import re
import csv
from lxml import etree
from tqdm import tqdm
from fake_useragent import UserAgent

# 随机产生请求头
ua = UserAgent(verify_ssl=False, path='fake_useragent.json')

def random_ua():        # 用于随机切换请求头
    headers = {
        "Accept-Encoding": "gzip",
        "Accept-Language": "zh-CN",
        "Connection": "keep-alive",
        "Host": "www.maomijiaoyi.com",
        "User-Agent": ua.random
    }
    return headers


def create_csv():          # 创建保存数据的csv
    with open('./data/cat_kind.csv', 'w', newline='', encoding='utf-8') as f:
        wr = csv.writer(f)
        wr.writerow(['品种', '参考价格', '中文学名', '别名', '祖先', '分布区域',
                     '原产地', '体型', '原始用途', '今日用途', '分组', '身高',
                     '体重', '寿命', '整体', '毛发', '颜色', '头部', '眼睛',
                     '耳朵', '鼻子', '尾巴', '胸部', '颈部', '前驱', '后驱',
                     '基本信息', 'FCI标准', '性格特点', '生活习性', '优点/缺点',
                     '喂养方法', '鉴别挑选'])


def scrape_page(url1):      # 获取HTML网页源代码 返回文本
    response = requests.get(url1, headers=random_ua())
    # print(response.status_code)
    response.encoding = 'utf-8'
    return response.text


def get_cat_urls(html1):    # 获取每个品种猫咪详情页url
    dom = etree.HTML(html1)
    lis = dom.xpath('//div[@class="pinzhong_left"]/a')
    cat_urls = []
    for li in lis:
        cat_url = li.xpath('./@href')[0]
        cat_url = 'http://www.maomijiaoyi.com' + cat_url
        cat_urls.append(cat_url)
    return cat_urls


def get_info(html2):    # 爬取每个品种猫咪详情页里的有关信息
    # 品种
    kind = re.findall('div class="line1">.*?<div class="name">(.*?)<span>', html2, re.S)[0]
    kind = kind.replace('\r','').replace('\n','').replace('\t','')
    # 参考价格
    price = re.findall('<div>参考价格:</div>.*?<div>(.*?)</div>', html2, re.S)[0]
    price = price.replace('\r', '').replace('\n', '').replace('\t', '')
    # 中文学名
    chinese_name = re.findall('<div>中文学名:</div>.*?<div>(.*?)</div>', html2, re.S)[0]
    chinese_name = chinese_name.replace('\r', '').replace('\n', '').replace('\t', '')
    # 别名
    other_name = re.findall('<div>别名:</div>.*?<div>(.*?)</div>', html2, re.S)[0]
    other_name = other_name.replace('\r', '').replace('\n', '').replace('\t', '')
    # 祖先
    ancestor = re.findall('<div>祖先:</div>.*?<div>(.*?)</div>', html2, re.S)[0]
    ancestor = ancestor.replace('\r', '').replace('\n', '').replace('\t', '')
    # 分布区域
    area = re.findall('<div>分布区域:</div>.*?<div>(.*?)</div>', html2, re.S)[0]
    area = area.replace('\r', '').replace('\n', '').replace('\t', '')
    # 原产地
    source_area = re.findall('<div>原产地:</div>.*?<div>(.*?)</div>', html2, re.S)[0]
    source_area = source_area.replace('\r', '').replace('\n', '').replace('\t', '')
    # 体型
    body_size = re.findall('<div>体型:</div>.*?<div>(.*?)</div>', html2, re.S)[0]
    body_size = body_size.replace('\r', '').replace('\n', '').replace('\t', '').strip()
    # 原始用途
    source_use = re.findall('<div>原始用途:</div>.*?<div>(.*?)</div>', html2, re.S)[0]
    source_use = source_use.replace('\r', '').replace('\n', '').replace('\t', '')
    # 今日用途
    today_use = re.findall('<div>今日用途:</div>.*?<div>(.*?)</div>', html2, re.S)[0]
    today_use = today_use.replace('\r', '').replace('\n', '').replace('\t', '')
    # 分组
    group = re.findall('<div>分组:</div>.*?<div>(.*?)</div>', html2, re.S)[0]
    group = group.replace('\r', '').replace('\n', '').replace('\t', '')
    # 身高
    height = re.findall('<div>身高:</div>.*?<div>(.*?)</div>', html2, re.S)[0]
    height = height.replace('\r', '').replace('\n', '').replace('\t', '')
    # 体重
    weight = re.findall('<div>体重:</div>.*?<div>(.*?)</div>', html2, re.S)[0]
    weight = weight.replace('\r', '').replace('\n', '').replace('\t', '')
    # 寿命
    lifetime = re.findall('<div>寿命:</div>.*?<div>(.*?)</div>', html2, re.S)[0]
    lifetime = lifetime.replace('\r', '').replace('\n', '').replace('\t', '')
    # 整体
    entirety = re.findall('<div>整体</div>.*?<!-- 页面小折角 -->.*?<div></div>.*?<div>(.*?)</div>', html2, re.S)[0]
    entirety = entirety.replace('\r', '').replace('\n', '').replace('\t', '').strip()
    # 毛发
    hair = re.findall('<div>毛发</div>.*?<div></div>.*?<div>(.*?)</div>', html2, re.S)[0]
    hair = hair.replace('\r', '').replace('\n', '').replace('\t', '').strip()
    # 颜色
    color = re.findall('<div>颜色</div>.*?<div></div>.*?<div>(.*?)</div>', html2, re.S)[0]
    color = color.replace('\r', '').replace('\n', '').replace('\t', '').strip()
    # 头部
    head = re.findall('<div>头部</div>.*?<div></div>.*?<div>(.*?)</div>', html2, re.S)[0]
    head = head.replace('\r', '').replace('\n', '').replace('\t', '').strip()
    # 眼睛
    eye = re.findall('<div>眼睛</div>.*?<div></div>.*?<div>(.*?)</div>', html2, re.S)[0]
    eye = eye.replace('\r', '').replace('\n', '').replace('\t', '').strip()
    # 耳朵
    ear = re.findall('<div>耳朵</div>.*?<div></div>.*?<div>(.*?)</div>', html2, re.S)[0]
    ear = ear.replace('\r', '').replace('\n', '').replace('\t', '').strip()
    # 鼻子
    nose = re.findall('<div>鼻子</div>.*?<div></div>.*?<div>(.*?)</div>', html2, re.S)[0]
    nose = nose.replace('\r', '').replace('\n', '').replace('\t', '').strip()
    # 尾巴
    tail = re.findall('<div>尾巴</div>.*?<div></div>.*?<div>(.*?)</div>', html2, re.S)[0]
    tail = tail.replace('\r', '').replace('\n', '').replace('\t', '').strip()
    # 胸部
    chest = re.findall('<div>胸部</div>.*?<div></div>.*?<div>(.*?)</div>', html2, re.S)[0]
    chest = chest.replace('\r', '').replace('\n', '').replace('\t', '').strip()
    # 颈部
    neck = re.findall('<div>颈部</div>.*?<div></div>.*?<div>(.*?)</div>', html2, re.S)[0]
    neck = neck.replace('\r', '').replace('\n', '').replace('\t', '').strip()
    # 前驱
    font_foot = re.findall('<div>前驱</div>.*?<div></div>.*?<div>(.*?)</div>', html2, re.S)[0]
    font_foot = font_foot.replace('\r', '').replace('\n', '').replace('\t', '').strip()
    # 后驱
    rear_foot = re.findall('<div>前驱</div>.*?<div></div>.*?<div>(.*?)</div>', html2, re.S)[0]
    rear_foot = rear_foot.replace('\r', '').replace('\n', '').replace('\t', '').strip()

    # 保存前面猫猫的各种有关信息
    cat = [kind, price, chinese_name, other_name, ancestor, area, source_area,
           body_size, source_use, today_use, group, height, weight, lifetime,
           entirety, hair, color, head, eye, ear, nose, tail, chest, neck, font_foot, rear_foot]

    # 提取标签栏信息(基本信息-FCI标准-性格特点-生活习性-优缺点-喂养方法-鉴别挑选)
    html2 = etree.HTML(html2)
    labs = html2.xpath('//div[@class="property_list"]/div')
    for lab in labs:
        text1 = lab.xpath('string(.)')
        text1 = text1.replace('\n','').replace('\t','').replace('\r','').replace(' ','')
        cat.append(text1)
    return cat


def write_to_csv(data):     # 保存数据  追加写入
    with open('./data/cat_kind.csv', 'a+', newline='', encoding='utf-8') as fn:
        wr = csv.writer(fn)
        wr.writerow(data)


if __name__ == '__main__':
    # 创建保存数据的csv
    create_csv()
    # 猫咪品种页面url
    base_url = 'http://www.maomijiaoyi.com/index.php?/pinzhongdaquan_5.html'
    # 获取品种页面中的所有url
    html = scrape_page(base_url)
    urls = get_cat_urls(html)
    # 进度条可视化运行情况    就不打印东西来看了
    pbar = tqdm(urls)
    # 开始爬取
    for url in pbar:
        text = scrape_page(url)
        info = get_info(text)
        write_to_csv(info)

The operation effect is as follows:

 

Successfully crawled the cat breed data and saved it to csv, then crawled the cat transaction data, and entered the page of buying cats and selling cats:

 

To crawl more detailed data, you need to enter the details page, including business information, cat breed, cat age, price, title, number of animals on sale, prevention and other information.

 

Due to the large amount of data, it can be crawled separately. First obtain the urls of all cats details transaction links in each page and save them to csv, and then read the url in csv to request, crawl each transaction data, crawler ideas Similar to the previous one, in order to speed up crawling efficiency, you can use multi-threaded or asynchronous crawlers. Finally, 20W+ pieces of data were obtained.

Three, data exploration

Look at it directly through the word cloud chart, cute cats have those breeds.

 

Look at the size distribution of various cats

 

Among all the cats, there is only one large cat, which is a puppet cat. The other breeds are small and medium-sized cats. After that, if you see larger cats, you can think of puppet cats first.

 

Orange cats are found all over the world, and I deserve to be my big orange cat. As the saying goes, "ten orange cats, nine fat cats, and one collapsed Kang". Orange cats like to eat more than other cats. They have a good appetite and can survive better. Maybe this is the reason why orange cats are everywhere in the world. But it is a small cat. The orange cat looks very tall when he was young. It looks small, tender and cute. But when the orange cat grows up, he really realizes what "the importance of orange" is. .

 

Let's look at the trading data of cats. Among the cats that are traded, which breeds have the most transactions?

 

Orange cats have the largest number of transactions. It was mentioned earlier that there are orange cats all over the world. From here, we can also see that there are the largest number of orange cats. Followed by coffee cat, puppet cat, British short blue and white cat and so on.

 

The average price of Maine Coon cats and puppet cats are among the best, and the average price of orange cats is the second lowest, which seems to be quite affordable.

How old are these cats for sale?

 

The cats sold are mainly 1-6 months old, and they are all kittens who are just born and less than half a year old. The kitten at this time should be very cute, waiting for the destined owner to take it home.

Finally, take a look at the most expensive cats and the most viewed cats on the website

import pandas as pd

df = pd.read_excel('处理后数据.xlsx')
print(df.info())
df1 = df.sort_values(by='浏览次数', ascending=False)
print(df1.iloc[:3, ::].values)
print('----------------------------------------------------------')
df2 = df.sort_values(by='价格', ascending=False)
print(df2.iloc[:3, ::].values)
# 浏览次数最多的
http://www.maomijiaoyi.com/index.php?/chanpinxiangqing_441879.html
http://www.maomijiaoyi.com/index.php?/chanpinxiangqing_462431.html
http://www.maomijiaoyi.com/index.php?/chanpinxiangqing_455366.html

 

The most viewed Maine Coon cat is this one, with 16,164 views. emmm, I feel that this cat looks quite fierce, not very cute.

 

 

On the other hand, the number of views ranked second and third, the price is much cheaper, the prevention has been given 3 shots of vaccines, the number of sales is relatively abundant, and more hobbies than the first (personal feeling).

# 价格最贵的如下
http://www.maomijiaoyi.com/index.php?/chanpinxiangqing_265770.html
http://www.maomijiaoyi.com/index.php?/chanpinxiangqing_281910.html
http://www.maomijiaoyi.com/index.php?/chanpinxiangqing_230417.html

 

 

 

The most expensive discovery is the 3,000 yuan puppet cat. Looking at the information, it is found that the puppet cats and large cats are not only expensive when they are purchased, but the cost of raising is also relatively high, because the amount of food and exercise are relatively large, and the related expenses such as beauty will also be higher.

Guess you like

Origin blog.csdn.net/m0_48405781/article/details/113730745