Python crawler practice: crawling Taobao products and doing data analysis

Preface

It's like this. I received an order from a sponsor before. He wanted to open an online store for small fish snacks on Taobao. He wanted to do some analysis on the products currently on the market. Originally, it was also possible to do statistics and analysis manually. The information is all displayed to the outside world, but it’s troublesome to do it manually, so I wanted to ask me to help.

picture

1. Project requirements:

The specific requirements are as follows:

1. I searched for "small fish snacks" on Taobao and wanted to know the sales volume and amount of all the products in the first 10 pages of search results. I counted the quantities according to the price ranges he had defined, and he divided the price range table for me as follows:

picture

2. In these 10 pages of search results, where are the businesses located across the country?

3. Among the products on these 10 pages, which one has the most user comments?

4. From these search results, find the names and store links of the 10 stores with the largest sales.

Judging from these requirements, it is actually not difficult to achieve these requirements. Let’s first take a look at the effects of the project.

2. Effect preview

After obtaining the data, I analyzed it and finally made a histogram. You can see the specific number of products by moving the mouse.

picture

There are the most products between 10 and 30 yuan, and there are fewer products as you go up. It seems that most of the products are positioned in the low-end market.

Then let’s take a look at the distribution of merchants across the country:

picture

It can be seen that most businesses are distributed along the coast and near the middle and lower reaches of the Yangtze River, with the coastal areas being the most densely populated.

Then let’s take a look at what users have commented on the product:

picture

The largest word means it appears the most. Taste, packaging quality, product weight and shelf life are the aspects that users evaluate most. So when packaging products, we can make targeted elaborations from these aspects to solve the problem for most people. A matter of greater concern.

Finally, there are the top 10 sales stores and links.

picture

After getting the data and doing the analysis, I was also thinking, if I were the one doing this, would I be able to see anything? Maybe you can find an entry point from the price, maybe you can differentiate your product based on its geographical location, or maybe you can do marketing from the outside in with the user as the center.

The more I think about it, the more I feel that there is a way to do it. Forget it, I am an amateur when it comes to small fish snacks, so I don’t think much about it.

3. Crawler source code

Since the source code is divided into several source files, it is quite long, so I won’t explain them one by one here. People who know crawlers will understand it after reading it a few times. Those who don’t understand crawlers will be confused no matter how much they say. You will understand after you learn how to crawl.

Test Taobao crawler data apikey secret

import csvimport osimport timeimport wordcloudfrom selenium import webdriverfrom selenium.webdriver.common.by import By

def tongji():    prices = []    with open('前十页销量和金额.csv', 'r', encoding='utf-8', newline='') as f:        fieldnames = ['价格', '销量', '店铺位置']        reader = csv.DictReader(f, fieldnames=fieldnames)        for index, i in enumerate(reader):            if index != 0:                price = float(i['价格'].replace('¥', ''))                prices.append(price)    DATAS = {'<10': 0, '10~30': 0, '30~50': 0,             '50~70': 0, '70~90': 0, '90~110': 0,             '110~130': 0, '130~150': 0, '150~170': 0, '170~200': 0, }    for price in prices:        if price < 10:            DATAS['<10'] += 1        elif 10 <= price < 30:            DATAS['10~30'] += 1        elif 30 <= price < 50:            DATAS['30~50'] += 1        elif 50 <= price < 70:            DATAS['50~70'] += 1        elif 70 <= price < 90:            DATAS['70~90'] += 1        elif 90 <= price < 110:            DATAS['90~110'] += 1        elif 110 <= price < 130:            DATAS['110~130'] += 1        elif 130 <= price < 150:            DATAS['130~150'] += 1        elif 150 <= price < 170:            DATAS['150~170'] += 1        elif 170 <= price < 200:            DATAS['170~200'] += 1
    for k, v in DATAS.items():        print(k, ':', v)

def get_the_top_10(url):    top_ten = []    # 获取代理    ip = zhima1()[2][random.randint(0, 399)]    # 运行quicker动作(可以不用管)    os.system('"C:\Program Files\Quicker\QuickerStarter.exe" runaction:5e3abcd2-9271-47b6-8eaf-3e7c8f4935d8')    options = webdriver.ChromeOptions()    # 远程调试Chrome    options.add_experimental_option('debuggerAddress', '127.0.0.1:9222')    options.add_argument(f'--proxy-server={ip}')    driver = webdriver.Chrome(options=options)    # 隐式等待    driver.implicitly_wait(3)    # 打开网页    driver.get(url)    # 点击部分文字包含'销量'的网页元素    driver.find_element(By.PARTIAL_LINK_TEXT, '销量').click()    time.sleep(1)    # 页面滑动到最下方    driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')    time.sleep(1)    # 查找元素    element = driver.find_element(By.ID, 'mainsrp-itemlist').find_element(By.XPATH, './/div[@class="items"]')    items = element.find_elements(By.XPATH, './/div[@data-category="auctions"]')    for index, item in enumerate(items):        if index == 10:            break        # 查找元素        price = item.find_element(By.XPATH, './div[2]/div[1]/div[contains(@class,"price")]').text        paid_num_data = item.find_element(By.XPATH, './div[2]/div[1]/div[@class="deal-cnt"]').text        store_location = item.find_element(By.XPATH, './div[2]/div[3]/div[@class="location"]').text        store_href = item.find_element(By.XPATH, './div[2]/div[@class="row row-2 title"]/a').get_attribute(            'href').strip()        # 将数据添加到字典        top_ten.append(            {'价格': price,             '销量': paid_num_data,             '店铺位置': store_location,             '店铺链接': store_href             })
    for i in top_ten:        print(i)

def get_top_10_comments(url):    with open('排名前十评价.txt', 'w+', encoding='utf-8') as f:        pass    # ip = ipidea()[1]    os.system('"C:\Program Files\Quicker\QuickerStarter.exe" runaction:5e3abcd2-9271-47b6-8eaf-3e7c8f4935d8')    options = webdriver.ChromeOptions()    options.add_experimental_option('debuggerAddress', '127.0.0.1:9222')    # options.add_argument(f'--proxy-server={ip}')    driver = webdriver.Chrome(options=options)    driver.implicitly_wait(3)    driver.get(url)    driver.find_element(By.PARTIAL_LINK_TEXT, '销量').click()    time.sleep(1)    element = driver.find_element(By.ID, 'mainsrp-itemlist').find_element(By.XPATH, './/div[@class="items"]')    items = element.find_elements(By.XPATH, './/div[@data-category="auctions"]')    original_handle = driver.current_window_handle    item_hrefs = []    # 先获取前十的链接    for index, item in enumerate(items):        if index == 10:            break        item_hrefs.append(            item.find_element(By.XPATH, './/div[2]/div[@class="row row-2 title"]/a').get_attribute('href').strip())    # 爬取前十每个商品评价    for item_href in item_hrefs:        # 打开新标签        # item_href = 'https://item.taobao.com/item.htm?id=523351391646&ns=1&abbucket=11#detail'        driver.execute_script(f'window.open("{item_href}")')        # 切换过去        handles = driver.window_handles        driver.switch_to.window(handles[-1])
        # 页面向下滑动一部分,直到让评价那两个字显示出来        try:            driver.find_element(By.PARTIAL_LINK_TEXT, '评价').click()        except Exception as e1:            try:                x = driver.find_element(By.PARTIAL_LINK_TEXT, '评价').location_once_scrolled_into_view                driver.find_element(By.PARTIAL_LINK_TEXT, '评价').click()            except Exception as e2:                try:                    # 先向下滑动100,放置评价2个字没显示在屏幕内                    driver.execute_script('var q=document.documentElement.scrollTop=100')                    x = driver.find_element(By.PARTIAL_LINK_TEXT, '评价').location_once_scrolled_into_view                except Exception as e3:                    driver.find_element(By.XPATH, '/html/body/div[6]/div/div[3]/div[2]/div/div[2]/ul/li[2]/a').click()        time.sleep(1)        try:            trs = driver.find_elements(By.XPATH, '//div[@class="rate-grid"]/table/tbody/tr')            for index, tr in enumerate(trs):                if index == 0:                    comments = tr.find_element(By.XPATH, './td[1]/div[1]/div/div').text.strip()                else:                    try:                        comments = tr.find_element(By.XPATH,                                                   './td[1]/div[1]/div[@class="tm-rate-fulltxt"]').text.strip()                    except Exception as e:                        comments = tr.find_element(By.XPATH,                                                   './td[1]/div[1]/div[@class="tm-rate-content"]/div[@class="tm-rate-fulltxt"]').text.strip()                with open('排名前十评价.txt', 'a+', encoding='utf-8') as f:                    f.write(comments + '\n')                    print(comments)        except Exception as e:            lis = driver.find_elements(By.XPATH, '//div[@class="J_KgRate_MainReviews"]/div[@class="tb-revbd"]/ul/li')            for li in lis:                comments = li.find_element(By.XPATH, './div[2]/div/div[1]').text.strip()                with open('排名前十评价.txt', 'a+', encoding='utf-8') as f:                    f.write(comments + '\n')                    print(comments)

def get_top_10_comments_wordcloud():    file = '排名前十评价.txt'    f = open(file, encoding='utf-8')    txt = f.read()    f.close()
    w = wordcloud.WordCloud(width=1000,                            height=700,                            background_color='white',                            font_path='msyh.ttc')    # 创建词云对象,并设置生成图片的属性
    w.generate(txt)    name = file.replace('.txt', '')    w.to_file(name + '词云.png')    os.startfile(name + '词云.png')

def get_10_pages_datas():    with open('前十页销量和金额.csv', 'w+', encoding='utf-8', newline='') as f:        f.write('\ufeff')        fieldnames = ['价格', '销量', '店铺位置']        writer = csv.DictWriter(f, fieldnames=fieldnames)        writer.writeheader()    infos = []    options = webdriver.ChromeOptions()    options.add_experimental_option('debuggerAddress', '127.0.0.1:9222')    # options.add_argument(f'--proxy-server={ip}')    driver = webdriver.Chrome(options=options)    driver.implicitly_wait(3)    driver.get(url)    # driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')    element = driver.find_element(By.ID, 'mainsrp-itemlist').find_element(By.XPATH, './/div[@class="items"]')    items = element.find_elements(By.XPATH, './/div[@data-category="auctions"]')    for index, item in enumerate(items):        price = item.find_element(By.XPATH, './div[2]/div[1]/div[contains(@class,"price")]').text        paid_num_data = item.find_element(By.XPATH, './div[2]/div[1]/div[@class="deal-cnt"]').text        store_location = item.find_element(By.XPATH, './div[2]/div[3]/div[@class="location"]').text        infos.append(            {'价格': price,             '销量': paid_num_data,             '店铺位置': store_location})    try:        driver.find_element(By.PARTIAL_LINK_TEXT, '下一').click()    except Exception as e:        driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')        driver.find_element(By.PARTIAL_LINK_TEXT, '下一').click()    for i in range(9):        time.sleep(1)        driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')        element = driver.find_element(By.ID, 'mainsrp-itemlist').find_element(By.XPATH, './/div[@class="items"]')        items = element.find_elements(By.XPATH, './/div[@data-category="auctions"]')        for index, item in enumerate(items):            try:                price = item.find_element(By.XPATH, './div[2]/div[1]/div[contains(@class,"price")]').text            except Exception:                time.sleep(1)                driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')                price = item.find_element(By.XPATH, './div[2]/div[1]/div[contains(@class,"price")]').text            paid_num_data = item.find_element(By.XPATH, './div[2]/div[1]/div[@class="deal-cnt"]').text            store_location = item.find_element(By.XPATH, './div[2]/div[3]/div[@class="location"]').text            infos.append(                {'价格': price,                 '销量': paid_num_data,                 '店铺位置': store_location})        try:            driver.find_element(By.PARTIAL_LINK_TEXT, '下一').click()        except Exception as e:            driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')            driver.find_element(By.PARTIAL_LINK_TEXT, '下一').click()        # 一页结束        for info in infos:            print(info)        with open('前十页销量和金额.csv', 'a+', encoding='utf-8', newline='') as f:            fieldnames = ['价格', '销量', '店铺位置']            writer = csv.DictWriter(f, fieldnames=fieldnames)            for info in infos:                writer.writerow(info)

if __name__ == '__main__':    url = 'https://s.taobao.com/search?q=%E5%B0%8F%E9%B1%BC%E9%9B%B6%E9%A3%9F&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.21814703.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306&bcoffset=4&ntoffset=4&p4ppushleft=2%2C48&s=0'    # get_10_pages_datas()    # tongji()    # get_the_top_10(url)    # get_top_10_comments(url)    get_top_10_comments_wordcloud()

Through the above code, we can obtain the data we want to obtain, and then use Bar and Geo to display histograms and geographical distribution. You can explore these two areas.

Guess you like

Origin blog.csdn.net/Jernnifer_mao/article/details/133134795