Crawl Jingdong mobile ranking

1. Thematic web crawler design

1. Thematic web crawler name: Crawl Jingdong mobile phone information.

2. Analysis of the content and data characteristics crawled by thematic web crawlers: the product name, detail page url and picture address, price of each page ( currently 1 41 pages) , and the product introduction after entering each mobile phone detail page The "brand", "running memory", "body storage", "number of cameras" and other information.

3. Overview of thematic web crawler design scheme: Idea: Use web crawler to perform data analysis, find the data object to be crawled through the network source code, crawl to the data storage and then analyze it.

The scrapy framework quickly crawls, xpath data extraction, and csv data storage.

2. Analysis of the structural characteristics of the theme page

1. Analysis of the structural characteristics of the theme page:

The concept of scrapy: Scrapy is an application framework written for crawling website data and extracting structural data.

The operation process and data transmission process of scrapy framework:

  • The starting url in the crawler is constructed as a request object-> crawler middleware-> engine-> scheduler
  • The scheduler puts request-> engine-> download middleware ---> downloader
  • The downloader sends a request to get the response ----> download middleware ----> engine ---> crawler middleware ---> crawler
  • The crawler extracts the url address and assembles it into a request object ----> crawler middleware ---> engine ---> scheduler, repeat step 2
 
   

Crawler extracts data ---> engine ---> pipeline processing and saving data:

2.Htmls page analysis:

 

 3. Node (label) search method and traversal method:

Create a project:

a)  First create a scrapy project named jdSpider : scrapy startproject jdSpider

b)  Then switch to the created project folder: cd jdSpider

c)  Create a crawler named jd : s crapy genspider jd list.jd.com ( here list.jd.com is a domain name that allows crawling , and can be modified later )

Three. Web crawler programming

1. Data crawling and collection

 

import re

 

import scrapy

import requests

 

 

class JdSpider(scrapy.Spider):

    num = 1

    # Reptile name

    name = 'jd'

    # Domain names allowed to crawl

    allowed_domains = ['list.jd.com', 'item.jd.com'] # list. List page domain name item. Detail page domain name

    # URL to start crawling

 

    start_urls = ['https://list.jd.com/list.html?cat=9987,653,655&page=1&sort=sort_rank_asc&trans=1&JL=6_0_0#J_main']

    # start_urls = ['https://list.jd.com/list.html?cat=9987,653,655&page=143&sort=sort_rank_asc&trans=1&JL=6_0_0#J_main']

    # Data extraction method, receiving the response from the download middleware

    def parse(self, response):

        # print ('Check if there is any change in user-Agent')

        # print(response.request.headers['User-Agent'])

        # print('-'*100)

        # Use xpath for data extraction

        li_list = response.xpath('//li[@class="gl-item"]')

        # Traverse to extract single data

        for li in li_list:

            # image link

            img = li.xpath('./div/div[@class="p-img"]/a/img/@src').extract_first()

            if not img:

                img = li.xpath('./div/div[@class="p-img"]/a/img/@data-lazy-img').extract_first()

            # product name

            name_list = li.xpath ('./ div / div / a / em / text ()'). extract () # extract returns a list

            name = name_list [0] if len (name_list) == 1 else name_list [1] # because Jingdong International ’s product name is special

            # Product url

            url = li.xpath('./div/div[@class="p-img"]/a/@href').extract_first()

            # Extract product id

            sku_id = re.search(r'\d+', url).group()

            # Call API to get product price

            headers = {"User-Agent": response.request.headers['User-Agent'].decode(),

                       'Connection': 'close'}

            price_response = requests.get("https://p.3.cn/prices/mgets?skuIds="+sku_id, headers=headers)

            # [{'cbf': '0', 'id': 'J_100006947212', 'm': '9999.00', 'op': '1399.00', 'p': '1289.00'}]

            # p is the current price

            dict_response = price_response.json()[0]  #

            price = dict_response['p']

            # Let pipelines.py process data strip () to remove the space before and after the string

            # yield {"name": name.strip(), "url": url, "img": img, "price": price}

            data_dict = {"name": name.strip(), "url": url, "img": img, "price": price}

 

            if len(name_list)==1:

                yield scrapy.Request("https:"+url, callback=self.parse_detail, meta={"data_dict": data_dict})

            else:

                yield data_dict # Jingdong International does not enter the details page, but directly stores the data

            # break

 

        print ('第', JdSpider.num, "The page has been crawled")

        JdSpider.num + = 1

        # Extract the url of the next page

        next_page_url = response.xpath("//a[@class='pn-next']/@href").extract_first()

        # print('-'*200)

        # print(next_page_url, type(next_page_url))

        # print('-'*200)

        if next_page_url:

            yield scrapy.Request('https://list.jd.com'+next_page_url, callback=self.parse)

 

    def parse_detail(self, response):

        "" "Process each product details" ""

        data_dict = response.meta["data_dict"]

        if response.status == 200:

 

            # Brand

            brand = response.xpath('//ul[@id="parameter-brand"]/li/@title').extract_first()

            if brand:

                # General merchandise

                data_dict ["brand"] = brand

                # Other data

                li_list = response.xpath('//ul[@class="parameter2 p-parameter-list"]/li')

                for li in li_list:

                    data = li.xpath('./text()').extract_first()

                    # Regularly extract description and data

                    ret = re.match (r '(. +?): (. +)', data) # for Chinese:? non-greedy

                    key = ret.group(1)

                    value = ret.group(2)

                    data_dict[key] = value

 

            # else:

            # # Jingdong International

            #     li_list = response.xpath('//ul[@class="parameter2"]/li')

            #     for li in li_list:

            #         data = li.xpath('./text()').extract_first()

            # if "shop:" == data:

            #             data_dict["店铺"] = li.xpath('./a/text()').extract_first()

            #             continue

            # # Regularly extract description and data

            # ret = re.match (r '(. +?): (. +)', data) # for Chinese:? non-greedy

            #         key = ret.group(1)

            #         value = ret.group(2)

            #         data_dict[key] = value

 

            # Let pipelines.py process the data

            yield data_dict

from pymongo import MongoClient

import csv

 

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

fieldnames = ['name', 'url', 'img', 'price', 'brand', 'product name', "product number",

              "Product gross weight", "Product origin", "CPU model", "Run memory", "Body storage", "Memory card",

              "Number of cameras", "Rear camera main camera", "Front camera main camera", "Main screen size (inch)",

              "Resolution", "Screen Ratio", "Screen Proactive Combination", "Charger", "Hot Spot", "Special Features",

              "Operating System", "Game Performance", "Battery Capacity (mAh)", "Body Color", "Screen Ratio",

              "Charging Power", "Game Configuration", "Elderly Machine Configuration", "Shop", "Article Number"]

 

 

class JdspiderPipeline(object):

 

    def open_spider (self, spider): # Execute only once when the crawler is started

        print ('Crawler on')

        # # Link to mongoDB database

        # client = MongoClient("127.0.0.1", 27017)

        # self.collection = client ["jd"] ["info"] # The database name is jd, and the table name is info.

        with open('data.csv', 'w',  newline='') as csvfile:

            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

            writer.writeheader()

 

    def close_spider (self, spider): # execute only once when the crawler is closed

        print ('Crawler off')

 

    def process_item(self, item, spider):

        # # Insert data

        # print(item)

        # self.collection.insert(item)

        with open('data.csv', 'a', newline='') as csvfile:

            writer = csv.DictWriter(csvfile,  fieldnames=fieldnames)

            writer.writerow(item)

2. Data cleaning and processing:

ROBOTS protocol can be set in settings :

 

 

 

Set the log level:

 

 

 3. Text analysis:

In order to achieve different requests with different user_agent, we can add a list USER_AGENTS_LIST in settings.py:

 

 

And improve the code in middlewares.py:

 

 

 

At this time, you need to go back to setting.py to open the middleware, so that each request can randomly select the user_agent:

 

 

 

 4. Data analysis and visualization (for example: data histogram, histogram, scatter plot, combined graph, distribution graph):

Draw the distribution map:

 

 

5.根据数据之间的关系,分析两个变量之间的相关系数,画出散点图,并建立变量之间的回归方程:

 

 

 

 

 

 6. Data persistence:

 

 7. Complete program code:

import re

 

import scrapy

import requests

 

 

class JdSpider(scrapy.Spider):

    num = 1

    # Reptile name

    name = 'jd'

    # Domain names allowed to crawl

    allowed_domains = ['list.jd.com', 'item.jd.com'] # list. List page domain name item. Detail page domain name

    # URL to start crawling

 

    start_urls = ['https://list.jd.com/list.html?cat=9987,653,655&page=1&sort=sort_rank_asc&trans=1&JL=6_0_0#J_main']

    # start_urls = ['https://list.jd.com/list.html?cat=9987,653,655&page=143&sort=sort_rank_asc&trans=1&JL=6_0_0#J_main']

    # Data extraction method, receiving the response from the download middleware

    def parse(self, response):

        # print ('Check if there is any change in user-Agent')

        # print(response.request.headers['User-Agent'])

        # print('-'*100)

        # Use xpath for data extraction

        li_list = response.xpath('//li[@class="gl-item"]')

        # Traverse to extract single data

        for li in li_list:

            # image link

            img = li.xpath('./div/div[@class="p-img"]/a/img/@src').extract_first()

            if not img:

                img = li.xpath('./div/div[@class="p-img"]/a/img/@data-lazy-img').extract_first()

            # product name

            name_list = li.xpath ('./ div / div / a / em / text ()'). extract () # extract returns a list

            name = name_list [0] if len (name_list) == 1 else name_list [1] # because Jingdong International ’s product name is special

            # Product url

            url = li.xpath('./div/div[@class="p-img"]/a/@href').extract_first()

            # Extract product id

            sku_id = re.search(r'\d+', url).group()

            # Call API to get product price

            headers = {"User-Agent": response.request.headers['User-Agent'].decode(),

                       'Connection': 'close'}

            price_response = requests.get("https://p.3.cn/prices/mgets?skuIds="+sku_id, headers=headers)

            # [{'cbf': '0', 'id': 'J_100006947212', 'm': '9999.00', 'op': '1399.00', 'p': '1289.00'}]

            # p is the current price

            dict_response = price_response.json()[0]  #

            price = dict_response['p']

            # Let pipelines.py process data strip () to remove the space before and after the string

            # yield {"name": name.strip(), "url": url, "img": img, "price": price}

            data_dict = {"name": name.strip(), "url": url, "img": img, "price": price}

 

            if len(name_list)==1:

                yield scrapy.Request("https:"+url, callback=self.parse_detail, meta={"data_dict": data_dict})

            else:

                yield data_dict # Jingdong International does not enter the details page, but directly stores the data

            # break

 

        print ('第', JdSpider.num, "The page has been crawled")

        JdSpider.num + = 1

        # Extract the url of the next page

        next_page_url = response.xpath("//a[@class='pn-next']/@href").extract_first()

        # print('-'*200)

        # print(next_page_url, type(next_page_url))

        # print('-'*200)

        if next_page_url:

            yield scrapy.Request('https://list.jd.com'+next_page_url, callback=self.parse)

 

    def parse_detail(self, response):

        "" "Process each product details" ""

        data_dict = response.meta["data_dict"]

        if response.status == 200:

 

            # Brand

            brand = response.xpath('//ul[@id="parameter-brand"]/li/@title').extract_first()

            if brand:

                # General merchandise

                data_dict ["brand"] = brand

                # Other data

                li_list = response.xpath('//ul[@class="parameter2 p-parameter-list"]/li')

                for li in li_list:

                    data = li.xpath('./text()').extract_first()

                    # Regularly extract description and data

                    ret = re.match (r '(. +?): (. +)', data) # for Chinese:? non-greedy

                    key = ret.group(1)

                    value = ret.group(2)

                    data_dict[key] = value

 

            # else:

            # # Jingdong International

            #     li_list = response.xpath('//ul[@class="parameter2"]/li')

            #     for li in li_list:

            #         data = li.xpath('./text()').extract_first()

            # if "shop:" == data:

            #             data_dict["店铺"] = li.xpath('./a/text()').extract_first()

            #             continue

            # # Regularly extract description and data

            # ret = re.match (r '(. +?): (. +)', data) # for Chinese:? non-greedy

            #         key = ret.group(1)

            #         value = ret.group(2)

            #         data_dict[key] = value

 

            # Let pipelines.py process the data

            yield data_dict

from pymongo import MongoClient

import csv

 

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

fieldnames = ['name', 'url', 'img', 'price', 'brand', 'product name', "product number",

              "Product gross weight", "Product origin", "CPU model", "Run memory", "Body storage", "Memory card",

              "Number of cameras", "Rear camera main camera", "Front camera main camera", "Main screen size (inch)",

              "Resolution", "Screen Ratio", "Screen Proactive Combination", "Charger", "Hot Spot", "Special Features",

              "Operating System", "Game Performance", "Battery Capacity (mAh)", "Body Color", "Screen Ratio",

              "Charging Power", "Game Configuration", "Elderly Machine Configuration", "Shop", "Article Number"]

 

 

class JdspiderPipeline(object):

 

    def open_spider (self, spider): # Execute only once when the crawler is started

        print ('Crawler on')

        # # Link to mongoDB database

        # client = MongoClient("127.0.0.1", 27017)

        # self.collection = client ["jd"] ["info"] # The database name is jd, and the table name is info.

        with open('data.csv', 'w',  newline='') as csvfile:

            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

            writer.writeheader()

 

    def close_spider (self, spider): # execute only once when the crawler is closed

        print ('Crawler off')

 

    def process_item(self, item, spider):

        # # Insert data

        # print(item)

        # self.collection.insert(item)

        with open('data.csv', 'a', newline='') as csvfile:

            writer = csv.DictWriter(csvfile,  fieldnames=fieldnames)

            writer.writerow(item)

Improve the code:

from jdSpider.settings import USER_AGENTS_LIST

 

class UserAgentMiddleware(obkect):

      def process_request(self,request,spider):

            user_agent=random.choice(USER_AGENTS_LIST)

            request.headers['User-Agent']=user_agent

            request.headers["Connection"]="close"

 

class CheckUA(object):

       def process_response(self,request,response,spider):

             return response

 "" "Use scatter () to draw a scatter plot" ""
import matplotlib.pyplot as plt
 
x_values ​​= range (1, 1001)
y_values ​​= [x * x for x in x_values]
'' '
scatter () 
x: abscissa y : Ordinate s: point size
''
plt.scatter (x_values, y_values, s = 10)
 
# Set the chart title and label the coordinate axis
plt.title ('Square Numbers', fontsize = 24)
plt.xlabel ('Value', fontsize = 14)
plt.ylabel ('Square of Value', fontsize = 14)
 
# Set the size of the tick mark
plt.tick_params (axis = 'both', which = 'major', labelsize = 14)
 
# Set the value range of each coordinate axis
plt.axis ([0, 1100, 0, 1100000])
plt.show ()
 
4. Conclusion
1. After analyzing and visualizing the subject data, the following conclusions can be drawn: Apple and Huawei are more popular in the mobile phone industry
2. Make a simple summary of the completion of the program design task.
Summary: Through this learning opportunity, I realized that I still have a lot of deficiencies, there is a lot of room for improvement, and I also learned some knowledge in this study.

 

Guess you like

Origin www.cnblogs.com/cmmmmm/p/12758152.html