1. Thematic web crawler design
1. Thematic web crawler name: Crawl Jingdong mobile phone information.
2. Analysis of the content and data characteristics crawled by thematic web crawlers: the product name, detail page url and picture address, price of each page ( currently 1 41 pages) , and the product introduction after entering each mobile phone detail page The "brand", "running memory", "body storage", "number of cameras" and other information.
3. Overview of thematic web crawler design scheme: Idea: Use web crawler to perform data analysis, find the data object to be crawled through the network source code, crawl to the data storage and then analyze it.
The scrapy framework quickly crawls, xpath data extraction, and csv data storage.
2. Analysis of the structural characteristics of the theme page
1. Analysis of the structural characteristics of the theme page:
The concept of scrapy: Scrapy is an application framework written for crawling website data and extracting structural data.
The operation process and data transmission process of scrapy framework:
- The starting url in the crawler is constructed as a request object-> crawler middleware-> engine-> scheduler
- The scheduler puts request-> engine-> download middleware ---> downloader
- The downloader sends a request to get the response ----> download middleware ----> engine ---> crawler middleware ---> crawler
- The crawler extracts the url address and assembles it into a request object ----> crawler middleware ---> engine ---> scheduler, repeat step 2
Crawler extracts data ---> engine ---> pipeline processing and saving data:
2.Htmls page analysis:
3. Node (label) search method and traversal method:
Create a project:
a) First create a scrapy project named jdSpider : scrapy startproject jdSpider
b) Then switch to the created project folder: cd jdSpider
c) Create a crawler named jd : s crapy genspider jd list.jd.com ( here list.jd.com is a domain name that allows crawling , and can be modified later )
Three. Web crawler programming
1. Data crawling and collection
import re
import scrapy
import requests
class JdSpider(scrapy.Spider):
num = 1
# Reptile name
name = 'jd'
# Domain names allowed to crawl
allowed_domains = ['list.jd.com', 'item.jd.com'] # list. List page domain name item. Detail page domain name
# URL to start crawling
start_urls = ['https://list.jd.com/list.html?cat=9987,653,655&page=1&sort=sort_rank_asc&trans=1&JL=6_0_0#J_main']
# start_urls = ['https://list.jd.com/list.html?cat=9987,653,655&page=143&sort=sort_rank_asc&trans=1&JL=6_0_0#J_main']
# Data extraction method, receiving the response from the download middleware
def parse(self, response):
# print ('Check if there is any change in user-Agent')
# print(response.request.headers['User-Agent'])
# print('-'*100)
# Use xpath for data extraction
li_list = response.xpath('//li[@class="gl-item"]')
# Traverse to extract single data
for li in li_list:
# image link
img = li.xpath('./div/div[@class="p-img"]/a/img/@src').extract_first()
if not img:
img = li.xpath('./div/div[@class="p-img"]/a/img/@data-lazy-img').extract_first()
# product name
name_list = li.xpath ('./ div / div / a / em / text ()'). extract () # extract returns a list
name = name_list [0] if len (name_list) == 1 else name_list [1] # because Jingdong International ’s product name is special
# Product url
url = li.xpath('./div/div[@class="p-img"]/a/@href').extract_first()
# Extract product id
sku_id = re.search(r'\d+', url).group()
# Call API to get product price
headers = {"User-Agent": response.request.headers['User-Agent'].decode(),
'Connection': 'close'}
price_response = requests.get("https://p.3.cn/prices/mgets?skuIds="+sku_id, headers=headers)
# [{'cbf': '0', 'id': 'J_100006947212', 'm': '9999.00', 'op': '1399.00', 'p': '1289.00'}]
# p is the current price
dict_response = price_response.json()[0] #
price = dict_response['p']
# Let pipelines.py process data strip () to remove the space before and after the string
# yield {"name": name.strip(), "url": url, "img": img, "price": price}
data_dict = {"name": name.strip(), "url": url, "img": img, "price": price}
if len(name_list)==1:
yield scrapy.Request("https:"+url, callback=self.parse_detail, meta={"data_dict": data_dict})
else:
yield data_dict # Jingdong International does not enter the details page, but directly stores the data
# break
print ('第', JdSpider.num, "The page has been crawled")
JdSpider.num + = 1
# Extract the url of the next page
next_page_url = response.xpath("//a[@class='pn-next']/@href").extract_first()
# print('-'*200)
# print(next_page_url, type(next_page_url))
# print('-'*200)
if next_page_url:
yield scrapy.Request('https://list.jd.com'+next_page_url, callback=self.parse)
def parse_detail(self, response):
"" "Process each product details" ""
data_dict = response.meta["data_dict"]
if response.status == 200:
# Brand
brand = response.xpath('//ul[@id="parameter-brand"]/li/@title').extract_first()
if brand:
# General merchandise
data_dict ["brand"] = brand
# Other data
li_list = response.xpath('//ul[@class="parameter2 p-parameter-list"]/li')
for li in li_list:
data = li.xpath('./text()').extract_first()
# Regularly extract description and data
ret = re.match (r '(. +?): (. +)', data) # for Chinese:? non-greedy
key = ret.group(1)
value = ret.group(2)
data_dict[key] = value
# else:
# # Jingdong International
# li_list = response.xpath('//ul[@class="parameter2"]/li')
# for li in li_list:
# data = li.xpath('./text()').extract_first()
# if "shop:" == data:
# data_dict["店铺"] = li.xpath('./a/text()').extract_first()
# continue
# # Regularly extract description and data
# ret = re.match (r '(. +?): (. +)', data) # for Chinese:? non-greedy
# key = ret.group(1)
# value = ret.group(2)
# data_dict[key] = value
# Let pipelines.py process the data
yield data_dict
from pymongo import MongoClient
import csv
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
fieldnames = ['name', 'url', 'img', 'price', 'brand', 'product name', "product number",
"Product gross weight", "Product origin", "CPU model", "Run memory", "Body storage", "Memory card",
"Number of cameras", "Rear camera main camera", "Front camera main camera", "Main screen size (inch)",
"Resolution", "Screen Ratio", "Screen Proactive Combination", "Charger", "Hot Spot", "Special Features",
"Operating System", "Game Performance", "Battery Capacity (mAh)", "Body Color", "Screen Ratio",
"Charging Power", "Game Configuration", "Elderly Machine Configuration", "Shop", "Article Number"]
class JdspiderPipeline(object):
def open_spider (self, spider): # Execute only once when the crawler is started
print ('Crawler on')
# # Link to mongoDB database
# client = MongoClient("127.0.0.1", 27017)
# self.collection = client ["jd"] ["info"] # The database name is jd, and the table name is info.
with open('data.csv', 'w', newline='') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
def close_spider (self, spider): # execute only once when the crawler is closed
print ('Crawler off')
def process_item(self, item, spider):
# # Insert data
# print(item)
# self.collection.insert(item)
with open('data.csv', 'a', newline='') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow(item)
2. Data cleaning and processing:
ROBOTS protocol can be set in settings :
Set the log level:
3. Text analysis:
In order to achieve different requests with different user_agent, we can add a list USER_AGENTS_LIST in settings.py:
And improve the code in middlewares.py:
At this time, you need to go back to setting.py to open the middleware, so that each request can randomly select the user_agent:
4. Data analysis and visualization (for example: data histogram, histogram, scatter plot, combined graph, distribution graph):
Draw the distribution map:
5.根据数据之间的关系,分析两个变量之间的相关系数,画出散点图,并建立变量之间的回归方程:
6. Data persistence:
7. Complete program code:
import re
import scrapy
import requests
class JdSpider(scrapy.Spider):
num = 1
# Reptile name
name = 'jd'
# Domain names allowed to crawl
allowed_domains = ['list.jd.com', 'item.jd.com'] # list. List page domain name item. Detail page domain name
# URL to start crawling
start_urls = ['https://list.jd.com/list.html?cat=9987,653,655&page=1&sort=sort_rank_asc&trans=1&JL=6_0_0#J_main']
# start_urls = ['https://list.jd.com/list.html?cat=9987,653,655&page=143&sort=sort_rank_asc&trans=1&JL=6_0_0#J_main']
# Data extraction method, receiving the response from the download middleware
def parse(self, response):
# print ('Check if there is any change in user-Agent')
# print(response.request.headers['User-Agent'])
# print('-'*100)
# Use xpath for data extraction
li_list = response.xpath('//li[@class="gl-item"]')
# Traverse to extract single data
for li in li_list:
# image link
img = li.xpath('./div/div[@class="p-img"]/a/img/@src').extract_first()
if not img:
img = li.xpath('./div/div[@class="p-img"]/a/img/@data-lazy-img').extract_first()
# product name
name_list = li.xpath ('./ div / div / a / em / text ()'). extract () # extract returns a list
name = name_list [0] if len (name_list) == 1 else name_list [1] # because Jingdong International ’s product name is special
# Product url
url = li.xpath('./div/div[@class="p-img"]/a/@href').extract_first()
# Extract product id
sku_id = re.search(r'\d+', url).group()
# Call API to get product price
headers = {"User-Agent": response.request.headers['User-Agent'].decode(),
'Connection': 'close'}
price_response = requests.get("https://p.3.cn/prices/mgets?skuIds="+sku_id, headers=headers)
# [{'cbf': '0', 'id': 'J_100006947212', 'm': '9999.00', 'op': '1399.00', 'p': '1289.00'}]
# p is the current price
dict_response = price_response.json()[0] #
price = dict_response['p']
# Let pipelines.py process data strip () to remove the space before and after the string
# yield {"name": name.strip(), "url": url, "img": img, "price": price}
data_dict = {"name": name.strip(), "url": url, "img": img, "price": price}
if len(name_list)==1:
yield scrapy.Request("https:"+url, callback=self.parse_detail, meta={"data_dict": data_dict})
else:
yield data_dict # Jingdong International does not enter the details page, but directly stores the data
# break
print ('第', JdSpider.num, "The page has been crawled")
JdSpider.num + = 1
# Extract the url of the next page
next_page_url = response.xpath("//a[@class='pn-next']/@href").extract_first()
# print('-'*200)
# print(next_page_url, type(next_page_url))
# print('-'*200)
if next_page_url:
yield scrapy.Request('https://list.jd.com'+next_page_url, callback=self.parse)
def parse_detail(self, response):
"" "Process each product details" ""
data_dict = response.meta["data_dict"]
if response.status == 200:
# Brand
brand = response.xpath('//ul[@id="parameter-brand"]/li/@title').extract_first()
if brand:
# General merchandise
data_dict ["brand"] = brand
# Other data
li_list = response.xpath('//ul[@class="parameter2 p-parameter-list"]/li')
for li in li_list:
data = li.xpath('./text()').extract_first()
# Regularly extract description and data
ret = re.match (r '(. +?): (. +)', data) # for Chinese:? non-greedy
key = ret.group(1)
value = ret.group(2)
data_dict[key] = value
# else:
# # Jingdong International
# li_list = response.xpath('//ul[@class="parameter2"]/li')
# for li in li_list:
# data = li.xpath('./text()').extract_first()
# if "shop:" == data:
# data_dict["店铺"] = li.xpath('./a/text()').extract_first()
# continue
# # Regularly extract description and data
# ret = re.match (r '(. +?): (. +)', data) # for Chinese:? non-greedy
# key = ret.group(1)
# value = ret.group(2)
# data_dict[key] = value
# Let pipelines.py process the data
yield data_dict
from pymongo import MongoClient
import csv
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
fieldnames = ['name', 'url', 'img', 'price', 'brand', 'product name', "product number",
"Product gross weight", "Product origin", "CPU model", "Run memory", "Body storage", "Memory card",
"Number of cameras", "Rear camera main camera", "Front camera main camera", "Main screen size (inch)",
"Resolution", "Screen Ratio", "Screen Proactive Combination", "Charger", "Hot Spot", "Special Features",
"Operating System", "Game Performance", "Battery Capacity (mAh)", "Body Color", "Screen Ratio",
"Charging Power", "Game Configuration", "Elderly Machine Configuration", "Shop", "Article Number"]
class JdspiderPipeline(object):
def open_spider (self, spider): # Execute only once when the crawler is started
print ('Crawler on')
# # Link to mongoDB database
# client = MongoClient("127.0.0.1", 27017)
# self.collection = client ["jd"] ["info"] # The database name is jd, and the table name is info.
with open('data.csv', 'w', newline='') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
def close_spider (self, spider): # execute only once when the crawler is closed
print ('Crawler off')
def process_item(self, item, spider):
# # Insert data
# print(item)
# self.collection.insert(item)
with open('data.csv', 'a', newline='') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow(item)
Improve the code:
from jdSpider.settings import USER_AGENTS_LIST
class UserAgentMiddleware(obkect):
def process_request(self,request,spider):
user_agent=random.choice(USER_AGENTS_LIST)
request.headers['User-Agent']=user_agent
request.headers["Connection"]="close"
class CheckUA(object):
def process_response(self,request,response,spider):
return response
import matplotlib.pyplot as plt
x_values = range (1, 1001)
y_values = [x * x for x in x_values]
'' '
scatter ()
x: abscissa y : Ordinate s: point size
''
plt.scatter (x_values, y_values, s = 10)
# Set the chart title and label the coordinate axis
plt.title ('Square Numbers', fontsize = 24)
plt.xlabel ('Value', fontsize = 14)
plt.ylabel ('Square of Value', fontsize = 14)
# Set the size of the tick mark
plt.tick_params (axis = 'both', which = 'major', labelsize = 14)
# Set the value range of each coordinate axis
plt.axis ([0, 1100, 0, 1100000])
plt.show ()