Article Directory
It's 2021 and I haven't crawled your favorite beauty pictures. Does my crawler look OUT? But it’s okay. Although the current interface has changed, then I will talk about how to get the entire beauty network in 2021. This is an improved project. I participated in many of my own ideas, such as some that are difficult to understand. I use it myself. I realized it in a simple way, and I feel that it is a good realization. You can take a look at it. The results of the first crawl.
Introduction
Beauty web crawling based on Scrapy framework
Crawler entry address: http://www.meinv.hk/?cat=2
If your crawler is running normally but there is no data, the possible reason is that you need a ladder to access the site.
Here we mainly learn two technical points, a custom picture pipeline and a custom csv data pipeline
Implementation process
It's too simple to create a project, let's not talk about it.
Open website
In the process of clicking, specific crawling ideas are obtained. First, the popular recommended tags are crawled, and then the URL of the specific picture of each beauty is obtained.
Then use the rules.
The rules for crawling the url in the response are specified in the rules. The url obtained by the crawl will be requested again, and parsed or followed up according to the settings of the callback function and the follow attribute.
Two points are emphasized here:
- One is to extract URLs from all returned responses, including the response from the first URL request;
- The second is that all rules specified in the rules list will be executed.
Enter pipeline.py, import from scrapy.pipelines.images import ImagesPipeline, and inherit ImagesPipeline
The custom pipeline can be completed on the basis of the ImagesPipeline that comes with scrapy.
You can rewrite the three methods in ImagesPipeline: get_media_requests(), file_path(), item_completed()
Specific code
Because you are using a custom pipeline (pictures and CSV), there is no need to write item.py
mv.py
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class MvSpider(CrawlSpider):
name = 'mv'
allowed_domains = ['www.meinv.hk']
start_urls = ['http://www.meinv.hk/?cat=2']
# 增加提取 a 标签的href连接的规则
# 将提取到的href连接,生成新的Request 请求, 同时指定新的请求后解析函数
rules = (
# allow 默认使用正则的表达式,查找所有a标签的href
# follow 为True时,表示在提取规则连接下载完成后,是否再次提取规则中连接
Rule(LinkExtractor(allow=r'p=\d+'), callback='parse_item', follow=True),
)
def parse_item(self, response):
item = {}
info = response.xpath('//div[@class="wshop wshop-layzeload"]/text()').extract_first()
try:
item['hometown'] = info.split("/")[2].strip().split()[1]
item['birthday'] = info.split("/")[1].strip().split()[1]
except:
item['birthday'] = "未知"
item['hometown'] = "未知"
item['name'] = response.xpath('//h1[@class="title"]/text()').extract_first()
images = response.xpath('//div[@class="post-content"]//img/@src')
try:
item['image_urls'] = images.extract()
except:
item['image_urls'] = ''
item['images'] = ''
item['detail_url'] = response.url
yield item
middlewares.py
import random
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
class RandomUserAgentMiddleware(UserAgentMiddleware):
def __init__(self, user_agent_list):
super().__init__()
self.user_agent_list = user_agent_list
@classmethod
def from_crawler(cls, crawler):
return cls(user_agent_list=crawler.settings.get('USER_AGENT_LIST'))
def process_request(self, request, spider):
user_agent = random.choice(self.user_agent_list)
if user_agent:
request.headers['User-Agent'] = user_agent
return None
pipelines.py
import csv
import os
from hashlib import sha1
from scrapy import Request
from scrapy.pipelines.images import ImagesPipeline
from meinv import settings
class MvImagePipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for url in item['image_urls']:
yield Request(url, meta={'name': item['name']})
def item_completed(self, results, item, info):
#将下载完成后的图片路径设置到item中
item['images'] = [x for ok, x in results if ok]
return item
def file_path(self, request, response=None, info=None):
# 为每位人员创建一个目录,存放她自己所有的图片
author_name = request.meta['name']
author_dir = os.path.join(settings.IMAGES_STORE, author_name)
if not os.path.exists(author_dir):
os.makedirs(author_dir)
#从连接中提取文件名和扩展名
try:
filename = request.url.split("/")[-1].split(".")[0]
except:
filename = sha1(request.url.encode(encoding='utf-8')).hexdigest()
try:
ext_name = request.url.split(".")[-1]
except:
ext_name = 'jpg'
# 返回的相对路径
return '%s/%s.%s' % (author_name, filename, ext_name)
class MeinvPipeline(object):
def __init__(self):
self.csv_filename = 'meinv.csv'
self.existed_header = False
def process_item(self, item, spider):
# item dict对象,是spider.detail_parse() yield{}输出模块
with open(self.csv_filename, 'a', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=(
'name', 'hometown', 'birthday', 'detail_url'))
if not self.existed_header:
# 如果文件不存在,则表示第一次写入
writer.writeheader()
self.existed_header = True
image_urls = ''
for image_url in item['image_urls']:
image_urls += image_url + ','
image_urls.strip("\"").strip("\'")
data = {
'name': item['name'].strip(),
'hometown': item['hometown'],
'birthday': item['birthday'].replace('年', '-').replace('月', '-').replace('日', ''),
'detail_url': item['detail_url'],
}
writer.writerow(data)
f.close()
return item
Setting.py
import os
BOT_NAME = 'meinv'
SPIDER_MODULES = ['meinv.spiders']
NEWSPIDER_MODULE = 'meinv.spiders'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 1
DOWNLOADER_MIDDLEWARES = {
'meinv.middlewares.RandomUserAgentMiddleware': 543,
}
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
# ImagePipeline 存放图片使用的目录位置
IMAGES_STORE = os.path.join(BASE_DIR, 'images')
ITEM_PIPELINES = {
'meinv.pipelines.MeinvPipeline': 300,
'meinv.pipelines.MvImagePipeline':100
}
USER_AGENT_LIST = [
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1'
]
1. File directory
2. Someone's picture
3.csv file content
Finally, a few beautiful pictures
Code download
https://download.csdn.net/download/weixin_54707168/15902904