scrapy_常用快查

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/weixin_39532362/article/details/88658492

生成项目

scrapy startproject <project_name> #生成项目文件

scrapy genspider mySpider 163.com  #生成基本spider模板

scrapy genspider -l #显示spider模板列表

scrapy genspider -d template #预览模板格式

scrapy genspider [-t template] <name> <domain> #指定name和网址生成spider文件,可选指定模板

修改协议

  • settings
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

设置请求和解析

  • spider
# -*- coding: utf-8 -*-
import scrapy
class mySpider(scrapy.Spider):
    name = 'mySpider'
    allowed_domains = ['163.com'] # 限制主站域名
    start_urls = ['http://163.com/']

    def start_requests(self):
        # 必须返回可迭代类型
        for url in self.start_urls:
            yield self.make_requests_from_url(url)
            
    def make_requests_from_url(self, url):
        return scrapy.Request(url, callback=self.parse, method='GET', encoding='utf-8' , dont_filter=False,errback)
        # return scrapy.FormRequest(url, formdata={}, callback=self.parse)

    def parse(self,response):
        response.text
        response.body.decode(encoding='utf-8')

设置headers

  • settings
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
  'Accept-Language': 'en',
}

设置代理

  • middlewares
class ProxyMiddleware(object):
    def process_request(self,request,spider):
        request.meta['proxy']='http://127.0.0.1:9743'
  • settings
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'testproject.middlewares.ProxyMiddleware': 543,
}

设置管道

  • pipelines
from scrapy.exceptions import DropItem
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

class Pipeline_Format(object):
    def process_item(self,item,spider):
        item=pd.DataFrame([dict(item)])
        return item



class Pipeline_MySql(object):
    def __init__(self,user,password,port,database,charset):
        self.user=user
        self.password=password
        self.port=port
        self.database=database
        self.charset=charset

    # 用setting变量初始化自身对象
    @classmethod
    def from_crawler(cls,crawler):
        return cls(
        user=crawler.settings.get('MYSQL_USER'),
        password=crawler.settings.get('MYSQL_PASSWORD'),
        port=crawler.settings.get('MYSQL_PORT'),
        database=crawler.settings.get('MYSQL_DATABASE'),
        charset=crawler.settings.get('MYSQL_CHARSET')
        )
        
    # spider开启时调用,构造engine及连接数据库
    def open_spider(self,spider):
        cracom='mysql+pymysql://{user}:{passwork}@127.0.0.1:{port}/{database}?charset={charset}'
        self.engine=create_engine(cracom.format(
            user=self.user,
            passwork=self.passwork,
            port=self.port,
            database=self.atabase,
            charset=self.charset)
            )
        self.session=sessionmaker(bind=self.engine)()
        
    # spider关闭时调用,断开数据库
    def close_spider(self,spider):
        self.session.close()

    # 处理item,把item写入数据库并返回item
    def process_item(self,item,spider):
        item.to_sql('tbname',con=self.engine,if_exists='append',index=False)
        return item
  • settings
# 开启管道,数值越少越优先处理
ITEM_PIPELINES = {
    'testproject.pipelines.Pipeline_Format': 300,
    'testproject.pipelines.Pipeline_MySql': 400}

# 定义用于连接数据库的参数
MYSQL_DATABASE='scrapy_test'
MYSQL_USER='root'
MYSQL_PASSWORD='123456'
MYSQL_PORT=3306
MYSQL_CHARSET='utf8mb4'

启动项目

scrapy crawl mySpider
scrapy crawl mySpider -o fname.json
-o fname.jl #行格式的json
-o fname.csv
-o fname.htp://url/path/file_name.csv] #上传到网络

构造请求

  • class scrapy.http.Request()
    url (string)– the URL of this request
    callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s parse() method will be used. Note that if exceptions are raised during processing, errback is called instead.
    method (string)– the HTTP method of this request. Defaults to ‘GET’.
    meta (dict) – the initial values for the Request.meta attribute. If given, the dict passed in this parameter will be shallow copied.
    body (str or unicode) – the request body. If a unicode is passed, then it’s encoded to str using the encoding passed (which defaults to utf-8). If body is not given, an empty string is stored. Regardless of the type of this argument, the final value stored will be a str (never unicode or None).
    headers (dict) – the headers of this request. The dict values can be strings (for single valued headers) or lists (for multi-valued headers). If None is passed as value, the HTTP header will not be sent at all.
    cookies (dict or list) –the request cookies. These can be sent in two forms.
  • scrapy.FormRequest(url, formdata={}, callback=self.parse [,…])
  • scrapy.FormRequest.from_response(url, formdata={}, callback=self.parse [,…])

构造响应

  • class scrapy.http.Response()
    url (string) – the URL of this response
    headers (dict) – the headers of this response. The dict values can be strings (for single valued headers) or lists (for multi-valued headers).
    status (integer) – the HTTP status of the response. Defaults to 200.
    body (str) – the response body. It must be str, not unicode, unless you’re using a encoding-aware Response subclass, such as TextResponse.
    meta (dict)– the initial values for the Response.meta attribute. If given, the dict will be shallow copied.
    flags (list) – is a list containing the initial values for the Response.flags attribute. If given, the list will be shallow copied.

其他常用函数

response.body #返回二进制文本
response.text #返回可读文本
response.urljoin(href) #返回绝对地址
  • 选择器例子
# 选择器
sel.extract(default='') #返回列表,提取标签内容,没有内容返回''
sel.extract_frist() #返回首个
sel.re('(.*)') #返回列表,提取()内匹配的内容
sel.re_first('(.*)') #返回首个

#(http://doc.scrapy.org/en/latest/_static/selectors-sample1.html)

#根据标签定位
res.xpath('//div/a')
res.css('div a')

#根据属性值定位
res.xpaht('//div[@id="images"]/a)
res.css('div[id=images] a')

#根据属性值包含内容定位
res.xpath('//a[contains(@href,"image")]/img')
res.css('a[href*=image] img')

#定位标签内文本内容
res.xpath('//title/text()') 
res.css('title::text') 

#定位获取属性内容
response.xpath('//a/@href')
response.css('a::attr(href)')

猜你喜欢

转载自blog.csdn.net/weixin_39532362/article/details/88658492