使用scrapy框架模拟登录

scrapy模拟登录

注意:模拟登陆时,必须保证settings.py里的COOKIES_ENABLED(Cookies中间件) 处于开启状态

COOKIES_ENABLED = True
# COOKIES_ENABLED = False

策略一:直接POST数据(比如需要登陆的账户信息)

只要是需要提供post数据的,就可以用这种方法。下面示例里post的数据是账户密码:

  • 可以使用yield scrapy.FormRequest(url, formdata, callback)方法发送POST请求。
  • 如果希望程序执行一开始就发送POST请求,可以重写Spider类的start_requests(self)方法,并且不再调用start_urls里的url。
import scrapy

class MyrenrenSpider(scrapy.Spider):
    name = 'myrenren'
    allowed_domains = ['renren.com']
    # start_urls = ['http://renren.com/']
    #重写start_requests()方法,爬虫首先调用这个方法,
    # 程序不再调用start_urls里的url
    def start_requests(self):
        loginUrl = "http://www.renren.com/PLogin.do"
        # 表单数据提交
        # formdata 表单数据
        # callback 登录后回调
        yield scrapy.FormRequest(url=loginUrl,
                                 formdata={"email": "你的账号",
                                      "password": "你的密码"},
                                 callback=self.parse
                                 )

    def parse(self, response):
        print('*******',response.url)
        print(response.body.decode('utf-8'))

策略二:标准的模拟登陆步骤

正统模拟登录方法:

  1. 首先发送登录页面的get请求,获取到页面里的登录必须的参数(比如说zhihu登陆界面的 _xsrf)
  2. 然后和账户密码一起post到服务器,登录成功
  3. 使用FormRequest.from_response()方法[模拟用户登录]

模拟浏览器登录

start_requests()方法,可以返回一个请求给爬虫的起始网站,这个返回的请求相当于start_urls,start_requests()返回的请求会替代start_urls里的请求

Request()get请求,可以设置,url、cookie、回调函数

FormRequest.from_response()表单post提交,第一个必须参数,上一次响应cookie的response对象,其他参数,cookie、url、表单内容等

yield Request()可以将一个新的请求返回给爬虫执行

在发送请求时cookie的操作, meta={‘cookiejar’:1}表示开启cookie记录,首次请求时写在Request()里 meta={‘cookiejar’:response.meta[‘cookiejar’]}表示使用上一次response的cookie,写在FormRequest.from_response()里post授权 meta={‘cookiejar’:True}表示使用授权后的cookie访问需要登录查看的页面

正统模拟登录方法

范例:爬取人人网个人主页:

# -*- coding: utf-8 -*-

import scrapy
from scrapy.http import HtmlResponse
from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor

class MyrenrenSpider(CrawlSpider):
    name = 'myrenren2'
    allowed_domains = ['renren.com']
    #个人主页
    start_urls = ['http://www.renren.com/353111356/profile']
    #爬取多个人的主页
    rules = [Rule(LinkExtractor(allow=("(\d+)/profile")),callback="get_parse",follow=True)]

    # 爬虫开启时调用的第一个方法,只调用一次
    def start_requests(self):
        indexURL = "http://www.renren.com"
        # 表单数据提交
        # formdata 表单数据
        # callback 登录后回调
        yield scrapy.FormRequest(url=indexURL,
              # 开启cookie, 用来保存cookie,在setting把COOKIES_ENABLED设为True
                                 meta={"cookiejar":1},
                                 callback=self.login
                                 )
    def login(self,response):
        print('访问主页返回的的url',response.url)
        # 从响应里面获取认证牌
        loginUrl = "http://www.renren.com/PLogin.do"
        yield scrapy.FormRequest.from_response(response,#响应
                                               url=loginUrl,# 登录url
                                               #表单数据
                                               formdata={"email": "你的值",
                                                         "password": "你的密码",},
                                               meta={"cookiejar": response.meta['cookiejar']},  # 传递cookie
                                               callback=self.after_login
                                               )
    #登录后
    def after_login(self,response):
        print('登录后返回的url',response.url)
        for url in self.start_urls:
            yield scrapy.Request(url,meta={"cookiejar": response.meta['cookiejar']})

    # 重写CrawlSpider中_requests_to_follow()方法
    def _requests_to_follow(self, response):
        """重写加入cookiejar的更新"""
        print('跟踪的url',response.url)
        if not isinstance(response, HtmlResponse):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = self._build_request(n, link)

                # 更新cookie
                r.meta.update(rule=n, link_text=link.text, cookiejar=response.meta['cookiejar'])

                yield rule.process_request(r)
    #处理爬取得到的网页函数
    def get_parse(self, response):
        print('追踪后返回的url',response.url)
        print(response.body.decode('utf-8'))

策略三:直接使用保存登录状态的Cookie模拟登录

如果实在没办法了,可以用这种方法模拟登录,虽然麻烦一点,但是成功率100%

先登录网页,获取cookie,然后转化为字典,保存在settings.py中的COOKIES池中,使用中间件用cookie登录。

1、cookie,转化为字典
def cookieChangeToDict(cookie):
    '''
    将cookie字符串转换成字典
    :param cookie: 登录后的cookie
    :return:字典
    '''
    cookieList = cookie.split(';')
    cookieDict = {}
    for cookie in cookieList:
        name = cookie.split('=', maxsplit=1)[0].strip()
        value = cookie.split('=', maxsplit=1)[1].strip()
        cookieDict[name] = value
    return cookieDict

if __name__ == '__main__':
    cookie = """
    你的cookie
    """
    print(cookieChangeToDict(cookie))
#把打印出的cookie放到settings.py中自定义的COOKIES=[]中
2、使用登录后的cookie发送请求

方式一:

# 可以重写Spider类的start_requests方法,附带Cookie值,发送POST请求
    def start_requests(self):
        url= ''
        return [scrapy.FormRequest(url, cookies = self.cookies, callback = self.parse)]

方式2:使用中间件:

from scrapy import signals
from scrapy.downloadermiddlewares.cookies import CookiesMiddleware
import random

from renren.settings import COOKIES

class RandomCookieMiddleware(CookiesMiddleware):
    '''
    随机cookie池
    '''
    def process_request(self, request, spider):
        cookie = random.choice(COOKIES)
        request.cookies = cookie

在settings.py中设置:


ROBOTSTXT_OBEY = False

COOKIES_ENABLED = True

#启用中间件
DOWNLOADER_MIDDLEWARES = {
   'renren.middlewares.RandomCookieMiddleware': 543,
}

# COOKIES池
COOKIES = [
]

策略四:使用selenium插件

1、模拟pc端登录CSDN

爬虫:

class CSDNSpider(scrapy.Spider):
    name = 'myCSDN'
    allowed_domains = ['csdn.net']
    start_urls = ['http://passport.csdn.net/account/login',
                  'http://my.csdn.net/my/account/changepwd']

    def __init__(self):
        super().__init__()

        driver = None  # 实例selenium
        cookies = None  # 用来保存cookie

    def parse(self, response):
        print(response.url)
        print(response.body.decode('utf-8'))

中间件:

class LoginMiddleware(object):
    # type()元类,object 基类

    def process_request(self, request, spider):
        '''
        :param request: 请求
        :param spider: 爬虫名
        :return:
        '''
        # 判断是哪个爬虫
        if spider.name == 'myCSDN':
            # 判断是否是登陆
            if request.url.find('login') != -1:
                spider.driver = webdriver.Chrome()
                spider.driver.get(request.url)
                spider.driver.find_element_by_xpath('/html/body/div[3]/div/div/div[2]/div/h3/a').click()
                time.sleep(2)
                #模拟输入账号密码
                username = spider.driver.find_element_by_xpath('//*[@id="username"]')
                password = spider.driver.find_element_by_xpath('//*[@id="password"]')
                username.send_keys('18588403840')
                password.send_keys('Changeme_123')
                #模拟点击“登录”按钮
                spider.driver.find_element_by_xpath('//*[@id="fm1"]/input[8]').click()
                time.sleep(3)
                spider.cookies = spider.driver.get_cookies()
                return HtmlResponse(url=spider.driver.current_url,  # 登录后的url
                                    body=spider.driver.page_source,  # html源码
                                    encoding='utf-8')

            # 不是登录
            else:
                # 使用session保存cookie
                session = request.session()
                for cookie in spider.cookies:
                    session.cookies.set(cookie['name'], cookie['value'])
                # 清空头
                session.headers.clear()
                response = session.get(request.url)
                return HtmlResponse(url=response.url,
                                    body=response.text,
                                    encoding='utf-8')

在settings.py中设置:

ROBOTSTXT_OBEY = False

COOKIES_ENABLED = True

#启用中间件
DOWNLOADER_MIDDLEWARES = {
   'renren.middlewares.LoginMiddleware': 543,
}
2、模拟移动端登录淘宝

爬虫:

import scrapy

class TaobaoSpider(scrapy.Spider):
    name = 'mytaobao'

    allowed_domains = ['taobao.com']
    start_urls = ['https://login.m.taobao.com/login.htm',
                  "http://h5.m.taobao.com/mlapp/olist.html?spm=a2141.7756461.2.6"]
    def __init__(self):
        super().__init__()  
        self.browser = None
        self.cookies = None

    def parse(self, response):
        # 打印链接,打印网页源代码

        print(response.url)
        print(response.body.decode("utf-8", "ignore"))

中间件:

class LoginMiddleware(object):

    def process_request(self, request, spider):
        if spider.name == "mytaobao":  # 指定仅仅处理这个名称的爬虫
            if request.url.find("login") != -1:  # 判断是否登陆页面
                mobilesetting = {"deviceName": "iPhone 6 Plus"}
                options = webdriver.ChromeOptions()  # 浏览器选项
                options.add_experimental_option("mobileEmulation", mobilesetting)  # 模拟手机
                spider.browser = webdriver.Chrome(chrome_options=options)  # 创建一个浏览器
                spider.browser.set_window_size(400, 800)  # 配置手机大小

                spider.browser.get(request.url)  # 爬虫访问链接
                time.sleep(3)
                print("login访问", request.url)
                username = spider.browser.find_element_by_id("username")
                password = spider.browser.find_element_by_id("password")
                time.sleep(1)
                username.send_keys("你的账号")  # 账户
                time.sleep(2)
                password.send_keys("你的密码")  # 密码
                time.sleep(2)
                click = spider.browser.find_element_by_id("btn-submit")
                click.click()
                time.sleep(18)
                spider.cookies = spider.browser.get_cookies()  # 抓取全部的cookie
                # spider.browser.close()

                return HtmlResponse(url=spider.browser.current_url,  # 当前连接
                                    body=spider.browser.page_source,  # 源代码
                                    encoding="utf-8")  # 返回页面信息
            else:
                # spider.browser.get(request.url)
                # request.访问,调用selenium cookie
                # request模拟访问。统一selenium,慢,request,不能执行js
                print("request  访问")
                req = requests.session()  # 会话
                for cookie in spider.cookies:
                    req.cookies.set(cookie['name'], cookie["value"])
                req.headers.clear()  # 清空头
                newpage = req.get(request.url)
                print("---------------------")
                print(request.url)
                print("---------------------")
                print(newpage.text)
                print("---------------------")
                # 页面
                time.sleep(3)
                return HtmlResponse(url=request.url,  # 当前连接
                                    body=newpage.text,  # 源代码
                                    encoding="utf-8")  # 返回页面信息

猜你喜欢

转载自blog.csdn.net/lm_is_dc/article/details/81045288