Python网络爬虫-7（Scrapy模拟登陆实战）

Scrapy模拟登陆实战|人人网登录

# Scrapy模拟登陆实战|人人网登录
'''
使用Scrapy代替人去登陆一些网站，登陆之后可以爬取深层页面。
到登陆页面进行抓包分析，定位表单提交网址
'''
#login.py
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request, FormRequest

class LoginSpider(scrapy.Spider):
    name = 'login'
    allowed_domains = ['renren.com']
    # 没有start_request方法的话，默认使用当前链接
    # start_urls = ['http://baidu.com/']
    # 伪装浏览器
    header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Safari/537.36"}
    # 编写start_requests()方法，若有该方法则第一次会默认调取该方法中的请求，若没有该方法则调用start_urls链接
    def start_requests(self):
        # 首先爬一次登录页，然后进入回调函数parse() cookiejar为1表示cookie开启
        return [Request("http://www.renren.com/PLogin.do", meta={"cookiejar": 1}, callback=self.parse)]

    def parse(self, response):
        # 设置要传递的post信息，此时没有验证码字段
        data = {
            "email": "账号",
            "password": "密码"
        }
        print("登录中...")
        # 通过FormRequest.from_response(）进行登录
        return [FormRequest.from_response(
            response,
            # 设置cookie信息
            meta = {"cookiejar": response.meta["cookiejar"]},
            # 设置headers信息模拟成浏览器
            headers = self.header,
            # 设置post表单中的数据
            formdata = data,
            # 设置回调函数，此时回调函数为next()
            callback = self.next
        )]

    def next(self, response): # 此时已经登录成功
        print('--登陆成功1--')
        data = response.body # 当前响应的所有数据
        # 页面数据写到本地
        fh = open('./ren1.html', 'wb')
        fh.write(data)
        fh.close()
        yield Request("http://www.renren.com/974220719/profile", callback=self.next2, meta={"cookiejar": True})

    def next2(self, response):
        print('--登陆成功2--')
        data = response.body
        # 页面数据写到本地
        fh = open('./ren2.html', 'wb')
        fh.write(data)
        fh.close()

运行遇到bug总结：

1. TypeError: write() argument must be str, not bytes
错误代码：fh = open('./ren1.html', 'w')
改正代码：fh = open('./ren1.html', 'wb')
2.  DEBUG: Filtered offsite request to '网址'
说明：此时请求request  url的域名跟起始设置的allowed_domains 中的域名不一样，系统自动过滤掉与设置域名不一样的域名了
这里有两种解决办法：
1）设置allowed_domains = []
2）yield scrapy.Request(url,callback=self.parse,dont_filter=True)

相同类型内容的博客：
scrapy实战|模拟登录人人网实战

牧阳MuYoung

发布了42 篇原创文章 · 获赞 0 · 访问量 1858

私信关注

Python网络爬虫-7（Scrapy模拟登陆实战）

猜你喜欢