老男孩爬虫实战密训课第一季,2018.6,初识爬虫训练-实战2-自动登陆抽屉网

较上次内容增加的内容: 自动登陆抽屉网,自动点赞,自动换页

1.自动登陆网站    

  登陆时故意在浏览器输错出现login,查看内容,获取form data

2.cookies的使用    

  本次爬取的网站采用了cookies授权机制,得先访问总网站,分配到未授权的cookies,登陆后带着cookies去授权

代码:

import requests
from bs4 import BeautifulSoup
#1.先访问抽屉,获取cookie(未授权),点赞前肯定会访问此网站
r1 = requests.get(
    url='https://dig.chouti.com/all/hot/recent/1',
    headers={
        'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'
    }
)
r1_cookie_dict=r1.cookies.get_dict()

#2.发送用户名和密码认为认证 + cookie(未授权)
response_login = requests.post(
    url='https://dig.chouti.com/login',
    data={
        'phone':'8613026354610',
        'password':'halou445513',
        'oneMonth':'1'
    },
    headers={
        'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'
    },
    cookies=r1_cookie_dict
)
# 1.获取点赞id
for page_num in range(1,3):
    response_index = requests.get(
        url='https://dig.chouti.com/all/hot/recent/%s'%page_num,
        headers={
            'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'
        }
    )
    # print(response_index.text)
    soup = BeautifulSoup(response_index.text,"html.parser")
    div = soup.find(attrs={'id':'content-list'})
    items = div.find_all(attrs={'class':'item'})

    for item in items:
        tag = item.find(attrs={'class':'part2'})
        if not tag:
                continue
        nid = tag.get('share-linkid')
        print(nid)
    #点赞
        r1 = requests.post(
            url='https://dig.chouti.com/link/vote?linksId=%s'%nid,
            headers={
                'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'
             },
            cookies = r1_cookie_dict
        )
        print(r1.text)

其他知识点:

1.requests常用参数:

url,params,headers,cookies:

data,json:data传的是字典,json传的是字符串

files:上传文件(stream分段上传,此处不列出)

 

auth:浏览器内置弹窗的数据

 

proxies:代理

cert,verify:与证书相关,比较少见

1.验证码问题(与人工智能相关):

  • pil模块可以搞定简单的模块,简单的70-80%通过率,
  • 买第三方服务

 

猜你喜欢

转载自www.cnblogs.com/yhstcxx/p/10952107.html