登陆表单（urllib与requests比较）

通过穿越登陆表单做一个urllib与requests的比较，urllib的代码来自《Web Scraping with Python》一书的第六章6.1节，测试网站同样来自这本书中。

测试网站：http://example.webscraping.com/

登陆页面的form表单代码如下，共有7个input标签，其中3个是隐藏的，name=_fromkey这个input标签生成一个随机字符串作为唯一ID来避免表单多次提交。

<form action="#" enctype="application/x-www-form-urlencoded" method="post">
	<table>
		<tr id="auth_user_email__row">
			<td class="w2p_fl"><label class="" for="auth_user_email" id="auth_user_email__label">电子邮件: </label></td>
			<td class="w2p_fw"><input class="string" id="auth_user_email" name="email" type="text" value="" /></td>
			<td class="w2p_fc"></td>
		</tr>
		<tr id="auth_user_password__row">
			<td class="w2p_fl"><label class="" for="auth_user_password" id="auth_user_password__label">密码: </label></td>
			<td class="w2p_fw"><input class="password" id="auth_user_password" name="password" type="password" value="" /></td>
			<td class="w2p_fc"></td>
		</tr>
		<tr id="auth_user_remember_me__row">
			<td class="w2p_fl"><label class="" for="auth_user_remember_me" id="auth_user_remember_me__label">记住我(30 天): </label></td>
			<td class="w2p_fw"><input class="boolean" id="auth_user_remember_me" name="remember_me" type="checkbox" value="on" /></td>
			<td class="w2p_fc"></td>
		</tr>
		<tr id="submit_record__row">
			<td class="w2p_fl"></td>
			<td class="w2p_fw"><input type="submit" value="Log In" /><button class="btn w2p-form-button" onclick="window.location=&#x27;/places/default/user/register?_next=%2Fplaces%2Fdefault%2Findex&#x27;;return false">注册</button></td>
			<td class="w2p_fc"></td>
		</tr>
	</table>
	<div style="display:none;">
		<input name="_next" type="hidden" value="/places/default/index" />
		<input name="_formkey" type="hidden" value="f8c4ab38-8a8f-46b3-a6b7-119f1cbd5851" />
		<input name="_formname" type="hidden" value="login" />
	</div>
</form>

下面是urllib的代码，书中用python2编写，这里改用python3，提交表单数据要带上所有字段，包括隐藏字段

from urllib.request import build_opener, HTTPCookieProcessor, Request
from urllib.parse import urlencode
from http import cookiejar
import lxml.html

# 这个函数用来解析出表单的每个提交字段的name和value，返回一个字典作为post数据
def parse_form(html_text):
    tree = lxml.html.fromstring(html_text)
    form_data = {}
    for e in tree.cssselect('form input'):
        if e.get('name'):
            form_data[e.get('name')] = e.get('value')
    return form_data

login_url = 'http://example.webscraping.com/places/default/user/login?_next=/places/default/index'
login_email = '[email protected]'
login_password = 'example'

# 要使用cookie，加载登陆页时_formkey的值会保存在cookie中，然后该值会与提交的post数据中的_formkey值进行比较
# 不使用cookie会导致登陆失败
opener = build_opener(HTTPCookieProcessor(cookiejar.CookieJar()))
html = opener.open(login_url).read()

data = parse_form(html)
# 往post数据里填写登陆邮箱和密码
data['email'] = login_email
data['password'] = login_password

# post数据需要通过rulencode编码，之后再用utf-8编码，才能提交
response = opener.open(Request(login_url, urlencode(data).encode('utf-8')))
print(response.geturl())

最后打印出post后的url

http://example.webscraping.com/places/default/index

是主页的url，表示成功登陆表单，如果失败了还是登陆页的url。

下面我再用requests库写登陆表单代码

import requests
import lxml.html

login_url = 'http://example.webscraping.com/places/default/user/login?_next=/places/default/index'
login_email = '[email protected]'
login_password = 'example'

login_page = requests.get(login_url)
login_page_text = login_page.text
tree = lxml.html.fromstring(login_page_text)
# 加载登陆页后，提取出_formkey的值用于构建post数据
formkey = tree.cssselect('form input')[5].get('value')

# post数据
data = {
    'email': login_email,
    'password': login_password,
    'remember_me': 'on',
    '_next': '/',
    '_formkey': formkey,
    '_formname': 'login'
    }
# post请求，要带上cookie，否则会登陆失败
response = requests.post(login_url, data, cookies=login_page.cookies)
print(response.url)

跟上面urllib有些不同，上面是通过parse_form函数生成post数据，这里直接用字典编写，并且不需要编码。相同的是提交表单要带上所有字段，包括隐藏字段，以及使用cookie。

另外，除了使用requests.post方法，还可以创建session对象，这样可以更容易保持登陆状态。

# response = requests.post(login_url, data, cookies=login_page.cookies)
session = requests.Session()
response = session.post(login_url, data, cookies=login_page.cookies)
print(response.url)

还有一点值得注意，这个表单需要提交cookie，但并非其他网站的表单都需要这样做，根据情况而定。

打印的是主页的url，表示成功。

http://example.webscraping.com/places/default/index

通过这个例子看出requests确实比rullib方便些，代码更少，接口清晰，不用处理编码问题，基本不怎么需要费脑子。

登陆表单（urllib与requests比较）

猜你喜欢