Java Crawler Tutorial: Simulate User Form Login

Reprinted from: http://xiaolongonly.cn/2016/06/01/Reptile3/

This is the third part of the crawler tutorial, teaching you how to simulate user form login.

Preliminary preparation:

JSOUP 1.83 jar package
Eclipse can run java in any version
. Google Chrome The
first step: still analyze the page structure

We want to simulate the CSDN user form login to get the data after the user logs in.
In some websites and forums, some content always needs to be viewed by users who need certain permissions, such as member users.
Well, we can see the importance of simulating user form login.

Go to the login page and press F12 on the page to view the content of the page element.
write picture description here

This time we only need the form tag and its internal account and password input box information.

Step 2: Explain the specific operation process required by Post

Generally, the id of the form is unique, so it is very easy to filter out the form.
Here the id value of the form is fm1

List<Element> et = d1.select("#fm1");// 获取form表单

The next step is to get the account and password input box controls under the form, which can also be filtered by id, but the name attribute is used here.
General website design will match the name attribute with value and post it to the server.
Put the key-value pair of account and password directly in the Map object, the code is as follows

Map<String, String> datas = new HashMap<>();
datas.put(e.attr("name"), e.attr("value"));

This is the key value of the account and password input box of the form,
or it can be directly traversed.

datas.put("username", "your username");
datas.put("password", "your password");

Haha, isn't it a lot less?
After this step, the data we want to post has been stored in datas.
You can perform a second request to post the login information.
Response login = con2.ignoreContentType(true).method(Method.POST)
.data(datas).cookies(rs.cookies()).execute();
Carry the logged-in Map data and the cookie returned by the first login to perform post.

Step 3: Start writing code

Well it's that simple. Let's look at the code next.

import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.Connection.Method;
import org.jsoup.Connection.Response;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
/**
 * 使用Jsoup模拟登陆CSDN
 * 
 * 
 * 大体思路如下:
 * 
 * 第一次请求登陆页面,获取页面信息,包含表单信息,和cookie(这个很重要),拿不到,会模拟登陆不上
 * 
 * 
 * 第二次登陆,设置用户名,密码,把第一次的cooking,放进去,即可
 * 
 * 怎么确定是否登陆成功?
 * 
 * 登陆后,打印页面,会看到账户的详细信息。
 * 
 * 
 * @date 2016年6月13日
 * @author xiaolong
 * 
 * 
 * **/
public class LoginDemo {
    public static void main(String[] args) throws Exception {
        LoginDemo loginDemo = new LoginDemo();
        loginDemo.login("your account", "password");// 输入CSDN的用户名,和密码
    }
    /**
     * 模拟登陆CSDN
     * 
     * @param userName
     *            用户名
     * @param pwd
     *            密码
     * 
     * **/
    public void login(String userName, String pwd) throws Exception {
        // 第一次请求
        Connection con = Jsoup
                .connect("https://passport.csdn.net/account/login");// 获取连接
        con.header("User-Agent",
                "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0");// 配置模拟浏览器
        Response rs = con.execute();// 获取响应
        Document d1 = Jsoup.parse(rs.body());// 转换为Dom树
        List<Element> et = d1.select("#fm1");// 获取form表单,可以通过查看页面源码代码得知
        // 获取,cooking和表单属性,下面map存放post时的数据
        Map<String, String> datas = new HashMap<>();
        for (Element e : et.get(0).getAllElements()) {
            if (e.attr("name").equals("username")) {
                e.attr("value", userName);// 设置用户名
            }
            if (e.attr("name").equals("password")) {
                e.attr("value", pwd); // 设置用户密码
            }
            if (e.attr("name").length() > 0) {// 排除空值表单属性
                datas.put(e.attr("name"), e.attr("value"));
            }
        }
        /**
         * 第二次请求,post表单数据,以及cookie信息
         * 
         * **/
        Connection con2 = Jsoup
                .connect("https://passport.csdn.net/account/login");
        con2.header("User-Agent",
                "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0");
        // 设置cookie和post上面的map数据
        Response login = con2.ignoreContentType(true).method(Method.POST)
                .data(datas).cookies(rs.cookies()).execute();
        // 打印,登陆成功后的信息
        System.out.println(login.body());
        // 登陆成功后的cookie信息,可以保存到本地,以后登陆时,只需一次登陆即可
        Map<String, String> map = login.cookies();
        for (String s : map.keySet()) {
            System.out.println(s + "      " + map.get(s));
        }
    }
}

Summarize

这个类中写了两次网站访问的请求
第一次请求用来获取cookie信息
第二次请求将携带cookie和登录数据的信息post出去用来模拟登录。
就是这么简单~~~

小Tips:
想要模拟用户表单登录,链接头信息是不可少的,"User-Agent"代表的是浏览器访问信息。
通过下图可以看到请求头可以有这么多的信息,
服务端可能会通过约束请求头来判别用户post/get的信息是否合法
所以请求头很重要~请求头很重要~请求头很重要~(重要的事情说三遍)

write picture description here

这个是登录后每一次操作都需要携带的头部信息。可以通过F12查看页面网络访问状态来查看请求头和返回头。
好啦,有了模拟表单登录了,是时候去大展身手了。
爬虫教程到此结束,希望有兴趣的小伙伴可以继续深入研究

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325624472&siteId=291194637