python crawling Educational Management System

Yesterday learned the simple entry reptiles, so the spur of the moment to write a reptile crawling results, the following reptiles write about the whole process, because just learning reptiles, so finding an old login page, do not need to enter a verification code

Here faded School Information: http://xxxjwc.its.xxu.edu.cn/jsxsd/

Results page: http: //xxxjwc.its.xxu.edu.cn/jsxsd/kscj/cjcx_list

general idea:

1, using the account number and password, record the cookie information

2, using the cookie information to access the results page

3, using information obtained xpath want (this is not a careful study, under a rough understanding, with not being good enough)

First, the sending of obtaining login url

1.1 Use browser developer tools

Using the Chrome browser to open educational management system web page, use F12, click on the network, check Preserve log

1.2 login account password, use tools to find the url sent login (in fact, you can see the source code from the web page url form submission)

Then lowermost headers which can be seen from data transmitted is called a data encoded as follows

This is not a common type of account and password, read the source code is found as follows

function submitForm1(){  
        try{
            var xh = document.getElementById("userAccount").value;
            var pwd = document.getElementById("userPassword").value; 
            if(xh==""){
                alert ( "User name can not be empty!");
                return false;
            }
            if(pwd==""){
                alert ( "password can not be empty!");
                return false;
            } 
            var account = encodeInp(xh);
            var passwd = encodeInp(pwd);
            var encoded = account+"%%%"+passwd;
            document.getElementById("encoded").value = encoded;
            var jzmmid = document.getElementById("Form1").jzmmid;  
            return true;
        }catch(e){
            alert(e.Message);
            return false;
        }
    }

Encoded found after splicing, then encrypted account number and password, there is the encodedInp () is a function of at conwork.js, as follows:

eval(function(p,a,c,k,e,d){e=function(c){return(c<a?"":e(parseInt(c/a)))+((c=c%a)>35?String.fromCharCode(c+29):c.toString(36))};if(!''.replace(/^/,String)){while(c--)d[e(c)]=k[c]||e(c);k=[function(e){return d[e]}];e=function(){return'\\w+'};c=1;};while(c--)if(k[c])p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k[c]);return p;}('b 9="o+/=";p q(a){b e="";b 8,5,7="";b f,g,c,1="";b i=0;m{8=a.h(i++);5=a.h(i++);7=a.h(i++);f=8>>2;g=((8&3)<<4)|(5>>4);c=((5&s)<<2)|(7>>6);1=7&t;k(j(5)){c=1=l}v k(j(7)){1=l}e=e+9.d(f)+9.d(g)+9.d(c)+9.d(1);8=5=7="";f=g=c=1=""}u(i<a.n);r e}',32,32,'|enc4||||chr2||chr3|chr1|keyStr|input|var|enc3|charAt|output|enc1|enc2|charCodeAt||isNaN|if|64|do|length|ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789|function|encodeInp|return|15|63|while|else'.split('|'),0,{}))

So we want to encrypt the username and password you need to own a python rewrite this function or directly using the function js inside, I chose the direct use of the function js inside, you need to download PyExecJS module, and then import execjs, account number and password generation encoded as follows:

# Of account password is encoded

def make_user_token(username, password):
    with open('conwork.js') as f:
        ctx = execjs.compile(f.read())
        username_encode = ctx.call('encodeInp', username)
        password_encode = ctx.call('encodeInp', password)
    token = username_encode + '%%%' + password_encode
    return token

Preparations are completed, you can start to build log cookie

Second, login

Because just learning reptiles, so I used the most basic things to write, use http.cookiejar in CookieJar to get the cookie information

# 1. Log 
DEF get_opener ():
     # 1.1 Create a CookieJar 
    cookierjar = CookieJar ()
     # 1.2 cookierjar use objects to create a HTTPCookieProcess 
    Handler = request.HTTPCookieProcessor (cookierjar)
     # Handler created in step 1.3 use to create a opener 
    opener = Request. build_opener (Handler)
     return opener

# 1.4 request (username and password) opener sends the login, the purpose is to get the cookie 
DEF the Login (opener, encoded):
    data={}
    data['encoded'] = encoded
    login_url = 'http://xxxjwc.its.xxu.edu.cn/jsxsd/xk/LoginToXk'
    req = request.Request(login_url,data=parse.urlencode(data).encode('utf-8'),headers =headers)
    opener.open(req)

Third, access to the results page

# 2. Access the results page 
DEF visit_grade (opener):
    grade_url = 'http://xxxjwc.its.xxu.edu.cn/jsxsd/kscj/cjcx_list'
    req=  request.Request(grade_url,headers=headers)
    resp = opener.open(req)
    return resp.read().decode('utf-8')

Fourth, to obtain the desired results information

After obtaining the results page, or you can use spilt xpath to get the desired content, and there I chose to get score information using xpath

    web_data = visit_grade(opener)
    html = etree.HTML(web_data)
    grade_data = html.xpath ( ' .//*[@id="dataList "] ' ) # Note that this route according to their own needs to set

After then clocked out to inside information, according to their specific circumstances set desired. And there are a point of note is that if you are crawling inside the data table,

Html text because the browser will have a certain degree of standardization, there is tbody if the result is an empty path, you can remove tbody in xpath

Fifth, the overall effect

All codes

from urllib import request,parse
from http.cookiejar import CookieJar
from lxml import etree
import execjs

headers = {
    'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Mobile Safari/537.36'
}

# 1. Log 
DEF get_opener ():
     # 1.1 Create a CookieJar 
    cookierjar = CookieJar ()
     # 1.2 cookierjar use objects to create a HTTPCookieProcess 
    Handler = request.HTTPCookieProcessor (cookierjar)
     # Handler created in step 1.3 use to create a opener 
    opener = Request. build_opener (Handler)
     return opener

# 1.4 request (username and password) opener sends the login, the purpose is to get the cookie 
DEF the Login (opener, encoded):
    data={}
    data['encoded'] = encoded
    login_url = 'http://xxxjwc.its.xxu.edu.cn/jsxsd/xk/LoginToXk'
    req = request.Request(login_url,data=parse.urlencode(data).encode('utf-8'),headers =headers)
    opener.open(req)
    
# 2. Access the results page 
DEF visit_grade (opener):
    grade_url = 'http://xxxjwc.its.xxu.edu.cn/jsxsd/kscj/cjcx_list'
    req=  request.Request(grade_url,headers=headers)
    resp = opener.open(req)
    return resp.read().decode('utf-8')


# Of account password is encoded

def make_user_token(username, password):
    with open('conwork.js') as f:
        ctx = execjs.compile(f.read())
        username_encode = ctx.call('encodeInp', username)
        password_encode = ctx.call('encodeInp', password)
    token = username_encode + '%%%' + password_encode
    return token

if __name__ =='__main__':
    username= 'xxxxxxxx'
    password= 'xxxxxxxxxxxxxxxxx'
    encoded = make_user_token(username, password)
    opener = get_opener()
    login(opener,encoded)
    web_data = visit_grade(opener)
    html = etree.HTML(web_data)
    grade_data = html.xpath('.//*[@id="dataList"]')
    for i in grade_data[0]:
        n=0
        s=""
        for j in i:
            if n==3:
                s+=j.text
                s+="\t"
            elif n==4:
                s+=j.text
                s+="\t"
            elif n==5:
                for grade in j:
                    s+=grade.text
                    s+="\t"
            elif n==6:
                s+=j.text
                s+="\t"
            n+=1
        print(s)
                
    

A basic reptile on the well, of course, very simple, with in-depth study of the latter could write this article to improve it.

Guess you like

Origin www.cnblogs.com/caijiyang/p/12551043.html