Yesterday learned the simple entry reptiles, so the spur of the moment to write a reptile crawling results, the following reptiles write about the whole process, because just learning reptiles, so finding an old login page, do not need to enter a verification code
Here faded School Information: http://xxxjwc.its.xxu.edu.cn/jsxsd/
Results page: http: //xxxjwc.its.xxu.edu.cn/jsxsd/kscj/cjcx_list
general idea:
1, using the account number and password, record the cookie information
2, using the cookie information to access the results page
3, using information obtained xpath want (this is not a careful study, under a rough understanding, with not being good enough)
First, the sending of obtaining login url
1.1 Use browser developer tools
Using the Chrome browser to open educational management system web page, use F12, click on the network, check Preserve log
1.2 login account password, use tools to find the url sent login (in fact, you can see the source code from the web page url form submission)
Then lowermost headers which can be seen from data transmitted is called a data encoded as follows
This is not a common type of account and password, read the source code is found as follows
function submitForm1(){
try{
var xh = document.getElementById("userAccount").value;
var pwd = document.getElementById("userPassword").value;
if(xh==""){
alert ( "User name can not be empty!");
return false;
}
if(pwd==""){
alert ( "password can not be empty!");
return false;
}
var account = encodeInp(xh);
var passwd = encodeInp(pwd);
var encoded = account+"%%%"+passwd;
document.getElementById("encoded").value = encoded;
var jzmmid = document.getElementById("Form1").jzmmid;
return true;
}catch(e){
alert(e.Message);
return false;
}
}
Encoded found after splicing, then encrypted account number and password, there is the encodedInp () is a function of at conwork.js, as follows:
eval(function(p,a,c,k,e,d){e=function(c){return(c<a?"":e(parseInt(c/a)))+((c=c%a)>35?String.fromCharCode(c+29):c.toString(36))};if(!''.replace(/^/,String)){while(c--)d[e(c)]=k[c]||e(c);k=[function(e){return d[e]}];e=function(){return'\\w+'};c=1;};while(c--)if(k[c])p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k[c]);return p;}('b 9="o+/=";p q(a){b e="";b 8,5,7="";b f,g,c,1="";b i=0;m{8=a.h(i++);5=a.h(i++);7=a.h(i++);f=8>>2;g=((8&3)<<4)|(5>>4);c=((5&s)<<2)|(7>>6);1=7&t;k(j(5)){c=1=l}v k(j(7)){1=l}e=e+9.d(f)+9.d(g)+9.d(c)+9.d(1);8=5=7="";f=g=c=1=""}u(i<a.n);r e}',32,32,'|enc4||||chr2||chr3|chr1|keyStr|input|var|enc3|charAt|output|enc1|enc2|charCodeAt||isNaN|if|64|do|length|ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789|function|encodeInp|return|15|63|while|else'.split('|'),0,{}))
So we want to encrypt the username and password you need to own a python rewrite this function or directly using the function js inside, I chose the direct use of the function js inside, you need to download PyExecJS module, and then import execjs, account number and password generation encoded as follows:
# Of account password is encoded def make_user_token(username, password): with open('conwork.js') as f: ctx = execjs.compile(f.read()) username_encode = ctx.call('encodeInp', username) password_encode = ctx.call('encodeInp', password) token = username_encode + '%%%' + password_encode return token
Preparations are completed, you can start to build log cookie
Second, login
Because just learning reptiles, so I used the most basic things to write, use http.cookiejar in CookieJar to get the cookie information
# 1. Log DEF get_opener (): # 1.1 Create a CookieJar cookierjar = CookieJar () # 1.2 cookierjar use objects to create a HTTPCookieProcess Handler = request.HTTPCookieProcessor (cookierjar) # Handler created in step 1.3 use to create a opener opener = Request. build_opener (Handler) return opener # 1.4 request (username and password) opener sends the login, the purpose is to get the cookie DEF the Login (opener, encoded): data={} data['encoded'] = encoded login_url = 'http://xxxjwc.its.xxu.edu.cn/jsxsd/xk/LoginToXk' req = request.Request(login_url,data=parse.urlencode(data).encode('utf-8'),headers =headers) opener.open(req)
Third, access to the results page
# 2. Access the results page DEF visit_grade (opener): grade_url = 'http://xxxjwc.its.xxu.edu.cn/jsxsd/kscj/cjcx_list' req= request.Request(grade_url,headers=headers) resp = opener.open(req) return resp.read().decode('utf-8')
Fourth, to obtain the desired results information
After obtaining the results page, or you can use spilt xpath to get the desired content, and there I chose to get score information using xpath
web_data = visit_grade(opener) html = etree.HTML(web_data) grade_data = html.xpath ( ' .//*[@id="dataList "] ' ) # Note that this route according to their own needs to set
After then clocked out to inside information, according to their specific circumstances set desired. And there are a point of note is that if you are crawling inside the data table,
Html text because the browser will have a certain degree of standardization, there is tbody if the result is an empty path, you can remove tbody in xpath
Fifth, the overall effect
All codes
from urllib import request,parse from http.cookiejar import CookieJar from lxml import etree import execjs headers = { 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Mobile Safari/537.36' } # 1. Log DEF get_opener (): # 1.1 Create a CookieJar cookierjar = CookieJar () # 1.2 cookierjar use objects to create a HTTPCookieProcess Handler = request.HTTPCookieProcessor (cookierjar) # Handler created in step 1.3 use to create a opener opener = Request. build_opener (Handler) return opener # 1.4 request (username and password) opener sends the login, the purpose is to get the cookie DEF the Login (opener, encoded): data={} data['encoded'] = encoded login_url = 'http://xxxjwc.its.xxu.edu.cn/jsxsd/xk/LoginToXk' req = request.Request(login_url,data=parse.urlencode(data).encode('utf-8'),headers =headers) opener.open(req) # 2. Access the results page DEF visit_grade (opener): grade_url = 'http://xxxjwc.its.xxu.edu.cn/jsxsd/kscj/cjcx_list' req= request.Request(grade_url,headers=headers) resp = opener.open(req) return resp.read().decode('utf-8') # Of account password is encoded def make_user_token(username, password): with open('conwork.js') as f: ctx = execjs.compile(f.read()) username_encode = ctx.call('encodeInp', username) password_encode = ctx.call('encodeInp', password) token = username_encode + '%%%' + password_encode return token if __name__ =='__main__': username= 'xxxxxxxx' password= 'xxxxxxxxxxxxxxxxx' encoded = make_user_token(username, password) opener = get_opener() login(opener,encoded) web_data = visit_grade(opener) html = etree.HTML(web_data) grade_data = html.xpath('.//*[@id="dataList"]') for i in grade_data[0]: n=0 s="" for j in i: if n==3: s+=j.text s+="\t" elif n==4: s+=j.text s+="\t" elif n==5: for grade in j: s+=grade.text s+="\t" elif n==6: s+=j.text s+="\t" n+=1 print(s)
A basic reptile on the well, of course, very simple, with in-depth study of the latter could write this article to improve it.