Article Directory
This blog is a summary of my process of hacking into the school's educational administration system. The detailed code has been placed on GitHub, and you can pick it up if you need it.
URP educational administration system automatic login script
verification code
Open the website http://jwxs.hhu.edu.cn/ and redirect directly to the login page http://jwxs.hhu.edu.cn/login
Personally, I think the interface of this educational administration system is quite good-looking, because when I first came to freshman, it was the old version of educational administration, and the UI was still in the style of 2000.
The first problem we face is the captcha.
Fully Automated Public Turing test to tell Computers and Humans Apart (English: Completely Automated Public Turing test to tell Computers and Humans Apart, referred to as CAPTCHA), also known as verification code, is a public fully automated program to distinguish whether a user is a machine or a human.
Get the verification code image
Open the developer tools of the browser and refresh the page, you can find the path of the verification code as
http://jwxs.hhu.edu.cn/img/captcha.jpg
Let's write a small piece of code to download this picture
import requests
prefix = 'http://jwxs.hhu.edu.cn/'
captcha_url = prefix + 'img/captcha.jpg'
src = 'captcha.jpg'
response = requests.get(captcha_url)
file = open(src, 'wb')
file.write(response.content)
file.close()
For example the picture below
The next step is text recognition.
Identify the verification code content
Here I checked the information and found that I need to use tesseract
this OCR engine. After installing it for a long time, I finally found that the recognition results were not accurate. I found a python library with a very special name.
ddddocr - OCR Universal Verification Code Recognition SDK Free Open Source Edition
I installed it with the mentality of playing around, and tried to identify a few pictures, and found that the effect was ok
import ddddocr
import requests
prefix = 'http://jwxs.hhu.edu.cn/'
captcha_url = prefix + 'img/captcha.jpg'
src = 'captcha.jpg'
response = requests.get(captcha_url)
file = open(src, 'wb')
file.write(response.content)
file.close()
ocr = ddddocr.DdddOcr(show_ad=False)
with open(src, 'rb') as f:
img_bytes = f.read()
res = ocr.classification(img_bytes)
print('captcha:', res)
>>> captcha: c65a
That's it!
After I tried a lot of pictures, I found that the recognition success rate was not very high due to the interference lines in the pictures, so I continued to check the information and tried to denoise the pictures.
image noise reduction
After many failures, I summed up the reasons:
-
The solutions on the Internet are not necessarily suitable for all types of verification codes. For example, some verification codes only have background noise or many thin lines, but ours is a black line similar to the content. Follow some methods on the Internet to reduce noise It is possible that even the content itself has been removed
-
Carefully observe the verification code, you can find that the theme of the picture is red, plus black thick lines, then we only need to change the black or close to black pixels in the picture to white, right?
Another round of new attempts, and finally found that the following processing works best
import ddddocr
import requests
from PIL import Image
prefix = 'http://jwxs.hhu.edu.cn/'
captcha_url = prefix + 'img/captcha.jpg'
src = 'captcha.jpg'
dst = 'captcha_p.png'
def process_data(src, dst):
img = Image.open(src)
w, h = img.size
for x in range(w):
for y in range(h):
r, g, b = img.getpixel((x, y))
low = 50
up = 256
if r == 0 and g == 0 and b == 0:
img.putpixel((x, y), (255, 255, 255))
if r in range(low) and g in range(low) and b in range(low):
img.putpixel((x, y), (255, 255, 255))
if r in range(low, up) and g in range(low, up) and b in range(low, up):
img.putpixel((x, y), (255, 255, 255))
img.save(dst)
if __name__ == "__main__":
response = requests.get(captcha_url)
file = open(src, 'wb')
file.write(response.content)
file.close()
process_data(src, dst)
ocr = ddddocr.DdddOcr(show_ad=False)
with open(dst, 'rb') as f:
img_bytes = f.read()
res = ocr.classification(img_bytes)
print('captcha:', res)
There is still a big difference between before and after image processing
The principle is actually very simple, traverse all pixels, if the rgb component of the pixel is 0, it is black, change it to white, if the values of the three components are all between 0-50 or 50-256, this pixel will also be Change to white.
These finally solved the problem of the verification code, and the next step is the topic: automatic login into the teaching affairs
auto login
If we don't enter anything and click the login button directly, we will find one more request
POST http://jwxs.hhu.edu.cn/j_spring_security_check
Checking the form elements reveals that it is the information submitted to the system when logging in, including three fields.
At this time, I haven’t noticed that the password submitted when I didn’t enter the password is actually not empty. This point has pitted me for a long time.
It's time to tidy up the code a little bit, let's first write a Request
class to define the login method
import requests
from bs4 import BeautifulSoup
USERNAME = 'xxxxxxxxxx'
PASSWORD = 'xxxxxxxxxx'
Host = 'jwxs.hhu.edu.cn'
prefix = 'http://jwxs.hhu.edu.cn/'
UserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 ' \
'Safari/537.36 '
login_url = prefix + 'login'
captcha_url = prefix + 'img/captcha.jpg'
post_url = prefix + 'j_spring_security_check'
index_url = prefix + 'index.jsp'
class Request(object):
def __init__(self, username, password):
self.username = username
self.password = password
self.session = requests.Session()
self.headers = {
'Host': Host,
'User-Agent': UserAgent,
'Referer': login_url,
}
self.cookies = self.session.cookies
def captcha(self):
src = 'captcha.jpg'
dst = 'captcha_p.png'
response = self.session.get(captcha_url)
file = open(src, 'wb')
file.write(response.content)
file.close()
res = captcha_code(src, dst)
return res
def login(self):
post_data = {
'j_username': self.username,
'j_password': self.password,
'j_captcha': self.captcha(),
}
self.session.post(post_url, post_data, headers=self.headers)
response = self.session.get(index_url, headers=self.headers, cookies=self.session.cookies)
soup = BeautifulSoup(response.text, 'lxml')
name = soup.find('title').string
if name == 'URP综合教务系统首页':
print('login success')
print('JSESSIONID:', self.session.cookies.get('JSESSIONID'))
if __name__ == "__main__":
request = Request(USERNAME, PASSWORD)
request.login()
Let's run it, strangely, the command line does not printlogin success
Check the verification code recognition result? no problem
Take a look at the form in HTML? are all corresponding
The student number corresponds j_username
, the password corresponds j_password
, the verification code corresponds j_captcha
, there should be no problem
Eh, no, what hex_md5
is ! ! !
So I finally found out that the password in the submission form without filling in the password also has content.
It turned out that the password field was encrypted with md5 before submitting the form. Looking for the source file, I found a file named md5.js
I was thinking, should I rewrite this js script into a python script? After writing a few lines, I gave up. Although both are dynamic languages, there are still some differences in many places, so I searched directly convert js to python
and found the python library Js2Py , which is quite convenient.
import js2py
# from md5 import *
if __name__ == "__main__":
js2py.translate_file('md5.js', 'md5.py')
# data = md5.hex_md5('12ibnsdkq1ed')
# print(data)
Uncomment all the comments when you run it for the second time, and you can see the test results.
At this time, add this encryption function to our code
from md5 import *
...
...
def login(self):
post_data = {
'j_username': self.username,
'j_password': md5.hex_md5(self.password),
'j_captcha': self.captcha(),
}
self.session.post(post_url, post_data, headers=self.headers)
response = self.session.get(index_url, headers=self.headers, cookies=self.session.cookies)
soup = BeautifulSoup(response.text, 'lxml')
name = soup.find('title').string
if name == 'URP综合教务系统首页':
print('login success')
print('JSESSIONID:', self.session.cookies.get('JSESSIONID'))
Now you can successfully enter the system
captcha: xxxx
login success
JSESSIONID: abcMTh7Thb9p4ef4DZ2my
Crawl the required data
Isn't this an offline class? It is not so easy to find an empty classroom for self-study after class. In addition to looking for it by yourself, you can also go to the academic affairs to check, but every time you log in to the educational affairs, you have to enter the verification code, and the login status will still be displayed. Fail quickly, turning this into something simple but repetitive. If I can automatically log in to the teaching affairs, then can I crawl the information of free classrooms with my hands? Just do it!
First locate the homepage of the free classroom query
Just click on a teaching building, and you can find that the browser http://jwxs.hhu.edu.cn/student/teachingResources/freeClassroom/today
has sent , and in the request header Content_type
is application/x-www-form-urlencoded
, this has to be marked down, and it will depend on it later
Looking at the form, you will find that there are two pieces of information, which should be the teaching building number and the campus number
When we look down, we can see that there is one more queryCodeTeaBuildingList
. Click to find that it is indeed the case.
The number of Qinxue Building in Jiangning Campus is2_11
What happens if we try to get the content http://jwxs.hhu.edu.cn/student/teachingResources/freeClassroom/today
of ?
The answer, of course, is that we can’t get the result we want because application/x-www-form-urlencoded
application/x-www-form-urlencoded: Data is encoded as key-value pairs separated by '&', while keys and values are separated by '='. Non-alphanumeric characters will be percent-encoded: this is why this type does not support binary data (use multipart/form-data instead).
If you look at the source code of the web page, you will find that dynamic rendering technology is used here, which is simply JSP
JSP (full name Jakarta Server Pages, formerly known as JavaServer Pages) is a dynamic web page technology standard created by Sun Microsystems. JSP is deployed on the web server, can respond to the request sent by the client, and dynamically generate a web page of HTML, XML or other format documents according to the content of the request, and then return it to the requester. JSP technology uses the Java language as a scripting language to provide services for users' HTTP requests, and can work with other Java programs on the server to handle complex business requirements.
How to put it, this kind of technology is basically not used now, and it can be said that it is the era of separation of front and back ends, but the relatively old system of educational affairs inevitably still uses these technologies, and the complexity of the management system of educational affairs is quite High, it is not easy to change.
So what should we do? In fact, there is still a way. Let’s pay attention to the customization options.
We have many options to search, and the query results are in the two tables below, you may wish to click search directly
You can see the extra search
Pay attention to the request header and effect header Content-Type
. The server returns data in JSON format. If you have done a project that separates the front and back ends, are you familiar with this?
I guess that the current educational administration system is not all JSP, but also has this kind of interface that partially separates the front and back ends.
It should be easy to analyze the data of the front and back ends. Before pasting the code, let's analyze the form elements first.
- weeks - the number of weeks
- jslxdm - classroom type
- codeCampusListNumber - campus number
- teaNum - the teaching building number
- wSection - week/section
- pageNum - number of pages
- pageSize - the number of pages per page
Knowing the meaning of each field, plus the number of the teaching building obtained from the query, etc., you can query the free classroom situation of a certain teaching building in a certain day and section, and post the code directly here.
...
...
def search_free_classroom(self, query_param):
headers = {
'Host': Host,
'User-Agent': UserAgent,
'Referer': query_refer_url,
'X-Requested-With': 'XMLHttpRequest',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
}
response = self.session.post(query_url, data=query_param, headers=headers, cookies=self.session.cookies)
data = response.json()[0]['records']
logging.debug('free classrooms:', '(week', query_param['weeks'], ')', '(section', query_param['wSection'], ')')
sets = []
for i in range(len(data)):
val = data[i]['classroomName']
sets.append(val)
logging.debug(sets)
return sets
if __name__ == "__main__":
request = Request(USERNAME, PASSWORD)
request.login()
param = {
'weeks': 3,
'jslxdm': 1,
'codeCampusListNumber': 1,
'teaNum': 14,
'wSection': 4/4,
'pageNum': 1,
'pageSize': 10,
}
request.search_free_classroom(param)
The basic idea is completed, but the one thing that bothers me is how to store the queried data. This may require me to think about it. Maybe I will write an interface to transmit data, and then write an App Convenient inquiry? Not sure yet.
that's all.