URP educational administration system automatic login

This blog is a summary of my process of hacking into the school's educational administration system. The detailed code has been placed on GitHub, and you can pick it up if you need it.

URP educational administration system automatic login script

verification code

Open the website http://jwxs.hhu.edu.cn/ and redirect directly to the login page http://jwxs.hhu.edu.cn/login

Personally, I think the interface of this educational administration system is quite good-looking, because when I first came to freshman, it was the old version of educational administration, and the UI was still in the style of 2000.

The first problem we face is the captcha.

Fully Automated Public Turing test to tell Computers and Humans Apart (English: Completely Automated Public Turing test to tell Computers and Humans Apart, referred to as CAPTCHA), also known as verification code, is a public fully automated program to distinguish whether a user is a machine or a human.

Get the verification code image

Open the developer tools of the browser and refresh the page, you can find the path of the verification code as

http://jwxs.hhu.edu.cn/img/captcha.jpg

Let's write a small piece of code to download this picture

import requests

prefix = 'http://jwxs.hhu.edu.cn/'
captcha_url = prefix + 'img/captcha.jpg'
src = 'captcha.jpg'

response = requests.get(captcha_url)
file = open(src, 'wb')
file.write(response.content)
file.close()

For example the picture below

The next step is text recognition.

Identify the verification code content

Here I checked the information and found that I need to use tesseractthis OCR engine. After installing it for a long time, I finally found that the recognition results were not accurate. I found a python library with a very special name.

ddddocr - OCR Universal Verification Code Recognition SDK Free Open Source Edition

I installed it with the mentality of playing around, and tried to identify a few pictures, and found that the effect was ok

import ddddocr
import requests

prefix = 'http://jwxs.hhu.edu.cn/'
captcha_url = prefix + 'img/captcha.jpg'
src = 'captcha.jpg'

response = requests.get(captcha_url)
file = open(src, 'wb')
file.write(response.content)
file.close()

ocr = ddddocr.DdddOcr(show_ad=False)
with open(src, 'rb') as f:
    img_bytes = f.read()
res = ocr.classification(img_bytes)
print('captcha:', res)
>>> captcha: c65a

That's it!

After I tried a lot of pictures, I found that the recognition success rate was not very high due to the interference lines in the pictures, so I continued to check the information and tried to denoise the pictures.

image noise reduction

After many failures, I summed up the reasons:

  1. The solutions on the Internet are not necessarily suitable for all types of verification codes. For example, some verification codes only have background noise or many thin lines, but ours is a black line similar to the content. Follow some methods on the Internet to reduce noise It is possible that even the content itself has been removed

  2. Carefully observe the verification code, you can find that the theme of the picture is red, plus black thick lines, then we only need to change the black or close to black pixels in the picture to white, right?

Another round of new attempts, and finally found that the following processing works best

import ddddocr
import requests
from PIL import Image

prefix = 'http://jwxs.hhu.edu.cn/'
captcha_url = prefix + 'img/captcha.jpg'
src = 'captcha.jpg'
dst = 'captcha_p.png'


def process_data(src, dst):
    img = Image.open(src)
    w, h = img.size
    for x in range(w):
        for y in range(h):
            r, g, b = img.getpixel((x, y))
            low = 50
            up = 256
            if r == 0 and g == 0 and b == 0:
                img.putpixel((x, y), (255, 255, 255))
            if r in range(low) and g in range(low) and b in range(low):
                img.putpixel((x, y), (255, 255, 255))
            if r in range(low, up) and g in range(low, up) and b in range(low, up):
                img.putpixel((x, y), (255, 255, 255))
    img.save(dst)


if __name__ == "__main__":
    response = requests.get(captcha_url)
    file = open(src, 'wb')
    file.write(response.content)
    file.close()

    process_data(src, dst)

    ocr = ddddocr.DdddOcr(show_ad=False)
    with open(dst, 'rb') as f:
        img_bytes = f.read()
    res = ocr.classification(img_bytes)
    print('captcha:', res)

There is still a big difference between before and after image processing

The principle is actually very simple, traverse all pixels, if the rgb component of the pixel is 0, it is black, change it to white, if the values ​​of the three components are all between 0-50 or 50-256, this pixel will also be Change to white.

These finally solved the problem of the verification code, and the next step is the topic: automatic login into the teaching affairs

auto login

If we don't enter anything and click the login button directly, we will find one more request

POST http://jwxs.hhu.edu.cn/j_spring_security_check

Checking the form elements reveals that it is the information submitted to the system when logging in, including three fields.

At this time, I haven’t noticed that the password submitted when I didn’t enter the password is actually not empty. This point has pitted me for a long time.

It's time to tidy up the code a little bit, let's first write a Requestclass to define the login method

import requests
from bs4 import BeautifulSoup

USERNAME = 'xxxxxxxxxx'
PASSWORD = 'xxxxxxxxxx'

Host = 'jwxs.hhu.edu.cn'
prefix = 'http://jwxs.hhu.edu.cn/'
UserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 ' \
            'Safari/537.36 '

login_url = prefix + 'login'
captcha_url = prefix + 'img/captcha.jpg'
post_url = prefix + 'j_spring_security_check'
index_url = prefix + 'index.jsp'


class Request(object):
    def __init__(self, username, password):
        self.username = username
        self.password = password
        self.session = requests.Session()
        self.headers = {
    
    
            'Host': Host,
            'User-Agent': UserAgent,
            'Referer': login_url,
        }
        self.cookies = self.session.cookies

    def captcha(self):
        src = 'captcha.jpg'
        dst = 'captcha_p.png'
        response = self.session.get(captcha_url)
        file = open(src, 'wb')
        file.write(response.content)
        file.close()
        res = captcha_code(src, dst)
        return res

    def login(self):
        post_data = {
    
    
            'j_username': self.username,
            'j_password': self.password,
            'j_captcha': self.captcha(),
        }
    
        self.session.post(post_url, post_data, headers=self.headers)
        response = self.session.get(index_url, headers=self.headers, cookies=self.session.cookies)
        soup = BeautifulSoup(response.text, 'lxml')
        name = soup.find('title').string
        if name == 'URP综合教务系统首页':
            print('login success')
            print('JSESSIONID:', self.session.cookies.get('JSESSIONID'))


if __name__ == "__main__":
    request = Request(USERNAME, PASSWORD)
    request.login()

Let's run it, strangely, the command line does not printlogin success

Check the verification code recognition result? no problem

Take a look at the form in HTML? are all corresponding

The student number corresponds j_username, the password corresponds j_password, the verification code corresponds j_captcha, there should be no problem

Eh, no, what hex_md5is ! ! !

So I finally found out that the password in the submission form without filling in the password also has content.

It turned out that the password field was encrypted with md5 before submitting the form. Looking for the source file, I found a file named md5.js

I was thinking, should I rewrite this js script into a python script? After writing a few lines, I gave up. Although both are dynamic languages, there are still some differences in many places, so I searched directly convert js to pythonand found the python library Js2Py , which is quite convenient.

import js2py
# from md5 import *

if __name__ == "__main__":
    js2py.translate_file('md5.js', 'md5.py')
    # data = md5.hex_md5('12ibnsdkq1ed')
    # print(data)

Uncomment all the comments when you run it for the second time, and you can see the test results.

At this time, add this encryption function to our code

from md5 import *

...
...

def login(self):
    post_data = {
    
    
        'j_username': self.username,
        'j_password': md5.hex_md5(self.password),
        'j_captcha': self.captcha(),
    }
   
    self.session.post(post_url, post_data, headers=self.headers)
    response = self.session.get(index_url, headers=self.headers, cookies=self.session.cookies)
    soup = BeautifulSoup(response.text, 'lxml')
    name = soup.find('title').string
    if name == 'URP综合教务系统首页':
        print('login success')
        print('JSESSIONID:', self.session.cookies.get('JSESSIONID'))

Now you can successfully enter the system

captcha: xxxx
login success
JSESSIONID: abcMTh7Thb9p4ef4DZ2my

Crawl the required data

Isn't this an offline class? It is not so easy to find an empty classroom for self-study after class. In addition to looking for it by yourself, you can also go to the academic affairs to check, but every time you log in to the educational affairs, you have to enter the verification code, and the login status will still be displayed. Fail quickly, turning this into something simple but repetitive. If I can automatically log in to the teaching affairs, then can I crawl the information of free classrooms with my hands? Just do it!

First locate the homepage of the free classroom query

Just click on a teaching building, and you can find that the browser http://jwxs.hhu.edu.cn/student/teachingResources/freeClassroom/todayhas sent , and in the request header Content_typeis application/x-www-form-urlencoded, this has to be marked down, and it will depend on it later

Looking at the form, you will find that there are two pieces of information, which should be the teaching building number and the campus number

When we look down, we can see that there is one more queryCodeTeaBuildingList. Click to find that it is indeed the case.

The number of Qinxue Building in Jiangning Campus is2_11

What happens if we try to get the content http://jwxs.hhu.edu.cn/student/teachingResources/freeClassroom/todayof ?

The answer, of course, is that we can’t get the result we want because application/x-www-form-urlencoded

application/x-www-form-urlencoded: Data is encoded as key-value pairs separated by '&', while keys and values ​​are separated by '='. Non-alphanumeric characters will be percent-encoded: this is why this type does not support binary data (use multipart/form-data instead).

If you look at the source code of the web page, you will find that dynamic rendering technology is used here, which is simply JSP

JSP (full name Jakarta Server Pages, formerly known as JavaServer Pages) is a dynamic web page technology standard created by Sun Microsystems. JSP is deployed on the web server, can respond to the request sent by the client, and dynamically generate a web page of HTML, XML or other format documents according to the content of the request, and then return it to the requester. JSP technology uses the Java language as a scripting language to provide services for users' HTTP requests, and can work with other Java programs on the server to handle complex business requirements.

How to put it, this kind of technology is basically not used now, and it can be said that it is the era of separation of front and back ends, but the relatively old system of educational affairs inevitably still uses these technologies, and the complexity of the management system of educational affairs is quite High, it is not easy to change.

So what should we do? In fact, there is still a way. Let’s pay attention to the customization options.

We have many options to search, and the query results are in the two tables below, you may wish to click search directly

You can see the extra search

Pay attention to the request header and effect header Content-Type. The server returns data in JSON format. If you have done a project that separates the front and back ends, are you familiar with this?

I guess that the current educational administration system is not all JSP, but also has this kind of interface that partially separates the front and back ends.

It should be easy to analyze the data of the front and back ends. Before pasting the code, let's analyze the form elements first.

  • weeks - the number of weeks
  • jslxdm - classroom type
  • codeCampusListNumber - campus number
  • teaNum - the teaching building number
  • wSection - week/section
  • pageNum - number of pages
  • pageSize - the number of pages per page

Knowing the meaning of each field, plus the number of the teaching building obtained from the query, etc., you can query the free classroom situation of a certain teaching building in a certain day and section, and post the code directly here.

...
...

def search_free_classroom(self, query_param):
        headers = {
    
    
            'Host': Host,
            'User-Agent': UserAgent,
            'Referer': query_refer_url,
            'X-Requested-With': 'XMLHttpRequest',
            'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        }
        response = self.session.post(query_url, data=query_param, headers=headers, cookies=self.session.cookies)
        data = response.json()[0]['records']
        logging.debug('free classrooms:', '(week', query_param['weeks'], ')', '(section', query_param['wSection'], ')')
        sets = []
        for i in range(len(data)):
            val = data[i]['classroomName']
            sets.append(val)
        logging.debug(sets)
        return sets


if __name__ == "__main__":
    request = Request(USERNAME, PASSWORD)
    request.login()
    param = {
    
    
        'weeks': 3,
        'jslxdm': 1,
        'codeCampusListNumber': 1,
        'teaNum': 14,
        'wSection': 4/4,
        'pageNum': 1,
        'pageSize': 10,
    }
    request.search_free_classroom(param)

The basic idea is completed, but the one thing that bothers me is how to store the queried data. This may require me to think about it. Maybe I will write an interface to transmit data, and then write an App Convenient inquiry? Not sure yet.

that's all.

Guess you like

Origin blog.csdn.net/wji15/article/details/126922141