URI source code analysis

Use URI to verify url, the following code:

import org.apache.commons.validator.routines.UrlValidator;

import java.net.URI;
import java.net.URISyntaxException;

public class UrlUtils {
    
    

    public static void main(String[] args) {
    
    
        String url = "http://www.jiaobuchong.com?name=tom&request={\"hobby\":\"film\"}";
        System.out.println(new UrlValidator().isValid(url));
        isValidUrl(url);
    }

    public static boolean isValidUrl(String url) {
    
    
        try {
    
    
            URI uri = new URI(url);
            return true;
        } catch (URISyntaxException e) {
    
    
            e.printStackTrace();
        }
        return false;
    }
}

If the url contains {}""will be thrown URISyntaxException, not by check. Looking at the source code in the URI, there is a check which is quite interesting.

One, analyze the checkChar method

1. Obtain the binary of the legal character range through highMask

private static final long L_ALPHA = L_LOWALPHA | L_UPALPHA;
private static final long L_LOWALPHA = 0L;
private static final long L_UPALPHA = 0L;

private static final long H_ALPHA = H_LOWALPHA | H_UPALPHA;
private static final long H_LOWALPHA = highMask('a', 'z');
private static final long H_UPALPHA = highMask('A', 'Z');
checkChar(0, L_ALPHA, H_ALPHA, "scheme name");

The result of L_ALPHA and is 0. Now let’s start the more interesting part, let’s see what highMask does:

    // Compute a high-order mask for the characters
    // between first and last, inclusive
    private static long highMask(char first, char last) {
    
    
        long m = 0;
        // Math.min 表示从 ASCII 的范围里取值
        // Math.max(Math.min(first, 127), 64) 表示在 ASCII 码表[64, 127]之间的字符
        // 减去 64 表示不包括 64 这个字符,相对 64 之间还有多少个个字符
        int f = Math.max(Math.min(first, 127), 64) - 64;
        int l = Math.max(Math.min(last, 127), 64) - 64;
        for (int i = f; i <= l; i++)
            m |= 1L << i;
        return m;
    }

highMask ( 'a', 'z ') m binary:
Insert picture description here
highMask ( 'A', 'the Z') m binary:
Insert picture description here
H_ALPHA = H_LOWALPHA | H_UPALPHA phase is the result:
11111111111111111111111111000000111111111111111111111111110The results after the phase represented H_UPALPHA a-zA-Zthis range Characters, one radish has one pit, the pit bit corresponding to the data not in this range is 0.

2. Match method

   // Tell whether the given character is permitted by the given mask pair
    private static boolean match(char c, long lowMask, long highMask) {
    
    
        if (c == 0) // 0 doesn't have a slot in the mask. So, it never matches.
            return false;
        if (c < 64)
            return ((1L << c) & lowMask) != 0;
        if (c < 128)
            // 左移 (c - 64) 位,和 highMask 进行 & 运算,如果不等于 0 就表示这个字符是合法的
            return ((1L << (c - 64)) & highMask) != 0;
        return false;
    }

Second, analyze the checkChars method

Through the above analysis, the principle of URI judging whether a legal character is: convert the character in the legal range into a binary number, and then perform the AND operation between the incoming character and the binary. If it is not equal to 0, it means that the character is legal. .

Let's analyze this method again:

checkChars(1, p, L_SCHEME, H_SCHEME, "scheme name");

Through calculations in the code, the result of L_SCHEME is:

0 | lowMask('0', '9') | lowMask("+-.")

H_SCHEME:

H_SCHEME = highMask('a', 'z') | highMask('A', 'Z') | 0 | highMask("+-.")

Looking at the code of lowMask, the generated result m represents the range of [0-9] binary:

    // Compute a low-order mask for the characters
    // between first and last, inclusive
    private static long lowMask(char first, char last) {
    
    
        long m = 0;
        int f = Math.max(Math.min(first, 63), 0);
        int l = Math.max(Math.min(last, 63), 0);
        for (int i = f; i <= l; i++)
            m |= 1L << i;
        return m;
    }

Then in the match method:

    // Tell whether the given character is permitted by the given mask pair
    private static boolean match(char c, long lowMask, long highMask) {
    
    
        if (c == 0) // 0 doesn't have a slot in the mask. So, it never matches.
            return false;
        if (c < 64)
            // 对于小于 64 的字符,和 lowMask 进行与运算,不等于0表示合法的字符
            return ((1L << c) & lowMask) != 0;
        if (c < 128)
            return ((1L << (c - 64)) & highMask) != 0;
        return false;
    }

Guess you like

Origin blog.csdn.net/jiaobuchong/article/details/102757459