SensitiveInformationDetectionSensitive information detection tool (exposed surface detection)

Preface

Some time ago, I encountered some work that required a lot of manual detection - exposed surface detection. The last step was to manually detect whether the page had sensitive information leaks, which required manual judgment, so I wrote a script to automatically detect sensitive information on web pages. .

introduce

This script mainly detects the middleware version, other versions, whether there is source code leakage, sensitive information detection and whether there is download behavior.

Since certificate verification is disabled SSL, there will be some security issues, please use it as appropriate.

The principle is to crawl web content and use regular expressions to match keywords. If there are special keywords, you can modify and add regular expressions to complete the code you want to use.

middleware version

Mainly detected from the following versions (basically covering common middleware on the market)

Tomcat
WebLogic
Jboss
Jetty
Webshere
Glassfish
Nginx
Apache
Microsoft IIS
Kafka
RabbitMQ
Redis
Elasticsearch
MongoDB
MySQL
Node.js
Express.js
Django

Other versions

It mainly detects the versions in the two formats [number.number] [number.number.number].

Is there any source code leaked?

Because most of the web pages are HTMLwritten, they are removed first when matching HTML. When matching other languages ​​(including the following code blocks), the common grammatical words in the language are mainly matched.

HTML

Python
JavaScript
Java
C++
Go

Sensitive information detection

Mainly detects email addresses ending with .comand .cn; 11-digit mainland Chinese mobile phone numbers starting with 13, 14, 15, 18 and 17; and mainland China ID numbers

Is there any downloading behavior?

Match the captured data packets. There is httpd/unix-directoryor applicationmay be a download behavior in the response packet. If you encounter special circumstances, you can also add it yourself.

use

I introduced argparsethe module in the script -hto see how to use it

Insert image description here

Only one detection and file batch detection have been added URL. Scan results can also be output as files, and threads have been added to process multiple pieces of data faster (default is 5).

source code

import argparse
import re
import requests
import threading
from tabulate import tabulate
import urllib3
from tqdm import tqdm

# 禁用SSL证书验证
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# 编译正则表达式
REGEX_DICT = {
    
    
    'Tomcat': r'Apache\s*Tomcat/([\d\.]+)',
    'Weblogic': r'Oracle\s*WebLogic\s*Server/([\d\.]+)',
    'Jboss': r'JBoss/([\d\.]+)',
    'Jetty': r'Jetty/([\d\.]+)',
    'Webshere': r'IBM\s*WebSphere/([\d\.]+)',
    'Glassfish': r'GlassFish/([\d\.]+)',
    'Nginx': r'nginx/([\d\.]+)',
    'Apache': r'Apache/([\d\.]+)',
    'Microsoft IIS': r'Microsoft-IIS/([\d\.]+)',
    'Kafka': r'Apache\s*Kafka/([\d\.]+)',
    'RabbitMQ': r'RabbitMQ/([\d\.]+)',
    'Redis': r'Redis/([\d\.]+)',
    'Elasticsearch': r'Elasticsearch/([\d\.]+)',
    'MongoDB': r'MongoDB/([\d\.]+)',
    'MySQL': r'MySQL/([\d\.]+)',
    'Node.js': r'X-Powered-By: Express',
    'Express.js': r'X-Powered-By: Express',
    'Django': r'X-Powered-By: Django'
}
COMPILED_REGEX_DICT = {
    
    middleware: re.compile(regex, re.IGNORECASE) for middleware, regex in REGEX_DICT.items()}

SENSITIVE_INFO_REGEX_LIST = [
    r'([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.(cn|com))',
    r'((13|14|15|18|17)+[0-9]{9})',
    r'(\d{17}[\d|x]|\d{15})',
]

PROGRAMMING_LANGUAGES = {
    
    
    'HTML': '<html>|<!DOCTYPE',
    'Python': 'import\s+|def\s+|print\s*\(|from\s+',
    'JavaScript': 'function\s+|console\.',
    'Java': 'public\s+class\s+|import\s+java\.',
    'C++': '#include\s+<|using\s+namespace\s+std',
    'Go': 'go\s+'
}

# 爬取网页内容
def read_content(url):
    try:
        requests.packages.urllib3.disable_warnings()
        response = requests.get(url, verify=False, timeout=3)
        response.raise_for_status()
        content = response.text
        return content
    except requests.exceptions.RequestException as e:
        print(f"请求异常: {
      
      str(e)}")
        return "NONE"

# 查找中间件和其他版本
def find_versions(content, compiled_regex_dict):
    found_middleware = []
    found_other = []
    other_versions_regex = re.compile(r'(\d+(?:\.\d+){1,2})')

    for middleware, compiled_regex in compiled_regex_dict.items():
        matches = compiled_regex.findall(content)
        if matches:
            found_middleware.append(f"{
      
      middleware}{
      
      matches[0]}")
            break
    else:
        found_versions = other_versions_regex.findall(content)
        found_versions = [version for version in found_versions if re.match(r'^\d+(?:\.\d+){1,2}$', version)]
        if found_versions:
            found_other.extend(found_versions)

    return found_middleware, found_other

# 版本检测
def version_detection(content, compiled_regex_dict):
    middleware, other = [], []

    if content:
        middleware, other = find_versions(content, compiled_regex_dict)

    return middleware[0] if middleware else "NONE", other[0] if other else "NONE"

# 匹配网页内容中的编程语言
def match_programming_language(content):
    for language, pattern in PROGRAMMING_LANGUAGES.items():
        if re.search(pattern, content):
            if language == "HTML":
                return "NONE"
            else:
                return language
    
    return "NONE"

# 检测敏感信息
def check_sensitive_info(content):
    sensitive_info = []
    for regex in SENSITIVE_INFO_REGEX_LIST:
        matches = re.findall(regex, content)
        if matches:
            for match in matches:
                sensitive_info.extend(match)
    
    if len(sensitive_info) > 0:
        return "Possible"
    else:
        return "NONE"

#判断是否为下载链接 
def is_downloadable(url):
    try:
        r = requests.head(url, allow_redirects=True, verify=False)
        content_type = r.headers.get('content-type')
        if content_type and content_type.startswith('application'):
            return "Possible"
        elif content_type and content_type.startswith('httpd/unix-directory'):
            return "Possible"
    except requests.exceptions.RequestException as e:
        return "NONE"
    return "NONE"

# 输出结果
def output_results(output, output_file=None):
    headers = ["URL", "Middleware version", "Other version", "Source code leakage", "information leakage", "Download files?"]
    table = tabulate(output, headers, tablefmt='simple')
    if output_file:
        with open(output_file, 'w', newline='') as file:
            print(table)
            writer = file.write(table)
    else:
        print(table)

# 线程函数
def worker(url, COMPILED_REGEX_DICT):
    semaphore.acquire()  # 获取信号量,如果超过最大线程数量会阻塞
    try:
        content = read_content(url)
        middleware, other = version_detection(content, COMPILED_REGEX_DICT)
        language = match_programming_language(content)
        sensitive = check_sensitive_info(content)
        downloadable = is_downloadable(url)
        output.append([url, middleware, other, language, sensitive, downloadable])
    finally:
        semaphore.release()  # 释放信号量,以便其他线程可以获取


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='敏感信息检测')
    parser.add_argument('-u', '--url', type=str, help='URL')
    parser.add_argument('-f', '--file', type=str, help='输入文件')
    parser.add_argument('-o', '--output', type=str, help='输出文件')
    parser.add_argument('-t', '--thread', type=int, help='线程数量(默认为5)')
    args = parser.parse_args()

    output = []

    if args.url:
        content = read_content(args.url)
        middleware, other = version_detection(content, COMPILED_REGEX_DICT)
        language = match_programming_language(content)
        sensitive = check_sensitive_info(content)
        downloadable = is_downloadable(args.url)
        output.append([args.url, middleware, other, language, sensitive, downloadable])
        output_results(output, args.output)
    elif args.file:
        # 创建线程信号量
        if args.thread:
            thread = args.thread
        else:
            thread = 5
        semaphore = threading.BoundedSemaphore(thread)
        with open(args.file, 'r') as f:
            urls = f.read().splitlines()
        with tqdm(total=len(urls)) as pbar:
            threads = []
            for url in urls:
                t = threading.Thread(target=worker, args=(url, COMPILED_REGEX_DICT))
                t.start()
                pbar.update(1)
                threads.append(t)
            # 等待所有线程完成
            for t in threads:
                t.join()
        output_results(output, args.output)

test

Insert image description here

In the test case (192.168.164.134:8080 is the web service I built myself, using Tomcat8.5.19)

The first one is an unknown version leaked

The second and third are both Tomcatversion leaks

The fourth unknown version leaked and JavaScriptsource code leaked

The fifth is that there may be information leakage. The information is as follows, including email addresses.

123456

[email protected]

The sixth, seventh, and eighth are the response packets with download behavior that exist httpd/unix-directoryorapplication

Disclaimer

This tool can only be used in the security construction of enterprises that have obtained sufficient legal authorization. When using this tool, you should ensure that all your actions comply with local laws and regulations. If you commit any illegal behavior while using this tool, you will bear all the consequences yourself. All developers and contributors of this tool do not assume any legal and joint liability. Please do not use this tool unless you have fully read, fully understood and accepted all the terms of this agreement. Your usage behavior or your acceptance of this Agreement in any other express or implicit manner shall be deemed to have read and agreed to be bound by this Agreement.

Guess you like

Origin blog.csdn.net/weixin_56378389/article/details/134006268
Recommended