Preface
Some time ago, I encountered some work that required a lot of manual detection - exposed surface detection. The last step was to manually detect whether the page had sensitive information leaks, which required manual judgment, so I wrote a script to automatically detect sensitive information on web pages. .
introduce
This script mainly detects the middleware version, other versions, whether there is source code leakage, sensitive information detection and whether there is download behavior.
Since certificate verification is disabled SSL
, there will be some security issues, please use it as appropriate.
The principle is to crawl web content and use regular expressions to match keywords. If there are special keywords, you can modify and add regular expressions to complete the code you want to use.
middleware version
Mainly detected from the following versions (basically covering common middleware on the market)
Tomcat
WebLogic
Jboss
Jetty
Webshere
Glassfish
Nginx
Apache
Microsoft IIS
Kafka
RabbitMQ
Redis
Elasticsearch
MongoDB
MySQL
Node.js
Express.js
Django
Other versions
It mainly detects the versions in the two formats [number.number] [number.number.number].
Is there any source code leaked?
Because most of the web pages are HTML
written, they are removed first when matching HTML
. When matching other languages (including the following code blocks), the common grammatical words in the language are mainly matched.
HTML
Python
JavaScript
Java
C++
Go
Sensitive information detection
Mainly detects email addresses ending with .com
and .cn
; 11-digit mainland Chinese mobile phone numbers starting with 13, 14, 15, 18 and 17; and mainland China ID numbers
Is there any downloading behavior?
Match the captured data packets. There is httpd/unix-directory
or application
may be a download behavior in the response packet. If you encounter special circumstances, you can also add it yourself.
use
I introduced argparse
the module in the script -h
to see how to use it
Only one detection and file batch detection have been added URL
. Scan results can also be output as files, and threads have been added to process multiple pieces of data faster (default is 5).
source code
import argparse
import re
import requests
import threading
from tabulate import tabulate
import urllib3
from tqdm import tqdm
# 禁用SSL证书验证
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# 编译正则表达式
REGEX_DICT = {
'Tomcat': r'Apache\s*Tomcat/([\d\.]+)',
'Weblogic': r'Oracle\s*WebLogic\s*Server/([\d\.]+)',
'Jboss': r'JBoss/([\d\.]+)',
'Jetty': r'Jetty/([\d\.]+)',
'Webshere': r'IBM\s*WebSphere/([\d\.]+)',
'Glassfish': r'GlassFish/([\d\.]+)',
'Nginx': r'nginx/([\d\.]+)',
'Apache': r'Apache/([\d\.]+)',
'Microsoft IIS': r'Microsoft-IIS/([\d\.]+)',
'Kafka': r'Apache\s*Kafka/([\d\.]+)',
'RabbitMQ': r'RabbitMQ/([\d\.]+)',
'Redis': r'Redis/([\d\.]+)',
'Elasticsearch': r'Elasticsearch/([\d\.]+)',
'MongoDB': r'MongoDB/([\d\.]+)',
'MySQL': r'MySQL/([\d\.]+)',
'Node.js': r'X-Powered-By: Express',
'Express.js': r'X-Powered-By: Express',
'Django': r'X-Powered-By: Django'
}
COMPILED_REGEX_DICT = {
middleware: re.compile(regex, re.IGNORECASE) for middleware, regex in REGEX_DICT.items()}
SENSITIVE_INFO_REGEX_LIST = [
r'([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.(cn|com))',
r'((13|14|15|18|17)+[0-9]{9})',
r'(\d{17}[\d|x]|\d{15})',
]
PROGRAMMING_LANGUAGES = {
'HTML': '<html>|<!DOCTYPE',
'Python': 'import\s+|def\s+|print\s*\(|from\s+',
'JavaScript': 'function\s+|console\.',
'Java': 'public\s+class\s+|import\s+java\.',
'C++': '#include\s+<|using\s+namespace\s+std',
'Go': 'go\s+'
}
# 爬取网页内容
def read_content(url):
try:
requests.packages.urllib3.disable_warnings()
response = requests.get(url, verify=False, timeout=3)
response.raise_for_status()
content = response.text
return content
except requests.exceptions.RequestException as e:
print(f"请求异常: {
str(e)}")
return "NONE"
# 查找中间件和其他版本
def find_versions(content, compiled_regex_dict):
found_middleware = []
found_other = []
other_versions_regex = re.compile(r'(\d+(?:\.\d+){1,2})')
for middleware, compiled_regex in compiled_regex_dict.items():
matches = compiled_regex.findall(content)
if matches:
found_middleware.append(f"{
middleware}{
matches[0]}")
break
else:
found_versions = other_versions_regex.findall(content)
found_versions = [version for version in found_versions if re.match(r'^\d+(?:\.\d+){1,2}$', version)]
if found_versions:
found_other.extend(found_versions)
return found_middleware, found_other
# 版本检测
def version_detection(content, compiled_regex_dict):
middleware, other = [], []
if content:
middleware, other = find_versions(content, compiled_regex_dict)
return middleware[0] if middleware else "NONE", other[0] if other else "NONE"
# 匹配网页内容中的编程语言
def match_programming_language(content):
for language, pattern in PROGRAMMING_LANGUAGES.items():
if re.search(pattern, content):
if language == "HTML":
return "NONE"
else:
return language
return "NONE"
# 检测敏感信息
def check_sensitive_info(content):
sensitive_info = []
for regex in SENSITIVE_INFO_REGEX_LIST:
matches = re.findall(regex, content)
if matches:
for match in matches:
sensitive_info.extend(match)
if len(sensitive_info) > 0:
return "Possible"
else:
return "NONE"
#判断是否为下载链接
def is_downloadable(url):
try:
r = requests.head(url, allow_redirects=True, verify=False)
content_type = r.headers.get('content-type')
if content_type and content_type.startswith('application'):
return "Possible"
elif content_type and content_type.startswith('httpd/unix-directory'):
return "Possible"
except requests.exceptions.RequestException as e:
return "NONE"
return "NONE"
# 输出结果
def output_results(output, output_file=None):
headers = ["URL", "Middleware version", "Other version", "Source code leakage", "information leakage", "Download files?"]
table = tabulate(output, headers, tablefmt='simple')
if output_file:
with open(output_file, 'w', newline='') as file:
print(table)
writer = file.write(table)
else:
print(table)
# 线程函数
def worker(url, COMPILED_REGEX_DICT):
semaphore.acquire() # 获取信号量,如果超过最大线程数量会阻塞
try:
content = read_content(url)
middleware, other = version_detection(content, COMPILED_REGEX_DICT)
language = match_programming_language(content)
sensitive = check_sensitive_info(content)
downloadable = is_downloadable(url)
output.append([url, middleware, other, language, sensitive, downloadable])
finally:
semaphore.release() # 释放信号量,以便其他线程可以获取
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='敏感信息检测')
parser.add_argument('-u', '--url', type=str, help='URL')
parser.add_argument('-f', '--file', type=str, help='输入文件')
parser.add_argument('-o', '--output', type=str, help='输出文件')
parser.add_argument('-t', '--thread', type=int, help='线程数量(默认为5)')
args = parser.parse_args()
output = []
if args.url:
content = read_content(args.url)
middleware, other = version_detection(content, COMPILED_REGEX_DICT)
language = match_programming_language(content)
sensitive = check_sensitive_info(content)
downloadable = is_downloadable(args.url)
output.append([args.url, middleware, other, language, sensitive, downloadable])
output_results(output, args.output)
elif args.file:
# 创建线程信号量
if args.thread:
thread = args.thread
else:
thread = 5
semaphore = threading.BoundedSemaphore(thread)
with open(args.file, 'r') as f:
urls = f.read().splitlines()
with tqdm(total=len(urls)) as pbar:
threads = []
for url in urls:
t = threading.Thread(target=worker, args=(url, COMPILED_REGEX_DICT))
t.start()
pbar.update(1)
threads.append(t)
# 等待所有线程完成
for t in threads:
t.join()
output_results(output, args.output)
test
In the test case (192.168.164.134:8080 is the web service I built myself, using Tomcat8.5.19
)
The first one is an unknown version leaked
The second and third are both Tomcat
version leaks
The fourth unknown version leaked and JavaScript
source code leaked
The fifth is that there may be information leakage. The information is as follows, including email addresses.
123456
The sixth, seventh, and eighth are the response packets with download behavior that exist httpd/unix-directory
orapplication
Disclaimer
This tool can only be used in the security construction of enterprises that have obtained sufficient legal authorization. When using this tool, you should ensure that all your actions comply with local laws and regulations. If you commit any illegal behavior while using this tool, you will bear all the consequences yourself. All developers and contributors of this tool do not assume any legal and joint liability. Please do not use this tool unless you have fully read, fully understood and accepted all the terms of this agreement. Your usage behavior or your acceptance of this Agreement in any other express or implicit manner shall be deemed to have read and agreed to be bound by this Agreement.