最近在学习python爬虫,正好同事做蜜罐需要65535个端口信息,所以业余时间顺便把端口信息爬下来了。
这里写图片描述
然而有个站点提供了这些信息:https://www.speedguide.net/port.php?port=21,链接比较固定,参数port即是端口号,所以只要遍历0-65535就可以获取0-65535的端口信息了。
这里写图片描述
我们只需要如上图所示表格中的内容,我只要把这些提取出来就行了,于是写了一个函数提取表格内容。但是解析网页实在是有点头疼,一开始用了xpath,但是表格里面又带有html标签,提取的时候遇到html标签会被截断,只好继续用beautifulsoup了。然而在解析的时候每个端口的协议数量完全不同,所以还需要判断。而且这个网站不太稳定,还有反爬虫机制。不过绕过反爬还是比较好做的,加个时间延迟就OK了,反正放服务器里面慢慢跑,也可以加代理池,反正我是没买代理池,免费的又不稳定,只能消磨时间了。解析网页的方法比较奇葩,应该有其他更好的方法,只是个人思路。欢迎大家指正。源码如下:
#coding=utf8
import sys
defaultencoding = 'utf-8'
if sys.getdefaultencoding() != defaultencoding:
reload(sys)
sys.setdefaultencoding(defaultencoding)
import requests
from bs4 import BeautifulSoup
import pymysql
import time
headers = {}
headers["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101"
def getMiddleStr(content, startStr, endStr):
startIndex = content.index(startStr)
if startIndex >= 0:
startIndex += len(startStr)
endIndex = content.index(endStr)
global source
source = content[startIndex:endIndex]
return source
def parse(source):
soup = BeautifulSoup(source,"html.parser")
li = []
for i in enumerate(soup.find_all("td")):
li.append(i[1].text)
list = li[1:]
#print(list)
ss = len(list)
i = 0
while i < ss:
a = list[i]
i += 1
b = list[i]
i += 1
c = list[i]
i += 1
d = list[i]
i += 1
e = list[i]
i += 1
args = (a,b,c,d,e)
sql = '''INSERT INTO port(port,protocol,service,details,source) VALUES ("%s","%s","%s","%s","%s")''' % (a,b,c,d,e)
try:
cursor.execute(sql)
db.commit()
except:
db.rollback()
try:
db = pymysql.connect("ip","username","passw","dbname")
print("连接数据库成功!")
cursor = db.cursor()
print("开始采集漏洞信息入库!")
print("*"*60)
except Exception as e:
pass
port = 0
print("开始采集端口信息!")
print("+"*60)
while port <= 65535:
url = "https://www.speedguide.net/port.php?port=" + str(port)
try:
response = requests.get(url,headers=headers)
html = response.text
parse(getMiddleStr(html,'<table class="port-outer">',"records found"))
print("采集端口" + str(port) + "的数据成功!")
except Exception as e:
print("采集端口" + str(port) + "的数据失败:" + str(e))
port += 1
time.sleep(5)
#db.close()
print("+"*60)
print("采集完毕!")