A free and open source super keyword URL collection tool in security penetration testing

A free and open source super keyword URL collection tool in security penetration testing.

#####################
Disclaimer: The tool itself is not good or bad. I hope that everyone will use the tool on the premise of complying with the relevant laws of the "Network Security Law" and support research Learn, do not use it for illegal and criminal activities, and the loss caused by the malicious use of this tool has nothing to do with me or the developer.
######################

superl-url keyword URL collection:

An open source and free keyword URL collection tool based on Python.

A lightweight software program that collects URL content of search engine content retrieval results based on keywords.

The program is mainly used in security penetration testing projects, as well as batch evaluation of the 0DAY impact of various CMS systems, and it is also a small program for batch collection of websites you are interested in~~

It can automatically collect information such as the real address and title of relevant websites from search engines, save them as files, and automatically remove duplicate URLs. At the same time, you can also customize to ignore multiple domain names.

insert image description here

Program features
Support simultaneous collection of multiple search engines (Baidu, Sogou, 360 have been built in), modular structure, easy to expand, unlimited additions.

What is obtained is the real URL address of the search results of the search engine

Cross-platform, developed with Python, all codes are completely open source. And there is no risk of bundling backdoors, and it is easy to update. Most of the URL collection software on the Internet are executable files under WINDOWS, and many of them cannot be used normally after the search engine is updated.

Powerful filtering function. It can filter a variety of top-level domain names, and can filter URLs containing a certain keyword in the specified title, such as filtering subdomain URLs belonging to youku.com in search results. Support TXT configuration filtering.

Automatically remove duplicate URLs

The result format to be saved can be customized flexibly through the configuration file. For example, only output the original real URL with parameters, or only output the domain name, or output the title and search engine name at the same time.

The search engines participating in the collection can be flexibly turned on and off. For example, if you only want to use Baidu, you can set other search engine parameters to False.

Compatible with python3 and python2 versions at the same time! Conscience open source small products~~~

The display quantity per page of different search engines can be customized separately (if the search engine itself supports it)

Support multi-process simultaneous collection, one process for each search engine

The time interval of each page collection can be customized to prevent being blocked

Real-time display of the [real URL] and [title] of the collected web pages. The [ID] in front corresponds to the Xth item data of the search engine results on the current page.

The save type can be customized, and currently supports saving as local txt and writing to a remote MYSQL database!

superl-url installation and use

git clone https://github.com/super-l/superl-url.git

install dependencies

python3:pip install ConfigParserpip install tldextract

Python2:pip install tldextractpip install -i https://pypi.tuna.tsinghua.edu.cn/simple configparser

If the prompt module does not exist, install it according to the prompt!

Instructions for use
If you want to collect related websites with the keyword "hacker" and collect the first 3 pages of search results, enter as follows:

please input keyword:hacker

Search Number of pages:3


Configuration file description config.cfg:

Node parameter example value description
global save_type mysql save type can choose file or mysql if it is file, save it as local txt
global sleep_time 1 wait time after each search finishes processing a page, to prevent being blocked by search engines too frequently
url url_type realurl save file The url type displayed in txt. realurl=real website address baseurl=original search engine address urlparam=real website address with parameters
filter filter_status True whether to enable the filter, if enabled, the filter domain name and title will not take effect
filter filter_domain True whether to filter the domain name
filter filter_title True whether to filter the title
log write_title True whether to display the title
log write_name True whether to display the search engine name
engine baidu True whether the Baidu search engine module is enabled
engine sougou True whether the Sogou module is enabled
engine so False whether the Soso module is enabled (soso can't crawl now)
pagesize baidu_pagesize 50 Number of entries per page
pagesize sougou_pagesize 50 Number of entries per page
pagesize so_pagesize 10 Number of entries per page
mysql host 127.0.0.1 If the storage type is Mysql, this node must be configured correctly
mysql port 3306 port
mysql user root user name
mysql password root password
mysql database superldb database name
mysql table search_data table name
file save_pathdir result If the save type is file, the path to save is set here, currently the result folder of the root directory of the program
plugin pr True Reserved plug-in functions, not supported for now


Database creation table sql statement:


CREATE TABLE `search_data` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `engine` varchar(20) NOT NULL DEFAULT '',
  `keyword` varchar(100) NOT NULL DEFAULT '',
  `baseurl` varchar(255) NOT NULL DEFAULT '',
  `realurl` varchar(255) NOT NULL DEFAULT '',
  `urlparam` varchar(255) NOT NULL DEFAULT '',
  `webtitle` varchar(255) NOT NULL DEFAULT '',
  `create_time` int(10) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=395 DEFAULT CHARSET=utf8;

Guess you like

Origin blog.csdn.net/u014374009/article/details/128991614