Filter crawl data from search engines

Some regulations

  • The search engine is not case sensitive;
  • Google limits the search to a maximum of 32 words, including search terms and advanced operators. However, there are other ways to circumvent this limitation, such as using wildcards to replace certain search terms.
  • The same search grammar, Baidu sometimes get surprises.

Basic query

  • inurl
    searches for URLs that contain specific characters. For example, enter inurl:/admin_login, you can find the URL with admin_login characters, usually this kind of URL is the login URL of the administrator background.

  • intext
    searches for specified characters in the body content of a web page, such as input intext:后台登陆.
    This grammar is similar to the "article content search" function that we usually use in some websites, using a character in the body content of the web page as a search condition,

  • intitle
    searches for web pages that contain specific characters in the web page title. For example, input intitle:后台登陆, so that the webpage with Shangke interconnection in the webpage title will be searched out. Similar to the intext above, search the webpage title for the character we are looking for, for example, search: intitle: security angel, will return all webpages that contain "security angel" in the webpage title. Similarly, allintitle is similar to intitle.
    allintext:家庭住址

  • filetype
    searches for files of the specified type. For example input filetype:PDF, will return a PDF document. I would like to recommend this, whether it is a cast-off attack or a search for a specific type of file that we will talk about later.
    doc .bak .db .mdb .inc

  • site
    specifies the domain name search. This filtering is more accurate, and it will be used as a prerequisite for compound queries later. For example input site:www.sunghost.cn. All URLs related to this website will be displayed.


  • Some pages of the cache cannot be accessed anymore, or if you want to see a snapshot of his page history, you can use cashe,cache:www.juwan888.com

  • define
    Search for the definition of a word. Search:, define:微积分will return some definitions about calculus.

  • info to
    find some basic information about the specified site. info:www.douban.com, To return some introductions, developments and announcements about Douban.


  • For example, search for link :, link:www.cnblogs.com/mysticbinaryreturns the URL of all pages containing www.cnblogs.com/mysticbinary link.



Symbol use

Google is not case sensitive, except when or is used as a Boolean operator, so it must be written as OR. The use of symbols is similar to regular matching, and many rules can be used in common.

  • "Specific search"
    with "" quotation marks in English, specify that the search results must be the same.

  • The wildcard *
    must be used in "","kali * web渗透测试"

  • Dot. The same as the
    wildcard asterisk *must also be used in "", the difference is that the dot .matches characters, not words, phrases and other content, but symbols. There symbol reserved ,, ., [, (, -and so on.


Boolean logic

  • The logical AND is a space, AND

  • Or (java | php)

  • Non-java
    does not contain

  • Constraint +
    "mysticbinar" + "impossible thing to send"

  • Parenthesis grouping ()
    is the same as regular

  • Time frame 2020..2020
    "美团*术" 2020..2020



Compound query

The following is to filter out the desired data from a security perspective. I feel that using Google grammar search is also a type of data analysis, but the analysis traffic comes from the search engine. 1. You have to know what you want to search first The characteristics can be analyzed. 2. The search engine must collect (crawl) data to filter it out. Some small sites have not included a few URLs. It is useless to let your filtering syntax be precise.

Site information collection

  • Subdomain query
# 使用site限定范围并使用*来进行泛查询,最后用-排除掉主域名,得到的就是子域名:
site:*.jd.com -www.jd.com
  • Section C query
# 如果你知道这个网站的IP,你也可以使用site结合通配符来查询在C段上存在的网站
site:18.18.18.*

Filter out function

  • Login interface search
allback=|api=|interface=|function=|functions=|count=
passlogin|ftppwd|password|secret|credentials|token
conf|config|security|jdbc|auth|system|db|ini|init
security_credentials|connetionstring
ssh2_auth_password|send_keys
doc|docx|xls|xlsx|pdf
oa|rem|ehr|cms|main|wp|test|ceshiboos|bossbook
word|master|count|log|login|reg|register|phpMyAdmin

site:jd.com intext:管理|后台|登陆|用户名|密码|帐号|注册|admin|login|manage
site:jd.com intext:管理|后台|登录|用户名|密码|验证码|系统|账号|服务端|后端|phpMyAdmin
site:jd.com intitle:管理|后台|登录|用户名|密码|验证码|系统|账号|服务端|后端|phpMyAdmin
site:jd.com intext:(password|passcode|pass|密码) intext:(username|userid|user|用户|账户)
site:jd.com intext:oa|rem|ehr|system|test|guanli|denglu|manager|register|houtai|guanli|forgotten
site:ly.com intext:rem|ehr|guanli|denglu
site:jd.com intext:"Powered by"
  • Search for specific functions
site:jd.com inurl:ewebeditor|editor|uploadfile|eweb|edit|php?id=|asp?id=
site:jd.com inurl:upload|upfile|saveup intext:提交|确定|上传
site:jd.com inurl:"path="|"readfile="|"file="|"url="
site:jd.com intext:提交|确定|评论
site:jd.com intext:个人信息管理|会员|个人空间 OR inurl:member|zone

site:jd.com inurl:"/uddiexplorer/SetupUDDIExplorer.jsp"
site:jd.com inurl:admin|login|manage|manager|register|prelogin|logincheck
site:jd.com inurl:admin|administration|administrator|manage|login|sys|managetem|password|username
site:jd.com inurl:login|admin|manage|admin_login|login_admin|system|boos|master|main|cms|wp
site:jd.com inurl:oa|rem|ehr|system|test|guanli|denglu
site:jd.com inurl:*"gk"*|*"publick"*|*"pub"*

site:jd.com intext:"sql syntax near"|"syntax error has occurred"| "incorrect syntax near"|"unexpected end of SQL command"|"mysql_connect()"|"mysql_query()"|"Warning: pg_connect()"
site:jd.com intext:"/var/lib/"|"/var/www/"|"D:\"|"C:\"
# 查看是否还遗留木马页面
site:example.com intext:剑眉大侠|不灭之魂|仗剑孤行|通杀版|法客论坛|上传的口令|"导出DLL文件出错”|"token虚拟机管理"|老子的绝对路径|免杀版
site:example.com intext:法克|后门|木马|小马|大马|脱库|黑客|一句话后门|挂马|清马|"扫描IP”|开放端口|提权|执行命令|设置密码|提升权限
site:example.com intext:一句话木马|过狗|安全狗|"K8飞刀"|"K8拉登哥哥"|"K8搞基大队"|反弹端口|"hacked by"
site:example.com inurl:phpspy|udf|JFolder|JspSpyJDK5|AspxSpy2014Final
site:example.com intext:"Georg says" intext:"All seems fine"
site:example.com intext:"Struts2 Exploit Test"
# 一些容器特征
site:example.com intext:"Dumping data for table"
site:example.com intitle:"apache tomcat/" "Apache Tomcat examples"
site:example.com inurl:examples|jsp|snp|snoop.jsp
site:example.com (inurl:"robot.txt” | inurl:"robots.txt") intext:disallow filetype:txt
site:example.com filetype: reg HKEY_CURRENT_USER username
site:example.com inurl:tmp|temp|cache…

Filter out sensitive files

  • Email / QQ / Group
site:example.com intext:qq|qq群|企鹅|腾讯|email|邮件
site:example.com intitle:qq|qq群|企鹅|腾讯|email|邮件
site:example.com intext:邮箱|邮件|email|e-mail
site:example.com intext:"@qq.com"|"@163.com"
site:example.com intext:电话|手机号|联系方式|请拨打

  • index of/*
site:jd.com index of/*
site:xxx.xxx intitle:index of
Index of /password
Index of / passwd 
"index?of/" config
"Index of /" password.txt
site:example.com intitle:index .of "parent directory"
site:example.com intitle:index .of name size
site:example.com intitle:index .of inurl:admin
site:example.com intitle:index .of "Application Data/Microsoft/Credentials"
site:example.com intitle:index .of etc|.sh_history|.bash_history|passwd|people.lst|htpasswd
  • phpmyadmin
site:ulnetworks.co.kr ?inurl:.php ?intext:CHARACTER_SETS,COLLATIONS, ?intitle:phpmyadmin
  • File search
这个语法不能使用 |,为什么不用布尔逻辑汇总在一条查询里呢?
因为在实际测试中,发现 filetype 和 ext 运算符与布尔逻辑的合作性并不是很好,
经常有查不到任何数据的情况出现,因此宁愿多进行几次查询,来增加我们查询的命中率。

site:jd.com filetype:doc
mdb
ini
php
asp
aspx
jsp
json
xml
pdf
doc
xlsx
xls
csv
git
txt
text
log
sql
cnf
conf
zip
rar
tar
tar.gz
7z
cab
gz
iso
bz2
jar
bkf
bkp
bak
old
backup
dll
ctl
inf
cfg

sql
db
dbf
mdb
wdb
backupdb


site:example.com filetype:doc "密码"
site:example.com filetype:xls|xlxs "密码"
site:example.com filetype:doc intitle:"管理"


reference

http://absec.cn/?p=751
https://www.cnblogs.com/xuanhun/p/3910134.html

Guess you like

Origin www.cnblogs.com/mysticbinary/p/12703036.html