How to identify a malicious request, anti-crawler operation?

Foreword

In recent days, more and more really feel. Business needs to promote the development of technology. No support business needs, everything is pulled.

Before we know almost answered the question suddenly a fire, resulting in my applet traffic surge, as shown below:

 

                           

  

                             

 

Peak time, a plurality of different ip 200 requests per minute. About 5 requests per second. That is 5QPS. (Suddenly feeling good small so small)

 

I limited flow of this system, there is a cache, QPS thousands is no problem.

So today I want to write is not high concurrency, but how to identify malicious requests, malicious attacks, and intercept them.

Because the code is open source, what the interface is completely out of the storm drain, so there is always some malicious request my interface, although nothing big impact, but always very unhappy.

 

Ip restrictions

This is what I have always been of the code, as follows:

  1 package com.gdufe.osc.interceptor;
  2 
  3 import com.alibaba.fastjson.JSON;
  4 import com.gdufe.osc.common.OscResult;
  5 import com.gdufe.osc.enums.OscResultEnum;
  6 import com.gdufe.osc.service.RedisHelper;
  7 import com.gdufe.osc.utils.IPUtils;
  8 import lombok.extern.slf4j.Slf4j;
  9 import org.apache.commons.lang3.StringUtils;
 10 import org.springframework.beans.factory.annotation.Autowired;
 11 import org.springframework.lang.Nullable;
 12 import org.springframework.web.servlet.HandlerInterceptor;
 13 import org.springframework.web.servlet.ModelAndView;
 14 
 15 import javax.servlet.http.HttpServletRequest;
 16 import javax.servlet.http.HttpServletResponse;
 17 import java.util.Map;
 18 
 19 /**
 20  * @Author: yizhen
 21  * @Date: 2018/12/28 12:11
 22  */
 23 @Slf4j
 24 public class IPBlockInterceptor implements HandlerInterceptor {
 25 
 26     /** 10s内访问50次,认为是刷接口,就要进行一个限制 */
 27     private static final long TIME = 10;
 28     private static final long CNT = 50;
 29     private Object lock = new Object();
 30 
 31     /** 根据浏览器头进行限制 */
 32     private static final String USERAGENT = "User-Agent";
 33     private static final String CRAWLER = "crawler";
 34 
 35     @Autowired
 36     private RedisHelper<Integer> redisHelper;
 37 
 38     @Override
 39     public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler) throws Exception {
 40         synchronized (lock) {
 41             boolean checkAgent = checkAgent(request);
 42             boolean checkIP = checkIP(request, response);
 43             return checkAgent && checkIP;
 44         }
 45     }
 46 
 47     private boolean checkAgent(HttpServletRequest request) {
 48         String header = request.getHeader(USERAGENT);
 49         if (StringUtils.isEmpty(header)) {
 50             return false;
 51         }
 52         if (header.contains(CRAWLER)) {
 53             log.error("请求头有问题,拦截 ==> User-Agent = {}", header);
 54             return false;
 55         }
 56         return true;
 57     }
 58 
 59     private boolean checkIP(HttpServletRequest request, HttpServletResponse response) throws Exception {
 60         String ip = IPUtils.getClientIp(request);
 61         String url = request.getRequestURL().toString();
 62         String param = getAllParam(request);
 63         boolean isExist = redisHelper.isExist(ip);
 64         if (isExist) {
 65             // 如果存在,直接cnt++
 66             int cnt = redisHelper.incr(ip);
 67             if (cnt > IPBlockInterceptor.CNT) {
 68                 OscResult<String> result = new OscResult<>();
 69                 response.setCharacterEncoding("UTF-8");
 70                 response.setHeader("content-type", "application/json;charset=UTF-8");
 71                 result = result.fail(OscResultEnum.LIMIT_EXCEPTION);
 72                 response.getWriter().print(JSON.toJSONString(result));
 73                 log.error("ip = {}, 请求过快,被限制", ip);
 74                 // 设置ip不过期 加入黑名单
 75                 redisHelper.set(ip, --cnt);
 76                 return false;
 77             }
 78             log.info("ip = {}, {}s之内第{}次请求{},参数为{},通过", ip, TIME, cnt, url, param);
 79         } else {
 80             // 第一次访问
 81             redisHelper.setEx(ip, IPBlockInterceptor.TIME, 1);
 82             log.info("ip = {}, {}s之内第1次请求{},参数为{},通过", ip, TIME, url, param);
 83         }
 84         return true;
 85     }
 86 
 87     private String getAllParam(HttpServletRequest request) {
 88         Map<String, String[]> map = request.getParameterMap();
 89         StringBuilder sb = new StringBuilder("[");
 90         map.forEach((x, y) -> {
 91             String s = StringUtils.join(y, ",");
 92             sb.append(x + " = " + s + ";");
 93         });
 94         sb.append("]");
 95         return sb.toString();
 96     }
 97 
 98     @Override
 99     public void postHandle(HttpServletRequest request, HttpServletResponse response, Object handler, @Nullable ModelAndView modelAndView) throws Exception {
100     }
101 
102     @Override
103     public void afterCompletion(HttpServletRequest request, HttpServletResponse response, Object handler, @Nullable Exception ex) throws Exception {
104     }
105 }

 

代码我大致解释一个。

可以看到41行和42行代码;我做了两层的拦截:

第一层是先拦截不合规的浏览器头,比如浏览器头包含有爬虫的信息,全部拦截掉。

第二层是一个ip的拦截。如果在10s之内,访问我的接口大于50次,我就认为你是刷接口过快,是一个爬虫。

此时我直接存入redis,永不过期,下次直接拦截掉。

这是第一个办法。

 

统计ip访问次数

但总有些ip访问很慢,比如10s才访问,20-30次,但又不间断的访问,爬取,永不停歇。

虽然没啥大的影响,总归很不爽。

我们看看程序大致打印的日志把:

2019-06-01 16:21:24.271 [http-nio-8083-exec-5] INFO  c.g.osc.interceptor.IPBlockInterceptor - [] - ip = 106.121.145.154, 10s之内第1次请求zhihu/spider/get,参数为[type = 1;offset = 80;limit = 10;],通过
2019-06-01 16:21:24.271 [http-nio-8083-exec-5] INFO  c.gdufe.osc.service.impl.ZhiHuSpiderImpl - [] - 图片随机位置为:356
2019-06-01 16:21:24.775 [http-nio-8083-exec-3] INFO  c.g.osc.interceptor.IPBlockInterceptor - [] - ip = 120.229.218.95, 10s之内第1次请求zhihu/spider/get,参数为[type = 1;offset = 70;limit = 10;],通过
2019-06-01 16:21:24.775 [http-nio-8083-exec-3] INFO  c.gdufe.osc.service.impl.ZhiHuSpiderImpl - [] - 图片随机位置为:612
2019-06-01 16:21:32.050 [http-nio-8083-exec-10] INFO  c.g.osc.interceptor.IPBlockInterceptor - [] - ip = 105.235.134.202, 10s之内第1次请求zhihu/spider/get,参数为[type = 2;offset = 0;limit = 10;],通过
2019-06-01 16:21:32.050 [http-nio-8083-exec-10] INFO  c.gdufe.osc.service.impl.ZhiHuSpiderImpl - [] - 图片随机位置为:93
2019-06-01 16:21:32.320 [http-nio-8083-exec-7] INFO  c.g.osc.interceptor.IPBlockInterceptor - [] - ip = 120.229.218.95, 10s之内第2次请求zhihu/spider/get,参数为[type = 1;offset = 80;limit = 10;],通过
2019-06-01 16:21:32.320 [http-nio-8083-exec-7] INFO  c.gdufe.osc.service.impl.ZhiHuSpiderImpl - [] - 图片随机位置为:100
2019-06-01 16:21:33.755 [http-nio-8083-exec-2] INFO  c.g.osc.interceptor.IPBlockInterceptor - [] - ip = 106.17.6.118, 10s之内第1次请求zhihu/spider/get,参数为[type = 1;offset = 80;limit = 10;],通过
2019-06-01 16:21:33.755 [http-nio-8083-exec-2] INFO  c.gdufe.osc.service.impl.ZhiHuSpiderImpl - [] - 图片随机位置为:107
2019-06-01 16:21:33.805 [http-nio-8083-exec-9] INFO  c.g.osc.interceptor.IPBlockInterceptor - [] - ip = 123.120.29.78, 10s之内第1次请求zhihu/spider/get,参数为[type = 1;offset = 80;limit = 10;],通过
2019-06-01 16:21:33.805 [http-nio-8083-exec-9] INFO  c.gdufe.osc.service.impl.ZhiHuSpiderImpl - [] - 图片随机位置为:1057
2019-06-01 16:21:35.697 [http-nio-8083-exec-6] INFO  c.g.osc.interceptor.IPBlockInterceptor - [] - ip = 106.121.145.154, 10s之内第1次请求zhihu/spider/get,参数为[type = 1;offset = 90;limit = 10;],通过
2019-06-01 16:21:35.697 [http-nio-8083-exec-6] INFO  c.gdufe.osc.service.impl.ZhiHuSpiderImpl - [] - 图片随机位置为:1030
2019-06-01 16:21:36.197 [http-nio-8083-exec-1] INFO  c.g.osc.interceptor.IPBlockInterceptor - [] - ip = 120.229.218.95, 10s之内第1次请求zhihu/spider/get,参数为[type = 2;offset = 0;limit = 10;],通过
2019-06-01 16:21:36.198 [http-nio-8083-exec-1] INFO  c.gdufe.osc.service.impl.ZhiHuSpiderImpl - [] - 图片随机位置为:2384
2019-06-01 16:21:36.725 [http-nio-8083-exec-8] INFO  c.g.osc.interceptor.IPBlockInterceptor - [] - ip = 183.236.187.208, 10s之内第1次请求zhihu/spider/get,参数为[type = 1;offset = 0;limit = 10;],通过

 

一个访问ip,应该会打印出两条日志。一条日志他的ip以及访问的路径。一条则与本题无关。

但我如何统计每个ip总共访问了多少次呢?

 

Shell代码如下:

 

 1 #!/bin/bash 
 2 # 复制日志到当前目录
 3 cp /home/tomcat/apache-tomcat-8.5.23/workspace/osc/osc.log /home/shell/java/osc.log 
 4 # 将日志中的ip点号如: 120.74.147.123 换为 12074147123
 5 sed -i "s/\./:/g" osc.log
 6 # 筛选出只包含ip的行,并且只打印ip出来
 7 awk '/limit/ {print $11}' osc.log > temp.txt
 8 # 根据ip的所有位数进行排序 并且统计次数 最后输出前50行
 9 cat temp.txt | sort -t ':' -k1n -k2n -k3n -k4n | uniq -c | sort -nr | head -n 50 > result.txt
10 # 删除无关紧要文件
11 rm -rf temp.txt osc.log 

 

这其中涉及到了好多命令,都是今天临时一一学的(临时抱佛脚)。

最后执行结果如下:    

第一列是访问的次数,第二列是ip。

那么一看43.243.12.43这个ip就不正常了。肯定是爬虫来的。那么直接封了就是。

 

后言

 

业务需求推动技术真的学到了。有需求,有业务,才会推动技术的进步。

 

各位大佬们,还有没有其他反爬虫的技巧。一起交流一下。

 

Guess you like

Origin www.cnblogs.com/wenbochang/p/10960066.html