案例六 session数据分析

Day06.网站访问日志session数据分析
什么叫session

数据:时间区域操作
数据中的字段分别为:
访客 ip地址
访客 访问时间
访客 请求的url及协议
网站 响应码
网站 返回数据量
访客的 referral url(从哪个网站进来的)
访客的 客户端操作系统及浏览器信息
需求:

1)需要为从访问日志中梳理出每一个session(如果一个用户两次相邻请求之间的时间差<30分钟,则该两次请求都属于同一个session,否则分属不同的session),并为session中的历次请求打上序号,示意如下:
session号 ip地址 请求时间 请求url 请求次序 其他字段......
session1 ip1 2017-10-11 08:10:30 /a 1 ......
session1 ip1 2017-10-11 08:11:20 /b 2 ......
session2 ip1 2017-10-11 09:10:30 /c 1 ......
流程:
基本处理:读文件,多切分,拿到有用数据

1:把相同ip分到一块(分组)

2:相同组的数据,按照时间先后顺序来排列

3:进行时间比较  ---->打session,排顺序

2)将每次session进行汇总,得出用户每次session的浏览起、止页面,每次session会话总时长等,示意如下:
session号 ip地址 起始请求时间 结束请求时间 起始页面 跳出页面 访问时长
session1 ip1 2017-10-11 08:10:30 2017-10-11 08:11:20 /a /b 50秒
session2 ip1 2017-10-11 09:10:30 2017-10-11 09:10:30 /c /c 默认值
session3 ip2 2017-10-11 07:15:10 2017-10-11 07:30:10 /h /x 750秒

步骤分析:
 1 读取日志文件,获取用户请求数据,会根据用户的ip进行分组 (Map)
 2 将用户的url按照时间排序
 3 判断两个相邻的url的时间差值是否是在30分钟内来确定是否是同一个session
 4 判断为每个url生成sessionId并打上运行顺序标签
 5 第二问,获取sessionId相同的url,得出最先请求和最终请求的两个url和之间的时间差值
知识点:

 集合(存储,排序)  IO  时间操作(格式转换,比较,时间差)

数据样例,(部分)

194.237.142.21 - - [18/Sep/2013:06:49:18 +0000] "GET /wp-content/uploads/2013/07/rstudio-git3.png HTTP/1.1" 304 0 "-" "Mozilla/4.0 (compatible;)"
183.49.46.228 - - [18/Sep/2013:06:49:23 +0000] "-" 400 0 "-" "-"
163.177.71.12 - - [18/Sep/2013:06:49:33 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
163.177.71.12 - - [18/Sep/2013:06:49:36 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
101.226.68.137 - - [18/Sep/2013:06:49:42 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
101.226.68.137 - - [18/Sep/2013:06:49:45 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
60.208.6.156 - - [18/Sep/2013:06:49:48 +0000] "GET /wp-content/uploads/2013/07/rcassandra.png HTTP/1.0" 200 185524 "http://cos.name/category/software/packages/" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939 "http://www.angularjs.cn/A00n" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
222.68.172.190 - - [18/Sep/2013:06:50:08 +0000] "-" 400 0 "-" "-"
183.195.232.138 - - [18/Sep/2013:06:50:16 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
183.195.232.138 - - [18/Sep/2013:06:50:16 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
66.249.66.84 - - [18/Sep/2013:06:50:28 +0000] "GET /page/6/ HTTP/1.1" 200 27777 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
221.130.41.168 - - [18/Sep/2013:06:50:37 +0000] "GET /feed/ HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
157.55.35.40 - - [18/Sep/2013:06:51:13 +0000] "GET /robots.txt HTTP/1.1" 200 150 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
50.116.27.194 - - [18/Sep/2013:06:51:35 +0000] "POST /wp-cron.php?doing_wp_cron=1379487095.2510800361633300781250 HTTP/1.0" 200 0 "-" "WordPress/3.6; http://blog.fens.me"
58.215.204.118 - - [18/Sep/2013:06:51:35 +0000] "GET /nodejs-socketio-chat/ HTTP/1.1" 200 10818 "http://www.google.com/url?sa=t&rct=j&q=nodejs%20%E5%BC%82%E6%AD%A5%E5%B9%BF%E6%92%AD&source=web&cd=1&cad=rja&ved=0CCgQFjAA&url=%68%74%74%70%3a%2f%2f%62%6c%6f%67%2e%66%65%6e%73%2e%6d%65%2f%6e%6f%64%65%6a%73%2d%73%6f%63%6b%65%74%69%6f%2d%63%68%61%74%2f&ei=rko5UrylAefOiAe7_IGQBw&usg=AFQjCNG6YWoZsJ_bSj8kTnMHcH51hYQkAA&bvm=bv.52288139,d.aGc" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
58.215.204.118 - - [18/Sep/2013:06:51:36 +0000] "GET /wp-includes/js/jquery/jquery-migrate.min.js?ver=1.2.1 HTTP/1.1" 304 0 "http://blog.fens.me/nodejs-socketio-chat/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
58.215.204.118 - - [18/Sep/2013:06:51:35 +0000] "GET /wp-includes/js/jquery/jquery.js?ver=1.10.2 HTTP/1.1" 304 0 "http://blog.fens.me/nodejs-socketio-chat/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"


实现代码:

数据分析案例 

import java.util.Date;

public class SessionBean {
	private String sessionId;
	private String ip;
	private Date date;
	private String url;
	private int order;
	public String getSessionId() {
		return sessionId;
	}
	public void setSessionId(String sessionId) {
		this.sessionId = sessionId;
	}
	public String getIp() {
		return ip;
	}
	public void setIp(String ip) {
		this.ip = ip;
	}
	public Date getDate() {
		return date;
	}
	public void setDate(Date date) {
		this.date = date;
	}
	public String getUrl() {
		return url;
	}
	public void setUrl(String url) {
		this.url = url;
	}
	public int getOrder() {
		return order;
	}
	public void setOrder(int order) {
		this.order = order;
	}
	@Override
	public String toString() {
		return "SessionBean [sessionId=" + sessionId + ", ip=" + ip + ", date=" + date + ", url=" + url + ", order="
				+ order + "]";
	}

}

一定要明确项目想要什么结果,或者有开发文档,否则会很迷茫

import java.io.BufferedReader;
import java.io.FileReader;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.Date;
import java.util.HashMap;
import java.util.List;
import java.util.Locale;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import ch03.IpUtils;
public class TestMain {
	public static void main(String[] args) {
		//得到ip对应的sessionBean集合
		Map<String, List<SessionBean>> map1 = getIpSessionBeanMap();	
		//按照时间排序
		sortByDate(map1);
		//生成sessionid
		makeSessionId(map1);
		//sessionid对应的list结合
		Map<String, List<SessionBean>> map2 = new HashMap<>();
		//
		Set<Entry<String,List<SessionBean>>> entrySet2 = map1.entrySet();
		for (Entry<String, List<SessionBean>> entry : entrySet2) {
			List<SessionBean> value = entry.getValue();
			for (SessionBean sessionBean : value) {
				//到这里能够拿到每一条数据
				List<SessionBean> list = map2.getOrDefault(sessionBean.getSessionId(), new ArrayList<>());
				list.add(sessionBean);
				map2.put(sessionBean.getSessionId(), list);
			}
		}
		//因为上面是有序的,取得的值也是有序的,得到的list集合也是按照时间升序排列的  ,省去排序
		
		Set<Entry<String,List<SessionBean>>> entrySet3 = map2.entrySet();
		for (Entry<String, List<SessionBean>> entry : entrySet3) {
			String sessionId = entry.getKey();
			List<SessionBean> list = entry.getValue();
			SessionBean first = list.get(0);
			SessionBean end = list.get(list.size()-1);
			long cha = end.getDate().getTime()-first.getDate().getTime();
			String ret = sessionId+"\t"+first.getIp()+"\t"
					+first.getDate()+"\t"+end.getDate()+"\t"+first.getUrl()
					+"\t"+end.getUrl()+"\t"+(cha/1000);
			System.out.println(ret);
		}

		/*Set<Entry<String,List<SessionBean>>> entrySet = map2.entrySet();
		for (Entry<String, List<SessionBean>> entry : entrySet) {
			System.out.println(entry.getKey());
			List<SessionBean> value = entry.getValue();
			for (SessionBean sessionBean : value) {
				System.out.println(sessionBean);
			}
			System.out.println("---------------------------");
		}*/
	}
	/**
	 * 用来生成sessionid和order
	 * @param map1
	 */
	private static void makeSessionId(Map<String, List<SessionBean>> map1) {
		Set<Entry<String,List<SessionBean>>> entrySet = map1.entrySet();
		for (Entry<String, List<SessionBean>> entry : entrySet) {
			//获取到相同ip的sessionBean集合
			List<SessionBean> list = entry.getValue();
			//当长度等于一的时候
			if(list.size()==1){
				//获取到对应的sessionBean
				//String uuid = UUID.randomUUID().toString();
				SessionBean sessionBean = list.get(0);
				sessionBean.setSessionId(getSessionId(sessionBean.getIp()));
				sessionBean.setOrder(1);
			}
			//当长度大于1的时候
			for(int i = 0;i<list.size()-1;i++){
				SessionBean session1 = list.get(i);
				SessionBean session2 = list.get(i+1);
				//同一个session的时候
				if(isSameSession(session1,session2)){
					if(session1.getSessionId()!=null){
						session2.setSessionId(session1.getSessionId());
						session2.setOrder(session1.getOrder()+1);
					}else{
						session1.setSessionId(getSessionId(session1.getIp()));
						session1.setOrder(1);
						session2.setSessionId(session1.getSessionId());
						session2.setOrder(session1.getOrder()+1);
					}
				}else{//不是同一个session的时候
					if(session1.getSessionId()!=null){
						session2.setSessionId(getSessionId(session2.getIp()));
						session2.setOrder(1);
					}else{
						session1.setSessionId(getSessionId(session1.getIp()));
						session1.setOrder(1);
						session2.setSessionId(getSessionId(session2.getIp()));
						session2.setOrder(1);
					}	
				}
			}	
		}	
	}
	/**
	 * 判断两个session是否是同一个session
	 * @param session1
	 * @param session2
	 * @return
	 */
	private static boolean isSameSession(SessionBean session1, SessionBean session2) {
		long date1 = session1.getDate().getTime();
		long date2 = session2.getDate().getTime();
		//session时间0-30分钟
		long cha = date2-date1;
		if(cha>=0&&cha<=(1000*60*30)){
			return true;
		}
		
		return false;
	}
	/**
	 * 生成sessionId    ip+时间
	 * @param ip
	 * @return
	 */
	private static String getSessionId(String ip) {
		long longIp = IpUtils.strIpToLongIp(ip);
		long nanoTime = System.nanoTime();
		return ""+longIp+nanoTime;
	}
	/**
	 * 对map里面的每一个list按时间排序
	 * @param map1
	 */
	private static void sortByDate(Map<String, List<SessionBean>> map1) {
		Set<Entry<String,List<SessionBean>>> entrySet = map1.entrySet();
		for (Entry<String, List<SessionBean>> entry : entrySet) {
			List<SessionBean> list = entry.getValue();
			Collections.sort(list, new Comparator<SessionBean>() {
				@Override
				public int compare(SessionBean o1, SessionBean o2) {
					Date date1 = o1.getDate();
					Date date2 = o2.getDate();
					return date1.before(date2)?-1:1;
				}
			});	
		}
	}
	private static Map<String, List<SessionBean>> getIpSessionBeanMap() {
		//用来存放ip对应的sessionBean集合
		Map<String, List<SessionBean>> map1 = new HashMap<>();
		try(BufferedReader br =new BufferedReader(new FileReader("../案例练习4/src/ch06/access.log.fensi"));) {
			String line = null;
			while((line = br.readLine())!=null){
				//System.out.println(line);
				String ipRegex = "(\\d+\\.){3}\\d+";
				String dateRegex = "\\[.+\\d+\\]";
				String urlRegex = "(POST|GET){1}\\s(\\S)*\\s";
				String ip = getContByRegex(line,ipRegex);
				String date = getContByRegex(line,dateRegex);
				String url = getContByRegex(line,urlRegex);
				//System.out.println(url);
				//数据过滤,数据清洗
				if(url!=null&&date!=null&&ip!=null){
					SessionBean session = new SessionBean();
					session.setIp(ip);
					session.setUrl(url);
					session.setDate(parseDate(date));
					List<SessionBean> list = map1.getOrDefault(session.getIp(), new ArrayList<>());
					list.add(session);
					map1.put(ip, list);
				}
			}
		} catch (Exception e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
		return map1;
	}
	/**
	 * 根据字符串时间转化成date
	 * @param date
	 * @return
	 */
	private static Date parseDate(String date) {
		//[18/Sep/2013:06:51:37 +0000]
		String substring = date.substring(1, date.length()-1);
		SimpleDateFormat format = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss", Locale.US);
		try {
			return format.parse(substring);
		} catch (ParseException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
		return null;
	}
	/**
	 * 根据正则表达书,匹配出所需要的数据
	 * @param line
	 * @param ipRegex
	 * @return
	 */
	private static String getContByRegex(String line, String ipRegex) {
		Pattern compile = Pattern.compile(ipRegex);
		Matcher matcher = compile.matcher(line);
		while(matcher.find()){
			return matcher.group();
		}
		return null;
	}
}

猜你喜欢

转载自blog.csdn.net/a331685690/article/details/80281448
今日推荐