Python tool for analyzing risk data

 

Tool 1 is introduced - loading data analysis package

Start IPython notebook, load the operating environment:

2 Data Preparation

As the saying goes: you can not make bricks without straw. An analysis of data primarily small to use a proxy IP user access logging information, the raw data to be analyzed stored in the form of CSV. Here we must first introduce to pandas.read_csv this common approach, it will read the data into DataFrame.

Right, one line of code to read all the data of a two-dimensional table structure DataFrame variables, feeling very simple with wood have ah! ! ! Of course, the IO tools provided with the Pandas You can also read large files block, and then test it in this small security performance, full load of about 215.3 billion data also probably only need about 90 seconds, the performance is still quite good.

3 Data Glimpse

In general, we first have to analyze the data before a general understanding of the data, such as how much the amount of data, what data variables, the distribution of data variables, data duplication, data is missing, the preliminary data outliers observation and so on. The following Andy partners together with a small Glimpse Glimpse of these data.

Use shape ways to view data rows and columns

Out: (21524530, 22) # 22 has a dimension which is a total of 21,524,530 remember data DataFrame

Use head () method of the default front view five lines, in addition to tail () method is the default view line 5, the input parameters may of course be custom view to the number of rows

4 Data Cleaning

Because the source data usually contains a null value even empty column, it will affect the time and efficiency of data analysis, data summary after the preview, need to deal with these invalid data.

In general, removing some null data can be used dropna method, when you use this method, when the examination revealed dropna (after) removal of almost all of the data lines, a check Pandas user manual, the situation had no arguments next, dropna () will remove all rows contain null values.

If you want to remove columns are all null values, and how axis need to add two parameters:

Further, it is possible to remove empty data as specified by the parameter subset dropna, and set value takes thresh line number per non-removable data None less than the thresh.

Field or line is removed proxy_host value field does not srcip

 

Remove all rows have the value field of the attribute is less than 10 rows

5 统计分析

再对数据中的一些信息有了初步了解过后,原始数据有22个变量。从分析目的出发,我将从原始数据中挑选出局部变量进行分析。这里就要给大家介绍pandas的数据切片方法loc。

loc([start_row_index:end_row_index,[‘timestampe’, ‘proxy_host’, ‘srcip’]])是pandas重要的切片方法,逗号前面是对行进行切片;逗号后的为列切片,也就是挑选要分析的变量。

如下,我这里选出日期,host和源IP字段——

首先让我们来看看蜜罐代理每日使用数据量,我们将数据按日统计,了解每日数据量PV,并将结果画出趋势图。

对数据列的丢弃,除无效值和需求规定之外,一些表自身的冗余列也需要在这个环节清理,比如说DataFrame中的index号、类型描述等,通过对这些数据的丢弃,从而生成新的数据,能使数据容量得到有效的缩减,进而提高计算效率。

由上图分析可知蜜罐代理使用量在6月5号,19-22号和25号这几天呈爆炸式增长。那么这几天数据有情况,不正常,具体是神马情况,不急,后面小安带大家一起来慢慢揪出来到底是那些人(源ip) 干了什么“坏事”。

进一步分析, 数据有异常后,再让我们来看看每天去重IP数据后量及其增长量。可以按天groupby后通过nunique()方法直接算出来每日去重IP数据量。

 

END

碧茂课堂精彩课程推荐:

1.Cloudera数据分析课;

2.Spark和Hadoop开发员培训;

3.大数据机器学习之推荐系统;

4.Python数据分析与机器学习实战;

详情请关注我们公众号:碧茂大数据-课程产品-碧茂课堂

现在注册互动得海量学币,大量精品课程免费送!

Guess you like

Origin blog.csdn.net/ShuYunBIGDATA/article/details/90691530