Web attack detection machine learning depth practice

Creative Commons License Copyright: Attribution, allow others to create paper-based, and must distribute paper (based on the original license agreement with the same license Creative Commons )

I. Overview

1.1 WAF traditional pain points

The traditional WAF, rely on rules and black and white lists of ways to detect Web attacks. The way to place undue reliance on the breadth of knowledge of security personnel, helpless against the unknown types of attacks; on the other hand even the known types of attacks, due to the inherent limitations of regular expressions, and shell, php and other languages ​​extremely flexible syntax, that is, in theory, can bypass, and therefore mistakenly stopped and stopped the leakage is a natural existence; and improve the accuracy of the cost of regular refinement is to add more regular, was thrown into a whirlpool of endless patching, dragged down the overall performance.

In response to these problems, the current mainstream security vendors research can be divided into two camps: AI semantic parsing and recognition.

1.2 Semantic Analysis

Http payload extracted from the executable code segment suspected, to resolve a sandbox may be performed to see if the next.

For the common shell commands cat, if a shell syntax to understand, cat c'a't c " 'a"' t "" c'a't "" are the same thing. Semantic understanding is theoretically possible to solve some of the regular omission of false positives problem, but there are some difficulties. For example, the http protocol in which part of the executable code segment is suspected, the http protocol how to cut and spliced ​​to ensure the normal resolution, these are more troublesome; another sql grammar, sehll grammar, js syntax also need to achieve, respectively.

Libinjection semantic parsing library point of view, and there bypassing the leak stopped in many cases, and use it to rule itself, based on the traditional WAF rules do on a layer of abstraction, for the identification of Rule. In fact, the market there have been some semantic-based WAF slogan is very loud, exactly how the current outlook is not very clear.

1.3 AI recognition

Some AI fans who, optimistic that machine learning, deep learning is to solve the pain points of traditional WAF ultimate solution, the amount ... Maybe, just maybe AI was not yet invented a more perfect solution. Even so, it is a simple machine learning WAF-enabled point of view, there are still a vast world.

In the field of secure identification, human use of AI technology for the media data, the distinguishing features have constructed a mathematical expression capability, and so the ability to distinguish between good and bad by way of a training model.

Therefore, good or bad depends on the quality of the final model of the quality and characteristics of the data, they determine the model can reach the upper bound, and the algorithm is to enable the model to try to continue to touch the upper bound.

Feature extraction is a "wonderful law of nature excavation process," a certain type of features can distinguish the type of attack that corresponds with the class characteristics, the core is how to make this type of feature selection model not only has a good ability to distinguish, at the same time We can have a good generalization inside and versatility, the ability to distinguish even unknown types of attacks.

With respect to image recognition, voice recognition and other fields, AI applications start in the field of Web security a little later, applications are not deep enough thoroughly. The reason machine learning to Web security recognition accuracy and maintainability is not yet perfect alternative to the traditional WAF rules; matching based on regular security, WYSIWYG, maintenance becomes effective. Therefore, the use of AI for Web attack recognition to improve its applicability need to start from the following directions:

  • Improve accuracy

  • Optimization logic, improve performance

  • Model of self-renewal and efficient iteration

  • Identification of the unknown types of attacks
    two Web attack signature analysis,

First look at the attack Example:

1.XSS cross-site scripting

<script>alert(0)</script>

<img src=0 onerror=alert(0)>

2.SQl injection

+and+(select+0+from+(select+count(*),concat(floor(rand(0)*0),

union all select null,null,null,null,null,null,null,null#

3. command execution

${@print(eval($_post[c]))}

exec xp_cmdshell('cat ../../../etc/passwd')#

Web request can be seen that the attack is characterized largely divided into two directions:

Keywords threats features: such as

select,script,etc/passwd

Non-standard structural features: The

${@print(eval($_post[c]))}

2.1 extraction features based on state transition

Our common practice is to have a similar character attribute generalization is a state with a fixed character instead. Such as: the generalization to the letter 'N', Chinese characters generalized to 'the Z', the digital generalized to '0', separator generalized to 'F' and the like. The core idea is to use different states to express different character attributes, as far as possible so that the characters have the meaning of a Web attack in separate area with other characters and then convert a payload into a series of state to train a chain probability of conversion matrix.

Commonly used model is a Hidden Markov chain model. If the black sample training HHM model, you can find black black realize the purpose of this benefit is a lower miscarriage of justice; white with sample training HHM model, you can discover unknown type of attack, but at the same time have higher miscarriage of justice . When using the collected training samples tested was found good, for some XSS attack, the attack parameters of the Web separator insertion attacks such obvious variations in the characteristics of the request parameter structure, which have a good ability to identify embodiment; while unstructured SQL injection features or perform sensitive directory is not recognized, it is also fully in line with expectations.

However, there is a well-known manner the defect: to observe abnormalities request parameters from a structural standpoint, the structures are not necessarily abnormal Web attacks; does not guarantee that the structure is not normal Web attacks.

(1) structural abnormalities xss attack -> Recognition

var _=i[c].id;u.test(_)&&(s=(s+=(_=_.substring(0))+"#@#").replace(/\\|/g," "))}""!==s?(s=s.substring(0,s.length-0),_sendexpodatas

(2) the structure of the abnormal deformation xss attack -> Recognition

/m/101/bookdetail/comment/129866160.page?title=xxx<marquee onstart="top[`ale`+`rt`](document[\'cookie\'])">

(3) structural abnormalities sql injection -> Recognition

/wap/home.htm?utm_source=union%' and 3356=dbms_pipe.receive_message(chr(107)||chr(78)||chr(72)||chr(79),5) and '%'='&utm_medium=14&utm_campaign=32258543&utm_content=504973

(4) Structure Normal sql injection -> unrecognized

/hitcount.asp?lx=qianbo_about&id=1 and 1=2 union select password from 

(5) Normal structural abnormalities Request -> false

/amapfromcookie().get("visitorid"),o=__ut._encode(loginusername),u=o?"r":"g",d=n.gettime(),c=_cuturltoshorrid")

(6) Normal structural abnormalities Request -> false

o.value:"")&&(c=c+"&sperid="+o),x+=c,__ut._httpgifsendpassh0(x)}}_sendexpodatas=function(e,t,n){var a=0===t?getmainpr

(7) Structure Abnormal Normal Request -> false

/index.php?m=vod-search&wd={{page:lang}if-a:e{page:lang}val{page:lang}($_po{page:lang}st[hxg])}{endif-a}

2.2 Structure-based statistics

URL request of extraction features, such as a combination of the number of special characters URL length, the path length, the parameter length portion, the length of parameter name, parameter value of the length, the number of parameters, parameter length accounted for, the number of special characters, dangerous and high risk special character combinations, the number, depth path, separators, etc. these statistical indicators as the number of features, the logistic regression model may be selected, SVM, the number of set algorithms, unsupervised learning or the MLP model.

If take only a single domain name url request made to verify that the model has acceptable performance; however, we are faced with thousands of the company's domain name system, different domain exhibit different URL directory hierarchy, different naming conventions, ... parameters different requests for such operations extremely complex scenarios, wherein in the above art, there are a lot of data itself will be ambiguous. Thus, less effective distinction request model, the accuracy is too low for the full stack url. Real-time, even if the environment has a relatively good fit, relatively simple scenario, it is difficult to model accurately rate raised to 97%.

2.3 based on the code word fragment characterized

According to the rules of a particular word, the url request slice, using the TF-IDF feature extraction, and retain the ability to differentiate with a combination of keywords feature, combined with the open-source online attacks sample sound features as possible. Here how to "lossless" sub-feature words and keyword combinations are closely related structure, is the key feature of the project, the latter requires a combination of performance results of the model continuously being refined (below focuses).

In fact, some common features are reserved keywords, and risk character combinations among Web attacks, and these keywords and combinations of characters is limited. In theory, combined with massive traffic and WAF full sample of Web attacks we have today, almost all of these keywords and character combinations can all be covered.

Third, feature-based segmentation and extraction MLP model

The universal approximation Theorem Universal approximation theorem (Hornik et al, 1989;. Cybenko, 1989) describes, in theory, the neural network can function with arbitrary accuracy you arbitrary complexity.

Project 3.1 features

decoding:

Recursive decoding URL, Base64 decoding, dec. Hex decoding;

Character generalization:

The unified data such as generalization is "0", small letter capital and other operations;

Event match:

XSS attacks label and payload contains events, where an event of the same type or tag collected, by matching the timing, and replace it with a combination of characters into words custom model bags;

Keywords match:

Similar events matching the above principle, the same type of keywords with the same properties as a generalization combination of characters, words and put the bag model, the benefits of doing so is to reduce feature dimensions;

Conversion feature vectors:

A sample by decoding, word matching is converted into feature vectors by a fixed length "0" and "1" thereof.

3.2 model results

To reduce space, here only the results of the evaluation and thinking model feature extraction.

Random Forests:
Here Insert Picture Description
Logistic Regression:

15404293287007.jpg

MLP model:

、

3.3 Summary

Shortcoming

  • Require repeated calibration model, extracting features to optimize conversion rule;

  • Identifying the type of poor unknown attacks;

  • Deformation attack recognition invalid;

  • Not learned keyword timing information.

For the common shell commands cat, if a shell syntax to understand, cat c'a't c " 'a"' t "" c'a't "" are the same thing. MLP model presented here can understand the word cat, but these can not understand the deformation of c'a't (word destruction of information).

advantage

  • Having a relatively deep learning is more efficient prediction efficiency;

  • Relative depth learning model, distributed deployment of more convenient, scalable and can accommodate a flood of traffic;

  • High accuracy, to completely identify the known type;

  • Maintainability, just stopped the leak stopped and error type of request after marking training to re-enter.

MLP model for keyword-based features of the above, some people may be wondering, why can achieve approximately 100% accurate? This is the result of repeated testing. Before you do a feature vector conversion request url done a lot of generalization and cleaning work, but also uses regular. Early identification request misjudgment will by adjusting the vector dimension bag of words and url cleaning mode, tap out the distinguishing features of the positive and negative samples, then after vector conversion to try to ensure that the input to the model training sample is unambiguous. During the on-line model for the type of false positives generated daily, it will feature extraction after adjusting, as a positive sample back into training set and update the model. Through the accumulation of bit by bit, to make the model more sophisticated.

Fourth, to identify deformation and unknown attacks LSTM models

Feature extraction based on the three ideas, the best way to select the word model MLP training effect can be obtained by training a combination of functions and parameters, to meet the exact identification of the type known attacks. However, due to the characteristics of the MLP model of an extract from which is partially dependent on the rules, resulting in the leakage stopped theoretically forever and miscarriage of justice. Because the goal is to identify the sample is always inadequate, require manual constantly Review, found that new attacks, adjustment feature extraction, adjusting the parameters, weight training ... this road seemingly never-ending.

4.1 Why LSTM

Recalling the request at the above Web attacks, security experts will be able to identify an attack, and machine learning model requires us to manually tell it features a series of discrimination, and the use of sample data binding characteristics, let ML model to simulate a function to get a output and Africa.

Security experts see a url request, will be requested according to their own understanding of the url mind "experience memory", url request structure is normal, Web attacks included key words, what is the meaning of each segment ... these are based on the url request context understood that each character. Traditional neural networks can not do this, but recurrent neural network can do that, it allows information to persist.
Here Insert Picture Description
LSTM to just use text before and after understanding the advantages of using a character before and after the judgment whether the request url for the Web attacks. This feature can save engineering advantage is that complex process.

It is this way of understanding url requested features, it has the ability to identify a certain unknown attacks. For unknown attacks deform it, MLP model can understand the word cat, but deformed c'a't can not understand, because the word would split it open. LSTM model and as a feature of each character, and there is context link between characters, whether cat, c'a't or c " 'a"' t, "" c'a't "", through the layer embedding after the conversion, the expression vector has a similar feature, the model for both approximately the same thing.

4.2 features to quantify and model training

Here only the parameter values ​​of the parameter values ​​requested training.

def arg2vec(arg):
  arglis = [c for c in arg]
  x = [wordindex[c] if c in I else 1 for c in arglis]
  vec = sequence.pad_sequences([x], maxlen=maxlen)
  return np.array(vec).reshape(-1 ,maxlen)

def build_model(max_features, maxlen):
  """Build LSTM model"""
  model = Sequential()
  model.add(Embedding(max_features, 32, input_length=maxlen))
  model.add(LSTM(16))
  model.add(Dropout(0.5))
  model.add(Dense(1))
  model.add(Activation('sigmoid'))
  # model.compile(loss='binary_crossentropy,mean_squared_error',
  # optimizer='Adam,rmsprop')
  model.compile(loss='binary_crossentropy',
  optimizer='rmsprop', metrics= ['acc'])
  return model

def run():
  model = build_model(max_features, maxlen)
  reduce_lr = ReduceLROnPlateau(monitor='val_loss' , factor=0.2, patience= 4 , mode='auto', epsilon = 0.0001 )
  model.fit(X, y, batch_size=512, epochs= 20, validation_split=0.1, callbacks = [reduce_lr])
  return model

if __name__=="__main__":
  startTime = time.time()
  filename = sys.argv[1]
  data = pd.read_csv(filename)
  I = ['v', 'i', '%', '}' , 'r', '^', 'a' , 'c', 'y', '.' , '_', '|', 'h' , 'w', 'd', 'g' , '{', '!', '$' , '[', ' ', '"' , ';', '\t ' , '>', '<', ' \\', 'l' , '\n', ' \r', '(', '=', ':', 'n' , '~', '`', '&', 'x', "'" , '+', 'k', ']', ')', 'f' , 'u', '', '0', 'q', '#' , 'm', '@', '*', 'e', 'z' , '?', 't' , 's', 'b' , 'p' , 'o' , '-', 'j' , '/',',' ]
  wordindex = {k:v+2 for v, k in enumerate (I)}
  max_features = len(wordindex) + 2 # 增加未知态(包含中文)和填充态
  maxlen = 128
  X = np.array([arg2vec(x) for x in data['args']]).reshape(- 1 ,128)
  y = data['lable']
  model = run()
  logger.info("模型存储!")
  modelname = 'model/lstm' + time.strftime('%y_%m_%d' ) + '.h5'
  model.save(modelname)

Here Insert Picture Description
4.3 Model Evaluation
Here Insert Picture Description
test sample size is 10,000, 99.4% accuracy;

When the test sample 5.84 million, after a training GPU accuracy of 99.99%;

It was observed sample identification errors, mostly caused by cutting the length of reasons url fragment whether aggressive intentions poorly defined.

4.4 Summary

Shortcoming

  • Resource overhead, low prediction efficiency;

  • Models need to enter the same size; above for the request url is cut larger than 128 bytes, 128 bytes of less than 0 is up, the cutting of such rigid manner url may destroy the original information.

advantage

  • No complicated features of the project;

  • It has the ability to recognize unknown attacks;

  • Strong generalization ability.

V. Reflections

I need for work, try a variety of detecting the direction and characteristics of Web attacks extraction method, but do not make me get very satisfactory results, even sometimes a defect in a certain direction of existence itself can not stand. Traditional machine learning methods to do Web attack recognition, very dependent feature project, which consumes most of my time and still continues.

In addition to the current model LSTM, Suning production environment is the best-performing MLP model, but it itself there is a serious flaw: Because feature of this model is a Web-based attack to extract keywords, doing feature extraction time, in order to ensure the accuracy of the identification had to use a lot of regular word to perform cleaning url generalization, but the essence of this approach is not much difference with the rule-based WAF. The only advantage is to provide a multi not identical to the means test to identify WAF rule out some type of leak stopped or misuse of Elam, thereby maintaining the rule base to upgrade.

In the long run I think LSTM detect the direction of the above is the most promising; where each character as a feature vector, in theory, just give it adequate enough to feed the sample, it will learn to own a character set combination, appear What position within the meaning of the premises on behalf of the url, like real security experts to do the same instantly recognizable the attack, no matter what variety of attacks.

Guess you like

Origin blog.csdn.net/kclax/article/details/93631780
Recommended