website_fingerprinting
Currently this project supports the following models:
-
Deep Fingerprinting
-
SDAE
-
LSTM
-
CNN
The remaining two are statistical machine learning models: [Currently these two models are not well adapted, but the feature extraction inside is effective]
-
CUMULATION
-
AppScanner
Instructions
data preparation
First of all, you need to prepare the data format: you
need to organize the network traffic into the above 6 files, and place them in the same directory with the file name as above.
X_train_pkt_length.pkl : 包长序列,训练集。
X_valid_pkt_length.pkl : 包长序列,验证集。
X_test_pkt_length.pkl : 包长序列,测试集。
y_train_pkt_length.pkl : 流量标签,训练集。
y_valid_pkt_length.pkl : 流量标签,验证集。
y_test_pkt_length.pkl : 流量标签,测试集。
Among them, X_*_pkt_length.pkl
is a numpy matrix saved using pickle.save(), and its shape is m × lm\times lm×l . Where m is the number of samples, l is the length of the packet length sequence, and the packet length sequences ofall samples in the same data set need to be filled to the same length.
y_*_pkt_length.pkl
It is also a numpy matrix saved by pickle.save(), its shape ism × 1 m\times1m×1 , m represents the number of samples, and the i-th element is an integer, which represents the label of the i-th sample of the corresponding training set, validation set, and test set.
The saving of the data set requires steps similar to the following:
with gzip.GzipFile(path_dir+"/"+"X_train_"+feature_name+".pkl","wb") as fp:
pickle.dump(X_train,fp,-1)
with gzip.GzipFile(path_dir+"/"+"X_valid_"+feature_name+".pkl","wb") as fp:
pickle.dump(X_valid,fp,-1)
with gzip.GzipFile(path_dir+"/"+"X_test_"+feature_name+".pkl","wb") as fp:
pickle.dump(X_test,fp,-1)
with gzip.GzipFile(path_dir+"/"+"y_train_"+feature_name+".pkl","wb") as fp:
pickle.dump(y_train,fp,-1)
with gzip.GzipFile(path_dir+"/"+"y_valid_"+feature_name+".pkl","wb") as fp:
pickle.dump(y_valid,fp,-1)
with gzip.GzipFile(path_dir+"/"+"y_test_"+feature_name+".pkl","wb") as fp:
pickle.dump(y_test,fp,-1)
The number of samples of the packet length sequence of the training set needs to be equal to the number of samples of the traffic label sequence of the training set.
The number of samples of the packet length sequence of the verification set needs to be equal to the number of samples of the traffic label sequence of the verification set.
The number of samples of the packet length sequence of the test set needs to be equal to the number of samples of the traffic label sequence of the test set.
The project provides an example data set app_dataset, which is a 55-category data set. The packet length of each sample is 1000. If it is insufficient, it will be filled with 0. If it exceeds 1000, it will be truncated.
Modify the data directory
After preparing the data according to the above steps, you need to modify the data directory.
Modify website_fingerprinting/data_utils.py
the NB_CLASSES
variables in the file and the default number set directory dataset_dir
variables.
The NB_CLASSES
variable is the number of different labels in the data set.
dataset_dir
Is the directory of the default data set
Configuration model
Before running the model, you need to modify their configuration files.
Currently, the configuration file of each model is in a directory named after the model name:
for example, for the Deep fingerprinting model, its configuration file is df_model_config.py in the df directory.
Modify the model file: modify the number of categories in it and the length parameters of the packet length sequence. The parameters that need to be modified are marked in each mode file.
Run the model
Run X_example.py
training model, where X can be df, cnn, lstm, sdae.
Run X_eval.py
to test the model, where X can be df, cnn, lstm, sdae.
For example:
the data set comes app_dataset operation df_example.py
result is: