这个单子主要是深度模型的构建以及pca降维度在这里插入图片描述

背景

汽车销售行业在税收上存在少开发票金额、少记收入，上牌、按揭、保险不入账，不及时确认保修索赔款等情况，导致政府损失大量税收。汽车销售企业的部分经营指标数据能在一定程度上评估企业的偷漏税倾向。样本数据提供了汽车销售行业纳税人的各种属性和是否偷漏税标识，提取纳税人经营特征可以建立偷漏税行为识别模型，识别偷漏税纳税人。

分析方法的主要流程

1.1 数据的提取

import pandas as pd
%matplotlib inline
from pylab import mpl
mpl.rcParams['font.sans-serif'] = ['FangSong'] # 指定默认字体
mpl.rcParams['axes.unicode_minus'] = False # 解决保存图像是负号'-'显示为方块的问题

test=pd.read_csv('1.csv')
test.head()

	纳税人编号	销售类型	销售模式	汽车销售平均毛利	维修毛利	企业维修收入占销售收入比重	增值税税负	存货周转率	成本费用利润率	整体理论税负	整体税负控制数	办牌率	单台办牌手续费收入	代办保险率	保费返还率	输出
0	1	国产轿车	4S店	0.0635	0.3241	0.0879	0.0084	8.5241	0.0018	0.0166	0.0147	0.4000	0.02	0.7155	0.1500	正常
1	2	国产轿车	4S店	0.0520	0.2577	0.1394	0.0298	5.2782	-0.0013	0.0032	0.0137	0.3307	0.02	0.2697	0.1367	正常
2	3	国产轿车	4S店	0.0173	0.1965	0.1025	0.0067	19.8356	0.0014	0.0080	0.0061	0.2256	0.02	0.2445	0.1301	正常
3	4	国产轿车	一级代理商	0.0501	0.0000	0.0000	0.0000	1.0673	-0.3596	-0.1673	0.0000	0.0000	0.00	0.0000	0.0000	异常
4	5	进口轿车	4S店	0.0564	0.0034	0.0066	0.0017	12.8470	-0.0014	0.0123	0.0095	0.0039	0.08	0.0117	0.1872	正常

test['输出'].unique()

array(['正常', '异常'], dtype=object)

def function(a):
	if '正常'in a :
		return 1
	else:
		return 0
test['输出'] = test.apply(lambda x: function(x['输出']), axis = 1)

1.2 缺失值查看

test.isnull().sum()

纳税人编号            0
销售类型             0
销售模式             0
汽车销售平均毛利         0
维修毛利             0
企业维修收入占销售收入比重    0
增值税税负            0
存货周转率            0
成本费用利润率          0
整体理论税负           0
整体税负控制数          0
办牌率              0
单台办牌手续费收入        0
代办保险率            0
保费返还率            0
输出               0
dtype: int64

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 124 entries, 0 to 123
Data columns (total 16 columns):
纳税人编号            124 non-null int64
销售类型             124 non-null object
销售模式             124 non-null object
汽车销售平均毛利         124 non-null float64
维修毛利             124 non-null float64
企业维修收入占销售收入比重    124 non-null float64
增值税税负            124 non-null float64
存货周转率            124 non-null float64
成本费用利润率          124 non-null float64
整体理论税负           124 non-null float64
整体税负控制数          124 non-null float64
办牌率              124 non-null float64
单台办牌手续费收入        124 non-null float64
代办保险率            124 non-null float64
保费返还率            124 non-null float64
输出               124 non-null int64
dtypes: float64(12), int64(2), object(2)
memory usage: 15.6+ KB

1.2 数据的可视化

test.hist(figsize=(20,20))

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001E94D9EB160>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FA5F128>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FA8B780>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FAB3E10>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FAE44E0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FAE4518>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FB3B240>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FB648D0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FB8CF60>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FBBC630>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FBE2CC0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FC15390>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FC3CA20>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FC700F0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FC96780>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FCBEDD8>]],
      dtype=object)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-nua3HhCA-1575811000009)(output_12_1.png)]

#使用pandas库将类别变量编码
test_1 = pd.get_dummies(test)

test_1.head()

	纳税人编号	汽车销售平均毛利	维修毛利	企业维修收入占销售收入比重	增值税税负	存货周转率	成本费用利润率	整体理论税负	整体税负控制数	办牌率	...	销售类型_国产轿车	销售类型_进口轿车	销售模式_4S店	销售模式_一级代理商
0	1	0.0635	0.3241	0.0879	0.0084	8.5241	0.0018	0.0166	0.0147	0.4000	...	1	0	1	0
1	2	0.0520	0.2577	0.1394	0.0298	5.2782	-0.0013	0.0032	0.0137	0.3307	...	1	0	1	0
2	3	0.0173	0.1965	0.1025	0.0067	19.8356	0.0014	0.0080	0.0061	0.2256	...	1	0	1	0
3	4	0.0501	0.0000	0.0000	0.0000	1.0673	-0.3596	-0.1673	0.0000	0.0000	...	1	0	0	1
4	5	0.0564	0.0034	0.0066	0.0017	12.8470	-0.0014	0.0123	0.0095	0.0039	...	0	1	1	0

5 rows × 27 columns

2.1 PCA降低维度

# 作用：将数据集划分为 训练集和测试集
y=test_1['输出']
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
X_scaler = StandardScaler()
#特征标准化
x = test_1.drop(['输出'],axis=1)
x = X_scaler.fit_transform(x)
# PCA
pca = PCA(n_components=0.9)# 保证降维后的数据保持90%的信息
pca.fit(x)
X=pca.transform(x)

print('data shape: {0}; no. positive: {1}; no. negative: {2}'.format(
    X.shape, y[y==1].shape[0], y[y==0].shape[0]))
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

data shape: (124, 17); no. positive: 71; no. negative: 53

3.1 svm模型

from sklearn import svm
from sklearn.metrics import classification_report
svm = svm.SVC() # 定义svm模型
# 拟合模型
svm.fit(X_train, y_train)
print(classification_report(y_test, svm.predict(X_test)))

             precision    recall  f1-score   support

          0       0.57      0.80      0.67         5
          1       0.83      0.62      0.71         8

avg / total       0.73      0.69      0.70        13

3.1 决策树模型

from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier() # 定义决策树模型
# 拟合模型
tree.fit(X_train, y_train)
print(classification_report(y_test, tree.predict(X_test)))

             precision    recall  f1-score   support

          0       0.44      0.80      0.57         5
          1       0.75      0.38      0.50         8

avg / total       0.63      0.54      0.53        13

#预测结果
svm.predict(X_test)

array([1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1], dtype=int64)

构建LM神经网络模型

#构建LM神经网络模型
from keras.models import Sequential#导入神经网络的初始函数
from keras.layers.core import Dense,Activation
net_file='net.model'
net=Sequential()#建立神经网络模型
net.add(Dense(input_dim=17,output_dim=10))
net.add(Activation('relu'))
net.add(Dense(input_dim=10,output_dim=1))
net.add(Activation('sigmoid'))
net.compile(loss='binary_crossentropy',optimizer='adam')
net.fit(X_train,y_train,nb_epoch=2,batch_size=10)#每次训练10个样本
net.save_weights(net_file)#保存模型

Using Theano backend.
WARNING (theano.configdefaults): g++ not available, if using conda: `conda install m2w64-toolchain`
D:\sofewore\anaconda\lib\site-packages\theano\configdefaults.py:560: UserWarning: DeprecationWarning: there is no c++ compiler.This is deprecated and with Theano 0.11 a c++ compiler will be mandatory
  warnings.warn("DeprecationWarning: there is no c++ compiler."
WARNING (theano.configdefaults): g++ not detected ! Theano will be unable to execute optimized C-implementations (for both CPU and GPU) and will default to Python implementations. Performance will be severely degraded. To remove this warning, set Theano flags cxx to an empty string.
WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
D:\sofewore\anaconda\lib\site-packages\ipykernel_launcher.py:6: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(input_dim=17, units=10)`
  
D:\sofewore\anaconda\lib\site-packages\ipykernel_launcher.py:8: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(input_dim=10, units=1)`
  
D:\sofewore\anaconda\lib\site-packages\ipykernel_launcher.py:11: UserWarning: The `nb_epoch` argument in `fit` has been renamed `epochs`.
  # This is added back by InteractiveShellApp.init_path()


Epoch 1/2
111/111 [==============================] - 0s 4ms/step - loss: 0.7726
Epoch 2/2
111/111 [==============================] - 0s 4ms/step - loss: 0.7472

predict_result=net.predict_classes(X_test).reshape(len(X_test))#预测结果

predict_result

array([0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0])

print(classification_report(y_test,predict_result))

             precision    recall  f1-score   support

          0       0.45      1.00      0.62         5
          1       1.00      0.25      0.40         8

avg / total       0.79      0.54      0.49        13

Happy丶lazy

发布了76 篇原创文章 · 获赞 23 · 访问量 1万+

私信关注

20191202_2_识别偷税漏税人

背景

分析方法的主要流程

构建LM神经网络模型

猜你喜欢