Knowledge Graph Modeling

Our goal is anti-fraud in business, and existing features cannot be used because there is already a modeling team. In order to improve the effect of anti-fraud, we can only dig out more and better variables from the perspective of data, without any existing features, but the goal is to perform better than the existing models.

After understanding the situation, I started to build a relationship network, because this is an area where the company is vacant, and from the perspective of the entire industry, there are not many companies that do well.

Just do it, and the work starts with researching data types, relationships, and data cleaning. The first problem encountered is sql, which cannot be constructed into a graph structure. So write code by yourself to read and write hdfs data with java, compose pictures in memory, and calculate features. Although I got a machine with 256G memory, I still encountered insufficient memory. The reason for checking is that there are too many edges that need to be loaded into memory, hundreds of millions. Next, optimize the memory overhead of the data structure. If you can use short, you don't need integer, and if you can use long, you don't need string. You can even do data sharding yourself. When computing features, just load all the data needed by the shard into memory. Finally the first milestone is achieved, the feature is there, the label is there. It took 1.5 months from the start of work to the establishment of the first model. Fortunately, the model has a remarkable effect and has been supported by the bosses.

Since the model effect is good, only using xgboost will have a significant effect. Then try to get online. Only at this time did I realize that going online was the biggest difficulty. Because this model goes online, it is necessary to prioritize the construction of a real-time relational network.

Where is the relational network data stored? Tried Neo4j, OriginDB, Titan and other graph databases. It took more than 3 weeks to write various demos, test logic accuracy, test performance and so on. The conclusion is that no open source graph data can meet the needs.

Don't worry, valuable models are difficult to launch due to technical implementation problems. Never give up. In the end, it was decided to save only the edge data in the HBASE table. Let HBASE help me with the storage problem. Using the Flink real-time streaming framework, I wrote a bunch of code to violently write incremental edges into HBASE. It took more than 2 months to complete FLINK + kafka to build a real-time relationship network, with a delay of less than 1 second. The number of concurrent processes is as high as more than 30.

In the follow-up, it is necessary to do the integrity test of the graph, the timeliness test, and after the test is ok. You also need to write your own feature engineering based on relational network, model prediction, and prediction results need to be observed online for a month. The real launch has already started in October. From the end of February, I tried to do it, and it took 7 months to make it online.

The whole process down, the modeling work does not exceed 2 weeks. Stream computing, real-time relational network construction, feature engineering, feature accuracy testing and other technical engineering implementations take 99% of the time. Do it all by yourself. For business use, just go to a table in hbase to get the model prediction score according to the rowkey. But I can't feel the difficulty of the online journey.

The actual effect of the model is good, and it is worth the hard work.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325062483&siteId=291194637