HIN Application Research Summary

1. Code security

iDev: enhancing social coding security by cross-platform user identification between GitHub and stack overflow【A】

Yujie Fan, Yiming Zhang, Shifu Hou, Lingwei Chen, Yanfang Ye, Chuan Shi, Liang Zhao, & Shouhuai Xu (2019). iDev: Enhancing Social Coding Security by Cross-platform User Identification Between GitHub and Stack Overflow international joint conference on art significant intelligence.

Enhancing social coding security with cross-platform user identification between GitHub and Stack overflow

Background and issues : With the increasing popularity of platforms such as GitHub and Stack Overflow, potential security issues are also on the rise, mainly due to the fact that risky and harmful codes can be embedded and propagated well. The literature uses heterogeneous graph representations to learn to identify users and detect cross-platform poisoning attackers.

image-20221112203536519

Contribution : Automatic cross-platform [Github and Stack Overflow] user identification, using user attributes and social coding attributes to identify users and detect poisoning attackers.

Method and model : Construct a cross-platform user code interaction graph, and propose AHIN2Vec user representation learning based on the attributed heterogeneous information network (AHIN). The results of representation learning are used as node features for downstream tasks for cross-platform user identification.

image-20221112203612916

image-20221112211144024

2. API recommendation

Group preference based API recommendation via heterogeneous information network【A】

Fenfang Xie, Liang Chen, Dongding Lin, Chuan Chen, Zibin Zheng, & Xiaola Lin (2018). Poster: Group Preference Based API Recommendation via Heterogeneous Information Network international conference on software engineering.

Heterogeneous information network API recommendation based on group recommendation

Background and Problem : A Heterogeneous Information Network (HIN) is a logical network that can include multiple types of edges and multiple types of node relationships. Previous research on API recommendation mainly focuses on homogeneous information networks and few types of edges, so it does not take advantage of its rich heterogeneous information.

Methods and models : GPRec. Input: Mashup information [tag, category, description], API information [tag, category, description, provider] and historical call records between Mashups and API. GPRec Model: use mashup, api, and their related attributes to construct heterogeneous information network. Connect mashups with different meta-paths to learn different semantic representations. Four similarity measures are used to calculate the similarity between different mashups. Each pair of Mashup-APIs is sorted using the Bayesian group preference personalized sorting algorithm. Output: Personalized sorted results.

image-20221112220447869

3.Andorid malware detection

Out-of-sample Node Representation Learning for Heterogeneous Graph in Real-time Android Malware Detection【A】

Yanfang Ye, Shifu Hou, Lingwei Chen, Jingwei Lei, Wenqiang Wan, Jiabin Wang, Qi Xiong, & Fudong Shao (2019). Out-of-sample Node Representation Learning for Heterogeneous Graph in Real-time Android Malware Detection international joint conference on artificial intelligence.

Out-of-Sample Node Representation Learning for Heterogeneous Graphs in Real-time Android Malware Detection

Background and Issues : The increasing sophistication of Android malware requires defensive technologies capable of protecting mobile users from threats. Since Android malware detection is a speed-sensitive application and requires a cost-effective solution, scalable learning methods for HG representations are needed, especially for out-of-sample nodes. The literature utilizes heterogeneous graph representations to learn to recognize software and detect malware.

image-20221113130444927

Method and model : Construct a heterogeneous graph of APP-component relationship, and perform heterogeneous graph representation learning based on Heterogeneous Graph Learning. Use the learned representations as downstream task node features for software detection.

image-20221113130625080

image-20221113134330848

image-20221113134802037

4. Duplicate bug report detection

HINDBR: Heterogeneous Information Network Based Duplicate Bug Report Prediction【B】

Guanping Xiao, Xiaoting Du, Yulei Sui, & Tao Yue (2020). HINDBR: Heterogeneous Information Network Based Duplicate Bug Report Prediction international symposium on software reliability engineering.

Repeated Bug Report Prediction Based on Heterogeneous Information Network

Background and problem : There are repeated bugs in the bug tracking system. The existing methods are mainly based on the method of text similarity to identify repeated bugs, but this method becomes infeasible in JIT.

Methods and models : HINDBR. Through the representation learning of HIN, HINDRB embeds the semantic relationship of the error report into the low-dimensional space, and uses the Manhattan distance to find the distance between the two vectors. When the distance between the two vectors is close to each other in the potential space, the two vectors are considered Indicates a repeat error.

image-20221113152551402

Constructed HIN:

  1. 节点:bug report (BID)【TextBID,unstructured features】, component (COM), product (PRO), version (VER), priority (PRI), and severity (SEV).【structured】
  2. 关系:Bug-Component,Component-Product,Bug-Version,Bug-Priority,Bug-Severity

Feature representation and fusion:

  1. Structure. HIN2Vec to pre-train structural features.
  2. non-structural features. Word2Vec to pre-train unstructured feature text.
  3. For a node, its structural features and non-structural features are fused as its final feature representation.

Duplicate Bug similarity measurement method: Manhattan distance

Limitations and future work : Because the pre-trained vectors are based on the specified dataset, the model cannot be generalized to other datasets for repeated bug detection.

5. Program understanding and representation

Learning to represent programs with heterogeneous graphs【B】

Wenhan Wang, Kechi Zhang, Ge Li, & Zhi Jin (2021). Learning to Represent Programs with Heterogeneous Graphs arXiv: Software Engineering.

Learning Program Representations Using Heterogeneous Graphs

Background and Problem : Code representation converts programs into semantic vectors and is crucial for source code processing. Abstract Syntax Tree (AST) [Abstract Syntax Tree] is an enhanced graph that contains rich semantic information and structural information. In order to learn the representation of codes, the existing methods are mainly isomorphic graphs, so the type information of nodes and edges is ignored. Literature uses heterogeneous graph representation to learn and understand source code, method name prediction and code classification

image-20221113170808781

Method and model : Heterogeneous Program Graph (HPG): Provides the types of nodes and edges, and uses Abstract Syntax Description Language (ASDL) to generate abstract syntax heterogeneous graphs from abstract syntax trees. The node representation of heterogeneous graph is learned based on HGT, and the representation result is used as the node feature of the downstream task.

image-20221113165601899

image-20221113171449477

Heterogeneous Graph Transformer (HGT)

image-20221113171909013image-20221113172211359image-20221113171921912image-20221113171935022

The flaws of the method :

  1. The attribute features of nodes and edges are not introduced, so the representation needs to be initialized.
  2. The sub-token structure adopted has strong assumption requirements for entity naming

6. Defect report developer distribution

KSAP: An approach to bug report assignment using KNN search and heterogeneous proximity【B】

Wen Zhang, Song Wang, & Qing Wang (2016). KSAP: An approach to bug report assignment using KNN search and heterogeneous proximity Information & Software Technology.

A Bug Report Assignment Method Based on KNN Search and Heterogeneous Proximity

Background and Issues : Timely assignment of bug reports to developers is critical to software quality assurance. Assigning bugs to the proper developers is difficult as software systems evolve.

Models and Methods : KSAP. When a new bug report is submitted, a heterogeneous graph of the bug report is constructed. KSAP will distribute this report to the developer using a two-stage process. The first stage is to search historically resolved similar bug reports to new bug reports via the k-Nearest Neighbor (KNN) method. The second stage is to rank those developers who have contributed similar bug reports based on different kinds of proximity.

image-20221113193429515

image-20221113193613950

实体:developer, bug, comment, component, product。

relation:

image-20221113194415204

Future work : More entities, such as the version and platform of the bug report, will be considered in the bug repository to leverage more heterogeneous information to recommend developers to resolve the bug report. Future plans address the problem of overspecialization in heterogeneous proximity ranking.

7. Smart Contract Vulnerability Detection

MANDO: Multi-Level Heterogeneous Graph Embeddings for Fine-Grained Detection of Smart Contract Vulnerabilities

Hoang H. Nguyen, Nhat-Minh Nguyen, Chunyao Xie, Zahra Ahmadi, Daniel Kudendo, Thanh-Nam Doan, & Lingxiao Jiang (2022). MANDO: Multi-Level Heterogeneous Graph Embeddings for Fine-Grained Detection of Smart Contract Vulnerabilities

Multi-level heterogeneous embedding for fine-grained detection of smart contract vulnerabilities

Background and Problem : Learning heterogeneous graphs composed of different types of nodes and edges enhances the results of isomorphic graph techniques. A control flow graph is a heterogeneous graph that represents the possible flow of software code execution. Control flow graphs can represent more semantic information of codes, development techniques, and tools, which is beneficial for detecting vulnerabilities in software. Existing methods cannot handle heterogeneous graphs with a large number of edges and nodes of different types.

Models and Methods : MANDO. Given the software code in the Ethereum smart contract, construct a control flow heterogeneous graph and a call graph heterogeneous graph, and fuse the two graphs to construct a heterogeneous contract graph. Metapath-based methods learn node representations for heterogeneous contract graphs. Use representation learning results as downstream task node features to identify vulnerabilities in contracts.

image-20221113200942845

image-20221113201015285

8. Unsafe Code Fragment Detection

ICSD: An Automatic System for Insecure Code Snippet Detection in Stack Overflow over Heterogeneous Information Network【B】

Yanfang Ye, Shifu Hou, Lingwei Chen, Xin Li, Liang Zhao, Shouhuai Xu, Jiabin Wang, & Qi Xiong (2018). ICSD: An Automatic System for Insecure Code Snippet Detection in Stack Overflow over Heterogeneous Information Network annual computer security applications con reference .

An automated system for detecting safe code fragments in Stack Overflow using a heterogeneous information network

Background and problems : With the popularity of the coding paradigm [Stack Overflow] in modern society, the security risks that unsafe codes are easily embedded and distributed in the system are also increasing.

Model and method :

  1. Use code content [function name, function, API] and social coding attributes to detect unsafe code snippets in Stack Overflow. Social coding attributes include users, badges, questions, answers, code snippets, and more.
  2. Leveraging HIN to learn rich semantic relations, a meta-path-based approach incorporates higher-level semantic features to establish the relevance of code segments.
  3. A snippet2vec framework is proposed to learn the representation of rich semantic knowledge and structural knowledge in HIN.
  4. Multi-view fusion classifiers for downstream tasks [detection of unsafe code fragments].

image-20221118162214221

Model advantages :

  1. An up-to-date feature representation of Stack Overflow data.
  2. A multi-view fusion classifier based on state-of-the-art representation learning models.
  3. The actual system for automatic detection of unsafe code snippets.

9. Bug report developer distribution

A spatial–temporal graph neural network framework for automated software bug triaging

Hongrun Wu, Yutao Ma, Zhenglong Xiang, Chen Yang, & Keqing He (2021). A Spatial-Temporal Graph Neural Network Framework for Automated Software Bug Triaging arXiv: Software Engineering.

Spatial-Temporal Graph Neural Networks for Automated Software Bug Diagnosis and Classification

Background and questions :

  1. In order to efficiently assign bugs to designated developers, a bug diagnostic triage procedure is very important.
  2. Most of the existing methods only focus on the static folding graph of a single time slice, which lacks dynamics and scalability.
  3. None of the previous work considered the periodic interaction of developers.

Model and method :

  1. Proposed spatial-temporal dynamic graph neural network (ST-DGNN), a spatial-temporal dynamic graph neural network. Including joint random walk (JRWalk) mechanism and graph recurrent convolutional neural network (GRCNN) model two parts.
  2. JRWalk uses two sampling strategies to sample the local topology by considering the importance of nodes [node degree, reputation] and edges [edge weight, preference].
  3. CRCNN has three identical structural components: hourly periodicity, daily periodicity, and weekly periodicity. Learning dynamic developer collaboration networks (DCN) features on spatiotemporal graphs. [CNN, LSTM]

image-20221204101859927

image-20221204101924054

image-20221204101944471

10. Code Review Recommendations

Using Large-scale Heterogeneous Graph Representation Learning for Code Review Recommendations

Jiyang Zhang, Chandra Maddila, Ram Bairi, Christian Bird, Ujjwal Raizada, Apoorva Agrawal, Yamini Jhawar, Kim Herzig, & Arie van Deursen (2022). Using Large-scale Heterogeneous Graph Representation Learning for Code Review Recommendations

Using Large-Scale Heterogeneous Graph Representation Learning for Code Review Recommendations

Background and questions :

  1. Code review is an important process in the development of mature software.
  2. Most review recommendation systems mainly rely on historical document changes and comment information. While these methods are able to identify and suggest qualified reviewers, they may be blind to reviewers who possess the required expertise and have never interacted with the changed file.

Model and method :

  1. Coral. Examiner recommendation system. A socio-technical graph built from a rich set of entities (including developers, repositories, files, pull requests, work items, etc.) and their relationships in a modern source control system.
  2. Using RGCN for heterogeneous graph representation learning.
  3. Adopt an inductive learning paradigm.
  4. The model structure is simple and suitable for large-scale heterogeneous graphs.

image-20221204102257269

image-20221204102207906

RGCN

image-20221204102534189

defect :

  1. The attribute characteristics of nodes and edges are not considered.
  2. The assumptions of IID need to be satisfied.
  3. For fusing node features and graph topology, it lacks self-adaptive ability.

Summarize

HIN General Process

  1. Collect relevant datasets
  2. Constructing heterogeneous graphs [nodes, edges, attributes, meta-paths]
  3. Node Representation Learning on Heterogeneous Graphs
  4. downstream tasks

Challenges Facing HIN

  1. How to Collect and Clean Datasets
  2. How to Construct Effective Heterogeneous Graphs
  3. How to design suitable encoders to learn node representations based on downstream tasks
  4. How to extend to OOD scenarios and large-scale graph data scenarios
  5. How to construct a sufficient and effective set of meta-paths in meta-path-based representation learning methods

Heterogeneous Graph Representation Learning Methods

  1. Method based on meta-path [HIN2Vec]
  2. Method based on attention mechanism [HGT]
  3. Combination method of meta-path and attention mechanism [HAN]

meta path

image-20221112211008182

img

Guess you like

Origin blog.csdn.net/qq_45724216/article/details/128170022