The tracing interpretation problem of why problem of graph structure clustering

  The graph is a universal model that describes various complex systems in the real world, and has a strong ability to express. In the real world, many applications represent the relationship between data and data as a graph, such as social networks, information networks, collaboration networks, e-commerce networks, communication networks, biological protein networks, etc. From the perspective of data management, on-demand loading of large-scale data will save a lot of computing resources. Graph clustering provides a solution to the above requirements, which is conducive to the analysis, understanding and visualization of large-scale graphs.

      When researching around the graph structure clustering algorithm, we found that there are two main problems in the graph structure clustering algorithm. The first is the quality of the data, and the second is the unreasonable parameters of the clustering method.

      1. Graph data quality problems: The quality problems of graph data are divided into 4 small points. The first point is caused by missing information and errors in the graph data source, such as due to human factors, machine failures or limitations of location determination technology , The location information of mobile device users is sometimes inaccurate or missing. The second point is that there are errors in the extraction of map data. For example, when extracting data from web pages, most extraction methods are slow, error-prone and difficult to maintain. The third point is that the map data is in Repeated errors occur during extraction. For example, in online services, a user may have multiple accounts, resulting in the illusion of multiple users. The fourth point is the quality problem caused by the integration error of graph data. When integrating multi-source data, For the same fact, different data sources have different judgments, resulting in data conflicts and uncertainty. The graph structure clustering method is sensitive to data. If there are quality problems in the data, the clustering results cannot meet the needs of users

      2. Unreasonable clustering parameters: The graph structure clustering method is very sensitive to the clustering parameters. Due to the limited professional knowledge of users, the submitted clustering parameters may not express the user's real clustering request. At this time, the graph database The clustering result returned by the system will be different from the user's expectations and cannot meet the user's needs. In this case, users may submit multiple clustering operations to satisfy their query requests, or give up on the data set to continue operations. As a result, data resources are wasted, the availability of graph data is reduced, and huge economic losses are indirectly caused.

      After the above problems appeared, we raised the why question. What is a why question? Why question refers to, first, why unexpected data will appear in the query results. Second, what can be done to prevent unexpected data from appearing in the query results. We hope that the graph structure clustering system can support such a traceability interpretation function, and providing traceability explanations for graph structure clustering results can effectively solve the problem of user dissatisfaction with clustering results caused by data sensitivity and parameter sensitivity.

      Based on different query requests, the following introduces the methods used to explain why-not problems and why problems through query refinement.

      (1) SQL query . For the why-not problem of relational data SPJA (Select-Project-Join-Aggregation) query, an interpretation method based on query refinement, ConQueR, is proposed. This method requires that the query results of the modified query include the original query results and the expected why -not data. Albarrack et al. also proposed a semi-automatic SQL debugger to interpret the missing data of SQL queries, and provided some suggestions to repair the original SQL queries so that these queries can return the expected data, while Freire et al. researched and solved the link query Explanation of why question.

         In order to explain the statistical results of SQL queries, they first proposed the concept of intervention, that is, to remove the tuples that have the greatest impact on the query results from the database. Then, an intervention-based method is proposed to explain the why and why-not problems of statistical information in SQL queries. Salimi et al. studied the causality of query results in the database based on causality and responsibility, the connection between data repair based on rejection constraints and consistency-based diagnosis, and then discussed the causality of query results, induced diagnosis and The relationship between the view update problem and the definition of the causality of query results with integrity constraints.

        (2) Top-k query. For data sets, top-k query is an effective way to display only important data for users. Because users lack the necessary professional knowledge of the data set, the top-k queries they submit may not express their real query intentions, for example, the data they expect does not appear in the query results. In order to explain this kind of problem, the reason and why-not problems of top-k query are explained by refining the original top-k query. However, this solution only considers the impact of the refined query on the original query, and does not consider the impact of the refined query on the original query result.

      (3) Skyline query. Over the past few decades, Skyline queries have received great attention from data researchers and proved that this type of query is very valuable for multi-standard decision support. In order to let users understand why certain interesting points do not appear in the query results, Chester et al. proposed the sky-not query, explaining the why and why-not problems of the skyline query. When querying the database, if the query results cannot meet the user's needs, user feedback can help discover the true intention of the user's query. Based on this, Liu et al. proposed an effective and flexible interactive query interpretation framework FlexIQ

      (4) Graph query. With the rapid development of graph database technology, while helping users store and support complex queries, the graph database also brings some additional costs, that is, the query may result in empty or large result sets. In order to solve the why and why-not problems of graph query, Vasilyeva et al. introduced a new type of graph query called a difference query. This query explains which part of a query graph is displayed in the data graph and which part is Indeed. Then they proposed a subgraph-based method to explain why there are so many empty graph query results. Query failure, that is, the query result is empty, is a major problem in the pattern matching query processing of graph databases.

      Although the above research results have solved the problem of traceability interpretation for many kinds of query requests, they still cannot meet the needs of users. For example, the existing research results cannot be directly used for the traceability interpretation problem of graph result clustering. Therefore, in the research of parameter-sensitive traceability interpretation, it is also necessary to study the problem of parameter-sensitive graph structure cluster traceability interpretation.

        I will continue to study why in the future, and will continue to write the results of my learning, thank you.


Guess you like

Origin blog.51cto.com/15064656/2602789
Recommended