Dynamic Shape Analysis via Degree Metrics

Overview The
increasing size and complexity of applications make debugging and program understanding more challenging. Programs written in managed languages ​​such as Java, C #, and Ruby further exacerbate this challenge because they tend to encode most of the state in the heap. This article introduces dynamic shape analysis, which aims to characterize the data structure in the heap by dynamically summarizing object pointer relationships and based on class detection dynamic metrics. The analysis can identify recursive data structures, automatically discover dynamic metrics, and report errors when the metrics are violated. The use of dynamic shape analysis includes helping programmers discover data structure errors during development, generate verification assertions through static or dynamic analysis, and detect subtle errors in deployment. We implement dynamic shape analysis in the Java Virtual Machine (JVM). Using the SPECjvm and DaCapo benchmark tests, we can see that most objects in the heap are part of a recursive data structure, which can maintain strong dynamic metrics. We show that once dynamic shape analysis establishes the degree metric through correct execution, it can find automatically inserted errors in the subsequent execution of micro-marking. This shows that it can be used for deployment to improve software reliability.

Keywords dynamic shape analysis, measurement, dynamic invariants

1. Introduction
The program state in object-oriented language coding objects. As the number of objects allocated in the heap continues to increase, many semantics appear in the heap-allocated data structure, and the data structure and concurrency errors are not surprising. Therefore, heap analysis may help developers detect errors and specify their programs correctly.

To manage objects, programmers use conventional data structures, such as arrays and recursive data structures. Recursive data structures (RDS) form a regular pattern of nodes and references (pointers), such that removing references between any two nodes results in two instances of the same data structure being generated. For example, a single-link list with n nodes is a simple recursive data structure, and can be divided into two smaller single-link lists: one is size x and the other is size nx. For a long time, researchers have used static shape analysis to characterize the recursive data structure in the heap based on the code used to create and manipulate them [9,10,16,28,29]. By analyzing program statements, static shape analysis detects recursive data structures and their invariants. For example, it can detect that a singly linked list has two invariants: n-1 nodes have only one input pointer and one output pointer. Despite recent progress [9], static shape analysis is not widely used because it requires flow analysis and context sensitivity, which makes it very expensive and necessarily conservative.

In this article, we introduced dynamic shape analysis, which uses periodic garbage collection to dynamically detect recursive data structures and degree invariants that are maintained during the execution of a particular program. Our analysis calculated a class domain summary graph (CFSG), which summarizes the dynamic object graph based on the class definition. CFSG records the number of objects and their recursive metrics as invariants of in-degree and out-degree. When a specific number of nodes of the data structure exhibit a specific degree, we call it a fixed metric and track the number of objects that exhibit a fixed metric. For example, in a singly linked list with n nodes, exactly n-1 nodes have an outdegree equal to 1, and the last node has an outdegree equal to 0. For any invariant degree invariants, the CFSG records the range metric as the score of an object with a given attribute and its variance.

Completely accurate dynamic shape analysis requires analyzing the heap after every pointer mutation, which is very expensive. To make the cost easier to handle, we piggybacked garbage collection in a tool called ShapeUp, and we added it to the Java Virtual Machine (JVM). Since garbage collection is periodic and relatively infrequent, dynamic shape analysis can be made effective. For example, ShapeUp adds an average of 4 to 8% of the total running time, and adds less than 1% of space overhead in the system. Performing dynamic shape analysis infrequently can improve the accuracy of efficiency. ShapeUp loses accuracy because the program may violate the degree index between collections. However, our results indicate that performing dynamic shape analysis relatively infrequently is sufficient to generate sufficiently accurate information to discover many errors.
We evaluate ShapeUp by first identifying various libraries and custom recursive data structures in the SPECjvm and DaCapo Java benchmark tests. We prove that the vast majority of objects in the heap are part of recursive data structures, and maintain dynamic indexes for these data structures throughout their execution. Although each data structure maintains a degree index (for example, the degree is 1), the heap is not a whole due to its transient nature. Using micro-benchmarking, we demonstrated the unique application of dynamic shape analysis, where ShapeUp uses the correct execution to develop the invariant, and then finds errors in the incorrect execution. We automatically generate incorrect executions and find some errors that trigger ShapeUp to detect anomalies and reports on metrics. These results indicate that ShapeUp may be useful for hard-to-find errors that will make it deployable.

In summary, dynamic heap analysis effectively summarizes objects in the heap by class, finds dynamic invariants, and finds violations by mining most of the conventional structure of the heap from the object graph. This article begins to study the dynamic shape analysis through the degree standard by showing how the degree standard is used to detect and report errors in micro-benchmarking. It leaves behind future work, studying more precise summaries, which can handle multiple instances of a single data structure and the application of actual programs. We will start in the next section and discuss in more detail the potential uses of dynamic shape analysis and explain how it differs from static shape analysis.

2. Motivation
The use cases for dynamic and static shape analysis are different.
Static shape analysis is a flow and context sensitive analysis that proves that the program constructs and operates recursive data structures correctly. Static analysis strives to detect that the data structure will never violate certain invariants, for example, the nodes in the binary tree have at most one ingress. Unfortunately, there are pointers in the program (ie, when manipulating data structures), which temporarily violates the invariant. To help with static shape analysis, programmers specify invariants and the points they expect the invariants to hold. Even with this help, static analysis must be conservative. Even if they do exist, they cannot always prove certain properties. In addition, static shape analysis is still difficult to process for all but moderately sized programs.

In contrast, dynamic shape analysis samples the data structures in the heap where a given program runs. It analyzes the current shape of the data structure and, given the correct invariants, can determine whether the current dynamic shape violates these invariants. It cannot guarantee that the data structure will never violate its invariants, because invariants can be violated between samples. However, dynamic shape analysis can help developers find errors that persist during early development, during testing, and after deployment. Data structure reports during development can help programmers discover obvious errors. For example, consider a situation where a developer intends to create a two-way linked list but forgets to set the backward pointer and create a single linked list. The ShapeUp invariant report will clearly indicate that the n-1 nodes display the expected value of 1 degree instead of 2.
Perhaps more interesting is the detection of constant violations during or after deployment. This article demonstrates that dynamic shape analysis can develop invariants by observing correct execution, and then find anomalies, such as when well-tested programs have run-time errors due to competition. If errors persist, they will be detected by the regular analysis heap. In addition, information from dynamic shape analysis enhances static analysis and verification. For example, the user can provide dynamically detected invariants as input for static or dynamic verification. Therefore, dynamic shape analysis may help programmers in multiple stages of development and deployment.

three. Related work

Related work includes dynamic invariants based on the position of the program counter, static shape analysis, error detection and correction using invariant specifications, and C heap analysis.

For a long time, static shape analysis has tried to analyze the code to identify the recursive data structure to understand the heap structure [9, 10, 16, 28, 29]. Unfortunately, it is not widely used because it requires flow and context sensitivity, which makes it very expensive and must be conservative. Our analysis effectively provides the same information, but is specific to one or more program executions because it observes the current state of the heap, rather than analyzing all possible heap states. Dynamic shape analysis, like static shape analysis, can be used to generate specifications and tests.

Recently, dynamic analysis discovered possible invariants by mining dynamic program behaviors, correlating them with program locations, and then identifying abnormal execution [15, 17, 18, 22, 21, 23, 25, 31]. For example, hangal and lam indicate that there is usually abnormal behavior before the crash, that is, the program violates one or more dynamic invariants, which are established in the previous execution or early in the current execution. They show that logging variable and condition values ​​when reporting unseen values ​​helps debugging. We show that this assumption also applies to the heap, that is, the encoding semantics of the heap object graph, and the abnormal heap relationship reveals software defects.

Recent work has also shown how to detect and correct data structure errors using programmer-specified invariants [1, 4, 11] or user-defined predicate routines [8, 14]. This method requires the programmer to specify the nature of the data structure, and then use model checking and partial evaluation to detect and repair errors that occur in the wild. User-defined predicates can encode valuable information, for example, which value encodes the number of nodes that should exist in the data structure, and the shape of these nodes will not be discovered. The advantage of shapeup is that it is fully automated and does not require predicate routines. It detects similar errors by automatically discovering many of the same attributes contained in user predicates. Developers can use our analysis results to write predicate routines for complex recursive data structures.

HeapMD checks the simple heap attribute in the C program [12]. Specifically, it shows that many C heaps contain a stable part, and the degree of entry and exit of the object is zero, one, or two. Although HEAPMD provides inspiration, our work shows that the more transient nature of the Java heap and the more complex relationships in the object graph rarely provide a stable whole heap of invariants. However, by distinguishing the heap by class and connection, we can find that the recursive data structure does have many stability invariants.

Jump and McKinley introduced class points from the figure, which summarizes the entire heap and the pointer relationship between them by user class [20]. This heap representation helps developers find memory leaks by identifying growing parts of the graph. Our work is orthogonal and free. In this paper, by adding field edges, a class field summary graph is created, and class points are expanded from the graph. In addition, we use the generated graphics to describe the data structure of the project in specjvm and dacapo java, identify the recursive degree invariants in correct program execution, and quickly and accurately identify the wrong data structure. 

4. Data structure analysis              

To manage large amounts of data, programs written in modern languages ​​use recursive data structures. Developers implicitly and explicitly maintain invariants in data structures and code that allocates and manipulates them. Recursive data structures (rds) are a set of objects linked by references (pointers) in a regular pattern, so that any part consists of smaller or simple instances of the same data structure. For example, a subset of a continuous link list is also a single link list. Although the definition of the data structure is unbounded, the size of any particular RDS in the heap is bounded. 

We examine the composition of the heap based on the recursive data structure of specjvm and dacapo benchmarks and display them in Figure 1. We separate them based on where the recursive data structures are implemented. The custom data structure is a structure specifically implemented by the application. Other data structures are implemented in the library and are simply used by the application. Figure 1 shows that recursive data structures are commonly used in benchmarks for analysis. Although compressand and mpegaudio strictly rely on arrays to process data, in other benchmarks, 91% of the objects are part of an rds, and 33% of them are contained in custom data structures. 

In Java and other object-oriented languages, recursive data structures are implemented separately from the data contained in the data structures. Therefore, we refine the definition of rds to includeObject class, and distinguish the objects of the class that implements the recursive backbone of the data structure. The recursive backbone defines the shape of rds and consists of objects of a single class that refer to other objects of the same class. For example, a tree is composed of smaller trees (subtrees), where the smallesttree is a single node. A given class definition of a tree contains a class node, which contains some references to other nodes, and references to one or more data objects. Table 1 details the microbenchmarks we implemented to evaluate the shape, including linked lists, doubly linked lists, binary trees, binary trees with parent pointers, and simplified hash graphs. For each data structure, the first column illustrates example instantiation. The second column shows the definition of the node class. The third column shows the class field summary graph (cfsg, explained in the next section), which summarizes the recursive backbone and reflects the shape of the rds. This is the characteristic sought by static and dynamic shape analysis. Next, we will describe how to use the garbage collector shapeup to create this summary. 

4.1 Summary data structure for RDS analysis              

Dynamic shape analysis checks the heap every time garbage is collected. The state of the program heap can be expressed as a directed graph G = {V, E}, where V is the set of all heap allocated objects and E is the reference set between objects in the heap. That is, if the field F of the object O refers to the object P, the edge (of, p) exists in the graph G. The in-degree o of an object is the number of other objects in the heap that reference o. The outgoing degree of object o is the number of objects actually referenced by o (not the potential number specified by the object class). The root of the heap graph is a reference stored in static (global variables), stack, and registers. The trace garbage collector starts at these roots and detects active objects by performing a pass-through close on all active object references in the heap. ShapeUp traces back on this scan and summarizes the structure of the heap INA class field summary graph (CFSG), which describes the dynamic shape of objects of each class. 

cfsg summarizes the entire heap by class nodes and field edges. The class summary node records the total number of such objects in the heap. For each live object OC1 found during the garbage collection tracking process of the heap, ShapeUp determines the class C1 of the object and increments the counter of the corresponding class node. The field edge represents each reference (of, p) as a directed edge between O and P, distinguished by field F. To capture the shape of recursive data structures, shapeup uses degree metrics. The degree index is defined as the import and export of object instances. Since the dynamic shape analysis graph describes the RDS shape defined by the recursive backbone, shapeup only tracks the degree indicators corresponding to the objects and edges that define the RDS recursive backbone. Shapeup tracks these edges in the cfsg class nodes of the distributed histogram. 

According to the definition, the backbone edge of RDS is defined as (oC1.f, pC2), so that c1 = c2. In this case, the shapeup is incremented, and the out-degree of oC1 and the in-degree of pc2 are accumulated into a histogram that tracks the number of objects in and out of the cfsg class node. Object scanning allows shapeup to calculate the degree of entry and exit of each individual object at low cost. When the collector scans the object, the out-degree is known, and ShapeUp directly increments the out-degree histogram corresponding to the number of non-empty references. Because the degree of the object instance is not completely known before the collector completes the transitive shutdown on the heap, the uncertainty histogram is calculated incrementally. We described this process in detail in Figure 2 and the following sections with pseudocode. 

Figure 1. Recursive data structure in the heap 

 

Figure 2. Pseudocode for accumulating in / out degree metrics during object scanning. 

 4.2 Step-by-step example       

Figure 3 shows a step-by-step example of constructing cfsg for the recursive backbone of a doubly linked list. In the figure, the instances of the same class have the same shape (ie, the doubleLinkedList object is a square and the nodes are round). The left side of the figure shows the heap graph during the garbage collector object scan. After the three-color abstraction [13], black objects have been scanned by the collector, gray objects are queued for scanning, and white objects have not been found. The corresponding CFSG is displayed on the right. We walked through this number step by step:

Table 1. The data structure in the microbenchmark shows an example heap map, implementation, and corresponding cfsg

Figure 3. Building the CFSG step by step. The binary histogram is represented by in = k (#), where # is the number of objects in the binary.

Step 1: Start after scanning the doubleLinkedList object (black). Above each object, the figure shows the degree of the object instance. In this case, both objects are marked with †, indicating that neither object is identified as a recursive backbone object. In cfsg, the node of the doubleLinkedList class displays an instance, and the edge of one of the instances points to the node class. Note that the number of node objects in cfsg is zero, because node objects are only queued, but not scanned (gray). Backbone objects without recursion have been identified. 

Step 2: The collector scans the first node object that defines the recursive backbone. cfsg captures and identifies the recursive backbone object with the self-edge that represents the next field in the node, and increases the degree histogram of this root object (in = 0 (1)) We use a separate histogram to track the degree of entry and exit in the cfsg node. Since the outgoing degree is known, the corresponding bar in the histogram increases (OUT = 1 (1)). The in degree that must be calculated incrementally requires a byte in the object header to calculate the degree of each instance. As the collector scans each object, it will check the sub-objects and increase the degree in the object's header. For each gray child that has been queued for scanning, the shapeup will decrease the histogram bar corresponding to the child's previous level in degrees, and increase the corresponding bar according to the new uncertainty of the child. For each white child node, shapeup only adds bars corresponding to in = 1 (out = 1 (1)). 

Step 3: The second node object scan adds a second selfedge in cfsg, indicating the previous pointer in the node, and out = 2 is added to the cfsg node. For each child level (in = 0 (0), in = 1 (3)), the degree of in is calculated incrementally.              

Step 4-5: The scan continues to process each node of the doubly linked list. 

Step 6: At the end of the data structure, we process the last node of the data structure. At this point, shapeup has captured the shape of the entire data structure in cfsg. 

The bottom of Figure 3 shows the most general form of cfsg for the correct double link list. In summary, the in-degree and out-degree of n−2 objects are equal to 2, and the in-degree and out-degree of 2 objects (head and tail) are equal to 1. We call these fixed indicators of data structure. The third column of Table 1 shows several common CFSGs with different data structures. At the end of the collection, cfsg completely summarizes the number of objects of each class and the number of objects of each entry and exit level at the time of collection.

Finally, we add a stage at the end of the garbage collection, during which we add the current cfsg to the cumulative cfsg of the program. We use each degree indicator (such as in = 0, in = 1, in = 2, etc.) to summarize the average percentage of objects. For a single data structure, if a 0, 1, 2, n, n−1, or n−2 object displays a degree indicator, a fixed metric is recorded. Otherwise, we record the range metric as the percentage of objects observed through shape during one or more executions. We found that when anomalies are introduced, fixed metrics are very sensitive to violations, while scope metrics are more forgiving. In the next section, we will evaluate cfsg and show how to use it to detect errors in recursive data structures.

5. Experimental methodology 

We implemented shapeup in mmtk, which is a memory management tool kitin jikes rvm version 2.9.1 [2, 3]. MMTK implements many high-performance collectors [5, 6]. We use full-stack mark scanning and generation mark scanning collectors to measure the performance of the Pecjvm [30] and Dacapo V.06-10-MR2 [7] benchmark kits. We use the pre-compiled configuration as much as possible, including the keystore and optimized compiler (Fast BuildTimeConfiguration is turned on), and then turn off the assertion check. In addition, we apply replay compilation to eliminate the uncertain behavior of the adaptive compilation system [19].

Overhead: We configure shapeup to perform dynamic shape analysis and garbage collection at natural garbage collection points that are triggered when the heap is full. In accordance with best practice [7], we tested a series of heap sizes proportional to the active memory size of a given benchmark. We choose the heap size, which ranges from the minimum value that the program can execute to 6 times the minimum value. A common practice in production systems is to choose a heap size that is approximately twice the minimum value. When using the full heap collector (maximum PMD is 39%), the current implementation of CFSG increases the total time of all benchmarks by an average of 4.6%, while when using the generation collector (maximum HSQLDB is 92%), An average increase of 8.4%, while increasing the space requirements of the program by <1%. For brevity, we have omitted these results. We believe that through performance adjustments, this overhead can be further reduced. If the user can afford more overhead, and hope more examples are used for debugging, for example, during testing or development, shapeup can analyze the heap more frequently. Users can easily obtain examples at specific points of interest during debugging by inserting a call to system.gc () in the source code. 

7. in conclusion

Programmers increasingly challenge the scale and complexity of the programs they create. As a program allocates more and more objects on the heap, heap analysis becomes critical for program understanding and debugging. This article presents a dynamic shape analysis tool that describes the shape of recursive data structures by a very low-cost summary of the heap graphs in the class domain summary graph (cfsg). cfsg completely summarizes the number of objects of each class and the references between them by field. The scale scalar of the recursive backbone object captures the shape of the recursive data structure in the form of dynamic invariants. We evaluate the shape by describing the recursive data structure in the specjvm and dacapo benchmarks. We prove that the vast majority of objects in the heap are part of a recursive data structure. Although summarizing the degree standard for the entire heap in Java is not sufficient to understand most program behavior, we show that the degree measure for a single recursive data structure maintains its invariant throughout its execution. We will demonstrate how to use dynamically discovered degree metrics to find microbenchmarks for error execution, which shows that for some data structures, a single error is sufficient to trigger the conflict reported by shapeup.  

Future work should: (1) Explore a more accurate summary of degree measurement, including the use of lightweight techniques to separate multiple recursive data structure instances of RDS instances [27]; (2) In addition to recursive types, data must also be considered Object; (3) Evaluate the shape in the actual application. This paper shows that dynamic heap analysis can effectively classify and summarize the heap, so as to mine degree measures from the regular structure of the heap, find dynamic invariants, and find conflicts.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Published 43 original articles · Like 23 · Visits 30,000+

Guess you like

Origin blog.csdn.net/zhang14916/article/details/89819339