A large-scale study on the usage of Java’s concurrent programming constructs

Original address:

https://www.sciencedirect.com/science/article/abs/pii/S0164121215000849?via%3Dihubicon-default.png?t=MBR7https://www.sciencedirect.com/science/article/abs/pii/S0164121215000849?via%3Dihub

 

A Large-Scale Study on the Usage of Java’s Concurrent

Programming Constructs

Gustavo Pinto, Weslley Torres, Benito Fernandes, Fernando Castor, Roberto SM Barros
{ ghlp, wst, jbfan, castor, roberto } @cin.ufpe.br
Informatics Center, Federal University of Pernambuco (CIn-UFPE), Av. Jornalista Anibal Fernandes, S / N, Recife-PE, 50.740-560, Brazil.
translate:

A Large-Scale Study of Programming Constructs for Concurrent Applications in Java

Gustavo Pinto、Wesley Torres、Benito Fernandes、Fernando Castor、Roberto SMBarros

{ghlp,wst,jbfan,beaver,roberto}@cin.ufpe.br

Center for Informatics, Federal University of Pernambuco. Jornalita Anibal Fernandes, S/N, Recife PE50.740-560, Brazil.

                                                                                                                                                          

Summary

 

Both academia and industry strongly believe that multicore technology will fundamentally change the way software is built.

However, little is known about the current state of use of concurrent programming constructs. In this work, we propose an empirical

Aims to study the use of 2227 real-world, stable and mature Java projects with concurrent programming structures using SourceForge. We have studied the use of concurrency techniques in recent versions of these applications and how usage has evolved over time. The main results of our research are: (i) more than 75% of the latest versions of the projects explicitly create threads or use some kind of concurrency control mechanism; (ii) more than half of the projects have at least 47 synchronized methods and 3 Runnable interfaces per 100KLoC implementation, which means that programming constructs are used often, but they are also heavily used; (iii) the adoption library of java.util.concurrent is only moderate (approximately 23% of concurrent projects use it); (iv) efficient and Thread-safe data structures, methods like ConcurrentHashMap are not widely used yet, although they have many advantages.

Key words:

Java, concurrency, software evolution, OSS

                                                                                                                                                                                    

1 Introduction

        Multi-core systems offer the potential for inexpensive, scalable, high-performance computing with significant reductions in energy consumption. To realize this potential, it is necessary to exploit new heterogeneous architectures comprising collections of multiple processing elements. Using multi-core technology, applications must be concurrent, which poses a challenge because concurrent programming is notoriously difficult (Sutter, 2005). Many programming languages ​​provide constructs for concurrent programming. These solutions differ greatly in terms of abstraction, error-proneness and performance. The Java programming language is especially rich when it comes to concurrent programming constructs. For example, it includes the concept of monitoring, a low-level mechanism that supports mutual exclusion and synchronization, and the high-level library (Lea, 2005), java.util.concurrent. Also, also known as juc, was introduced in version 1.5 of the language.

        In both academia and industry, there is a strong belief that multicore technology will fundamentally change the way software is built. However, to the best of our knowledge, reliable information is lacking on the current state of concurrent software development practices in terms of the structures developers use. In this work, we aim to partially fill this gap.

        Specifically, we provide an empirical study aimed at establishing the current state of practical use of concurrent programming constructs in Java applications. We analyzed 2227 stable and mature Java projects including more than 600 million lines of code from SourceForge (LoC - without blank lines and comments), one of the most popular open source code repositories. Our analysis included several versions of these applications and was based on more than 50 automatically collected source code metrics. We also study the correlations between these metrics in an attempt to find trends in the use of concurrent programming constructs. We chose Java because it is a widely used object-oriented programming language. Also, as we said before, it includes support for multithreading with low-level and high-level mechanisms. Also, it is the language with the highest number of projects on SourceForge.

        Evidence of how to write parallel programs can increase developers' awareness of available mechanisms. It can also show how to put some well-accepted mechanisms into practice. Furthermore, it can tell researchers designing new mechanisms about the kinds of structures that developers might prefer to use. Tool vendors can also benefit by supporting developers using lesser-known and more efficient mechanisms, for example, by implementing refactoring (Dig, Marrero, Ernst, 2009, Ishizaki, Daijavad, Nakatani, 2011 and Schaifer, Sridharan, Dolby, Tip, 2011a). Furthermore, the findings of this study find more convincing the importance of concurrent programming for students entering faculty, not only for the future of software development, but also for the present.

        Based on the data obtained, we propose answered research questions (RQs).

        RQ1: Do Java applications use concurrent programming constructs?

        We found that more than 75% of the most recent versions of the checks included some form of concurrent programming, ie, at least one occurrence of the synchronized keyword. In medium projects (20001 - 100000 LoC), for large projects this percentage increases by more than 90%, reaching 100% (over 100000 LoC). Furthermore, the average numbers (per 100,000 LoC) of the synchronization methods, class extension lines and class implementation runs are 66.75, 13, and 13.85, respectively. These results indicate that projects frequently use concurrent programming constructs and are quite intensive¹. On the other hand or perhaps the opposite, despite the prevalence of multi-core machines, the overall percentage of concurrent projects has not changed significantly over the years.

                                                                                                                                                          

¹ Throughout the paper, we use the terms "frequent" and "dense" frequently. We use the first to refer to using the given construct. We use the term "often" as a synonym for "often". We use the term "dense" to refer to a given construct within a single project. For example, the synchronous method is used frequently and intensively because most projects use this construct, and most projects use this construct multiple times.

                                                                                                                                                          

        RQ2: Are developers moving to library-based concurrency?

        Our data shows that only 23.21% of the analyzed concurrency uses the project classes in the java.util.concurrent concurrency library. On the other hand, this library is growing in adoption. However, this growth does not seem to be generally related to the decrease in the use of Java's traditional concurrent programming constructs, with some exceptions. Also, more projects are actively being developed recently, for example, at least one release since 2009 using the java.util.concurrent library. Therefore, mature projects using the library have an active percentage, higher than 23.21%.

        RQ3: How can developers protect shared variables from concurrent threads?

        Most projects use synchronized blocks and methods. Volatility modifiers, explicit locks (including variations such as read-write locks) and atomic variables are less common, although some of them seem to be gaining popularity. We also noticed a growing trend in the use of synchronized blocks.

        RQ4: Developers still use java.lang thread class to create and manage threads?

        We've found that implementing the Runnable interface is the most common way to define new threads. In addition, a considerable number of projects use executors to manage thread execution (11.14% of concurrent projects). It can be observed that projects adopting executors exhibit a weak trend to reduce the number of explicitly extended thread classes.

        RQ5: Do developers use thread-safe data structures?

        We found that developers still use Hashtable and HashMap, although the former is thread-safe but inefficient, the latter is not thread-safe. Nonetheless, in many projects there is a tendency to use ConcurrentHashMap instead of other associative data structures.

        RQ6: How long do developers use state synchronization?

        A large number of concurrent projects include the call method of notify, notifyAll or wait method, etc. Also, we noticed that a small number of projects have eliminated many usages of these methods, using the CountDownLatch class, part of the java.util.concurrent library. This number is not large enough for statistical analysis.

        However, it shows that mechanisms with simple semantics like CountDownLatch have the potential to, in some cases, replace low-level, more traditional ones.

        RQ7: Are developers trying to catch exceptions that might cause sudden thread failures?

        Our data shows that less than 3% of concurrent project execution threads. There is no interface for handling exceptions, which means that, in 97% of concurrent projects, an exception stemming from a programming error can cause threads to die silently, potentially affecting the behavior of threads that interact with them. Furthermore, analyzing these implementations, we found that even when they implement a handler, developers often don't know how to handle uncaught exceptions on the thread. This provides some indication that a new exception handling mechanism is clearly needed to address concurrent applications.

        To provide a basic intuition that developers believe is true for the usage of concurrent programming constructs, we also conducted a survey involving more than 160 software developers. These developers are committers on projects that perform analysis on the source code. Respondents in this survey asked various questions, such as "Which do you think is the most frequently used concurrent/parallel programming construct in the Java language?". In this paper, we compare the findings with the data by analyzing the Java source code.

        This work makes the following contributions:

        It is the first large-scale study on the usage of concurrent programming constructs in the Java language, including an analysis of how the use of these constructs has evolved over time.

        It provides a real parallel project of substantial data with the current state of practice, which evolves over time.

        It presents some project analysis of the findings made by the submitters. This survey provides an overview of developers' perceptions about the use of concurrent programming constructs.

        The rest of the paper is organized as follows: Section 2 provides some background on concurrent programming in Java. Section III describes our survey setup and some preliminary results. Next, in Section IV, we describe the infrastructure we use to download and extract data for analysis. In Section V, we present the research results in terms of research questions. We then present the effectiveness of threads on work in Section 6 and some implications in Section 7. Section 8 is dedicated to related work. Finally, in Section 9, we present our conclusions and discuss future directions.

⒉ background

        Before our study, we provided a brief background on concurrent programming. A detailed introduction to concurrent programming concepts is available elsewhere (Tanenbaum, 2008).

        In general, processes and threads are the main abstractions for concurrent programming. A process is a container that keeps all the information needed to run a program, for example, memory locations in the process from which data can be read and written. A thread, on the other hand, can be thought of as a lightweight process. There are different implementations of threads, even though threads and processes are different from each other, multiple threads can exist in the same process and share their own data, and not share resources in different processes. Additionally, threads can share source code information. This feature is a double-edged sword, since it has the cost of well-known concurrency bugs such as race conditions.

        However, one of the main reasons to use threads is that since threads have no associated resources, they are easier and faster than processes. For example, creating a thread can be 100 times faster than creating a process (Tanenbaum, 2008).

        On a processor, multithreading usually occurs through time division multiplexing. In other words, the processor switches between different threads. This context switch usually happens quickly and the end user perceives that the threads are running concurrently. On a multiprocessor or in a multicore system, threads or tasks run concurrently, with each processor or core running a specific thread. The number of threads running simultaneously is limited by the number of processors available.

        Concurrent programming has been an exciting area of ​​research over the past decade. Although there is no consensus on a single model of concurrency, many advances have been made with the development of various competing models (Burckhardt, Baldassin, Leijen, 2010 and Yi, Sadowski, Flanagan, 2011). Besides, regardless of the concurrency model, many researchers (Dig, Marrero, Ernst, 2009, Goetz, Peierls, Bloch, Bowbeer, Holmes, Lea, 2006, Ishizaki, Daijavad, Nakatani, 2011 and Okur, Dig, 2012 ) believes that high-concurrency libraries can improve software quality.

        The java.util.concurrent library is designed to simplify the development of concurrent applications in the Java language. Using this framework, even an inexperienced programmer can write concurrent application work. The java.util.concurrent library provides features to simplify the task of concurrent programming. Also, the library is optimized for performance. Below we discuss some of the most well-known structures. We assume the reader is familiar with the Java programming language and the basic concepts of concurrent programming, such as locks, mutual exclusion, and synchronization. The java.util.concurrent library contains some constructs, such as semaphores and switches, which we do not discuss in this article because they are rarely used. For example, we found that the semaphore class was never used in the analysis project.

        Locks:  Implementations of the lock interface, such as ReentrantLock, support more flexible locks than can use synchronized method and code block execution. They facilitate more generic structures that may have different attributes depending on the thread accessing the data, and may support multiple related condition objects (an interface defining condition variables associated with locks). A lock is a tool used to control access to shared resources by multiple threads. In general, a lock provides exclusive access to a shared resource: only one thread at a time can acquire the lock, and each access to the shared resource first requires the lock to be acquired. However, some locks allow concurrent access to shared resources, such as ReadWriteLock for read locks. Lock implementations provide additional functional blocks on using synchronized methods and attempting to acquire locks ( tryLock() ) through blocks that support blocking and attempting to acquire interrupted locks.

        Atomic data types: These data types are provided by a small toolbox of classes that support lock-free, thread-safe programming on single variables. Essentially, the java.util.concurrent atomic package extends the concept of fluctuating values, fields, and array elements, providing an atomic conditional update operation using the compareAndSet() method. If its current value is equal to the first parameter of the method, the method automatically sets a variable and returns true if successful. The class methods contained in this package unconditionally get and set values, and increment and decrement the value of variables. Examples of classes in this package are AtomicBoolean, AtomicInteger, and AtomicIntegerArray.

    Concurrent Collections: A set of collections determined to be used in a multi-threaded environment. This set includes ConcurrentHashMap, CopyOnWriteArrayList, CopyOnWriteArraySet, and ArrayBlockingQueue. The parallel prefix used for certain classes in this package is a shorthand to indicate some distinction like synchronized classes, which use a single set of locks. For example classes Hashtable and collections.synchronizedmap(...) are synchronized, but ConcurrentHashMap is "concurrent". Concurrent collections are thread-safe, but are not governed by a single lock. ConcurrentHashMap, in particular, safely allows any number of concurrent reads and tunable concurrent writes.

        Synchronization: java.util.concurrent also provides some classes that can replace wait() and notify() methods. CountDownLatch is a synchronization aid that allows one or more threads to wait until a set of operations that other threads are performing has completed. CountDownLatch waits for multiple threads to finish before letting them continue. CyclicBarrier is another synchronization aid. It allows a group of threads to wait for each other to reach a common barrier point.

        Executor:  The executor, reflected by the sub-interfaces Executor interface and its implementation, supports multiple methods to manage thread execution. They provide a framework for asynchronous task execution. ExecutorService manages the scheduling of queues and tasks, and allows controlled shutdown. The ExecutorService interface and its implementations provide methods to asynchronously execute any function represented as a callable, runnable result-bearing simulation. The ScheduledExecutorService subinterface adds support for delaying and executing periodic tasks. A function in the future that returns a result that will allow deciding whether execution is complete, and provide a means to cancel execution. Its implementation provides a callable, flexible thread pool. Implementation classes provide factory methods for the most common types and configurations, as well as some utility methods to use them².

3. Survey

        We conducted a programmer-related survey to gather information on developers' perceived usage of concurrent programming constructs in Java. Using this information we can check whether the intuitions of these developers reflect the source code of real systems. The questionnaire design was suggested by Kitchenham and Pfleeger (2008), and the authors specified the following stages: planning, creating the questionnaire, defining the target audience, evaluating, conducting the survey, and analyzing the results. First, we define the subject matter. Topic: Respondents' experience with concurrent programming, how familiar they are, and finally, we asked questions directly about the state of use of concurrent programming techniques. The questionnaire had 9 questions and was constrained in structure due to multiple choice, Likert scale (reflected on a scale from О to 10, where 0 means no knowledge and 10 means super expert), was also free. It includes a question (#9) to which respondents can answer using free text.

        After defining all the questions in the questionnaire, what we got was an iterative feedback that clarified and described the interpretation of some questions. This feedback was obtained from a group of experts and also from analysis and discussion in a pilot survey. In conjunction with the questionnaire instructions, we include some simple examples to clarify our intent. Table 1 presents the questions of the questionnaire. A complete list of questions and responses to all surveys is available on the website³.

                                                                                                                                                           

² In this article, we often use the term "executor construct". We use it to refer to classes related to the Executor framework, such as Executor, ExecutorService, ScheduledExecutorServices, Executors, etc.

³ http://www.cin.ufpe.br/˜groundhog

                                                                                                                                                           

        Our target population includes programmers who implement at least one open source software that commits an analyzed job. This job analysis is a very important project on SourceForge, which uses Subversion as the default version control system. Even so, Subversion doesn't necessarily track the email addresses of commit authors. For example, an anonymous commit author id or a pseudonym could be used. In fact, the latter is the more commonly used email address. Another problem with SourceForge is that previous repositories were external to SourceForge, which made it difficult to keep track of their large number of projects. Then, to collect the email addresses of these programmers, we investigated projects that moved to Github because it made it easier to find the email addresses of committers. We found 72 projects that moved to Github.

        Among these items, we found 2353 unique email addresses, but only 1953 of them were valid. When investigating these programmers, when sending 273 emails were rejected by the unknown domain notification server and another 18 automatically replied

 

reply message. Over a 20-day period, we obtained 164 responses, a 9.75% response rate. This response rate is almost double the response rate found in the survey in the field of software engineering (Kitchenham Pfleeger, 2008). Table 2 summarizes the survey data.

        We can see that in the above table, 26% of the respondents have more than 12 years of software development experience, on average, the respondents consider themselves moderate concurrent programming experience (value 6 ranges from 0 to 10), 0 means no knowledge, and 10 means an expert). In their experience, the top five most commonly used concurrent programming constructs are the same as the first version of the Java language. Furthermore, on average, they believe that half of the open source projects use at least one of the basic concurrency constructs and 30% of the projects use the java.uti.concurrent concurrency library.

        Additionally, 53 percent of respondents indicated that they have used concurrent programming techniques to improve performance and/or scalability of applications. One of the anonymous interviewees specifically described how difficult it was to write correct concurrent programs and achieve performance improvements:

Concurrency is tough on many levels - actually parallelizing code, avoiding potential deadlocks, etc. If not all developers are disciplined on the project, it is also easy to slip in practice, such as proper use of administrative locks, etc. and create fragile concurrency code. Java constructs help with the details, but the main burden still falls on the programmer to thoroughly understand concurrency and its implications. Also a language with many flaws (like long not guaranteed to be atomic all the time in all environments) the java.util concurrency utility can help, but it can only be avoided if the programmer understands the problem and knows the methods and tools to use. The new JLS version has quelled some of the problems (eg: if I remember correctly, now you can count on all statements to complete before a constructor returns, which wasn't the case before); but the language's idiosyncrasies still give developers a big The burden is to know all the traps or develop an all-out defense.

In the remainder of this paper, we discuss the main findings of the survey based on the seven research questions mentioned in Section 1.

4. Study Setup

        This section presents the configuration of our study: our basic assumptions, our mining infrastructure and the suite of metrics we employ. We've built a set of tools that analyze source code for projects downloaded from SourceForge, and collect metrics from those projects. It includes crawler, a metrics collection tool, and some helper shell scripts. We call it infrastructure. Figure 1 depicts the infrastructure we use. Originally, the crawler-fill project was a Java project from SourceForge, including its various versions of the library.

        We get projects by means of HTTP requests without directly accessing the project's source code repository. We use this approach because we are only concerned with analyzing project releases, stable versions that are available to the general public. Resource repositories are often not clearly versioned and their approach is inconsistent. On the other hand, SourceForge makes it relatively easy to obtain distributions via HTTP requests.

        When the project has all been downloaded, all zip files are unzipped into our local repository. We are currently able to decompress zip, rar, tar, gz, tgz, bz2, tbz, tbz2, bzip2 and 7Z file compression. After that, the metrics collection tool parses the source code, collects metrics, and stores the results in a metrics repository. Finally, it generates input as a CSV file, which is used for statistical analysis by R (Ihaka and Gentleman, 1996).

        Crawler is an extension of Crawler4, an open source web crawler framework. This framework is multi-threaded and written in Java. We also implemented additional scripts to organize project versions according to date and to check if the target project is ready for analysis, repairing its structure if necessary. To collect concurrency metrics, we used the javaCompiler5 class to parse the source code and build a parse tree. The tree is traversed, the metrics are extracted and stored in a text file. The extended index includes the counting class control thread Thread class, the class implementing the Runnable interface, and the usage of some Java keywords, such as synchronized, volatile, and type instantiations belonging to JUC's number library, such as AtomicInteger, ConcurrentHashMap, ReentrantLock and many others. Table 3 lists the occurrence counts of the elements we have measured.

                                                                                                                                                          

4.http://code.google.com/p/crawler4j/

5.http://docs.oracle.com/javase/6/docs/api/javax/tools/JavaCompiler.html 

                                                                                                                                                          

        Our analysis is completely focused on mature and stable projects and then identified project development. Also, projects that do not have at least one release after 2004 are not considered, since java.util.concurrent was released as part of the JDK released in December 2004, and to avoid trivial systems we only check that have at least 1000

Control line items. We have analyzed the items while considering their latest releases and their evolution over time. in the latter

In this case, we have worked on multiple versions of the project. To better understand their development, we also calculate the difference in some indicator values ​​taking into account recent and previous system versions. We then calculated the Pearson correlation (Pearson, 1936) between these differences. This helped us determine, for example, that several projects exhibited a tendency to switch from directly inheriting the Thread class to use execution to managing thread execution.

 

     All results presented in this paper are normalized to avoid distortion of very large absolute values ​​and to make them more directly comparable. For example, on 2011.08.22 the calculation result is in the version of the Dr.Java project, measure the release of Runnable, we implement Runnable several times, which is 6, by the number of lines of code, which is 112703, resulting in 0.000053238. This result is then multiplied by 100000 and the final result is 5.3238. All collected metrics are normalized in this way, and we use these normalized values ​​in the rest of this paper. References to absolute values ​​throughout the article are explicitly stated. The absolute sum of the two is the normalized value of all metrics to provide the companion website for this article. Finally, based on the survey results, we propose some hypotheses that represent expectations about developers despite the state of use of some concurrent programming techniques. We assume the following:

  1. A1 Java projects often use parallel programming structures (average estimate: 51.43%);
  2. A2 Java projects are often built using the juc library (average estimate: 36.63%);
  3. A3 Synchronous methods are the most commonly used parallel programming constructs;
  4. A4 ConcurrentHashMap is the most frequently used concurrency in the juc programming structure;
  5. The A5 initiative redesigns existing systems because leveraging multi-core architectures is commonplace.

                                                                                                                                                          

figure 1. In (a), the crawler populates the infrastructure repository with Java projects from Sourceforge. In (b), the shell script extracts all compressed files into our local repository. In (c), the metrics collection tool parses the source code, collects metrics, and stores the results in a metrics repository. In (d), the metric collection tool generates an input CSV file for R to perform statistical analysis on.

To be followed up. . . .

Guess you like

Origin blog.csdn.net/weixin_46048259/article/details/128491451