Concurrent programming in Python

We call a running program a process. Each process has its own system state, including memory status, a list of open files, a program pointer that tracks the execution of instructions, and a call stack that holds local variables. Normally, a process executes sequentially according to a single sequence of control flow, which is called the main thread of the process. At any given moment, a program does only one thing.

A program can create new processes through the os or subprocess modules in the Python library functions (such as os.fork() or subprocess.Popen()). However, these processes, called child processes, run independently, with their own independent system state and main thread. Because the processes are independent of each other, they execute concurrently with the original process. This means that the original process can perform other work after creating a child process.

Although processes are independent of each other, they can communicate with each other through a mechanism called inter-process communication (IPC). A typical mode is based on message passing, which can be simply understood as a pure byte buffer, and send() or recv() operation primitives can be passed through such as pipes or network sockets. socket) and other I/O channels to transmit or receive messages. There are also some IPC modes that can be accomplished through memory-mapped mechanisms (such as the mmap module). Through memory mapping, processes can create shared areas in memory, and modifications to these areas are visible to all processes.

Multi-processing can be used in scenarios where multiple tasks need to be performed simultaneously, and different processes are responsible for different parts of the task. However, another way to subdivide work into tasks is to use threads. Like a process, a thread has its own control flow and execution stack, but the thread runs within the process that created it, sharing all the data and system resources of its parent process. Threads are useful when an application needs to complete concurrent tasks, but the potential problem is that tasks must share a large amount of system state.

When using multiple processes or threads, the operating system is responsible for scheduling. This is achieved by giving each process (or thread) a small time slice and rapidly cycling through all active tasks, which divides the CPU time into small fragments for each task. For example, if you have 10 active processes executing on your system, the operating system will appropriately allocate one-tenth of the CPU time to each process and switch between the ten processes in a loop. When the system has more than one CPU core, the operating system can schedule processes to different CPU cores to keep the system load even to achieve parallel execution.

Programs written using concurrent execution mechanisms need to consider some complex issues. A major source of complexity concerns issues with synchronizing and sharing data. Often, multiple tasks attempting to update the same data structure at the same time can cause problems with dirty data and inconsistent program state (formally known as resource contention problems). To solve this problem, use mutexes or other similar synchronization primitives to identify and protect critical sections of your program. For example, if multiple different threads are trying to write data to the same file at the same time, then you need a mutex to allow these write operations to be performed sequentially. While one thread is writing, other threads must wait until the current thread Release this resource.

Concurrent programming in Python

Python has long supported different forms of concurrent programming, including threads, subprocesses, and other implementations of concurrency using generator functions.

Python supports both message passing and thread-based concurrent programming mechanisms on most systems. Although most programmers are more familiar with the thread interface, Python's threading mechanism has many limitations. Python uses the internal Global Interpreter Lock (GIL) to ensure thread safety, and the GIL only allows one thread to execute at a time. This allows Python programs to run on a single processor, even on multi-core systems. Despite the many debates in the Python community about the GIL, it is unlikely that it will be removed in the foreseeable future.

Python provides some neat tools for managing concurrent operations based on threads and processes. Even simple programs can use these tools to make tasks run concurrently and thus run faster. The subprocess module provides an API for subprocess creation and communication. This is especially suitable for running text-related programs, because these APIs support passing data through the standard input and output channels of the new process. The signal module exposes the semaphore mechanism of the UNIX system to users to transmit event information between processes. Signals are processed asynchronously, and usually interrupt the current work of the program when a signal arrives. The signaling mechanism enables a coarse-grained message-passing system, but there are other more reliable in-process communication techniques capable of delivering more complex messages. The threading module provides a set of high-level, object-oriented APIs for concurrent operations. Thread objects run concurrently within a process, sharing memory resources. Using threads can better scale I/O-intensive tasks. The multiprocessing module is similar to the threading module, but it provides operations on processes. Each process class is a real operating system process and does not share memory resources, but the multiprocessing module provides a mechanism for sharing data and passing messages between processes. Usually, changing a thread-based program to a process-based program is as simple as modifying some import statements.

Threading module example

Taking the threading module as an example, think about such a simple question: how to use the segmented parallel method to complete the accumulation of a large number.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

import threading

class SummingThread(threading.Thread):

    def __init__(self, low, high):

        super(SummingThread, self).__init__()

        self.low = low

        self.high = high

        self.total = 0

    def run(self):

        for i in range(self.low, self.high):

            self.total += i

thread1 = SummingThread(0, 500000)

thread2 = SummingThread(500000, 1000000)

thread1.start() # This actually causes the thread to run

thread2.start()

thread1.join()  # This waits until the thread has completed

thread2.join()

# At this point, both threads have completed

result = thread1.total + thread2.total

print(result)

Custom Threading class library

I've written a small Python library that is easy to use with threads and contains some useful classes and functions.

key parameter:

* do_threaded_work - This function assigns a given series of tasks to corresponding processing functions (the order of assignment is indeterminate)
* ThreadedWorker - This class creates a thread that will pull work tasks from a synchronized work queue and process the results Write to synchronous result queue
* start_logging_with_thread_info – write thread id to all log messages. (Depends on logging environment)
* stop_logging_with_thread_info - Used to remove thread id from all logging messages. (Depends on logging environment)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

import threading

import logging

def do_threaded_work(work_items, work_func, num_threads=None, per_sync_timeout=1, preserve_result_ordering=True):

    """ Executes work_func on each work_item. Note: Execution order is not preserved, but output ordering is (optionally).

        Parameters:

        - num_threads               Default: len(work_items)  --- Number of threads to use process items in work_items.

        - per_sync_timeout          Default: 1                --- Each synchronized operation can optionally timeout.

        - preserve_result_ordering  Default: True             --- Reorders result_item to match original work_items ordering.

        Return:

        --- list of results from applying work_func to each work_item. Order is optionally preserved.

        Example:

        def process_url(url):

            # TODO: Do some work with the url

            return url

        urls_to_process = ["http://url1.com", "http://url2.com", "http://site1.com", "http://site2.com"]

        # process urls in parallel

        result_items = do_threaded_work(urls_to_process, process_url)

        # print(results)

        print(repr(result_items))

    """

    global wrapped_work_func

    if not num_threads:

        num_threads = len(work_items)

    work_queue = Queue.Queue()

    result_queue = Queue.Queue()

    index = 0

    for work_item in work_items:

        if preserve_result_ordering:

            work_queue.put((index, work_item))

        else:

            work_queue.put(work_item)

        index += 1

    if preserve_result_ordering:

        wrapped_work_func = lambda work_item: (work_item[0], work_func(work_item[1]))

    start_logging_with_thread_info()

    #spawn a pool of threads, and pass them queue instance

    for _ in range(num_threads):

        if preserve_result_ordering:

            t = ThreadedWorker(work_queue, result_queue, work_func=wrapped_work_func, queue_timeout=per_sync_timeout)

        else:

            t = ThreadedWorker(work_queue, result_queue, work_func=work_func, queue_timeout=per_sync_timeout)

        t.setDaemon(True)

        t.start()

    work_queue.join()

    stop_logging_with_thread_info()

    logging.info('work_queue joined')

    result_items = []

    while not result_queue.empty():

        result = result_queue.get(timeout=per_sync_timeout)

        logging.info('found result[:500]: ' + repr(result)[:500])

        if result:

            result_items.append(result)

    if preserve_result_ordering:

        result_items = [work_item for index, work_item in result_items]

    return result_items

class ThreadedWorker(threading.Thread):

    """ Generic Threaded Worker

        Input to work_func: item from work_queue

    Example usage:

    import Queue

    urls_to_process = ["http://url1.com", "http://url2.com", "http://site1.com", "http://site2.com"]

    work_queue = Queue.Queue()

    result_queue = Queue.Queue()

    def process_url(url):

        # TODO: Do some work with the url

        return url

    def main():

        # spawn a pool of threads, and pass them queue instance

        for i in range(3):

            t = ThreadedWorker(work_queue, result_queue, work_func=process_url)

            t.setDaemon(True)

            t.start()

        # populate queue with data  

        for url in urls_to_process:

            work_queue.put(url)

        # wait on the queue until everything has been processed    

        work_queue.join()

        # print results

        print repr(result_queue)

    main()

    """

    def __init__(self, work_queue, result_queue, work_func, stop_when_work_queue_empty=True, queue_timeout=1):

        threading.Thread.__init__(self)

        self.work_queue = work_queue

        self.result_queue = result_queue

        self.work_func = work_func

        self.stop_when_work_queue_empty = stop_when_work_queue_empty

        self.queue_timeout = queue_timeout

    def should_continue_running(self):

        if self.stop_when_work_queue_empty:

            return not self.work_queue.empty()

        else:

            return True

    def run(self):

        while self.should_continue_running():

            try:

                # grabs item from work_queue

                work_item = self.work_queue.get(timeout=self.queue_timeout)

                # works on item

                work_result = self.work_func(work_item)

                #place work_result into result_queue

                self.result_queue.put(work_result, timeout=self.queue_timeout)

            except Queue.Empty:

                logging.warning('ThreadedWorker Queue was empty or Queue.get() timed out')

            except Queue.Full:

                logging.warning('ThreadedWorker Queue was full or Queue.put() timed out')

            except:

                logging.exception('Error in ThreadedWorker')

            finally:

                #signals to work_queue that item is done

                self.work_queue.task_done()

def start_logging_with_thread_info():

    try:

        formatter = logging.Formatter('[thread %(thread)-3s] %(message)s')

        logging.getLogger().handlers[0].setFormatter(formatter)

    except:

        logging.exception('Failed to start logging with thread info')

def stop_logging_with_thread_info():

    try:

        formatter = logging.Formatter('%(message)s')

        logging.getLogger().handlers[0].setFormatter(formatter)

    except:

        logging.exception('Failed to stop logging with thread info')

使用示例

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

from test import ThreadedWorker

from queue import Queue

urls_to_process = ["http://facebook.com", "http://pypix.com"]

work_queue = Queue()

result_queue = Queue()

def process_url(url):

    # TODO: Do some work with the url

    return url

def main():

    # spawn a pool of threads, and pass them queue instance

    for i in range(5):

        t = ThreadedWorker(work_queue, result_queue, work_func=process_url)

        t.setDaemon(True)

        t.start()

    # populate queue with data  

    for url in urls_to_process:

        work_queue.put(url)

    # wait on the queue until everything has been processed    

    work_queue.join()

    # print results

    print(repr(result_queue))

main()

Guess you like

Origin blog.csdn.net/m0_59485658/article/details/129335899