We call a running program a process. Each process has its own system state, including memory status, a list of open files, a program pointer that tracks the execution of instructions, and a call stack that holds local variables. Normally, a process executes sequentially according to a single sequence of control flow, which is called the main thread of the process. At any given moment, a program does only one thing.
A program can create new processes through the os or subprocess modules in the Python library functions (such as os.fork() or subprocess.Popen()). However, these processes, called child processes, run independently, with their own independent system state and main thread. Because the processes are independent of each other, they execute concurrently with the original process. This means that the original process can perform other work after creating a child process.
Although processes are independent of each other, they can communicate with each other through a mechanism called inter-process communication (IPC). A typical mode is based on message passing, which can be simply understood as a pure byte buffer, and send() or recv() operation primitives can be passed through such as pipes or network sockets. socket) and other I/O channels to transmit or receive messages. There are also some IPC modes that can be accomplished through memory-mapped mechanisms (such as the mmap module). Through memory mapping, processes can create shared areas in memory, and modifications to these areas are visible to all processes.
Multi-processing can be used in scenarios where multiple tasks need to be performed simultaneously, and different processes are responsible for different parts of the task. However, another way to subdivide work into tasks is to use threads. Like a process, a thread has its own control flow and execution stack, but the thread runs within the process that created it, sharing all the data and system resources of its parent process. Threads are useful when an application needs to complete concurrent tasks, but the potential problem is that tasks must share a large amount of system state.
When using multiple processes or threads, the operating system is responsible for scheduling. This is achieved by giving each process (or thread) a small time slice and rapidly cycling through all active tasks, which divides the CPU time into small fragments for each task. For example, if you have 10 active processes executing on your system, the operating system will appropriately allocate one-tenth of the CPU time to each process and switch between the ten processes in a loop. When the system has more than one CPU core, the operating system can schedule processes to different CPU cores to keep the system load even to achieve parallel execution.
Programs written using concurrent execution mechanisms need to consider some complex issues. A major source of complexity concerns issues with synchronizing and sharing data. Often, multiple tasks attempting to update the same data structure at the same time can cause problems with dirty data and inconsistent program state (formally known as resource contention problems). To solve this problem, use mutexes or other similar synchronization primitives to identify and protect critical sections of your program. For example, if multiple different threads are trying to write data to the same file at the same time, then you need a mutex to allow these write operations to be performed sequentially. While one thread is writing, other threads must wait until the current thread Release this resource.
Concurrent programming in Python
Python has long supported different forms of concurrent programming, including threads, subprocesses, and other implementations of concurrency using generator functions.
Python supports both message passing and thread-based concurrent programming mechanisms on most systems. Although most programmers are more familiar with the thread interface, Python's threading mechanism has many limitations. Python uses the internal Global Interpreter Lock (GIL) to ensure thread safety, and the GIL only allows one thread to execute at a time. This allows Python programs to run on a single processor, even on multi-core systems. Despite the many debates in the Python community about the GIL, it is unlikely that it will be removed in the foreseeable future.
Python provides some neat tools for managing concurrent operations based on threads and processes. Even simple programs can use these tools to make tasks run concurrently and thus run faster. The subprocess module provides an API for subprocess creation and communication. This is especially suitable for running text-related programs, because these APIs support passing data through the standard input and output channels of the new process. The signal module exposes the semaphore mechanism of the UNIX system to users to transmit event information between processes. Signals are processed asynchronously, and usually interrupt the current work of the program when a signal arrives. The signaling mechanism enables a coarse-grained message-passing system, but there are other more reliable in-process communication techniques capable of delivering more complex messages. The threading module provides a set of high-level, object-oriented APIs for concurrent operations. Thread objects run concurrently within a process, sharing memory resources. Using threads can better scale I/O-intensive tasks. The multiprocessing module is similar to the threading module, but it provides operations on processes. Each process class is a real operating system process and does not share memory resources, but the multiprocessing module provides a mechanism for sharing data and passing messages between processes. Usually, changing a thread-based program to a process-based program is as simple as modifying some import statements.
Threading module example
Taking the threading module as an example, think about such a simple question: how to use the segmented parallel method to complete the accumulation of a large number.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
import threading class SummingThread(threading.Thread): def __init__(self, low, high): super(SummingThread, self).__init__() self.low = low self.high = high self.total = 0 def run(self): for i in range(self.low, self.high): self.total += i thread1 = SummingThread(0, 500000) thread2 = SummingThread(500000, 1000000) thread1.start() # This actually causes the thread to run thread2.start() thread1.join() # This waits until the thread has completed thread2.join() # At this point, both threads have completed result = thread1.total + thread2.total print(result) |
Custom Threading class library
I've written a small Python library that is easy to use with threads and contains some useful classes and functions.
key parameter:
* do_threaded_work - This function assigns a given series of tasks to corresponding processing functions (the order of assignment is indeterminate)
* ThreadedWorker - This class creates a thread that will pull work tasks from a synchronized work queue and process the results Write to synchronous result queue
* start_logging_with_thread_info – write thread id to all log messages. (Depends on logging environment)
* stop_logging_with_thread_info - Used to remove thread id from all logging messages. (Depends on logging environment)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
import threading import logging def do_threaded_work(work_items, work_func, num_threads=None, per_sync_timeout=1, preserve_result_ordering=True): """ Executes work_func on each work_item. Note: Execution order is not preserved, but output ordering is (optionally). Parameters: - num_threads Default: len(work_items) --- Number of threads to use process items in work_items. - per_sync_timeout Default: 1 --- Each synchronized operation can optionally timeout. - preserve_result_ordering Default: True --- Reorders result_item to match original work_items ordering. Return: --- list of results from applying work_func to each work_item. Order is optionally preserved. Example: def process_url(url): # TODO: Do some work with the url return url urls_to_process = ["http://url1.com", "http://url2.com", "http://site1.com", "http://site2.com"] # process urls in parallel result_items = do_threaded_work(urls_to_process, process_url) # print(results) print(repr(result_items)) """ global wrapped_work_func if not num_threads: num_threads = len(work_items) work_queue = Queue.Queue() result_queue = Queue.Queue() index = 0 for work_item in work_items: if preserve_result_ordering: work_queue.put((index, work_item)) else: work_queue.put(work_item) index += 1 if preserve_result_ordering: wrapped_work_func = lambda work_item: (work_item[0], work_func(work_item[1])) start_logging_with_thread_info() #spawn a pool of threads, and pass them queue instance for _ in range(num_threads): if preserve_result_ordering: t = ThreadedWorker(work_queue, result_queue, work_func=wrapped_work_func, queue_timeout=per_sync_timeout) else: t = ThreadedWorker(work_queue, result_queue, work_func=work_func, queue_timeout=per_sync_timeout) t.setDaemon(True) t.start() work_queue.join() stop_logging_with_thread_info() logging.info('work_queue joined') result_items = [] while not result_queue.empty(): result = result_queue.get(timeout=per_sync_timeout) logging.info('found result[:500]: ' + repr(result)[:500]) if result: result_items.append(result) if preserve_result_ordering: result_items = [work_item for index, work_item in result_items] return result_items class ThreadedWorker(threading.Thread): """ Generic Threaded Worker Input to work_func: item from work_queue Example usage: import Queue urls_to_process = ["http://url1.com", "http://url2.com", "http://site1.com", "http://site2.com"] work_queue = Queue.Queue() result_queue = Queue.Queue() def process_url(url): # TODO: Do some work with the url return url def main(): # spawn a pool of threads, and pass them queue instance for i in range(3): t = ThreadedWorker(work_queue, result_queue, work_func=process_url) t.setDaemon(True) t.start() # populate queue with data for url in urls_to_process: work_queue.put(url) # wait on the queue until everything has been processed work_queue.join() # print results print repr(result_queue) main() """ def __init__(self, work_queue, result_queue, work_func, stop_when_work_queue_empty=True, queue_timeout=1): threading.Thread.__init__(self) self.work_queue = work_queue self.result_queue = result_queue self.work_func = work_func self.stop_when_work_queue_empty = stop_when_work_queue_empty self.queue_timeout = queue_timeout def should_continue_running(self): if self.stop_when_work_queue_empty: return not self.work_queue.empty() else: return True def run(self): while self.should_continue_running(): try: # grabs item from work_queue work_item = self.work_queue.get(timeout=self.queue_timeout) # works on item work_result = self.work_func(work_item) #place work_result into result_queue self.result_queue.put(work_result, timeout=self.queue_timeout) except Queue.Empty: logging.warning('ThreadedWorker Queue was empty or Queue.get() timed out') except Queue.Full: logging.warning('ThreadedWorker Queue was full or Queue.put() timed out') except: logging.exception('Error in ThreadedWorker') finally: #signals to work_queue that item is done self.work_queue.task_done() def start_logging_with_thread_info(): try: formatter = logging.Formatter('[thread %(thread)-3s] %(message)s') logging.getLogger().handlers[0].setFormatter(formatter) except: logging.exception('Failed to start logging with thread info') def stop_logging_with_thread_info(): try: formatter = logging.Formatter('%(message)s') logging.getLogger().handlers[0].setFormatter(formatter) except: logging.exception('Failed to stop logging with thread info') |
使用示例
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
from test import ThreadedWorker from queue import Queue urls_to_process = ["http://facebook.com", "http://pypix.com"] work_queue = Queue() result_queue = Queue() def process_url(url): # TODO: Do some work with the url return url def main(): # spawn a pool of threads, and pass them queue instance for i in range(5): t = ThreadedWorker(work_queue, result_queue, work_func=process_url) t.setDaemon(True) t.start() # populate queue with data for url in urls_to_process: work_queue.put(url) # wait on the queue until everything has been processed work_queue.join() # print results print(repr(result_queue)) main() |