Python multi-threaded applications in the reptile

Inscription: As a test engineers often need to solve the problem of test data sources Solutions nothing but three ways: (1) climb to fetch data from the Internet directly to the real production environment, copy data (2) from (3) build their own data using a script or tool . Some time ago, in order to obtain more test data, the author made a crawler crawling fetch data from the Internet, although the basic functions to meet the needs of the project, but crawling efficiency is not too high. As a test engineer a better, decided to look at multi-threaded applications in the field of reptiles, in order to improve the efficiency of reptiles.

First, why the need for multi-threaded
  
  everything have to know these know why. Before understanding of the relevant knowledge multithreaded, we take a look at why you need multi-threading. Analogy it, you want to move, make a single thread is similar to the movers, he was a man in charge of packing, handling, driving, unloading a series of operational processes, the working efficiency can be imagined is very slow; the multi-threaded equivalent invited four movers, a complete package to have been carried to the car, then drove to the destination propionate, and finally by the Ding to unload.
  This shows that the benefits of multi-threading is efficient and can make full use of resources, the downside is to be coordinated between the various threads, or easy to mess (like a boy is a boy, two boys half a boy, three boys no boy dilemma). Therefore, in order to improve the efficiency of reptiles, we pay special attention to the management of multi-threading problems when using multiple threads.

Basics Second, multithreaded
  
  process: by the program, data collection, process control block of three parts, it is the process of a program running on a data set. If you run twice on a data set with a program that opened the two processes. Process is the basic unit of resource management. In the operating system, each process has an address space, but there is a default control process.

  Thread: is a physical process, it is the basic unit of CPU scheduling and dispatch, as well as the smallest unit of execution. It appears to reduce the consumption of context switching and improves the concurrency of the system, and a process to overcome the defects can only do one thing. Thread is managed by the process, multiple threads share space resources of the parent process.

  The relationship between processes and threads:
  one thread can belong to only one process, and a process can have multiple threads, but there is at least one thread.
  Resources allocated to the process, all threads in the same process of sharing all the resources of the process.
  CPU points to a thread that actually run on the CPU is a thread.

  Threads work:
  as shown, a thread refers to a serial performed on the CPU; a plurality of parallel running on CPU plurality of threads; complicated by a "pseudo-parallel", the same time, only one CPU perform a task, the CPU time slice, one thread only takes a very short time slices, then each thread in turn, due to the very short time slices it appears to the user all the threads are "simultaneous". Concurrent is also a single CPU most practical way to run multiple threads.

  The working status of the process:
  a process has three states: running, blocking, ready. The conversion between the three states as shown below: runnable processes waiting for input may be due to active Blocking state, since the scheduler may choose other passive process enters the ready state (typically a CPU time allocated to it to ); blocked due process until an effective input into the ready state; ready state of the process because the scheduler chose it again and again into operation.

  
Third, multi-threaded communication examples

  Or return to the issue of reptiles up, we know when crawling blog articles are first crawling list page, and then come crawling crawling content in accordance with article details the results list page. Fast and crawling speed is certainly better than list page story page crawling speed.
  In this case, we can design thread A list of articles is responsible for crawling pages, thread B, thread C, D thread is responsible for crawling the article details. A URL list of the result in a structure similar to a global variable, the thread B, C, D to take the results from this structure.
  In PYTHON, there are two support multithreading modules: threading module - responsible for the creation of threads, open and other operations; queque module - responsible for maintaining the structure that is similar to global variables. Here I would like to add that: Perhaps students will ask the direct use of a global variable can not it what? Why on earth use a queue? Because global variables are not thread-safe, such as global variables in (list type) is only a url, and thread B judgment of the global variable is not empty, before yet removed the url, cpu time slice to the thread C, C thread will last a url removed, and then the turn of cpu time slice B, B will be taken because the data in an empty list and error. The queue module implements a multi-producer, multi-consumer queue, when the value of the put value is thread-safe.

  Ado, directly on the code for everyone to see:

Import threading # import threading module 
from queue Import Queue # import queue module 
Import time   # import the time module 

# crawling detail page article 
DEF get_detail_html (detail_url_list, the above mentioned id):
     the while True: 
        url = detail_url_list.get () # GET method Queue queue from the queue element for extracting 
        the time.sleep (2)   # delay 2s, simulation and process network requests details of the article crawling 
        Print ( " Thread {ID}: GET URL} {Detail Finished " .format (ID = ID, url = url)) # print thread id and was crawling the content of the article url 

# crawling articles list 
defget_detail_url (Queue):
     for i in the Range (10000 ): 
        the time.sleep ( 1) # delay 1s, simulate crawling faster than the article details 
        queue.put ( " http://testedu.com/{id} " . the format (ID = I)) # queue queue element to put a method for placing the queue in queue, queue since a FIFO queue, it is first URL put also will be to get out. 
        Print ( " GET URL {ID} End Detail " .format (ID = I)) # print out the article which has been URL 

# main function 
IF  the __name__ == " __main__ " : 
    detail_url_queue = Queue (MAXSIZE = 1000) #Queue structure with a size of 1,000 thread-safe FIFO queue 
    # to create four threads 
    the Thread = threading.Thread (target = get_detail_url, args = (detail_url_queue,)) # A thread is responsible for a list of crawl url 
    html_thread = []
     for i in the Range (3 ): 
        Thread2 = threading.Thread (target = get_detail_html, args = (detail_url_queue, i)) 
        html_thread.append (Thread2) # BCD thread crawl article details 
    start_time = time.time ()
     # start four thread 
    Thread.start ()
     for i in the Range (3 ): 
        html_thread [i] .start () 
    #Wait for all threads end, thread.join () before the function on behalf of the child thread is completed, the parent process has been blocked state. 
    Thread.join ()
     for I in Range (. 3 ): 
        html_thread [I] .join () 

    Print ( " Last Time: {S} " .format (the time.time () - START_TIME)) # like ABCD four threads after calculating the main process of crawling the total time.

 

  operation result:


  Postscript: From the orderly operation results can be seen to work with between threads, the circumstances of any error and warning does not appear. Visible using Queue queue for communication between multiple threads to be much safer than direct use of global variables. And the use of multi-threading than not using multiple threads, then the crawling have a lot less time, improving the efficiency of the reptiles while taking into account the security thread can be said that in the process of crawling the test data is a very practical the way. GET small partners can hope to Oh!

Transfer: https: //mp.weixin.qq.com/s/LsRNxAVJywKwEXxo8WuwLw

Guess you like

Origin www.cnblogs.com/songzhenhua/p/11824483.html