Data enhancement code practice, add some knowledge points about multi-process and multi-threading

  • The most advanced neural network model currently uses a series of linear and nonlinear functions to fit the target output. Since it is fitting, of course, the more samples, the more accurate results can be obtained, which is why the scale of data used to train neural networks is getting larger and larger.

  • In actual use, we may often only have thousands or even hundreds of data. Facing the millions of parameters of the neural network, it is easy to fall into the trap of overfitting. Because the convergence of the neural network requires a long training process, and during this process, the network encounters the same few pictures in the training set over and over again, and it is difficult to learn what can be generalized by memorizing them. Characteristics. A natural thought is, can we use one image to generate a series of images, thereby expanding our data set hundreds or thousands of times? And this is one of the purposes of data enhancement.

  • Neural networks have no common sense, so they will always distinguish between two categories in the most "convenient" way. Suppose we want to train a neural network to distinguish apples and oranges, but the data we have are only red apples and green oranges. No matter how many photos we take, the neural network will simply think that red ones are apples and cyan ones are oranges. . This often occurs in actual use. The lighting, shooting angle, etc., any inconspicuous distinguishing point will be used as the basis for classification by the neural network.

  • The goal of data enhancement is not to pile up data mindlessly, but to cover as much as possible the situations that cannot be covered by the original data but will occur in real life . Using data augmentation techniques can increase the diversity of images in the dataset, thereby improving the performance and generalization ability of the model. In the Pytorch framework, commonly used data enhancement functions are mainly integrated in the transforms file. Since the input of transforms is in PIL file format, we need to use the PIL.Image module.

  • Guide package:

    • from PIL import Image
      from pathlib import Path
      import matplotlib.pyplot as plt
      import torch
      import numpy as np
      plt.rcParams["savefig.bbox"]="tight"
      org_img=Image.open(Path("nest.jpg"))
      torch.manual_seed(0)
      print(np.array(org_img).shape) 
      # (2286, 2603, 3)
      
    • resize, zoom

    • import torchvision.transforms as T
      resize_img=[T.Resize(size=newsize)(org_img) for newsize in [1000,2000]]
      
      ax1=plt.subplot(131)
      ax1.set_title("original")
      ax1.imshow(org_img)
      
      ax2=plt.subplot(132)
      ax2.set_title("1000*1000")
      ax2.imshow(resize_img[0])
      
      ax3=plt.subplot(133)
      ax3.set_title(2000*2000)
      ax3.imshow(resize_img[1])
      
      plt.show()
      
    • Insert image description here

    • Grayscale

    • gray_img = T.Grayscale()(org_img)
      ax1 = plt.subplot(121)
      ax1.set_title('original')
      ax1.imshow(org_img)
      
      ax2 = plt.subplot(122)
      ax2.set_title('gray')
      ax2.imshow(gray_img,cmap='gray')
      plt.show()
      
    • Insert image description here

    • standardization

    • norm_img=T.Normalize(mean=(0.5,0.5,0.5),std=(0.5,0.5,0.5))(T.ToTensor()(org_img))
      norm_img=[T.ToPILImage()(norm_img)]
      
      ax1 = plt.subplot(121)
      ax1.set_title('original')
      ax1.imshow(org_img)
      
      ax2 = plt.subplot(122)
      ax2.set_title('normalize')
      ax2.imshow(norm_img[0])
      
      plt.show()
      
    • Insert image description here

    • rotate

    • plt.rcParams['font.sans-serif'] = ['SimHei']
      rotate_img=[T.RandomRotation(degrees=180)(org_img)]
      # print(rotate_img)
      
      ax1=plt.subplot(121)
      ax1.set_title("original")
      ax1.imshow(org_img)
      
      ax2=plt.subplot(122)
      ax2.set_title("$180$")
      ax2.imshow(rotate_img[0])
      
      plt.show()
      
    • Insert image description here

    • center crop

    • center_crop=[T.CenterCrop(size=newsize)(org_img) for newsize in (300,600)]
      
      ax1 = plt.subplot(131)
      ax1.set_title('original')
      ax1.imshow(org_img)
      
      ax2 = plt.subplot(132)
      ax2.set_title('300*300')
      ax2.imshow(np.array(center_crop[0]))
      
      ax3 = plt.subplot(133)
      ax3.set_title('600*600')
      ax3.imshow(np.array(center_crop[1]))
      
      plt.show()
      
    • Insert image description here

    • random crop

    • rand_corp=[T.RandomCrop(size=newsize)(org_img) for newsize in [500,1000]]
      
      ax1 = plt.subplot(131)
      ax1.set_title('original')
      ax1.imshow(org_img)
      
      ax2 = plt.subplot(132)
      ax2.set_title('500*500')
      ax2.imshow(np.array(rand_corp[0]))
      
      ax3 = plt.subplot(133)
      ax3.set_title('1000*1000')
      ax3.imshow(np.array(rand_corp[1]))
      
      plt.show()
      
    • Insert image description here

    • Add Gaussian noise

    • blur_img=[T.GaussianBlur(kernel_size=(3,3),sigma=x)(org_img) for x in (30,60)]
      
      ax1 = plt.subplot(131)
      ax1.set_title('original')
      ax1.imshow(org_img)
      
      ax2 = plt.subplot(132)
      ax2.set_title('sigma=30')
      ax2.imshow(np.array(blur_img[0]))
      
      ax3 = plt.subplot(133)
      ax3.set_title('sigma=60')
      ax3.imshow(np.array(blur_img[1]))
      
      plt.show()
      
    • Insert image description here

    • Color correction, saturation, etc.

    • colorjitter=[T.ColorJitter(brightness=(0.2,0.8),contrast=(0.5,0.5),saturation=(0.5,0.5),hue=0.5)(org_img)]
      # 亮度(brightness)、对比度(contrast)、饱和度(saturation)和色调(hue)
      # brightness_factor从[max(0, 1 - brightness), 1 + brightness]中随机采样产生。应当是非负数。
      # contrast_factor从[max(0, 1 - contrast), 1 + contrast]中随机采样产生。应当是非负数。
      # saturation_factor从[max(0, 1 - saturation), 1 + saturation]中随机采样产生。应当是非负数。
      # hue_factor从[-hue, hue]中随机采样产生,其值应当满足0<= hue <= 0.5或-0.5 <= min <= max <= 0.5
      ax1 = plt.subplot(121)
      ax1.set_title('original')
      ax1.imshow(org_img)
      ax2 = plt.subplot(122)
      ax2.set_title('colorjitter')
      ax2.imshow(np.array(colorjitter[0]))
      plt.show()
      
    • Insert image description here

    • horizontal flip

    • horizon=[T.RandomHorizontalFlip(p=1)(org_img)]
      # p表示概率
      ax1 = plt.subplot(121)
      ax1.set_title('original')
      ax1.imshow(org_img)
      
      ax2 = plt.subplot(122)
      ax2.set_title('horizon')
      ax2.imshow(np.array(horizon[0]))
      
      plt.show()
      
    • Insert image description here

    • flip vertically

    • vertical=[T.RandomVerticalFlip(p=1)(org_img)]
      
      ax1 = plt.subplot(121)
      ax1.set_title('original')
      ax1.imshow(org_img)
      
      ax2 = plt.subplot(122)
      ax2.set_title('VerticalFlip')
      ax2.imshow(np.array(vertical[0]))
      
      plt.show()
      
    • Insert image description here

    • Add custom noise

    • def add_noise(imputs,noise_fac=0.5):
          noise=imputs + torch.randn_like(imputs)*noise_fac
          noise=torch.clip(noise,0.0,1.0)
          # clip这个函数将将数组中的元素限制在a_min, a_max之间,大于a_max的就使得它等于 a_max,小于a_min,的就使得它等于a_min。
          return noise
      
      noise_img=[add_noise(T.ToTensor()(org_img),fac) for fac in (0.4,0.8)]
      noise_img=[T.ToPILImage()(n_img) for n_img in noise_img]
      
      ax1 = plt.subplot(131)
      ax1.set_title('original')
      ax1.imshow(org_img)
      
      ax2 = plt.subplot(132)
      ax2.set_title('noise_factor=0.4')
      ax2.imshow(np.array(noise_img[0]))
      
      ax3 = plt.subplot(133)
      ax3.set_title('noise_factor=0.8')
      ax3.imshow(np.array(noise_img[1]))
      
      plt.show()
      
    • Insert image description here

    • Add mask block

    • def add_box(img,num_box,size=100):
          h,w=size,size
          img=np.asarray(img).copy()
          img_size=img.shape[1]
          boxes=[]
          for k in range(num_box):
              y,x=np.random.randint(0,img_size-w,(2,))
              img[y:y+h,x:x+w]=0
              boxes.append((x,y,h,w))
          img=Image.fromarray(img.astype('uint8'),'RGB')
          return img
      
      block_img=[add_box(org_img,num_box=15)]
      
      ax1 = plt.subplot(121)
      ax1.set_title('original')
      ax1.imshow(org_img)
      
      ax2 = plt.subplot(122)
      ax2.set_title('add black boxes')
      ax2.imshow(np.array(block_img[0]))
      
      plt.show()
      
    • Insert image description here

    • center mask block

    • def add_center(o_img,size=150):
          h,w=size,size
          img=np.asarray(o_img).copy()
          img_size=img.shape[1]
          img[int(img_size/2-h):int(img_size/2+h),int(img_size/2-w):int(img_size/2+w)]=0
          img=Image.fromarray(img.astype('uint8'),'RGB')
          return img
      
      center_img=[add_center(org_img,size=200)]
      
      ax1 = plt.subplot(121)
      ax1.set_title('original')
      ax1.imshow(org_img)
      
      ax2 = plt.subplot(122)
      ax2.set_title('add_center')
      ax2.imshow(np.array(center_img[0]))
      
      plt.show()
      
    • Insert image description here

Multi-process and multi-thread

  • Concurrency: over a period of timealternatelyto perform multiple tasks. Example: For a single-core CPU to handle multiple tasks, the operating system takes turns to let each taskAlternate execution

  • Parallelism: Performing multiple tasks together truly simultaneously over a period of time . Example: For multi-core CPUs to handle multitasking, the operating system will arrange a task for each core of the CPU to execute. Multiple cores can truly execute multiple tasks at the same time. Needed hereNote that multi-core CPUs execute multiple tasks in parallel, and multiple tasks are always executed together.

  • The process is the smallest unit for the operating system to allocate resources, and the thread is the smallest unit for the operating system to schedule. (One is for allocation and the other is for scheduling)

  • An application includes at least 1 process, and 1 process includes 1 or more threads, and the size of threads is smaller.

  • Each process has an independent memory unit during execution, and multiple threads of a process share memory during execution.

  • A factory (similar to a CPU) assumes that the power is limited and can only supply one workshop, that is, it can only run one task. There are many workshops (similar to processes) in it to perform a single task. In this case, only one workshop can run at a time. , if another workshop wants to work, the current workshop must rest.

  • A multi-core CPU is like multiple factories. It can allow multiple workshops (similar to multiple processes) to work together at the same time. Of course, they work in different factories, that is, they run on different CPU cores.

  • There are many workers in each workshop , which is similar to threads . A workshop can have multiple workers, that is, a process can have multiple threads.

  • This is the basic relationship between CPU, multi-core, process and thread. Everything seems to be in harmony here, and it has nothing to do with locks. Why is there a lock in society? Next we have to talk about the issue of memory usage.

    • Under a process, the resources it has are limited (how many, for example), and under this process, multiple threads share these resources, just like a workshop, the size of each What should we do if workers occupy different places for work? Then we need to add a lock. When a certain working place is full, we have to put a lock on it to tell other workers that there is no room for people here. You have to wait for someone to come out, that is, when a certain thread finishes releasing the memory, before someone can come in again and start a new thread.

    • Here the big guys invented a simple method to prevent conflicts, called Mutual exclusion (Mutex) . In other words, when a thread uses shared memory, it will lock the door to prevent others from coming in and occupying it. Those who come later have to wait for the lock to be unlocked, that is, the space to be released, before they can enter and use this space again.

    • There are also some rooms that can accommodate n people at the same time. In other words, if the number of people is greater than n, the extra people can only wait outside. This is like some memory areas that can only be used by a fixed number of threads. The solution at this time is to hang n keys at the door. The person who goes in takes a key and hangs the key back up when he comes out. Those who arrived later found that the keys were empty, so they knew they had to wait in line at the door . This approach is called "Semaphore" and is used to ensure that multiple threads do not conflict with each other.

  • Multi-process CPUs automatically allocate resources and can run on different cores at the same time. Python multi-threading actually works intermittently on the CPU. When one thread runs, other threads are in a resting state. When the next thread starts, other threads enter a resting state again.

  • Multi-threading: threading, using the principle that CPU and IO can be executed at the same time, so that the CPU will not wait dryly for IO to complete

    • Advantages of multi-threaded Thread (threading): Compared with processes, it is more lightweight and takes up less resources . Disadvantages: Compared with coroutines, multi-threads can only be executed concurrently and cannot utilize multiple CPUs (GIL). Compared with coroutines: the number of starts is limited, memory resources are occupied, and there is thread switching overhead. Suitable for: IO-intensive computing, the number of tasks running at the same time is not required
  • Multi-process: multiprocessing, utilizing the capabilities of multi-core CPUs to truly execute tasks in parallel

    • Multi-process Process (multiprocessing)·Advantages: Multi-core CPU parallel computing can be used . Disadvantages: takes up the most resources, has fewer startable threads than threads , and is suitable for: CPU-intensive calculations
  • Asynchronous IO: asyncio, using the principle of simultaneous execution of CPU and IO in a single thread, to achieve asynchronous execution of functions

    • Advantages of multi-coroutine Coroutine (asyncio): minimal memory overhead and the largest number of started coroutines . Disadvantages: The supported libraries are limited (aiohttp vs requests), and the code implementation is complex. Suitable for: IO-intensive computing, scenarios that require ultra-multitasking but are supported by ready-made libraries
  • There can be multiple threads in a process, and there can be multiple coroutines in a thread.

  • Use Lock to lock resources to prevent conflicting access

  • Use Queue to implement data communication between different threads/processes and implement the producer-consumer model

  • Use thread pool/process pool pool to simplify thread/process task submission, waiting for completion, and obtaining results

  • Use subprocess to start the process of an external program and perform input and output interaction

  • CPU-intensive, also called computing-intensive, means that I/O can be completed in a short time. The CPU requires a lot of calculations and processing, and is characterized by a very high CPU usage.

    • For example: compression and decompression, encryption and decryption, regular expression search
  • I/O intensive means that most of the system operation is when the CPU is waiting for I/O (hard disk/memory) read/write operations, and the CPU usage is still low.

    • For example: file processing programs, web crawler programs, reading and writing database programs
  • Global Interpreter Lock (English: Global Interpreter Lock, abbreviated as GIL) is a mechanism used by computer programming language interpreters to synchronize threads, which allows only one thread to be executed at any time. Even on multi-core processors, interpreters using the GIL only allow one thread to execute at a time . The GIL simplifies the management of shared resources.

    • Because although Python threads are real threads, when the interpreter executes code, there is a GIL lock: Global Interpreter Lock. Before any Python thread is executed, it must first obtain the GIL lock. Then, every time 100 bytecodes are executed, the interpreter The GIL lock is automatically released, allowing other threads a chance to execute . This GIL global lock actually locks the execution code of all threads. Therefore, multi-threads can only be executed alternately in Python. Even if 100 threads run on a 100-core CPU, only 1 core can be used.
  • The multi-threading mechanism is still useful for IO-intensive calculations. Because during I/O (read, write, send, recv, etc.), the thread will release the GIL to realize parallelization of CPU and IO, so multi-threading can still greatly improve the speed when used for IO-intensive calculations. But when multi-threading is used for CPU-intensive calculations, it will only slow down the speed even more .

  • useMultiprocessing mechanism of multiprocessingRealize parallel computing and take advantage of multi-core CPUs. In order to deal with GIL problems, Python provides multiprocessing

  • Insert image description here

  • The new thread system needs to allocate resources, and the terminated thread system needs to reclaim resources. If threads can be reused, the overhead of new creation/termination can be reduced. The running time of a thread can be divided into three parts: thread startup time, thread body running time and thread destruction time. In the context of multi-threaded processing, if threads cannot be reused, it means that each creation needs to go through 3 processes of starting, destroying and running. This will inevitably increase the system response time and reduce efficiency.

  • Insert image description here

  • Benefits of using a thread pool 1. Improved performance: because a large amount of overhead of creating and terminating threads is reduced, thread resources are reused; 2. Applicable scenarios: suitable for processing a large number of sudden requests or requiring a large number of threads to complete tasks, but the actual task processing The time is shorter ; 3. Defense function: It can effectively prevent the system from creating too many threads, causing the system to be overloaded and slow down; 4. Code advantage: using the syntax of the thread pool is more concise than creating a new thread execution thread yourself

  • Use a thread pool: Since threads are created in advance and put into the thread pool, and are not destroyed after processing the current task but are arranged to process the next task, it is possible to avoid creating threads multiple times, thereby saving the overhead of thread creation and destruction. , can bring better performance and system stability . The basic principle of the thread pool: We put the task into the queue, and then open N threads. Each thread goes to the queue to fetch a task. After executing it, it tells the system that I have finished executing it, and then goes to fetch the next task from the queue. , until all tasks in the queue are emptied and the thread exits.

  • Insert image description here

  • # 创建队列实例, 用于存储任务
    queue = Queue()
    # 定义需要线程池执行的任务
    def do_job():
        while True:
            i = queue.get()
            time.sleep(1)
            print 'index %s, curent: %s' % (i, threading.current_thread())
            queue.task_done()
    if __name__ == '__main__':
        # 创建包括3个线程的线程池
        for i in range(3):
            t = Thread(target=do_job)
            t.daemon=True # 设置线程daemon  主线程退出,daemon线程也会推出,即时正在运行
            t.start()
        # 模拟创建线程池3秒后塞进10个任务到队列
        time.sleep(3)
        for i in range(10):
            queue.put(i)
        queue.join()
    
  • Multi-threading and multi-process support in python

  • Insert image description here

  • Semaphore (English: Semaphore), also known as semaphore and semaphore, is a synchronization object used to maintain a count value between 0 and a specified maximum value. When the thread completes a wait for the semaphore object, the count value is decremented by one; when the thread completes a release of the semaphore object, the count value is incremented by one. When the count value is 0, the thread can no longer wait for the semaphore object until the semaphore object becomes the signaled state. The count value of the semaphore object is greater than 0, which is the signaled state; the count value is equal to 0, which is the nonsignaled state.

  • Set the main process daemon: child process object.daemon=true. Multithreading is a way to implement multitasking in Python programs; a thread is the smallest unit of program execution; multiple threads belonging to the same process share all resources owned by the process.

Guess you like

Origin blog.csdn.net/weixin_43424450/article/details/133269326