Multi-threaded crawling and storage infrastructure

Multithreading:

General python program is running in the foreground (the main thread), which represents the order after running only to run in front of a finished running behind one, but sometimes this is a waste of time, such as downloading data from the first two individual consumption download when t1 while the second individual time-consuming download t2, time is t1 + t2, but the two programs you will run into the background when the time is max (t1, t2), but this seems to be little more than one advantage is reflected in the data Out

First import threading library

import threading

xxx.threading.Thread (target = yyy, args =) yyy to create a function name is xxx running in the background thread, args parameter is passed in the function yyy

xxx.start () to start the thread of this name is xxx

xxx.join () plug this thread: the next thread and other finished making a statement

Some storage and read:

First introduce two functions os library

os.path.exsist (path) : determines whether there is a return path Path bool value

os.getcwd = xxx () : Get the current working directory absolute address

Write text (string) Code | script

  

import threading
import os

def write_file(path,data):
    '''
    : Param path: the path of the file to be written
    : Data to be written to the file: param data
    :return:
    '''
    f = open(path, 'wb')
    f.write(data.encode('utf-8'))
    f.close()
    Print ( ' file successfully written ' )
 DEF get_path (name):
     '' '
    : Param name: Enter the file name
    : Return: the absolute path of the output file
    '' ' 
    Os_path = The os.getcwd ()   # obtain the current folder path 
    file_name = + name ' .txt '   # enter the name of the new file 
    return os_path + ' \\ ' + file_name   # give absolute address

def construct_file(name,data):
    '''
        Only use this function if there is no function to overwrite the file
        Enter the file name
    :param name:
    :return:
    '''
    path=get_path(name)
    if not os.path.exists(path):
        write_file(path,data)
    the else :
         Print ( ' file name already exists ' )
         Print ( ' whether to overwrite files the Y-| N ' )
         IF the INPUT () == ' the Y- ' :

            Print ( ' success overwrite the file ' )
            write_file(path,data)
            return 0
         the else :
             Print ( ' Re-enter the file name ... ' )
             return 1
     return 0
 DEF write_dataintofile (the Data):
     '' '
    : Param data: the data written to the file
    : Return: no return value
    '' ' 
    Print ( ' Please enter a file name ... ' )
     the while construct_file (the INPUT (), the Data):
         Pass

 

Read the code (string) removing the blank reading

def inp():
    sr=input()
    enter=[]
    for i in sr:
        if i==' ':
            continue
        else :
            enter.append(i)
    s=''.join(enter)
    return s

 

Here are some parameters on open, write I will always remember the operation

"Rt" read-only open a text file, allowing only read data
"wt" write-only open or create a text file, allowing only write data
"at" append to open a text file and write data at the end of the file
"rb" read-only open a binary file, allowing only read data
"wb" write-only open or create a binary file, allowing only write data
"ab" append to open a binary file and write data at the end of the file
"rt +" read and write to open a text file, allows read and write
"wt +" write open or create a text file, allowing write
"at +" to open a text file read and write, to allow reading or adding at the end of the data file
"rb +" read a binary file to open, allow read and write
"wb +" write open or create a binary file that allows reading and writing
"ab +" to open a binary file reading and writing, allow read, or append data at the end of file

Although I do not know why you rarely see people use the + sign? What disadvantages do?

Filled pit

Picture reads, downloads :

from bs4 import BeautifulSoup
import urllib.request
from bs4 import UnicodeDammit
if __name__=='__main__':
    url='https://misaka.design.blog/'
    user_agent={'user_agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36"}
    req=urllib.request.Request(url,headers=user_agent)
    res=urllib.request.urlopen(url)
    doc=res.read()
    dammit=UnicodeDammit(doc,["utf-8","gbk"])
    doc=dammit.unicode_markup
    soup=BeautifulSoup(doc,"html.parser")
    Data = soup.select ( " A [class = 'POST-thumbnail The'] IMG " )
     # obtained in the url image 
    IMG = the urllib.request.urlopen (Data [0] [ ' the src ' ])
    IMGA = img.read () # read a binary data address inside 
    Print (IMGA)

    f = Open (r ' imag1.jpg ' , ' wb ' ) # Create a jpg file 
    f.write (IMGA)
     # to write a binary image file is then downloaded pictures 
    f.close ()

 

Guess you like

Origin www.cnblogs.com/cherrypill/p/12407717.html