Python distributed crawler practice

Recently, I read Fan Chuanhui's book on Python crawler development and project practice, and practiced the distributed crawler in Chapter 7.

When I encountered the problem that the code could not run in my own environment, after a lot of hard work, I found that it was mainly caused by the following points:

Book environment: python2.7, linux

Local environment: python3.4, window

1. The import module in the text is import Queue, and in py3 it should be from multiprocessing import Queue

2. lambda cannot be serialized in the window environment. To use a custom function, the custom function needs to be placed at the beginning of the code, as follows;

url_q=Queue()
result_q=Queue()
def get_url_q():
    global url_q
    return url_q
def get_result_q():
    global result_q
    return result_q
class NodeManager(object):
    def start_manager(self,url_q,result_q):
        BaseManager.register('get_task_queue',callable=get_url_q)
        BaseManager.register('get_result_queue',callable=get_result_q)

3. In the window environment, if the address parameter of the BaseManager function is '', it does not represent the local machine, and linux represents the local machine. The local machine address should be clearly written, and the authkey parameter under py3 should be encoded and converted, as follows:

manager=BaseManager(address=('127.0.0.1',8001),authkey='baike'.encode('utf-8'))

4. In addition, the main problem encountered in the practice is the conversion of the encoding format of the data. Different encoding formats will cause the crawler to pause until a certain step is executed (PS: Mine is stuck after parsing the URL and has not moved)

5. The code is mainly executed by two windows, in which NodeManager is only used to start the process without output, and the SpiderWork process is used for output

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325838503&siteId=291194637