Design architecture of Baidu cloud network disk resource search engine based on python

Everyone knows that there are many shared resources on Baidu Cloud network disk, including software, various video self-study tutorials, e-books, and even various movies, BT seeds, but Baidu Cloud does not provide a corresponding search function. Personally, I usually find it very painful to find some software and American dramas. So I tried to develop a search system for Baidu cloud resources.
Resource crawler idea:
The most important thing about a search engine is that it has massive resources. With resources, as long as the full-text retrieval function is realized based on resources, it is a simple search engine. First of all, I need to crawl the shared resources of Baidu Cloud, crawl the ideas, and open the homepage of any Baidu Cloud sharer yun.baidu.com/share/home?uk=xxxxxx&view=share#category/type=0, you can find the sharing There are subscribers and fans, you can recursively traverse the subscribers and fans, so as to get a lot of sharers uk, and then get a lot of sharing resources.
System implementation environment:
Language: python
Operating system: Linux
Other middleware: nginx mysql sphinx
system includes several independent parts:
1. Independent resource crawler based on requests
2. Resource indexing program based on the open source full-text search engine sphinx
3. A simple website developed based on Django+bootstrap3, the website is built using nginx1.8+fastCGI(flup)+python. Demonstration website http://www.itjujiao.com
PS:
At present, the crawler has crawled about 4000W of data. Sphinx has too much memory requirements, and it is a huge pit.
Baidu will limit the ip of the crawler, and write a simple xicidaili proxy collection program. Requests can be configured with http proxy.
Word segmentation is the implementation of sphinx, which supports Chinese word segmentation. Chinese is based on monadic word segmentation, which is a bit excessive. The word segmentation effect is not particularly ideal. , which is not as expected. There are many things that can be improved in English word segmentation. For example, when I search for xart, the result of x-art will not appear, but in fact x-art is also the result set I want (you know).
The database is mysql, resource table, considering the upper limit of single table records, divided into 10 tables. After crawling sphinx for the first time, do full indexing, and then do incremental indexing.
Subsequent optimization:
1. Word segmentation processing, the current word segmentation search results are not very ideal, there are great gods who can give pointers. For example, when I searched for "Secrets of the Scroll of Kung Fu Panda", there was no result. And searching "Kung Fu Panda" has a result set (Kung Fu Panda⒊ English Chinese and English subtitles.mp4, Kung Fu Panda 2.Kung.Fu.Panda.2.2011.BDrip.720P. Chinese, Cantonese, English and Taiwan four languages. Special effects in Chinese and English Subtitles.mp4, Kung Fu Panda 3 (Korean version) 2016. High-definition Chinese characters. mkv, etc.) or search for "The Secret of the Scroll" to have a result set ([US] The Secret of the Scroll of Kung Fu Panda. 2016.1080p.mp4, The Scroll of the GF Panda The secret of HD1280 ultra-clear Chinese and English double words.mp4, etc.)
2. Data deduplication. At present, it is found that many of the captured data are shared resources. Later, we will consider the implementation code of the

crawler (just the idea code is a bit messy):
#coding: utf8

import re
import urllib2
import time
from Queue import Queue
import threading, errno, datetime
import json
import requests
import MySQLdb as mdb

DB_HOST = '127.0.0.1'
DB_USER = 'root'
DB_PASS = ''


re_start = re.compile(r'start=(\d+)')
re_uid = re.compile(r'query_uk=(\d+)')
re_pptt = re.compile(r'&pptt=(\d+)')
re_urlid = re.compile(r'&urlid=(\d+)')

ONEPAGE = 20
ONESHAREPAGE = 20

URL_SHARE = 'http://yun.baidu.com/pcloud/feed/getsharelist?auth_type=1&start={start}&limit=20&query_uk={uk}&urlid={id}'
URL_FOLLOW = 'http://yun.baidu.com/pcloud/friend/getfollowlist?query_uk={uk}&limit=20&start={start}&urlid={id}'
URL_FANS = 'http://yun.baidu.com/pcloud/friend/getfanslist?query_uk={uk}&limit=20&start={start}&urlid={id}'

QNUM = 1000
hc_q = Queue(20)
hc_r = Queue(QNUM)

success = 0
failed = 0

PROXY_LIST = [[0, 10, "42.121.33.160", 809, "", "", 0],
                [5, 0, "218.97.195.38", 81, "", "", 0],
                ]

def req_worker(inx):
    s = requests.Session()
    while True:
        req_item = hc_q.get()
        
        req_type = req_item[0]
        url = req_item[1]
        r = s.get(url)
        hc_r.put((r.text, url))
        print "req_worker#", inx, url
        
def response_worker():
    dbconn = mdb.connect(DB_HOST, DB_USER, DB_PASS, 'baiduyun', charset='utf8')
    dbcurr = dbconn.cursor()
    dbcurr.execute('SET NAMES utf8')
    dbcurr.execute('set global wait_timeout=60000')
    while True:
        
        metadata, effective_url = hc_r.get()
        #print "response_worker:", effective_url
        try:
            tnow = int(time.time())
            id = re_urlid.findall(effective_url)[0]
            start = re_start.findall(effective_url)[0]
            if True:
                if 'getfollowlist' in effective_url: #type = 1
                    follows = json.loads(metadata)
                    uid = re_uid.findall(effective_url)[0]
                    if "total_count" in follows.keys() and follows["total_count"]>0 and str(start) == "0":
                        for i in range((follows["total_count"]-1)/ONEPAGE):
                            try:
                                dbcurr.execute('INSERT INTO urlids(uk, start, limited, type, status) VALUES(%s, %s, %s, 1, 0)' % (uid, str(ONEPAGE*(i+1)), str(ONEPAGE)))
                            except Exception as ex:
                                print "E1", str(ex)
                                pass
                    
                    if "follow_list" in follows.keys():
                        for item in follows["follow_list"]:
                            try:
                                dbcurr.execute('INSERT INTO user(userid, username, files, status, downloaded, lastaccess) VALUES(%s, "%s", 0, 0, 0, %s)' % (item['follow_uk'], item['follow_uname'], str(tnow)))
                            except Exception as ex:
                                print "E13", str(ex)
                                pass
                    else:
                        print "delete 1", uid, start
                        dbcurr.execute('delete from urlids where uk=%s and type=1 and start>%s' % (uid, start))
                elif 'getfanslist' in effective_url: #type = 2
                    fans = json.loads(metadata)
                    uid = re_uid.findall(effective_url)[0]
                    if "total_count" in fans.keys() and fans["total_count"]>0 and str(start) == "0":
                        for i in range((fans["total_count"]-1)/ONEPAGE):
                            try:
                                dbcurr.execute('INSERT INTO urlids(uk, start, limited, type, status) VALUES(%s, %s, %s, 2, 0)' % (uid, str(ONEPAGE*(i+1)), str(ONEPAGE)))
                            except Exception as ex:
                                print "E2", str(ex)
                                pass
                    
                    if "fans_list" in fans.keys():
                        for item in fans["fans_list"]:
                            try:
                                dbcurr.execute('INSERT INTO user(userid, username, files, status, downloaded, lastaccess) VALUES(%s, "%s", 0, 0, 0, %s)' % (item['fans_uk'], item['fans_uname'], str(tnow)))
                            except Exception as ex:
                                print "E23", str(ex)
                                pass
                    else:
                        print "delete 2", uid, start
                        dbcurr.execute('delete from urlids where uk=%s and type=2 and start>%s' % (uid, start))
                else:
                    shares = json.loads(metadata)
                    uid = re_uid.findall(effective_url)[0]
                    if "total_count" in shares.keys() and shares["total_count"]>0 and str(start) == "0":
                        for i in range((shares["total_count"]-1)/ONESHAREPAGE):
                            try:
                                dbcurr.execute('INSERT INTO urlids(uk, start, limited, type, status) VALUES(%s, %s, %s, 0, 0)' % (uid, str(ONESHAREPAGE*(i+1)), str(ONESHAREPAGE)))
                            except Exception as ex:
                                print "E3", str(ex)
                                pass
                    if "records" in shares.keys():
                        for item in shares["records"]:
                            try:
                                dbcurr.execute('INSERT INTO share(userid, filename, shareid, status) VALUES(%s, "%s", %s, 0)' % (uid, item['title'], item['shareid']))
                            except Exception as ex:
                                #print "E33", str(ex), item
                                pass
                    else:
                        print "delete 0", uid, start
                        dbcurr.execute('delete from urlids where uk=%s and type=0 and start>%s' % (uid, str(start)))
                dbcurr.execute('delete from urlids where id=%s' % (id, ))
                dbconn.commit()
        except Exception as ex:
            print "E5", str(ex), id

        
        pid = re_pptt.findall(effective_url)
        
        if pid:
            print "pid>>>", pid
            ppid = int(pid[0])
            PROXY_LIST[ppid][6] -= 1
    dbcurr.close()
    dbconn.close()
    
def worker():
    global success, failed
    dbconn = mdb.connect(DB_HOST, DB_USER, DB_PASS, 'baiduyun', charset='utf8')
    dbcurr = dbconn.cursor()
    dbcurr.execute('SET NAMES utf8')
    dbcurr.execute('set global wait_timeout=60000')
    while True:

        #dbcurr.execute('select * from urlids where status=0 order by type limit 1')
        dbcurr.execute('select * from urlids where status=0 and type>0 limit 1')
        d = dbcurr.fetchall()
        #print d
        if d:
            id = d[0][0]
            uk = d[0][1]
            start = d[0][2]
            limit = d[0][3]
            type = d[0][4]
            dbcurr.execute('update urlids set status=1 where id=%s' % (str(id),))
            url = ""
            if type == 0:
                url = URL_SHARE.format(uk=uk, start=start, id=id).encode('utf-8')
            elif  type == 1:
                url = URL_FOLLOW.format(uk=uk, start=start, id=id).encode('utf-8')
            elif type == 2:
                url = URL_FANS.format(uk=uk, start=start, id=id).encode('utf-8')
            if url:
                hc_q.put((type, url))
                
            #print "processed", url
        else:
            dbcurr.execute('select * from user where status=0 limit 1000')
            d = dbcurr.fetchall()
            if d:
                for item in d:
                    try:
                        dbcurr.execute('insert into urlids(uk, start, limited, type, status) values("%s", 0, %s, 0, 0)' % (item[1], str(ONESHAREPAGE)))
                        dbcurr.execute('insert into urlids(uk, start, limited, type, status) values("%s", 0, %s, 1, 0)' % (item[1], str(ONEPAGE)))
                        dbcurr.execute('insert into urlids(uk, start, limited, type, status) values("%s", 0, %s, 2, 0)' % (item[1], str(ONEPAGE)))
                        dbcurr.execute('update user set status=1 where userid=%s' % (item[1],))
                    except Exception as ex:
                        print "E6", str(ex)
            else:
                time.sleep(1)
                
        dbconn.commit()
    dbcurr.close()
    dbconn.close()
        
    
for item in range(16):    
    t = threading.Thread(target = req_worker, args = (item,))
    t.setDaemon(True)
    t.start()

s = threading.Thread(target = worker, args = ())
s.setDaemon(True)
s.start()

response_worker()

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326940951&siteId=291194637
Recommended