Resource crawler idea:
The most important thing about a search engine is that it has massive resources. With resources, as long as the full-text retrieval function is realized based on resources, it is a simple search engine. First of all, I need to crawl the shared resources of Baidu Cloud, crawl the ideas, and open the homepage of any Baidu Cloud sharer yun.baidu.com/share/home?uk=xxxxxx&view=share#category/type=0, you can find the sharing There are subscribers and fans, you can recursively traverse the subscribers and fans, so as to get a lot of sharers uk, and then get a lot of sharing resources.
System implementation environment:
Language: python
Operating system: Linux
Other middleware: nginx mysql sphinx
system includes several independent parts:
1. Independent resource crawler based on requests
2. Resource indexing program based on the open source full-text search engine sphinx
3. A simple website developed based on Django+bootstrap3, the website is built using nginx1.8+fastCGI(flup)+python. Demonstration website http://www.itjujiao.com
PS:
At present, the crawler has crawled about 4000W of data. Sphinx has too much memory requirements, and it is a huge pit.
Baidu will limit the ip of the crawler, and write a simple xicidaili proxy collection program. Requests can be configured with http proxy.
Word segmentation is the implementation of sphinx, which supports Chinese word segmentation. Chinese is based on monadic word segmentation, which is a bit excessive. The word segmentation effect is not particularly ideal. , which is not as expected. There are many things that can be improved in English word segmentation. For example, when I search for xart, the result of x-art will not appear, but in fact x-art is also the result set I want (you know).
The database is mysql, resource table, considering the upper limit of single table records, divided into 10 tables. After crawling sphinx for the first time, do full indexing, and then do incremental indexing.
Subsequent optimization:
1. Word segmentation processing, the current word segmentation search results are not very ideal, there are great gods who can give pointers. For example, when I searched for "Secrets of the Scroll of Kung Fu Panda", there was no result. And searching "Kung Fu Panda" has a result set (Kung Fu Panda⒊ English Chinese and English subtitles.mp4, Kung Fu Panda 2.Kung.Fu.Panda.2.2011.BDrip.720P. Chinese, Cantonese, English and Taiwan four languages. Special effects in Chinese and English Subtitles.mp4, Kung Fu Panda 3 (Korean version) 2016. High-definition Chinese characters. mkv, etc.) or search for "The Secret of the Scroll" to have a result set ([US] The Secret of the Scroll of Kung Fu Panda. 2016.1080p.mp4, The Scroll of the GF Panda The secret of HD1280 ultra-clear Chinese and English double words.mp4, etc.)
2. Data deduplication. At present, it is found that many of the captured data are shared resources. Later, we will consider the implementation code of the
crawler (just the idea code is a bit messy):
#coding: utf8 import re import urllib2 import time from Queue import Queue import threading, errno, datetime import json import requests import MySQLdb as mdb DB_HOST = '127.0.0.1' DB_USER = 'root' DB_PASS = '' re_start = re.compile(r'start=(\d+)') re_uid = re.compile(r'query_uk=(\d+)') re_pptt = re.compile(r'&pptt=(\d+)') re_urlid = re.compile(r'&urlid=(\d+)') ONEPAGE = 20 ONESHAREPAGE = 20 URL_SHARE = 'http://yun.baidu.com/pcloud/feed/getsharelist?auth_type=1&start={start}&limit=20&query_uk={uk}&urlid={id}' URL_FOLLOW = 'http://yun.baidu.com/pcloud/friend/getfollowlist?query_uk={uk}&limit=20&start={start}&urlid={id}' URL_FANS = 'http://yun.baidu.com/pcloud/friend/getfanslist?query_uk={uk}&limit=20&start={start}&urlid={id}' QNUM = 1000 hc_q = Queue(20) hc_r = Queue(QNUM) success = 0 failed = 0 PROXY_LIST = [[0, 10, "42.121.33.160", 809, "", "", 0], [5, 0, "218.97.195.38", 81, "", "", 0], ] def req_worker(inx): s = requests.Session() while True: req_item = hc_q.get() req_type = req_item[0] url = req_item[1] r = s.get(url) hc_r.put((r.text, url)) print "req_worker#", inx, url def response_worker(): dbconn = mdb.connect(DB_HOST, DB_USER, DB_PASS, 'baiduyun', charset='utf8') dbcurr = dbconn.cursor() dbcurr.execute('SET NAMES utf8') dbcurr.execute('set global wait_timeout=60000') while True: metadata, effective_url = hc_r.get() #print "response_worker:", effective_url try: tnow = int(time.time()) id = re_urlid.findall(effective_url)[0] start = re_start.findall(effective_url)[0] if True: if 'getfollowlist' in effective_url: #type = 1 follows = json.loads(metadata) uid = re_uid.findall(effective_url)[0] if "total_count" in follows.keys() and follows["total_count"]>0 and str(start) == "0": for i in range((follows["total_count"]-1)/ONEPAGE): try: dbcurr.execute('INSERT INTO urlids(uk, start, limited, type, status) VALUES(%s, %s, %s, 1, 0)' % (uid, str(ONEPAGE*(i+1)), str(ONEPAGE))) except Exception as ex: print "E1", str(ex) pass if "follow_list" in follows.keys(): for item in follows["follow_list"]: try: dbcurr.execute('INSERT INTO user(userid, username, files, status, downloaded, lastaccess) VALUES(%s, "%s", 0, 0, 0, %s)' % (item['follow_uk'], item['follow_uname'], str(tnow))) except Exception as ex: print "E13", str(ex) pass else: print "delete 1", uid, start dbcurr.execute('delete from urlids where uk=%s and type=1 and start>%s' % (uid, start)) elif 'getfanslist' in effective_url: #type = 2 fans = json.loads(metadata) uid = re_uid.findall(effective_url)[0] if "total_count" in fans.keys() and fans["total_count"]>0 and str(start) == "0": for i in range((fans["total_count"]-1)/ONEPAGE): try: dbcurr.execute('INSERT INTO urlids(uk, start, limited, type, status) VALUES(%s, %s, %s, 2, 0)' % (uid, str(ONEPAGE*(i+1)), str(ONEPAGE))) except Exception as ex: print "E2", str(ex) pass if "fans_list" in fans.keys(): for item in fans["fans_list"]: try: dbcurr.execute('INSERT INTO user(userid, username, files, status, downloaded, lastaccess) VALUES(%s, "%s", 0, 0, 0, %s)' % (item['fans_uk'], item['fans_uname'], str(tnow))) except Exception as ex: print "E23", str(ex) pass else: print "delete 2", uid, start dbcurr.execute('delete from urlids where uk=%s and type=2 and start>%s' % (uid, start)) else: shares = json.loads(metadata) uid = re_uid.findall(effective_url)[0] if "total_count" in shares.keys() and shares["total_count"]>0 and str(start) == "0": for i in range((shares["total_count"]-1)/ONESHAREPAGE): try: dbcurr.execute('INSERT INTO urlids(uk, start, limited, type, status) VALUES(%s, %s, %s, 0, 0)' % (uid, str(ONESHAREPAGE*(i+1)), str(ONESHAREPAGE))) except Exception as ex: print "E3", str(ex) pass if "records" in shares.keys(): for item in shares["records"]: try: dbcurr.execute('INSERT INTO share(userid, filename, shareid, status) VALUES(%s, "%s", %s, 0)' % (uid, item['title'], item['shareid'])) except Exception as ex: #print "E33", str(ex), item pass else: print "delete 0", uid, start dbcurr.execute('delete from urlids where uk=%s and type=0 and start>%s' % (uid, str(start))) dbcurr.execute('delete from urlids where id=%s' % (id, )) dbconn.commit() except Exception as ex: print "E5", str(ex), id pid = re_pptt.findall(effective_url) if pid: print "pid>>>", pid ppid = int(pid[0]) PROXY_LIST[ppid][6] -= 1 dbcurr.close() dbconn.close() def worker(): global success, failed dbconn = mdb.connect(DB_HOST, DB_USER, DB_PASS, 'baiduyun', charset='utf8') dbcurr = dbconn.cursor() dbcurr.execute('SET NAMES utf8') dbcurr.execute('set global wait_timeout=60000') while True: #dbcurr.execute('select * from urlids where status=0 order by type limit 1') dbcurr.execute('select * from urlids where status=0 and type>0 limit 1') d = dbcurr.fetchall() #print d if d: id = d[0][0] uk = d[0][1] start = d[0][2] limit = d[0][3] type = d[0][4] dbcurr.execute('update urlids set status=1 where id=%s' % (str(id),)) url = "" if type == 0: url = URL_SHARE.format(uk=uk, start=start, id=id).encode('utf-8') elif type == 1: url = URL_FOLLOW.format(uk=uk, start=start, id=id).encode('utf-8') elif type == 2: url = URL_FANS.format(uk=uk, start=start, id=id).encode('utf-8') if url: hc_q.put((type, url)) #print "processed", url else: dbcurr.execute('select * from user where status=0 limit 1000') d = dbcurr.fetchall() if d: for item in d: try: dbcurr.execute('insert into urlids(uk, start, limited, type, status) values("%s", 0, %s, 0, 0)' % (item[1], str(ONESHAREPAGE))) dbcurr.execute('insert into urlids(uk, start, limited, type, status) values("%s", 0, %s, 1, 0)' % (item[1], str(ONEPAGE))) dbcurr.execute('insert into urlids(uk, start, limited, type, status) values("%s", 0, %s, 2, 0)' % (item[1], str(ONEPAGE))) dbcurr.execute('update user set status=1 where userid=%s' % (item[1],)) except Exception as ex: print "E6", str(ex) else: time.sleep(1) dbconn.commit() dbcurr.close() dbconn.close() for item in range(16): t = threading.Thread(target = req_worker, args = (item,)) t.setDaemon(True) t.start() s = threading.Thread(target = worker, args = ()) s.setDaemon(True) s.start() response_worker()