DHT protocol crawler magnetic link and BT seeds search engine

System functions and use of technology.

The system consists of several separate parts:

  • Using Python  Scrapy  framework developed web crawler, crawling magnetic link and for seeds;

  • Use PHP CI framework for the development of simple website;

  • MySQL search engine is currently used directly in the future consider using sphinx;

  • Chinese word.

    Written in PHP with a simple version of the maximum matching algorithm based on reverse subcategories, thesaurus it, ha ha, direct use of the  word table of Chrome , word form can be downloaded at this address: http://www.mdbg.net/chindict /chindict.php?page=cedict .

  • New word discovery mechanism

    Discovery mechanism based on a new word search keyword.

    Currently thesaurus there is still a lot of problems, such as the latest movies can not be sub-word, for example, Interstellar  is word as "Star" and "through", so "The Stolen Years, Cross Fire, snail speed, the great Gatsby, Fake Fiction, Star Trek, Steve Jobs Biography. "also appears in the search results.

    Of course, this thing is not too big a problem, but the Hobbits was word for "Huo", "bit", "people", and fortunately there is no valid and what the search results into chaos. These are over-segmentation can be solved by increasing the thesaurus content, so prepare some reptile a watercress, watercress movie have joined the lexicon of all, to aid word.

  • Resource alias

    This will make our system more intelligent, more humane. When we Baidu search, often encounter such a situation, when we search for "open walnut weapon", Baidu reminds us, "you're looking for is not Nokia?." When we search for "the best language in the world," Baidu reminds us, "you're looking for is not PHP?". Similarly, when a user searches for "Interstellar" Interstellar should provide matching results for users.

    We do not realize the complexity of online translation, just need to continue crawling watercress, the movies are made in English table on it. Moreover, in order to take into account the special needs of some otaku, we need to do a table Japanese.

  • English word

    English is also required word? Space is no word boundary it? You have such a translation is normal, I initially was thinking, so the English simply uses the PHP  explode(' ', $query) function.

    But I just (2015-02-01 21:59:35) found search logs to see some problems, today  xart  keyword was searched 169 times,  xart  keyword searches have been done only 54 times, but x- art is its official term ah (Do not ask me why I know so much). So I just adjusted the code, xart and xart unified directed to xart.

  • BitTorrent Python developers to use low initial version, and is open source, many libraries are the direct use of BitTorrent, there are a number of libraries and auxiliary functions directly ported to PHP platform; (Petru Paler wrote  bencode  too praised his wife He asked me: Why do you write code knees)?

P2P understand the principle of people know, BT does not require a central server, since each node acts as a client, but also the server, so based on 0x0d Great God  dhtfck wrote a DHT reptile, it disguised as a DHT node in the network, so that when when other clients want to download a torrent, will launch in the DHT network broadcast, when it asked about my node, I know: Oh, someone you want to download this seed ah, then surely there is in this DHT network seed. So I saved the seeds of this information to MySQL.

The whole process can be more specific look DHT  DHT protocol .

Note: I just saved infohash torrent of information, with this information, you can build a magnetic link, but still did not get the seed file, we have to get the seed file through other means.

Python's crawler is to take the initiative to find the blind. Looking seeds and magnetic link in the massive Internet pages. The DHT reptile becomes a passive waiting, when people come asking, put it in the record result of the inquiry, if a seed is asked many times, then the seeds are a popular seed, which is not possible reptiles Python of.

Since Python BitTorrent open source version, so my DHT crawler also uses Python. As a server, be sure to use twisted framework, familiar nodejs students must know the characteristics of this framework: asynchronous network IO, although most developers are only understood by nodejs asynchronous IO, but twisted than nodejs as early as N years.

Reptile currently running is a very simple version, a week ago I wrote a multi-threaded server-based Socket of DHT. As of now (February 1, 2015), it has been running for six days, and collected a total of 45,234,859 a magnetic link.

Magnetic search site is updated daily summary

https://www.cnblogs.com/cilisousuo/p/12099547.html

Guess you like

Origin www.cnblogs.com/cilisousuo/p/12110831.html