python web crawler project Choice3

Reference links:
reptiles actual project collection

micro-channel public number reptiles. Sogou search based on micro-channel micro-channel public number of reptiles interface can be expanded into Sogou search based on the reptile, the result is a list of each specific numbers are public information dictionary.

famous book crawlers. Can climb down all the books in the famous book label, were ranked according to score storage, stored in Excel, you can facilitate screening of collecting, such as the number of screening evaluations> 1000 score books; can be based on different themes stored in a different Excel Sheet using User Agent camouflage crawling into your browser and add to better mimic the behavior of the browser random delay to avoid reptile was closed.

know almost reptiles. Function of this project is almost crawling the user information and known interpersonal topology, using Scrapy crawler frame, using data stored mongo

bilibili user crawler. The total number of data: 20,119,918, crawl fields: user id, nickname, gender, avatar, rank, experience, number of fans, birthday, address, registration date, signature, rank and experience and so on. Generating user station B after reporting data fetch.

Weibo reptiles. The main crawling Weibo user's personal information, tweets, fans and followers. Cookie code gets Sina microblogging log, can be prevented by pocketing more than Sina account login. The main use scrapy reptile framework.

novel download distributed crawlers. Use scrapy, Redis, MongoDB, a distributed web crawler graphite to achieve, the underlying storage mongodb clusters, distributed use redis achieve, reptiles use graphite to achieve the status display, aimed at a novel site.

China HowNet reptiles. After setting the search condition, performs src / fetches data stored in the data capture / data directory, the behavior of the first field name of each data file.

chain of home network reptiles. Crawling over the years in Beijing Homelink second-hand housing transaction records. Homelink covers all reptiles article code, including the chain of home analog login code.

Jingdong reptiles. Based scrapy Jingdong website crawler, save format is csv.

QQ group of reptiles. Batch fetching QQ group information, including group name, group number, group number, the main group, group profile and other content, and ultimately generate XLS (X) / CSV file results.

clouds crawlers. Clouds public vulnerability, reptiles and search the knowledge base. There is a list of text content and each vulnerability, the entire disclosure of vulnerabilities in mongodb, probably about 2G content; if the entire station to climb all the text and pictures as an offline query takes about 10G of space, two hours (10M-bandwidth telecommunications); crawling all knowledge library, a total of about 500M space. Use the vulnerability search Flask as web server, bootstrap as the front end.

ticket reptiles (and where to Ctrip). Findtrip is based Scrapy ticket reptiles, the current integration of the two major domestic airline ticket websites (where to go + Ctrip).

content crawler-based requests, MySQLdb, torndb Netease client

watercress movies, books, groups, albums, reptiles and other things set

QQ space reptiles, including weblogs, or personal information, a day to fetch 4 million data.

Baidu mp3 full station reptiles, use redis support for HTTP.

Taobao and Tmall reptiles, according to mongodb can search keywords, article id came to the page of information, data storage.

a stock data (CSI) reptiles and stock-picking strategy testing framework. Grab all the Shanghai and Shenzhen stock market data based on date range selected. It supports the use of expressions to define stock-picking strategy. Support multi-threading. Save data to a JSON file, CSV file.

Baidu cloud disk reptiles.

social data reptiles. Support micro-Bo, we know almost, watercress.

the pool Proxy
Python Reptile proxy IP pool (proxy pool).

crawling all comments Netease cloud music songs.

crawling omelette sister paper the picture.


crawling Mu-class network video.

China HowNet reptiles.

know Chong Yu reptiles topic.

Aisi APP picture reptiles, and crack-free payment of VIP Figure.

dynamic IP mechanisms to address anti-reptile Sina quickly crawl the content.

crawling blog articles on CSDN.

crawling on the west thorn proxy IP and verify the availability of agents

Published 58 original articles · won praise 110 · views 180 000 +

Guess you like