Step Python reptiles crawling data

reptile:

  Web crawler is an important part Dissatisfied with search engine crawling system (Baidu, Google, etc.). The main purpose of the web page on the Internet is downloaded to the local, mirroring a form of Internet content.

step:

The first step: Get Web links

 

  1. The need to observe the changes of multi-crawling web pages are basically only a small part of the change, such as: some pages only the last number in the URL changes, you can change this number will more Web links Obtain;

  2. multiple pages of links to get to get into the dictionary, to act as a temporary database, when necessary with direct access through the function call;

  3. Note that our crawler is not just any web site can climb, we need to comply with our agreement reptiles, many sites that we are not just crawling. Such as: Taobao, Tencent and the like;

  4. face reptile era, each site basically set up the appropriate mechanisms for anti-reptile, when we encounter Access Denied error message 404, may be disguised as their crawlers made personally to complete the acquisition by User-Agent access to information, rather than a program and then to implement web content acquisition.

Step two: data storage

  1. reptiles crawling to the page, the page data into the original database. Where the page data and the user's browser to get the HTML is exactly the same;

  2. engines crawling the pages, it will do some duplicate content detection in the event of a large number of plagiarism on access to the site is very low weight, collected or copied content, it may no longer crawling;

  3. Data storage can have a lot of ways, we can be stored in the local database can also be stored in the temporary mobile database, files can also be stored in txt or csv file, in short, many forms;

The third step: pre-processing (data cleaning)

  1. When will we get the data to, some data is usually very messy, there must be a lot of space and a number of labels, when unnecessary things we want to get rid of the data, the data to improve the appearance and availability;

  2. can also use our software to visualize model data to visually see the data content;

The fourth step: using data

  We can crawl data as a kind of market research, thereby saving waste of human resources, but also multi-faceted compare benefits and achieve best meet the needs of the can.

summary:

  python data can be used to climb, but is not designed to make python reptile, Python can do a lot of things. It reptiles do have certain advantages, it was easier to write, simple, fast crawling speed, handling cookie, verification codes, etc. reptile frequently asked questions are also easy to use, is a valuable language.

Guess you like

Origin www.cnblogs.com/Wang1107/p/12001964.html