Python reptile white Getting Started Guide, become a big cow must go through three stages

Learning any one technology, with the goal should be to learn the target like a beacon to guide you forward, a lot of people learn to learn to give up school, it is largely because there is no clear objective, therefore, we must learn to clear before you are ready to learn reptiles, ask yourself why you want to study reptiles. Some people are for a job, some for fun, some people in order to achieve a black & function. But be sure, learned reptiles give your job to provide a lot of convenience.

White entry must-read

As the zero-based white, it can be divided into three stages to achieve.

The first stage is the entry, master the necessary basics, such as Python-based, such as the basic principles of network request;

The second stage is to imitate, to follow someone else's code for reptiles learn, understand each line of code, familiar with mainstream tools reptiles,

The third stage is to do it yourself, at this stage you begin to have their own problem-solving ideas, and can be designed independently crawler system.

Crawler technology involved include but are not limited to one skilled in the programming language (here, an example in Python) HTML knowledge of the basics of the HTTP protocol, the use of regular expressions, the knowledge database, using tools common capture, crawler frame, involving massive reptiles, you also need to understand the concept of distributed, message queues, commonly used data structures and algorithms, cache, and even include machine learning applications, large-scale systems are behind by a lot of technology to support. Data analysis, mining, and even machine learning are inseparable from the data, which often need to get through the reptile, therefore, even if the reptile as a professional to learn is also a great future.

It is not above knowledge must take full finished school before they can start writing reptiles do? Of course not, learning is a lifelong thing, as long as you can write Python code, directly to get started reptiles, like learn to drive, as long as the start on the road now, writing code much safer than driving.

Write reptiles in Python

Will first need to Python, to get to know basic grammar and know how to use functions, classes, list, dict common method of even basic entry.
Knowledge of HTTP

Reptile basic principle is that through the process of network requests from the remote server to download the data, while the technology behind the network request is based on the HTTP protocol. As an entry reptiles, you need to understand the basic principles of the HTTP protocol, although the HTTP specification with a book are never finish, but in-depth content can later be put slowly see, theory and practice.

Network requests frameworks are to achieve the HTTP protocol, such as the famous library Requests network request is an analog network library browser sends an HTTP request. Knowing the HTTP protocol, you can specifically targeted learning modules and a network-related, such as Python comes with urllib, urllib2 (urllib Python3 in), httplib, Cookie and other content, of course, you can skip these, Requests direct learning how to use, provided that you are familiar with the basic content of the HTTP protocol, data climb down, in most cases it is HTML text, a few are based on data in XML format or Json format, in order to correctly process the data, you should be familiar with data for each type of solution, such as JSON data can be used directly Python JSON own module for HTML data, you may be used BeautifulSoup, lxml library to other treatment, for the xml data, may be used in addition to untangle, xmltodict other third-party libraries.

Reptile tool

Reptile tools inside, learn to use Chrome or FireFox browser to examine the elements, trace request information, etc., and now most of the site with the address APP and mobile browser access, priority use of these interfaces, relatively easier. There are tools to use a proxy such as Fiddler.

Getting reptiles, learning regular expressions is not necessary, you can go to learn when you really need, such as you climb after take back the data, the data needs to be cleaned when you find using conventional string manipulation methods simply when not treated, then you can try to look at regular expressions, which can often play a multiplier effect. Python re module may be used to handle a regular expression.

Python reptile white Getting Started Guide, become a big cow must go through three stages

Data cleaning

Data cleaning the final to be persistent storage, you can store files, such as CSV files, you can also use a database to store, simply use SQLite, professional point with MySQL, or distributed document database MongoDB, these databases are for Python very friendly, ready-made library support, you have to do is familiar with how to use the API.

If you are still confused in the programming world, you can join us to learn Python buckle qun: 784758214, look at how seniors are learning. Exchange of experience. From basic web development python script to, reptiles, django, data mining and other projects to combat zero-based data are finishing. Given to every little python partner! Share some learning methods and need to pay attention to small details, click on Join us python learner gathering

Advanced Road

From the captured data to memory and then to wash the basic processes are completed, and can be considered a basic entry, the next step is to test the internal strength of the time, many sites have a counter-strategy reptiles, they are trying to get you to stop using irregular means data, such as there will be all kinds of weird code limit your request operation, the request to make the speed limit, to do IP restrictions, and even data encryption, in short, is to increase the cost of obtaining data. Then you will need more knowledge, you need in-depth understanding of the HTTP protocol, you need to understand the common encryption algorithm, you have to understand the various HEADER HTTP cookie, HTTP proxy, HTTP is. Reptiles and anti reptile is love to kill one pair, said first high-step ahead.

How to deal with anti-crawlers no established unified solution, relying on your experience and your knowledge system. This is not a mere 21 days introductory tutorial will be able to reach the heights.

Massive reptiles, usually from a URL began to climb, and then parse the page URL link to the collection URL to be crawled, we need to use the priority queue or queues to distinguish between some sites priority climb, climb behind some websites . Each climbing a page, using the depth-first or breadth-first algorithm to remove a link to climb. Each time you start a network request when the process involves parsing of a DNS (convert URLs into IP) in order to avoid repeated DNS resolution, we need to resolve the IP cached good. URL so much, how to determine which site has climbed, which did not climb over, the simple point is to use a dictionary is already climbed the structure to store the URL, but if touched URL vast amounts of memory space occupied by the dictionary is very large, At this point you need to consider using Bloom filter (Bloom filter), with a thread crawling data individually, have poor low efficiency, if the crawler improve efficiency, the use of multi-threaded, multi-process or coroutine, or distributed operation, It requires repeated practice.

Guess you like

Origin blog.51cto.com/14510224/2438403