Once the white is how to learn Python entry reptiles?

Before learning reptiles we need to understand the question

What do reptiles?

In addition to data reptile can get the Internet can also help us to do a lot of tedious manual operations, which include not only access to data, but also the ability to add data, such as:

1. vote

2. Manage multiple accounts on multiple platforms (such as various electronic business platform account)

3. The micro-channel bot

The actual application of far more than those above, but only in addition to the above applications use only the data itself, the data itself is also very broad application:

1. Machine Learning Corpus

2. Vertical field of services (second-hand car valuation)

3. aggregation service (where to network, the US group)

4. News Recommended (Today's headlines)

The prediction and judgment (the medical field)

Once the white is how to learn Python entry reptiles?

 

So the reptile can do is very much, it created a demand for reptile is getting strong, but there are a lot of people felt after the end of the development of reptiles is very simple, a lot of people think that reptiles with a library (requests) to acquire a html then parse the line, in fact, reptiles really so simple?

First answer before we learn to ask a few questions:

1. If a page requires login to access, how do?

2. For the above problem, a lot of people say that simulated login on the line, but in fact many sites will use various means to increase the difficulty of simulating a login, such as: various authentication code, login logic of various confusion and encryption parameters a variety of encryption, how to solve these problems?

3. how many sites can only do phone log?

4. Many websites in order to optimize the user experience and server, each element will be a page using asynchronous load or load js completed? You have the ability to analyze these out?

5. As a website, a variety of anti-climb of the program are endless, when you are reptiles anti-climb, you have to guess how the other party is how anti-climb?

6. how a reptile found in the latest data? How to find out whether a data updated?

If you just do a simple reptiles, such as your reptile is a one-time, one-time access to certain data for a site like this of course is simple, but you have to do a reptile service, you have to face the top problem, which has not been mentioned in the above extract and parse the data, and so on:

The above question let's look at what we have to learn:

Phase I: Basics

1. Computer network infrastructure, comprising: tcp / ip protocol, socket network programming, HTTP protocol

2. The front end of the base: mainly javascript and ajax foundation foundation

3. python of basic grammar

4. Database Basics: Any database will do, but it is strongly recommended to learn mysql or postgresql

5. html parsing base: the use beautifulsoup, xpath and css selector

6. html download basis of: urllib or requests to use

7. Save the data base: If you want to use that word about the database (mysql) can be used pymysql, followed by the use of peewee, if you need to use a document database (mongodb), you can choose pymongo, then use mongoengine

The second stage: reptile combat

After the previous stage, you just have only the most basic knowledge of reptiles, reptile want to really grab you still need further study

1. Analog Log: You need to know the principles of cookie and session login, if you need targeted crawling microblogging you also need to know the specific process of oauth2.0

2. Dynamic web analytics technology: The most basic way is by analyzing methods such as html and js, but many sites will make this part of the logic is very complicated, so you need to further study the basis of selenium and related chromedriver

3. Verify identification code:

This includes basic verification code recognition, such recognition ocr, for more complex verification code if you want to identify yourself, then you have to understand machine learning and image recognition technology, the simplest way is to call a third-party service

4. For anti-climb, you need to know basic configuration nginx, you need to be more familiar with further details http protocol

5. reptiles need to develop multi-threaded development configuration, so you need to know more multi-threaded development here, including the inter-thread communications and other basic thread synchronization

The third stage: reptile monitoring and operation and maintenance

A reptile on-line production environment you have to monitor your reptile it, you'd better monitor a crawler page is managed it, so you have to understand:

1. linux foundation for deploying services

2. docker basis, docker deploy strengths and popular believe we know everything

3. django or flask, because we need to develop to monitor reptile page

Phase IV: reptile framework and distributed reptiles

1. You have to know at least one reptile framework scrapy or pyspider

2. Understand scrapy you also need to know scrapy-redis know how to solve the problem of distributed reptiles

3. You have to understand distributed storage solution: hadoop of a solution

4. You have to understand mongodb document database

5. You have to know elasticsearch search engine

6. You have to understand this kafaka distributed publish-subscribe messaging system

7. Distributed associated infrastructure such as distributed locks, etc. you need to know principle

Fifth stage: reptiles application

This stage is part of the field of application, such as artificial intelligence, you have to do, you have to understand the relevant knowledge of artificial intelligence, data analysis, if you do you have to learn the basics of data analysis, if you want to do a web service that you need to learn web based on the development, if you want to do search engine and recommendation system associated infrastructure you have to understand the job.

Such a system learning sequence, definitely make you take a lot less detours, learning about Python tutorial, in front of the machines have talked with you, follow-up will continue to update everyone!

Guess you like

Origin www.cnblogs.com/cherry-tang/p/11237973.html