1. What is an Internet crawler?
If we compare the Internet to a large spider web, the data on a computer is a prey on the spider web, and the crawler program is a small spider that crawls along the spider web to obtain the data it wants.
Explanation 1: Through a program, crawl the webpage according to the url to obtain useful information
Explanation 2: Use a program to simulate a browser, send a request to the server, and get the response information
2. Reptile core?
1. Crawl the webpage: crawl the entire webpage, including all the content in the webpage
2. Analyze data: analyze the data you get in the web page
3. Difficulty: the game between reptiles and anti-reptiles
3. The use of reptiles
- Data Analysis /Artificial Datasets
- Social software cold start
- Public opinion monitoring
- competitor monitoring
4. Classification of reptiles
Common Reptiles:
Example: Baidu, 360, Google, sougou and other search engines - Bole Online
Function
Visit the webpage->grab data->data storage->data processing->provide retrieval services
robots protocol
It is a customary agreement to add a robots.txt file to explain what content on this website cannot be grabbed, and it does not play a restrictive role
Reptiles written by themselves do not have to follow
Website Ranking (SEO)
1. Rank according to the pagerank algorithm value (refer to website traffic, click-through rate and other indicators)
2. Baidu PPC
shortcoming
1. Most of the captured data is useless
2. The data cannot be accurately obtained according to the needs of users
Focus on reptiles
Function
According to the requirements, implement the crawler program to grab the required data
Design ideas
1. Determine the url to be crawled
how to get url
2. Simulate the browser to access the url through the http protocol, and obtain the html code returned by the server
how to access
3. Parse the Html string (extract the required data according to certain rules)
how to parse
5. Anti-climbing means
1.User-Agent:
The Chinese name of User Agent is User Agent, referred to as UA. It is a special string header that enables the server to identify the operating system and version, CPU type, browser and version, browser rendering engine, browser language, and browser version used by the client. plug-ins, etc.
2. Proxy IP
West agent
fast proxy
What is High Anonymity, Anonymous and Transparent Proxy? What's the difference?
1. Use a transparent proxy, the other server can know that you use a proxy, and also know your real IP
2. Using an anonymous proxy, the other server can know that you are using a proxy, but it does not know your real IP
3. Use a high-anonymity proxy, the other server does not know that you are using a proxy, let alone your real IP
3. Verification code access
Coding platform
Cloud coding platform
super
4. The dynamic loading page website returns js data, not the real data of the web page
Selenium drives real browsers to send requests
5. Data encryption
Analyze js code