01.Crawler basics

1. Introduction to Python crawler

The practical requirements of crawlers are very strong. Crawlers often need to crawl the content of commercial websites or government websites, and these websites may be updated at any time. In addition, network reasons and website anti-crawler mechanisms will also interfere with the crawler code demonstration.

1.1 The use of crawlers

Web crawler : A program that automatically crawls Internet information according to certain rules.

First of all, let me ask you: It is said that now is the "big data era", so where does the data come from?

1. 2 Application Direction

1, 2, 1 Customize a search engine

By learning crawlers, you can customize a search engine and gain a deeper understanding of the working principles of search engine data collection.

​ Some friends hope to have a deep understanding of the working principle of search engine crawlers, or hope that they can develop a private search engine. At this time, it is very necessary to learn crawlers. To put it simply, after we learn how to write a crawler, we can use the crawler to automatically collect information from the Internet, and then store or process it accordingly. When we need to retrieve some information, we only need to retrieve it in the collected information. To perform a search, a private search engine is implemented. Of course, we need to design how to crawl information, how to store it, how to segment it, how to calculate correlation, etc. Crawler technology mainly solves the problem of information crawling.

1, 2, 2 seo optimization

​ For many SEO practitioners, learning crawlers can provide a deeper understanding of the working principles of search engine crawlers, so that they can better perform search engine optimization. Since it is search engine optimization, then you must have a very good understanding of the working principles of search engines. Clearly, you also need to master the working principles of search engine crawlers, so that when performing search engine optimization, you can know yourself and the enemy, and be victorious in every battle.

1, 2, 3 Data Analysis

In the era of big

	在进行大数据分析或者进行数据挖掘的时候,数据源可以从某些提供数据统计的网站获得,也可以从某些文献或内部资料中获得,但是这些获得数据的方式,有时很难满足我们对数据的需求,而手动从互联网中去寻找这些数据,则耗费的精力过大。此时就可以利用爬虫技术,自动地从互联网中获取我们感兴趣的数据内容,并将这些数据内容爬取回来,作为我们的数据源,从而进行更深层次的数据分析,并获得更多有价值的信息。
1, 2, 4 Looking for a job

From an employment perspective, crawler engineers are currently in short supply, and their salaries are generally high. Therefore, mastering this technology in depth is very beneficial to employment.

	有些朋友学习爬虫可能为了就业或者跳槽。从这个角度来说,爬虫工程师方向是不错的选择之一,因为目前爬虫工程师的需求越来越大,而能够胜任这方面岗位的人员较少,所以属于一个比较紧缺的职业方向,并且随着大数据时代的来临,爬虫技术的应用将越来越广泛,在未来会拥有很好的发展空间。

1. 3 Why use Python crawler

  1. PHP: PHP is the best language in the world! ! But it is not born to do this, and it does not have very good support for multi-threading and asynchronous processing, and its concurrent processing capabilities are weak. Crawler is a tool program with relatively high requirements on speed and efficiency. It is said that PHP has improved efficiency, but the community environment still cannot keep up with Python.
  2. Java: The ecosystem is very complete and it is the biggest competitor of Python crawlers. But the Java language itself is cumbersome and has a large amount of code. The cost of refactoring is relatively high, and any modification will cause a large number of changes to the code. Crawlers often need to modify the collection code. After all, life is short. . . .
  3. C/C++: The operating efficiency is unmatched. But learning and development costs are high. It may take more than half a day to write a small crawler program. To sum up in one sentence, why not use C++ to develop crawler programs, because it will make your hair fall out and kill people.
  4. Python: beautiful syntax, concise code, high development efficiency, and supports many modules. The related HTTP request modules and HTML parsing modules are very rich. There are also Scrapy and Scrapy-redis frameworks that make it extremely simple for us to develop crawlers. And there are very rich resources. In addition, Python also supports asynchronous, and is also very friendly to asynchronous network programming. The future direction is asynchronous network programming, which is very suitable for crawler programs! !
1, 3, 1 A python crawler

Use python to write a crawler to crawl Baidu URLs

2. Reptile

2.1 Classification of crawlers

2, 1, 1 common reptile

​ General web crawlers are an important part of search engine crawling systems (Baidu, Google, Sogou, etc.). The main purpose is to download web pages on the Internet to the local computer to form a mirror backup of Internet content. Provide search support for search engines.

How search engines work:

Please add image description

  • Step 1: Crawl the web page

Search engines crawl data from thousands of websites.

  • Step 2: Data Storage

The search engine crawls the web pages through crawlers and stores the data into the original page database (that is, the document library). The page data is exactly the same as the HTML obtained by the user's browser.

  • Step 3: Provide search services and website ranking

The search engine performs various preprocessing steps on the pages crawled back by the crawler: Chinese word segmentation, noise elimination, and index processing.

After the search engine organizes and processes the information, it provides users with keyword retrieval services and displays relevant information to users. They will be ranked when displayed.

Search engine limitations:

  • Search engines crawl the entire web page, not specific and detailed information.

  • Search engines cannot provide search results that are specific to a customer's needs.

2, 1, 2 Focus on crawlers

In response to these situations of general crawlers, focused crawler technology is widely used. A focused crawler is a web crawler program "oriented to specific subject needs". The difference between it and a general search engine crawler is that the focused crawler will process and filter the content when crawling web pages, and try to ensure that only crawling is relevant to the needs. web page data.

The next part of our course will focus on crawlers .

2. 2 Robots protocol

​ Robots is an agreement between the website and the crawler. It uses a simple and direct txt format text to tell the corresponding crawler the permissions allowed. In other words, robots.txt is the first file to be viewed when accessing the website in the search engine. When a search spider visits a site, it will first check whether robots.txt exists in the root directory of the site. If it exists, the search robot will determine the scope of access based on the contents of the file; if the file does not exist, all The search spiders will be able to access all pages on the website that are not password protected. --Baidu Encyclopedia

​ Robots protocol is also called crawler protocol, robot protocol, etc. The full name is "Robots Exclusion Protocol". The website uses Robots protocol to tell search engines which pages can be crawled and which pages cannot be crawled , for example:

Taobao: https://www.taobao.com/robots.txt

Baidu: https://www.baidu.com/robots.txt

3. Request and response

HTTP communication consists of two parts: client request message and server response message

The process of the browser sending an HTTP request:

Insert image description here

  1. When we enter the URL https://www.baidu.com in the browser, the browser sends a Request request to obtain the html file of https://www.baidu.com, and the server sends the Response file object back to the browser .
  2. The browser analyzes the HTML in the Response and finds that it references many other files, such as Images files, CSS files, and JS files. The browser will automatically send the Request again to obtain images, CSS files, or JS files.
  3. When all files are downloaded successfully, the web page will be displayed completely according to the HTML syntax structure.

In fact, when we crawl data by learning crawler technology, it is also a process of requesting data from the server and obtaining the server's response data.

4. chrome developer tools

​ When we crawl different websites, the implementation of each website page is different, and we need to analyze each website. Are there some general analysis methods? Let me share my “routine” for crawling and analyzing. On a certain website, the tool I use most to analyze pages and crawl data is Chrome Developer Tools .

	Chrome 开发者工具是一套内置于 Google Chrome 中的 Web 开发和调试工具,可用来对网站进行迭代、调试和分析。因为国内很多浏览器内核都是基于 Chrome 内核,所以国产浏览器也带有这个功能。例如:UC 浏览器、QQ 浏览器、360 浏览器等。

Next, let’s take a look at some of the more awesome features of Chrome developer tools.

4. 1 Elements Panel (Elements)

​Through the Element panel, we can view the tag where the page rendering content we want to capture is located, what CSS attributes to use (for example: class="middle"), etc. For example, if I want to capture the dynamic title in my Zhihu homepage, right-click on the page and select "Inspect" to enter the element panel of Chrome Developer Tools.

​ Through this method, we can quickly locate a DOM node on the page, and then extract the relevant parsing statements. Move the mouse to the node, then right-click the mouse and select "Copy" to quickly copy the parsing statements of content parsing libraries such as Xpath and CSS selector.

4. 2 Console panel

The console panel (Console) is a separate window used to display JS and DOM object information.

In the js decryption topic of the crawler course, the console function will be used to debug and run the js code.

4.3 Resource Panel (Source)

All source files of the current web page can be viewed on the Resource Panel (Source) page.

In the left column you can see that the source files are displayed in a tree structure.
Use this place in the middle column to debug js code.
On the right is the breakpoint debugging ribbon.
In the subsequent js decryption, the function of the resource panel will be used.

4. 4 Network panel (Network)

The Network panel records information about every network operation on the page, including detailed time-consuming data, HTTP request and response headers, cookies, and more. This is what we usually call packet capture.

4, 4, 1 Toolbar

Stop recording network log

By default, as long as the developer tools are turned on, all network requests will be recorded. Of course, the records are displayed in the Network panel. Red means on, gray means off.

Clear

Clear all data. Every time you re-analyze, you need to clear the previous data.

Filter

Packet filter. Red means open, blue means closed.

It is often used to filter out some HTTP requests, such as filtering out asynchronous requests, pictures, videos, etc. initiated using Ajax.

The largest pane is called the Requests Table , which lists every HTTP request retrieved. By default, this table is sorted chronologically, with the oldest resources at the top. Click on the resource's name to display more information.

Requests Table parameters:

  • **all:* All request data (images, videos, audios, js codes, css codes )
  • **XHR: **The abbreviation of XMLHttpRequest is the core of ajax technology. It dynamically loads content that is frequently analyzed.
  • **CSS: **css style file
  • **JS: **JavaScript file, which is a page often analyzed by js decryption
  • Img: Images picture file
  • Font: font file (font reverse)
  • DOC: Document, document content
  • **WS: **WebSocket, socket data communication on the web side, generally used for some real-time updated data
  • **Manifest:** Displays resources cached through the manifest. Including a lot of information, such as js library files will display the file address, size and
    type;

Search

In the search box, any content that appears in ALL can be directly searched. Commonly used data retrieval and JS decryption

Preserve log

Keep a log. When analyzing content that jumps on multiple pages, be sure to check it, otherwise when a new jump occurs on the page, all historical data will be cleared. Keep logs, it must be checked when doing crawlers

Disable cache

Clear the cache of JavaScript and css files to get the latest ones.

Hide data URLs

Used to hide dataurl, so what is dataurl? The traditional src attribute of the usual img tag specifies a resource on a remote server. The browser needs to send a pull resource request to the server for each external resource. Data URL technology embeds image data into the page in base64 string format and integrates it with HTML.

Has blocked cookies

Only show requests with blocked response cookies, this option must not be checked.

Blocked Requests

Only show blocked requests, this option must not be checked.

3rd-party requests

Only show requests from sources different from the page source. Do not check this option.

4.5 Requests details:

Request
headers : Headers that display HTTP requests. Through this, we can see the request method, the request parameters carried, etc.

  • General

    Request url : the actual requested URL
    Request Method : the request method
    Status Code : status code, 200 when successful

  • Response Headers

    Some data set when the server returns, such as the latest cookie data updated by the server, is modified here.

  • Requests Headers

    Request body, the reason why the data cannot be requested is generally here. Anti-pickling is also the data in the request body.
    Accept : the data format received by the server (generally ignored)
    Accept-Encoding : the encoding received by the server (generally ignored)
    Accept-Language : the language received by the server (generally ignored)
    Connection : maintain the connection ( Generally ignored)
    Cookies : Cookie information is identity information. Crawling VIP resources requires identity information.
    Host : Requested host address.
    User-Agent : User identity agent. The server determines the user's general information based on this.
    Sec-xxx-xxx : Others. The information may be useless, or it may be counterfeiting. Detailed analysis of specific situations*

Preview

​ Preview is a preview of the request results. It is generally used to view requested images, and is more powerful for grabbing image websites.

Response

Response is the result returned by the request. The general content is the source code of the entire website. If the request is an asynchronous request, the returned result content is generally Json text data.

This data may be inconsistent with the web page displayed by the browser because the browser is loaded dynamically

Initiator

​ The stack called by the request initiation

Timing

​Request and response timetable

Extracurricular expansion of HTTP transmission

https://mp.weixin.qq.com/s/aSwXVrz47lAvQ4k0o4VcZg

Guess you like

Origin blog.csdn.net/qq_65898266/article/details/133562634