Python Crawler: A Collection of Crawler Notes

foreword

You can understand that learning crawlers is to borrow money from Latiao Jun (borrow 1 million). First of all, if you want to borrow money from Latiao, you first need to know my residential address, and then find a way to go to the place where Latiao is located (you can walk. Take a car), and then there are more things on Latiao, there are 1 million, lighters, cigarettes, mobile phone clothes, we need to filter out the things you need from these things, and after we get the things you want, we can go to save money, Let's understand the crawler's running process through a picture:
image.png
the crawler's process is very important. If you can get this process done, then the crawler's process will have a basic understanding in your mind. It can be said that your crawler has learned 20%. span
insert image description here

1. Obtain data address information

Know the URL

First of all, let's get to know the so-called website. The high-end name of the website is called 'Uniform Resource Locator'. In the Internet, if the data is obtained, it is located through the website (if you want to borrow money from spicy strips from you, you need to know the spicy food first). The address where the entry is currently located) So what is the special meaning of the URL that is used every day?
The URL includes: the protocol part, the domain name part, the file name part, the parameter part
1. The most common protocols are http and hettps
2. The domain name part is also what we call the server address
3. The file name part is where the data we need is located Place
4. The parameter part filters the data according to the conditions we are querying.
All in all, we know that we need to obtain the Internet data and the website address.
image.png

data division

Go back and think about the case of borrowing money from Latiao. If you want to find Latiao, you need to go through my address, then the address I gave may be my work address. If I go home, my address will be changed. Then the URLs we are talking about are also the same. The URLs we can see on the search page are static URLs. Then the data of some of our URLs are constantly updated (similar to news websites), so this kind of constantly loaded data is called dynamic data. , then how do I distinguish whether our data is static data or dynamic data? 1. We can directly observe the page
, static data will be loaded faster, dynamic data will be loaded relatively slowly
, if not it is dynamic data

packet capture

So how do we get dynamic data? It can be obtained by capturing packets. What is packet capturing? As we all know, the data we obtain on the Internet is all through the network, so can we intercept the data transmitted by these networks? For example, we need to go out to work now. When renting a house, according to the normal idea, the tenant finds the landlord to obtain the listing information. This is an ideal state, but the listing information that we want to rent is all in the hands of the intermediary. It will appear that I want to rent a house and need to find the intermediary first. , and then the intermediary finds the landlord to obtain high-quality housing, the landlord returns the housing information to the intermediary, and the intermediary is giving it to me, then the packet capture is also the meaning, I can intercept all the data information from
image.png
it, how should we use this packet capture? Every browser will have its own package capture tool, right-click on the browser page and click to check (here we recommend you to use Google's browser, which is convenient, fast and more professional)
image.png
Element: code information after the webpage is loaded
Console: can be used to debug the webpage Code
Source code: the source code information of web page development
Network: all the data loaded through the network
These four are the key content we need to learn, then the dynamic data we want is in the XHR option of the network, which can be obtained in this way to the network data we want

2. Send network requests

When we get the target address, the normal first thought is to copy the data in the search box of the browser to see what data is obtained from this website. If we want to achieve it through a crawler, we need to pass the code. How to achieve it, we You can use Python's third-party tools to do it, the common third-party libraries urllib, requests, scrapy, ..., requests can already meet our daily needs when we are just learning, we need to pay attention to us as a crawler to request other people's URLs when sending requests It's unpopular, just like you borrow money from Latiao, but I don't know you and I have no reason to lend it to you. Similarly, some websites don't want to give us data when crawlers request URLs, so how can we What to do? You can disguise yourself, and I may lend you money by disguising yourself as Latiao’s relatives and friends. The core of our crawler lies in disguising as a browser to send network requests

Disguised as a client (browser, APP)

So how do we pretend? When we capture packets, there will be a request header in the header and we will see the entered data, so let's focus on some key information:
Accept: the data accepted by the browser
Accept-Encoding: the accepted format
Accept-Language : Accepted language
Connection: Type of link
Cookie: Realize state preservation, how to understand it, can be used to record your user information, just like you asked me to borrow money before, I will write you an IOU, next time If you come to borrow money and take this IOU, I know it's you
. Host: the host of the link
Referer: source, anti-theft link, similar to our current itinerary code, you are from somewhere
User-Agent: user agent, the identity of the browser Identification, which can be understood as your ID card,
then these things are things that we need to bring to prove our identity when sending requests
image.png

Request header encryption

The request headers are not static. Sometimes there are some special fields, so what request headers we need to add is also based on your URL. Then the request fields we see may be encrypted as shown in the figure below, then if we meet How do we pass the parameters of this encryption? You need to perform js reverse (js reverse is not explained here)
image.png

request method

The request method is used to distinguish the request rules of the website. The common ones are get and post. Get generally obtains the data of the web page, and post needs to submit the data to the server (for example, when you log in, you need to pass the account and password)
image.png
image.png

Extract data

The data obtained by the crawler is divided into structured data and unstructured data
Structured data: json, xml
Unstructured data: html
has different ways of extracting data for the data we obtain, if we obtain json data, we It can be directly converted into a dictionary type to obtain data. If we obtain html data, we can extract it through xpath, bs4, pyquery, regularity, etc. Here we focus on learning xpath

XPath terminology

Node

In XPath, there are seven types of nodes: elements, attributes, text, namespaces, processing instructions, comments, and document (root) nodes. XML documents are treated as node trees. The root of the tree is called the document node or root node.

Take a look at the following XML document:

<?xml version="1.0" encoding="ISO-8859-1"?>

<bookstore>

<book>
  <title lang="en">Harry Potter</title>
  <author>J K. Rowling</author> 
  <year>2005</year>
  <price>29.99</price>
</book>

</bookstore>

Examples of nodes in the XML document above:

<bookstore> (文档节点)
<author>J K. Rowling</author> (元素节点)
lang="en" (属性节点) 

Basic value (or atomic value, Atomic value)

The base value is a node with no parent or no children.

Examples of basic values:

J K. Rowling
"en"

Item

Items are primitive values ​​or nodes.

Node relationship

Father (Parent)

Every element and attribute has a parent.

In the following example, the book element is the parent of the title, author, year, and price elements:

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

Children

An element node can have zero, one, or more children.

In the following example, the title, author, year, and price elements are all children of the book element:

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

Sibling

Nodes with the same parent

In the following example, the title, author, year, and price elements are all siblings:

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

Ancestor

The parent of a node, the parent of the parent, and so on.

In the following example, the ancestors of the title element are the book element and the bookstore element:

<bookstore>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

</bookstore>

Descendant

A child of a node, a child of a child, etc.

In the following example, the descendants of bookstore are the book, title, author, year, and price elements:

<bookstore>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

</bookstore>

XPath uses path expressions to select nodes or sets of nodes in an XML document. Nodes are selected by following paths or steps.

XML instance document

We will use this XML document in the following examples.

<?xml version="1.0" encoding="ISO-8859-1"?>

<bookstore>

<book>
  <title lang="eng">Harry Potter</title>
  <price>29.99</price>
</book>

<book>
  <title lang="eng">Learning XML</title>
  <price>39.95</price>
</book>

</bookstore>

pick node

XPath uses path expressions to select nodes in an XML document. Nodes are selected by following a path or step.

The most useful path expressions are listed below:

expression describe
nodename Selects all child nodes of this node.
/ Pick from the root node.
// Selects nodes in the document from the current node that matches the selection, regardless of their position.
. Select the current node.
Select the parent node of the current node.
@ Select properties.

example

In the table below, we have listed some path expressions and their results:

path expression result
bookstore Selects all child nodes of the bookstore element.
/bookstore Select the root element bookstore. Note: A path always represents an absolute path to an element if it starts with a forward slash ( / )!
bookstore/book Selects all book elements that are children of bookstore.
//book Selects all book child elements, regardless of their position in the document.
bookstore//book Selects all book elements that are descendants of the bookstore element, regardless of where they are located below the bookstore.
//@lang Select all properties named lang.

Predicates

Predicates are used to find a specific node or a node that contains a specified value.

Predicates are enclosed in square brackets.

example

In the table below, we list some path expressions with predicates, and the results of the expressions:

path expression result
/bookstore/book[1] Selects the first book element that is a child element of the bookstore.
/bookstore/book[last()] Selects the last book element that is a child element of the bookstore.
/bookstore/book[last()-1] Selects the penultimate book element that is a child element of the bookstore.
/bookstore/book[position()< 3] Selects the first two book elements that are children of the bookstore element.
//title[@lang] Selects all title elements that have an attribute named lang.
//title[@lang=‘eng’] Selects all title elements that have a lang attribute with a value of eng.
/bookstore/book[price>35.00] Selects all book elements of the bookstore element, and the value of the price element must be greater than 35.00.
/bookstore/book[price>35.00]/title Selects all title elements of the book element in the bookstore element, and the value of the price element must be greater than 35.00.

select unknown node

XPath wildcards can be used to select unknown XML elements.

wildcard describe
* Matches any element node.
@* Matches any attribute node.
node() Match any type of node.

example

In the table below, we list some path expressions, and the results of these expressions:

path expression result
/bookstore/* Selects all child elements of the bookstore element.
//* Selects all elements in the document.
//title[@*] Selects all title elements with attributes.

choose several paths

You can select several paths by using the "|" operator in a path expression.

example

In the table below, we list some path expressions, and the results of these expressions:

path expression result
//book/title | //book/price Selects all title and price elements of the book element.
//title | //price Selects all title and price elements in the document.
/bookstore/book/title | //price Selects all title elements that belong to the book element of the bookstore element, and all price elements in the document.

XML instance document

We will use this XML document in the following example:

<?xml version="1.0" encoding="ISO-8859-1"?>

<bookstore>

<book>
  <title lang="eng">Harry Potter</title>
  <price>29.99</price>
</book>

<book>
  <title lang="eng">Learning XML</title>
  <price>39.95</price>
</book>

</bookstore>

XPath axis

Axes define the set of nodes relative to the current node.

axis name result
ancestor Selects all ancestors (parents, grandparents, etc.) of the current node.
ancestor-or-self Selects all ancestors of the current node (parents, grandparents, etc.) as well as the current node itself.
attribute Selects all attributes of the current node.
child Selects all child elements of the current node.
descendant Selects all descendant elements (children, grandchildren, etc.) of the current node.
descendant-or-self 选取当前节点的所有后代元素(子、孙等)以及当前节点本身。
following 选取文档中当前节点的结束标签之后的所有节点。
namespace 选取当前节点的所有命名空间节点。
parent 选取当前节点的父节点。
preceding 选取文档中当前节点的开始标签之前的所有节点。
preceding-sibling 选取当前节点之前的所有同级节点。
self 选取当前节点。

位置路径表达式

位置路径可以是绝对的,也可以是相对的。

绝对路径起始于正斜杠( / ),而相对路径不会这样。在两种情况中,位置路径均包括一个或多个步,每个步均被斜杠分割:

绝对位置路径:

/step/step/...

相对位置路径:

step/step/...

每个步均根据当前节点集之中的节点来进行计算。

步(step)包括:

  • 轴(axis)

    定义所选节点与当前节点之间的树关系

  • 节点测试(node-test)

    识别某个轴内部的节点

  • 零个或者更多谓语(predicate)

    更深入地提炼所选的节点集

步的语法:

轴名称::节点测试[谓语]

实例

例子 结果
child::book 选取所有属于当前节点的子元素的 book 节点。
attribute::lang 选取当前节点的 lang 属性。
child:: * 选取当前节点的所有子元素。
attribute:: * 选取当前节点的所有属性。
child::text() 选取当前节点的所有文本子节点。
child::node() 选取当前节点的所有子节点。
descendant::book 选取当前节点的所有 book 后代。
ancestor::book 选择当前节点的所有 book 先辈。
ancestor-or-self::book 选取当前节点的所有 book 先辈以及当前节点(如果此节点是 book 节点)
child:: */child::price 选取当前节点的所有 price 孙节点。

XPath 表达式可返回节点集、字符串、逻辑值以及数字。

XPath 运算符

下面列出了可用在 XPath 表达式中的运算符:

运算符 描述 实例 返回值
| 计算两个节点集 //book | //cd 返回所有拥有 book 和 cd 元素的节点集
+ 加法 6 + 4 10
- 减法 6 - 4 2
* 乘法 6 * 4 24
div 除法 8 div 4 2
= 等于 price=9.80 如果 price 是 9.80,则返回 true。如果 price 是 9.90,则返回 false。
!= 不等于 price!=9.80 如果 price 是 9.90,则返回 true。如果 price 是 9.80,则返回 false。
< 小于 price<9.80 如果 price 是 9.00,则返回 true。如果 price 是 9.90,则返回 false。
<= 小于或等于 price<=9.80 如果 price 是 9.00,则返回 true。如果 price 是 9.90,则返回 false。
> 大于 price>9.80 如果 price 是 9.90,则返回 true。如果 price 是 9.80,则返回 false。
>= 大于或等于 price>=9.80 如果 price 是 9.90,则返回 true。如果 price 是 9.70,则返回 false。
or price=9.80 or price=9.70 如果 price 是 9.80,则返回 true。如果 price 是 9.50,则返回 false。
and price>9.00 and price<9.90 如果 price 是 9.80,则返回 true。如果 price 是 8.50,则返回 false。
mod 计算除法的余数 5 mod 2 1

XML实例文档

我们将在下面的例子中使用这个 XML 文档:

“books.xml” :

<?xml version="1.0" encoding="ISO-8859-1"?>

<bookstore>

<book category="COOKING">
  <title lang="en">Everyday Italian</title>
  <author>Giada De Laurentiis</author>
  <year>2005</year>
  <price>30.00</price>
</book>

<book category="CHILDREN">
  <title lang="en">Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

<book category="WEB">
  <title lang="en">XQuery Kick Start</title>
  <author>James McGovern</author>
  <author>Per Bothner</author>
  <author>Kurt Cagle</author>
  <author>James Linn</author>
  <author>Vaidyanathan Nagarajan</author>
  <year>2003</year>
  <price>49.99</price>
</book>

<book category="WEB">
  <title lang="en">Learning XML</title>
  <author>Erik T. Ray</author>
  <year>2003</year>
  <price>39.95</price>
</book>

</bookstore>

加载 XML 文档

所有现代浏览器都支持使用 XMLHttpRequest 来加载 XML 文档的方法。

针对大多数现代浏览器的代码:

var xmlhttp=new XMLHttpRequest()

针对古老的微软浏览器(IE 5 和 6)的代码:

var xmlhttp=new ActiveXObject("Microsoft.XMLHTTP")

选取节点

不幸的是,Internet Explorer 和其他处理 XPath 的方式不同。

在我们的例子中,包含适用于大多数主流浏览器的代码。

Internet Explorer 使用 selectNodes() 方法从 XML 文档中的选取节点:

xmlDoc.selectNodes(xpath);

Firefox, Chrome, Opera, and Safari use the evaluate() method to select nodes from XML documents:

xmlDoc.evaluate(xpath, xmlDoc, null, XPathResult.ANY_TYPE,null);

select all titles

The following example selects all title nodes:

/bookstore/book/title

Fourth, save data

The storage of data is generally based on the requirements of the enterprise, and it is basically stored in the database. The database mainly masters mysql and mongdb.

Guess you like

Origin blog.csdn.net/AI19970205/article/details/124282549