Tools - Automatically fetch/proofread xpath helper

One, xpath helper installation

1. Purpose and meaning

  • 1) Purpose: XPath Helper is a practical crawler webpage parsing tool for Chrome browser

    • can be found quickly and easily目标信息对应的Xpath节点,获取xpath规则,并提取目标信息,并进行校对测试
    • The queried xpath can be edited, and the correct edited result will be displayed in the result box next to it, and highlighted on the web page, which is conducive to proofreading the xpath and obtaining the content in real time on the web page, and returning the proofreading result
    • Application scenario: Only data source link + 1 xpath configuration is needed to complete the crawling of each webpage. Suppose there are thousands of news websites that need to extract links to articles; I want people who don’t know xpath to easily master and configure xpath, and You can get timely feedback and optimization on the effectiveness of your own configured xpath; at this time, the xpath helper plug-in can solve the problem very well

2. Install xpath helper

  • 1) Download address: xpath helper download , the downloaded file is a file with a .crx suffix; open the three dots in the upper right corner of Google Chrome according to Figure 1 > More Tools > Extensions to open
    insert image description here
    insert image description here

  • 2) According to the upper right corner of Figure 2, first open the developer mode > drag the downloaded .crx plug-in to the area in the figure to complete the installation
    insert image description here

  • 3) According to Figure 3, pin the xpath helper on the homepage for quick and easy use of the plug-in later
    insert image description here

3. Install Pasty

  • 1) The browser opens the link plug-in share in batches at one time, Pasty download address , the installation steps are the same as xpath helper
    insert image description here

  • 2) How to use: first copy dozens of URLs, then go to the browser page and click the Pasty button in the upper right corner, then the browser will automatically open the URLs in batches
    insert image description here

2. Two ways for xpath-helper to obtain xpath

  • 1) Use: After installing the plug-in in Google Chrome and restarting the browser, there are generally two ways to obtain xpath

    • One is the xpath helper that comes with
    • One is to rely on the Google Developer Tools Platform
  • 2) Defect: Whether you use the built-in ctrl+Shift key, or rely on the copy xpath function that comes with the Google developer toolbar (elements option), sometimes the extracted xpath path is too long, which is not easy to understand and maintain. We You can manually change the XPath path into a concise format, so this requires a little knowledge of writing xpath rules, and some advanced usages will be introduced in detail in catalog three/four/five

1. The xpath helper comes with

  • 1) Press CTRL + SHIFT + X or the xpath helper icon button in the upper right corner to open the XPath Helper plug-in
    insert image description here
  • 2) Press CTRL + SHIFT to move the mouse to point to the area to be extracted, press X to enable or disable the extraction, the extracted area will become highlighted, and the corresponding xpath will also be displayed in the left area of ​​the upper xpath helper console rules, and the corresponding content will be displayed in real time on the right
    insert image description here

2. Rely on Google Developer Tools

  • 1) Press F12 or right-click to select Check to open the Google Developers Toolbar (elements option); click the arrow button in the upper left corner of the Google Developers Toolbar on the right; go back to the left webpage content to hover or click, this The html element of the content you click will be displayed synchronously on the right side
    insert image description here

  • 2) Right-click the html element on the right and select copy xpath to get the xpath rule; go back to the xpath helper plug-in console on the upper left and paste it, you can see that the corresponding content has also been obtained; you may even find that it is obtained in this way xpath is obviously much simpler than the first one, and it is also a way I recommend
    insert image description here

Three, xpath basic syntax

1. Understand HTML tags

  • 1) html (Hyper Text Markup Language) hypertext markup language, the following is a sample of html tags, by editing html content, you can change the display form of the front-end webpage, such as font size, image size, etc.

    <!DOCTYPE html>
    <html>
    	<head>
    		<meta charset="utf-8">
    		<title>title</title>
    	</head>
    	<body>
    		<h4>1、标题标签:是通过"h1-h6"标签进行定义的。定义最大的标题义最小的标题</h4>
    		<p>2、段落标签</p>
    		<a href="https://www.baidu.com/">3、点击跳转到百度,href属性指向百度链接</a>
    		<br>
    		<span>4、以下img是图片标签</span>
    		<br>
    		<img src="https://www.baidu.com/img/PCtm_d9c8750bed0b3c7d089fa7d55720d6cf.png" alt="logo图片" width="304" height="228"></img>
    		<br>
    	  <div>5、以下table是表格标签</div>
    		<table border="1">
    			<tr><th>Header 1</th><th>Header 2</th></tr>
    			<tr><td>row 1, cell 1</td><td>row 1, cell 2</td></tr>
    			<tr><td>row 2, cell 1</td><td>row 2, cell 2</td></tr>
    		</table>
    		
    	</body>
    </html>
    
  • 2) As shown in the figure, for example, in the html online editor , a piece of html content is written on the left, and the corresponding content is displayed on the right

    • Feature 1: HTML tags are keywords surrounded by angle brackets, such as <html>, the outermost root tag is <html>, followed by <body>
    • Feature 2: HTML tags usually appear in pairs, such as <body> and </body>, <a> and </a>, <img></img>
    • Feature 3: The first tag in the tag pair is the start tag such as <a>, and the second tag is the end tag such as </a>
    • Feature 4: Tags have a hierarchical containment relationship, such as father and son, father and grandson, brothers, such as <body></body> sub-tags include <p></p> tags, <a></a> tags, etc.
      insert image description here
  • 3) Introduction to the meaning of more html tags

2. Understand xpath rules

  • 1) xpath rules: usually for parsing html tags, locating specific tag elements, and obtaining specified attributes or content. The usual display form is:

    从根节点开始查找body标签下的div标签:/html/body/div
    查找html文本里面的所有a标签://a
    获取span标签的文本内容://span[@class="title-content-title"]/text()
    获取a标签的href属性值://*[@id="hotsearch-content-wrapper"]/li[1]/a/@href
    获取img标签的src属性值://*[@id="s-top-more"]/div[1]/a[1]/img/@src
    

    insert image description here

  • 2) Basic grammatical meaning of xpath rules: case URL: https://www.baidu.com/
    insert image description here

4. Use case of xpath-helper

1. Get the text content -text()

  • 1)xpath规则://span[@class=“title-content-title”]/text()
    insert image description here

2. Get a tag link - @href

  • 1)xpath规则://*[@id=“hotsearch-content-wrapper”]//a/@href
    insert image description here

3. Get img tag link - @src

  • 1) xpath rule: //div[@id="lg"]/img/@src
    insert image description here

5. Advanced usage of xpath

1. Sequential position selection

//ul[@class="s-hotsearch-content"]/li[1]
//ul[@class="s-hotsearch-content"]/li[last()]
//ul[@class="s-hotsearch-content"]/li[position()>2]

2. Attribute/text fuzzy matching

//title[text()='百度一下,你就知道']
//*[contains(text(),'百度')]
//span[text()>2]

//*[contains(@class,'title')]
//ul[starts-with(@class,'s-')]
//div[not(@class="hot-title")]
//li/attribute::class
//div[@id!='right']

//ul[@*] 
//ul/node() 和 //ul/*

3. Multiple and or situations

//ul[starts-with(@class,'s-')]|//title[text()='百度一下,你就知道']
//ul[not(@class="tbhead") and @class="s-hotsearch-content"]
//div[@class="title-content-noindex" or @class="content-wrap"]
//span[starts-with(@class,'hot')][text()='换一换']

4. Parent/sibling nodes

//ul[@class="s-hotsearch-content"]/ancestor::*
//ul[@class="s-hotsearch-content"]/ancestor::div
//div[not(@class="hot-title")]/following::*
//div[not(@class="hot-title")]/following-sibling::*
//div/preceding::*
//div/preceding-sibling::*

Guess you like

Origin blog.csdn.net/weixin_43411585/article/details/128908199