Web crawlers jmeter learning

Learning website: https://www.cnblogs.com/Zfc-Cjk/p/9937269.html

Learning jmx file: Scots know what .jmx Court optimized website, never heard of

Encountered a problem: garbage problem, the solution: https://www.cnblogs.com/shishibuwan/p/11307194.html

After studying summarize ideas: in short, to submit a request to the page, and then extract all the value returned by ForEach controller to achieve traversal;

Get the URL to determine which fields need to crawl the Internet;

Foreach cycle through use;

Final output to a local file;

1, pages get

 

 2、XPath Extractor

Xpath with a previous fetch request. This form is suitable for the case where the return xml fragments. Right-click to add a post-processor on the request requires data -> xPath Extractor.

Xpath is generally used to return xml with much more.

 

 

 Use Tidy: When a page needs to be handled in HTML format, you must select this option when the page is the need to deal with XML or XHTML format (for example, RSS returned) when, uncheck this option.

Reference Name: storing parameter values ​​extracted.

 XPath Query: for XPath expressions to extract the value.

 Default Value: The default value of the parameter.

Match numbers: 0 represents taking a first, a second take represents, -1 means taking all

 

XPath Extractor regular expression usage scenarios difference:

If you need to extract the text is the attribute value of an element on the page, it recommends using XPath Extractor;

And if you need to extract text position on the page is not fixed, or is not an element of property, it is recommended to use regular expressions extractor.

3, XPath Extractor basic grammar

// * [@ class, 'A'] / @ href target all class = A from the root directory href

// * [@ class, 'A'] is positioned within the text of all class = A label from the root directory

//*[contains(@class,'A ')]  从根目录下@class值中包含A的节点

substring-before(.//*[@class='A']/text(),'0') 返回根目录下[@class='A']/text()中第一个'0'前面的部分,如果不存在'0',则返回空值

substring-after(.//*[@class='A']/text(),'0') 返回根目录下[@class='A']/text()中第一个'0'后面的部分,如果不存在'0',则返回空值

 详细点的请查看:

https://www.blazemeter.com/blog/using-xpath-extractor-jmeter-0/

https://www.blazemeter.com/blog/using-xpath-extractor-jmeter-0/

.//a[@class='linkto']/@href 意思就是通过a>class>href这三层标签进行逐级检索,找到class=linkto标签下的所有href,进行匹配

 

//a[@href]/text()提取的结果

 

 

4、For each 控制器

 

输入前缀是title1, 左下角勾选“Add _ before number”,与输入前缀拼接后为“title1_”。start index for loop为16,end index for loop为17,是“左开右闭”,即(16,17]。ForEach控制器会依次取title1_17,title1_18,并赋值给 financial_type,这里就取title1_17。
如下图从title1_17开始:

 

 5、获取二级标题

 

 

 

 

 

 

 

 

 

 

 

import java.io.File;  
import java.io.FileNotFoundException;  
import java.io.FileWriter;  
import java.io.IOException;  
import java.io.RandomAccessFile;  
  
  
            StringBuffer fileBuf=new StringBuffer();  
            String filePar = "D:\\\目录\\\目录_${title1}\\\2级目录_${title2}";
            File myPath = new File( filePar );  
            if ( !myPath.exists()){
                myPath.mkdirs();  
                System.out.println("创建文件夹路径为:"+ filePar);  
            }  
           
            String filename = "列表_${title2}.txt";
            try {  
                FileWriter fw = new FileWriter(filePar + "\\\" + filename,true);
              
                String originalLine ="${text}";
                System.out.println("*** "+ originalLine);  
                fw.write(originalLine);  
                fw.close();  
            } catch (IOException e) {  
 
                e.printStackTrace();  
            }  

 

Guess you like

Origin www.cnblogs.com/shishibuwan/p/11330301.html