Dead links page detection method

 

Before understanding the test methodology, to understand the concepts under dead links, links

  • Kind of dead links
  1. Agreement dead links: TCP protocol state / HTTP protocol status page of dead links, made it clear that the
    common status as 404,403,503.
  2. Content dead chain: server return status is normal,
    but the content has been changed
    does not exist, has been deleted or you need permission and other information unrelated to the original page content.
  • The reason appears dead links
  1. Site Directory replacement.
  2. A file server in a position to move or delete.
  3. The server is set incorrectly.
  4. Under conditions of dynamic link database is no longer supported.
  • The impact of dead links
  1. Affect the function of the user experience.
  2. Reduce the number of pages indexed search engine, lower right site in the search engine's weight.
  3. Affect the site loading speed.
  4. Damage the overall image of the site.
  • Link HTML link syntax

a label
by using the href attribute - create a link to another document links;
<a href="url">Link text</a>
by using the name attribute - create bookmarks within the document.
<a name="label">锚(显示在页面上的文本)</a>

  • Links can be text, image, to jump to a new target by clicking.

    aims:

  1. Another page;
  2. Different locations on the same page;
  3. Pictures, e-mail address, a file;
  4. application.

Dead links page detection  is a basic test routine page test point, the relevant test methods reported below:

[A] little point method

In the manual labor functions related to test link is normal. Page judgment section belongs to the link, click on the link and observe the target is correct.

Disadvantages:

  • Low efficiency: the need for the exclusion of other interference term (non-linked text, pictures, buttons, etc.) page, click Wait judge the need for manual, time-consuming;
  • Human error: testers for routine testing projects often easy to form an iterative mindset, or change the scope of developer given not comprehensive, can lead to dead links are leakage test.

[] Method Two web-based testing tool: Webmaster Tools

To the web detection tools, input the link to be detected site, click the query.

advantage:

  • Easy to use.

Disadvantages:

  • Only valid online environment;
  • Were only detected url does not involve other site elements, resources;
  • The protocol can only detect dead links;
  • Traversal detection layers shallow, deep enough, the link under sub-pages not to continue testing.

[] Method three software-based detection tools: Xenu tool

Download detection tools, links to enter the site to be tested (the test environment, online environment may be), set up detection-related settings, click the query.

advantage:

  • Full: from the root directory of the site to be measured start the search all files and read all the pages of hyperlinks, images, files, include files, CSS files, page internal links;
  • Efficient: maximum support 100 threads, detect very fast;
  • The recording site file does not exist, specify the file link does not exist or does not exist specified page
    of links and problems specific location in which it is located;
  • Can output test reports, set up e-mail notification;
  • It has failed to re-examine the link function.

Check the status report classification:

  • Normal link: ok, mail host ok;
  • Access timeout, inaccessible: timeout, no connection, no host such;
  • We did not find that the air links: not found;
  • No object to return, namely the empty page: no info to return;
  • No object data, a common situation occurs 400 errors Access error to access the server: no object data.

Disadvantages:

  • Not open source

[Four] programming method

If the programming means to achieve detection of dead links, your idea of ​​what will be achieved?

[A thought] reptile thinking

First traversal grab all the links, and then determine the validity of the link.

For examples:

  • [Python] multithreaded website dead links detection tool

     [项目地址](https://github.com/Flowerowl/pylinktester)
    

    Ideas: The thread manager, triggering reptile thread in accordance with the breadth-first crawling links, on the other hand is used to detect trigger detection thread link crawling. If the normal climb link no longer detected, or need to re-detection (based python2).

        设计点:
        1. 考虑设置线程数、爬取深度;
        2. 处理链接超时,设置超时访问次数;
        3. 保存爬取链接集合,检测时设置未访问链接集合,不重复检测;
        4. 记录日志,生成文件;
        5. 爬虫线程,采用广度优先算法。
    
  • Site links check the validity of a python script

     [项目地址](https://github.com/TronGeek/CheckLinks-Python)
    

    Ideas: According to a label in response, traversing get all links pages, including images, js, css links, testing whether the return value of 200 (based on python3).

        设计点:
        1. 缺点:单线程和未设置爬取深度导致程序运行效率低且可能无法自行结束循环遍历;
        2. 输出csv日志表格文件;
        3. 考虑检测url,以及图片、js、css链接;
        4. 进行链接分类,过滤掉站外链接;
        5. 可设置登录配置;
        6. 可设置邮件通知。
    

[Two ideas] reverse thinking

Prespecified first cited the link to be detected, and then determine the validity of the link.
Idea: first you need to configure detection of web resources, and then testing, test page can open properly and that there are resource records properly.

      设计点:
      1. 通过添加需要检测的网页来快速检测特点的网页,针对性强(前提是你知道需要事先知道并配置好待检测网页的具体url)。

in conclusion

The above method for detecting dead links, have advantages and disadvantages, depending on the specific test scenario can be used flexibly.

发布了11 篇原创文章 · 获赞 7 · 访问量 1万+

Guess you like

Origin blog.csdn.net/sinat_16683257/article/details/82911148