jsoup parse html usage

Recently made a demand in the project, a Web page to download a picture of the outside of our own servers, html method is to match the contents of the address of the img src tag, then use this address to download pictures, but always some src address is not correct image resources appear, download error happens, and I found that there is data-src attribute or original-src attribute in the img tag, and these attributes in the address is available for download.

This brings an idea for me to match the img tag src attribute contains all the strings are screened. If the address can not download src, then use data-src or other string src attribute with the address to download.

About how to obtain src attribute of the img tag, you might have understood that use regular expressions to match. 
Here to use regular expressions to be a little test it:

       String html="<p>pic1:<img width=\"200\" data-src=\"/image/261/1.jpeg\" alt=\"\"/> pic2: <img width=\"200\" src=\"/image/751/3.jpg\" alt=\"\"/>,pic3:<img width=\"200\" src=\"/image/132/5.jpeg\" alt=\"\"/></p>";
       Pattern p = Pattern.compile("<img[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>");
       Matcher m = p.matcher(html);
       while(m.find()){
           //整个img标签
           System.out.println("img标签-------------"+m.group());
           //src属性
           System.out.println("src属性-------------"+m.group(1));
       }
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

Output:

img标签-------------<img width="200" data-src="/image/261/1.jpeg" alt=""/>
src属性-------------/image/261/1.jpeg
img标签-------------<img width="200" src="/image/751/3.jpg" alt=""/>
src属性-------------/image/751/3.jpg
img标签-------------<img width="200" src="/image/132/5.jpeg" alt=""/>
src属性-------------/image/132/5.jpeg

Today, new to use jsoup to parse html tags found very convenient, much more convenient than regular hard to write with their own, and finished the regular whole, over a period of time to look at this regular expression, it is estimated not read.

Here are just a jsoup parsing pages:

I intercepted the following URL portion of the source code 
http://domestic.firefox.sina.com/17/0412/08/4OPJ52GTXH0M3V9W.html 
some source code

<li><a href="http://domestic.firefox.sina.com/" title="国内">国内</a></li>
      <li><a href="http://world.firefox.sina.com/" title="国际">国际</a></li>
      <li><a href="http://mil.firefox.sina.com/" title="军事">军事</a></li>
      <li><a href="http://photo.firefox.sina.com/" title="图片">图片</a></li>
      <li><a href="http://society.firefox.sina.com/" title="社会">社会</a></li>
      <li><a href="http://ent.firefox.sina.com/" title="娱乐">娱乐</a></li>
      <li><a href="http://tech.firefox.sina.com/" title="科技">科技</a></li>
      <li><a href="http://sports.firefox.sina.com/" title="体育">体育</a></li>
      <li><a href="http://finance.firefox.sina.com/" title="财经">财经</a></li>
      <li><a href="http://auto.firefox.sina.com/" title="汽车">汽车</a></li> 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
<img src="http://img.firefoxchina.cn/2016/07/1/201607131534240.png" alt="新浪国内">
<img src="/static/images/photo.jpg" data-original="http://n.sinaimg.cn/news/crawl/20170412/tUlE-fyecrxv5553275.jpg" alt="衡水中学进浙江引热议 当地校长:办学还是办厂">
<img src="/static/images/photo.jpg" data-original="http://n.sinaimg.cn/news/transform/20170412/NCxo-fyecfam0409134.jpg" alt="《奔跑吧兄弟》被指抄袭葫芦娃 被诉索赔200万">
<img src="/static/images/photo.jpg" data-original="http://n.sinaimg.cn/news/transform/20170412/QEZ5-fyecfam0405087.jpg" alt="对抗巡视组 他们用特咸饭菜等招数公开挑衅">
<img src="/static/images/photo.jpg" data-original="http://n.sinaimg.cn/news/transform/20170412/TfuR-fyecezv3239499.jpg" alt="外媒:中国贸易商接通知退回十多船朝鲜煤炭">
<img src="/static/images/photo.jpg" data-original="http://n.sinaimg.cn/news/transform/20170412/eNkL-fyecezv3239189.jpg" alt="中国工资低? 上海最低工资已超部分东欧国家">
<img src="/static/images/photo.jpg" data-original="http://n.sinaimg.cn/translate/20170412/hPZz-fyecezv3239988.jpg" alt="贵州省副省长慕德贵兼任省委宣传部部长(图)">
<img src="/static/images/photo.jpg" data-original="http://n.sinaimg.cn/translate/20170412/ufxr-fyecezv3235689.jpg" alt="一则公告让许晴王学兵等明星瞬间蒸发6000多万">
<img src="/static/images/photo.jpg" data-original="http://n.sinaimg.cn/news/crawl/20170412/4DGJ-fyecfam0401539.jpg" alt="武汉留美女生曾亲历美联航驱客:被壮汉架下飞机">
<img src="http://static.firefoxchina.cn/img/201701/7_587834a17cfee0.png">
<img src="http://static.firefoxchina.cn/img/201701/7_5876ea657366d0.png">
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

We no longer use regular expressions to get here href attribute of a tag and tag img src attribute, but with the method jsoup to resolve.

Go online to download jsoup jar package, I share the Baidu cloud disk: http://pan.baidu.com/s/1jIE7fBS , add the jar package to our project.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class TestJsoup {
   public static void main(String[] args) throws Exception {

     Document doc = Jsoup.connect("http://domestic.firefox.sina.com/17/0412/08/4OPJ52GTXH0M3V9W.html").get(); 
     //获取 带有src属性的img元素
     Elements imgTags = doc.select("img[src]");
     System.out.println("=====imgsTag===="+imgTags);
     for(Element element:imgTags){
         String src=element.attr("abs:src");//获取src的绝对路径
         String src2=element.attr("src");//获取src的绝对路径
         System.out.println("===src==="+src);
         System.out.println("===src2==="+src2);
     }
     //获取 带有href属性的a元素
     Elements aTag = doc.select("a[href]");
     System.out.println("=====aTag===="+aTag);
     for(Element element :aTag){

         String href=element.attr("href");
         System.out.println("===href==="+href);

     }
     //所有引用jpg图片的元素 
     Elements jpgs = doc.select("img[src$=.jpg]");

     for(Element element :jpgs){

         String src=element.attr("abs:src");
         System.out.println("===src==="+src);

     }
 }
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38

Code

     String src=element.attr("abs:src");
     String src2=element.attr("src");

The difference is that the former can get the absolute path to the src address.

Element的attr方法,获取的是属性名为“src”的值,如果我们要像上面的正则表达式一样,将属性名中包含字符串”src”的属性也获取到,但是我们不知道这个包含“src”字符串的属性到底是什么,可能是data-src,也可能是original-src,甚至是其他的,这种情况我们只能变历img标签中的所有属性了。

比如像下面的这种情况,我们获取data-src的属性。

<p><img data-src="http://mmbiz.qpic.cn/mmbiz_jpg/LoyT0npAgkkRjJibID5PXg2zT6iarg9IMkdqpvGv58Fq9tSGUGibZibX2uYfibIryXPuwX44SRjGrY4JURnAvPqvaOQ/0?wx_fmt=jpeg" data-ratio="0.66625" data-w="800" src="http://mmbiz.qpic.cn/mmbiz_jpg/LoyT0npAgkkRjJibID5PXg2zT6iarg9IMkdqpvGv58Fq9tSGUGibZibX2uYfibIryXPuwX44SRjGrY4JURnAvPqvaOQ/640?wx_fmt=jpeg&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1" data-fail="0" alt="" /></p>
  • 1

可以遍历img标签的属性:

     Document doc = Jsoup.connect("http://domestic.firefox.sina.com/17/0412/08/4OPJ52GTXH0M3V9W.html").get(); 
     //获取 带有src属性的img元素
     Elements imgTags = doc.select("img[src]");
     for(Element element:imgTags){
         Attributes node=element.attributes();
            Iterator<Attribute> iterator=node.iterator();
            while (iterator.hasNext()) {
                Attribute attribute=iterator.next();
                String key=attribute.getKey();
                //属性中包含“src”字符串,但不是src的属性
                if (!key.equals("src")&&key.indexOf("src")!=-1) {
                    //element.removeAttr(key);
                    String  otherSrc=attribute.getValue();
                    System.out.println("====otherSrc===="+otherSrc);
                    break;
                }

           }
     }

Guess you like

Origin blog.csdn.net/Charles_dai001/article/details/79029400