XML constraint and XML / HTML parsing

XML:


The concept: Extensible Markup Language Extensible Markup Language

  • Scalable: tags are customizable.
  • Features
    • Storing data
      1. Profiles
      2. Transmission in the network
  • The difference between the xml and html
    1. xml tags are customizable, html tags are predefined.
    2. xml strict syntax, html syntax loose
    3. xml is stored data, html is showing data
  • w3c: World Wide Web Consortium

grammar:


  • The basic syntax:
    1. Extension .xml xml document
    2. The first line must be defined as xml document declaration
    3. xml document and only one root tag
    4. Attribute values ​​in quotation marks (odd and even can) to cause
    5. Tags must be properly closed
    6. xml tag names are case sensitive
  • Getting Started:

    
    <?xml version='1.0' ?>
    <users>
        <user id='1'>
            <name>zhangsan</name>
            <age>23</age>
            <gender>male</gender>
            <br/>
        </user>
    
        <user id='2'>
            <name>lisi</name>
            <age>24</age>
            <gender>female</gender>
        </user>
    </users>
    
  • component:

    1. Document declaration
      1. format:
      
      <?xml 属性列表 ?>
      
      1. List of attributes:
        • version: The version number, required attributes
        • encoding: encoding. Inform parsing engine used in the current document character set, the default value: ISO-8859-1
        • standalone: ​​independence
          • Value:
            • yes: do not rely on other files
            • no: dependent on other files
    2. Command (Learn): css binding of impression data

      
      <?xml-stylesheet type="text/css" href="a.css" ?>
    3. Label: Custom label name
      • rule:
        • The name can contain letters, numbers and other characters
        • The name can not start with a number or punctuation
        • The name can not start with the letters xml (or XML, Xml etc.)
        • The name can not contain spaces
    4. Properties:
      the above mentioned id attribute values are unique
    5. text:
      • CDATA regions: the data in this area will be as display, can not special characters such as '<' & lt need
        • format:
        
         <![CDATA[ 数据内容 ]]>
        

constraint


  • Constraints: the provisions of the rules of writing xml document

    • As a user of the frame (programmer):
      1. Constraints can be introduced in xml document
      2. Constraints can simply read the document
    • classification:
      1. DTD: A simple restriction technique, defective, incomplete defining attributes

      2. Schema: a complex technical constraints
    • DTD:
      • Introduced dtd document to xml document
        • Internal dtd: the constraint rules defined in xml document
        • The rules define constraints in external dtd file: external dtd
          • local:
          
          <!DOCTYPE 根标签名 SYSTEM "dtd文件的位置">
          
          • The internet:
          
          <!DOCTYPE 根标签名 PUBLIC "dtd文件名字" "dtd文件的位置URL">
          
          
          dtd文件内容
          <!ELEMENT students (student*) >
          <!ELEMENT student (name,age,sex)>
          <!ELEMENT name (#PCDATA)>
          <!ELEMENT age (#PCDATA)>
          <!ELEMENT sex (#PCDATA)>
          <!ATTLIST student number ID #REQUIRED>
          
          XML引入内部dtd
          <?xml version="1.0" encoding="UTF-8" ?>
          <!DOCTYPE students SYSTEM "student.dtd">
          
          <students>
              <student number="0001">
                  <name>tom</name>
                  <age>18</age>
                  <sex>male</sex>
              </student>
          
          </students>
          
    • Schema:
      • Introduction:
        1. Fill xml document root element
        2. Introduction xsi prefix xmlns:. Xsi = "http://www.w3.org/2001/XMLSchema-instance"
        3. introducing xsd file namespace xsi:. SchemaLocation = " XXX / student.xsd "
        4. xsd for each constraint specifies a prefix (xsd used to distinguish a plurality of documents), as the identification xmlns =" xxx / xml "( xmlns: a =" xxx / xml ")

        
        <students   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xmlns="xxx/xml"
        xsi:schemaLocation="xxx/tudent.xsd">
        

Analysis: xml document operation, the read data of the document into memory


  • Operating xml document
    1. Parsing (reading): Data is read the document into memory
    2. Write: save the data in memory to the xml document. Persistent storage
  • Parse xml way:
    1. DOM: The one-time markup language document loaded into memory, the formation of a dom tree in memory
      • Advantages: easy to operate, can be CRUD operations for all documents
      • Cons: total memory
    2. SAX: read line by line, based on event-driven.
      • Pros: do not take up memory.
      • Disadvantages: can only be read, not additions and deletions
  • Common xml parser:
    1. JAXP: sun provided by the parser, and supports two ideas dom sax
    2. DOM4J: a very good parser
    3. Jsoup: jsoup is a Java HTML parser can parse a URL address directly, HTML text. It provides a very labor-saving API, which is taken out and manipulate data through DOM, CSS and an operation method is similar to jQuery.
    4. PULL: Android operating system, built-in parser, sax way.
  • Jsoup: jsoup is a Java HTML parser can parse a URL address directly, HTML text. It provides a set of highly labor-saving API, which is taken out and manipulate data through DOM, CSS and an operation method is similar to jQuery.
    • Getting Started:
      • step:
        1. Import jar package
        2. Gets the Document object
        3. Acquiring a corresponding tag Element object
        4. retrieve data
    • Code:
    
         //2.1获取student.xml的path
        String path = JsoupDemo1.class.getClassLoader().getResource("student.xml").getPath();
        //2.2解析xml文档,加载文档进内存,获取dom树--->Document
        Document document = Jsoup.parse(new File(path), "utf-8");
        //3.获取元素对象 Element
        Elements elements = document.getElementsByTag("name");
    
        System.out.println(elements.size());
        //3.1获取第一个name的Element对象,elements继承ArrayList
        Element element = elements.get(0);
        //3.2获取数据
        String name = element.text();
        System.out.println(name);
    
  • Objects of Use:
    1. Jsoup: tools, html or xml document can be resolved, return Document
      • parse: parse html or xml documents, return Document
        • parse (File in, String charsetName): parse xml or html file.
        • parse (String html): parsed html or xml string
        • parse (URL url, int timeoutMillis): Gets the document object specified html or xml path through the network
    2. Document: document object. It represents the memory of the dom tree
      • Gets the Element object
        • getElementById (String id): obtain a unique id attribute value of the element according to the object
        • getElementsByTag (String tagName): Gets the object collection element according to the label name
        • getElementsByAttribute (String key): Gets the object collection element (id) based on the attribute name
        • getElementsByAttributeValue (String key, String value): Gets an object collection element according to a corresponding attribute names and values
    3. Elements: a collection of elements Element object. It can be used as ArrayList To use
    4. Element: element object
      1. Acquiring sub-element object
        • getElementById (String id): obtain a unique id attribute value of the element according to the object
        • getElementsByTag (String tagName): Gets the object collection element according to the label name
        • getElementsByAttribute (String key): Gets a collection of objects based on the attribute name element
        • getElementsByAttributeValue (String key, String value): Gets an object collection element according to a corresponding attribute names and values
      2. Gets the property value
        • String attr (String key): Gets the property value based on the attribute name
      3. Get the text content
        • String text (): Get the text content
        • String html (): Get the entire contents of the label body (including the contents of a word string tag)
    5. Node: node object
      • Document and Element is the parent class
      
      import cn.wanghaomiao.xpath.exception.XpathSyntaxErrorException;
      import org.jsoup.Jsoup;
      import org.jsoup.nodes.Document;
      import org.jsoup.select.Elements;
      
      import java.io.IOException;
      import java.net.URL;
      
      public class JsoupDemo01 {
          public static void main(String[] args) throws IOException, XpathSyntaxErrorException {
              String path= JsoupDemo01.class.getClassLoader().getResource("").getPath();
      
              URL url= new URL("https://www.baidu.com/");
              Document document = Jsoup.parse(url,10000);
              Elements tag = document.getElementsByTag("map");
              Elements btn = document.getElementsByAttributeValue("type","submit");
              String text = btn.attr("value");
              System.out.println(btn);
              System.out.println("-------------");
              System.out.println(text);
              System.out.println("-------------");
      
           }
      }
      
      
  • Quick and easy way:
    1. selector: selector
      • Methods used: Elements select (String cssQuery)
        • Syntax: Syntax class defined in the Reference Selector
    2. XPath: XPath is the XML Path Language, which is a language for determining the position of a portion of XML (a subset of the Standard Generalized Markup Language) document
      • Jsoup use of Xpath require additional import jar package.
      • Queries w3cshool reference manual, xpath syntax of the query is complete
      • Code:

        
        //1.获取student.xml的path
        String path = JsoupDemo6.class.getClassLoader().getResource("student.xml").getPath();
        //2.获取Document对象
        Document document = Jsoup.parse(new File(path), "utf-8");
        
        //3.根据document对象,创建JXDocument对象
        JXDocument jxDocument = new JXDocument(document);
        
        //4.结合xpath语法查询
        //4.1查询所有student标签
        List<JXNode> jxNodes = jxDocument.selN("//student");
        for (JXNode jxNode : jxNodes) {
            System.out.println(jxNode);
        }
        
        System.out.println("--------------------");
        
        //4.2查询所有student标签下的name标签
        List<JXNode> jxNodes2 = jxDocument.selN("//student/name");
        for (JXNode jxNode : jxNodes2) {
            System.out.println(jxNode);
        }
        
        System.out.println("--------------------");
        
        //4.3查询student标签下带有id属性的name标签
        List<JXNode> jxNodes3 = jxDocument.selN("//student/name[@id]");
        for (JXNode jxNode : jxNodes3) {
            System.out.println(jxNode);
        }
        System.out.println("--------------------");
        //4.4查询student标签下带有id属性的name标签 并且id属性值为pp
        
        List<JXNode> jxNodes4 = jxDocument.selN("//student/name[@id='sex']");
        for (JXNode jxNode : jxNodes4) {
            System.out.println(jxNode);
        }
        

Guess you like

Origin www.cnblogs.com/huxiaobai/p/12129447.html