1. The concept: What is the information tag
In simple terms, information labeling is to give a message to mark (uh, though a bit silly, but if that is so). For example: "Beijing Zhongguancun", to mark it, then it is a place name. The "Beijing Institute of Technology," which is the name of a university. If you cook mark on a set of information it?
Information tags:
- Flag information may be formed after information organization, an increase of the dimension information.
- The tag information is used for communication, storage or display.
- Labeled as the structure and information of great value.
- Information labeling program more conducive to understand and use, but also more conducive to the understanding and application of human.
2. The three forms of information marks
HTML (HTML) is the WWW (World Wide Web) information organization. It will sound, image, video, and other hypertext information embedded in the text. HTML tissue different types of information by a predetermined <> ... </> tag form.
Simply put, it kind of information on internationally recognized mark a general sense, there are three forms: XML, JSON, YAML
- XML (eXtensible Markup Langue)
can be seen - is very similar to HTML and
when the tag is no element in the form of abbreviations, to angle brackets </>. If there is an element tag, use <> ... </>
may be embedded in comments, angle brackets, exclamation mark the beginning to the end of the angle brackets
- JSON (JavaScript Object Notation)
has a type of key-value pairs, key: value, have increased in double quotes to express it is a string form. If the value is a number, write the number directly on it.
When a key corresponding to a plurality of values, using the [] and to organize.
When nested, be embodied in {}.
Simply put, JSON key-value pairs using typed information organized,
- YAML
无类型键值对,key:value。无论键还是值都没有双引号的形式,
通过缩进的形式来表达所属关系,这点和Python很像。
表达并列关系,加上“-”号:
用“|”表示整块数据,“#”表示注释:
常用的使用格式:
3. 三种信息标记形式的比较
- XML:用<>标签来标记信息的表达形式(最早的通用信息标记语言,可扩展性好,但繁琐)。
- JSON:用有类型的键值对来标记信息的表达形式(信息有类型,适合程序处理(js),较XML简洁)。
- YAML:用无类型的键值对来标记信息的表达形式(信息无类型,文本信息比例较高,可读性好)。
XML实例:可以看出,有效信息比例不高,大部分被标签占用。
JSON实例:都需要用双引号来表达类型。
YAML实例:
三种信息标记形式的比较
- XML:Internet上的信息交互与传递。包括HTML也是属于XML这一类别的。
- JSON:移动应用端和节点的信息通信,无注释。一般来讲,JSON格式用在程序的接口处理的地方。JSON数据在经过传输之后,能够作为程序代码一部分,并能被程序直接运行,这样,JSON中对信息类型的定义才能发挥最大作用。
- YAML:各类系统的配置文件,有注释易读。对文本利用率比较高。
4. 总结
文章简单介绍了信息标记的三种形式,XML、JSON、YAML。这三种信息标记形式各有特点,各自适用不同的范围。
讲完信息标记之后,接着会说说信息提取的一般方式,一点一点深入理解Python爬虫的深层原理。不仅要会爬虫,更要懂爬虫。
才疏学浅,文章如有不当之处,还请多多指教~~ (●′ω`●)