Getting Started with Python Reptile - Information Organization and extraction methods (1)

1. The concept: What is the information tag

  In simple terms, information labeling is to give a message to mark (uh, though a bit silly, but if that is so). For example: "Beijing Zhongguancun", to mark it, then it is a place name. The "Beijing Institute of Technology," which is the name of a university. If you cook mark on a set of information it?
Information tags:

  • Flag information may be formed after information organization, an increase of the dimension information.
  • The tag information is used for communication, storage or display.
  • Labeled as the structure and information of great value.
  • Information labeling program more conducive to understand and use, but also more conducive to the understanding and application of human.

2. The three forms of information marks

  HTML (HTML) is the WWW (World Wide Web) information organization. It will sound, image, video, and other hypertext information embedded in the text. HTML tissue different types of information by a predetermined <> ... </> tag form.
Here Insert Picture Description

Simply put, it kind of information on internationally recognized mark a general sense, there are three forms: XML, JSON, YAML

  • XML (eXtensible Markup Langue)
    can be seen - is very similar to HTML and
    Here Insert Picture Description
    when the tag is no element in the form of abbreviations, to angle brackets </>. If there is an element tag, use <> ... </>
    Here Insert Picture Description
    may be embedded in comments, angle brackets, exclamation mark the beginning to the end of the angle brackets
    Here Insert Picture Description
  • JSON (JavaScript Object Notation)
    has a type of key-value pairs, key: value, have increased in double quotes to express it is a string form. If the value is a number, write the number directly on it.
    Here Insert Picture Description
    When a key corresponding to a plurality of values, using the [] and to organize.
    Here Insert Picture Description
    When nested, be embodied in {}.
    Here Insert Picture Description
    Simply put, JSON key-value pairs using typed information organized,
    Here Insert Picture Description
  • YAML
    无类型键值对,key:value。无论键还是值都没有双引号的形式,
    Here Insert Picture Description
    通过缩进的形式来表达所属关系,这点和Python很像。
    Here Insert Picture Description
    表达并列关系,加上“-”号:
    Here Insert Picture Description
    用“|”表示整块数据,“#”表示注释:
    Here Insert Picture Description
    常用的使用格式:
    Here Insert Picture Description

3. 三种信息标记形式的比较

  • XML:用<>标签来标记信息的表达形式(最早的通用信息标记语言,可扩展性好,但繁琐)。
  • JSON:用有类型的键值对来标记信息的表达形式(信息有类型,适合程序处理(js),较XML简洁)。
  • YAML:用无类型的键值对来标记信息的表达形式(信息无类型,文本信息比例较高,可读性好)。

XML实例:可以看出,有效信息比例不高,大部分被标签占用。
Here Insert Picture Description
JSON实例:都需要用双引号来表达类型。
Here Insert Picture Description
YAML实例:
Here Insert Picture Description
三种信息标记形式的比较

  • XML:Internet上的信息交互与传递。包括HTML也是属于XML这一类别的。
  • JSON:移动应用端和节点的信息通信,无注释。一般来讲,JSON格式用在程序的接口处理的地方。JSON数据在经过传输之后,能够作为程序代码一部分,并能被程序直接运行,这样,JSON中对信息类型的定义才能发挥最大作用。
  • YAML:各类系统的配置文件,有注释易读。对文本利用率比较高。

4. 总结

  文章简单介绍了信息标记的三种形式,XML、JSON、YAML。这三种信息标记形式各有特点,各自适用不同的范围。
  讲完信息标记之后,接着会说说信息提取的一般方式,一点一点深入理解Python爬虫的深层原理。不仅要会爬虫,更要懂爬虫。
  才疏学浅,文章如有不当之处,还请多多指教~~ (●′ω`●)

Published 20 original articles · won praise 51 · views 7514

Guess you like

Origin blog.csdn.net/weixin_43275558/article/details/104406737