[Reptile] python regular expression

First, the classification of data

  1, structured data

    Features: data in units, each data representing an entity. Attributes of each row of data is the same.
    Example: Table data structure is a relational database.
    Approach: sql

  2, semi-structured data

    Features: Another form of structured data. He was not in line with the characteristics of relational data, can not be described by the relational model. However, such data relevant tags useful semantic elements and to divide fields described delamination.
       It is also known from the description of the structure.
    For example: xml, html, json
    processing method: regular, xpath, jsonpath, css selector.

  3, unstructured data:

    Features: no fixed data structure.
    For example: documents, pictures, audio, video.
    Approach: often save binary form to do a whole.

Two, json data

  1, json content is what language?

    js json language is used to format a string [] to store a data structure and an array of objects.
    essentially the json data string.

  2, js kinds of arrays and objects

    js array: var array = [ 'aaa' , 'bb', 'cc'] ---- python and lists corresponding to
    the object of js: var obj = {name: ' zhangsan', age: 10} --- and python dictionary correspondence.
        name = obj.name

  3, the data analysis method json

    json module:
    (1) the operation of the character string json

      json.loads (json_str) ---> python of the list or dict
      json.dumps (or list of Python dict) ---> json_str

    (2) operation on the file json

      json.load (fp) ---> json data read out from json file and returns a list or python dict
      The json.dump (or a list of python dict, fp) --- "python dict or saved to a list fp the corresponding file.

  4, json meaning:

    (. 1) json transmitted as a data format, with high efficiency
    (2) having a closing tab strict, so json time as data transmission, the effective proportion of his data (valid data and not the total data as the json xml ratio) is much higher than the xml.
    (3) at the same flow rate, as a data transmission JSON xml ratio, the more data transmission.

Third, regular expressions

  1, metacharacters

    (1) the boundary matching

      ^ ---- beginning of the line
      $ ----- end of the line

    (2) the number of repetitions

      ? ---- 0 or 1
      * -----> = 0
      + ----> 1 =
      {n-, ---}> = n-
      {n-, m} ---> n-=, <= m
      {n-n-times} ----

    Representation (3) various characters

      [] Brackets ---- matches a character, a single character
      [abc] - matching a or b, or C
      [the Z-a-z0-9A]
      \ D --- digital
      \ --- W alphanumeric underscore
      \ s --- whitespace characters: line breaks, tabs, spaces,
      \ b --- word boundary
      .---- any character except newline.

  2, using the re module.

    python re module is used to make the regular process.

    (1) re module using the steps:

1  # 1, the leader packet 
2  Import Re
 . 3  # 2, the regular expression compiled into a pattern object 
. 4 pattern = re.complie (
 . 5          R & lt ' regular expression ' ,
 6          ' matching mode ' 
. 7          )
 . 8  # R & lt represented Metacharacter . 
9  # 3, a pattern object is used to match the contents of the corresponding method. 
10                 

    (2) The method of pattern objects:

      ①match Method: Default start from scratch, only matched once, and returns a match object.

. 1  pattern.match (
 2         ' matching target string ' ,
 . 3          Start, matching start position - default, Start = 0
 . 4          End, matching the end position - by default, End = -1
 . 5          ) # - > match objects
          a, object attributes match
               match.group () --- get a match.
            match.span () - scope matching
            match.start () --- starting position
            match.end () --- End position
          B, these methods can take an argument 0, but can not write to represent 1,1 fetch packets.
            match.group (0) --- get a match.
            match.span (0) - matched range
            match.start (0) --- start position
            match.end (0) --- end position
            match.groups () - the contents of all packets in sequence into returns a tuple

       ②search method: Start from anywhere matches, only one match, a match object returns

. 1  pattern.search (
 2      ' matching target string ' ,
 . 3      Start, matching start position - default, Start = 0
 . 4      End, matching the end position - by default, End = -1
 . 5      ) # - > match objects

      ③findall method: full match, match times, each time to match the result in the list in return.

. 1  pattern.findall (
 2      ' matching target string ' ,
 . 3      Start, matching start position - default, Start = 0
 . 4      End, matching the end position - by default, End = -1
 . 5      ) # - > list

      ④finditer method: full match, match times, return an iterator.

. 1  pattern.finditer (
 2      ' matching target string ' ,
 . 3      Start, matching start position - default, Start = 0
 . 4      End, matching the end position - by default, End = -1
 . 5 ) # - > list # finditer mainly used matches under more circumstances.

      ⑤split: segmentation, in accordance with content indicated by the regular slicing string, each sub-string returns after slicing

Pattern.split (
     ' to cut the string points ' ,
     ' segmentation of words ' , the default is the whole part. 
) # -> List

      ⑥sub methods: with the specified string, replacing the regular expression matched to the content.

pattern.sub ( 
    repl, # replace what 
    content, what to replace 
    count, the number of replacement, replace all default 
) # -> string replacement

      Alternatively repl content can function:
        function requires:
          A, must have the function parameter, the parameter is a positive match each target object matching string is obtained.
          B, this function must have a return value, the return value must be a string, this string as the content to the future replacement.

# Zhangsan: 3000, Lisi: 4000 
# raise up each 1000 
Content = ' zhangsan: 3000, Lisi: 4000 ' 
P = the re.compile (R & lt ' \ + D ' ) 
Result = p.sub (the Add,)

      ⑦ Packet

          Packet using regular expressions () expressed, a packet is a bracket.
          Grouping effect:
            A, filter content
            b, can be applied in front of the packet in the same expression:
              \ 1 refers to a first packet
            c, findall packet with
. 1  Import Re
 2  
. 3 Content = ' <HTML> <h1 of> regular expression </ h1 of> </ HTML> ' 
. 4 P = the re.compile (R & lt ' <(HTML)> <(h1 of)> (. *) < / \ 2> </ \. 1> ' )
 . 5  # Print (p.search (Content) .group ()) 
. 6  Print (p.findall (Content)) # [(' HTML ',' regular expressions h1 ',' formula')]

      ⑧ non-greedy greedy

          a, greedy and non-greedy match but not that much content.
          b, using greedy * to control the number of matches. Regular default is greed.
          c, using the non-greedy? controlled.
          d, represents the number of characters plus a control element behind? In this case it means that the number of control characters takes a minimum value, i.e., non-greedy.

      ⑨ match mode:

        re.S ----. can match newline
        re.I ---- ignore case.

      ⑩ universal matching regular expression: * (matches any little content as possible) with re.S?

Guess you like

Origin www.cnblogs.com/Tree0108/p/12070785.html