First, the classification of data
1, structured data
Features: data in units, each data representing an entity. Attributes of each row of data is the same.
Example: Table data structure is a relational database.
Approach: sql
2, semi-structured data
Features: Another form of structured data. He was not in line with the characteristics of relational data, can not be described by the relational model. However, such data relevant tags useful semantic elements and to divide fields described delamination.
It is also known from the description of the structure.
For example: xml, html, json
processing method: regular, xpath, jsonpath, css selector.
3, unstructured data:
Features: no fixed data structure.
For example: documents, pictures, audio, video.
Approach: often save binary form to do a whole.
Two, json data
1, json content is what language?
js json language is used to format a string [] to store a data structure and an array of objects.
essentially the json data string.
2, js kinds of arrays and objects
js array: var array = [ 'aaa' , 'bb', 'cc'] ---- python and lists corresponding to
the object of js: var obj = {name: ' zhangsan', age: 10} --- and python dictionary correspondence.
name = obj.name
3, the data analysis method json
json module:
(1) the operation of the character string json
json.loads (json_str) ---> python of the list or dict
json.dumps (or list of Python dict) ---> json_str
(2) operation on the file json
json.load (fp) ---> json data read out from json file and returns a list or python dict
The json.dump (or a list of python dict, fp) --- "python dict or saved to a list fp the corresponding file.
4, json meaning:
(. 1) json transmitted as a data format, with high efficiency
(2) having a closing tab strict, so json time as data transmission, the effective proportion of his data (valid data and not the total data as the json xml ratio) is much higher than the xml.
(3) at the same flow rate, as a data transmission JSON xml ratio, the more data transmission.
Third, regular expressions
1, metacharacters
(1) the boundary matching
^ ---- beginning of the line
$ ----- end of the line
(2) the number of repetitions
? ---- 0 or 1
* -----> = 0
+ ----> 1 =
{n-, ---}> = n-
{n-, m} ---> n-=, <= m
{n-n-times} ----
Representation (3) various characters
[] Brackets ---- matches a character, a single character
[abc] - matching a or b, or C
[the Z-a-z0-9A]
\ D --- digital
\ --- W alphanumeric underscore
\ s --- whitespace characters: line breaks, tabs, spaces,
\ b --- word boundary
.---- any character except newline.
2, using the re module.
python re module is used to make the regular process.
(1) re module using the steps:
1 # 1, the leader packet 2 Import Re . 3 # 2, the regular expression compiled into a pattern object . 4 pattern = re.complie ( . 5 R & lt ' regular expression ' , 6 ' matching mode ' . 7 ) . 8 # R & lt represented Metacharacter . 9 # 3, a pattern object is used to match the contents of the corresponding method. 10
(2) The method of pattern objects:
①match Method: Default start from scratch, only matched once, and returns a match object.
. 1 pattern.match ( 2 ' matching target string ' , . 3 Start, matching start position - default, Start = 0 . 4 End, matching the end position - by default, End = -1 . 5 ) # - > match objects
a, object attributes match
match.group () --- get a match.
match.span () - scope matching
match.start () --- starting position
match.end () --- End position
B, these methods can take an argument 0, but can not write to represent 1,1 fetch packets.
match.group (0) --- get a match.
match.span (0) - matched range
match.start (0) --- start position
match.end (0) --- end position
match.groups () - the contents of all packets in sequence into returns a tuple
②search method: Start from anywhere matches, only one match, a match object returns
. 1 pattern.search ( 2 ' matching target string ' , . 3 Start, matching start position - default, Start = 0 . 4 End, matching the end position - by default, End = -1 . 5 ) # - > match objects
③findall method: full match, match times, each time to match the result in the list in return.
. 1 pattern.findall ( 2 ' matching target string ' , . 3 Start, matching start position - default, Start = 0 . 4 End, matching the end position - by default, End = -1 . 5 ) # - > list
④finditer method: full match, match times, return an iterator.
. 1 pattern.finditer ( 2 ' matching target string ' , . 3 Start, matching start position - default, Start = 0 . 4 End, matching the end position - by default, End = -1 . 5 ) # - > list # finditer mainly used matches under more circumstances.
⑤split: segmentation, in accordance with content indicated by the regular slicing string, each sub-string returns after slicing
Pattern.split ( ' to cut the string points ' , ' segmentation of words ' , the default is the whole part. ) # -> List
⑥sub methods: with the specified string, replacing the regular expression matched to the content.
pattern.sub ( repl, # replace what content, what to replace count, the number of replacement, replace all default ) # -> string replacement
Alternatively repl content can function:
function requires:
A, must have the function parameter, the parameter is a positive match each target object matching string is obtained.
B, this function must have a return value, the return value must be a string, this string as the content to the future replacement.
# Zhangsan: 3000, Lisi: 4000 # raise up each 1000 Content = ' zhangsan: 3000, Lisi: 4000 ' P = the re.compile (R & lt ' \ + D ' ) Result = p.sub (the Add,)
⑦ Packet
Packet using regular expressions () expressed, a packet is a bracket.
Grouping effect:
A, filter content
b, can be applied in front of the packet in the same expression:
\ 1 refers to a first packet
c, findall packet with
. 1 Import Re 2 . 3 Content = ' <HTML> <h1 of> regular expression </ h1 of> </ HTML> ' . 4 P = the re.compile (R & lt ' <(HTML)> <(h1 of)> (. *) < / \ 2> </ \. 1> ' ) . 5 # Print (p.search (Content) .group ()) . 6 Print (p.findall (Content)) # [(' HTML ',' regular expressions h1 ',' formula')]