The data path - Python crawler - Json module JsonPath

First, what is Json?

It simply is json javascript objects and arrays, so the two structures is the object and an array of two structures, two structures can be represented by a variety of complex structures.

  • Object: Object js expressed as the { }content enclosed, the data structure  { key:value, key:value, ... }of the key-on configuration, in object-oriented languages, key attributes for the object, value of the corresponding property value, so it is easily understood, the value .key method of getting a property value for the object, the type of the attribute value can be a number, character strings, arrays, these types of objects.
  • Array: An array is in brackets in js [ ]enclosed content, data structure  ["Python", "javascript", "C++", ...], and mode values in all languages, the use of index obtaining, value type field can be numbers, strings, arrays, several objects.

Two, Json basic functions

json module provides four functions: dumps, , dump, loads, loadand for inter-python string data type conversion.

1.json.loads()

Converting the format string is decoded into Json objects from Python python json to control the type of conversion as follows:

# Json_loads.py 

Import JSON 

strlist = ' [. 1, 2,. 3,. 4] ' 

strDict = ' { "City": "Beijing", "name": "Big Cat"} ' 

json.loads (strlist) 
# [. 1 , 2,. 3,. 4] 

json.loads (strDict) # JSON data is automatically stored as Unicode 
# {u'city ': U' \ u5317 \ u4eac ', u'name': U '\ u5927 \ u732b'}

2.json.dumps()

Python json type implemented into a string, str returns a target object is to convert a Python string encoded as Json

The conversion from the original type python json types of control are as follows :

 

# Json_dumps.py 

Import JSON
 Import the chardet 

listStr = [. 1, 2,. 3,. 4 ] 
tupleStr = (. 1, 2,. 3,. 4 ) 
dictStr = { " City " : " Beijing " , " name " : " Big Cat " } 

json.dumps (listStr) 
# '[. 1, 2,. 3,. 4]' 
json.dumps (tupleStr)
 # '[. 1, 2,. 3,. 4]' 

# Note: json.dumps () when the default serialization encoding the ascii 
# add the parameter coding ensure_ascii = False disable ascii, utf-8 encoded by 
# chardet.detect()返回字典, 其中confidence是检测精确度

json.dumps(dictStr) 
# '{"city": "\\u5317\\u4eac", "name": "\\u5927\\u5218"}'

chardet.detect(json.dumps(dictStr))
# {'confidence': 1.0, 'encoding': 'ascii'}

print json.dumps(dictStr, ensure_ascii=False) 
# {"city": "北京", "name": "大刘"}

chardet.detect(json.dumps(dictStr, ensure_ascii=False))
# {'confidence': 0.99, 'encoding': 'utf-8'}

3.json.dump()

After the file is written to the built-in Python type object is serialized to json

# json_dump.py

import json

listStr = [{"city": "北京"}, {"name": "大刘"}]
json.dump(listStr, open("listStr.json","w"), ensure_ascii=False)

dictStr = {"city": "北京", "name": "大刘"}
json.dump(dictStr, open("dictStr.json","w"), ensure_ascii=False)

4.json.load()

Json read file is converted into the form of a string element type python

# json_load.py

import json

strList = json.load(open("listStr.json"))
print strList

# [{u'city': u'\u5317\u4eac'}, {u'name': u'\u5927\u5218'}]

strDict = json.load(open("dictStr.json"))
print strDict
# {u'city': u'\u5317\u4eac', u'name': u'\u5927\u5218'}

Three, JsonPath

1.JsonPath rules

XPath JSONPath description
/ $ Root
. @ The current node
/ .or[] Take the child nodes
.. n/a Take the parent node, Jsonpath not support
// .. That is, regardless of location, select all the qualifying conditions
* * Matches all element nodes
@ n/a According to property access, Json is not supported, because Json Key-value is a recursive structure, no.
[] [] Flag iterator (simple iterative operation can be done in it, such as an array index, according to the content selected value, etc.)
| [,] Support iterator make multiple selections.
[] ?() Support filtering.
n/a () Support for expression evaluation
() n/a Grouping, JsonPath not supported

2. Examples

# jsonpath_lagou.py

import urllib2
import jsonpath
import json
import chardet

url = 'http://www.lagou.com/lbs/getAllCitySearchLabels.json'
request =urllib2.Request(url)
response = urllib2.urlopen(request)
html = response.read()

# 把json格式字符串转换成python对象
jsonobj = json.loads(html)

# 从根节点开始,匹配name节点
citylist = jsonpath.jsonpath(jsonobj,'$..name')

print citylist
print type(citylist)
fp = open('city.json','w')

content = json.dumps(citylist, ensure_ascii=False)
print content

fp.write(content.encode('utf-8'))
fp.close()

注意事项

##字符串编码转换

这是中国程序员最苦逼的地方,什么乱码之类的几乎都是由汉字引起的。
其实编码问题很好搞定,只要记住一点:

####任何平台的任何编码 都能和 Unicode 互相转换

UTF-8 与 GBK 互相转换,那就先把UTF-8转换成Unicode,再从Unicode转换成GBK,反之同理。

# 这是一个 UTF-8 编码的字符串
utf8Str = "你好地球"

# 1. 将 UTF-8 编码的字符串 转换成 Unicode 编码
unicodeStr = utf8Str.decode("UTF-8")

# 2. 再将 Unicode 编码格式字符串 转换成 GBK 编码
gbkData = unicodeStr.encode("GBK")

# 3. 再将 GBK 编码格式字符串 转化成 Unicode
unicodeStr = gbkData.decode("gbk")

# 4. 再将 Unicode 编码格式字符串转换成 UTF-8
utf8Str = unicodeStr.encode("UTF-8")

 

Guess you like

Origin www.cnblogs.com/Iceredtea/p/11294362.html